- Basic operations: emptiness checks, whitespace trimming.
- Complex tasks: pattern matching, substring extraction, text normalization.
- Advanced techniques: regex matching, Unicode normalization.
ClickHouse function reference
empty
Arguments:x
(String
) — Input string.
- Returns 1 for an empty string or 0 for a non-empty string. (
UInt8
)
empty
function is also available for [arrays] and [UUIDs].
notEmpty
Checks whether the input string is non-empty. A string is considered non-empty if it contains at least one byte, even if this byte is a space or a null byte. Syntax:x
(String
): Input value.
- Returns 1 for a non-empty string or 0 for an empty string. [
UInt8
]
notEmpty
returns 0 for an empty string and 1 for a non-empty string containing the word “guacamole”.
This function is also available for [arrays] and [UUIDs].
length
Returns the length of a string in bytes (not characters or Unicode code points). Syntaxs
(String
orArray
): The input string or array.
- The length of
s
in bytes. [UInt64
]
length
counts bytes, not characters. For example, some UTF-8 characters may occupy more than one byte:
Note that ‘Jalapeño’ is 8 characters long, but occupies 9 bytes due to the ‘ñ’ character.
lengthUTF8
function.
lengthUTF8
Returns the length of a string in Unicode code points (not in bytes). Syntax- char_length
- character_length
s
(String
): The input string. Must contain valid UTF-8 encoded text.
- The length of the string
s
in Unicode code points. [UInt64
]
lengthUTF8
correctly counts the Unicode characters in the Spanish greeting and taco reference, including the inverted exclamation mark.
- This function assumes the input string contains valid UTF-8 encoded text. If this assumption is violated, the result is undefined and no exception is thrown.
lengthUTF8
differs from thelength
function, which returns the length in bytes.- For ASCII strings,
lengthUTF8
andlength
will return the same result.
length
: Returns the length of a string in bytes.
left
Returns a substring of string s with a specified offset starting from the left. Syntax:s
(String
orFixedString
): The string to calculate a substring from.offset
(UInt*
): The number of bytes of the offset.
- For positive offset: A substring of s with offset many bytes, starting from the left of the string.
- For negative offset: A substring of s with length(s) - |offset| bytes, starting from the left of the string.
- An empty string if length is 0.
The
left
function operates on bytes, not characters. For UTF-8 encoded strings, consider using leftUTF8
to work with characters instead of bytes.leftUTF8
Returns a substring of a UTF-8 encoded string with a specified number of characters from the left. Syntaxs
(String
): The UTF-8 encoded string to extract the substring from.length
(UInt
): The number of characters to extract.
- A substring of
s
containing the leftmostlength
characters. (String
)
If
length
is greater than the number of characters in the string, the entire string is returned. If length
is 0 or negative, an empty string is returned.leftPad
Pads a string from the left with spaces or with a specified string (multiple times, if needed) until the resulting string reaches the specified length. Syntax:- LPAD
string
(String
): Input string that should be padded.length
(UInt
orInt
): The length of the resulting string. If the value is smaller than the input string length, then the input string is shortened to length characters.pad_string
(String
, optional): The string to pad the input string with. If not specified, then the input string is padded with spaces.
- A left-padded string of the given length. (
String
)
- ‘Taco’ is padded with asterisks to a length of 10 characters.
- ‘Burrito’ is padded with spaces to a length of 10 characters.
The function measures string length in bytes, not in characters. For Unicode strings, consider using
leftPadUTF8
instead.leftPadUTF8
Pads a string from the left with spaces or a specified string (multiple times, if needed) until the resulting string reaches the given length. UnlikeleftPad
which measures the string length in bytes, this function measures the string length in Unicode code points.
Syntax:
string
(String
): Input string that should be padded.length
(UInt
orInt
): The length of the resulting string. If the value is smaller than the input string length, then the input string is shortened tolength
characters.pad_string
(String
, optional): The string to pad the input string with. If not specified, then the input string is padded with spaces.
- A left-padded string of the given length. (
String
)
padded_taco
is padded with taco emojis to reach a length of 10 Unicode characters.padded_burrito
is padded with spaces to reach a length of 10 Unicode characters.
right
Returns a substring of a string starting from the right. Syntaxs
(String
orFixedString
): The input string.offset
(UInt*
): The number of characters to extract.
- For positive
offset
: A substring ofs
withoffset
many characters, starting from the right of the string. - For negative
offset
: A substring ofs
withlength(s) - |offset|
characters, starting from the right of the string. - An empty string if
offset
is 0.
last_four
returns the last 4 characters of ‘Crunchy Taco’.all_but_last_four
returns all but the last 4 characters of ‘Soft Taco’.
rightUTF8
Returns a substring of a UTF-8 encoded string with a specified offset starting from the right. Syntax:s
(String
orFixedString
): The UTF-8 encoded string to calculate a substring from.offset
(UInt*
): The number of characters for the offset.
- For positive offset: A substring of
s
withoffset
many characters, starting from the right of the string. - For negative offset: A substring of
s
withlength(s) - |offset|
characters, starting from the right of the string. - An empty string if
offset
is 0.
The function counts characters, not bytes. This is important for multi-byte UTF-8 characters.
rightPad
Pads a string from the right with spaces or with a specified string (multiple times, if needed) until the resulting string reaches the specified length. Syntax:- RPAD
string
(String
) — Input string that should be padded.length
(UInt
orInt
) — The length of the resulting string. If the value is smaller than the input string length, then the input string is shortened to length characters.pad_string
(String
, optional) — The string to pad the input string with. If not specified, then the input string is padded with spaces.
- A right-padded string of the given length. (
String
)
rightPadUTF8
Pads a string from the right with spaces or a specified string (multiple times, if needed) until the resulting string reaches the given length. UnlikerightPad
which measures the string length in bytes, this function measures the string length in Unicode code points.
Syntax
string
(String
) — Input string that should be padded.length
(UInt
orInt
) — The desired length of the resulting string.pad_string
(String
, optional) — The string to pad the input string with. If not specified, the input string is padded with spaces.
- A right-padded string of the given length. (
String
)
- ‘Taco’ is padded with taco emojis to reach a length of 10 Unicode code points.
- ‘Burrito’ is padded with spaces to reach a length of 10 Unicode code points.
lower
Converts ASCII Latin symbols in a string to lowercase. Syntax- lcase
input
(String
): The input string.
- A string with all ASCII Latin characters converted to lowercase. [
String
]
This function only affects ASCII Latin characters (A-Z). It does not change other characters or Unicode symbols. For Unicode-aware lowercase conversion, use the
lowerUTF8
function.upper
Converts ASCII Latin characters in a string to uppercase. Syntax- ucase
input
(String
): The input string.
- A string with ASCII Latin characters converted to uppercase. [
String
]
This function only affects ASCII Latin characters (a-z). It does not change other characters or consider language-specific uppercase rules.
lowerUTF8
Converts a string to lowercase, assuming the string contains valid UTF-8 encoded text. Syntaxinput
(String
): A string to convert.
- The input string converted to lowercase. (
String
)
This function does not detect the language. For some languages (e.g., Turkish), the result might not be exactly correct due to specific uppercase/lowercase rules. If the length of the UTF-8 byte sequence differs between upper and lower case for a code point, the result may be incorrect for that code point.Unlike the
lower
function, lowerUTF8
correctly handles UTF-8 encoded characters beyond the ASCII range.upperUTF8
Converts a string to uppercase, assuming the string contains valid UTF-8 encoded text. Syntaxinput
(String
): A string to convert.
- The input string converted to uppercase. [
String
]
This function does not detect the language, so results might not be exactly correct for some languages (e.g., Turkish). If the length of the UTF-8 byte sequence differs between upper and lower case for a code point, the result may be incorrect for that code point.
isValidUTF8
Checks whether a string contains valid UTF-8 encoded text. Syntaxinput
(String
): The string to check.
- Returns 1 if the input string contains valid UTF-8 encoded text, 0 otherwise. [
UInt8
]
toValidUTF8
Replaces invalid UTF-8 characters with the � (U+FFFD) replacement character. Consecutive invalid characters are collapsed into a single replacement character. Syntaxinput_string
(String
): Any set of bytes represented as a String data type.
- A valid UTF-8 string. (
String
).
\xF0\x80\x80\x80
is replaced with a single � character in the resulting string.
This function is useful when dealing with potentially corrupted or improperly encoded text data, ensuring that the output is always valid UTF-8 for further processing or display.
repeat
Repeats a string a specified number of times. Syntaxs
(String
): The string to repeat.n
(UInt
orInt
): The number of times to repeat the string.
s
repeated n
times. If n
<=
0, the function returns an empty string. (String
)
Example:
If you need to repeat a string a large number of times, be mindful of the resulting string length to avoid excessive memory usage.
space
Returns a string consisting of a specified number of space characters. Syntax:- SPACE
n
(UInt
orInt
): The number of space characters to generate.
n
space characters. If n
<=
0, an empty string is returned. [String
]
Example:
space()
is used to create a consistent alignment for the ingredients list, regardless of the taco name length.
reverse
Reverses the sequence of bytes in a string. Syntax:s
(String
): The input string.
- A string with the bytes in reverse order. [
String
]
For proper handling of Unicode characters, use the
reverseUTF8
function instead.reverseUTF8
Reverses a sequence of Unicode code points in a string. Syntaxstr
(String
): The input string to reverse.
- A string with the sequence of Unicode code points reversed. (
String
)
reverse
function, which reverses the sequence of bytes, reverseUTF8
preserves the integrity of multi-byte Unicode characters.
Example:
Be cautious when using this function with strings that contain combining characters or complex scripts, as the visual representation of the reversed text might not always be meaningful.
concat
Concatenates the given string arguments. Syntaxs1
,s2
, … (String
orFixedString
): Strings to concatenate.
- A
String
created by concatenating the arguments.
- If any argument is
NULL
, the function returnsNULL
. - Arguments of other types are automatically converted to strings, but this may impact performance.
concatAssumeInjective
Concatenates strings, assuming the result is injective (unique) for different input values. This can be used for query optimization, particularly with GROUP BY operations. Syntaxs1
,s2
, … (String
orFixedString
): Strings to concatenate.
- The concatenated string. (
String
)
Use this function only when you are certain that the concatenation result is unique for different inputs. Incorrect usage may lead to wrong query results.
concatWithSeparator
Concatenates strings with a specified separator. Syntax- concat_ws
separator
(String
): The string to use as a separator.s1
,s2
, … (String
): The strings to concatenate.
- A string containing all the input strings concatenated with the separator between them. [
String
]
If any of the input strings is NULL, it is skipped in the result without adding a separator.
concatWithSeparatorAssumeInjective
Concatenates strings with a specified separator, assuming the result is injective. This function can be used for optimization of GROUP BY operations. Syntaxseparator
(String
): The string to use as a separator between the concatenated strings.s1
,s2
, … (String
): The strings to concatenate.
- A string created by concatenating the input strings with the specified separator. (
String
)
This function assumes that the concatenation result is injective, meaning different arguments always produce different results. This assumption allows for certain query optimizations.
substring
Returns a substring of a string starting at a specified position. Syntax- substr
- mid
s
(String
): The input string.offset
(Integer
): The starting position of the substring (1-based index).length
(Integer
, optional): The maximum length of the substring.
- A substring of
s
. (String
)
without_length
returns the substring starting from the 7th character to the end.with_length
returns a 5-character substring starting from the 7th character.
- If
offset
is 0, an empty string is returned. - If
offset
is negative, the substring starts that many characters from the end of the string. - If
length
is omitted, the substring extends to the end of the string.
substringUTF8
Returns a substring of a UTF-8 encoded string starting at a specified position. Syntaxs
(String
): The UTF-8 encoded string to extract the substring from.offset
(Integer
): The starting position of the substring.length
(Integer
, optional): The maximum length of the substring.
- A substring of
s
. (String
)
offset
argument is 1-based, meaning the first character is at position 1. If offset
is 0, an empty string is returned. If offset
is negative, the substring starts that many characters from the end of the string.
If the optional length
argument is omitted, the substring extends to the end of the string.
This function operates on Unicode code points rather than bytes, making it suitable for strings containing multi-byte characters.
substringIndex
Returns a substring of a string before a specified number of occurrences of a delimiter. Syntax- SUBSTRING_INDEX
s
(String
): The input string to extract the substring from.delim
(String
): The delimiter string.count
(Int
orUInt
): The number of delimiter occurrences.- If positive, counts from the left
- If negative, counts from the right
- A substring of
s
. (String
)
substringIndex
returns everything before the second occurrence of the comma, giving us the first two taco toppings from the list.
If
count
is 0 or the delimiter is not found, the function returns an empty string. If count
exceeds the number of delimiters, the entire string is returned.substringIndexUTF8
Returns the substring of a UTF-8 encoded string before a specified number of occurrences of a delimiter. Syntaxs
(String
): The UTF-8 encoded string to extract the substring from.delim
(String
): The delimiter string.count
(Int
): The number of delimiter occurrences to count before extracting the substring. If positive, counts from the left; if negative, counts from the right.
- A substring of
s
. (String
)
This function assumes that the input string contains valid UTF-8 encoded text. If this assumption is violated, the behavior is undefined and no exception is thrown.
appendTrailingCharIfAbsent
Appends a specified character to the end of a string if it’s not already present. Syntax:s
(String
): The input string.c
(String
): The character to append. (length 1)
- A string with the specified character appended if it wasn’t already present. (
String
)
with_slash
adds a trailing ’/’ to ‘Taco’ since it wasn’t present.already_has_slash
doesn’t modify ‘Burrito/’ because it already ends with ’/’.
convertCharset
Converts a string from one character encoding to another. Syntax:s
(String
): The input string to convert.from
(String
): The source character encoding.to
(String
): The target character encoding.
- The converted string. (
String
)
The available character encodings depend on the ICU library version used during ClickHouse compilation. Make sure to use valid encoding names supported by your ClickHouse installation.
base58Encode
Encodes a string using Base58 in the “Bitcoin” alphabet. Syntax:plaintext
(String
): String to encode.
- The Base58 encoded string. [
String
]
base58Decode
Decodes a Base58-encoded string. Syntaxencoded
(String
): A Base58-encoded string.
- The decoded value of the argument. (
String
)
If the input string is not a valid Base58-encoded value, an exception is thrown. For a version that returns an empty string instead of throwing an exception, use
tryBase58Decode()
.tryBase58Decode
Attempts to decode a Base58-encoded string. If decoding fails, it returns an empty string instead of throwing an exception. Syntaxencoded
(String
): A string containing Base58-encoded data.
- A string containing the decoded value of the argument. (
String
) - An empty string if decoding fails.
- ‘3dc8KtHrwM’ is a valid Base58-encoded string that decodes to ‘Taco’.
- ‘invalid_base58’ is not a valid Base58-encoded string, so an empty string is returned.
base64Encode
Encodes a string using Base64 encoding according to RFC 4648. Syntax- TO_BASE64
input
(String
): The string to encode.
- A Base64-encoded string. (
String
)
The function follows the standard Base64 encoding scheme, which is commonly used for encoding binary data for transmission over text-based protocols or storage in text formats.
base64URLEncode
Encodes a string using Base64 URL-safe encoding, according to RFC 4648. Syntaxinput
(String
): The string to encode.
- A string containing the Base64 URL-safe encoded value of the input. [
String
]
+
and /
characters with -
and _
respectively, making the output safe for use in URLs without additional encoding.
The encoded output does not include padding
=
characters, further improving URL compatibility.base64Decode
Decodes a base64-encoded string. Syntaxencoded
(String
): A base64-encoded string.
- The decoded string. (
String
)
If the input string is not a valid base64-encoded value, an exception is thrown. For a version that returns an empty string instead of throwing an exception, use
tryBase64Decode
.base64URLDecode
Decodes a base64-encoded URL string according to the URL-safe variant of base64 encoding specified in RFC 4648. SyntaxencodedUrl
(String
): A string containing a base64-encoded URL with URL-safe modifications.
- The decoded string. (
String
)
If the input string is not a valid base64URL-encoded value, an exception is thrown. For a version that returns an empty string instead of throwing an exception, use
tryBase64URLDecode
.tryBase64Decode
Attempts to decode a Base64-encoded string. If decoding fails, it returns an empty string instead of throwing an exception. Syntaxencoded
(String
): A string containing Base64-encoded data.
- A string containing the decoded value of the argument. (
String
) - An empty string if decoding fails.
tryBase64URLDecode
Decodes a Base64URL-encoded string. If the input is not a valid Base64URL-encoded string, it returns an empty string instead of throwing an exception. SyntaxencodedString
(String
): A string containing Base64URL-encoded data.
- The decoded string, or an empty string if decoding fails. [
String
]
decoded_url
shows a successfully decoded Base64URL string.failed_decode
returns an empty string because the input is not valid Base64URL.
endsWith
Returns whether a string ends with the specified suffix. Syntaxstr
(String
): The string to check.suffix
(String
): The suffix to look for.
1
ifstr
ends withsuffix
,0
otherwise. [UInt8
]
endsWithUTF8
Returns whether a UTF-8 encoded string ends with the specified suffix. Syntaxstr
(String
): The string to check.suffix
(String
): The suffix to look for.
1
ifstr
ends withsuffix
,0
otherwise. [UInt8
]
Note that this function differs from the non-UTF8 version
endsWith
in that it matches str
and suffix
by UTF-8 characters rather than bytes.startsWith
Returns whether a string begins with a specified prefix. Syntaxstr
(String
): The string to check.prefix
(String
): The prefix to look for at the start ofstr
.
1
ifstr
starts withprefix
,0
otherwise. [UInt8
]
startsWithUTF8
Returns whether a UTF-8 encoded string starts with the specified prefix. Syntaxstr
(String
): The UTF-8 encoded string to check.prefix
(String
): The prefix to look for at the start ofstr
.
1
ifstr
starts withprefix
,0
otherwise. [UInt8
]
startsWith
in that it matches str
and prefix
by UTF-8 characters rather than bytes. This is particularly useful when working with non-ASCII characters.
Example with non-ASCII characters
This function assumes that the input strings contain valid UTF-8 encoded text. If this assumption is violated, the behavior is undefined and no exception is thrown.
trim
Removes specified characters from the start and/or end of a string. Syntaxtrim_character
(String
): Character(s) to remove.input_string
(String
): String to trim.
- A string with specified characters removed from the start and/or end. (
String
)
trimLeft
Removes the consecutive occurrences of whitespace (ASCII-character 32) from the start of a string. Syntax:- ltrim
input_string
(String
): String to trim.
- A string without leading whitespaces. [
String
]
trimRight
Removes the consecutive occurrences of whitespace (ASCII-character 32) from the end of a string. Syntax:- rtrim
input_string
(String
): String to trim.
- A string without trailing whitespaces. [
String
]
trimBoth
Removes the consecutive occurrences of whitespace (ASCII-character 32) from both ends of a string. Syntax:- trim(input_string)
input_string
(String
): String to trim.
- A string without leading and trailing whitespaces. [
String
]
CRC32
Calculates the CRC32 checksum of a string using CRC-32-IEEE 802.3 polynomial and initial value 0xffffffff (zlib implementation). Syntax:str
(String
): Input string.
- The CRC32 checksum of the string. [
UInt32
]
The CRC32 function is not cryptographically secure. For security-sensitive applications, use cryptographic hash functions instead.
CRC32IEEE
Calculates the CRC32 checksum of a string using the CRC-32-IEEE 802.3 polynomial. Syntaxstring
(String
): The input string to calculate the checksum for.
- The CRC32 checksum. (
UInt32
)
CRC64
Calculates the CRC64 checksum of a string, using the CRC-64-ECMA polynomial. Syntax:str
(String
): Input string.
- The CRC64 checksum of the input string. [
UInt64
]
normalizeQuery
Replaces literals, sequences of literals and complex aliases with placeholders in a SQL query string. Syntaxquery
(String
): The SQL query string to normalize.
- A normalized version of the input query with placeholders. (
String
)
- Replaces literals with
?
placeholders - Replaces sequences of literals with
?..
placeholders - Replaces complex aliases (containing whitespace, more than two digits, or at least 36 bytes long such as UUIDs) with
?
placeholders
5.99
and "hot"
are replaced with ?
placeholders in the normalized query.
The normalization process helps in identifying structurally similar queries, which can be beneficial for query analysis, caching, and optimization strategies.
normalizeQueryKeepNames
Replaces literals and sequences of literals with placeholders, but preserves complex aliases in SQL queries. This function is useful for analyzing complex query logs. Syntaxquery
(String
): The SQL query to normalize.
- A normalized version of the input query with placeholders for literals, but preserving complex aliases. [
String
]
normalizeQuery
replaces all literals and aliases with placeholders.normalizeQueryKeepNames
preserves the complex aliastotal_tacos
while still replacing other literals with placeholders.
normalizedQueryHash
Returns a 64-bit hash value for a normalized query string. This function is useful for analyzing query logs and identifying similar queries with different literal values. Syntaxquery
(String
): The SQL query string to be hashed.
- A 64-bit hash value of the normalized query. (
UInt64
)
- Normalizes the query by replacing literals, sequences of literals, and complex aliases with placeholders.
- Calculates a hash value for the normalized query.
filling
column. The normalizedQueryHash
function returns the same hash value for both queries, demonstrating its ability to identify structurally similar queries.
This function is particularly useful for query analysis, performance tuning, and identifying frequently executed query patterns in large-scale systems.
query
(String
): The SQL query to be normalized and hashed.
UInt64
]
Description:
This function normalizes the input query by replacing literals with placeholders, but preserves complex aliases (containing whitespace, more than two digits, or at least 36 bytes long such as UUIDs). It then calculates a hash of the normalized query.
This function is useful for analyzing query logs and identifying similar queries while maintaining the readability of complex aliases.
Example:
hash1
andhash2
are identical becausenormalizedQueryHash
treats the different fillings as placeholders.hash3
andhash4
are identical to each other but different fromhash1
andhash2
becausenormalizedQueryHashKeepNames
preserves the complex aliastotal_tacos_123
.
normalizeUTF8NFC
Converts a string to NFC (Normalization Form Canonical Composition) normalized form, assuming the string contains valid UTF-8 encoded text. Syntaxs
(String
): Input string containing valid UTF-8 encoded text.
- A string transformed to NFC normalization form. (
String
)
- The original ‘â’ is represented by 2 bytes.
- After NFC normalization, it remains ‘â’ but is now represented as a single Unicode code point.
- The length of the normalized form is still 2 bytes, as it’s a single character that requires 2 bytes in UTF-8 encoding.
If the input string is not valid UTF-8, no exception is thrown, but the result is undefined. Always ensure your input is valid UTF-8 before using this function.
normalizeUTF8NFD
Converts a string to NFD normalized form, assuming the string contains valid UTF-8 encoded text. Syntaxs
(String
): UTF-8 encoded input string.
- String transformed to NFD normalization form. (
String
)
- The original ‘â’ is 2 bytes long.
- After NFD normalization, it’s decomposed into ‘a’ and the combining diacritic ‘◌̂’.
- The normalized form is 3 bytes long.
If the input string contains invalid UTF-8, no exception is thrown, but the result is undefined.
normalizeUTF8NFKC
Converts a string to NFKC normalized form, assuming the string contains valid UTF-8 encoded text. Syntaxs
(String
): UTF-8 encoded input string.
- String transformed to NFKC normalization form. [
String
]
- We use a Spanish letter ‘ñ’ which can be represented in different Unicode forms.
original_length
shows the byte length of the original string.nfkc
shows the normalized form, which visually looks the same.nfkc_length
shows the byte length after normalization, which remains 2 in this case.
normalizeUTF8NFKD
Converts a string to NFKD normalized form, assuming the string contains valid UTF-8 encoded text. Syntaxs
(String
): UTF-8 encoded input string.
String
: String transformed to NFKD normalization form.
- We use a character ‘ñ’ (as in “jalapeño”).
- The original UTF-8 encoded ‘ñ’ has a length of 2 bytes.
- After NFKD normalization, it’s decomposed into ‘n’ and the combining tilde, resulting in a 3-byte string.
encodeXMLComponent
Escapes characters with special meaning in XML so they can be safely placed into XML text nodes or attributes. Syntaxx
(String
): Input string to encode.
- The input string with special XML characters escaped. [
String
]
<
becomes<
&
becomes&
>
becomes>
"
becomes"
'
becomes'
&
symbol is encoded to &
, ensuring it doesn’t interfere with XML parsing.
decodeXMLComponent
Decodes XML-encoded special characters in a string. Syntaxx
(String
): The input string to decode.
- The decoded string. (
String
)
"
to"
&
to&
'
to'
>
to>
<
to<
✓
) and hexadecimal (like ✓
) forms are supported.
Example:
decodeXMLComponent
decodes the &
entity to &
and the numeric character reference ñ
to ñ
.
This function is useful when working with XML data or when you need to unescape HTML entities in text. It’s particularly handy when processing data from web scraping or API responses that may contain encoded characters.
decodeHTMLComponent
Decodes HTML entities and numeric character references in a string. Syntaxencoded_string
(String
): A string containing HTML entities or numeric character references.
- The decoded string with HTML entities and numeric character references converted to their corresponding characters. (
String
)
&
, <
, >
) and numeric character references (like ✓
or ✓
) in the input string. It supports both decimal and hexadecimal forms of numeric character references.
Example:
♥
) and the hot pepper emoji (🌶
) are decoded into their corresponding Unicode characters.
This function is useful when working with HTML-encoded text, especially when processing web scraping results or user-generated content that may contain HTML entities.
extractTextFromHTML
Extracts plain text from HTML or XHTML content. Syntaxhtml
(String
): Input HTML string.
- Extracted plain text. (
String
)
- Comments are skipped.
- CDATA is included verbatim.
<script>
and<style>
elements are removed with all their content.- Other tags are removed, but their content is kept.
- HTML entities are not decoded.
- Whitespace is collapsed or inserted according to specific rules.
The function does not fully conform to HTML, XML, or XHTML specifications but provides a reasonably accurate and fast implementation for most use cases.
ascii
Returns the ASCII code point (asInt32
) of the first character of a string.
Syntax
s
(String
): Input string.
- The ASCII code of the first character. [
Int32
] - If
s
is empty, returns 0. - If the first character is not an ASCII character, the result is undefined.
For non-ASCII characters or empty strings, consider using additional checks or alternative functions for more robust handling.
soundex
Returns the Soundex code of a string. Syntax:val
(String
): Input string.
- The Soundex code of the input string. [
String
]
Soundex is primarily designed for English words and names. It may not work as effectively for other languages or non-Latin alphabets.
punycodeEncode
Encodes a UTF-8 string using Punycode encoding. Syntaxstring
(String
): The UTF-8 encoded input string to be encoded.
- A Punycode-encoded representation of the input string. [
String
]
punycodeDecode
Decodes a Punycode-encoded string back into a UTF-8 encoded plaintext string. Syntaxencoded
(String
): A Punycode-encoded string.
- The decoded UTF-8 plaintext string. (
String
)
If the input is not a valid Punycode-encoded string, an exception is thrown. For a version that returns an empty string instead of throwing an exception, see
tryPunycodeDecode
.tryPunycodeDecode
Attempts to decode a Punycode-encoded string. If the input is not a valid Punycode-encoded string, it returns an empty string instead of throwing an exception. Syntax:encoded
(String
): A Punycode-encoded string.
- The decoded string if successful, or an empty string if decoding fails. [
String
]
idnaEncode
Encodes a domain name using the Internationalized Domain Names in Applications (IDNA) mechanism, converting it to its ASCII representation (ToASCII algorithm). Syntaxdomain
(String
): A UTF-8 encoded domain name string.
- The ASCII representation of the domain name according to the IDNA mechanism. [
String
]
- The input string must be UTF-8 encoded and translatable to an ASCII string, otherwise an exception is thrown.
- No percent decoding or trimming of tabs, spaces, or control characters is performed.
- For error handling without exceptions, use the
tryIdnaEncode
function.
tryIdnaEncode
Attempts to encode a string using the Internationalized Domain Names in Applications (IDNA) mechanism, returning an empty string if the encoding fails. Syntaxdomain
(String
): The domain name to encode.
- The ASCII representation of the domain name according to the IDNA ToASCII algorithm. (
String
) - An empty string if the encoding fails.
- ‘tacos.méxico.com’ is successfully encoded to its IDNA representation.
- ‘invalid domain!@#’ fails to encode, resulting in an empty string.
idnaDecode
Returns the Unicode (UTF-8) representation of a domain name according to the Internationalized Domain Names in Applications (IDNA) mechanism. Syntaxdomain
(String
): An IDNA-encoded domain name.
- The Unicode (UTF-8) representation of the input domain name.
String
- If the input is invalid, the function returns the input string without modification.
- Repeated application of
idnaEncode()
andidnaDecode()
may not return the original string due to case normalization.
byteHammingDistance
Calculates the Hamming distance between two byte strings. Syntax:- mismatches
string1
(String
): First input string.string2
(String
): Second input string.
- The Hamming distance between the two strings. [
UInt64
]
stringJaccardIndex
Calculates the Jaccard similarity index between two strings. Syntaxstring1
(String
): First input string.string2
(String
): Second input string.
- The Jaccard similarity index as a number between 0 and 1. (
Float64
)
The Jaccard index is calculated as the size of the intersection divided by the size of the union of two sets. In the context of strings, each string is treated as a set of characters. A higher value indicates greater similarity between the strings.
stringJaccardIndexUTF8
Calculates the Jaccard similarity index between two UTF-8 encoded strings. Syntaxstring1
(String
): First input string.string2
(String
): Second input string.
- The Jaccard similarity index between the two strings, as a number between 0 and 1.
- Type:
Float64
.
The function assumes that the input strings contain valid UTF-8 encoded text. If this assumption is violated, the behavior is undefined and no exception is thrown.
editDistance
Calculates the edit distance between two strings. Syntax- levenshteinDistance
string1
(String
): First input string.string2
(String
): Second input string.
- The edit distance between the two input strings. (
UInt32
)
The edit distance is also known as the Levenshtein distance.
editDistanceUTF8
Calculates the edit distance between two UTF-8 encoded strings. Syntaxstring1
(String
): First UTF-8 encoded string to compare.string2
(String
): Second UTF-8 encoded string to compare.
- The edit distance between the two strings.
UInt32
.
The function assumes that the input strings are valid UTF-8 encoded text. If this assumption is violated, the behavior is undefined and no exception is thrown.
damerauLevenshteinDistance
Calculates the Damerau-Levenshtein distance between two strings. Syntax:string1
(String
): First input string.string2
(String
): Second input string.
- The Damerau-Levenshtein distance between the two input strings. [
UInt32
]
The Damerau-Levenshtein distance is an extension of the Levenshtein distance that allows for transpositions of adjacent characters in addition to insertions, deletions, and substitutions.
jaroSimilarity
Calculates the Jaro similarity between two strings. Syntaxstring1
(String
): First input string.string2
(String
): Second input string.
- A number representing the Jaro similarity between the two input strings.
Float64
.
jaroWinklerSimilarity
Calculates the Jaro-Winkler similarity between two strings. Syntaxstring1
(String
): First input string.string2
(String
): Second input string.
- A number representing the Jaro-Winkler similarity between the two input strings. [
Float64
]- Range is 0 to 1, where 1 indicates identical strings and 0 indicates no similarity.
The Jaro-Winkler similarity is particularly useful for short strings such as person names. It gives more favorable ratings to strings that match from the beginning, making it suitable for comparing things like product names or ingredients in a recipe.
initcap
Converts the first letter of each word in a string to uppercase and the rest to lowercase. Syntaxstring
(String
): The input string.
- A string with the first letter of each word capitalized. [
String
]
This function may produce unexpected results for words containing apostrophes or internal capital letters. For example:Result:The apostrophe is treated as a word boundary, and the ‘S’ is capitalized. This is a known behavior with no current plans for modification.
initcapUTF8
Converts the first letter of each word in a UTF-8 encoded string to uppercase and the rest to lowercase. Syntaxstring
(String
): The input string to be converted.
- A string with the first letter of each word capitalized. (
String
)
- This function does not detect language, so results may not be exactly correct for some languages (e.g., Turkish with i/İ vs. i/I).
- If the length of the UTF-8 byte sequence differs between upper and lower case for a code point, the result may be incorrect for that code point.
- Words are defined as sequences of alphanumeric characters separated by non-alphanumeric characters.
initcap
function, initcapUTF8
correctly handles multi-byte UTF-8 characters, making it suitable for text in various languages and scripts.
firstLine
Returns the first line from a multi-line string. Syntax:s
(String
): Input string.
String
]
Example: