String functions
Manipulate and analyze text data.
String functions offer a wide range of capabilities for text manipulation and analysis. These functions cover:
- Basic operations: emptiness checks, whitespace trimming.
- Complex tasks: pattern matching, substring extraction, text normalization.
- Advanced techniques: regex matching, Unicode normalization.
ClickHouse function reference
empty
Arguments:
x
(String
) — Input string.
Returns:
- Returns 1 for an empty string or 0 for a non-empty string. (
UInt8
)
Example:
Result:
This function checks if a string contains any characters, including spaces. It returns 1 (true) for an empty string and 0 (false) for any non-empty string, even if it only contains whitespace.
The empty
function is also available for [arrays] and [UUIDs].
notEmpty
Checks whether the input string is non-empty. A string is considered non-empty if it contains at least one byte, even if this byte is a space or a null byte.
Syntax:
Arguments:
x
(String
): Input value.
Returns:
- Returns 1 for a non-empty string or 0 for an empty string. [
UInt8
]
Example:
Result:
In this example, notEmpty
returns 0 for an empty string and 1 for a non-empty string containing the word “guacamole”.
This function is also available for [arrays] and [UUIDs].
length
Returns the length of a string in bytes (not characters or Unicode code points).
Syntax
Arguments:
s
(String
orArray
): The input string or array.
Returns:
- The length of
s
in bytes. [UInt64
]
Example:
Result:
This function also works with arrays:
Result:
In the case of strings, length
counts bytes, not characters. For example, some UTF-8 characters may occupy more than one byte:
Result:
Note that ‘Jalapeño’ is 8 characters long, but occupies 9 bytes due to the ‘ñ’ character.
To count Unicode characters instead of bytes, use the lengthUTF8
function.
lengthUTF8
Returns the length of a string in Unicode code points (not in bytes).
Syntax
Alias:
- char_length
- character_length
Arguments:
s
(String
): The input string. Must contain valid UTF-8 encoded text.
Returns:
- The length of the string
s
in Unicode code points. [UInt64
]
Example:
Result:
In this example, lengthUTF8
correctly counts the Unicode characters in the Spanish greeting and taco reference, including the inverted exclamation mark.
- This function assumes the input string contains valid UTF-8 encoded text. If this assumption is violated, the result is undefined and no exception is thrown.
lengthUTF8
differs from thelength
function, which returns the length in bytes.- For ASCII strings,
lengthUTF8
andlength
will return the same result.
See Also
length
: Returns the length of a string in bytes.
left
Returns a substring of string s with a specified offset starting from the left.
Syntax:
Arguments:
s
(String
orFixedString
): The string to calculate a substring from.offset
(UInt*
): The number of bytes of the offset.
Returns:
- For positive offset: A substring of s with offset many bytes, starting from the left of the string.
- For negative offset: A substring of s with length(s) - |offset| bytes, starting from the left of the string.
- An empty string if length is 0.
Example:
Result:
This example extracts the first 7 characters from the string ‘Crunchy Taco’, resulting in ‘Crunchy’.
Result:
This example extracts all but the last 6 characters from ‘Spicy Burrito’, resulting in ‘Spicy Bur’.
The left
function operates on bytes, not characters. For UTF-8 encoded strings, consider using leftUTF8
to work with characters instead of bytes.
leftUTF8
Returns a substring of a UTF-8 encoded string with a specified number of characters from the left.
Syntax
Arguments:
s
(String
): The UTF-8 encoded string to extract the substring from.length
(UInt
): The number of characters to extract.
Returns:
- A substring of
s
containing the leftmostlength
characters. (String
)
Example:
Result:
This function is useful when you need to extract a fixed number of characters from the beginning of a UTF-8 string, such as getting the first few words of a taco description or the prefix of a menu item name.
If length
is greater than the number of characters in the string, the entire string is returned. If length
is 0 or negative, an empty string is returned.
leftPad
Pads a string from the left with spaces or with a specified string (multiple times, if needed) until the resulting string reaches the specified length.
Syntax:
Alias:
- LPAD
Arguments:
string
(String
): Input string that should be padded.length
(UInt
orInt
): The length of the resulting string. If the value is smaller than the input string length, then the input string is shortened to length characters.pad_string
(String
, optional): The string to pad the input string with. If not specified, then the input string is padded with spaces.
Returns:
- A left-padded string of the given length. (
String
)
Example:
Result:
In this example:
- ‘Taco’ is padded with asterisks to a length of 10 characters.
- ‘Burrito’ is padded with spaces to a length of 10 characters.
The function measures string length in bytes, not in characters. For Unicode strings, consider using leftPadUTF8
instead.
leftPadUTF8
Pads a string from the left with spaces or a specified string (multiple times, if needed) until the resulting string reaches the given length. Unlike leftPad
which measures the string length in bytes, this function measures the string length in Unicode code points.
Syntax:
Arguments:
string
(String
): Input string that should be padded.length
(UInt
orInt
): The length of the resulting string. If the value is smaller than the input string length, then the input string is shortened tolength
characters.pad_string
(String
, optional): The string to pad the input string with. If not specified, then the input string is padded with spaces.
Returns:
- A left-padded string of the given length. (
String
)
Example:
Result:
In this example:
padded_taco
is padded with taco emojis to reach a length of 10 Unicode characters.padded_burrito
is padded with spaces to reach a length of 10 Unicode characters.
This function is particularly useful when working with strings containing non-ASCII characters or emojis, ensuring consistent visual alignment based on the number of Unicode characters rather than bytes.
right
Returns a substring of a string starting from the right.
Syntax
Arguments:
s
(String
orFixedString
): The input string.offset
(UInt*
): The number of characters to extract.
Returns:
- For positive
offset
: A substring ofs
withoffset
many characters, starting from the right of the string. - For negative
offset
: A substring ofs
withlength(s) - |offset|
characters, starting from the right of the string. - An empty string if
offset
is 0.
Example:
Result:
In this example:
last_four
returns the last 4 characters of ‘Crunchy Taco’.all_but_last_four
returns all but the last 4 characters of ‘Soft Taco’.
rightUTF8
Returns a substring of a UTF-8 encoded string with a specified offset starting from the right.
Syntax:
Arguments:
s
(String
orFixedString
): The UTF-8 encoded string to calculate a substring from.offset
(UInt*
): The number of characters for the offset.
Returns:
- For positive offset: A substring of
s
withoffset
many characters, starting from the right of the string. - For negative offset: A substring of
s
withlength(s) - |offset|
characters, starting from the right of the string. - An empty string if
offset
is 0.
Example:
Result:
Result:
This function assumes that the string contains valid UTF-8 encoded text. If this assumption is violated, no exception is thrown and the result is undefined.
The function counts characters, not bytes. This is important for multi-byte UTF-8 characters.
rightPad
Pads a string from the right with spaces or with a specified string (multiple times, if needed) until the resulting string reaches the specified length.
Syntax:
Alias:
- RPAD
Arguments:
string
(String
) — Input string that should be padded.length
(UInt
orInt
) — The length of the resulting string. If the value is smaller than the input string length, then the input string is shortened to length characters.pad_string
(String
, optional) — The string to pad the input string with. If not specified, then the input string is padded with spaces.
Returns:
- A right-padded string of the given length. (
String
)
Example:
Result:
In this example, the ‘Taco’ string is right-padded with taco emojis until it reaches a length of 10 characters. This function can be useful for formatting strings to a fixed width, which is particularly handy when creating text-based reports or aligning data in console output.
rightPadUTF8
Pads a string from the right with spaces or a specified string (multiple times, if needed) until the resulting string reaches the given length. Unlike rightPad
which measures the string length in bytes, this function measures the string length in Unicode code points.
Syntax
Arguments:
string
(String
) — Input string that should be padded.length
(UInt
orInt
) — The desired length of the resulting string.pad_string
(String
, optional) — The string to pad the input string with. If not specified, the input string is padded with spaces.
Returns:
- A right-padded string of the given length. (
String
)
Example:
Result:
In this example:
- ‘Taco’ is padded with taco emojis to reach a length of 10 Unicode code points.
- ‘Burrito’ is padded with spaces to reach a length of 10 Unicode code points.
This function is particularly useful when working with strings containing non-ASCII characters or emojis, ensuring consistent string lengths based on visible characters rather than byte count.
lower
Converts ASCII Latin symbols in a string to lowercase.
Syntax
Alias:
- lcase
Arguments:
input
(String
): The input string.
Returns:
- A string with all ASCII Latin characters converted to lowercase. [
String
]
Example:
Result:
This function only affects ASCII Latin characters (A-Z). It does not change other characters or Unicode symbols. For Unicode-aware lowercase conversion, use the lowerUTF8
function.
upper
Converts ASCII Latin characters in a string to uppercase.
Syntax
Alias:
- ucase
Arguments:
input
(String
): The input string.
Returns:
- A string with ASCII Latin characters converted to uppercase. [
String
]
Example:
Result:
This function only affects ASCII Latin characters (a-z). It does not change other characters or consider language-specific uppercase rules.
lowerUTF8
Converts a string to lowercase, assuming the string contains valid UTF-8 encoded text.
Syntax
Arguments:
input
(String
): A string to convert.
Returns:
- The input string converted to lowercase. (
String
)
Example:
Result:
This function does not detect the language. For some languages (e.g., Turkish), the result might not be exactly correct due to specific uppercase/lowercase rules. If the length of the UTF-8 byte sequence differs between upper and lower case for a code point, the result may be incorrect for that code point.
Unlike the lower
function, lowerUTF8
correctly handles UTF-8 encoded characters beyond the ASCII range.
upperUTF8
Converts a string to uppercase, assuming the string contains valid UTF-8 encoded text.
Syntax
Arguments:
input
(String
): A string to convert.
Returns:
- The input string converted to uppercase. [
String
]
Example:
Result:
This function does not detect the language, so results might not be exactly correct for some languages (e.g., Turkish). If the length of the UTF-8 byte sequence differs between upper and lower case for a code point, the result may be incorrect for that code point.
isValidUTF8
Checks whether a string contains valid UTF-8 encoded text.
Syntax
Arguments:
input
(String
): The string to check.
Returns:
- Returns 1 if the input string contains valid UTF-8 encoded text, 0 otherwise. [
UInt8
]
Example:
Result:
In this example, ‘Spicy jalapeño taco’ is valid UTF-8, while ‘Broken \xC3\x28 taco’ contains an invalid UTF-8 sequence.
This function is useful for validating input data, especially when working with user-generated content or data from external sources where UTF-8 encoding is expected but not guaranteed.
toValidUTF8
Replaces invalid UTF-8 characters with the � (U+FFFD) replacement character. Consecutive invalid characters are collapsed into a single replacement character.
Syntax
Arguments:
input_string
(String
): Any set of bytes represented as a String data type.
Returns:
- A valid UTF-8 string. (
String
).
Example:
Result:
In this example, the invalid UTF-8 sequence \xF0\x80\x80\x80
is replaced with a single � character in the resulting string.
This function is useful when dealing with potentially corrupted or improperly encoded text data, ensuring that the output is always valid UTF-8 for further processing or display.
repeat
Repeats a string a specified number of times.
Syntax
Arguments:
s
(String
): The string to repeat.n
(UInt
orInt
): The number of times to repeat the string.
Returns:
A string containing s
repeated n
times. If n
<=
0, the function returns an empty string. (String
)
Example:
Result:
In this example, we create a taco chant by repeating the word “Taco” three times.
If you need to repeat a string a large number of times, be mindful of the resulting string length to avoid excessive memory usage.
space
Returns a string consisting of a specified number of space characters.
Syntax:
Alias:
- SPACE
Arguments:
n
(UInt
orInt
): The number of space characters to generate.
Returns:
A string containing n
space characters. If n
<=
0, an empty string is returned. [String
]
Example:
Result:
This function is useful for padding strings or creating visual separators in query results. For example, you could use it to align taco ingredients in a formatted list:
Result:
In this example, space()
is used to create a consistent alignment for the ingredients list, regardless of the taco name length.
reverse
Reverses the sequence of bytes in a string.
Syntax:
Arguments:
s
(String
): The input string.
Returns:
- A string with the bytes in reverse order. [
String
]
Example:
Result:
This function reverses the byte order of the input string. For ASCII strings, this effectively reverses the characters. However, for multi-byte encodings like UTF-8, it may produce unexpected results as it operates on bytes, not characters.
For proper handling of Unicode characters, use the reverseUTF8
function instead.
reverseUTF8
Reverses a sequence of Unicode code points in a string.
Syntax
Arguments:
str
(String
): The input string to reverse.
Returns:
- A string with the sequence of Unicode code points reversed. (
String
)
Description
This function reverses the order of Unicode code points in the input string. It assumes that the string contains valid UTF-8 encoded text. If this assumption is violated, no exception is thrown and the result is undefined.
Unlike the reverse
function, which reverses the sequence of bytes, reverseUTF8
preserves the integrity of multi-byte Unicode characters.
Example:
Result:
In this example, the Unicode emoji 🌶️ is treated as a single code point and remains intact in the reversed string.
Be cautious when using this function with strings that contain combining characters or complex scripts, as the visual representation of the reversed text might not always be meaningful.
concat
Concatenates the given string arguments.
Syntax
Arguments:
s1
,s2
, … (String
orFixedString
): Strings to concatenate.
Returns:
- A
String
created by concatenating the arguments.
- If any argument is
NULL
, the function returnsNULL
. - Arguments of other types are automatically converted to strings, but this may impact performance.
Example:
Result:
This example concatenates multiple strings to create a taco description.
concatAssumeInjective
Concatenates strings, assuming the result is injective (unique) for different input values. This can be used for query optimization, particularly with GROUP BY operations.
Syntax
Arguments:
s1
,s2
, … (String
orFixedString
): Strings to concatenate.
Returns:
- The concatenated string. (
String
)
Example:
Result:
This example assumes that concatenating ‘Carne ’ and ‘Asada’ always produces a unique result, allowing for potential query optimization.
Use this function only when you are certain that the concatenation result is unique for different inputs. Incorrect usage may lead to wrong query results.
concatWithSeparator
Concatenates strings with a specified separator.
Syntax
Alias:
- concat_ws
Arguments:
separator
(String
): The string to use as a separator.s1
,s2
, … (String
): The strings to concatenate.
Returns:
- A string containing all the input strings concatenated with the separator between them. [
String
]
Example:
Result:
This example creates a comma-separated list of taco fillings.
If any of the input strings is NULL, it is skipped in the result without adding a separator.
concatWithSeparatorAssumeInjective
Concatenates strings with a specified separator, assuming the result is injective. This function can be used for optimization of GROUP BY operations.
Syntax
Arguments:
separator
(String
): The string to use as a separator between the concatenated strings.s1
,s2
, … (String
): The strings to concatenate.
Returns:
- A string created by concatenating the input strings with the specified separator. (
String
)
This function assumes that the concatenation result is injective, meaning different arguments always produce different results. This assumption allows for certain query optimizations.
Example:
Result:
In this example, we concatenate taco fillings with a comma separator and use the result for grouping. The function assumes that different combinations of fillings will always produce unique strings, allowing for potential query optimizations.
substring
Returns a substring of a string starting at a specified position.
Syntax
Aliases:
- substr
- mid
Arguments:
s
(String
): The input string.offset
(Integer
): The starting position of the substring (1-based index).length
(Integer
, optional): The maximum length of the substring.
Returns:
- A substring of
s
. (String
)
Example:
Result:
In this example:
without_length
returns the substring starting from the 7th character to the end.with_length
returns a 5-character substring starting from the 7th character.
- If
offset
is 0, an empty string is returned. - If
offset
is negative, the substring starts that many characters from the end of the string. - If
length
is omitted, the substring extends to the end of the string.
substringUTF8
Returns a substring of a UTF-8 encoded string starting at a specified position.
Syntax
Arguments:
s
(String
): The UTF-8 encoded string to extract the substring from.offset
(Integer
): The starting position of the substring.length
(Integer
, optional): The maximum length of the substring.
Returns:
- A substring of
s
. (String
)
Example:
Result:
This function assumes that the input string contains valid UTF-8 encoded text. If this assumption is violated, the result is undefined and no exception is thrown.
The offset
argument is 1-based, meaning the first character is at position 1. If offset
is 0, an empty string is returned. If offset
is negative, the substring starts that many characters from the end of the string.
If the optional length
argument is omitted, the substring extends to the end of the string.
This function operates on Unicode code points rather than bytes, making it suitable for strings containing multi-byte characters.
substringIndex
Returns a substring of a string before a specified number of occurrences of a delimiter.
Syntax
Alias:
- SUBSTRING_INDEX
Arguments:
s
(String
): The input string to extract the substring from.delim
(String
): The delimiter string.count
(Int
orUInt
): The number of delimiter occurrences.- If positive, counts from the left
- If negative, counts from the right
Returns:
- A substring of
s
. (String
)
Example:
Result:
In this example, substringIndex
returns everything before the second occurrence of the comma, giving us the first two taco toppings from the list.
If count
is 0 or the delimiter is not found, the function returns an empty string. If count
exceeds the number of delimiters, the entire string is returned.
substringIndexUTF8
Returns the substring of a UTF-8 encoded string before a specified number of occurrences of a delimiter.
Syntax
Arguments:
s
(String
): The UTF-8 encoded string to extract the substring from.delim
(String
): The delimiter string.count
(Int
): The number of delimiter occurrences to count before extracting the substring. If positive, counts from the left; if negative, counts from the right.
Returns:
- A substring of
s
. (String
)
Example:
Result:
In this example, the function returns the substring before the second occurrence of ’, ’ in the taco menu string.
This function assumes that the input string contains valid UTF-8 encoded text. If this assumption is violated, the behavior is undefined and no exception is thrown.
appendTrailingCharIfAbsent
Appends a specified character to the end of a string if it’s not already present.
Syntax:
Arguments:
s
(String
): The input string.c
(String
): The character to append. (length 1)
Returns:
- A string with the specified character appended if it wasn’t already present. (
String
)
Example:
Result:
In this example:
with_slash
adds a trailing ’/’ to ‘Taco’ since it wasn’t present.already_has_slash
doesn’t modify ‘Burrito/’ because it already ends with ’/‘.
This function is useful for ensuring strings have a consistent ending, which can be helpful when working with paths or URLs.
convertCharset
Converts a string from one character encoding to another.
Syntax:
Arguments:
s
(String
): The input string to convert.from
(String
): The source character encoding.to
(String
): The target character encoding.
Returns:
- The converted string. (
String
)
Example:
Result:
In this example, the function converts the UTF-8 encoded string ‘Jalapeño’ to ASCII, replacing the ‘ñ’ character with ‘n’.
The available character encodings depend on the ICU library version used during ClickHouse compilation. Make sure to use valid encoding names supported by your ClickHouse installation.
base58Encode
Encodes a string using Base58 in the “Bitcoin” alphabet.
Syntax:
Arguments:
plaintext
(String
): String to encode.
Returns:
- The Base58 encoded string. [
String
]
Example:
Result:
This example encodes the phrase “Taco Tuesday” using Base58 encoding. The result is a compact, URL-safe string that can be used for various purposes, such as generating short identifiers for taco orders or menu items.
base58Decode
Decodes a Base58-encoded string.
Syntax
Arguments:
encoded
(String
): A Base58-encoded string.
Returns:
- The decoded value of the argument. (
String
)
Example:
Result:
In this example, we decode a Base58-encoded taco name. The function returns the original string ‘Carnitas’.
If the input string is not a valid Base58-encoded value, an exception is thrown. For a version that returns an empty string instead of throwing an exception, use tryBase58Decode()
.
tryBase58Decode
Attempts to decode a Base58-encoded string. If decoding fails, it returns an empty string instead of throwing an exception.
Syntax
Arguments:
encoded
(String
): A string containing Base58-encoded data.
Returns:
- A string containing the decoded value of the argument. (
String
) - An empty string if decoding fails.
Example:
Result:
In this example, we attempt to decode two strings:
- ‘3dc8KtHrwM’ is a valid Base58-encoded string that decodes to ‘Taco’.
- ‘invalid_base58’ is not a valid Base58-encoded string, so an empty string is returned.
This function is useful when you need to handle potentially invalid Base58-encoded data without causing query errors.
base64Encode
Encodes a string using Base64 encoding according to RFC 4648.
Syntax
Alias:
- TO_BASE64
Arguments:
input
(String
): The string to encode.
Returns:
- A Base64-encoded string. (
String
)
Example:
Result:
This example encodes the phrase “Spicy Taco” into its Base64 representation.
The function follows the standard Base64 encoding scheme, which is commonly used for encoding binary data for transmission over text-based protocols or storage in text formats.
base64URLEncode
Encodes a string using Base64 URL-safe encoding, according to RFC 4648.
Syntax
Arguments:
input
(String
): The string to encode.
Returns:
- A string containing the Base64 URL-safe encoded value of the input. [
String
]
Example:
Result:
This function is useful for encoding data that needs to be included in URLs or filenames. It replaces the standard Base64 +
and /
characters with -
and _
respectively, making the output safe for use in URLs without additional encoding.
The encoded output does not include padding =
characters, further improving URL compatibility.
base64Decode
Decodes a base64-encoded string.
Syntax
Arguments:
encoded
(String
): A base64-encoded string.
Returns:
- The decoded string. (
String
)
Example:
Result:
This example decodes the base64-encoded string ‘VGFjbyBUdWVzZGF5’, which represents “Taco Tuesday”.
If the input string is not a valid base64-encoded value, an exception is thrown. For a version that returns an empty string instead of throwing an exception, use tryBase64Decode
.
base64URLDecode
Decodes a base64-encoded URL string according to the URL-safe variant of base64 encoding specified in RFC 4648.
Syntax
Arguments:
encodedUrl
(String
): A string containing a base64-encoded URL with URL-safe modifications.
Returns:
- The decoded string. (
String
)
Example:
Result:
This function decodes the base64URL-encoded string ‘aHR0cHM6Ly90YWNvLmNvbQ’, which represents the URL ‘https://taco.com’.
If the input string is not a valid base64URL-encoded value, an exception is thrown. For a version that returns an empty string instead of throwing an exception, use tryBase64URLDecode
.
tryBase64Decode
Attempts to decode a Base64-encoded string. If decoding fails, it returns an empty string instead of throwing an exception.
Syntax
Arguments:
encoded
(String
): A string containing Base64-encoded data.
Returns:
- A string containing the decoded value of the argument. (
String
) - An empty string if decoding fails.
Example:
Result:
In this example, ‘VGFjbyBUdWVzZGF5’ is successfully decoded to ‘Taco Tuesday’, while the invalid Base64 string returns an empty result instead of throwing an error.
This function is useful when you need to handle potentially invalid Base64 data without interrupting your query execution.
tryBase64URLDecode
Decodes a Base64URL-encoded string. If the input is not a valid Base64URL-encoded string, it returns an empty string instead of throwing an exception.
Syntax
Arguments:
encodedString
(String
): A string containing Base64URL-encoded data.
Returns:
- The decoded string, or an empty string if decoding fails. [
String
]
Example:
Result:
In this example:
decoded_url
shows a successfully decoded Base64URL string.failed_decode
returns an empty string because the input is not valid Base64URL.
This function is useful when you need to safely decode Base64URL-encoded data without risking exceptions for invalid inputs, such as when processing user-supplied or potentially corrupted data in web applications.
endsWith
Returns whether a string ends with the specified suffix.
Syntax
Arguments:
str
(String
): The string to check.suffix
(String
): The suffix to look for.
Returns:
1
ifstr
ends withsuffix
,0
otherwise. [UInt8
]
Example:
Result:
This example checks if the string ‘Spicy Taco’ ends with ‘Taco’. The result is 1 (true), confirming that it does end with ‘Taco’.
endsWithUTF8
Returns whether a UTF-8 encoded string ends with the specified suffix.
Syntax
Arguments:
str
(String
): The string to check.suffix
(String
): The suffix to look for.
Returns:
1
ifstr
ends withsuffix
,0
otherwise. [UInt8
]
Example:
Result:
This function is UTF-8 aware, so it correctly handles multi-byte characters. For example:
Result:
Note that this function differs from the non-UTF8 version endsWith
in that it matches str
and suffix
by UTF-8 characters rather than bytes.
startsWith
Returns whether a string begins with a specified prefix.
Syntax
Arguments:
str
(String
): The string to check.prefix
(String
): The prefix to look for at the start ofstr
.
Returns:
1
ifstr
starts withprefix
,0
otherwise. [UInt8
]
Example:
Result:
This example checks if the string ‘Taco Tuesday’ starts with ‘Taco’. The result is 1 (true) because the string indeed begins with ‘Taco’.
startsWithUTF8
Returns whether a UTF-8 encoded string starts with the specified prefix.
Syntax
Arguments:
str
(String
): The UTF-8 encoded string to check.prefix
(String
): The prefix to look for at the start ofstr
.
Returns:
1
ifstr
starts withprefix
,0
otherwise. [UInt8
]
Example:
Result:
This function differs from startsWith
in that it matches str
and prefix
by UTF-8 characters rather than bytes. This is particularly useful when working with non-ASCII characters.
Example with non-ASCII characters
Result:
In this case, both functions return the same result. However, if the string started with a multi-byte UTF-8 character, the results might differ.
This function assumes that the input strings contain valid UTF-8 encoded text. If this assumption is violated, the behavior is undefined and no exception is thrown.
trim
Removes specified characters from the start and/or end of a string.
Syntax
Arguments:
trim_character
(String
): Character(s) to remove.input_string
(String
): String to trim.
Returns:
- A string with specified characters removed from the start and/or end. (
String
)
Example:
Result:
This example removes spaces, parentheses, and any combination thereof from both ends of the string, leaving the taco filling intact.
trimLeft
Removes the consecutive occurrences of whitespace (ASCII-character 32) from the start of a string.
Syntax:
Alias:
- ltrim
Arguments:
input_string
(String
): String to trim.
Returns:
- A string without leading whitespaces. [
String
]
Example:
Result:
This function is useful for cleaning up user input or formatting data where leading spaces might interfere with further processing or display. It only removes ASCII space characters (code 32) from the beginning of the string, leaving other whitespace characters (like tabs or newlines) and the rest of the string untouched.
trimRight
Removes the consecutive occurrences of whitespace (ASCII-character 32) from the end of a string.
Syntax:
Alias:
- rtrim
Arguments:
input_string
(String
): String to trim.
Returns:
- A string without trailing whitespaces. [
String
]
Example:
Result:
This function is useful for cleaning up strings that may have inconsistent trailing spaces, such as user input or data from external sources. It helps standardize string data for further processing or comparison operations.
trimBoth
Removes the consecutive occurrences of whitespace (ASCII-character 32) from both ends of a string.
Syntax:
Alias:
- trim(input_string)
Arguments:
input_string
(String
): String to trim.
Returns:
- A string without leading and trailing whitespaces. [
String
]
Example:
Result:
This function is useful for cleaning up user input or formatting text data where extra whitespace at the beginning or end of strings is undesirable. It’s particularly handy when processing taco order data where customers might accidentally include extra spaces in their input.
CRC32
Calculates the CRC32 checksum of a string using CRC-32-IEEE 802.3 polynomial and initial value 0xffffffff (zlib implementation).
Syntax:
Arguments:
str
(String
): Input string.
Returns:
- The CRC32 checksum of the string. [
UInt32
]
Example:
Result:
This example calculates the CRC32 checksum of a popular taco menu item. The checksum can be used for data integrity checks or as a simple hash function.
The CRC32 function is not cryptographically secure. For security-sensitive applications, use cryptographic hash functions instead.
CRC32IEEE
Calculates the CRC32 checksum of a string using the CRC-32-IEEE 802.3 polynomial.
Syntax
Arguments:
string
(String
): The input string to calculate the checksum for.
Returns:
- The CRC32 checksum. (
UInt32
)
Example:
Result:
This example calculates the CRC32 checksum of the taco name ‘Crunchy Taco Supreme’ using the IEEE polynomial. The checksum can be used for data integrity checks or as a simple hash function.
CRC64
Calculates the CRC64 checksum of a string, using the CRC-64-ECMA polynomial.
Syntax:
Arguments:
str
(String
): Input string.
Returns:
- The CRC64 checksum of the input string. [
UInt64
]
Example:
Result:
This example calculates the CRC64 checksum of a taco order name. The CRC64 function can be useful for data integrity checks or generating unique identifiers for strings.
normalizeQuery
Replaces literals, sequences of literals and complex aliases with placeholders in a SQL query string.
Syntax
Arguments:
query
(String
): The SQL query string to normalize.
Returns:
- A normalized version of the input query with placeholders. (
String
)
Description
This function performs the following normalizations:
- Replaces literals with
?
placeholders - Replaces sequences of literals with
?..
placeholders - Replaces complex aliases (containing whitespace, more than two digits, or at least 36 bytes long such as UUIDs) with
?
placeholders
This function is useful for analyzing and grouping similar queries, regardless of their specific literal values.
Example:
Result:
In this example, the literal values 5.99
and "hot"
are replaced with ?
placeholders in the normalized query.
The normalization process helps in identifying structurally similar queries, which can be beneficial for query analysis, caching, and optimization strategies.
normalizeQueryKeepNames
Replaces literals and sequences of literals with placeholders, but preserves complex aliases in SQL queries. This function is useful for analyzing complex query logs.
Syntax
Arguments:
query
(String
): The SQL query to normalize.
Returns:
- A normalized version of the input query with placeholders for literals, but preserving complex aliases. [
String
]
Example:
Result:
In this example:
normalizeQuery
replaces all literals and aliases with placeholders.normalizeQueryKeepNames
preserves the complex aliastotal_tacos
while still replacing other literals with placeholders.
This function is particularly useful when you want to analyze query patterns in logs while retaining important named elements for context.
normalizedQueryHash
Returns a 64-bit hash value for a normalized query string. This function is useful for analyzing query logs and identifying similar queries with different literal values.
Syntax
Arguments:
query
(String
): The SQL query string to be hashed.
Returns:
- A 64-bit hash value of the normalized query. (
UInt64
)
Details
The function performs the following steps:
- Normalizes the query by replacing literals, sequences of literals, and complex aliases with placeholders.
- Calculates a hash value for the normalized query.
This process allows queries with different literal values but the same structure to produce the same hash value.
Example:
Result:
In this example, both queries have the same structure but different literal values for the filling
column. The normalizedQueryHash
function returns the same hash value for both queries, demonstrating its ability to identify structurally similar queries.
This function is particularly useful for query analysis, performance tuning, and identifying frequently executed query patterns in large-scale systems.
Syntax:
Arguments:
query
(String
): The SQL query to be normalized and hashed.
Returns:
A 64-bit hash value of the normalized query. [UInt64
]
Description:
This function normalizes the input query by replacing literals with placeholders, but preserves complex aliases (containing whitespace, more than two digits, or at least 36 bytes long such as UUIDs). It then calculates a hash of the normalized query.
This function is useful for analyzing query logs and identifying similar queries while maintaining the readability of complex aliases.
Example:
Result:
In this example:
hash1
andhash2
are identical becausenormalizedQueryHash
treats the different fillings as placeholders.hash3
andhash4
are identical to each other but different fromhash1
andhash2
becausenormalizedQueryHashKeepNames
preserves the complex aliastotal_tacos_123
.
This function is particularly useful when you want to identify similar queries in your logs while maintaining the ability to distinguish between queries with different complex aliases or table names.
normalizeUTF8NFC
Converts a string to NFC (Normalization Form Canonical Composition) normalized form, assuming the string contains valid UTF-8 encoded text.
Syntax
Arguments:
s
(String
): Input string containing valid UTF-8 encoded text.
Returns:
- A string transformed to NFC normalization form. (
String
)
Example:
Result:
In this example:
- The original ‘â’ is represented by 2 bytes.
- After NFC normalization, it remains ‘â’ but is now represented as a single Unicode code point.
- The length of the normalized form is still 2 bytes, as it’s a single character that requires 2 bytes in UTF-8 encoding.
This function is particularly useful when you need to ensure consistent representation of Unicode characters, especially for string comparisons or when working with text that may contain composed characters.
If the input string is not valid UTF-8, no exception is thrown, but the result is undefined. Always ensure your input is valid UTF-8 before using this function.
normalizeUTF8NFD
Converts a string to NFD normalized form, assuming the string contains valid UTF-8 encoded text.
Syntax
Arguments:
s
(String
): UTF-8 encoded input string.
Returns:
- String transformed to NFD normalization form. (
String
)
Example:
Result:
In this example:
- The original ‘â’ is 2 bytes long.
- After NFD normalization, it’s decomposed into ‘a’ and the combining diacritic ‘◌̂’.
- The normalized form is 3 bytes long.
This function is useful for text processing tasks where you need to work with canonically decomposed Unicode characters, such as in certain string comparison or sorting operations in multilingual environments.
If the input string contains invalid UTF-8, no exception is thrown, but the result is undefined.
normalizeUTF8NFKC
Converts a string to NFKC normalized form, assuming the string contains valid UTF-8 encoded text.
Syntax
Arguments:
s
(String
): UTF-8 encoded input string.
Returns:
- String transformed to NFKC normalization form. [
String
]
Example:
Result:
In this example:
- We use a Spanish letter ‘ñ’ which can be represented in different Unicode forms.
original_length
shows the byte length of the original string.nfkc
shows the normalized form, which visually looks the same.nfkc_length
shows the byte length after normalization, which remains 2 in this case.
This function is useful for standardizing Unicode text, especially when dealing with characters that can have multiple representations. NFKC normalization can help ensure consistent string comparisons and improve text processing in multilingual applications.
normalizeUTF8NFKD
Converts a string to NFKD normalized form, assuming the string contains valid UTF-8 encoded text.
Syntax
Arguments:
s
(String
): UTF-8 encoded input string.
Returns:
String
: String transformed to NFKD normalization form.
Example:
Result:
In this example:
- We use a character ‘ñ’ (as in “jalapeño”).
- The original UTF-8 encoded ‘ñ’ has a length of 2 bytes.
- After NFKD normalization, it’s decomposed into ‘n’ and the combining tilde, resulting in a 3-byte string.
This function is useful for text processing tasks where you need to normalize Unicode strings, such as when implementing search functionality for a taco recipe database that includes ingredients with diacritical marks.
encodeXMLComponent
Escapes characters with special meaning in XML so they can be safely placed into XML text nodes or attributes.
Syntax
Arguments:
x
(String
): Input string to encode.
Returns:
- The input string with special XML characters escaped. [
String
]
Details
The following characters are replaced:
<
becomes<
&
becomes&
>
becomes>
"
becomes"
'
becomes'
This function is useful when constructing XML content to ensure that string values don’t break the XML structure.
Example:
Result:
In this example, the &
symbol is encoded to &
, ensuring it doesn’t interfere with XML parsing.
decodeXMLComponent
Decodes XML-encoded special characters in a string.
Syntax
Arguments:
x
(String
): The input string to decode.
Returns:
- The decoded string. (
String
)
Description
This function decodes the following XML entities:
"
to"
&
to&
'
to'
>
to>
<
to<
It also replaces numeric character references with their corresponding Unicode characters. Both decimal (like ✓
) and hexadecimal (like ✓
) forms are supported.
Example:
Result:
In this example, decodeXMLComponent
decodes the &
entity to &
and the numeric character reference ñ
to ñ
.
This function is useful when working with XML data or when you need to unescape HTML entities in text. It’s particularly handy when processing data from web scraping or API responses that may contain encoded characters.
decodeHTMLComponent
Decodes HTML entities and numeric character references in a string.
Syntax
Arguments:
encoded_string
(String
): A string containing HTML entities or numeric character references.
Returns:
- The decoded string with HTML entities and numeric character references converted to their corresponding characters. (
String
)
Description
This function decodes HTML entities (like &
, <
, >
) and numeric character references (like ✓
or ✓
) in the input string. It supports both decimal and hexadecimal forms of numeric character references.
Example:
Result:
In this example, the heart symbol (♥
) and the hot pepper emoji (🌶
) are decoded into their corresponding Unicode characters.
This function is useful when working with HTML-encoded text, especially when processing web scraping results or user-generated content that may contain HTML entities.
extractTextFromHTML
Extracts plain text from HTML or XHTML content.
Syntax
Arguments:
html
(String
): Input HTML string.
Returns:
- Extracted plain text. (
String
)
Description
This function extracts plain text from HTML or XHTML content using the following rules:
- Comments are skipped.
- CDATA is included verbatim.
<script>
and<style>
elements are removed with all their content.- Other tags are removed, but their content is kept.
- HTML entities are not decoded.
- Whitespace is collapsed or inserted according to specific rules.
Example:
Result:
This function is useful for converting HTML content into plain text for further analysis or processing, such as creating a bag of words from web page content.
The function does not fully conform to HTML, XML, or XHTML specifications but provides a reasonably accurate and fast implementation for most use cases.
ascii
Returns the ASCII code point (as Int32
) of the first character of a string.
Syntax
Arguments:
s
(String
): Input string.
Returns:
- The ASCII code of the first character. [
Int32
] - If
s
is empty, returns 0. - If the first character is not an ASCII character, the result is undefined.
Example:
Result:
This example returns the ASCII code for ‘T’, which is 84.
For non-ASCII characters or empty strings, consider using additional checks or alternative functions for more robust handling.
soundex
Returns the Soundex code of a string.
Syntax:
Arguments:
val
(String
): Input string.
Returns:
- The Soundex code of the input string. [
String
]
Example:
Result:
This function converts similar-sounding words to the same representation, allowing for phonetic matching. It’s particularly useful for name matching in databases where spelling may vary.
Soundex is primarily designed for English words and names. It may not work as effectively for other languages or non-Latin alphabets.
punycodeEncode
Encodes a UTF-8 string using Punycode encoding.
Syntax
Arguments:
string
(String
): The UTF-8 encoded input string to be encoded.
Returns:
- A Punycode-encoded representation of the input string. [
String
]
Example:
Result:
In this example, the function encodes the string “Spicy Jalapeño Taco” into its Punycode representation. Note how the non-ASCII character ‘ñ’ is encoded, while the ASCII characters remain unchanged.
This function is particularly useful when dealing with Internationalized Domain Names (IDNs) or any scenario where you need to represent Unicode strings using only ASCII characters.
punycodeDecode
Decodes a Punycode-encoded string back into a UTF-8 encoded plaintext string.
Syntax
Arguments:
encoded
(String
): A Punycode-encoded string.
Returns:
- The decoded UTF-8 plaintext string. (
String
)
Example:
Result:
In this example, the Punycode-encoded string ‘xn—mnchen-3ya’ is decoded to reveal the original UTF-8 string ‘münchen’.
If the input is not a valid Punycode-encoded string, an exception is thrown. For a version that returns an empty string instead of throwing an exception, see tryPunycodeDecode
.
This function is particularly useful when working with Internationalized Domain Names (IDNs) or other scenarios where non-ASCII characters need to be represented in a limited character set and then decoded back to their original form.
tryPunycodeDecode
Attempts to decode a Punycode-encoded string. If the input is not a valid Punycode-encoded string, it returns an empty string instead of throwing an exception.
Syntax:
Arguments:
encoded
(String
): A Punycode-encoded string.
Returns:
- The decoded string if successful, or an empty string if decoding fails. [
String
]
Example:
Result:
In this example, ‘xn—mnchen-3ya’ is successfully decoded to ‘münchen’, while ‘invalid-punycode’ results in an empty string due to failed decoding.
This function is useful when working with Internationalized Domain Names (IDNs) or other Punycode-encoded strings, especially when you want to handle potential decoding errors gracefully in your data processing pipeline.
idnaEncode
Encodes a domain name using the Internationalized Domain Names in Applications (IDNA) mechanism, converting it to its ASCII representation (ToASCII algorithm).
Syntax
Arguments:
domain
(String
): A UTF-8 encoded domain name string.
Returns:
- The ASCII representation of the domain name according to the IDNA mechanism. [
String
]
Example:
Result:
This function converts the Unicode domain name ‘taco.méxico.com’ to its Punycode equivalent ‘taco.xn—mxico-bsa.com’.
- The input string must be UTF-8 encoded and translatable to an ASCII string, otherwise an exception is thrown.
- No percent decoding or trimming of tabs, spaces, or control characters is performed.
- For error handling without exceptions, use the
tryIdnaEncode
function.
tryIdnaEncode
Attempts to encode a string using the Internationalized Domain Names in Applications (IDNA) mechanism, returning an empty string if the encoding fails.
Syntax
Arguments:
domain
(String
): The domain name to encode.
Returns:
- The ASCII representation of the domain name according to the IDNA ToASCII algorithm. (
String
) - An empty string if the encoding fails.
Example:
Result:
In this example:
- ‘tacos.méxico.com’ is successfully encoded to its IDNA representation.
- ‘invalid domain!@#’ fails to encode, resulting in an empty string.
This function is useful for safely handling internationalized domain names in applications, especially when dealing with potentially invalid input.
idnaDecode
Returns the Unicode (UTF-8) representation of a domain name according to the Internationalized Domain Names in Applications (IDNA) mechanism.
Syntax
Arguments:
domain
(String
): An IDNA-encoded domain name.
Returns:
- The Unicode (UTF-8) representation of the input domain name.
String
- If the input is invalid, the function returns the input string without modification.
- Repeated application of
idnaEncode()
andidnaDecode()
may not return the original string due to case normalization.
Example:
Result:
This example decodes an IDNA-encoded domain name, converting it back to its Unicode representation with proper diacritical marks.
byteHammingDistance
Calculates the Hamming distance between two byte strings.
Syntax:
Alias:
- mismatches
Arguments:
string1
(String
): First input string.string2
(String
): Second input string.
Returns:
- The Hamming distance between the two strings. [
UInt64
]
Example:
Result:
In this example, we calculate the Hamming distance between two taco toppings. The result is 8, indicating the number of positions at which the corresponding bytes are different.
stringJaccardIndex
Calculates the Jaccard similarity index between two strings.
Syntax
Arguments:
string1
(String
): First input string.string2
(String
): Second input string.
Returns:
- The Jaccard similarity index as a number between 0 and 1. (
Float64
)
Example:
Result:
In this example, we calculate the Jaccard similarity index between two popular taco fillings. The low similarity score indicates that these taco varieties have few common ingredients or characteristics.
The Jaccard index is calculated as the size of the intersection divided by the size of the union of two sets. In the context of strings, each string is treated as a set of characters. A higher value indicates greater similarity between the strings.
stringJaccardIndexUTF8
Calculates the Jaccard similarity index between two UTF-8 encoded strings.
Syntax
Arguments:
string1
(String
): First input string.string2
(String
): Second input string.
Returns:
- The Jaccard similarity index between the two strings, as a number between 0 and 1.
- Type:
Float64
.
Example:
Result:
This example calculates the Jaccard similarity index between two different taco fillings. The low similarity score indicates that these strings (and tacos) are quite different.
The function assumes that the input strings contain valid UTF-8 encoded text. If this assumption is violated, the behavior is undefined and no exception is thrown.
editDistance
Calculates the edit distance between two strings.
Syntax
Alias:
- levenshteinDistance
Arguments:
string1
(String
): First input string.string2
(String
): Second input string.
Returns:
- The edit distance between the two input strings. (
UInt32
)
Example:
Result:
In this example, we calculate the edit distance between ‘salsa’ and ‘guacamole’. The result is 7, indicating the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other.
The edit distance is also known as the Levenshtein distance.
editDistanceUTF8
Calculates the edit distance between two UTF-8 encoded strings.
Syntax
Arguments:
string1
(String
): First UTF-8 encoded string to compare.string2
(String
): Second UTF-8 encoded string to compare.
Returns:
- The edit distance between the two strings.
UInt32
.
Example:
Query:
Result:
This example calculates the edit distance between two taco toppings, ‘salsa verde’ and ‘guacamole’. The result shows that it takes 9 operations (insertions, deletions, or substitutions) to transform one string into the other.
The function assumes that the input strings are valid UTF-8 encoded text. If this assumption is violated, the behavior is undefined and no exception is thrown.
damerauLevenshteinDistance
Calculates the Damerau-Levenshtein distance between two strings.
Syntax:
Arguments:
string1
(String
): First input string.string2
(String
): Second input string.
Returns:
- The Damerau-Levenshtein distance between the two input strings. [
UInt32
]
Example:
Result:
This example calculates the Damerau-Levenshtein distance between two taco toppings. A lower value indicates more similarity between the strings.
The Damerau-Levenshtein distance is an extension of the Levenshtein distance that allows for transpositions of adjacent characters in addition to insertions, deletions, and substitutions.
jaroSimilarity
Calculates the Jaro similarity between two strings.
Syntax
Arguments:
string1
(String
): First input string.string2
(String
): Second input string.
Returns:
- A number representing the Jaro similarity between the two input strings.
Float64
.
The Jaro similarity is a value between 0 and 1, where 1 indicates identical strings and 0 indicates no similarity.
Example:
Result:
In this example, we calculate the Jaro similarity between two taco fillings. The result shows a moderate similarity between ‘carnitas’ and ‘carne asada’.
jaroWinklerSimilarity
Calculates the Jaro-Winkler similarity between two strings.
Syntax
Arguments:
string1
(String
): First input string.string2
(String
): Second input string.
Returns:
- A number representing the Jaro-Winkler similarity between the two input strings. [
Float64
]- Range is 0 to 1, where 1 indicates identical strings and 0 indicates no similarity.
Example:
Result:
This example calculates the Jaro-Winkler similarity between two taco sauce names. The result of 0.805556 indicates a high degree of similarity, which is expected as the strings differ only in their last word.
The Jaro-Winkler similarity is particularly useful for short strings such as person names. It gives more favorable ratings to strings that match from the beginning, making it suitable for comparing things like product names or ingredients in a recipe.
initcap
Converts the first letter of each word in a string to uppercase and the rest to lowercase.
Syntax
Arguments:
string
(String
): The input string.
Returns:
- A string with the first letter of each word capitalized. [
String
]
Example:
Result:
This function may produce unexpected results for words containing apostrophes or internal capital letters. For example:
Result:
The apostrophe is treated as a word boundary, and the ‘S’ is capitalized. This is a known behavior with no current plans for modification.
initcapUTF8
Converts the first letter of each word in a UTF-8 encoded string to uppercase and the rest to lowercase.
Syntax
Arguments:
string
(String
): The input string to be converted.
Returns:
- A string with the first letter of each word capitalized. (
String
)
Example:
Result:
This function assumes the input string contains valid UTF-8 encoded text. If this assumption is violated, no exception is thrown, and the result is undefined.
- This function does not detect language, so results may not be exactly correct for some languages (e.g., Turkish with i/İ vs. i/I).
- If the length of the UTF-8 byte sequence differs between upper and lower case for a code point, the result may be incorrect for that code point.
- Words are defined as sequences of alphanumeric characters separated by non-alphanumeric characters.
Unlike the ASCII-only initcap
function, initcapUTF8
correctly handles multi-byte UTF-8 characters, making it suitable for text in various languages and scripts.
firstLine
Returns the first line from a multi-line string.
Syntax:
Arguments:
s
(String
): Input string.
Returns:
The first line of the input string, or the entire string if there are no line separators. [String
]
Example:
Result:
This function is useful when you need to extract the first line from a multi-line text field, such as grabbing the first item from a list of taco fillings.
Was this page helpful?