String functions offer a wide range of capabilities for text manipulation and analysis. These functions cover:

  • Basic operations: emptiness checks, whitespace trimming.
  • Complex tasks: pattern matching, substring extraction, text normalization.
  • Advanced techniques: regex matching, Unicode normalization.

ClickHouse function reference

empty

Arguments:

  • x (String) — Input string.

Returns:

  • Returns 1 for an empty string or 0 for a non-empty string. (UInt8)

Example:

SELECT
	empty('') AS is_empty,
	empty('Carne asada') AS is_not_empty;

Result:

| is_empty | is_not_empty |
|----------|--------------|
| 1        | 0            |

This function checks if a string contains any characters, including spaces. It returns 1 (true) for an empty string and 0 (false) for any non-empty string, even if it only contains whitespace.

The empty function is also available for [arrays] and [UUIDs].

notEmpty

Checks whether the input string is non-empty. A string is considered non-empty if it contains at least one byte, even if this byte is a space or a null byte.

Syntax:

notEmpty(x)

Arguments:

  • x (String): Input value.

Returns:

  • Returns 1 for a non-empty string or 0 for an empty string. [UInt8]

Example:

SELECT
	notEmpty('') AS empty_string,
	notEmpty('guacamole') AS non_empty_string;

Result:

| empty_string | non_empty_string |
|--------------|------------------|
| 0            | 1                |

In this example, notEmpty returns 0 for an empty string and 1 for a non-empty string containing the word “guacamole”.

This function is also available for [arrays] and [UUIDs].

length

Returns the length of a string in bytes (not characters or Unicode code points).

Syntax

length(s)

Arguments:

  • s (String or Array): The input string or array.

Returns:

  • The length of s in bytes. [UInt64]

Example:

SELECT
	length('Crunchy Taco') AS taco_length;

Result:

| taco_length |
|-------------|
| 12          |

This function also works with arrays:

SELECT
  length(['Beef', 'Chicken', 'Fish']) AS taco_fillings_count;

Result:

| taco_fillings_count |
|---------------------|
| 3                   |

In the case of strings, length counts bytes, not characters. For example, some UTF-8 characters may occupy more than one byte:

SELECT
  length('Jalapeño') AS spicy_length;

Result:

| spicy_length |
|--------------|
| 9            |

Note that ‘Jalapeño’ is 8 characters long, but occupies 9 bytes due to the ‘ñ’ character.

To count Unicode characters instead of bytes, use the lengthUTF8 function.

lengthUTF8

Returns the length of a string in Unicode code points (not in bytes).

Syntax

lengthUTF8(s)

Alias:

  • char_length
  • character_length

Arguments:

  • s (String): The input string. Must contain valid UTF-8 encoded text.

Returns:

  • The length of the string s in Unicode code points. [UInt64]

Example:

SELECT
  lengthUTF8('¡Hola, tacos!') AS utf8_length;

Result:

| utf8_length |
|-------------|
| 13          |

In this example, lengthUTF8 correctly counts the Unicode characters in the Spanish greeting and taco reference, including the inverted exclamation mark.

  • This function assumes the input string contains valid UTF-8 encoded text. If this assumption is violated, the result is undefined and no exception is thrown.
  • lengthUTF8 differs from the length function, which returns the length in bytes.
  • For ASCII strings, lengthUTF8 and length will return the same result.

See Also

  • length: Returns the length of a string in bytes.

left

Returns a substring of string s with a specified offset starting from the left.

Syntax:

left(s, offset)

Arguments:

  • s (String or FixedString): The string to calculate a substring from.
  • offset (UInt*): The number of bytes of the offset.

Returns:

  • For positive offset: A substring of s with offset many bytes, starting from the left of the string.
  • For negative offset: A substring of s with length(s) - |offset| bytes, starting from the left of the string.
  • An empty string if length is 0.

Example:

SELECT
	left('Crunchy Taco', 7) AS taco_type;

Result:

| taco_type |
|-----------|
| Crunchy   |

This example extracts the first 7 characters from the string ‘Crunchy Taco’, resulting in ‘Crunchy’.

SELECT
	left('Spicy Burrito', -6) AS taco_filling;

Result:

| taco_filling |
|--------------|
| Spicy Bur    |

This example extracts all but the last 6 characters from ‘Spicy Burrito’, resulting in ‘Spicy Bur’.

The left function operates on bytes, not characters. For UTF-8 encoded strings, consider using leftUTF8 to work with characters instead of bytes.

leftUTF8

Returns a substring of a UTF-8 encoded string with a specified number of characters from the left.

Syntax

leftUTF8(s, length)

Arguments:

  • s (String): The UTF-8 encoded string to extract the substring from.
  • length (UInt): The number of characters to extract.

Returns:

  • A substring of s containing the leftmost length characters. (String)

Example:

SELECT
  leftUTF8('Spicy Jalapeño Taco', 5) AS taco_prefix;

Result:

| taco_prefix |
|-------------|
| Spicy       |

This function is useful when you need to extract a fixed number of characters from the beginning of a UTF-8 string, such as getting the first few words of a taco description or the prefix of a menu item name.

If length is greater than the number of characters in the string, the entire string is returned. If length is 0 or negative, an empty string is returned.

leftPad

Pads a string from the left with spaces or with a specified string (multiple times, if needed) until the resulting string reaches the specified length.

Syntax:

leftPad(string, length[, pad_string])

Alias:

  • LPAD

Arguments:

  • string (String): Input string that should be padded.
  • length (UInt or Int): The length of the resulting string. If the value is smaller than the input string length, then the input string is shortened to length characters.
  • pad_string (String, optional): The string to pad the input string with. If not specified, then the input string is padded with spaces.

Returns:

  • A left-padded string of the given length. (String)

Example:

SELECT
	leftPad('Taco', 10, '*') AS padded_taco,
	leftPad('Burrito', 10);

Result:

| padded_taco  | leftPad('Burrito', 10) |
|--------------|------------------------|
| ******Taco   | Burrito                |

In this example:

  • ‘Taco’ is padded with asterisks to a length of 10 characters.
  • ‘Burrito’ is padded with spaces to a length of 10 characters.

The function measures string length in bytes, not in characters. For Unicode strings, consider using leftPadUTF8 instead.

leftPadUTF8

Pads a string from the left with spaces or a specified string (multiple times, if needed) until the resulting string reaches the given length. Unlike leftPad which measures the string length in bytes, this function measures the string length in Unicode code points.

Syntax:

leftPadUTF8(string, length[, pad_string])

Arguments:

  • string (String): Input string that should be padded.
  • length (UInt or Int): The length of the resulting string. If the value is smaller than the input string length, then the input string is shortened to length characters.
  • pad_string (String, optional): The string to pad the input string with. If not specified, then the input string is padded with spaces.

Returns:

  • A left-padded string of the given length. (String)

Example:

SELECT
	leftPadUTF8('Taco', 10, '🌮') AS padded_taco,
	leftPadUTF8('Burrito', 10) AS padded_burrito;

Result:

| padded_taco       | padded_burrito |
|-------------------|----------------|
| 🌮🌮🌮🌮🌮🌮Taco |     Burrito    |

In this example:

  • padded_taco is padded with taco emojis to reach a length of 10 Unicode characters.
  • padded_burrito is padded with spaces to reach a length of 10 Unicode characters.

This function is particularly useful when working with strings containing non-ASCII characters or emojis, ensuring consistent visual alignment based on the number of Unicode characters rather than bytes.

Returns a substring of a string starting from the right.

Syntax

right(s, offset)

Arguments:

  • s (String or FixedString): The input string.
  • offset (UInt*): The number of characters to extract.

Returns:

  • For positive offset: A substring of s with offset many characters, starting from the right of the string.
  • For negative offset: A substring of s with length(s) - |offset| characters, starting from the right of the string.
  • An empty string if offset is 0.

Example:

SELECT
	right('Crunchy Taco', 4) AS last_four,
	right('Soft Taco', -4) AS all_but_last_four;

Result:

| last_four | all_but_last_four |
|-----------|-------------------|
| Taco      | Soft              |

In this example:

  • last_four returns the last 4 characters of ‘Crunchy Taco’.
  • all_but_last_four returns all but the last 4 characters of ‘Soft Taco’.

rightUTF8

Returns a substring of a UTF-8 encoded string with a specified offset starting from the right.

Syntax:

rightUTF8(s, offset)

Arguments:

  • s (String or FixedString): The UTF-8 encoded string to calculate a substring from.
  • offset (UInt*): The number of characters for the offset.

Returns:

  • For positive offset: A substring of s with offset many characters, starting from the right of the string.
  • For negative offset: A substring of s with length(s) - |offset| characters, starting from the right of the string.
  • An empty string if offset is 0.

Example:

SELECT
  rightUTF8('Spicy Taco', 4);

Result:

| rightUTF8('Spicy Taco', 4) |
|----------------------------|
| Taco                       |
SELECT
  rightUTF8('Jalapeño Taco', -7);

Result:

| rightUTF8('Jalapeño Taco', -7) |
|--------------------------------|
| ño Taco                        |

This function assumes that the string contains valid UTF-8 encoded text. If this assumption is violated, no exception is thrown and the result is undefined.

The function counts characters, not bytes. This is important for multi-byte UTF-8 characters.

rightPad

Pads a string from the right with spaces or with a specified string (multiple times, if needed) until the resulting string reaches the specified length.

Syntax:

rightPad(string, length[, pad_string])

Alias:

  • RPAD

Arguments:

  • string (String) — Input string that should be padded.
  • length (UInt or Int) — The length of the resulting string. If the value is smaller than the input string length, then the input string is shortened to length characters.
  • pad_string (String, optional) — The string to pad the input string with. If not specified, then the input string is padded with spaces.

Returns:

  • A right-padded string of the given length. (String)

Example:

SELECT
	rightPad('Taco', 10, '🌮') AS padded_taco;

Result:

| padded_taco       |
|-------------------|
| Taco🌮🌮🌮🌮🌮🌮 |

In this example, the ‘Taco’ string is right-padded with taco emojis until it reaches a length of 10 characters. This function can be useful for formatting strings to a fixed width, which is particularly handy when creating text-based reports or aligning data in console output.

rightPadUTF8

Pads a string from the right with spaces or a specified string (multiple times, if needed) until the resulting string reaches the given length. Unlike rightPad which measures the string length in bytes, this function measures the string length in Unicode code points.

Syntax

rightPadUTF8(string, length[, pad_string])

Arguments:

  • string (String) — Input string that should be padded.
  • length (UInt or Int) — The desired length of the resulting string.
  • pad_string (String, optional) — The string to pad the input string with. If not specified, the input string is padded with spaces.

Returns:

  • A right-padded string of the given length. (String)

Example:

SELECT
	rightPadUTF8('Taco', 10, '🌮') AS padded_taco,
	rightPadUTF8('Burrito', 10) AS padded_burrito;

Result:

| padded_taco       | padded_burrito |
|-------------------|----------------|
| Taco🌮🌮🌮🌮🌮🌮 | Burrito        |

In this example:

  • ‘Taco’ is padded with taco emojis to reach a length of 10 Unicode code points.
  • ‘Burrito’ is padded with spaces to reach a length of 10 Unicode code points.

This function is particularly useful when working with strings containing non-ASCII characters or emojis, ensuring consistent string lengths based on visible characters rather than byte count.

lower

Converts ASCII Latin symbols in a string to lowercase.

Syntax

lower(input)

Alias:

  • lcase

Arguments:

  • input (String): The input string.

Returns:

  • A string with all ASCII Latin characters converted to lowercase. [String]

Example:

SELECT
	lower('CRUNCHY TACO') AS lowercase_taco;

Result:

| lowercase_taco |
|----------------|
| crunchy taco   |

This function only affects ASCII Latin characters (A-Z). It does not change other characters or Unicode symbols. For Unicode-aware lowercase conversion, use the lowerUTF8 function.

upper

Converts ASCII Latin characters in a string to uppercase.

Syntax

upper(input)

Alias:

  • ucase

Arguments:

  • input (String): The input string.

Returns:

  • A string with ASCII Latin characters converted to uppercase. [String]

Example:

SELECT
	upper('spicy taco') AS uppercase_taco;

Result:

| uppercase_taco |
|----------------|
| SPICY TACO     |

This function only affects ASCII Latin characters (a-z). It does not change other characters or consider language-specific uppercase rules.

lowerUTF8

Converts a string to lowercase, assuming the string contains valid UTF-8 encoded text.

Syntax

lowerUTF8(input)

Arguments:

  • input (String): A string to convert.

Returns:

  • The input string converted to lowercase. (String)

Example:

SELECT
  lowerUTF8('TACO TUESDAY') AS lowercase_taco;

Result:

| lowercase_taco |
|----------------|
| taco tuesday   |

This function does not detect the language. For some languages (e.g., Turkish), the result might not be exactly correct due to specific uppercase/lowercase rules. If the length of the UTF-8 byte sequence differs between upper and lower case for a code point, the result may be incorrect for that code point.

Unlike the lower function, lowerUTF8 correctly handles UTF-8 encoded characters beyond the ASCII range.

upperUTF8

Converts a string to uppercase, assuming the string contains valid UTF-8 encoded text.

Syntax

upperUTF8(input)

Arguments:

  • input (String): A string to convert.

Returns:

  • The input string converted to uppercase. [String]

Example:

SELECT
  upperUTF8('jalapeño') AS spicy_taco;

Result:

| spicy_taco |
|------------|
| JALAPEÑO   |

This function does not detect the language, so results might not be exactly correct for some languages (e.g., Turkish). If the length of the UTF-8 byte sequence differs between upper and lower case for a code point, the result may be incorrect for that code point.

isValidUTF8

Checks whether a string contains valid UTF-8 encoded text.

Syntax

isValidUTF8(input)

Arguments:

  • input (String): The string to check.

Returns:

  • Returns 1 if the input string contains valid UTF-8 encoded text, 0 otherwise. [UInt8]

Example:

SELECT
	isValidUTF8('Spicy jalapeño taco') AS valid_utf8,
	isValidUTF8('Broken \xC3\x28 taco') AS invalid_utf8;

Result:

| valid_utf8 | invalid_utf8 |
|------------|--------------|
| 1          | 0            |

In this example, ‘Spicy jalapeño taco’ is valid UTF-8, while ‘Broken \xC3\x28 taco’ contains an invalid UTF-8 sequence.

This function is useful for validating input data, especially when working with user-generated content or data from external sources where UTF-8 encoding is expected but not guaranteed.

toValidUTF8

Replaces invalid UTF-8 characters with the � (U+FFFD) replacement character. Consecutive invalid characters are collapsed into a single replacement character.

Syntax

toValidUTF8(input_string)

Arguments:

  • input_string (String): Any set of bytes represented as a String data type.

Returns:

  • A valid UTF-8 string. (String).

Example:

SELECT
  toValidUTF8('Spicy jalapeño \xF0\x80\x80\x80 taco') AS valid_utf8;

Result:

| valid_utf8              |
|-------------------------|
| Spicy jalapeño � taco   |

In this example, the invalid UTF-8 sequence \xF0\x80\x80\x80 is replaced with a single � character in the resulting string.

This function is useful when dealing with potentially corrupted or improperly encoded text data, ensuring that the output is always valid UTF-8 for further processing or display.

repeat

Repeats a string a specified number of times.

Syntax

repeat(s, n)

Arguments:

  • s (String): The string to repeat.
  • n (UInt or Int): The number of times to repeat the string.

Returns:

A string containing s repeated n times. If n <= 0, the function returns an empty string. (String)

Example:

SELECT
  repeat('Taco ', 3) AS taco_chant;

Result:

| taco_chant     |
|----------------|
| Taco Taco Taco |

In this example, we create a taco chant by repeating the word “Taco” three times.

If you need to repeat a string a large number of times, be mindful of the resulting string length to avoid excessive memory usage.

space

Returns a string consisting of a specified number of space characters.

Syntax:

space(n)

Alias:

  • SPACE

Arguments:

  • n (UInt or Int): The number of space characters to generate.

Returns:

A string containing n space characters. If n <= 0, an empty string is returned. [String]

Example:

SELECT
	space(5) AS five_spaces,
	space(0) AS no_spaces;

Result:

| five_spaces | no_spaces |
|-------------|-----------|
|             |           |

This function is useful for padding strings or creating visual separators in query results. For example, you could use it to align taco ingredients in a formatted list:

SELECT
	taco_name,
	space(20 - length(taco_name)) || ingredients AS formatted_ingredients
FROM
	taco_menu;

Result:

| taco_name | formatted_ingredients            |
|-----------|----------------------------------|
| Carnitas  |    Pork, Onions, Cilantro        |
| Pescado   |     Fish, Cabbage, Salsa         |
| Veggie    |   Beans, Avocado, Cheese         |

In this example, space() is used to create a consistent alignment for the ingredients list, regardless of the taco name length.

reverse

Reverses the sequence of bytes in a string.

Syntax:

reverse(s)

Arguments:

  • s (String): The input string.

Returns:

  • A string with the bytes in reverse order. [String]

Example:

SELECT reverse('Taco Tuesday');

Result:

| reverse('Taco Tuesday') |
|-------------------------|
| yadseuT ocaT            |

This function reverses the byte order of the input string. For ASCII strings, this effectively reverses the characters. However, for multi-byte encodings like UTF-8, it may produce unexpected results as it operates on bytes, not characters.

For proper handling of Unicode characters, use the reverseUTF8 function instead.

reverseUTF8

Reverses a sequence of Unicode code points in a string.

Syntax

reverseUTF8(str)

Arguments:

  • str (String): The input string to reverse.

Returns:

  • A string with the sequence of Unicode code points reversed. (String)

Description

This function reverses the order of Unicode code points in the input string. It assumes that the string contains valid UTF-8 encoded text. If this assumption is violated, no exception is thrown and the result is undefined.

Unlike the reverse function, which reverses the sequence of bytes, reverseUTF8 preserves the integrity of multi-byte Unicode characters.

Example:

SELECT
  reverseUTF8('Spicy 🌶️ Taco Tuesday!') AS reversed_text;

Result:

| reversed_text           |
|-------------------------|
| !yadseuT ocaT 🌶️ ycipS  |

In this example, the Unicode emoji 🌶️ is treated as a single code point and remains intact in the reversed string.

Be cautious when using this function with strings that contain combining characters or complex scripts, as the visual representation of the reversed text might not always be meaningful.

concat

Concatenates the given string arguments.

Syntax

concat(s1, s2, ...)

Arguments:

  • s1, s2, … (String or FixedString): Strings to concatenate.

Returns:

  • A String created by concatenating the arguments.
  • If any argument is NULL, the function returns NULL.
  • Arguments of other types are automatically converted to strings, but this may impact performance.

Example:

SELECT
  concat('Crunchy ', 'taco', ' with ', 'salsa') AS taco_description;

Result:

| taco_description         |
|--------------------------|
| Crunchy taco with salsa  |

This example concatenates multiple strings to create a taco description.

concatAssumeInjective

Concatenates strings, assuming the result is injective (unique) for different input values. This can be used for query optimization, particularly with GROUP BY operations.

Syntax

concatAssumeInjective(s1, s2, ...)

Arguments:

  • s1, s2, … (String or FixedString): Strings to concatenate.

Returns:

  • The concatenated string. (String)

Example:

SELECT
	concatAssumeInjective('Carne ', 'Asada') AS taco_name,
	COUNT(*) AS order_count
FROM
	taco_orders
GROUP BY
	concatAssumeInjective('Carne ', 'Asada');

Result:

| taco_name    | order_count |
|--------------|-------------|
| Carne Asada  | 42          |

This example assumes that concatenating ‘Carne ’ and ‘Asada’ always produces a unique result, allowing for potential query optimization.

Use this function only when you are certain that the concatenation result is unique for different inputs. Incorrect usage may lead to wrong query results.

concatWithSeparator

Concatenates strings with a specified separator.

Syntax

concatWithSeparator(separator, s1, s2, ...)

Alias:

  • concat_ws

Arguments:

  • separator (String): The string to use as a separator.
  • s1, s2, … (String): The strings to concatenate.

Returns:

  • A string containing all the input strings concatenated with the separator between them. [String]

Example:

SELECT
	concatWithSeparator(', ', 'Carne Asada', 'Al Pastor', 'Pollo') AS taco_menu;

Result:

| taco_menu                      |
|--------------------------------|
| Carne Asada, Al Pastor, Pollo  |

This example creates a comma-separated list of taco fillings.

If any of the input strings is NULL, it is skipped in the result without adding a separator.

concatWithSeparatorAssumeInjective

Concatenates strings with a specified separator, assuming the result is injective. This function can be used for optimization of GROUP BY operations.

Syntax

concatWithSeparatorAssumeInjective(separator, s1, s2, ...)

Arguments:

  • separator (String): The string to use as a separator between the concatenated strings.
  • s1, s2, … (String): The strings to concatenate.

Returns:

  • A string created by concatenating the input strings with the specified separator. (String)

This function assumes that the concatenation result is injective, meaning different arguments always produce different results. This assumption allows for certain query optimizations.

Example:

SELECT
	concatWithSeparatorAssumeInjective(', ', 'Carne Asada', 'Al Pastor', 'Pollo') AS taco_fillings,
	COUNT(*) AS order_count
FROM
	taco_orders
GROUP BY
	concatWithSeparatorAssumeInjective(', ', 'Carne Asada', 'Al Pastor', 'Pollo');

Result:

| taco_fillings                  | order_count |
|--------------------------------|-------------|
| Carne Asada, Al Pastor, Pollo  | 42          |

In this example, we concatenate taco fillings with a comma separator and use the result for grouping. The function assumes that different combinations of fillings will always produce unique strings, allowing for potential query optimizations.

substring

Returns a substring of a string starting at a specified position.

Syntax

substring(s, offset[, length])

Aliases:

  • substr
  • mid

Arguments:

  • s (String): The input string.
  • offset (Integer): The starting position of the substring (1-based index).
  • length (Integer, optional): The maximum length of the substring.

Returns:

  • A substring of s. (String)

Example:

SELECT
	substring('Carne Asada Taco', 7) AS without_length,
	substring('Carne Asada Taco', 7, 5) AS with_length;

Result:

| without_length | with_length |
|----------------|-------------|
| Asada Taco     | Asada       |

In this example:

  • without_length returns the substring starting from the 7th character to the end.
  • with_length returns a 5-character substring starting from the 7th character.
  • If offset is 0, an empty string is returned.
  • If offset is negative, the substring starts that many characters from the end of the string.
  • If length is omitted, the substring extends to the end of the string.

substringUTF8

Returns a substring of a UTF-8 encoded string starting at a specified position.

Syntax

substringUTF8(s, offset[, length])

Arguments:

  • s (String): The UTF-8 encoded string to extract the substring from.
  • offset (Integer): The starting position of the substring.
  • length (Integer, optional): The maximum length of the substring.

Returns:

  • A substring of s. (String)

Example:

SELECT
  substringUTF8('Spicy Jalapeño Taco', 7, 8) AS taco_filling;

Result:

| taco_filling |
|--------------|
| Jalapeño     |

This function assumes that the input string contains valid UTF-8 encoded text. If this assumption is violated, the result is undefined and no exception is thrown.

The offset argument is 1-based, meaning the first character is at position 1. If offset is 0, an empty string is returned. If offset is negative, the substring starts that many characters from the end of the string.

If the optional length argument is omitted, the substring extends to the end of the string.

This function operates on Unicode code points rather than bytes, making it suitable for strings containing multi-byte characters.

substringIndex

Returns a substring of a string before a specified number of occurrences of a delimiter.

Syntax

substringIndex(s, delim, count)

Alias:

  • SUBSTRING_INDEX

Arguments:

  • s (String): The input string to extract the substring from.
  • delim (String): The delimiter string.
  • count (Int or UInt): The number of delimiter occurrences.
    • If positive, counts from the left
    • If negative, counts from the right

Returns:

  • A substring of s. (String)

Example:

SELECT
	substringIndex('salsa,guacamole,queso,pico de gallo', ',', 2) AS taco_toppings;

Result:

| taco_toppings     |
|-------------------|
| salsa,guacamole   |

In this example, substringIndex returns everything before the second occurrence of the comma, giving us the first two taco toppings from the list.

If count is 0 or the delimiter is not found, the function returns an empty string. If count exceeds the number of delimiters, the entire string is returned.

substringIndexUTF8

Returns the substring of a UTF-8 encoded string before a specified number of occurrences of a delimiter.

Syntax

substringIndexUTF8(s, delim, count)

Arguments:

  • s (String): The UTF-8 encoded string to extract the substring from.
  • delim (String): The delimiter string.
  • count (Int): The number of delimiter occurrences to count before extracting the substring. If positive, counts from the left; if negative, counts from the right.

Returns:

  • A substring of s. (String)

Example:

SELECT
	substringIndexUTF8('Carne asada tacos, al pastor tacos, fish tacos', ', ', 2) AS result;

Result:

| result                                |
|---------------------------------------|
| Carne asada tacos, al pastor tacos    |

In this example, the function returns the substring before the second occurrence of ’, ’ in the taco menu string.

This function assumes that the input string contains valid UTF-8 encoded text. If this assumption is violated, the behavior is undefined and no exception is thrown.

appendTrailingCharIfAbsent

Appends a specified character to the end of a string if it’s not already present.

Syntax:

appendTrailingCharIfAbsent(s, c)

Arguments:

  • s (String): The input string.
  • c (String): The character to append. (length 1)

Returns:

  • A string with the specified character appended if it wasn’t already present. (String)

Example:

SELECT
	appendTrailingCharIfAbsent('Taco', '/') AS with_slash,
	appendTrailingCharIfAbsent('Burrito/', '/') AS already_has_slash;

Result:

| with_slash | already_has_slash |
|------------|-------------------|
| Taco/      | Burrito/          |

In this example:

  • with_slash adds a trailing ’/’ to ‘Taco’ since it wasn’t present.
  • already_has_slash doesn’t modify ‘Burrito/’ because it already ends with ’/’.

This function is useful for ensuring strings have a consistent ending, which can be helpful when working with paths or URLs.

convertCharset

Converts a string from one character encoding to another.

Syntax:

convertCharset(s, from, to)

Arguments:

  • s (String): The input string to convert.
  • from (String): The source character encoding.
  • to (String): The target character encoding.

Returns:

  • The converted string. (String)

Example:

SELECT
  convertCharset('Jalapeño', 'UTF-8', 'ASCII') AS ascii_jalapeno;

Result:

| ascii_jalapeno |
|----------------|
| Jalapeno       |

In this example, the function converts the UTF-8 encoded string ‘Jalapeño’ to ASCII, replacing the ‘ñ’ character with ‘n’.

The available character encodings depend on the ICU library version used during ClickHouse compilation. Make sure to use valid encoding names supported by your ClickHouse installation.

base58Encode

Encodes a string using Base58 in the “Bitcoin” alphabet.

Syntax:

base58Encode(plaintext)

Arguments:

  • plaintext (String): String to encode.

Returns:

  • The Base58 encoded string. [String]

Example:

SELECT
  base58Encode('Taco Tuesday') AS encoded_taco;

Result:

| encoded_taco      |
|-------------------|
| 5R3YyEqpuYBCDTzn  |

This example encodes the phrase “Taco Tuesday” using Base58 encoding. The result is a compact, URL-safe string that can be used for various purposes, such as generating short identifiers for taco orders or menu items.

base58Decode

Decodes a Base58-encoded string.

Syntax

base58Decode(encoded)

Arguments:

  • encoded (String): A Base58-encoded string.

Returns:

  • The decoded value of the argument. (String)

Example:

SELECT
  base58Decode('3dc8KtHrwM') AS decoded_taco_name;

Result:

| decoded_taco_name |
|-------------------|
| Carnitas          |

In this example, we decode a Base58-encoded taco name. The function returns the original string ‘Carnitas’.

If the input string is not a valid Base58-encoded value, an exception is thrown. For a version that returns an empty string instead of throwing an exception, use tryBase58Decode().

tryBase58Decode

Attempts to decode a Base58-encoded string. If decoding fails, it returns an empty string instead of throwing an exception.

Syntax

tryBase58Decode(encoded)

Arguments:

  • encoded (String): A string containing Base58-encoded data.

Returns:

  • A string containing the decoded value of the argument. (String)
  • An empty string if decoding fails.

Example:

SELECT
	tryBase58Decode('3dc8KtHrwM') AS decoded_taco,
	tryBase58Decode('invalid_base58') AS invalid_taco;

Result:

| decoded_taco | invalid_taco |
|--------------|--------------|
| Taco         |              |

In this example, we attempt to decode two strings:

  • ‘3dc8KtHrwM’ is a valid Base58-encoded string that decodes to ‘Taco’.
  • ‘invalid_base58’ is not a valid Base58-encoded string, so an empty string is returned.

This function is useful when you need to handle potentially invalid Base58-encoded data without causing query errors.

base64Encode

Encodes a string using Base64 encoding according to RFC 4648.

Syntax

base64Encode(input)

Alias:

  • TO_BASE64

Arguments:

  • input (String): The string to encode.

Returns:

  • A Base64-encoded string. (String)

Example:

SELECT
  base64Encode('Spicy Taco') AS encoded_taco;

Result:

| encoded_taco     |
|------------------|
| U3BpY3kgVGFjbw== |

This example encodes the phrase “Spicy Taco” into its Base64 representation.

The function follows the standard Base64 encoding scheme, which is commonly used for encoding binary data for transmission over text-based protocols or storage in text formats.

base64URLEncode

Encodes a string using Base64 URL-safe encoding, according to RFC 4648.

Syntax

base64URLEncode(input)

Arguments:

  • input (String): The string to encode.

Returns:

  • A string containing the Base64 URL-safe encoded value of the input. [String]

Example:

SELECT
  base64URLEncode('Spicy jalapeño taco') AS encoded_taco;

Result:

| encoded_taco                    |
|---------------------------------|
| U3BpY3kgamFsYXBlw7FvIHRhY28     |

This function is useful for encoding data that needs to be included in URLs or filenames. It replaces the standard Base64 + and / characters with - and _ respectively, making the output safe for use in URLs without additional encoding.

The encoded output does not include padding = characters, further improving URL compatibility.

base64Decode

Decodes a base64-encoded string.

Syntax

base64Decode(encoded)

Arguments:

  • encoded (String): A base64-encoded string.

Returns:

  • The decoded string. (String)

Example:

SELECT
  base64Decode('VGFjbyBUdWVzZGF5') AS decoded_text;

Result:

| decoded_text |
|--------------|
| Taco Tuesday |

This example decodes the base64-encoded string ‘VGFjbyBUdWVzZGF5’, which represents “Taco Tuesday”.

If the input string is not a valid base64-encoded value, an exception is thrown. For a version that returns an empty string instead of throwing an exception, use tryBase64Decode.

base64URLDecode

Decodes a base64-encoded URL string according to the URL-safe variant of base64 encoding specified in RFC 4648.

Syntax

base64URLDecode(encodedUrl)

Arguments:

  • encodedUrl (String): A string containing a base64-encoded URL with URL-safe modifications.

Returns:

  • The decoded string. (String)

Example:

SELECT
  base64URLDecode('aHR0cHM6Ly90YWNvLmNvbQ') AS decoded_url;

Result:

| decoded_url        |
|--------------------|
| https://taco.com   |

This function decodes the base64URL-encoded string ‘aHR0cHM6Ly90YWNvLmNvbQ’, which represents the URL ‘https://taco.com’.

If the input string is not a valid base64URL-encoded value, an exception is thrown. For a version that returns an empty string instead of throwing an exception, use tryBase64URLDecode.

tryBase64Decode

Attempts to decode a Base64-encoded string. If decoding fails, it returns an empty string instead of throwing an exception.

Syntax

tryBase64Decode(encoded)

Arguments:

  • encoded (String): A string containing Base64-encoded data.

Returns:

  • A string containing the decoded value of the argument. (String)
  • An empty string if decoding fails.

Example:

SELECT
	tryBase64Decode('VGFjbyBUdWVzZGF5') AS decoded_success,
	tryBase64Decode('invalid_base64!!!') AS decoded_failure;

Result:

| decoded_success | decoded_failure |
|-----------------|-----------------|
| Taco Tuesday    |                 |

In this example, ‘VGFjbyBUdWVzZGF5’ is successfully decoded to ‘Taco Tuesday’, while the invalid Base64 string returns an empty result instead of throwing an error.

This function is useful when you need to handle potentially invalid Base64 data without interrupting your query execution.

tryBase64URLDecode

Decodes a Base64URL-encoded string. If the input is not a valid Base64URL-encoded string, it returns an empty string instead of throwing an exception.

Syntax

tryBase64URLDecode(encodedString)

Arguments:

  • encodedString (String): A string containing Base64URL-encoded data.

Returns:

  • The decoded string, or an empty string if decoding fails. [String]

Example:

SELECT
	tryBase64URLDecode('aHR0cHM6Ly90YWNvLmNvbQ') AS decoded_url,
	tryBase64URLDecode('invalid_base64url') AS failed_decode;

Result:

| decoded_url        | failed_decode |
|--------------------|---------------|
| https://taco.com   |               |

In this example:

  • decoded_url shows a successfully decoded Base64URL string.
  • failed_decode returns an empty string because the input is not valid Base64URL.

This function is useful when you need to safely decode Base64URL-encoded data without risking exceptions for invalid inputs, such as when processing user-supplied or potentially corrupted data in web applications.

endsWith

Returns whether a string ends with the specified suffix.

Syntax

endsWith(str, suffix)

Arguments:

  • str (String): The string to check.
  • suffix (String): The suffix to look for.

Returns:

  • 1 if str ends with suffix, 0 otherwise. [UInt8]

Example:

SELECT
	endsWith('Spicy Taco', 'Taco') AS ends_with_taco;

Result:

| ends_with_taco |
|----------------|
| 1              |

This example checks if the string ‘Spicy Taco’ ends with ‘Taco’. The result is 1 (true), confirming that it does end with ‘Taco’.

endsWithUTF8

Returns whether a UTF-8 encoded string ends with the specified suffix.

Syntax

endsWithUTF8(str, suffix)

Arguments:

  • str (String): The string to check.
  • suffix (String): The suffix to look for.

Returns:

  • 1 if str ends with suffix, 0 otherwise. [UInt8]

Example:

SELECT
  endsWithUTF8('Spicy jalapeño taco', 'taco') AS ends_with_taco;

Result:

| ends_with_taco |
|----------------|
| 1              |

This function is UTF-8 aware, so it correctly handles multi-byte characters. For example:

SELECT
	endsWithUTF8('Habanero salsa picante', 'picante') AS ends_with_picante,
	endsWithUTF8('Jalapeño poppers', 'ño') AS ends_with_enye;

Result:

| ends_with_picante | ends_with_enye |
|-------------------|----------------|
| 1                 | 1              |

Note that this function differs from the non-UTF8 version endsWith in that it matches str and suffix by UTF-8 characters rather than bytes.

startsWith

Returns whether a string begins with a specified prefix.

Syntax

startsWith(str, prefix)

Arguments:

  • str (String): The string to check.
  • prefix (String): The prefix to look for at the start of str.

Returns:

  • 1 if str starts with prefix, 0 otherwise. [UInt8]

Example:

SELECT
	startsWith('Taco Tuesday', 'Taco') AS starts_with_taco;

Result:

| starts_with_taco  |
|-------------------|
| 1                 |

This example checks if the string ‘Taco Tuesday’ starts with ‘Taco’. The result is 1 (true) because the string indeed begins with ‘Taco’.

startsWithUTF8

Returns whether a UTF-8 encoded string starts with the specified prefix.

Syntax

startsWithUTF8(str, prefix)

Arguments:

  • str (String): The UTF-8 encoded string to check.
  • prefix (String): The prefix to look for at the start of str.

Returns:

  • 1 if str starts with prefix, 0 otherwise. [UInt8]

Example:

SELECT
  startsWithUTF8('Taco Tuesday', 'Taco') AS starts_with_taco;

Result:

| starts_with_taco |
|------------------|
| 1                |

This function differs from startsWith in that it matches str and prefix by UTF-8 characters rather than bytes. This is particularly useful when working with non-ASCII characters.

Example with non-ASCII characters

SELECT
	startsWithUTF8('Jalapeño Taco', 'Jal') AS utf8_match,
	startsWith('Jalapeño Taco', 'Jal') AS byte_match;

Result:

| utf8_match | byte_match |
|------------|------------|
| 1          | 1          |

In this case, both functions return the same result. However, if the string started with a multi-byte UTF-8 character, the results might differ.

This function assumes that the input strings contain valid UTF-8 encoded text. If this assumption is violated, the behavior is undefined and no exception is thrown.

trim

Removes specified characters from the start and/or end of a string.

Syntax

trim([LEADING|TRAILING|BOTH] trim_character FROM] input_string)

Arguments:

  • trim_character (String): Character(s) to remove.
  • input_string (String): String to trim.

Returns:

  • A string with specified characters removed from the start and/or end. (String)

Example:

SELECT
	trim(BOTH ' ()' FROM '( Carne Asada Tacos )');

Result:

| trim(BOTH ' ()' FROM '( Carne Asada Tacos )') |
|-----------------------------------------------|
| Carne Asada Tacos                             |

This example removes spaces, parentheses, and any combination thereof from both ends of the string, leaving the taco filling intact.

trimLeft

Removes the consecutive occurrences of whitespace (ASCII-character 32) from the start of a string.

Syntax:

trimLeft(input_string)

Alias:

  • ltrim

Arguments:

  • input_string (String): String to trim.

Returns:

  • A string without leading whitespaces. [String]

Example:

SELECT
	trimLeft('   Crunchy Taco   ') AS trimmed_taco;

Result:

| trimmed_taco     |
|------------------|
| Crunchy Taco     |

This function is useful for cleaning up user input or formatting data where leading spaces might interfere with further processing or display. It only removes ASCII space characters (code 32) from the beginning of the string, leaving other whitespace characters (like tabs or newlines) and the rest of the string untouched.

trimRight

Removes the consecutive occurrences of whitespace (ASCII-character 32) from the end of a string.

Syntax:

trimRight(input_string)

Alias:

  • rtrim

Arguments:

  • input_string (String): String to trim.

Returns:

  • A string without trailing whitespaces. [String]

Example:

SELECT
	trimRight('Crunchy taco    ') AS trimmed_taco;

Result:

| trimmed_taco |
|--------------|
| Crunchy taco |

This function is useful for cleaning up strings that may have inconsistent trailing spaces, such as user input or data from external sources. It helps standardize string data for further processing or comparison operations.

trimBoth

Removes the consecutive occurrences of whitespace (ASCII-character 32) from both ends of a string.

Syntax:

trimBoth(input_string)

Alias:

  • trim(input_string)

Arguments:

  • input_string (String): String to trim.

Returns:

  • A string without leading and trailing whitespaces. [String]

Example:

SELECT
  trimBoth('  Crunchy Taco Supreme  ') AS trimmed_taco;

Result:

| trimmed_taco           |
|------------------------|
| Crunchy Taco Supreme   |

This function is useful for cleaning up user input or formatting text data where extra whitespace at the beginning or end of strings is undesirable. It’s particularly handy when processing taco order data where customers might accidentally include extra spaces in their input.

CRC32

Calculates the CRC32 checksum of a string using CRC-32-IEEE 802.3 polynomial and initial value 0xffffffff (zlib implementation).

Syntax:

CRC32(str)

Arguments:

  • str (String): Input string.

Returns:

  • The CRC32 checksum of the string. [UInt32]

Example:

SELECT CRC32('Crunchy Taco Supreme') AS taco_checksum;

Result:

| taco_checksum |
|---------------|
| 3254710669    |

This example calculates the CRC32 checksum of a popular taco menu item. The checksum can be used for data integrity checks or as a simple hash function.

The CRC32 function is not cryptographically secure. For security-sensitive applications, use cryptographic hash functions instead.

CRC32IEEE

Calculates the CRC32 checksum of a string using the CRC-32-IEEE 802.3 polynomial.

Syntax

CRC32IEEE(string)

Arguments:

  • string (String): The input string to calculate the checksum for.

Returns:

  • The CRC32 checksum. (UInt32)

Example:

SELECT
  CRC32IEEE('Crunchy Taco Supreme') AS taco_checksum;

Result:

| taco_checksum |
|---------------|
| 3254741547    |

This example calculates the CRC32 checksum of the taco name ‘Crunchy Taco Supreme’ using the IEEE polynomial. The checksum can be used for data integrity checks or as a simple hash function.

CRC64

Calculates the CRC64 checksum of a string, using the CRC-64-ECMA polynomial.

Syntax:

CRC64(str)

Arguments:

  • str (String): Input string.

Returns:

  • The CRC64 checksum of the input string. [UInt64]

Example:

SELECT
  CRC64('Crunchy taco supreme') AS taco_checksum;

Result:

| taco_checksum        |
|----------------------|
| 13123123778010110439 |

This example calculates the CRC64 checksum of a taco order name. The CRC64 function can be useful for data integrity checks or generating unique identifiers for strings.

normalizeQuery

Replaces literals, sequences of literals and complex aliases with placeholders in a SQL query string.

Syntax

normalizeQuery(query)

Arguments:

  • query (String): The SQL query string to normalize.

Returns:

  • A normalized version of the input query with placeholders. (String)

Description

This function performs the following normalizations:

  • Replaces literals with ? placeholders
  • Replaces sequences of literals with ?.. placeholders
  • Replaces complex aliases (containing whitespace, more than two digits, or at least 36 bytes long such as UUIDs) with ? placeholders

This function is useful for analyzing and grouping similar queries, regardless of their specific literal values.

Example:

SELECT
	normalizeQuery('SELECT id, name FROM tacos WHERE price > 5.99 AND spice_level = "hot"') AS normalized_query;

Result:

| normalized_query                                                   |
|--------------------------------------------------------------------|
| SELECT id, name FROM tacos WHERE price > ? AND spice_level = ?     |

In this example, the literal values 5.99 and "hot" are replaced with ? placeholders in the normalized query.

The normalization process helps in identifying structurally similar queries, which can be beneficial for query analysis, caching, and optimization strategies.

normalizeQueryKeepNames

Replaces literals and sequences of literals with placeholders, but preserves complex aliases in SQL queries. This function is useful for analyzing complex query logs.

Syntax

normalizeQueryKeepNames(query)

Arguments:

  • query (String): The SQL query to normalize.

Returns:

  • A normalized version of the input query with placeholders for literals, but preserving complex aliases. [String]

Example:

SELECT
	normalizeQuery('SELECT COUNT(*) AS total_tacos FROM taco_orders WHERE price > 9.99') AS normalized_query,
	normalizeQueryKeepNames('SELECT COUNT(*) AS total_tacos FROM taco_orders WHERE price > 9.99') AS normalized_query_keep_names;

Result:

| normalized_query                                   | normalized_query_keep_names                                |
|----------------------------------------------------|------------------------------------------------------------|
| SELECT COUNT(*) AS `?` FROM `?` WHERE `?` > ?      | SELECT COUNT(*) AS total_tacos FROM `?` WHERE `?` > ?      |

In this example:

  • normalizeQuery replaces all literals and aliases with placeholders.
  • normalizeQueryKeepNames preserves the complex alias total_tacos while still replacing other literals with placeholders.

This function is particularly useful when you want to analyze query patterns in logs while retaining important named elements for context.

normalizedQueryHash

Returns a 64-bit hash value for a normalized query string. This function is useful for analyzing query logs and identifying similar queries with different literal values.

Syntax

normalizedQueryHash(query)

Arguments:

  • query (String): The SQL query string to be hashed.

Returns:

  • A 64-bit hash value of the normalized query. (UInt64)

Details

The function performs the following steps:

  1. Normalizes the query by replacing literals, sequences of literals, and complex aliases with placeholders.
  2. Calculates a hash value for the normalized query.

This process allows queries with different literal values but the same structure to produce the same hash value.

Example:

SELECT
	normalizedQueryHash('SELECT count(*) FROM tacos WHERE filling = "carnitas"') AS hash1,
	normalizedQueryHash('SELECT count(*) FROM tacos WHERE filling = "al pastor"') AS hash2;

Result:

| hash1             | hash2             |
|-------------------|-------------------|
| 13793980780653099 | 13793980780653099 |

In this example, both queries have the same structure but different literal values for the filling column. The normalizedQueryHash function returns the same hash value for both queries, demonstrating its ability to identify structurally similar queries.

This function is particularly useful for query analysis, performance tuning, and identifying frequently executed query patterns in large-scale systems.

Syntax:

normalizedQueryHashKeepNames(query)

Arguments:

  • query (String): The SQL query to be normalized and hashed.

Returns:

A 64-bit hash value of the normalized query. [UInt64]

Description:

This function normalizes the input query by replacing literals with placeholders, but preserves complex aliases (containing whitespace, more than two digits, or at least 36 bytes long such as UUIDs). It then calculates a hash of the normalized query.

This function is useful for analyzing query logs and identifying similar queries while maintaining the readability of complex aliases.

Example:

SELECT
	normalizedQueryHash('SELECT count(*) FROM tacos WHERE filling = "carnitas"') AS hash1,
	normalizedQueryHash('SELECT count(*) FROM tacos WHERE filling = "al pastor"') AS hash2,
	normalizedQueryHashKeepNames('SELECT count(*) AS total_tacos_123 FROM tacos WHERE filling = "carnitas"') AS hash3,
	normalizedQueryHashKeepNames('SELECT count(*) AS total_tacos_123 FROM tacos WHERE filling = "al pastor"') AS hash4;

Result:

| hash1      | hash2      | hash3      | hash4      |
|------------|------------|------------|------------|
| 1234567890 | 1234567890 | 9876543210 | 9876543210 |

In this example:

  • hash1 and hash2 are identical because normalizedQueryHash treats the different fillings as placeholders.
  • hash3 and hash4 are identical to each other but different from hash1 and hash2 because normalizedQueryHashKeepNames preserves the complex alias total_tacos_123.

This function is particularly useful when you want to identify similar queries in your logs while maintaining the ability to distinguish between queries with different complex aliases or table names.

normalizeUTF8NFC

Converts a string to NFC (Normalization Form Canonical Composition) normalized form, assuming the string contains valid UTF-8 encoded text.

Syntax

normalizeUTF8NFC(s)

Arguments:

  • s (String): Input string containing valid UTF-8 encoded text.

Returns:

  • A string transformed to NFC normalization form. (String)

Example:

SELECT
	length('â') AS original_length,
	normalizeUTF8NFC('â') AS nfc,
	length(normalizeUTF8NFC('â')) AS nfc_length;

Result:

| original_length | nfc | nfc_length |
|-----------------|-----|------------|
| 2               | â   | 2          |

In this example:

  • The original ‘â’ is represented by 2 bytes.
  • After NFC normalization, it remains ‘â’ but is now represented as a single Unicode code point.
  • The length of the normalized form is still 2 bytes, as it’s a single character that requires 2 bytes in UTF-8 encoding.

This function is particularly useful when you need to ensure consistent representation of Unicode characters, especially for string comparisons or when working with text that may contain composed characters.

If the input string is not valid UTF-8, no exception is thrown, but the result is undefined. Always ensure your input is valid UTF-8 before using this function.

normalizeUTF8NFD

Converts a string to NFD normalized form, assuming the string contains valid UTF-8 encoded text.

Syntax

normalizeUTF8NFD(s)

Arguments:

  • s (String): UTF-8 encoded input string.

Returns:

  • String transformed to NFD normalization form. (String)

Example:

SELECT
	length('â') AS original_length,
	normalizeUTF8NFD('â') AS nfd,
	length(normalizeUTF8NFD('â')) AS nfd_length;

Result:

| original_length | nfd | nfd_length |
|-----------------|-----|------------|
| 2               | â   | 3          |

In this example:

  • The original ‘â’ is 2 bytes long.
  • After NFD normalization, it’s decomposed into ‘a’ and the combining diacritic ‘◌̂’.
  • The normalized form is 3 bytes long.

This function is useful for text processing tasks where you need to work with canonically decomposed Unicode characters, such as in certain string comparison or sorting operations in multilingual environments.

If the input string contains invalid UTF-8, no exception is thrown, but the result is undefined.

normalizeUTF8NFKC

Converts a string to NFKC normalized form, assuming the string contains valid UTF-8 encoded text.

Syntax

normalizeUTF8NFKC(s)

Arguments:

  • s (String): UTF-8 encoded input string.

Returns:

  • String transformed to NFKC normalization form. [String]

Example:

SELECT
	length('ñ') AS original_length,
	normalizeUTF8NFKC('ñ') AS nfkc,
	length(normalizeUTF8NFKC('ñ')) AS nfkc_length;

Result:

| original_length | nfkc | nfkc_length |
|-----------------|------|-------------|
| 2               | ñ    | 2           |

In this example:

  • We use a Spanish letter ‘ñ’ which can be represented in different Unicode forms.
  • original_length shows the byte length of the original string.
  • nfkc shows the normalized form, which visually looks the same.
  • nfkc_length shows the byte length after normalization, which remains 2 in this case.

This function is useful for standardizing Unicode text, especially when dealing with characters that can have multiple representations. NFKC normalization can help ensure consistent string comparisons and improve text processing in multilingual applications.

normalizeUTF8NFKD

Converts a string to NFKD normalized form, assuming the string contains valid UTF-8 encoded text.

Syntax

normalizeUTF8NFKD(s)

Arguments:

  • s (String): UTF-8 encoded input string.

Returns:

  • String: String transformed to NFKD normalization form.

Example:

SELECT
	length('ñ') AS original_length,
	normalizeUTF8NFKD('ñ') AS nfkd,
	length(normalizeUTF8NFKD('ñ')) AS nfkd_length;

Result:

| original_length | nfkd | nfkd_length |
|-----------------|------|-------------|
| 2               | ñ    | 3           |

In this example:

  • We use a character ‘ñ’ (as in “jalapeño”).
  • The original UTF-8 encoded ‘ñ’ has a length of 2 bytes.
  • After NFKD normalization, it’s decomposed into ‘n’ and the combining tilde, resulting in a 3-byte string.

This function is useful for text processing tasks where you need to normalize Unicode strings, such as when implementing search functionality for a taco recipe database that includes ingredients with diacritical marks.

encodeXMLComponent

Escapes characters with special meaning in XML so they can be safely placed into XML text nodes or attributes.

Syntax

encodeXMLComponent(x)

Arguments:

  • x (String): Input string to encode.

Returns:

  • The input string with special XML characters escaped. [String]

Details

The following characters are replaced:

  • < becomes &lt;
  • & becomes &amp;
  • > becomes &gt;
  • " becomes &quot;
  • ' becomes &apos;

This function is useful when constructing XML content to ensure that string values don’t break the XML structure.

Example:

SELECT
  encodeXMLComponent('Spicy jalapeño & habanero tacos');

Result:

| encodeXMLComponent('Spicy jalapeño & habanero tacos') |
|-------------------------------------------------------|
| Spicy jalapeño &amp; habanero tacos                   |

In this example, the & symbol is encoded to &amp;, ensuring it doesn’t interfere with XML parsing.

decodeXMLComponent

Decodes XML-encoded special characters in a string.

Syntax

decodeXMLComponent(x)

Arguments:

  • x (String): The input string to decode.

Returns:

  • The decoded string. (String)

Description

This function decodes the following XML entities:

  • &quot; to "
  • &amp; to &
  • &apos; to '
  • &gt; to >
  • &lt; to <

It also replaces numeric character references with their corresponding Unicode characters. Both decimal (like &#10003;) and hexadecimal (like &#x2713;) forms are supported.

Example:

SELECT
	decodeXMLComponent('Taco &amp; Burrito') AS decoded_text,
	decodeXMLComponent('Jalape&#241;o') AS decoded_spicy;

Result:

| decoded_text    | decoded_spicy |
|-----------------|---------------|
| Taco & Burrito  | Jalapeño      |

In this example, decodeXMLComponent decodes the &amp; entity to & and the numeric character reference &#241; to ñ.

This function is useful when working with XML data or when you need to unescape HTML entities in text. It’s particularly handy when processing data from web scraping or API responses that may contain encoded characters.

decodeHTMLComponent

Decodes HTML entities and numeric character references in a string.

Syntax

decodeHTMLComponent(encoded_string)

Arguments:

  • encoded_string (String): A string containing HTML entities or numeric character references.

Returns:

  • The decoded string with HTML entities and numeric character references converted to their corresponding characters. (String)

Description

This function decodes HTML entities (like &amp;, &lt;, &gt;) and numeric character references (like &#10003; or &#x2713;) in the input string. It supports both decimal and hexadecimal forms of numeric character references.

Example:

SELECT
	decodeHTMLComponent('I &hearts; Tacos') AS decoded_text,
	decodeHTMLComponent('Spicy &#x1F336; Taco') AS spicy_taco;

Result:

| decoded_text | spicy_taco    |
|--------------|---------------|
| I ♥ Tacos    | Spicy 🌶 Taco |

In this example, the heart symbol (&hearts;) and the hot pepper emoji (&#x1F336;) are decoded into their corresponding Unicode characters.

This function is useful when working with HTML-encoded text, especially when processing web scraping results or user-generated content that may contain HTML entities.

extractTextFromHTML

Extracts plain text from HTML or XHTML content.

Syntax

extractTextFromHTML(html)

Arguments:

  • html (String): Input HTML string.

Returns:

  • Extracted plain text. (String)

Description

This function extracts plain text from HTML or XHTML content using the following rules:

  1. Comments are skipped.
  2. CDATA is included verbatim.
  3. <script> and <style> elements are removed with all their content.
  4. Other tags are removed, but their content is kept.
  5. HTML entities are not decoded.
  6. Whitespace is collapsed or inserted according to specific rules.

Example:

SELECT
  extractTextFromHTML('<p>Taco <b>Tuesday</b> is the <i>best</i>!</p>') AS plain_text;

Result:

| plain_text                |
|---------------------------|
| Taco Tuesday is the best! |

This function is useful for converting HTML content into plain text for further analysis or processing, such as creating a bag of words from web page content.

The function does not fully conform to HTML, XML, or XHTML specifications but provides a reasonably accurate and fast implementation for most use cases.

ascii

Returns the ASCII code point (as Int32) of the first character of a string.

Syntax

ascii(s)

Arguments:

  • s (String): Input string.

Returns:

  • The ASCII code of the first character. [Int32]
  • If s is empty, returns 0.
  • If the first character is not an ASCII character, the result is undefined.

Example:

SELECT
  ascii('Taco') AS first_char_ascii;

Result:

| first_char_ascii  |
|-------------------|
| 84                |

This example returns the ASCII code for ‘T’, which is 84.

For non-ASCII characters or empty strings, consider using additional checks or alternative functions for more robust handling.

soundex

Returns the Soundex code of a string.

Syntax:

soundex(val)

Arguments:

  • val (String): Input string.

Returns:

  • The Soundex code of the input string. [String]

Example:

SELECT
	soundex('guacamole') AS soundex_code;

Result:

| soundex_code |
|--------------|
| G254         |

This function converts similar-sounding words to the same representation, allowing for phonetic matching. It’s particularly useful for name matching in databases where spelling may vary.

Soundex is primarily designed for English words and names. It may not work as effectively for other languages or non-Latin alphabets.

punycodeEncode

Encodes a UTF-8 string using Punycode encoding.

Syntax

punycodeEncode(string)

Arguments:

  • string (String): The UTF-8 encoded input string to be encoded.

Returns:

  • A Punycode-encoded representation of the input string. [String]

Example:

SELECT
  punycodeEncode('Spicy Jalapeño Taco') AS encoded_name;

Result:

| encoded_name           |
|------------------------|
| Spicy Jalapeo-quc Taco |

In this example, the function encodes the string “Spicy Jalapeño Taco” into its Punycode representation. Note how the non-ASCII character ‘ñ’ is encoded, while the ASCII characters remain unchanged.

This function is particularly useful when dealing with Internationalized Domain Names (IDNs) or any scenario where you need to represent Unicode strings using only ASCII characters.

punycodeDecode

Decodes a Punycode-encoded string back into a UTF-8 encoded plaintext string.

Syntax

punycodeDecode(encoded)

Arguments:

  • encoded (String): A Punycode-encoded string.

Returns:

  • The decoded UTF-8 plaintext string. (String)

Example:

SELECT
  punycodeDecode('xn--mnchen-3ya') AS decoded_city;

Result:

| decoded_city |
|--------------|
| münchen      |

In this example, the Punycode-encoded string ‘xn—mnchen-3ya’ is decoded to reveal the original UTF-8 string ‘münchen’.

If the input is not a valid Punycode-encoded string, an exception is thrown. For a version that returns an empty string instead of throwing an exception, see tryPunycodeDecode.

This function is particularly useful when working with Internationalized Domain Names (IDNs) or other scenarios where non-ASCII characters need to be represented in a limited character set and then decoded back to their original form.

tryPunycodeDecode

Attempts to decode a Punycode-encoded string. If the input is not a valid Punycode-encoded string, it returns an empty string instead of throwing an exception.

Syntax:

tryPunycodeDecode(encoded)

Arguments:

  • encoded (String): A Punycode-encoded string.

Returns:

  • The decoded string if successful, or an empty string if decoding fails. [String]

Example:

SELECT
	tryPunycodeDecode('xn--mnchen-3ya') AS decoded_city,
	tryPunycodeDecode('invalid-punycode') AS invalid_decode;

Result:

| decoded_city | invalid_decode |
|--------------|----------------|
| münchen      |                |

In this example, ‘xn—mnchen-3ya’ is successfully decoded to ‘münchen’, while ‘invalid-punycode’ results in an empty string due to failed decoding.

This function is useful when working with Internationalized Domain Names (IDNs) or other Punycode-encoded strings, especially when you want to handle potential decoding errors gracefully in your data processing pipeline.

idnaEncode

Encodes a domain name using the Internationalized Domain Names in Applications (IDNA) mechanism, converting it to its ASCII representation (ToASCII algorithm).

Syntax

idnaEncode(domain)

Arguments:

  • domain (String): A UTF-8 encoded domain name string.

Returns:

  • The ASCII representation of the domain name according to the IDNA mechanism. [String]

Example:

SELECT
  idnaEncode('taco.méxico.com');

Result:

| idnaEncode('taco.méxico.com') |
|-------------------------------|
| taco.xn--mxico-bsa.com        |

This function converts the Unicode domain name ‘taco.méxico.com’ to its Punycode equivalent ‘taco.xn—mxico-bsa.com’.

  • The input string must be UTF-8 encoded and translatable to an ASCII string, otherwise an exception is thrown.
  • No percent decoding or trimming of tabs, spaces, or control characters is performed.
  • For error handling without exceptions, use the tryIdnaEncode function.

tryIdnaEncode

Attempts to encode a string using the Internationalized Domain Names in Applications (IDNA) mechanism, returning an empty string if the encoding fails.

Syntax

tryIdnaEncode(domain)

Arguments:

  • domain (String): The domain name to encode.

Returns:

  • The ASCII representation of the domain name according to the IDNA ToASCII algorithm. (String)
  • An empty string if the encoding fails.

Example:

SELECT
	tryIdnaEncode('tacos.méxico.com') AS encoded_domain,
	tryIdnaEncode('invalid domain!@#') AS failed_encoding;

Result:

| encoded_domain          | failed_encoding |
|-------------------------|-----------------|
| xn--tacos-mxico-r9a.com |                 |

In this example:

  • ‘tacos.méxico.com’ is successfully encoded to its IDNA representation.
  • ‘invalid domain!@#’ fails to encode, resulting in an empty string.

This function is useful for safely handling internationalized domain names in applications, especially when dealing with potentially invalid input.

idnaDecode

Returns the Unicode (UTF-8) representation of a domain name according to the Internationalized Domain Names in Applications (IDNA) mechanism.

Syntax

idnaDecode(domain)

Arguments:

  • domain (String): An IDNA-encoded domain name.

Returns:

  • The Unicode (UTF-8) representation of the input domain name. String
  • If the input is invalid, the function returns the input string without modification.
  • Repeated application of idnaEncode() and idnaDecode() may not return the original string due to case normalization.

Example:

SELECT
  idnaDecode('xn--taco-8va.xn--guacamole-7za.com') AS decoded_domain;

Result:

| decoded_domain         |
|------------------------|
| tácó.guácámólé.com     |

This example decodes an IDNA-encoded domain name, converting it back to its Unicode representation with proper diacritical marks.

byteHammingDistance

Calculates the Hamming distance between two byte strings.

Syntax:

byteHammingDistance(string1, string2)

Alias:

  • mismatches

Arguments:

  • string1 (String): First input string.
  • string2 (String): Second input string.

Returns:

  • The Hamming distance between the two strings. [UInt64]

Example:

SELECT
	byteHammingDistance('salsa', 'guacamole') AS hamming_distance;

Result:

| hamming_distance |
|------------------|
| 8                |

In this example, we calculate the Hamming distance between two taco toppings. The result is 8, indicating the number of positions at which the corresponding bytes are different.

stringJaccardIndex

Calculates the Jaccard similarity index between two strings.

Syntax

stringJaccardIndex(string1, string2)

Arguments:

  • string1 (String): First input string.
  • string2 (String): Second input string.

Returns:

  • The Jaccard similarity index as a number between 0 and 1. (Float64)

Example:

SELECT
	stringJaccardIndex('carne asada', 'al pastor') AS taco_similarity;

Result:

| taco_similarity |
|-----------------|
| 0.125           |

In this example, we calculate the Jaccard similarity index between two popular taco fillings. The low similarity score indicates that these taco varieties have few common ingredients or characteristics.

The Jaccard index is calculated as the size of the intersection divided by the size of the union of two sets. In the context of strings, each string is treated as a set of characters. A higher value indicates greater similarity between the strings.

stringJaccardIndexUTF8

Calculates the Jaccard similarity index between two UTF-8 encoded strings.

Syntax

stringJaccardIndexUTF8(string1, string2)

Arguments:

  • string1 (String): First input string.
  • string2 (String): Second input string.

Returns:

  • The Jaccard similarity index between the two strings, as a number between 0 and 1.
  • Type: Float64.

Example:

SELECT
  stringJaccardIndexUTF8('carnitas', 'al pastor') AS taco_similarity;

Result:

| taco_similarity |
|-----------------|
| 0.125           |

This example calculates the Jaccard similarity index between two different taco fillings. The low similarity score indicates that these strings (and tacos) are quite different.

The function assumes that the input strings contain valid UTF-8 encoded text. If this assumption is violated, the behavior is undefined and no exception is thrown.

editDistance

Calculates the edit distance between two strings.

Syntax

editDistance(string1, string2)

Alias:

  • levenshteinDistance

Arguments:

  • string1 (String): First input string.
  • string2 (String): Second input string.

Returns:

  • The edit distance between the two input strings. (UInt32)

Example:

SELECT
  editDistance('salsa', 'guacamole') AS edit_distance;

Result:

| edit_distance |
|---------------|
| 7             |

In this example, we calculate the edit distance between ‘salsa’ and ‘guacamole’. The result is 7, indicating the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other.

The edit distance is also known as the Levenshtein distance.

editDistanceUTF8

Calculates the edit distance between two UTF-8 encoded strings.

Syntax

editDistanceUTF8(string1, string2)

Arguments:

  • string1 (String): First UTF-8 encoded string to compare.
  • string2 (String): Second UTF-8 encoded string to compare.

Returns:

  • The edit distance between the two strings. UInt32.

Example:

Query:

SELECT
  editDistanceUTF8('salsa verde', 'guacamole') AS edit_distance;

Result:

| edit_distance |
|---------------|
| 9             |

This example calculates the edit distance between two taco toppings, ‘salsa verde’ and ‘guacamole’. The result shows that it takes 9 operations (insertions, deletions, or substitutions) to transform one string into the other.

The function assumes that the input strings are valid UTF-8 encoded text. If this assumption is violated, the behavior is undefined and no exception is thrown.

damerauLevenshteinDistance

Calculates the Damerau-Levenshtein distance between two strings.

Syntax:

damerauLevenshteinDistance(string1, string2)

Arguments:

  • string1 (String): First input string.
  • string2 (String): Second input string.

Returns:

  • The Damerau-Levenshtein distance between the two input strings. [UInt32]

Example:

SELECT
  damerauLevenshteinDistance('salsa', 'guacamole') AS taco_similarity;

Result:

| taco_similarity |
|-----------------|
| 7               |

This example calculates the Damerau-Levenshtein distance between two taco toppings. A lower value indicates more similarity between the strings.

The Damerau-Levenshtein distance is an extension of the Levenshtein distance that allows for transpositions of adjacent characters in addition to insertions, deletions, and substitutions.

jaroSimilarity

Calculates the Jaro similarity between two strings.

Syntax

jaroSimilarity(string1, string2)

Arguments:

  • string1 (String): First input string.
  • string2 (String): Second input string.

Returns:

  • A number representing the Jaro similarity between the two input strings. Float64.

The Jaro similarity is a value between 0 and 1, where 1 indicates identical strings and 0 indicates no similarity.

Example:

SELECT
  jaroSimilarity('carnitas', 'carne asada') AS taco_similarity;

Result:

| taco_similarity |
|-----------------|
| 0.6111111111111 |

In this example, we calculate the Jaro similarity between two taco fillings. The result shows a moderate similarity between ‘carnitas’ and ‘carne asada’.

jaroWinklerSimilarity

Calculates the Jaro-Winkler similarity between two strings.

Syntax

jaroWinklerSimilarity(string1, string2)

Arguments:

  • string1 (String): First input string.
  • string2 (String): Second input string.

Returns:

  • A number representing the Jaro-Winkler similarity between the two input strings. [Float64]
    • Range is 0 to 1, where 1 indicates identical strings and 0 indicates no similarity.

Example:

SELECT
  jaroWinklerSimilarity('salsa verde', 'salsa roja') AS similarity;

Result:

| similarity |
|------------|
| 0.805556   |

This example calculates the Jaro-Winkler similarity between two taco sauce names. The result of 0.805556 indicates a high degree of similarity, which is expected as the strings differ only in their last word.

The Jaro-Winkler similarity is particularly useful for short strings such as person names. It gives more favorable ratings to strings that match from the beginning, making it suitable for comparing things like product names or ingredients in a recipe.

initcap

Converts the first letter of each word in a string to uppercase and the rest to lowercase.

Syntax

initcap(string)

Arguments:

  • string (String): The input string.

Returns:

  • A string with the first letter of each word capitalized. [String]

Example:

SELECT
	initcap('welcome to the taco shop') AS capitalized_title;

Result:

| capitalized_title         |
|---------------------------|
| Welcome To The Taco Shop  |

This function may produce unexpected results for words containing apostrophes or internal capital letters. For example:

SELECT
	initcap('jalapeño''s spicy taco') AS spicy_title;

Result:

| spicy_title              |
|--------------------------|
| Jalapeño'S Spicy Taco    |

The apostrophe is treated as a word boundary, and the ‘S’ is capitalized. This is a known behavior with no current plans for modification.

initcapUTF8

Converts the first letter of each word in a UTF-8 encoded string to uppercase and the rest to lowercase.

Syntax

initcapUTF8(string)

Arguments:

  • string (String): The input string to be converted.

Returns:

  • A string with the first letter of each word capitalized. (String)

Example:

SELECT
  initcapUTF8('spicy jalapeño taco') AS capitalized_taco;

Result:

| capitalized_taco    |
|---------------------|
| Spicy Jalapeño Taco |

This function assumes the input string contains valid UTF-8 encoded text. If this assumption is violated, no exception is thrown, and the result is undefined.

  • This function does not detect language, so results may not be exactly correct for some languages (e.g., Turkish with i/İ vs. i/I).
  • If the length of the UTF-8 byte sequence differs between upper and lower case for a code point, the result may be incorrect for that code point.
  • Words are defined as sequences of alphanumeric characters separated by non-alphanumeric characters.

Unlike the ASCII-only initcap function, initcapUTF8 correctly handles multi-byte UTF-8 characters, making it suitable for text in various languages and scripts.

firstLine

Returns the first line from a multi-line string.

Syntax:

firstLine(s)

Arguments:

  • s (String): Input string.

Returns:

The first line of the input string, or the entire string if there are no line separators. [String]

Example:

SELECT
  firstLine('Carne asada\nAl pastor\nPollo') AS first_taco_filling;

Result:

| first_taco_filling |
|--------------------|
| Carne asada        |

This function is useful when you need to extract the first line from a multi-line text field, such as grabbing the first item from a list of taco fillings.