String splitting functions provide powerful tools for dividing strings into arrays based on various criteria.

ClickHouse function reference

splitByChar

Splits a string into substrings separated by a specified character.

Syntax:

splitByChar(separator, s[, max_substrings])

Arguments:

  • separator (String): The separator character.
  • s (String): The string to split.
  • max_substrings (UInt64, optional): Maximum number of substrings to return. Default is 0 (return all substrings).

Returns:

An array of selected substrings. (Array(String))

Example:

SELECT
  splitByChar(',', 'carnitas,al_pastor,barbacoa,lengua');

Result:

| splitByChar(',', 'carnitas,al_pastor,barbacoa,lengua') |
|--------------------------------------------------------|
| ['carnitas','al_pastor','barbacoa','lengua']           |

In this example, we split a string of taco fillings into an array using a comma as the separator.

Empty substrings may be selected when:

  • A separator occurs at the beginning or end of the string
  • There are multiple consecutive separators
  • The original string is empty

When max_substrings is 0 or omitted, the function returns as many substrings as possible.

splitByString

Splits a string into substrings separated by a specified string delimiter.

Syntax:

splitByString(separator, s[, max_substrings])

Arguments:

  • separator (String): The string separator.
  • s (String): The string to split.
  • max_substrings (UInt64, optional): Maximum number of substrings to return. Default is 0 (return all substrings).

Returns:

An array of selected substrings. (Array(String))

  • Empty substrings may be selected when:
    • A non-empty separator occurs at the beginning or end of the string
    • There are multiple consecutive non-empty separators
    • The original string s is empty while the separator is not empty
  • If separator is an empty string, it splits the string s into an array of single characters

Example:

SELECT
  splitByString(', ', 'beef, chicken, carnitas, barbacoa') AS taco_fillings;

Result:

| taco_fillings                                |
|----------------------------------------------|
| ['beef','chicken','carnitas','barbacoa']     |

In this example, we split a string of taco fillings into an array using ’, ’ as the separator.

Example with empty separator:

SELECT
  splitByString('', 'salsa') AS salsa_letters;

Result:

| salsa_letters       |
|---------------------|
| ['s','a','l','s']   |

When the separator is an empty string, the function splits the input string into individual characters.

Example with max_substrings:

SELECT
  splitByString(', ', 'mild, medium, hot, extra hot', 2) AS sauce_levels;

Result:

| sauce_levels      |
|-------------------|
| ['mild','medium'] |

By specifying max_substrings as 2, we limit the output to the first two substrings.

splitByRegexp

Splits a string into substrings using a regular expression as the separator.

Syntax:

splitByRegexp(regexp, s[, max_substrings])

Arguments:

  • regexp (String or FixedString): Regular expression used as the separator.
  • s (String): The string to split.
  • max_substrings (UInt64, optional): Maximum number of substrings to return. Default is 0 (return all substrings).

Returns:

An array of selected substrings. (Array(String))

Example:

SELECT
  splitByRegexp('\\s+', 'carnitas tacos al pastor barbacoa');

Result:

| splitByRegexp('\\s+', 'carnitas tacos al pastor barbacoa') |
|------------------------------------------------------------|
| ['carnitas','tacos','al','pastor','barbacoa']              |

In this example, the function splits the taco order into individual words using whitespace as the separator.

  1. Empty substrings may be selected when:

    • A non-empty regular expression match occurs at the beginning or end of the string
    • There are multiple consecutive non-empty regular expression matches
    • The original string s is empty while the regular expression is not empty
  2. If no match is found for the regular expression, the string s won’t be split.

Example with max_substrings:

SELECT
  splitByRegexp('\\s+', 'carnitas tacos al pastor barbacoa', 3);

Result:

| splitByRegexp('\\s+', 'carnitas tacos al pastor barbacoa', 3) |
|---------------------------------------------------------------|
| ['carnitas','tacos','al']                                     |

This function is useful for parsing structured text data, extracting specific patterns, or tokenizing strings based on complex separators.

splitByWhitespace

Splits a string into substrings separated by whitespace characters.

Syntax:

splitByWhitespace(s[, max_substrings])

Arguments:

  • s (String): The string to split.
  • max_substrings (UInt64, optional): The maximum number of substrings to return.
    • Default value: 0 (return all possible substrings)

Returns:

An array of selected substrings. (Array(String))

Example:

SELECT
  splitByWhitespace('Carne asada tacos al pastor') AS taco_types;

Result:

| taco_types                                 |
|--------------------------------------------|
| ['Carne','asada','tacos','al','pastor']    |

In this example, the function splits the taco order into individual words.

  1. Empty substrings are not included in the result.
  2. When max_substrings > 0, the function returns at most that many substrings.

Example with max_substrings:

SELECT
	splitByWhitespace('Carne asada tacos al pastor', 3) AS limited_taco_types;

Result:

| limited_taco_types         |
|----------------------------|
| ['Carne','asada','tacos']  |

This function is useful for parsing space-separated values or tokenizing text for further analysis.

splitByNonAlpha

Splits a string into substrings separated by non-alphabetic characters (whitespace and punctuation). Returns an array of selected substrings.

Syntax:

splitByNonAlpha(s[, max_substrings])

Arguments:

  • s (String): The string to split.
  • max_substrings (UInt64, optional): The maximum number of substrings to return. Default is 0 (return all substrings).

Returns:

An array of selected substrings. (Array(String))

Example:

SELECT
  splitByNonAlpha('Crunchy tacos: $3.99, Soft tacos: $3.50!');

Result:

| splitByNonAlpha('Crunchy tacos: $3.99, Soft tacos: $3.50!')      |
|------------------------------------------------------------------|
| ['Crunchy','tacos','3','99','Soft','tacos','3','50']             |

In this example, the function splits the taco menu string into an array of words and numbers, removing all non-alphabetic characters.

Example with max_substrings:

SELECT
	splitByNonAlpha('Crunchy tacos: $3.99, Soft tacos: $3.50!', 3) AS limited_taco_types;

Result:

| splitByNonAlpha('Crunchy tacos: $3.99, Soft tacos: $3.50!', 3)  |
|-----------------------------------------------------------------|
| ['Crunchy','tacos','3']                                         |

This function is useful for tokenizing text, extracting words from sentences, or parsing structured strings where non-alphabetic characters serve as delimiters.

arrayStringConcat

Concatenates string representations of values listed in the array with an optional separator.

Syntax:

arrayStringConcat(arr[, separator])

Arguments:

  • arr (Array): The input array of strings to concatenate.
  • separator (String, optional): The separator to use between array elements. Defaults to an empty string.

Returns:

A string containing the concatenated elements of the input array. (String)

Example:

SELECT
  arrayStringConcat(['Carne Asada', 'Al Pastor', 'Pollo'], ', ') AS taco_order;

Result:

| taco_order                      |
|---------------------------------|
| Carne Asada, Al Pastor, Pollo   |

In this example, arrayStringConcat joins the array of taco types into a single string, separating each type with a comma and space.

If the separator is omitted, the array elements are concatenated without any characters between them.

alphaTokens

Selects substrings of consecutive bytes from the ranges a-z and A-Z. Returns an array of substrings.

Syntax:

alphaTokens(s[, max_substrings])

Alias:

  • splitByAlpha

Arguments:

  • s (String): The string to split.
  • max_substrings (UInt64, optional): Maximum number of substrings to return. Default is 0 (return all substrings).

Returns:

An array of selected substrings. [Array(String)]

Example:

SELECT
  alphaTokens('Crunchy Taco Supreme 2.99') AS taco_tokens;

Result:

| taco_tokens                     |
|---------------------------------|
| ['Crunchy','Taco','Supreme']    |

In this example, alphaTokens extracts all consecutive alphabetic substrings from the taco menu item, ignoring numbers and spaces.

extractAllGroups

Extracts all groups from non-overlapping substrings matched by a regular expression.

Syntax:

extractAllGroups(text, regexp)

Arguments:

  • text (String or FixedString): The input text to search for matches.
  • regexp (String or FixedString): The regular expression to match against. Must be a constant.

Returns:

  • An array of arrays containing the matched groups. [Array(Array(String))]
    • If at least one matching group is found, returns an array of arrays, where each inner array represents a match and contains the captured groups.
    • If no matching groups are found, returns an empty array.

Example:

SELECT
	extractAllGroups('Carnitas Taco=$5.99, Fish Taco="$6.50"', '("[^"]+"|\\w+)=("[^"]+"|\\w+)') AS taco_prices;

Result:

| taco_prices                                         |
|-----------------------------------------------------|
| [['Carnitas Taco','5.99'],['Fish Taco','"$6.50"']]  |

In this example:

  • The function extracts price information for different types of tacos.
  • Each inner array contains two elements: the taco name and its price.
  • The regular expression matches patterns like name=price, where both name and price can be quoted or unquoted.

This function is particularly useful for parsing structured text data, such as log files or semi-structured data formats, where you need to extract multiple pieces of information based on a pattern.

ngrams

Splits a UTF-8 string into n-grams of a specified size.

Syntax:

ngrams(string, ngramsize)

Arguments:

  • string (String or FixedString): The input string to split into n-grams.
  • ngramsize (UInt): The size of each n-gram.

Returns:

An array of strings, where each string is an n-gram of the specified size. [Array(String)]

Example:

SELECT
  ngrams('Taco Tuesday', 3) AS taco_grams;

Result:

| taco_grams                                          |
|-----------------------------------------------------|
| ['Tac','aco',' Tu','Tue','ues','esd','sda','day']   |

In this example, the function splits the phrase “Taco Tuesday” into 3-grams. Each element in the resulting array represents a substring of 3 consecutive characters from the input string.

The function considers UTF-8 characters, so it works correctly with multi-byte characters as well.

This function is useful for text analysis, particularly in natural language processing tasks such as text classification or similarity measurement.

tokens

Splits a string into tokens using non-alphanumeric ASCII characters as separators.

Syntax:

tokens(input_string)

Arguments:

  • input_string (String): The string to be tokenized.

Returns:

An array of tokens extracted from the input string. [Array(String)]

Example:

SELECT
  tokens('spicy_taco, mild_salsa; extra_guacamole') AS taco_ingredients;

Result:

| taco_ingredients                                       |
|--------------------------------------------------------|
| ['spicy','taco','mild','salsa','extra','guacamole']    |

In this example, the tokens function splits the input string into separate words, removing the non-alphanumeric separators (underscore, comma, semicolon, and space).

The function treats any sequence of non-alphanumeric ASCII characters as a separator. Multiple consecutive separators are treated as a single separator.

This function is useful for text processing tasks such as tokenizing ingredient lists, parsing user input, or preparing text for further analysis.