Splitting string functions
Divide strings into arrays
String splitting functions provide powerful tools for dividing strings into arrays based on various criteria.
ClickHouse function reference
splitByChar
Splits a string into substrings separated by a specified character.
Syntax:
Arguments:
separator
(String
): The separator character.s
(String
): The string to split.max_substrings
(UInt64
, optional): Maximum number of substrings to return. Default is 0 (return all substrings).
Returns:
An array of selected substrings. (Array(String)
)
Example:
Result:
In this example, we split a string of taco fillings into an array using a comma as the separator.
Empty substrings may be selected when:
- A separator occurs at the beginning or end of the string
- There are multiple consecutive separators
- The original string is empty
When max_substrings
is 0 or omitted, the function returns as many substrings as possible.
splitByString
Splits a string into substrings separated by a specified string delimiter.
Syntax:
Arguments:
separator
(String
): The string separator.s
(String
): The string to split.max_substrings
(UInt64
, optional): Maximum number of substrings to return. Default is 0 (return all substrings).
Returns:
An array of selected substrings. (Array(String)
)
- Empty substrings may be selected when:
- A non-empty separator occurs at the beginning or end of the string
- There are multiple consecutive non-empty separators
- The original string
s
is empty while the separator is not empty
- If
separator
is an empty string, it splits the strings
into an array of single characters
Example:
Result:
In this example, we split a string of taco fillings into an array using ’, ’ as the separator.
Example with empty separator:
Result:
When the separator is an empty string, the function splits the input string into individual characters.
Example with max_substrings:
Result:
By specifying max_substrings
as 2, we limit the output to the first two substrings.
splitByRegexp
Splits a string into substrings using a regular expression as the separator.
Syntax:
Arguments:
regexp
(String
orFixedString
): Regular expression used as the separator.s
(String
): The string to split.max_substrings
(UInt64
, optional): Maximum number of substrings to return. Default is 0 (return all substrings).
Returns:
An array of selected substrings. (Array(String)
)
Example:
Result:
In this example, the function splits the taco order into individual words using whitespace as the separator.
-
Empty substrings may be selected when:
- A non-empty regular expression match occurs at the beginning or end of the string
- There are multiple consecutive non-empty regular expression matches
- The original string
s
is empty while the regular expression is not empty
-
If no match is found for the regular expression, the string
s
won’t be split.
Example with max_substrings:
Result:
This function is useful for parsing structured text data, extracting specific patterns, or tokenizing strings based on complex separators.
splitByWhitespace
Splits a string into substrings separated by whitespace characters.
Syntax:
Arguments:
s
(String
): The string to split.max_substrings
(UInt64
, optional): The maximum number of substrings to return.- Default value: 0 (return all possible substrings)
Returns:
An array of selected substrings. (Array(String)
)
Example:
Result:
In this example, the function splits the taco order into individual words.
- Empty substrings are not included in the result.
- When
max_substrings > 0
, the function returns at most that many substrings.
Example with max_substrings:
Result:
This function is useful for parsing space-separated values or tokenizing text for further analysis.
splitByNonAlpha
Splits a string into substrings separated by non-alphabetic characters (whitespace and punctuation). Returns an array of selected substrings.
Syntax:
Arguments:
s
(String
): The string to split.max_substrings
(UInt64
, optional): The maximum number of substrings to return. Default is 0 (return all substrings).
Returns:
An array of selected substrings. (Array(String)
)
Example:
Result:
In this example, the function splits the taco menu string into an array of words and numbers, removing all non-alphabetic characters.
Example with max_substrings:
Result:
This function is useful for tokenizing text, extracting words from sentences, or parsing structured strings where non-alphabetic characters serve as delimiters.
arrayStringConcat
Concatenates string representations of values listed in the array with an optional separator.
Syntax:
Arguments:
arr
(Array
): The input array of strings to concatenate.separator
(String
, optional): The separator to use between array elements. Defaults to an empty string.
Returns:
A string containing the concatenated elements of the input array. (String
)
Example:
Result:
In this example, arrayStringConcat
joins the array of taco types into a single string, separating each type with a comma and space.
If the separator is omitted, the array elements are concatenated without any characters between them.
alphaTokens
Selects substrings of consecutive bytes from the ranges a-z and A-Z. Returns an array of substrings.
Syntax:
Alias:
- splitByAlpha
Arguments:
s
(String
): The string to split.max_substrings
(UInt64
, optional): Maximum number of substrings to return. Default is 0 (return all substrings).
Returns:
An array of selected substrings. [Array(String)
]
Example:
Result:
In this example, alphaTokens
extracts all consecutive alphabetic substrings from the taco menu item, ignoring numbers and spaces.
extractAllGroups
Extracts all groups from non-overlapping substrings matched by a regular expression.
Syntax:
Arguments:
text
(String
orFixedString
): The input text to search for matches.regexp
(String
orFixedString
): The regular expression to match against. Must be a constant.
Returns:
- An array of arrays containing the matched groups. [
Array(Array(String))
]- If at least one matching group is found, returns an array of arrays, where each inner array represents a match and contains the captured groups.
- If no matching groups are found, returns an empty array.
Example:
Result:
In this example:
- The function extracts price information for different types of tacos.
- Each inner array contains two elements: the taco name and its price.
- The regular expression matches patterns like
name=price
, where both name and price can be quoted or unquoted.
This function is particularly useful for parsing structured text data, such as log files or semi-structured data formats, where you need to extract multiple pieces of information based on a pattern.
ngrams
Splits a UTF-8 string into n-grams of a specified size.
Syntax:
Arguments:
string
(String
orFixedString
): The input string to split into n-grams.ngramsize
(UInt
): The size of each n-gram.
Returns:
An array of strings, where each string is an n-gram of the specified size. [Array(String)
]
Example:
Result:
In this example, the function splits the phrase “Taco Tuesday” into 3-grams. Each element in the resulting array represents a substring of 3 consecutive characters from the input string.
The function considers UTF-8 characters, so it works correctly with multi-byte characters as well.
This function is useful for text analysis, particularly in natural language processing tasks such as text classification or similarity measurement.
tokens
Splits a string into tokens using non-alphanumeric ASCII characters as separators.
Syntax:
Arguments:
input_string
(String
): The string to be tokenized.
Returns:
An array of tokens extracted from the input string. [Array(String)
]
Example:
Result:
In this example, the tokens
function splits the input string into separate words, removing the non-alphanumeric separators (underscore, comma, semicolon, and space).
The function treats any sequence of non-alphanumeric ASCII characters as a separator. Multiple consecutive separators are treated as a single separator.
This function is useful for text processing tasks such as tokenizing ingredient lists, parsing user input, or preparing text for further analysis.
Was this page helpful?