String Search functions
Find patterns and substrings within text.
ClickHouse function reference
position
Returns the position (in bytes, starting from 1) of a substring in a string.
Syntax
Alias:
position(needle IN haystack)
Arguments
haystack
(String
): String to search in.needle
(String
): Substring to search for.start_pos
(UInt
, optional): Position (1-based) inhaystack
to start the search.
Returns
- Starting position in bytes (counting from 1) if the substring was found. (
UInt64
) - 0 if the substring was not found. (
UInt64
)
If needle
is empty, these rules apply:
- If no
start_pos
specified: return 1 - If
start_pos = 0
: return 1 - If
start_pos >= 1
andstart_pos <= length(haystack) + 1
: returnstart_pos
- Otherwise: return 0
Example
Result:
In this example:
- ‘taco’ is found at position 13
- ‘burrito’ is not found, so 0 is returned
- Starting from position 10, the next ‘a’ is found at position 15
Example with empty needle:
Result:
This example demonstrates the behavior with an empty needle string and various start positions.
locate
Returns the position of a substring within a string.
Syntax:
Arguments:
needle
(String
): Substring to search for.haystack
(String
): String to search in.start_pos
(UInt
, optional): Position to start the search from (1-based).
Returns:
- The position (in bytes) of the first occurrence of
needle
inhaystack
, starting from 1. - 0 if the substring is not found.
- If
needle
is empty, the function follows these rules:- If no
start_pos
is specified: returns 1 - If
start_pos = 0
: returns 1 - If
start_pos >= 1
andstart_pos <= length(haystack) + 1
: returnsstart_pos
- Otherwise: returns 0
- If no
Example:
Result:
In this example:
- ‘salsa’ is found at position 7 in the string.
- ‘guac’ is not found, so it returns 0.
- ‘a’ is found at position 13 when starting the search from position 8.
The locate
function is case-sensitive. For case-insensitive searches, consider using locateCaseInsensitive
or locateCaseInsensitiveUTF8
functions.
positionCaseInsensitive
Returns the position (in bytes, starting from 1) of a substring in a string, ignoring case.
Syntax
Arguments
haystack
(String
): String to search in.needle
(String
): Substring to search for.start_pos
(UInt
, optional): Position to start the search from (1-based).
Returns
- The position (in bytes) of the found substring, starting from 1.
- 0 if the substring was not found.
Example
Result:
This function performs a case-insensitive search, so ‘taco’ matches ‘Taco’ at position 1.
Example with start_pos
Result:
Here, the search starts from position 7, finding the second occurrence of ‘Taco’ at position 14.
The function follows the same rules for empty substrings as the position
function. If the needle
is an empty string, these rules apply:
- If no
start_pos
was specified: return 1 - If
start_pos = 0
: return 1 - If
start_pos >= 1
andstart_pos <= length(haystack) + 1
: returnstart_pos
- Otherwise: return 0
This function is useful when you need to find a substring within a larger string without worrying about the case of the characters, such as searching for ingredient names in a recipe database regardless of capitalization.
positionUTF8
Returns the position (in Unicode code points, starting from 1) of a substring in a string, assuming both the substring and the string are UTF-8 encoded.
Syntax
Arguments
haystack
(String
): The string to search in.needle
(String
): The substring to search for.start_pos
(UInt
, optional): The position (1-based) inhaystack
at which the search starts.
Returns
- The starting position in Unicode code points (1-based) if the substring was found.
- 0 if the substring was not found.
Example
Result:
In this example, positionUTF8
correctly counts the Unicode character ‘ñ’ as a single code point, returning its position as 6.
If the needle
substring is empty, these rules apply:
- If no
start_pos
was specified: return 1 - If
start_pos = 0
: return 1 - If
start_pos >= 1
andstart_pos <= length(haystack) + 1
: returnstart_pos
- Otherwise: return 0
positionCaseInsensitiveUTF8
Returns the position (in Unicode code points) of a substring in a string, ignoring case and assuming UTF-8 encoding.
Syntax
Arguments
haystack
(String
): The string to search in.needle
(String
): The substring to search for.start_pos
(UInt
, optional): The position to start searching from (1-based).
Returns
- The position (in Unicode code points) of the first occurrence of
needle
inhaystack
starting fromstart_pos
, if found. - 0, if
needle
is not found.
Type: UInt64
Example
Result:
This example finds the position of ‘ñO’ (case-insensitive) in the UTF-8 encoded string, returning 6 as the position of ‘ño’ in ‘Jalapeño’.
- The function is case-insensitive, so ‘A’ and ‘a’ are considered the same.
- The function counts positions in Unicode code points, not bytes.
- If
needle
is empty, the function returns 1. - If
start_pos
is 0, it’s treated as 1. - If
start_pos
is greater than the length ofhaystack
, the function returns 0.
This function is useful for searching within text that may contain non-ASCII characters and where case should be ignored, such as when dealing with user-generated content or multilingual data.
multiSearchAllPositions
Returns an array of positions (in bytes, starting from 1) of substrings in a string.
Syntax
Arguments
haystack
(String
): String to search in.needle
(Array(String)
): Array of substrings to search for.
Returns:
- Array of positions.
- For each needle:
- If found, the position (in bytes from 1) of the first occurrence.
- If not found, 0.
Type: Array(UInt64)
Example
Query:
Result:
In this example:
- ‘taco’ is found at position 13
- ‘salsa’ and ‘guacamole’ are not found, so their positions are 0
The function supports up to 2^8 needles.
multiSearchAllPositionsCaseInsensitive
Returns an array of positions (in bytes, starting from 1) of all occurrences of the needle substrings in the haystack string, ignoring case.
Syntax
Arguments
haystack
(String
): String in which to search.needle
(Array(String)
): Array of substrings to search for.
Returns:
- Array of the same size as the
needle
array, where each element is:- The position (in bytes) of the corresponding needle in haystack, if found.
- 0 if the substring was not found.
[Array(UInt64)
]
Example
Result:
In this example:
- ‘THE’ is found at position 1 (case-insensitive match with ‘The’)
- ‘dog’ is found at position 41
- ‘quick’ is found at position 5
- ‘TACO’ is not found, so its position is 0
This function is useful for simultaneously searching for multiple substrings in a given text, ignoring case differences. It can be particularly handy in text analysis, log parsing, or when working with user-generated content where case sensitivity is not important.
The function supports up to 2^8 (256) needles.
multiSearchAllPositionsUTF8
Searches for multiple substrings in a UTF-8 encoded string and returns an array of positions.
Syntax
Arguments
haystack
(String
): UTF-8 encoded string in which to search.needle
(Array(String)
): Array of UTF-8 encoded substrings to search for.
Returns
- An array of the same size as the
needle
array, where each element is:- The position (in bytes, starting from 1) of the corresponding
needle
if found. - 0 if the substring was not found.
- The position (in bytes, starting from 1) of the corresponding
[Array(UInt64)
]
Example
Result:
In this example:
- ‘jalapeño’ is found at position 7
- ‘habanero’ is found at position 21
- ‘salsa’ is not found, so its position is 0
- This function is UTF-8 aware, so it correctly handles multi-byte characters.
- The function supports up to 2^8 needles.
- If any non-constant
haystack
orneedle
arguments are more than 32KB in size, the function will throw an exception.
multiSearchAllPositionsCaseInsensitiveUTF8
Returns an array of positions (in bytes, starting from 1) of all occurrences of multiple needle substrings in a haystack string. The search is case-insensitive and assumes UTF-8 encoding.
Syntax
Arguments
haystack
(String
): UTF-8 encoded string in which the search is performed.needle
(Array(String)
): Array of UTF-8 encoded substrings to be searched.
Returns
- Array of arrays, where each inner array contains:
- The starting positions (in bytes, counting from 1) of the corresponding needle if found.
- 0 if the needle was not found.
Type: Array(Array(UInt64))
Example
Result:
In this example:
- ‘taco’ is found at position 7 (case-insensitive)
- ‘SPICY’ is found at position 1 (case-insensitive)
- ‘wednesday’ is not found, so 0 is returned
- This function supports up to 2^8 needles.
- The function is optimized for searching multiple substrings in a single pass through the haystack.
- It’s particularly useful for tasks like keyword matching or filtering based on multiple criteria.
multiSearchFirstPosition
Returns the position of the leftmost occurrence of any of the substrings in the haystack string.
Syntax
Arguments
haystack
(String
): String in which to search.needle
(Array(String)
): Array of substrings to search for.
Returns
- The position (in bytes) of the leftmost occurrence of any needle substring in the haystack string.
- 0 if none of the needles are found.
Type: UInt64
Example
Result:
In this example, ‘tacos’ is the leftmost occurrence among the search terms, starting at position 8 in the haystack string.
This function is case-sensitive. For case-insensitive searches, use multiSearchFirstPositionCaseInsensitive
.
multiSearchFirstPositionCaseInsensitive
Returns the position of the first occurrence of any of the specified substrings in the given string, ignoring case. The search is performed case-insensitively.
Syntax
Arguments
haystack
(String
): The string to search in.needle
(Array(String)
): Array of substrings to search for.
Returns
- The position (in bytes) of the first occurrence of any matched substring (1-based).
- 0 if none of the substrings are found.
Type: UInt64
Example
Result:
In this example, ‘TACO’ is found first at position 9, ignoring case.
This function is useful for efficiently finding the first occurrence of multiple substrings in a given text, without being sensitive to letter case.
multiSearchFirstPositionUTF8
Returns the position of the first occurrence of any of the substrings in the string, assuming UTF-8 encoding.
Syntax
Arguments
haystack
(String
): The UTF-8 encoded string to search in.needle
(Array(String)
): An array of UTF-8 encoded substrings to search for.
Returns
- The position (in bytes) of the leftmost match of any of the needles within the haystack.
- 0 if none of the needles are found.
Type: UInt64
Example
Result:
In this example, ‘jalapeño’ is found at position 7 (counting bytes, not characters) in the UTF-8 encoded string.
- This function is case-sensitive.
- If you need case-insensitive search, use
multiSearchFirstPositionCaseInsensitiveUTF8
instead. - The function returns the position in bytes, not in characters, which is important for UTF-8 strings where characters can occupy multiple bytes.
- The function supports up to 2^8 needles in the array.
multiSearchFirstPositionCaseInsensitiveUTF8
Returns the position of the first occurrence of any of the substrings in the string, ignoring case and assuming UTF-8 encoding.
Syntax
Arguments
haystack
(String
): The UTF-8 encoded string to search in.needle
(Array(String)
): Array of UTF-8 encoded substrings to search for.
Returns:
- The position (in bytes) of the leftmost match of any of the needles.
- 0 if none of the needles are found.
Type: UInt64
Example
Result:
This function finds the first occurrence of either ‘TACO’ or ‘BURRITO’ (case-insensitive) in the UTF-8 encoded string of taco types. It returns 9, which is the byte position of the first ‘Tacos’ in ‘Carnitas Tacos’.
- The function is case-insensitive, so ‘TACO’ matches ‘Tacos’.
- The search is performed on UTF-8 encoded strings, so it correctly handles multi-byte characters.
- If none of the needles are found, the function returns 0.
- The function supports up to 2^8 needles in the array.
This function is useful for efficiently finding the first occurrence of multiple substrings in a larger string, especially when dealing with case-insensitive, UTF-8 encoded text.
multiSearchFirstIndex
Returns the index of the first matching needle string found in the haystack string.
Syntax
Arguments
haystack
(String
): String to search in.needle
(Array(String)
): Array of strings to search for.
Returns
- The index (1-based) of the first matching needle string.
- 0 if none of the needle strings are found.
Type: UInt8
Example
Result:
In this example, ‘tacos’ is the first needle found in the haystack, so the function returns 1 (the index of ‘tacos’ in the needle array).
- The search is case-sensitive.
- If you need case-insensitive search, use
multiSearchFirstIndexCaseInsensitive
instead. - For UTF-8 encoded strings, use
multiSearchFirstIndexUTF8
. - The function returns as soon as it finds the first match, which can be more efficient than searching for all occurrences.
multiSearchFirstIndexCaseInsensitive
Returns the index of the first matched needle in the haystack string, ignoring case. The index is 1-based, meaning the first position is 1, not 0.
Syntax
Arguments
haystack
(String
): The string to search in.needle
(Array(String)
): Array of substrings to search for.
Returns
- The index of the first matched needle (1-based).
- 0 if none of the needles are found.
Type: UInt8
Example
Result:
In this example, ‘TACO’ is the first needle that matches (case-insensitively) in the haystack, and its index is 1.
multiSearchFirstIndexUTF8
Returns the index of the first found needle in the haystack string, assuming both are UTF-8 encoded. The index is 1-based, meaning the first position is 1, not 0.
Syntax
Arguments
haystack
(String
): UTF-8 encoded string in which to search.needle
(Array(String)
): Array of UTF-8 encoded substrings to search for.
Returns
- The index (starting from 1) of the leftmost found needle.
- 0 if none of the needles are found.
Type: UInt8
Example
Result:
In this example, ‘jalapeño’ is the first needle found in the haystack, so the function returns 1.
This function is case-sensitive. For a case-insensitive version, use multiSearchFirstIndexCaseInsensitiveUTF8
.
multiSearchFirstIndexCaseInsensitiveUTF8
Returns the index of the first found needle from the array in the haystack string, ignoring case and assuming UTF-8 encoding.
Syntax
Arguments
haystack
(String
): UTF-8 encoded string in which to search.needle
(Array(String)
): Array of UTF-8 encoded strings to search for.
Returns:
- Index (starting from 1) of the leftmost found needle.
- 0 if none of the needles are found.
Type: UInt8
Example
Query:
Result:
In this example, ‘TACO’ (index 1) is found first in the haystack, ignoring case.
This function is useful for efficiently finding the first occurrence of multiple strings within a larger text, while ignoring case differences and properly handling UTF-8 encoded strings.
multiSearchAny
Returns 1 if at least one of the strings in the needles
array matches the haystack
string, and 0 otherwise.
Syntax
Arguments
haystack
(String
): String in which to search.needles
(Array(String)
): Array of substrings to search for.
Returns
- 1 if there was at least one match.
- 0 if there was no match.
Type: UInt8
Example
Result:
In this example, the function returns 1 because the substring ‘taco’ is found in the haystack string.
This function is case-sensitive. For case-insensitive searches, use multiSearchAnyCaseInsensitive
. For UTF-8 encoded strings, use multiSearchAnyUTF8
or multiSearchAnyCaseInsensitiveUTF8
.
multiSearchAnyCaseInsensitive
Returns 1 if at least one of the strings in the needles
array matches the haystack
string case-insensitively, and 0 otherwise.
Syntax
Arguments
haystack
(String
): String in which to search.needles
(Array(String)
): Array of strings to search for.
Returns
- 1 if there was at least one case-insensitive match.
- 0 if there was no match.
Type: UInt8
Example
Query:
Result:
In this example, the function returns 1 because ‘TACO’ matches ‘Taco’ in the haystack string, ignoring case.
This function is useful for efficient case-insensitive searches of multiple substrings within a larger string. It’s particularly handy when dealing with user input or when case sensitivity is not important in your search criteria.
multiSearchAnyUTF8
Checks if any of the substrings from the needles
array are present in the haystack
string. Assumes both haystack
and needles
are UTF-8 encoded strings.
Syntax
Arguments
haystack
(String
): UTF-8 encoded string to search in.needles
(Array(String)
): Array of UTF-8 encoded substrings to search for.
Returns
- 1 if at least one match is found.
- 0 if no matches are found.
Type: UInt8
Example
Query:
Result:
In this example, the function returns 1 because the substring ‘taco’ is found in the haystack string.
This function is case-sensitive. For a case-insensitive version, use multiSearchAnyCaseInsensitiveUTF8
.
multiSearchAnyCaseInsensitiveUTF8
Checks if any of the UTF-8 encoded substrings are present in the given UTF-8 encoded string, ignoring case.
Syntax
Arguments
haystack
(String
): UTF-8 encoded string in which to search.needle
(Array(String)
): Array of UTF-8 encoded substrings to search for.
Returns
- 1 if any of the substrings are found (ignoring case).
- 0 if none of the substrings are found.
Type: UInt8
Example
Query:
Result:
In this example, the function returns 1 because ‘SUPREME’ is found in the haystack ‘Crunchy Taco Supreme’, ignoring case.
This function is case-insensitive and works with UTF-8 encoded strings. It’s useful for checking the presence of multiple substrings in a given text, regardless of their case or Unicode representation.
match
Checks whether a string matches a regular expression pattern.
Syntax
Alias:
REGEXP
Arguments
haystack
(String
): The string to search in.pattern
(String
): The regular expression pattern with re2 syntax.
Returns
1
if the string matches the pattern.0
if the string does not match the pattern.
[UInt8
]
- The function uses UTF-8 encoding. The regular expression must not contain null bytes.
- Unlike re2’s default behavior,
.
matches line breaks. To disable this, prepend the pattern with(?-s)
. - For simple substring searches, consider using
like
orposition
functions for better performance.
Example
Result:
This example checks if the menu items contain the word “Taco”.
Alias
The REGEXP
operator is an alias for match
:
multiMatchAllIndices
Returns an array of all indices that match the haystack string using multiple regular expression patterns.
Syntax
Arguments
haystack
(String
): String to search in.pattern1, pattern2, ..., patternN
(Array(String)
): Regular expressions with re2 syntax.
Returns
- Array of indices (1-based) of patterns that match the haystack, in the order they appear in the haystack. (
Array(UInt64)
)
Example
Result:
In this example:
- Index 1 corresponds to ‘quick’
- Index 2 corresponds to ‘brown’
- Index 8 corresponds to ‘lazy’
- ‘taco’ is not found, so it’s not included in the result
extract
Extracts a fragment of a string using a regular expression. If the haystack does not match the pattern, an empty string is returned.
For regex without subpatterns, the function uses the fragment that matches the entire regex. Otherwise, it uses the fragment that matches the first subpattern.
Syntax:
Arguments:
haystack
(String
): String in which the search is performed.pattern
(String
): Regular expression with re2 syntax.
Returns:
- The matched fragment, or an empty string if no match is found. [
String
]
Example:
Result:
In this example, extract
finds the price pattern in the taco description string.
If you only want to search for substrings in a string, consider using functions like position
or like
instead, as they are generally faster than regular expression-based functions.
extractAll
Extracts all fragments of a string that match a regular expression pattern.
Syntax:
Arguments:
haystack
(String
) - The string to search in.pattern
(String
) - The regular expression pattern to match.
Returns:
An array of strings containing all matches of the pattern in the haystack. If no matches are found, an empty array is returned. [Array(String)
]
Example:
Result:
In this example, extractAll
finds all words ending in “dillas” or “dilla” in the given string about Mexican food.
The regular expression pattern uses \w+
to match one or more word characters, followed by “dillas” with an optional “s” at the end. The backslashes are escaped in the SQL string.
extractAllGroupsHorizontal
Matches all groups of the haystack string using the pattern regular expression. Returns an array of arrays, where the first array includes all fragments matching the first group, the second array - matching the second group, etc.
Syntax:
Arguments:
haystack
(String
) — Input string.pattern
(String
) — Regular expression with re2 syntax. Must contain groups, each group enclosed in parentheses. If pattern contains no groups, an exception is thrown.
Returns:
- Array of arrays of matches. (
Array
)
If haystack
does not match the pattern regex, an array of empty arrays is returned.
Example:
Result:
In this example:
- The first array contains all matched taco names:
['Carne asada','Al pastor','Chorizo']
- The second array contains all matched prices:
['10','12','11']
This function is useful for extracting structured data from strings, such as parsing menu items and prices from a text description.
extractAllGroupsVertical
Matches all groups of the haystack string using the pattern regular expression. Returns an array of arrays, where each array includes matching fragments from every group. Fragments are grouped in order of appearance in the haystack.
Syntax:
Arguments:
haystack
(String
): Input string.pattern
(String
): Regular expression with re2 syntax. Must contain groups, each group enclosed in parentheses. If pattern contains no groups, an exception is thrown.
Returns:
Array of arrays of matches. (Array
)
If haystack does not match the pattern regex, an empty array is returned.
Example:
Result:
In this example:
- The function extracts all price-item pairs from the taco special menu.
- Each inner array contains the price (first group) and the item name (second group).
- The results are grouped vertically, meaning each inner array represents a complete match of both groups.
This function is particularly useful when you need to extract and group multiple related pieces of information from a string using a regular expression with multiple capture groups.
like
Checks if a string matches a pattern.
Syntax
Alias:
haystack LIKE pattern
(operator)
Arguments
haystack
(String
): The string to search in.pattern
(String
): The LIKE pattern to match against.
Returns
1
if the string matches the pattern,0
otherwise. [UInt8
]
- The
LIKE
pattern can contain the following special characters:%
matches any number of characters (including zero characters)_
matches any single character\
escapes special characters (%
,_
, and\
)
- Matching is based on UTF-8 encoding
- For patterns of the form
%needle%
, the function is as fast as theposition
function - Other patterns are internally converted to regular expressions
Example
Result:
In this example, we check if the taco type ‘Carne Asada’ matches various LIKE patterns.
To match against literal %
, _
, and \
characters, prepend them with a backslash: \%
, \_
, and \\
. Note that in SQL strings, backslashes must be escaped as well, so you would actually write \\%
, \\_
, and \\\\
.
notLike
Returns whether a string does not match a pattern. This function is the negation of the like
function.
Syntax:
Alias:
haystack NOT LIKE pattern
(operator)
Arguments:
haystack
(String
): The string to search in.pattern
(String
): The pattern to search for, which can contain the following wildcards:%
— Matches any number of characters (including zero characters)_
— Matches any single character
Returns:
1
if the string does not match the pattern.0
if the string matches the pattern.
[UInt8
]
Example:
Result:
In this example:
'Spicy Beef Taco'
matches the pattern'Spicy%'
, sonotLike
returns0
.'Veggie Taco'
does not match the pattern'Spicy%'
, sonotLike
returns1
.
This function is useful for filtering out rows that do not match a certain pattern.
The notLike
function is case-sensitive. For case-insensitive matching, use notILike
.
ilike
Performs a case-insensitive pattern matching operation.
Syntax:
Alias:
haystack ILIKE pattern
(operator)
Arguments:
haystack
(String
): The string to search in.pattern
(String
): The LIKE pattern to match against.
Returns:
Returns 1 if the string matches the pattern, 0 otherwise. [UInt8
]
Example:
Result:
In this example, ilike
checks if each taco_name
contains the word “spicy” (case-insensitive). The function returns 1 for names containing “spicy” in any case, and 0 otherwise.
The ilike
function is similar to like
, but it ignores case when matching. It’s particularly useful when you need to perform case-insensitive string matching operations in your queries.
The ilike
function follows the same pattern matching rules as like
, where %
matches any number of characters and _
matches any single character.
notILike
Returns whether the string haystack
does not match the case-insensitive LIKE expression pattern
.
Syntax:
Alias:
haystack NOT ILIKE pattern
(operator)
Arguments:
haystack
(String
): The string to search in.pattern
(String
): The LIKE expression pattern to match against.
Returns:
1
ifhaystack
does not matchpattern
,0
otherwise. [UInt8
]
Example:
Result:
In this example, notILike
returns 0
because ‘Spicy Taco’ does match the case-insensitive pattern ‘%taco%’.
- The
ILIKE
pattern matching is case-insensitive. %
matches any number of characters (including zero characters)._
matches any single character.- To escape special characters (
%
,_
,\
), precede them with a backslash (\
). - This function is the negation of the
ilike
function.
When working with literal backslashes in SQL strings, you may need to escape them twice: once for the SQL string literal, and once for the LIKE pattern. For example, to match a literal underscore, you might use '\\\_'
.
ngramDistance
Calculates the 4-gram distance between two strings. Returns a value between 0 and 1, where a smaller value indicates more similar strings.
Syntax
Arguments
haystack
(String
) — First string for comparison.needle
(String
) — Second string for comparison.
Returns:
- A value between 0 and 1 representing the similarity between the two strings. (
Float32
)
Details
The function counts the symmetric difference between two multisets of 4-grams and normalizes it by the sum of their cardinalities. A result closer to 0 indicates more similar strings.
This function will throw an exception if constant needle
or haystack
arguments are more than 32Kb in size. For non-constant arguments larger than 32Kb, the distance is always 1.
Example
Result:
In this example, the 4-gram distance between “Carne Asada Taco” and “Carnitas Taco” is calculated. The result of 0.5555556 indicates a moderate level of similarity between the two taco names.
See Also
ngramDistanceCaseInsensitive
ngramDistanceUTF8
ngramDistanceCaseInsensitiveUTF8
ngramDistanceCaseInsensitive
Calculates the 4-gram distance between two strings, ignoring case. Returns a value between 0 and 1, where 0 indicates identical strings and 1 indicates completely different strings.
Syntax
Arguments
haystack
(String
): First string for comparison.needle
(String
): Second string for comparison.
Returns
- The 4-gram distance between the strings, ignoring case. (
Float32
)
Example
Result:
This example compares two taco descriptions, ignoring case. The result (0.23529412) indicates that the strings are quite similar, with the main difference being “Beef” vs “chicken”.
- The function is case-insensitive, so “Taco” and “taco” are treated as identical.
- If constant arguments are more than 32KB in size, an exception is thrown.
- For non-constant arguments larger than 32KB, the distance is always 1.
This function is useful for fuzzy string matching and comparing similar strings while ignoring differences in capitalization.
ngramDistanceUTF8
Calculates the 4-gram distance between two UTF-8 encoded strings.
Syntax:
Arguments:
haystack
(String
): First UTF-8 encoded string for comparison.needle
(String
): Second UTF-8 encoded string for comparison.
Returns:
- A value between 0 and 1 representing the similarity between the two strings. The smaller the result, the more similar the strings are to each other. [
Float32
]
Example:
Result:
This example calculates the 4-gram distance between two different taco names. The result of 0.6666667 indicates that the strings are somewhat different, but have some similarity.
Implementation details:
The function uses the 3-gram distance for UTF-8 strings. It’s not a perfectly fair n-gram distance. The function uses 2-byte hashes to hash n-grams and then calculates the symmetric difference between these hash tables – collisions may occur.
If constant haystack
or needle
arguments are more than 32KB in size, an exception will be thrown. For non-constant arguments exceeding 32KB, the distance is always 1.
This function is useful for fuzzy string matching and comparison, especially when dealing with UTF-8 encoded text such as taco menu items in different languages or with special characters.
ngramDistanceCaseInsensitiveUTF8
Calculates the 4-gram distance between two UTF-8 encoded strings, ignoring case. Returns a value between 0 and 1, where 0 indicates identical strings and 1 indicates completely different strings.
Syntax
Arguments
haystack
(String
): First UTF-8 encoded string for comparison.needle
(String
): Second UTF-8 encoded string for comparison.
Returns
- A value between 0 and 1 representing the similarity between the two strings. [
Float32
]
Example
Result:
This example demonstrates that the function is case-insensitive, returning 0 for identical strings that differ only in case.
- This function uses the 3-gram distance for UTF-8 strings.
- The comparison is not a perfectly fair n-gram distance. It uses 2-byte hashes to hash n-grams and then calculates the symmetric difference between these hash tables, which may result in collisions.
- For case-insensitive UTF-8 comparison, it doesn’t use a fair
tolower
function. Instead, it zeroes the 5th bit (starting from zero) of each codepoint byte and the first bit of the zeroth byte if there are more than one byte. This works for Latin and most Cyrillic letters.
If you need to compare strings without considering UTF-8 encoding or case sensitivity, you can use the ngramDistance
function instead.
ngramSearch
Calculates the non-symmetric difference between a needle string and a haystack string. This function is useful for fuzzy string search.
Syntax
Arguments
haystack
(String
): String in which the search is performed.needle
(String
): String to be searched.
Returns
- A value between 0 and 1 representing the likelihood of the needle being in the haystack. The bigger the result, the more likely needle is in haystack. (
Float32
)
Example
Result:
In this example, we calculate the likelihood of finding ‘Supreme Taco’ in ‘Crunchy Taco Supreme’. The result of 0.6666667 indicates a high likelihood of a match.
The function calculates the number of n-grams from the needle minus the common number of n-grams normalized by the number of needle n-grams. It uses 4-grams by default.
This function is part of a family of ngram functions. Case-insensitive and UTF-8 variants are available: ngramSearchCaseInsensitive
, ngramSearchUTF8
, and ngramSearchCaseInsensitiveUTF8
.
ngramSearchCaseInsensitive
Calculates the non-symmetric difference between a needle string and a haystack string, ignoring case. Returns a value between 0 and 1, where a higher value indicates a higher likelihood of the needle being in the haystack.
This function is useful for fuzzy string searching.
Syntax
Arguments
haystack
(String
): String in which to search.needle
(String
): Substring to search for.
Returns:
- A value between 0 and 1. (
Float32
)- Closer to 0: Strings are less similar.
- Closer to 1: Strings are more similar or likely to contain the needle.
Example
Result:
This example shows that ‘taco’ is found in ‘Crunchy Taco Supreme’, ignoring the case difference.
- The function uses 4-grams for comparison.
- For UTF-8 strings, consider using
ngramSearchCaseInsensitiveUTF8
instead. - If constant needle or haystack arguments are more than 32KB in size, an exception is thrown.
- For non-constant arguments larger than 32KB, the result is always 1.
See also: ngramSearch
, ngramSearchUTF8
, ngramSearchCaseInsensitiveUTF8
ngramSearchUTF8
Calculates the non-symmetric difference between a needle string and a haystack string, assuming both are UTF-8 encoded. This function is useful for fuzzy string searching.
Syntax
Arguments
haystack
(String
): UTF-8 encoded string in which to search.needle
(String
): UTF-8 encoded string to search for.
Returns
- A value between 0 and 1 representing the likelihood of the needle being in the haystack. The larger the result, the more likely the needle is present. (
Float32
)
Details
- This function uses 3-gram distances for UTF-8 strings.
- The implementation uses 2-byte hashes to hash n-grams and calculates the non-symmetric difference between these hash tables.
- Collisions may occur, which can affect the accuracy of the result.
Example
Result:
In this example, the function returns a high value (0.75) indicating a strong likelihood that ‘jalapeño’ is present in the haystack string, despite the difference in case.
This function is more computationally intensive than exact string matching functions. Use it when fuzzy matching is required, such as for user input correction or similarity searches in text data.
ngramSearchCaseInsensitiveUTF8
Performs a case-insensitive n-gram search on UTF-8 encoded strings, calculating the likelihood of a needle being present in a haystack.
Syntax
Arguments
haystack
(String
): UTF-8 encoded string to search in.needle
(String
): UTF-8 encoded substring to search for.
Returns
- A value between 0 and 1 representing the likelihood of the needle being in the haystack. (
Float32
)
The larger the result, the more likely the needle is present in the haystack.
Example
Result:
This example searches for ‘TACO’ in ‘Spicy Carnitas Tacos’, ignoring case. The result of 0.625 indicates a high likelihood of the word being present.
- This function uses 3-gram distances for UTF-8 strings.
- The case-insensitive comparison is not perfectly fair for all Unicode characters.
- For non-Latin scripts, the function may not provide accurate results in all cases.
This function is useful for fuzzy string searching in multilingual contexts where case sensitivity is not important.
countSubstrings
Returns the number of occurrences of a substring in a string.
Syntax
Arguments
haystack
(String
): String in which to search.needle
(String
): Substring to search for.start_pos
(UInt
, optional): Position (1-based) inhaystack
at which to start the search.
Returns
- The number of occurrences. (
UInt64
)
Examples
Count occurrences of “taco” in a string:
Result:
Using the start_pos
argument:
Result:
This function is case-sensitive. For case-insensitive counting, use countSubstringsCaseInsensitive
.
countSubstringsCaseInsensitive
Returns how often a substring needle
occurs in a string haystack
, ignoring case.
Syntax
Arguments
haystack
(String
): String in which the search is performed.needle
(String
): Substring to be searched.start_pos
(UInt
, optional): Position (1-based) inhaystack
at which the search starts.
Returns
- The number of occurrences. (
UInt64
)
Examples
Query:
Result:
Example with start_pos
argument:
Query:
Result:
In this example, the search starts from the 6th character, skipping the first occurrence of ‘taco’.
countSubstringsCaseInsensitiveUTF8
Returns how often a substring needle
occurs in a string haystack
. Ignores case and assumes that haystack
is a UTF-8 string.
Syntax
Arguments
haystack
(String
): UTF-8 string in which the search is performed.needle
(String
): Substring to be searched.start_pos
(UInt
, optional): Position (1-based) inhaystack
at which the search starts.
Returns
- The number of occurrences. (
UInt64
)
Example
Query:
Result:
Example with start_pos
argument:
Query:
Result:
In this example, the function starts searching from the 15th character, which is after the first “Taco”, so it only counts the last two occurrences.
countMatches
Returns the number of regular expression matches for a pattern in a string.
Syntax:
Arguments:
haystack
(String
): The string to search in.pattern
(String
): The regular expression with re2 syntax.
Returns:
- The number of matches. (
UInt64
)
Example:
Result:
In this example, countMatches
finds three occurrences of the word “taco” (case-sensitive) in the given string about a Taco Tuesday special.
The function is case-sensitive. For case-insensitive matching, use countMatchesCaseInsensitive
.
countMatchesCaseInsensitive
Returns the number of case-insensitive regular expression matches for a pattern in a string.
Syntax:
Arguments:
haystack
(String
): The string to search in.pattern
(String
): The regular expression with re2 syntax.
Returns:
- The number of matches. (
UInt64
)
Example:
Result:
This function is similar to countMatches
, but it performs a case-insensitive search. It’s useful when you need to count occurrences of a pattern regardless of letter case, such as finding mentions of “taco” in menu item names.
regexpExtract
Extracts the first string in a haystack that matches a regular expression pattern and corresponds to the specified regex group index.
Syntax:
Alias:
- REGEXP_EXTRACT(haystack, pattern[, index])
Arguments:
haystack
(String
): String in which the regexp pattern will be matched.pattern
(String
): Regular expression. Must be a constant string.index
(UInt
orInt
, optional): An integer greater than or equal to 0, with a default value of 1. Represents which regex group to extract.
Returns:
- The extracted substring matching the specified group. (
String
)
- If
pattern
contains multiple regex groups,index
indicates which group to extract. - An
index
of 0 means matching the entire regular expression.
Example:
Result:
In this example:
- The first
regexpExtract
call extracts the taco type (group 1). - The second call extracts the price (group 2).
- The third call with index 0 returns the entire matched substring.
hasSubsequence
Returns whether a string contains a given subsequence.
Syntax
Arguments
haystack
(String
): String to search in.needle
(String
): Subsequence to search for.
Returns
- 1 if
needle
is a subsequence ofhaystack
, 0 otherwise. [UInt8
]
A subsequence is a sequence that can be derived from the given string by deleting zero or more elements without changing the order of the remaining elements.
Example
Result:
This example checks if ‘aco’ is a subsequence of ‘Taco Tuesday’. The result is 1 (true) because ‘aco’ can be derived from ‘Taco Tuesday’ by deleting some characters while maintaining the order of the remaining ones.
This function is case-sensitive. For case-insensitive matching, use hasSubsequenceCaseInsensitive
.
hasSubsequenceCaseInsensitive
Returns whether a string contains a given subsequence, ignoring case.
Syntax
Arguments
haystack
(String
): The string to search in.needle
(String
): The subsequence to search for.
Returns
1
ifneedle
is a subsequence ofhaystack
(ignoring case),0
otherwise. [UInt8
]
Example
Result:
This function is useful for case-insensitive substring matching, particularly when you need to check if a certain sequence of characters appears in a string, regardless of their case.
hasSubsequenceUTF8
Returns 1 if needle
is a subsequence of haystack
, or 0 otherwise. This function assumes haystack
and needle
are UTF-8 encoded strings. A subsequence is a sequence that can be derived from the given string by deleting zero or more elements without changing the order of the remaining elements.
Syntax
Arguments
haystack
(String
): String in which the search is performed.needle
(String
): Subsequence to be searched.
Returns
- 1, if
needle
is a subsequence ofhaystack
- 0, otherwise
Type: UInt8
Example
Query:
Result:
This example checks if ‘Tasty Tacos’ is a subsequence of the given string, which it is, so the function returns 1.
hasSubsequenceCaseInsensitiveUTF8
Checks if a given substring is present as a case-insensitive subsequence in a UTF-8 encoded string.
Syntax
Arguments
haystack
(String
): The UTF-8 encoded string to search in.needle
(String
): The UTF-8 encoded subsequence to search for.
Returns
1
ifneedle
is a case-insensitive subsequence ofhaystack
,0
otherwise. [UInt8
]
Example
Result:
This function is useful for case-insensitive substring matching in UTF-8 encoded text, such as finding menu items or ingredients in multilingual taco descriptions.
The function assumes both haystack
and needle
are valid UTF-8 encoded strings. If they are not, the behavior is undefined.
hasToken
Checks if a given token is present in a string.
Syntax
Arguments
haystack
(String
): String in which to search.token
(String
): Token to search for. Must be a constant string.
Returns
- 1 if the token is present in the haystack.
- 0 if the token is not present.
Type: UInt8
Description
A token is defined as a maximal length substring between two non-alphanumeric ASCII characters (or the boundaries of the haystack string).
This function is optimized for performance and is supported by the tokenbf_v1
index specialization.
Example
Result:
In this example, ‘Taco’ is found as a token within the ‘Crunchy Taco Supreme’ string, so the function returns 1.
The function is case-sensitive. Use hasTokenCaseInsensitive
for case-insensitive matching.
hasTokenOrNull
Returns whether a given token is present in a string, with special handling for ill-formed tokens.
Syntax
Arguments
haystack
(String
) — String in which to search.token
(String
) — Token to search for. Must be a constant string.
Returns
1
if the token is present in the haystack.0
if the token is not present in the haystack.NULL
if the token is ill-formed.
[Nullable(UInt8)
]
Details
This function is similar to hasToken
, but returns NULL
instead of throwing an error when the token is ill-formed. A token is considered ill-formed if it contains non-alphanumeric ASCII characters.
The function is case-sensitive. For a case-insensitive version, use hasTokenCaseInsensitiveOrNull
.
This function is supported by the tokenbf_v1
index specialization.
Example
Result:
Result:
Result:
In this last example, NULL
is returned because ‘Doritos,Locos’ is an ill-formed token due to the comma.
hasTokenCaseInsensitive
Returns 1 if a given token is present in a haystack, 0 otherwise. Ignores case.
Syntax:
Arguments:
haystack
(String
): String in which the search is performed.token
(String
): Maximal length substring between two non-alphanumeric ASCII characters (or boundaries of haystack).
Returns:
- 1 if the token is present in the haystack, 0 otherwise. (
UInt8
)
Implementation details:
Token must be a constant string. Supported by tokenbf_v1 index specialization.
Example:
Result:
In this example, the function returns 1 because ‘taco’ is present in the haystack ‘Crunchy Taco Supreme’, ignoring the case difference.
hasTokenCaseInsensitiveOrNull
Returns whether a given token is present in a string, ignoring case. Returns NULL
if the token is ill-formed.
Syntax
Arguments
haystack
(String
): String in which to search.token
(String
): Token to search for. Must be a constant string.
Returns
1
if the token is present in the haystack (case-insensitive).0
if the token is not present.NULL
if the token is ill-formed.
Type: Nullable(UInt8)
- A token is defined as a maximal length substring between two non-alphanumeric ASCII characters (or the boundaries of the haystack).
- This function is supported by the
tokenbf_v1
index specialization.
Example
Result:
Result:
In this example, the second query returns NULL
because ‘burrito,taco’ is not a valid token (it contains a comma).
Was this page helpful?