String functions in ClickHouse® are key tools for developers and data engineers. These functions enable efficient data manipulation, cleaning, and querying, which are crucial for building data applications. In this article, we'll explore the essential string functions in ClickHouse®, providing detailed explanations and practical examples you can use to learn and as a reference.
Introduction to ClickHouse® string functions
In this post, you will learn about various string functions available in ClickHouse®. We'll cover basic functions like length
and empty
, as well as more advanced functions like replaceRegexpAll
and concatWithSeparator
. Each function will be explained with practical examples, ensuring you can apply them effectively in your projects.
String functions are crucial for data cleaning, transformation, and querying. They allow you to standardize data formats, remove unwanted characters, and extract meaningful information from raw data. By understanding these functions, you can ensure your data is accurate, consistent, and ready for analysis.
A word about the examples
To illustrate various string functions in ClickHouse, we'll use a consistent example throughout this article. The theme is an imaginary SaaS company that tracks orders for taquerias. We'll create a table to store order data and use this schema to demonstrate each function.
Let's start by creating a table called orders
to store our data.
CREATE TABLE orders (
order_id UInt32,
customer_name String,
order_details String,
order_date DateTime
) ENGINE = MergeTree()
ORDER BY order_id;
Next, we'll insert some sample data into the orders
table. This data will be used in all the examples that follow.
INSERT INTO orders VALUES
(1, 'John Doe', '3 Tacos, 2 Burritos', '2023-01-15 12:34:56'),
(2, 'Jane Smith', '1 Taco, 1 Quesadilla', '2023-01-16 13:45:07'),
(3, 'Carlos Garcia', '5 Tacos, 1 Nacho', '2023-01-17 14:56:18'),
(4, 'Maria Rodriguez', '2 Burritos, 3 Quesadillas', '2023-01-18 15:07:29');
This table and data will serve as the foundation for demonstrating the various string functions in ClickHouse.
Next, we will explore basic string functions like determining the length of strings and checking for emptiness. These fundamental operations are crucial for validating and processing string data effectively.
Basic string functions: length and emptiness
We'll explore functions to determine string length and check for emptiness, helping you avoid common errors and improve efficiency in your data operations.
length
The length
function returns the length of a string in bytes. This is useful for validating data and ensuring that strings meet certain length requirements.
Attributes:
s
(String): The string whose length you want to determine.
Returns:
- An integer representing the length of the string in bytes.
Example:
Let's find the length of the customer_name
column in our orders
table.
SELECT
order_id,
customer_name,
length(customer_name) AS name_length
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─name_length─┐
│ 1 │ John Doe │ 8 │
│ 2 │ Jane Smith │ 10 │
│ 3 │ Carlos Garcia │ 13 │
│ 4 │ Maria Rodriguez │ 15 │
└──────────┴───────────────────┴─────────────┘
The equivalent functions for calculating string length in other platforms are:
- Postgres:
length(customer_name)
- DuckDB:
length(customer_name)
- Snowflake:
length(customer_name)
- Databricks:
length(customer_name)
lengthUTF8
The lengthUTF8
function returns the length of a string in Unicode code points, which is particularly important for strings containing multi-byte characters.
Attributes:
s
(String): The string whose length you want to determine.
Returns:
- An integer representing the length of the string in Unicode code points.
Example:
Let's find the length of the customer_name
column in Unicode code points.
SELECT
order_id,
customer_name,
lengthUTF8(customer_name) AS name_length_utf8
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─name_length_utf8─┐
│ 1 │ John Doe │ 8 │
│ 2 │ Jane Smith │ 10 │
│ 3 │ Carlos Garcia │ 13 │
│ 4 │ Maria Rodriguez │ 15 │
└──────────┴───────────────────┴──────────────────┘
The equivalent functions for lengthUTF8
in other platforms are:
- Postgres:
length(customer_name)
. Postgres handles UTF-8 natively, so the standardlength
function works for UTF-8 strings. - DuckDB:
length(customer_name)
. DuckDB also handles UTF-8 natively. - Snowflake:
length(customer_name)
. Snowflake'slength
function is UTF-8 aware. - Databricks:
length(customer_name)
. Databricks SQL also uses UTF-8 by default.
empty
The empty
function checks whether a given string is empty. This is useful for data validation, ensuring that required fields are not left blank.
Attributes:
x
(String): The string to check for emptiness.
Returns:
- A boolean value:
1
if the string is empty,0
otherwise.
Example:
Check if any customer_name
entries are empty.
SELECT
order_id,
customer_name,
empty(customer_name) AS is_empty
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─is_empty─┐
│ 1 │ John Doe │ 0 │
│ 2 │ Jane Smith │ 0 │
│ 3 │ Carlos Garcia │ 0 │
│ 4 │ Maria Rodriguez │ 0 │
└──────────┴───────────────────┴──────────┘
The equivalent functions for empty
in other platforms are:
- Postgres:
customer_name = ''
orlength(customer_name) = 0
- DuckDB:
customer_name = ''
orlength(customer_name) = 0
- Snowflake:
customer_name = ''
orlength(customer_name) = 0
- Databricks:
customer_name = ''
orlength(customer_name) = 0
For these platforms, you would modify the query as follows:
SELECT
order_id,
customer_name,
CASE WHEN customer_name = '' THEN 1 ELSE 0 END AS is_empty
FROM
orders;
Note that these platforms don't have a direct equivalent to ClickHouse's empty
function, so we use a comparison to an empty string or a length check to achieve the same result.
notEmpty
The notEmpty
function checks whether a given string is not empty. This is the inverse of the empty
function and is equally useful for data validation.
Attributes:
x
(String): The string to check for non-emptiness.
Returns:
- A boolean value:
1
if the string is not empty,0
otherwise.
Example:
Check if all customer_name
entries are not empty.
SELECT
order_id,
customer_name,
notEmpty(customer_name) AS is_not_empty
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─is_not_empty─┐
│ 1 │ John Doe │ 1 │
│ 2 │ Jane Smith │ 1 │
│ 3 │ Carlos Garcia │ 1 │
│ 4 │ Maria Rodriguez │ 1 │
└──────────┴───────────────────┴──────────────┘
The equivalent functions for notEmpty
in other platforms are:
- Postgres:
customer_name != ''
orlength(customer_name) > 0
- DuckDB:
customer_name != ''
orlength(customer_name) > 0
- Snowflake:
customer_name != ''
orlength(customer_name) > 0
- Databricks:
customer_name != ''
orlength(customer_name) > 0
For these platforms, you would modify the query as follows:
SELECT
order_id,
customer_name,
CASE WHEN customer_name != '' THEN 1 ELSE 0 END AS is_not_empty
FROM
orders;
Note that these platforms don't have a direct equivalent to ClickHouse's notEmpty
function, so we use a comparison to an empty string or a length check to achieve the same result.
Common pitfalls
When working with string functions, it's essential to be aware of common pitfalls to avoid errors and inefficiencies:
- Byte Length vs. Character Length: The
length
function returns the length in bytes, which may differ from the character count, especially for multi-byte characters. UselengthUTF8
for accurate character counts. - Empty Strings: Ensure that fields expected to contain data are not empty. Use the
empty
andnotEmpty
functions to validate data before processing.
String replacement functions
ClickHouse provides robust string replacement functions for data cleaning and transformation. This section examines essential replacement functions with real-world examples.
replaceOne
The replaceOne
function replaces the first occurrence of a specified pattern in a string with a replacement string. This is useful for correcting single instances of incorrect data.
Attributes:
haystack
(String): The original string.pattern
(String): The substring to be replaced.replacement
(String): The string to replace the pattern.
Returns:
- A new string with the first occurrence of the pattern replaced.
Example:
Replace the first occurrence of "Tacos" with "Taco" in the order_details
column.
SELECT
order_id,
order_details,
replaceOne(order_details, 'Tacos', 'Taco') AS corrected_order_details
FROM
orders;
Output:
┌─order_id─┬─order_details────────────┬─corrected_order_details──────────┐
│ 1 │ 3 Tacos, 2 Burritos │ 3 Taco, 2 Burritos │
│ 2 │ 1 Taco, 1 Quesadilla │ 1 Taco, 1 Quesadilla │
│ 3 │ 5 Tacos, 1 Nacho │ 5 Taco, 1 Nacho │
│ 4 │ 2 Burritos, 3 Quesadillas│ 2 Burritos, 3 Quesadillas │
└──────────┴──────────────────────────┴──────────────────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
regexp_replace(order_details, 'Tacos', 'Taco', 1)
. The fourth parameter '1' limits the replacement to one occurrence. - DuckDB:
regexp_replace(order_details, 'Tacos', 'Taco', 1)
. Similar to Postgres, the fourth parameter '1' limits the replacement to one occurrence. - Snowflake:
regexp_replace(order_details, 'Tacos', 'Taco', 1, 1)
. The fourth parameter '1' specifies to start at the first character, and the fifth parameter '1' limits the replacement to one occurrence. - Databricks:
regexp_replace(order_details, 'Tacos', 'Taco', 1)
. Like Postgres and DuckDB, this limits the replacement to the first occurrence.
replaceAll
The replaceAll
function replaces all occurrences of a specified pattern in a string with a replacement string. This is ideal for thorough data cleaning.
Attributes:
haystack
(String): The original string.pattern
(String): The substring to be replaced.replacement
(String): The string to replace the pattern.
Returns:
- A new string with all occurrences of the pattern replaced.
Example:
Replace all occurrences of "Tacos" with "Taco" in the order_details
column.
SELECT
order_id,
order_details,
replaceAll(order_details, 'Tacos', 'Taco') AS corrected_order_details
FROM
orders;
Output:
┌─order_id─┬─order_details────────────┬─corrected_order_details──────────┐
│ 1 │ 3 Tacos, 2 Burritos │ 3 Taco, 2 Burritos │
│ 2 │ 1 Taco, 1 Quesadilla │ 1 Taco, 1 Quesadilla │
│ 3 │ 5 Tacos, 1 Nacho │ 5 Taco, 1 Nacho │
│ 4 │ 2 Burritos, 3 Quesadillas│ 2 Burritos, 3 Quesadillas │
└──────────┴──────────────────────────┴──────────────────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
regexp_replace(order_details, 'Tacos', 'Taco', 'g')
. The 'g' flag ensures all occurrences are replaced. - DuckDB:
regexp_replace(order_details, 'Tacos', 'Taco')
. By default, this replaces all occurrences. - Snowflake:
regexp_replace(order_details, 'Tacos', 'Taco')
. This replaces all occurrences by default. - Databricks:
regexp_replace(order_details, 'Tacos', 'Taco')
. This also replaces all occurrences by default.
Note that while ClickHouse uses a specific replaceAll
function, many other platforms use regexp_replace
with default behavior or flags to achieve the same result.
replaceRegexpOne
The replaceRegexpOne
function replaces the first occurrence of a pattern matching a regular expression in a string. This is useful for more complex replacements based on patterns.
Attributes:
haystack
(String): The original string.pattern
(String): The regular expression pattern to be replaced.replacement
(String): The string to replace the pattern.
Returns:
- A new string with the first occurrence of the pattern replaced.
Example:
Replace the first occurrence of any digit with "X" in the order_details
column.
SELECT
order_id,
order_details,
replaceRegexpOne(order_details, '\\d', 'X') AS corrected_order_details
FROM
orders;
Output:
┌─order_id─┬─order_details────────────┬─corrected_order_details──────────┐
│ 1 │ 3 Tacos, 2 Burritos │ X Tacos, 2 Burritos │
│ 2 │ 1 Taco, 1 Quesadilla │ X Taco, 1 Quesadilla │
│ 3 │ 5 Tacos, 1 Nacho │ X Tacos, 1 Nacho │
│ 4 │ 2 Burritos, 3 Quesadillas│ X Burritos, 3 Quesadillas │
└──────────┴──────────────────────────┴──────────────────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
regexp_replace(order_details, '\\d', 'X')
. By default, Postgres replaces only the first occurrence, so no'g'
flag is needed. - DuckDB:
regexp_replace(order_details, '\\d', 'X')
. This function will replace the first occurrence by default. - Snowflake:
regexp_replace(order_details, '\\d', 'X', 1)
. This limits the replacement to the first occurrence. - Databricks:
regexp_replace(order_details, '\\d', 'X')
. By default, this function replaces the first occurrence.
replaceRegexpAll
The replaceRegexpAll
function replaces all occurrences of a pattern matching a regular expression in a string. This is useful for comprehensive data cleaning based on complex patterns.
Attributes:
haystack
(String): The original string.pattern
(String): The regular expression pattern to be replaced.replacement
(String): The string to replace the pattern.
Returns:
- A new string with all occurrences of the pattern replaced.
Example:
Replace all occurrences of any digit with "X" in the order_details
column.
SELECT
order_id,
order_details,
replaceRegexpAll(order_details, '\\d', 'X') AS corrected_order_details
FROM
orders;
Output:
┌─order_id─┬─order_details────────────┬─corrected_order_details──────────┐
│ 1 │ 3 Tacos, 2 Burritos │ X Tacos, X Burritos │
│ 2 │ 1 Taco, 1 Quesadilla │ X Taco, X Quesadilla │
│ 3 │ 5 Tacos, 1 Nacho │ X Tacos, X Nacho │
│ 4 │ 2 Burritos, 3 Quesadillas│ X Burritos, X Quesadillas │
└──────────┴──────────────────────────┴──────────────────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
regexp_replace(order_details, '\\d', 'X', 'g')
. In Postgres, the 'g' flag in theregexp_replace()
function stands for “global”, meaning that it replaces all occurrences of the pattern in the string. Without the 'g' flag, only the first occurrence of the pattern would be replaced. - DuckDB:
regexp_replace(order_details, '\\d', 'X')
. - Snowflake:
regexp_replace(order_details, '\\d', 'X')
. - Databricks:
regexp_replace(order_details, '\\d', 'X')
.
Efficiency tips
To optimize replacement operations and avoid performance hits:
- Use specific patterns: When using regular expressions, be as specific as possible to avoid unnecessary replacements and improve performance.
- Limit scope: Apply replacement functions only to the necessary columns and rows to minimize the processing load.
Substring functions
Extracting substrings is a common requirement in data manipulation, and precision is crucial to avoid errors. This section covers essential substring functions in ClickHouse, helping you perform these operations accurately and efficiently.
substring
The substring
function returns a portion of a string starting from a specified offset, with an optional length parameter. This is useful for extracting specific segments of data for further analysis or reporting.
Attributes:
s
(String): The original string.offset
(UInt32): The starting position for the substring (1-based index).length
(UInt32, optional): The number of characters to extract. If omitted, the substring extends to the end of the string.
Returns:
- A substring of the original string.
Example:
Extract the first three characters from the order_details
column.
SELECT
order_id,
order_details,
substring(order_details, 1, 3) AS order_prefix
FROM
orders;
Output:
┌─order_id─┬─order_details────────────┬─order_prefix─┐
│ 1 │ 3 Tacos, 2 Burritos │ 3 T │
│ 2 │ 1 Taco, 1 Quesadilla │ 1 T │
│ 3 │ 5 Tacos, 1 Nacho │ 5 T │
│ 4 │ 2 Burritos, 3 Quesadillas│ 2 B │
└──────────┴──────────────────────────┴──────────────┘
The equivalent functions in other platforms are:
- Postgres:
substring(order_details, 1, 3)
. - DuckDB:
substring(order_details, 1, 3)
. - Snowflake:
substring(order_details, 1, 3)
. - Databricks:
substring(order_details, 1, 3)
.
substringUTF8
The substringUTF8
function operates similarly to substring
but considers Unicode code points, making it suitable for strings containing multi-byte characters.
Attributes:
s
(String): The original string.offset
(UInt32): The starting position for the substring (1-based index).length
(UInt32, optional): The number of characters to extract. If omitted, the substring extends to the end of the string.
Returns:
- A substring of the original string, considering Unicode code points.
Example:
Extract the first three characters from the customer_name
column, considering Unicode code points.
SELECT
order_id,
customer_name,
substringUTF8(customer_name, 1, 3) AS name_prefix
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─name_prefix─┐
│ 1 │ John Doe │ Joh │
│ 2 │ Jane Smith │ Jan │
│ 3 │ Carlos Garcia │ Car │
│ 4 │ Maria Rodriguez │ Mar │
└──────────┴───────────────────┴─────────────┘
The equivalent functions in other platforms are:
- Postgres:
substring(customer_name, 1, 3)
. - DuckDB:
substring(customer_name, 1, 3)
. - Snowflake:
substring(customer_name, 1, 3)
. - Databricks:
substring(customer_name, 1, 3)
.
These platforms natively handle UTF-8, so no separate substringUTF8()
function is needed.
Advanced extraction
In more complex scenarios, you might need to extract substrings based on patterns or dynamic conditions. Here are a few advanced techniques:
- Extracting dynamic segments:
- Use functions like
position
to find the starting point of a substring dynamically. - Combine
substring
with other functions to extract variable-length segments.
- Use functions like
Example:
Extract the part of order_details
after the first comma.
SELECT
order_id,
order_details,
substring(order_details, position(order_details, ',') + 2) AS details_after_comma
FROM
orders;
Output:
┌─order_id─┬─order_details────────────┬─details_after_comma─┐
│ 1 │ 3 Tacos, 2 Burritos │ 2 Burritos │
│ 2 │ 1 Taco, 1 Quesadilla │ 1 Quesadilla │
│ 3 │ 5 Tacos, 1 Nacho │ 1 Nacho │
│ 4 │ 2 Burritos, 3 Quesadillas│ 3 Quesadillas │
└──────────┴──────────────────────────┴─────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
substring(order_details, position(order_details, ',') + 1)
. - DuckDB:
substring(order_details, instr(order_details, ',') + 2)
. - Snowflake:
substring(order_details, position(order_details, ',') + 2)
. - Databricks:
substring(order_details, instr(order_details, ',') + 2)
.
- Using regular expressions:
Combine regexpExtract
with substring
for more sophisticated extraction based on patterns.
Example:
Extract the first numeric value from order_details
.
SELECT
order_id,
order_details,
substring(order_details, position(order_details, regexpExtract(order_details, '\\d+', 0)), length(regexpExtract(order_details, '\\d+', 0))) AS first_number
FROM
orders;
Output:
┌─order_id─┬─order_details────────────┬─first_number─┐
│ 1 │ 3 Tacos, 2 Burritos │ 3 │
│ 2 │ 1 Taco, 1 Quesadilla │ 1 │
│ 3 │ 5 Tacos, 1 Nacho │ 5 │
│ 4 │ 2 Burritos, 3 Quesadillas│ 2 │
└──────────┴──────────────────────────┴──────────────┘
The equivalent functions in other platforms are:
- Postgres: Use
substring()
withposition()
andregexp_matches()
for extracting the first number. Example:substring(order_details, position(order_details, (regexp_matches(order_details, '\\d+'))[1]))
. - DuckDB: Use
substring()
withinstr()
andregexp_extract()
for extracting the first number. - Snowflake: Use
substring()
withposition()
andregexp_substr()
for extracting the first number. - Databricks: Use
substring()
withinstr()
andregexp_extract()
for extracting the first number.
String concatenation
Merging strings is often come in handy in data processing. This section explores key ClickHouse functions for combining strings effectively, while highlighting potential issues to avoid.
concat
The concat
function combines multiple strings into one without any separator. This is useful for merging fields or constructing new strings from existing data.
Attributes:
s1, s2, ...
(String): The strings to be concatenated.
Returns:
- A single string resulting from the concatenation of all input strings.
Example:
Concatenate customer_name
and order_details
into a single string.
SELECT
order_id,
customer_name,
order_details,
concat(customer_name, ' - ', order_details) AS full_order_description
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─order_details────────────┬─full_order_description────────────────────────────┐
│ 1 │ John Doe │ 3 Tacos, 2 Burritos │ John Doe - 3 Tacos, 2 Burritos │
│ 2 │ Jane Smith │ 1 Taco, 1 Quesadilla │ Jane Smith - 1 Taco, 1 Quesadilla │
│ 3 │ Carlos Garcia │ 5 Tacos, 1 Nacho │ Carlos Garcia - 5 Tacos, 1 Nacho │
│ 4 │ Maria Rodriguez │ 2 Burritos, 3 Quesadillas│ Maria Rodriguez - 2 Burritos, 3 Quesadillas │
└──────────┴───────────────────┴──────────────────────────┴──────────────────────────────────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
concat(customer_name, ' - ', order_details)
. - DuckDB:
concat(customer_name, ' - ', order_details)
. - Snowflake:
concat(customer_name, ' - ', order_details)
. - Databricks:
concat(customer_name, ' - ', order_details)
.
concatWithSeparator
The concatWithSeparator
function concatenates multiple strings using a specified separator. This is particularly useful for creating delimited strings or formatted outputs.
Attributes:
sep
(String): The separator to be used between concatenated strings.expr1, expr2, expr3...
(String): The strings to be concatenated.
Returns:
- A single string resulting from the concatenation of all input strings with the specified separator.
Example:
Concatenate customer_name
and order_details
with a comma separator.
SELECT
order_id,
customer_name,
order_details,
concatWithSeparator(', ', customer_name, order_details) AS full_order_description
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─order_details────────────┬─full_order_description────────────────────────┐
│ 1 │ John Doe │ 3 Tacos, 2 Burritos │ John Doe, 3 Tacos, 2 Burritos │
│ 2 │ Jane Smith │ 1 Taco, 1 Quesadilla │ Jane Smith, 1 Taco, 1 Quesadilla │
│ 3 │ Carlos Garcia │ 5 Tacos, 1 Nacho │ Carlos Garcia, 5 Tacos, 1 Nacho │
│ 4 │ Maria Rodriguez │ 2 Burritos, 3 Quesadillas│ Maria Rodriguez, 2 Burritos, 3 Quesadillas │
└──────────┴───────────────────┴──────────────────────────┴───────────────────────────────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
concat_ws(', ', customer_name, order_details)
. - DuckDB:
concat_ws(', ', customer_name, order_details)
. - Snowflake:
concat_ws(', ', customer_name, order_details)
. - Databricks:
concat_ws(', ', customer_name, order_details)
.
Common mistakes
When using string concatenation functions, it's important to be aware of common pitfalls to avoid unexpected results:
- Incorrect separator usage: Ensure the separator is correctly specified in
concatWithSeparator
. An incorrect or missing separator can lead to improperly formatted strings. - Unexpected results: Be mindful of null values in the strings being concatenated. ClickHouse may handle nulls differently, potentially leading to unexpected results in the final concatenated string.
Case conversion functions
Case conversion functions in ClickHouse allow you to change character case for data formatting and standardization. This section explores key functions for accurate and efficient case conversion operations.
upper
The upper
function converts ASCII Latin symbols in a string to uppercase. This is useful for standardizing data formats, such as converting all names to uppercase for consistency.
Attributes:
input
(String): The string to be converted to uppercase.
Returns:
- A new string with all ASCII Latin characters converted to uppercase.
Example:
Convert the customer_name
column to uppercase.
SELECT
order_id,
customer_name,
upper(customer_name) AS customer_name_upper
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─customer_name_upper─┐
│ 1 │ John Doe │ JOHN DOE │
│ 2 │ Jane Smith │ JANE SMITH │
│ 3 │ Carlos Garcia │ CARLOS GARCIA │
│ 4 │ Maria Rodriguez │ MARIA RODRIGUEZ │
└──────────┴───────────────────┴─────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
upper(customer_name)
. - DuckDB:
upper(customer_name)
. - Snowflake:
upper(customer_name)
. - Databricks:
upper(customer_name)
.
lower
The lower
function converts ASCII Latin symbols in a string to lowercase. This is useful for ensuring uniformity in data entry, such as converting all email addresses to lowercase.
Attributes:
input
(String): The string to be converted to lowercase.
Returns:
- A new string with all ASCII Latin characters converted to lowercase.
Example:
Convert the customer_name
column to lowercase.
SELECT
order_id,
customer_name,
lower(customer_name) AS customer_name_lower
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─customer_name_lower─┐
│ 1 │ John Doe │ john doe │
│ 2 │ Jane Smith │ jane smith │
│ 3 │ Carlos Garcia │ carlos garcia │
│ 4 │ Maria Rodriguez │ maria rodriguez │
└──────────┴───────────────────┴─────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
lower(customer_name)
. - DuckDB:
lower(customer_name)
. - Snowflake:
lower(customer_name)
. - Databricks:
lower(customer_name)
.
upperUTF8
The upperUTF8
function converts characters in a UTF-8 encoded string to uppercase. This is particularly important for strings containing non-ASCII characters.
Attributes:
input
(String): The string to be converted to uppercase.
Returns:
- A new string with all characters converted to uppercase, considering UTF-8 encoding.
Example:
Convert the customer_name
column to uppercase, considering UTF-8 encoding.
SELECT
order_id,
customer_name,
upperUTF8(customer_name) AS customer_name_upper_utf8
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─customer_name_upper_utf8─┐
│ 1 │ John Doe │ JOHN DOE │
│ 2 │ Jane Smith │ JANE SMITH │
│ 3 │ Carlos Garcia │ CARLOS GARCIA │
│ 4 │ Maria Rodriguez │ MARIA RODRIGUEZ │
└──────────┴───────────────────┴──────────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
upper(customer_name)
. - DuckDB:
upper(customer_name)
. - Snowflake:
upper(customer_name)
. - Databricks:
upper(customer_name)
.
These platforms natively handle UTF-8, so no separate upperUTF8()
function is needed.
lowerUTF8
The lowerUTF8
function converts characters in a UTF-8 encoded string to lowercase. This ensures that all characters, including non-ASCII ones, are uniformly converted.
Attributes:
input
(String): The string to be converted to lowercase.
Returns:
- A new string with all characters converted to lowercase, considering UTF-8 encoding.
Example:
Convert the customer_name
column to lowercase, considering UTF-8 encoding.
SELECT
order_id,
customer_name,
lowerUTF8(customer_name) AS customer_name_lower_utf8
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─customer_name_lower_utf8─┐
│ 1 │ John Doe │ john doe │
│ 2 │ Jane Smith │ jane smith │
│ 3 │ Carlos Garcia │ carlos garcia │
│ 4 │ Maria Rodriguez │ maria rodriguez │
└──────────┴───────────────────┴──────────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
lower(customer_name)
. - DuckDB:
lower(customer_name)
. - Snowflake:
lower(customer_name)
. - Databricks:
lower(customer_name)
.
These platforms natively handle UTF-8, so no separate lowerUTF8()
function is needed.
Padding functions
Padding helps format and align data, ensuring strings meet specific length requirements and are presented consistently. This section covers essential padding functions in ClickHouse.
leftPad
The leftPad
function pads a string on the left with a specified character or string until it reaches a desired length. This is useful for aligning data in reports or ensuring string fields meet minimum length requirements.
Attributes:
string
(String): The original string to be padded.length
(UInt32): The desired total length of the padded string.pad_string
(String, optional): The string to pad with. If omitted, spaces are used by default.
Returns:
- A new string padded on the left to the specified length.
Example:
Pad the customer_name
column to a length of 10 characters with spaces.
SELECT
order_id,
customer_name,
leftPad(customer_name, 10) AS padded_customer_name
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─padded_customer_name─┐
│ 1 │ John Doe │ John Doe │
│ 2 │ Jane Smith │ Jane Smith │
│ 3 │ Carlos Garcia │ Carlos Gar │
│ 4 │ Maria Rodriguez │ Maria Rodr │
└──────────┴───────────────────┴──────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
lpad(customer_name, 10)
. - DuckDB:
lpad(customer_name, 10)
. - Snowflake:
lpad(customer_name, 10)
. - Databricks:
lpad(customer_name, 10)
.
rightPad
The rightPad
function pads a string on the right with a specified character or string until it reaches a desired length. This is useful for formatting data for display or ensuring string fields have a uniform length.
Attributes:
string
(String): The original string to be padded.length
(UInt32): The desired total length of the padded string.pad_string
(String, optional): The string to pad with. If omitted, spaces are used by default.
Returns:
- A new string padded on the right to the specified length.
Example:
Pad the customer_name
column to a length of 0 characters with spaces.
SELECT
order_id,
customer_name,
rightPad(customer_name, 10) AS padded_customer_name
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─padded_customer_name─┐
│ 1 │ John Doe │ John Doe │
│ 2 │ Jane Smith │ Jane Smith │
│ 3 │ Carlos Garcia │ Carlos Gar │
│ 4 │ Maria Rodriguez │ Maria Rodr │
└──────────┴───────────────────┴──────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
rpad(customer_name, 10)
. - DuckDB:
rpad(customer_name, 10)
. - Snowflake:
rpad(customer_name, 10)
. - Databricks:
rpad(customer_name, 10)
.
leftPadUTF8
The leftPadUTF8
function pads a string on the left with a specified character or string, considering UTF-8 encoding. This is important for strings containing multi-byte characters.
Attributes:
string
(String): The original string to be padded.length
(UInt32): The desired total length of the padded string.pad_string
(String, optional): The string to pad with. If omitted, spaces are used by default.
Returns:
- A new string padded on the left to the specified length, considering UTF-8 encoding.
Example:
Pad the customer_name
column to a length of 20 characters with spaces, considering UTF-8 encoding.
SELECT
order_id,
customer_name,
leftPadUTF8(customer_name, 20) AS padded_customer_name_utf8
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─padded_customer_name_utf8─┐
│ 1 │ John Doe │ John Doe │
│ 2 │ Jane Smith │ Jane Smith │
│ 3 │ Carlos Garcia │ Carlos Garcia │
│ 4 │ Maria Rodriguez │ Maria Rodriguez │
└──────────┴───────────────────┴───────────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
lpad(customer_name, 20)
. - DuckDB:
lpad(customer_name, 20)
. - Snowflake:
lpad(customer_name, 20)
. - Databricks:
lpad(customer_name, 20)
.
These platforms natively handle UTF-8, so no separate leftPadUTF8()
function is needed.
rightPadUTF8
The rightPadUTF8
function pads a string on the right with a specified character or string, considering UTF-8 encoding. This ensures that all characters, including multi-byte ones, are uniformly padded.
Attributes:
string
(String): The original string to be padded.length
(UInt32): The desired total length of the padded string.pad_string
(String, optional): The string to pad with. If omitted, spaces are used by default.
Returns:
- A new string padded on the right to the specified length, considering UTF-8 encoding.
Example:
Pad the customer_name
column to a length of 20 characters with spaces, considering UTF-8 encoding.
SELECT
order_id,
customer_name,
rightPadUTF8(customer_name, 20) AS padded_customer_name_utf8
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─padded_customer_name_utf8─┐
│ 1 │ John Doe │ John Doe │
│ 2 │ Jane Smith │ Jane Smith │
│ 3 │ Carlos Garcia │ Carlos Garcia │
│ 4 │ Maria Rodriguez │ Maria Rodriguez │
└──────────┴───────────────────┴───────────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
rpad(customer_name, 20)
. - DuckDB:
rpad(customer_name, 20)
. - Snowflake:
rpad(customer_name, 20)
. - Databricks:
rpad(customer_name, 20)
.
These platforms natively handle UTF-8, so no separate rightPadUTF8()
function is needed.
String reversal and repetition
String reversal and repetition functions in ClickHouse allow for unique string manipulations. These tools ensure data accuracy and meet specific formatting needs. This section explores key reversal and repetition functions, providing practical examples to demonstrate their use.
reverse
The reverse
function reverses the characters in a string. This can be useful for tasks such as creating palindromes, reversing order information, or preparing data for specific analytical needs.
Attributes:
s
(String): The string to be reversed.
Returns:
- A new string with the characters in reverse order.
Example:
Reverse the customer_name
column.
SELECT
order_id,
customer_name,
reverse(customer_name) AS reversed_customer_name
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─reversed_customer_name_utf8─┐
│ 1 │ John Doe │ eoD nhoJ │
│ 2 │ Jane Smith │ htimS enaJ │
│ 3 │ Carlos Garcia │ aicraG solraC │
│ 4 │ Maria Rodriguez │ zeugirdoR airaM │
└──────────┴───────────────────┴─────────────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
reverse(customer_name)
. - DuckDB:
reverse(customer_name)
. - Snowflake:
reverse(customer_name)
. - Databricks:
reverse(customer_name)
.
reverseUTF8
The reverseUTF8
function reverses the characters in a UTF-8 encoded string. This is important for strings containing multi-byte characters, ensuring the reversal is accurate and maintains character integrity.
Attributes:
s
(String): The UTF-8 encoded string to be reversed.
Returns:
- A new string with the characters in reverse order, considering UTF-8 encoding.
Example:
Reverse the customer_name
column, considering UTF-8 encoding.
SELECT
order_id,
customer_name,
reverseUTF8(customer_name) AS reversed_customer_name_utf8
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─reversed_customer_name_utf8─┐
│ 1 │ John Doe │ eoD nhoJ │
│ 2 │ Jane Smith │ htimS enaJ │
│ 3 │ Carlos Garcia │ aicraG solraC │
│ 4 │ Maria Rodriguez │ zeugirdoR airaM │
└──────────┴───────────────────┴─────────────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
reverse(customer_name)
. - DuckDB:
reverse(customer_name)
. - Snowflake:
reverse(customer_name)
. - Databricks:
reverse(customer_name)
.
These platforms natively handle UTF-8, so no separate reverseUTF8()
function is needed.
repeat
The repeat
function repeats a string a specified number of times. This is useful for generating repeated patterns, creating test data, or formatting strings for display.
Attributes:
s
(String): The string to be repeated.n
(UInt32): The number of times to repeat the string.
Returns:
- A new string consisting of the original string repeated
n
times.
Example:
Repeat the customer_name
column three times.
SELECT
order_id,
customer_name,
repeat(customer_name, 3) AS repeated_customer_name
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─repeated_customer_name───────────────────────────────┐
│ 1 │ John Doe │ John DoeJohn DoeJohn Doe │
│ 2 │ Jane Smith │ Jane SmithJane SmithJane Smith │
│ 3 │ Carlos Garcia │ Carlos GarciaCarlos GarciaCarlos Garcia │
│ 4 │ Maria Rodriguez │ Maria RodriguezMaria RodriguezMaria Rodriguez │
└──────────┴───────────────────┴──────────────────────────────────────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
repeat(customer_name, 3)
. - DuckDB:
repeat(customer_name, 3)
. - Snowflake:
repeat(customer_name, 3)
. - Databricks:
repeat(customer_name, 3)
.
Potential issues
When using string reversal and repetition functions, be aware of common pitfalls to avoid unexpected results:
- Incorrect usage: Ensure the correct function is used for the desired operation. For example, use
reverseUTF8
for UTF-8 encoded strings to maintain character integrity. - Output length: Be mindful of the output length when using the
repeat
function, as repeating a string many times can result in very long strings, potentially leading to performance issues or truncation.
String search functions
String search functions are vital tools for developers working with data in ClickHouse. These functions enable precise substring location within larger text fields. This section explores key string search capabilities in ClickHouse, providing you with the means to efficiently locate and analyze specific data patterns.
position
The position
function returns the position of the first occurrence of a substring within a string. This is useful for locating specific data points within larger strings.
Attributes:
s
(String): The original string.substring
(String): The substring to search for.
Returns:
- An integer representing the 1-based index of the first occurrence of the substring. If the substring is not found, it returns 0.
Example:
Find the position of "Tacos" in the order_details
column.
SELECT
order_id,
order_details,
position(order_details, 'Tacos') AS tacos_position
FROM
orders;
Output:
┌─order_id─┬─order_details────────────┬─tacos_position─┐
│ 1 │ 3 Tacos, 2 Burritos │ 3 │
│ 2 │ 1 Taco, 1 Quesadilla │ 0 │
│ 3 │ 5 Tacos, 1 Nacho │ 3 │
│ 4 │ 2 Burritos, 3 Quesadillas│ 0 │
└──────────┴──────────────────────────┴────────────────┘
The equivalent functions in other platforms are:
- Postgres:
position('Tacos' IN order_details)
. - DuckDB:
instr(order_details, 'Tacos')
. - Snowflake:
position('Tacos' IN order_details)
. - Databricks:
instr(order_details, 'Tacos')
.
multiSearchAny
The multiSearchAny
function checks if any of the given substrings are present in the string. This is useful for validating the presence of multiple potential matches.
Attributes:
s
(String): The original string.substrings
(Array of Strings): The substrings to search for.
Returns:
- A boolean value:
1
if at least one substring matches, otherwise0
.
Example:
Check if any of the substrings "Tacos" or "Burritos" are present in the order_details
column.
SELECT
order_id,
order_details,
multiSearchAny(order_details, ['Tacos', 'Burritos']) AS contains_tacos_or_burritos
FROM
orders;
Output:
┌─order_id─┬─order_details────────────┬─contains_tacos_or_burritos─┐
│ 1 │ 3 Tacos, 2 Burritos │ 1 │
│ 2 │ 1 Taco, 1 Quesadilla │ 0 │
│ 3 │ 5 Tacos, 1 Nacho │ 1 │
│ 4 │ 2 Burritos, 3 Quesadillas│ 1 │
└──────────┴──────────────────────────┴────────────────────────────┘
The equivalent functions in other platforms are:
- Postgres: No direct equivalent. You would use
position()
orstrpos()
withOR
, e.g.,position(order_details, 'Tacos') > 0 OR position(order_details, 'Burritos') > 0
. - DuckDB: No direct equivalent. You would use
instr()
withOR
conditions, e.g.,instr(order_details, 'Tacos') > 0 OR instr(order_details, 'Burritos') > 0
. - Snowflake: No direct equivalent. You would use
position()
withOR
, e.g.,position(order_details, 'Tacos') > 0 OR position(order_details, 'Burritos') > 0
. - Databricks: No direct equivalent. You would use
instr()
withOR
, e.g.,instr(order_details, 'Tacos') > 0 OR instr(order_details, 'Burritos') > 0
.
multiSearchAllPositions
The multiSearchAllPositions
function returns an array of positions for each substring found within the string. This is useful for detailed analysis of where each substring occurs.
Attributes:
s
(String): The original string.substrings
(Array of Strings): The substrings to search for.
Returns:
- An array of integers representing the positions of each substring. If a substring is not found, its position is 0.
Example:
Find the positions of "Tacos" and "Burritos" in the order_details
column.
SELECT
order_id,
order_details,
multiSearchAllPositions(order_details, ['Tacos', 'Burritos']) AS positions
FROM
orders;
Output:
┌─order_id─┬─order_details────────────┬─positions───┐
│ 1 │ 3 Tacos, 2 Burritos │ [3, 12] │
│ 2 │ 1 Taco, 1 Quesadilla │ [0, 0] │
│ 3 │ 5 Tacos, 1 Nacho │ [3, 0] │
│ 4 │ 2 Burritos, 3 Quesadillas│ [0, 3] │
└──────────┴──────────────────────────┴─────────────┘
The equivalent functions in other platforms are:
- Postgres: No direct equivalent. You would need to use multiple
position()
orstrpos()
calls and store the positions in an array manually. - DuckDB: No direct equivalent. You would have to manually capture positions using
instr()
and create an array of results. - Snowflake: No direct equivalent. Multiple
position()
calls would need to be combined into an array manually. - Databricks: No direct equivalent. You would use multiple
instr()
calls and manually build the array of positions.
multiSearchFirstIndex
The multiSearchFirstIndex
function returns the index of the first found substring in the array. If no substring matches, it returns 0. This is useful for quickly identifying the first match among multiple potential substrings.
Attributes:
s
(String): The original string.substrings
(Array of Strings): The substrings to search for.
Returns:
- An integer representing the index of the first found substring in the array.
Example:
Find the first occurrence of either "Tacos" or "Burritos" in the order_details
column.
SELECT
order_id,
order_details,
multiSearchFirstIndex(order_details, ['Tacos', 'Burritos']) AS first_match_index
FROM
orders;
Output:
┌─order_id─┬─order_details────────────┬─first_match_index─┐
│ 1 │ 3 Tacos, 2 Burritos │ 1 │
│ 2 │ 1 Taco, 1 Quesadilla │ 0 │
│ 3 │ 5 Tacos, 1 Nacho │ 1 │
│ 4 │ 2 Burritos, 3 Quesadillas│ 2 │
└──────────┴──────────────────────────┴───────────────────┘
The equivalent functions in other platforms are:
- Postgres: No direct equivalent. You would need to use multiple
position()
orstrpos()
calls and compare the results with aCASE
orLEAST()
statement. - DuckDB: No direct equivalent. You would manually compare positions using
instr()
for each search term. - Snowflake: No direct equivalent. You would use
position()
for each term and compare the results. - Databricks: No direct equivalent. Use
instr()
for each term and manually calculate the first match.
Trimming functions
Trimming functions clean and format data by removing unwanted whitespace or characters from strings. This section covers how to use these functions effectively to improve data quality and consistency.
trim
The trim
function removes all specified characters from the start, end, or both ends of a string. By default, it removes common whitespace characters.
Attributes:
s
(String): The string to be trimmed.chars
(String, optional): The characters to be removed. If omitted, common whitespace characters are removed by default.
Returns:
- A new string with the specified characters removed from both ends.
Example:
Trim whitespace from both ends of the customer_name
column.
SELECT
order_id,
customer_name,
trim(customer_name) AS trimmed_customer_name
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─trimmed_customer_name─┐
│ 1 │ John Doe │ John Doe │
│ 2 │ Jane Smith │ Jane Smith │
│ 3 │ Carlos Garcia │ Carlos Garcia │
│ 4 │ Maria Rodriguez │ Maria Rodriguez │
└──────────┴───────────────────┴───────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
trim(customer_name)
. - DuckDB:
trim(customer_name)
. - Snowflake:
trim(customer_name)
. - Databricks:
trim(customer_name)
.
trimLeft
The trimLeft
function removes all specified characters from the beginning of a string. This is useful for cleaning up leading whitespace or other unwanted characters.
Attributes:
s
(String): The string to be trimmed.chars
(String, optional): The characters to be removed. If omitted, common whitespace characters are removed by default.
Returns:
- A new string with the specified characters removed from the beginning.
Example:
Trim leading whitespace from the customer_name
column.
SELECT
order_id,
customer_name,
trimLeft(customer_name) AS left_trimmed_customer_name
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─left_trimmed_customer_name─┐
│ 1 │ John Doe │ John Doe │
│ 2 │ Jane Smith │ Jane Smith │
│ 3 │ Carlos Garcia │ Carlos Garcia │
│ 4 │ Maria Rodriguez │ Maria Rodriguez │
└──────────┴───────────────────┴────────────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
ltrim(customer_name)
. - DuckDB:
ltrim(customer_name)
. - Snowflake:
ltrim(customer_name)
. - Databricks:
ltrim(customer_name)
.
trimRight
The trimRight
function removes all specified characters from the end of a string. This is particularly useful for cleaning up trailing whitespace or other unwanted characters.
Attributes:
s
(String): The string to be trimmed.chars
(String, optional): The characters to be removed. If omitted, common whitespace characters are removed by default.
Returns:
- A new string with the specified characters removed from the end.
Example:
Trim trailing whitespace from the customer_name
column.
SELECT
order_id,
customer_name,
trimRight(customer_name) AS right_trimmed_customer_name
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─right_trimmed_customer_name─┐
│ 1 │ John Doe │ John Doe │
│ 2 │ Jane Smith │ Jane Smith │
│ 3 │ Carlos Garcia │ Carlos Garcia │
│ 4 │ Maria Rodriguez │ Maria Rodriguez │
└──────────┴───────────────────┴─────────────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
rtrim(customer_name)
. - DuckDB:
rtrim(customer_name)
. - Snowflake:
rtrim(customer_name)
. - Databricks:
rtrim(customer_name)
.
trimBoth
The trimBoth
function removes all specified characters from both ends of a string. This function is useful for thorough cleaning of strings, ensuring no unwanted characters remain at either end.
Attributes:
s
(String): The string to be trimmed.chars
(String, optional): The characters to be removed. If omitted, common whitespace characters are removed by default.
Returns:
- A new string with the specified characters removed from both ends.
Example:
Trim whitespace from both ends of the customer_name
column.
SELECT
order_id,
customer_name,
trimBoth(customer_name) AS both_trimmed_customer_name
FROM
orders;
Output:
┌─order_id─┬─customer_name─────┬─both_trimmed_customer_name─┐
│ 1 │ John Doe │ John Doe │
│ 2 │ Jane Smith │ Jane Smith │
│ 3 │ Carlos Garcia │ Carlos Garcia │
│ 4 │ Maria Rodriguez │ Maria Rodriguez │
└──────────┴───────────────────┴────────────────────────────┘
The equivalent functions in other platforms are:
- Postgres:
trim(customer_name)
. - DuckDB:
trim(customer_name)
. - Snowflake:
trim(customer_name)
. - Databricks:
trim(customer_name)
.
Regular expressions: pattern matching
Regular expressions (regex) are powerful pattern-matching tools for string manipulation. They enable complex searches, replacements, and extractions based on specific patterns. This section covers regex capabilities in ClickHouse.
like
The like
function checks whether a string matches a simple pattern containing %
(wildcard for any number of characters) and _
(wildcard for a single character). This function is useful for basic pattern matching without the complexity of full regex.
Attributes:
s
(String): The string to be checked.pattern
(String): The pattern to match against.
Returns:
- A boolean value:
1
if the string matches the pattern, otherwise0
.
Example:
Check if order_details
contains the word "Tacos".
SELECT
order_id,
order_details,
order_details LIKE '%Tacos%' AS contains_tacos
FROM
orders;
Output:
┌─order_id─┬─order_details────────────┬─contains_tacos─┐
│ 1 │ 3 Tacos, 2 Burritos │ 1 │
│ 2 │ 1 Taco, 1 Quesadilla │ 0 │
│ 3 │ 5 Tacos, 1 Nacho │ 1 │
│ 4 │ 2 Burritos, 3 Quesadillas│ 0 │
└──────────┴──────────────────────────┴────────────────┘
The equivalent functions in other platforms are:
- Postgres:
order_details LIKE '%Tacos%'
. - DuckDB:
order_details LIKE '%Tacos%'
. - Snowflake:
order_details LIKE '%Tacos%'
. - Databricks:
order_details LIKE '%Tacos%'
.
regexpExtract
The regexpExtract
function extracts a substring from a string that matches a regular expression pattern. This function is useful for extracting specific parts of a string based on complex patterns.
Attributes:
s
(String): The original string.pattern
(String): The regular expression pattern to match.index
(UInt32, optional): The index of the capturing group to return. Defaults to1
.
Returns:
- A substring that matches the specified pattern.
Example:
Extract the first numeric value from order_details
.
SELECT
order_id,
order_details,
regexpExtract(order_details, '\\d+', 0) AS first_number
FROM
orders;
Output:
┌─order_id─┬─order_details────────────┬─first_number─┐
│ 1 │ 3 Tacos, 2 Burritos │ 3 │
│ 2 │ 1 Taco, 1 Quesadilla │ 1 │
│ 3 │ 5 Tacos, 1 Nacho │ 5 │
│ 4 │ 2 Burritos, 3 Quesadillas│ 2 │
└──────────┴──────────────────────────┴──────────────┘
The equivalent functions in other platforms are:
- Postgres:
regexp_matches()
combined with array access. - DuckDB:
regexp_extract()
with the capture group index. - Snowflake:
regexp_substr()
for extracting the match. - Databricks:
regexp_extract()
with the capture group index.
Conclusion
In this post, we've explored essential string functions in ClickHouse, covering a wide range of operations from basic manipulations to complex pattern matching. We've learned about:
- Case conversion functions like
lower
,upper
, andcapitalize
- String manipulation functions such as
substring
,concat
, andreplace
- Trimming functions including
trimLeft
,trimRight
, andtrimBoth
- Pattern matching capabilities with
like
andregexpExtract
We explained each function with its syntax, usage, and practical examples, along with comparisons to equivalent functions in other popular database systems like Postgres, DuckDB, Snowflake, and Databricks.
This post serves as a valuable reference for developers and data analysts working with ClickHouse. By bookmarking this guide, you'll have quick access to these essential string functions, their usage, and how they compare across different platforms.
You can get started with Propel's Serverless ClickHouse today and receive $15 in monthly credits forever. At any point, upgrade to pay-as-you-go, or contact us to learn more about our volume-based discounts. Visit our pricing page for details.