Imagine a data structure that can hold multiple values of different types in a single column. That's what tuples in ClickHouse offer. Tuples provide developers with a flexible and efficient way to handle more advanced data structures within a single table.
Whether you're a seasoned ClickHouse developer or just getting started, this guide will walk you through the basics and advanced techniques of using tuples in ClickHouse.
Tuples in ClickHouse
Tuples are a versatile data structure in ClickHouse. They let you handle collections of different data types in the same column. At its core, a tuple in ClickHouse is an ordered collection of elements that can have different data types. Unlike arrays, which store elements of the same type, tuples can combine various data types within a single structure.
For example, you might use a tuple to store a taco name (string), price (float), and order count (integer) together.
The syntax for defining a tuple in ClickHouse is straightforward:
Tuple(T1, T2, ..., Tn)
Where T1, T2, ..., Tn represent the data types of the elements in the tuple. Note that tuples in ClickHouse must contain at least one element.
Creating and manipulating tuples in ClickHouse
Let's dive deeper into how to create, access, and manipulate tuples in ClickHouse. Creating tuples in ClickHouse is easy. There are two primary methods:
1. Using the tuple()
function:
SELECT tuple(1, 'Al pastor', 5.00) AS tacos;
2. Using parentheses:
SELECT (2, 'Carnitas', 4.50) AS tacos;
Both methods return the same result: a collection of heterogeneous data types.
Tuples can also form the backbone of table structures:
-- Step 1: Create the Table
CREATE TABLE taco_order (
order_id Int32,
taco_tuple Tuple(String, Int32)
)
ENGINE = MergeTree()
ORDER BY order_id;
-- Step 2: Insert Sample Data
INSERT INTO taco_order VALUES (1, ('Beef', 3));
INSERT INTO taco_order VALUES (2, ('Chicken', 5));
INSERT INTO taco_order VALUES (3, ('Veggie', 2));
Once created, accessing tuple elements is straightforward. ClickHouse provides a dot notation for element retrieval:
SELECT
order_id,
taco_tuple.1 AS taco_type,
taco_tuple.2 AS quantity
FROM
taco_order;
Here, the taco_tuple
column encapsulates two distinct pieces of information within a single tuple.
┌─order_id─┬─taco_type─┬─quantity─┐
│ 3 │ Veggie │ 2 │
│ 1 │ Beef │ 3 │
│ 2 │ Chicken │ 5 │
└──────────┴───────────┴──────────┘
Tuples vs. rows: understanding the distinction in ClickHouse
While both tuples and rows serve as data containers in ClickHouse, their roles differ. Rows represent individual records, with each column corresponding to a specific attribute. Tuples offer a more versatile approach to data grouping within a single column.
Tuples vs. arrays: choosing the right data structure in ClickHouse
Understanding the differences between tuples and arrays is crucial. Arrays in ClickHouse are homogeneous collections, meaning all elements must share the same data type. They are ideal for lists of similar items, like timestamps or product ratings.
Tuples, on the other hand, excel at grouping heterogeneous data (data of different types) within a single column.
Tuples use cases
Tuples are versatile in various ClickHouse operations. These are some of the use cases when tuples are very handy.
Query-time data aggregation
Tuples are useful for temporary data aggregation in complex queries involving multiple columns. In this example well show a query-time aggregation using tuples.
-- Create the table
CREATE TABLE taco_purchases (
customer_id Int32,
taco_price Float32
)
ENGINE = MergeTree()
ORDER BY customer_id;
-- Insert data
INSERT INTO taco_purchases VALUES (1, 5.0);
INSERT INTO taco_purchases VALUES (1, 7.5);
INSERT INTO taco_purchases VALUES (1, 6.0);
INSERT INTO taco_purchases VALUES (2, 8.0);
INSERT INTO taco_purchases VALUES (2, 6.5);
INSERT INTO taco_purchases VALUES (3, 3.0);
INSERT INTO taco_purchases VALUES (3, 4.5);
INSERT INTO taco_purchases VALUES (3, 5.0);
INSERT INTO taco_purchases VALUES (3, 4.0);
-- Query the data
SELECT
customer_id,
tuple(sum(taco_price), avg(taco_price), count(*)) AS taco_stats
FROM taco_purchases
GROUP BY customer_id;
This output shows the customer_id
along with the aggregated taco purchase statistics (sum, average, and count of taco_price
) for each customer_id
in the taco_purchases
table.
This example shows how a tuple can encapsulate three distinct aggregations.
┌─customer_id─┬───────taco_stats────────────┐
│ 3 │ (16.5,4.125,4) │
│ 2 │ (14.5,7.25,2) │
│ 1 │ (18.5,6.166666666666667,3) │
└─────────────┴─────────────────────────────┘
Aggregating Data with sumMap
Tuples can be used in the sumMap
function to aggregate data by summing values associated with specific keys. This is particularly useful for scenarios where you need to aggregate metrics over time or other dimensions.
CREATE TABLE order_map (
created_at DateTime,
orderStatusMapTuple Tuple(Array(Int32), Array(Int32))
)
ENGINE = MergeTree()
ORDER BY created_at;
INSERT INTO order_map VALUES ('2024-07-01 00:00:00', ([1, 2, 3], [10, 20, 30]));
INSERT INTO order_map VALUES ('2024-07-01 01:00:00', ([1, 2, 3], [20, 40, 60]));
INSERT INTO order_map VALUES ('2024-07-01 02:00:00', ([1, 2, 3], [40, 80, 120]));
SELECT toDate(created_at) AS date, sumMap(orderStatusMapTuple) FROM order_map GROUP BY date;
In this example, sumMap
aggregates the values for each key in the tuple, providing a summarized view of the data. Thus, the final result of the query will be:
┌───────date──────┬──────────────sumMap(orderStatusMapTuple)──────────────┐
│ 2024-07-01 │ ([1,2,3],[70,140,210]) │
└─────────────────┴───────────────────────────────────────────────────────┘
JSON and tuples in materialized views
Tuples can be used in materialized Views to transform and store data in a more query-efficient structure. ClickHouse provides powerful functionality for working with JSON and tuples, particularly when dealing with nested JSON messages. Using tuples in materialized views allows for efficient parsing and storage of fixed-structure JSON data. Here are some practical examples:
Consider a nested JSON message with a fixed structure, like this:
{"taco_type": "Beef", "order_details": {"quantity": 42, "order_time": "2023-09-01 12:23:34.231"}}
Here's how you can use Tuple()
to parse this JSON message in a materialized view:
Create the Source Table:
CREATE TABLE tacos.nest_tuple_source
(
`taco_type` String,
`order_details` Tuple(`quantity` UInt8, `order_time` DateTime64(3))
)
ENGINE = MergeTree()
ORDER BY created_at;
Create the Materialized View:
CREATE MATERIALIZED VIEW tacos.mv_nest_tuple TO tacos.nest_tuple_destination
AS
SELECT
taco_type AS taco_type,
order_details.1 AS quantity,
order_details.2 AS order_time
FROM tacos.nest_tuple_source;
Create the Destination Table:
CREATE TABLE tacos.nest_tuple_destination
(
`taco_type` String,
`quantity` UInt8,
`order_time` DateTime64(3)
)
ENGINE = MergeTree()
ORDER BY order_time;
Insert Data:
INSERT INTO tacos.nest_tuple_source FORMAT JSONEachRow {"taco_type": "Beef", "order_details": {"quantity": 42, "order_time": "2023-09-01 12:23:34.231"}};
INSERT INTO tacos.nest_tuple_source FORMAT JSONEachRow {"taco_type": "Chicken", "order_details": {"quantity": 15, "order_time": "2023-09-02 13:45:12.123"}};
INSERT INTO tacos.nest_tuple_source FORMAT JSONEachRow {"taco_type": "Veggie", "order_details": {"quantity": 28, "order_time": "2023-09-03 09:10:56.789"}};
Query the Destination Table:
SELECT *
FROM tacos.nest_tuple_destination;
The output would be:
┌─taco_type─┬─quantity─┬──order_time──────────────────────┐
│ Beef │ 42 │ 2023-09-01 12:23:34.231 │
│ Chicken │ 15 │ 2023-09-02 13:45:12.123 │
│ Veggie │ 28 │ 2023-09-03 09:10:56.789 │
└───────────┴──────────┴──────────────────────────────────┘
Key tips:
- Case sensitivity: ClickHouse column names are case-sensitive. Ensure that the JSON message key names match the column names in the source table exactly.
- Fixed schema: The
Tuple()
approach works best with fixed JSON schemas. For dynamic schemas, use JSONExtract functions. - Index reference: For tuples, use index references inside the materialized view. Named tuples inside the MV won't work.
- Use Aliases: Always use
AS
(alias) for columns affected by functions in materialized views.
Denormalizing data
Tuples are effective in denormalizing data, which involves flattening tables to reduce query latency by avoiding joins. This is particularly beneficial for applications with heavy read requirements and complex queries.
CREATE TABLE taco_orders (
order_id UInt32,
customer_name String,
order_stats Tuple(TacoCount UInt32, TotalPrice Float32)
)
ENGINE = MergeTree()
ORDER BY order_id;
INSERT INTO taco_orders VALUES (1, 'John Doe', (10, 25.0));
INSERT INTO taco_orders VALUES (2, 'Jane Smith', (5, 12.5));
When querying the table, the stored tuple allows for efficient data retrieval without the need for complex joins. For example:
SELECT * FROM taco_orders;
This output shows the order_id
, customer_name
, and the order_stats
tuple containing TacoCount
and TotalPrice
for each order in the taco_orders
table.
┌─order_id---─┬─customer_name─┬─order_stats---──┐
│ 2 │ Jane Smith │ (5,12.5) │
│ 1 │ John Doe │ (10,25) │
└─────────────┴───────────────┴─────────────────┘
In this example, the order_stats
tuple stores related statistics directly within the taco_orders
table, reducing the need for joins and speeding up read operations
Complex filtering: Using tuples with IN
operators and lambda functions
Tuples can be used with IN
operators and lambda functions to perform complex filtering and transformations on data. This is useful for scenarios requiring advanced data manipulation, such as filtering taco orders based on specific criteria.
This example shows how to filter rows using tuples with the IN
operator and select orders based on specific combinations of taco type and quantity. It also shows how to use lambda functions for array filtering to filter an array of taco orders based on conditions applied to tuple elements.
Create the table and insert the data
CREATE TABLE taco_orders (
order_id UInt32,
taco_type String,
quantity UInt32,
price Float32
) ENGINE = MergeTree ORDER BY order_id;
INSERT INTO taco_orders VALUES (1, 'Beef', 10, 25.0);
INSERT INTO taco_orders VALUES (2, 'Chicken', 5, 12.5);
INSERT INTO taco_orders VALUES (3, 'Veggie', 20, 40.0);
INSERT INTO taco_orders VALUES (4, 'Beef', 7, 17.5);
INSERT INTO taco_orders VALUES (5, 'Chicken', 3, 7.5);
Filtering Rows Using Tuples with IN Operator
SELECT * FROM taco_orders
WHERE (taco_type, quantity) IN (('Beef', 10), ('Chicken', 5));
This query filters rows where the combination of taco_type and quantity matches any of the specified tuples (('Beef', 10) or ('Chicken', 5)).
┌─order_id─┬─taco_type─┬─quantity─┬─price─┐
│ 1 │ Beef │ 10 │ 25 │
│ 2 │ Chicken │ 5 │ 12.5 │
└──────────┴───────────┴──────────┴───────┘
Using Lambda Functions for Array Filtering
Tuples can be used with IN
operators and lambda functions to perform complex data filtering and transformations. This is useful for scenarios requiring advanced data manipulation.
SELECT arrayFilter(x -> x.2 > 5, [('Beef', 10), ('Chicken', 5), ('Veggie', 20), ('Beef', 7), ('Chicken', 3)]) as array;
This query filters an array of tuples, keeping only those tuples where the second element (quantity) is greater than 5.
┌─array──────────────────────────────────┐
│ [('Beef',10),('Veggie',20),('Beef',7)] │
└────────────────────────────────────────┘
By using tuples with IN operators and lambda functions, you can perform advanced filtering and data manipulation more efficiently and readable. This is particularly useful for complex scenarios like filtering semi-structured metadata.
Accessing tuple elements
Tuples allow for easy access to individual elements using the dot operator, which is useful for extracting specific data points from a tuple.
SELECT (10, 'Beef Taco').1 AS quantity, (10, 'Beef Taco').2 AS taco_type;
This query demonstrates how to access individual elements of a tuple directly in a SELECT statement. This is the output:
┌─quantity─┬─taco_type─┐
│ 10 │ Beef Taco │
└──────────┴───────────┘
Individual elements are accessed by their index, not their name.
Correct:
tuple.1 AS column1, tuple.2 AS column2;
Incorrect:
tuple.column1 AS column1, tuple.column2 AS column2;
Multi-column sorting
Suppose you have a table with taco products, and you want to order the results by multiple columns. This examples shows how to store taco product details and then order them by category, price, and stock in descending order.
-- Create the table
CREATE TABLE taco_products (
product_id UInt32,
product_name String,
category String,
price Float32,
stock UInt32
) ENGINE = MergeTree()
ORDER BY product_id;
-- Insert individual rows
INSERT INTO taco_products VALUES (1, 'Beef Taco', 'Main', 5.99, 100);
INSERT INTO taco_products VALUES (2, 'Chicken Taco', 'Main', 4.99, 150);
INSERT INTO taco_products VALUES (3, 'Veggie Taco', 'Main', 3.99, 200);
INSERT INTO taco_products VALUES (4, 'Fish Taco', 'Seafood', 6.99, 80);
INSERT INTO taco_products VALUES (5, 'Shrimp Taco', 'Seafood', 7.99, 50);
INSERT INTO taco_products VALUES (6, 'Cheese Taco', 'Vegetarian', 4.49, 120);
-- Order the results
SELECT *
FROM taco_products
ORDER BY (category, price, stock) DESC;
The expected output will list all taco products ordered by category, price, and stock in descending order.
┌─product_id─┬─product_name─┬─category───┬─price─┬─stock─┐
│ 6 │ Cheese Taco │ Vegetarian │ 4.49 │ 120 │
│ 5 │ Shrimp Taco │ Seafood │ 7.99 │ 50 │
│ 4 │ Fish Taco │ Seafood │ 6.99 │ 80 │
│ 1 │ Beef Taco │ Main │ 5.99 │ 100 │
│ 2 │ Chicken Taco │ Main │ 4.99 │ 150 │
│ 3 │ Veggie Taco │ Main │ 3.99 │ 200 │
└────────────┴──────────────┴────────────┴───────┴───────┘
How to convert an array to a tuple
In ClickHouse, converting an array to a tuple can be useful when you need to treat each element of the array as a distinct entity within a single composite structure. This can be done using the arrayJoin
and tuple
functions together.
Suppose you have a table with an array column, and you want to convert each array into a tuple.
-- Create the table
CREATE TABLE taco_order_quantities (
order_id UInt32,
taco_quantities Array(Int32)
) ENGINE = MergeTree()
ORDER BY order_id;
-- Insert data into the table
INSERT INTO taco_order_quantities VALUES (1, [10, 20, 30]), (2, [40, 50, 60]);
-- Convert array to tuple
SELECT
order_id,
tuple(arrayJoin(taco_quantities)) AS quantity_tuple
FROM taco_order_quantities;
For the given data, the arrayJoin function will create multiple rows for each array element, and the tuple function will convert these elements into tuples. The output will look something like this:
In this example:
- We first create a table
taco_order_quantities
with anorder_id
column and an array column namedtaco_quantities
. - We insert some test data into the table.
- We use the
arrayJoin
function to flatten the array and then convert it into a tuple using the tuple function.
┌─order_id─┬─quantity_tuple─┐
│ 1 │ (10) │
│ 1 │ (20) │
│ 1 │ (30) │
│ 2 │ (40) │
│ 2 │ (50) │
│ 2 │ (60) │
└──────────┴────────────────┘
This query will convert each element of the array into a tuple, allowing you to handle the data in a more structured way.
How to convert a tuple into an array
Converting a tuple into an array in ClickHouse can be useful when you need to handle the elements of a tuple as a unified collection. This is particularly handy for functions that operate on arrays.
Suppose you have a table with a tuple column, and you want to convert the tuple into an array.
-- Create the table
CREATE TABLE taco_tuple_to_array (
order_id UInt32,
taco_quantities Tuple(Int32, Int32, Int32)
) ENGINE = MergeTree()
ORDER BY order_id;
-- Insert data into the table
INSERT INTO taco_tuple_to_array VALUES (1, (10, 20, 30)), (2, (40, 50, 60));
-- Convert tuple to array
SELECT
order_id,
array(taco_quantities.1, taco_quantities.2, taco_quantities.3) AS taco_quantities_array
FROM taco_tuple_to_array;
In this example:
- We first create a table
taco_tuple_to_array
with anorder_id
column and atuple
column namedtaco_quantities
. - We insert some test data into the table.
- We use the
array
function to convert the elements of the tuple into an array.
┌─order_id─┬─taco_quantities_array─┐
│ 1 │ [10,20,30] │
│ 2 │ [40,50,60] │
└──────────┴───────────────────────┘
This query effectively converts the tuple into an array, enabling you to leverage ClickHouse's array functions for further data manipulation.
Tuple functions in ClickHouse
ClickHouse's robust suite of tuple functions elevates data manipulation.
The tuple()
function allows for the creation of tuples on the fly:
SELECT tuple(user_id, order_date, total_amount) AS order_summary
FROM orders
WHERE total_amount > 1000
The tupleElement()
function supports both index-based and name-based access:
SELECT
tupleElement(user_data, 'name') AS user_name,
tupleElement(user_data, 2) AS user_age
FROM users
The untuple()
function syntactically substitutes tuple elements directly in the query:
SELECT untuple((user_id, username, email))
FROM user_profiles
The tupleHammingDistance()
function calculates the Hamming distance between two tuples:
SELECT tupleHammingDistance(
(1, 'active', 'premium'),
(1, 'inactive', 'basic')
) AS profile_difference
The tupleToNameValuePairs()
function transforms named tuples into arrays of name-value pairs:
SELECT tupleToNameValuePairs((name := 'John', age := 30, city := 'New York'))
AS user_attributes
Arithmetic operations on tuples are streamlined with functions like tuplePlus()
, tupleMinus()
, tupleMultiply()
, and tupleDivide()
:
SELECT
tuplePlus((1, 2, 3), (4, 5, 6)) AS sum_tuple,
tupleMultiply((2, 3, 4), (5, 6, 7)) AS product_tuple
The flattenTuple()
function simplifies complex nested tuples:
SELECT flattenTuple((a := 1, b := (c := 2, d := 3))) AS flattened_data
These tuple functions enable expressive and efficient queries, making them an essential part of any ClickHouse developer's toolkit.
Working with tuples: Advanced techniques and best practices
Data type detection
ClickHouse employs intelligent type inference for tuple elements. For example:
SELECT toTypeName(tuple(1, NULL)) AS x;
This query returns Tuple(UInt8, Nullable(Nothing))
, showcasing ClickHouse's ability to efficiently handle both integer and NULL values.
Performance considerations
While tuples offer flexibility, they can impact query performance if misused. To maximize efficiency:
- Avoid unnecessary tuple creation in frequently executed queries.
- Use tuple comparison judiciously, as element-wise comparisons can be costly for large datasets.
Nested tuples: navigating complex structures
Nested tuples can represent hierarchical data effectively. The flattenTuple
function is invaluable for working with such structures:
SELECT flattenTuple((1, ('a', 2), (('b', 3), 'c'))) AS flattened;
Debugging tuple-related issues
When troubleshooting tuple problems:
- Use
toTypeName()
to verify the inferred types of tuple elements. - Employ
toString()
to inspect tuple contents.
Conclusion
In conclusion, tuples in ClickHouse offer a powerful and flexible way to handle collections of heterogeneous data types within a single column. Whether you're aggregating data on-the-fly, denormalizing tables for faster reads, or transforming JSON structures in materialized views, tuples provide an efficient and expressive mechanism for complex data operations. By understanding the nuances of tuple creation, manipulation, and advanced functions, you can leverage this data structure to optimize your queries and enhance your data processing workflows. As with any powerful tool, careful consideration of performance implications and best practices will ensure that you get the most out of tuples in your ClickHouse projects.
You can get started withPropel's Serverless ClickHouse today and receive $15 in monthly credits forever. At any point, upgrade to pay-as-you-go, or contact us to learn more about our volume-based discounts. Visit our pricing page for details.