Choosing the right primary key in ClickHouse is crucial for optimizing data storage and query performance. This post covers the key points you need to know to guide you through the process.
Understanding ClickHouse primary keys
Unlike traditional relational databases where primary keys enforce uniqueness and are used for indexing, ClickHouse primary keys serve a different purpose. In ClickHouse, primary keys are used for data organization and sparse indexing, allowing for efficient data skipping during queries. They don't enforce uniqueness and are always a prefix of the sorting key, which determines how data is physically stored on disk for optimized read performance.
Role and importance
- Indexing and storage: ClickHouse uses a sophisticated system for indexing and storing data. The primary key plays a pivotal role in this system by creating the main index file for the table.
- Query performance: The primary key significantly affects the performance of most analytical queries. It helps in structuring data for efficient retrieval.
- Data ordering: If an
ORDER BY
clause is not defined, the primary key will be used to order the data. This makes the primary key the default sorting mechanism for the table.
How primary keys work
ClickHouse primary keys work differently from traditional databases. Here's how they function:
- Sparse indexing: ClickHouse uses a sparse indexing system. Instead of indexing every row, it creates index marks for every N-th row (where N is the index granularity, typically 8192 rows).
- Data skipping: The primary key allows ClickHouse to skip large portions of data during query execution. When a query includes a condition on the primary key columns, ClickHouse can quickly determine which parts of the data can be skipped.
- Prefix matching: Queries that match a prefix of the primary key can be executed more efficiently. For example, if the primary key is (a, b, c), queries filtering on 'a' or 'a and b' will be more efficient than those filtering only on 'c'.
- No uniqueness guarantee: Unlike in many other databases, the primary key in ClickHouse does not enforce uniqueness. It's primarily used for data organization and query optimization.
Choosing the right primary key
Selecting an optimal primary key in ClickHouse requires careful consideration of your data structure and query patterns. Here are some key factors to keep in mind when making this crucial decision
- Query patterns: The primary key should include columns frequently used in query filters (WHERE clauses) and sorting (ORDER BY clauses). This ensures that the primary key aligns with the most common query patterns, optimizing performance.
- Column order: The order of columns in the primary key should cover most partial primary key usage cases. Typically, columns with low cardinality should be placed first, followed by columns with higher cardinality.
- Data sorting: The primary key serves as the default sorting mechanism. If additional data ordering is needed, explicitly define an
ORDER BY
clause to include more columns.
Practical tips
- Low to high cardinality: Always try to order the primary keys from low to high cardinality. This helps in optimizing the index and query performance.
- Filtering columns: Pick columns that are frequently used in filtering. This ensures that the primary key is effective in narrowing down the data during queries.
- Hierarchical data: For data with a tree-like structure, organize the primary key columns from root to leaves. This approach aligns with the natural hierarchy of the data and can improve query performance when filtering on higher-level categories.
- Time series data: For timeseries data, it often makes sense to put the timestamp as the latest column in the
ORDER BY
clause. This helps in keeping related data close together, improving data locality and compression .
Example
Let's see these concepts at work with a practical example. We'll explore a taco ordering system. This example will demonstrate how to structure a table, choose an appropriate primary key, and optimize for common query patterns in a real-world application.
Imagine you have a table with the following schema and data
-- Schema for taco_orders
order_id String,
order_timestamp DateTime,
taco_type String,
quantity UInt32,
price_per_taco Float32,
customer_id UInt32,
special_instructions String,
toppings Array(String)
-- Data
INSERT INTO taco_orders VALUES ('a1b2c3d4', '2024-08-06 12:00:00', 'Beef', 2, 3.50, '1011', '{"spicy": true, "extra_cheese": false}', ['lettuce', 'tomato', 'salsa'])
INSERT INTO taco_orders VALUES ('a1b2c3d4', '2024-08-06 12:00:00', 'Beef', 2, 3.50, '1011', '{"spicy": true, "extra_cheese": false}', ['lettuce', 'tomato', 'salsa'])
INSERT INTO taco_orders VALUES ('e5f6g7h8', '2024-08-06 12:05:00', 'Chicken', 3, 3.00, '1022', '{"spicy": false, "extra_guac": true}', ['lettuce', 'guacamole'])
INSERT INTO taco_orders VALUES ('i9j0k1l2', '2024-08-06 12:10:00', 'Veggie', 1, 2.75, '1033', '{"gluten_free": true}', ['beans', 'corn', 'avocado'])
INSERT INTO taco_orders VALUES ('m3n4o5p6', '2024-08-06 12:15:00', 'Fish', 2, 4.00, '1044', '{"extra_sauce": true}', ['cabbage', 'lime', 'crema'])
Consider the following common query patterns:
1. Total sales by taco type:
SELECT
taco_type,
SUM(quantity) AS total_tacos_sold,
SUM(quantity * price_per_taco) AS total_revenue
FROM taco_orders
GROUP BY taco_type;
2. Orders over time
SELECT
toHour(order_timestamp) AS hour,
COUNT(*) AS order_count,
SUM(quantity) AS total_tacos_sold,
AVG(quantity * price_per_taco) AS avg_order_value
FROM taco_orders
GROUP BY hour
ORDER BY hour;
3. Orders over time broken down by taco_type
SELECT
toHour(order_timestamp) AS hour,
taco_type,
COUNT(*) AS order_count,
SUM(quantity) AS total_tacos_sold,
AVG(quantity * price_per_taco) AS avg_order_value
FROM taco_orders
GROUP BY hour, taco_type
ORDER BY hour, taco_type;
For the given queries and the table schema, the optimal primary key should support efficient filtering, grouping, and aggregation based on the common query patterns. The primary key should be chosen to optimize queries that involve order_timestamp
, taco_type
, and ensure good performance for both time-based and type-based analyses.
Common query patterns and key columns
- Total sales by taco type:
- Columns involved:
taco_type
,quantity
,price_per_taco
- Columns involved:
- Orders over time:
- Columns involved:
order_timestamp
,quantity
,price_per_taco
- Columns involved:
- Orders over time broken down by taco type:
- Columns involved:
order_timestamp
,taco_type
,quantity
,price_per_taco
- Columns involved:
Recommended primary key
A composite primary key consisting of order_timestamp
and taco_type
would be effective. This combination ensures good performance for both time-based and type-based queries.
PRIMARY KEY (order_timestamp, taco_type)
Explanation
order_timestamp
:- Frequently used in time-based queries (
order_timestamp
to extract hour). - Helps in maintaining data locality and compression for time-series data.
- Optimizes queries that group by
hour
andorder_timestamp
.
- Frequently used in time-based queries (
taco_type
:- Used in aggregations and groupings for sales analysis.
- Enhances query performance when filtering or grouping by
taco_type
.
Table definition
CREATE TABLE taco_orders (
order_id String,
order_timestamp DateTime,
taco_type String,
quantity UInt32,
price_per_taco Float32,
customer_id UInt32,
special_instructions String,
toppings Array(String),
) ENGINE = MergeTree()
PRIMARY KEY (order_timestamp, taco_type);
How the primary key optimizes queries:
- Total sales by taco type:
- The primary key
(order_timestamp, taco_type)
allows efficient grouping bytaco_type
since it is part of the primary key. This reduces the amount of data to be scanned and grouped.
- The primary key
- Orders over time:
order_timestamp
, being the first part of the primary key, ensures that time-based queries are highly efficient. The data is already sorted by time, making it faster to group and aggregate by hour.
- Orders over time broken down by taco type:
- The combination of
order_timestamp
andtaco_type
in the primary key ensures that the data is efficiently grouped and ordered by both time and taco type. This supports the efficient execution of queries that require grouping by both hour and taco type.
- The combination of
By structuring the primary key as (order_timestamp, taco_type)
, we optimize for the most common query patterns, ensuring efficient filtering, grouping, and aggregation across different query types.
Conclusion
Choosing the right primary key in ClickHouse involves understanding its role in indexing and data storage, and considering common query patterns. By following these guidelines, you can significantly enhance the performance and efficiency of your ClickHouse database.
You can get started with Propel's Serverless ClickHouse today and receive $15 in monthly credits forever. At any point, upgrade to pay-as-you-go, or contact us to learn more about our volume-based discounts. Visit our pricing page for details.