Materialized views in ClickHouse®: The data transformation Swiss Army knife

What they are, how they work, and 10 practical examples

Materialized views in ClickHouse

Propel

Imagine having a tool that could answer complex analytical queries instantly. In the realm of data, ClickHouse® materialized views come close to this ideal. They allow you to precompute query results and store them as tables, significantly speeding up query performance. But how exactly do they work, and what are the benefits and trade-offs? Read on!

Lightning-fast queries with ClickHouse® materialized views

ClickHouse® materialized views are a powerful feature that allows you to precompute and store query results as separate tables. Unlike regular views, which are virtual and computed on the fly, materialized views are physically stored on disk and automatically updated when the source data changes.

The key benefits of using materialized views in ClickHouse® are:

  • Improved query performance: By precomputing and storing results, materialized views can significantly speed up complex queries, often by orders of magnitude.
  • Reduced computational overhead: Materialized views shift the computational burden from query time to insertion time, reducing the load on the system during peak query hours.

However, there are also some trade-offs to consider:

  • Increased storage usage: Since materialized views store pre-computed results, they consume additional disk space.
  • Potential write amplification: Updating materialized views on every insert, update, or delete operation on the source table can increase write overhead.
  • Insert errors: An error in a materialized view will cause the insert on the source table to fail.

Under the hood, ClickHouse® materialized views work by creating a trigger-based mechanism that automatically updates the view whenever the source table changes. When you insert data in the source table, ClickHouse® executes the materialized view's SELECT query on the new data and inserts the results into the view's target table.

One key characteristic of materialized views in ClickHouse® is incremental updates. Instead of recomputing the entire view on every change, ClickHouse intelligently updates only the affected parts of the view. This makes materialized views highly efficient, even for large datasets and frequent updates.

Creating materialized views in ClickHouse®

Creating a materialized view in ClickHouse is straightforward using the CREATE MATERIALIZED VIEW statement. The basic syntax looks like this:

CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db.]table_name 
[TO[db.]name] 
[ENGINE = engine] 
[POPULATE] 
AS SELECT ...

Let's break this down:

  • [IF NOT EXISTS]: Optional clause to prevent an error if a view with the same name already exists.
  • [db.]table_name: The name of the materialized view, optionally prefixed with the database name.
  • [TO [db.]name]: The target table to store the materialized view results. If omitted, ClickHouse® creates a target table with the same name as the view.
  • [ENGINE = engine]: The table engine for the materialized view. If not specified, the engine is derived from the SELECT query.
  • [POPULATE]: Optional keyword to backfill the view with existing data from the source table(s). However, it's not recommended for production use, as it can lead to data inconsistency if new data is inserted during the population process and does not work with the TO clause.
  • AS SELECT ...: The SELECT query that defines the materialized view.

Choosing the right table engine is crucial for materialized view performance and functionality. Some common choices are:

  • MergeTree: The default table engine for materialized views. It provides efficient data storage and retrieval for append-only data, with support for partitioning and data replication.
  • ReplacingMergeTree: An engine designed to handle duplicate rows and updates by keeping only the latest version based on a specified version column. This is useful for scenarios where rows might be updated frequently, and you want to maintain only the most recent data.
  • SummingMergeTree: An engine optimized for storing pre-aggregated data. It automatically sums values with the same primary key, making it ideal for materialized views that compute aggregates like COUNT or SUM. However, ClickHouse® now supports all the aggregate functions of the AggregatingMergeTree in the SummingMergeTree 🤷🏻.
  • AggregatingMergeTree: Similar to SummingMergeTree, but uses ClickHouse's aggregate functions.

Creating a materialized view with the TO clause vs implicit table

When creating a materialized view, we recommend using the TO clause to specify a custom target table name instead of using the materialized view implicit table. Next, we’ll explain how to create each and why we recommend using the TO clause to land the output of a materialized view in its own table.

Creating a materialized view using the TO clause allows you to specify a custom target table where the materialized view's results will be stored. This provides you with control over the name, schema, sorting key, and storage engine of the target table. It is useful for organizing views and controlling access permissions.

CREATE MATERIALIZED VIEW my_materialized_view
TO target_table
AS SELECT ...

If the TO clause is omitted, ClickHouse automatically creates an implicit target table with the same name as the materialized view. This table will store the materialized view's results. While this method is simpler and requires less configuration, it gives you less control over the target table's specifics.

CREATE MATERIALIZED VIEW my_materialized_view
AS SELECT ...

As mentioned above, the POPULATE clause will not work with the TO clause. Additionally, it results in inconsistencies when populating when new inserts are coming in. Therefore, we do not recommend you use it.

Materialized views example

This example demonstrates how to create a materialized view in ClickHouse that performs incremental aggregations and stores the count, total quantity, and total sales amount for each type of taco in a separate table.

Let’s start by defining our source table taco_sales. This table will store sales data for a taco restaurant. We will use this source table and data through the post.

CREATE TABLE taco_sales
(
    sale_id UInt32,
    sale_timestamp DateTime,
    taco_type String,
    quantity UInt32,
    price_per_taco Float32,
    customer_id UInt32,
    special_instructions String,
    toppings Array(String)
) ENGINE = MergeTree()
ORDER BY sale_timestamp;

INSERT INTO taco_sales VALUES (11, '2024-07-01 12:50:00', 'Beef', 4, 2.50, 110, '{\"spicy\": true, \"extra_cheese\": false, \"no_onions\": true}', ['lettuce', 'cheese', 'salsa']);
INSERT INTO taco_sales VALUES (12, '2024-07-01 12:55:00', 'Chicken', 5, 3.00, 111, '{\"spicy\": false, \"extra_cheese\": true, \"gluten_free\": true}', ['lettuce', 'guacamole']);
INSERT INTO taco_sales VALUES (13, '2024-07-01 13:00:00', 'Veggie', 3, 2.00, 112, '{\"spicy\": true, \"extra_cheese\": true, \"no_sour_cream\": true}', ['tomato', 'onion', 'cheese']);
INSERT INTO taco_sales VALUES (14, '2024-07-01 13:05:00', 'Fish', 2, 3.50, 113, '{\"spicy\": false, \"extra_cheese\": false, \"extra_avocado\": true}', ['lettuce', 'tomato', 'avocado']);
INSERT INTO taco_sales VALUES (15, '2024-07-01 13:10:00', 'Beef', 1, 2.50, 114, '{\"spicy\": true, \"extra_cheese\": true, \"no_tomato\": true}', ['cheese', 'sour cream']);
INSERT INTO taco_sales VALUES (16, '2024-07-01 13:15:00', 'Chicken', 4, 3.00, 115, '{\"spicy\": false, \"extra_cheese\": false, \"extra_lettuce\": true}', ['lettuce', 'tomato', 'salsa']);
INSERT INTO taco_sales VALUES (17, '2024-07-01 13:20:00', 'Veggie', 2, 2.00, 116, '{\"spicy\": true, \"extra_cheese\": false, \"no_avocado\": true}', ['lettuce', 'avocado']);
INSERT INTO taco_sales VALUES (18, '2024-07-01 13:25:00', 'Fish', 5, 3.50, 117, '{\"spicy\": false, \"extra_cheese\": true, \"no_cilantro\": true}', ['lettuce', 'onion', 'sour cream']);
INSERT INTO taco_sales VALUES (19, '2024-07-01 13:30:00', 'Beef', 3, 2.50, 118, '{\"spicy\": true, \"extra_cheese\": false, \"extra_salsa\": true}', ['tomato', 'cheese']);
INSERT INTO taco_sales VALUES (20, '2024-07-01 13:35:00', 'Chicken', 2, 3.00, 119, '{\"spicy\": false, \"extra_cheese\": true, \"extra_sour_cream\": true}', ['lettuce', 'sour cream']);

Next, we are going to create our target table that stores the resulting aggregates:

CREATE TABLE taco_sales_aggregated
(
    taco_type String,
    order_count UInt32,
    total_quantity UInt32,
    total_sales Float32
) ENGINE = SummingMergeTree()
ORDER BY taco_type;

With the source and target tables created, we can create our materialized view, which incrementally aggregates order_count, total_quantity, and total_sales by taco_type.

CREATE MATERIALIZED VIEW taco_sales_aggregated_mv
TO taco_sales_aggregated
AS
SELECT
    taco_type,
    count(*) as order_count,
    sum(quantity) AS total_quantity,
    sum(quantity * price_per_taco) AS total_sales
FROM taco_sales
GROUP BY taco_type;

Lastly, we are going to insert some data into the taco_sales source table.

INSERT INTO taco_sales VALUES (21, '2024-07-01 13:40:00', 'Beef', 4, 2.50, 110, '{\"spicy\": true, \"extra_cheese\": false, \"no_onions\": true}', ['lettuce', 'cheese', 'salsa']);
INSERT INTO taco_sales VALUES (22, '2024-07-01 13:45:00', 'Chicken', 5, 3.00, 111, '{\"spicy\": false, \"extra_cheese\": true, \"gluten_free\": true}', ['lettuce', 'guacamole']);
INSERT INTO taco_sales VALUES (23, '2024-07-01 13:50:00', 'Veggie', 3, 2.00, 112, '{\"spicy\": true, \"extra_cheese\": true, \"no_sour_cream\": true}', ['tomato', 'onion', 'cheese']);
INSERT INTO taco_sales VALUES (24, '2024-07-01 13:55:00', 'Fish', 2, 3.50, 113, '{\"spicy\": false, \"extra_cheese\": false, \"extra_avocado\": true}', ['lettuce', 'tomato', 'avocado']);
INSERT INTO taco_sales VALUES (25, '2024-07-01 14:00:00', 'Beef', 1, 2.50, 114, '{\"spicy\": true, \"extra_cheese\": true, \"no_tomato\": true}', ['cheese', 'sour cream']);
INSERT INTO taco_sales VALUES (26, '2024-07-01 14:05:00', 'Chicken', 4, 3.00, 115, '{\"spicy\": false, \"extra_cheese\": false, \"extra_lettuce\": true}', ['lettuce', 'tomato', 'salsa']);
INSERT INTO taco_sales VALUES (27, '2024-07-01 14:10:00', 'Veggie', 2, 2.00, 116, '{\"spicy\": true, \"extra_cheese\": false, \"no_avocado\": true}', ['lettuce', 'avocado']);
INSERT INTO taco_sales VALUES (28, '2024-07-01 14:15:00', 'Fish', 5, 3.50, 117, '{\"spicy\": false, \"extra_cheese\": true, \"no_cilantro\": true}', ['lettuce', 'onion', 'sour cream']);
INSERT INTO taco_sales VALUES (29, '2024-07-01 14:20:00', 'Beef', 3, 2.50, 118, '{\"spicy\": true, \"extra_cheese\": false, \"extra_salsa\": true}', ['tomato', 'cheese']);
INSERT INTO taco_sales VALUES (30, '2024-07-01 14:25:00', 'Chicken', 2, 3.00, 119, '{\"spicy\": false, \"extra_cheese\": true, \"extra_sour_cream\": true}', ['lettuce', 'sour cream']);

Once you've created a materialized view in ClickHouse, and inserted some data into the source table, querying it is as simple as running a SELECT statement on the target table. The materialized view will automatically keep the target table up-to-date as new data is inserted into the source table.

However, when querying materialized views that use the SummingMergeTree, ReplacingMergeTree, or AggregatingMergeTree engines, it's important to use the FINAL keyword. This ensures that the query returns fully aggregated results, as these engines may store intermediate aggregation states that have not yet been merged together.

SELECT * from taco_sales_aggregated FINAL

This returns:

┌─taco_type─┬─order_count─┬─total_quantity─┬─total_sales─┐
│ Beef      │  3820│ Veggie    │  2510│ Chicken   │  31133│ Fish      │  2724.5└───────────┴─────────────┴────────────────┴─────────────┘

Note that the order_count in the target table adds up to 10, while we’ve inserted 20 records in the source table. In fact, it only includes the records we inserted AFTER creating the materialized view. To process the records that existed BEFORE we need to backfill the materialized view.

Backfilling materialized views

When you create a materialized view, it starts processing new data immediately. However, any data that existed in the source table before the materialized view was created will not be included in the view. To include this historical data, you need to backfill the materialized view.

The approach we recommend to backfill a materialized view is to use the INSERT SELECT statement. This method allows you to manually insert data from the source table into the target table of the materialized view. Here's how you can do it:

  1. Identify the target table: Determine the target table where the materialized view stores its results. This is the table specified in the TO clause of the CREATE MATERIALIZED VIEW statement.
  2. Insert historical data: Use the INSERT SELECT statement to copy data from the source table to the target table. Ensure that the SELECT query matches the query defined in the materialized view.
  3. Filter to include only the “old” rows: Use the WHERE clause to include only the rows before the materialized view was created.

To backfill our example, we can run:

INSERT INTO taco_sales_aggregated
SELECT
    taco_type,
    count() AS order_count,
    sum(quantity) AS total_quantity,
    sum(quantity * price_per_taco) AS total_sales
FROM taco_sales
WHERE sale_timestamp > toDateTime('2024-07-01 13:35:00')
GROUP BY taco_type;

This query inserts the historical data from the taco_sales table into the taco_sales_aggregated table, ensuring that the materialized view is backfilled with all the data before the materialized view was created.

Now when you query the target table, you get the following results that include all rows created before and after the materialized view was created. The order_count column now sums up to 20, which is the total number of rows our source table has.

┌─taco_type─┬─order_count─┬─total_quantity─┬─total_sales─┐
│ Beef      │ 61640│ Chicken   │ 62266│ Fish      │ 41449│ Veggie    │ 41020└───────────┴─────────────┴────────────────┴─────────────┘

By using the INSERT SELECT method, you can backfill materialized views without using the POPULATE clause, which can lead to data inconsistency issues if new data is inserted during the population process.

Handling schema changes in materialized views

When the schema of the source table changes, you need to update the materialized view and its target table to reflect these changes. This involves the following steps:

  1. Add a column to the source table
  2. Add a column to the target table
  3. Delete the materialized view
  4. Recreate the materialized view
  5. Backfill the materialized view

Example

Let's say we want to add a new column discount to the taco_sales source table to store discount information for each sale.

Add a column to the source table

ALTER TABLE taco_sales ADD COLUMN discount Float32 DEFAULT 0;

Add a column to the target table

Then, create the new target table with the additional total_discount column.

ALTER TABLE taco_sales_aggregated ADD COLUMN total_discount Float32 DEFAULT 0;

Delete the materialized view

Deleting the materialized view stops new rows from being written to the target table.

DROP MATERIALIZED VIEW IF EXISTS taco_sales_aggregated_mv;

Recreate the materialized view

Create the new materialized view to include the discount column:

CREATE MATERIALIZED VIEW taco_sales_aggregated_mv
TO taco_sales_aggregated
AS
SELECT
    taco_type,
    count(*) AS order_count,
    sum(quantity) AS total_quantity,
    sum(quantity * price_per_taco) AS total_sales,
    sum(discount) AS total_discount
FROM taco_sales
GROUP BY taco_type;

Backfill the materialized view

Insert historical data into the new target table:

INSERT INTO taco_sales_aggregated
SELECT
    taco_type,
    count(*) AS order_count,
    sum(quantity) AS total_quantity,
    sum(quantity * price_per_taco) AS total_sales,
    sum(discount) AS total_discount
FROM taco_sales
WHERE sale_timestamp < toDateTime('2024-07-01 13:35:00')
GROUP BY taco_type;

Make sure the WHERE clause only includes rows BEFORE the materialized view was created.

By following these steps, you can handle schema changes in your source table and ensure that your materialized views and target tables remain up-to-date and consistent.

Chaining materialized views

One powerful feature of materialized views in ClickHouse is the ability to chain them together. This means you can create a materialized view that depends on another materialized view, allowing you to build complex, multi-stage data transformations. Chained views can be useful for gradually aggregating data at different granularities or for implementing incremental data processing pipelines.

Chaining materialized views in ClickHouse allows you to build complex, multi-stage data transformation pipelines. This can be particularly useful for gradually aggregating data at different granularities or for implementing incremental data processing pipelines.

Example

Let's extend our previous example by creating a second materialized view that further aggregates the data from the taco_sales_aggregated table. In this example, we will compute the average price per taco type based on the aggregated sales data.

Let's create a new target table to store the average price per taco type:

CREATE TABLE taco_sales_avg_price
(
    taco_type String,
    avg_price_per_taco Float32
) ENGINE = AggregatingMergeTree()
ORDER BY taco_type;

Next, we will create a second materialized view that computes the average price per taco type based on the data from the taco_sales_aggregated table:

CREATE MATERIALIZED VIEW taco_sales_avg_price_mv
TO taco_sales_avg_price
AS
SELECT
    taco_type,
    avgState(total_sales / total_quantity) AS avg_price_per_taco
FROM taco_sales_aggregated
GROUP BY taco_type;

With this setup, whenever data is inserted into the taco_sales table, the taco_sales_aggregated_mv materialized view will update the taco_sales_aggregated table. Then, the taco_sales_avg_price_mv materialized view will update the taco_sales_avg_price table based on the updated data from taco_sales_aggregated.

To query the final result, you can use the avgMerge function (more on this later) to get the average price per taco type:

SELECT
    taco_type,
    avgMerge(avg_price_per_taco) AS avg_price
FROM taco_sales_avg_price
GROUP BY taco_type;

By chaining materialized views in this manner, you can efficiently handle complex data transformations and aggregations in ClickHouse.

Materialized views use cases and examples

Materialized views in ClickHouse offer a variety of applications, from data transformation and enrichment to performance optimization and real-time analytics. Let's explore some common use cases and provide examples for each.

Incremental aggregations with state functions

Materialized views can perform incremental aggregations, where the aggregated data is updated incrementally as new data arrives. This can significantly improve query performance for aggregate queries.

ClickHouse offers a set of state functions that enable incremental aggregation. Functions like sumState, avgState, and quantileState store the intermediate state of an aggregation, which can be merged later to get the final result.

To use state functions in materialized views, create a target table with AggregateFunction columns to store these intermediate states. In the materialized view's SELECT query, use the state functions for aggregations. When querying the materialized view, use the corresponding Merge functions (e.g., sumMerge, avgMerge) to obtain the final results.

In the example above, our taco_sales table stores individual sales transactions, and the materialized view  taco_sales_aggregated_mv stores the aggregates in the target table  taco_sales_aggregated.

Similarly, we can create a materialized view to calculate not just the order_count, total_quantity, and total_sales, but also the average_price per type of taco.

We’ll use the same source taco_sales table, but we’ll create a new target table that includes the average_price. Note that we use a AggregateFunction(avg, Float32) for our average field.

CREATE TABLE taco_sales_aggregated
(
    taco_type String,
    order_count UInt32,
    total_quantity UInt32,
    total_sales Float32,
    average_price AggregateFunction(avg, Float32)
) ENGINE = SummingMergeTree()
ORDER BY taco_type;

Next, we can create our materialized view and insert some data

CREATE MATERIALIZED VIEW taco_sales_aggregated_mv
TO taco_sales_aggregated
AS
SELECT
    taco_type,
    count(*) AS order_count,
    sum(quantity) AS total_quantity,
    sum(quantity * price_per_taco) AS total_sales,
    avgState(price_per_taco) AS average_price
FROM taco_sales
GROUP BY taco_type;

INSERT INTO taco_sales VALUES (31, '2024-07-01 14:30:00', 'Beef', 3, 2.50, 120, '{\"spicy\": true, \"extra_cheese\": true, \"no_cilantro\": true}', ['lettuce', 'cheese', 'salsa']);
INSERT INTO taco_sales VALUES (32, '2024-07-01 14:35:00', 'Chicken', 6, 3.00, 121, '{\"spicy\": false, \"extra_cheese\": false, \"gluten_free\": true}', ['lettuce', 'guacamole', 'salsa']);
INSERT INTO taco_sales VALUES (33, '2024-07-01 14:40:00', 'Veggie', 4, 2.00, 122, '{\"spicy\": true, \"extra_cheese\": false, \"no_sour_cream\": true}', ['tomato', 'onion', 'cheese']);
INSERT INTO taco_sales VALUES (34, '2024-07-01 14:45:00', 'Fish', 3, 3.50, 123, '{\"spicy\": false, \"extra_cheese\": true, \"extra_avocado\": true}', ['lettuce', 'tomato', 'avocado']);
INSERT INTO taco_sales VALUES (35, '2024-07-01 14:50:00', 'Beef', 5, 2.50, 124, '{\"spicy\": true, \"extra_cheese\": false, \"no_tomato\": true}', ['cheese', 'sour cream', 'onion']);

You’ll notice here we use the avgState() function instead of just the avg().

If we just make this query:

SELECT * from taco_sales_aggregated FINAL

We’ll get:

┌─taco_type─┬─order_count─┬─total_quantity─┬─total_sales─┬─average_price───┐
│ Fish      │ 1310.5        │ \0\0\0\0\0\0\f@ │
│ Veggie    │ 148           │ \0\0\0\0\0\0\0@ │
│ Chicken   │ 1618          │ \0\0\0\0\0\0\b@ │
│ Beef      │ 2820          │ \0\0\0\0\0\0@   │
└───────────┴─────────────┴────────────────┴─────────────┴─────────────────┘

Notice the average_price has some weird values. That is because it is an aggregate type, and we need to use the avgMerge() function to read it.

SELECT
  taco_type,
  order_count,
  total_quantity,
  total_sales,
  avgMerge(average_price) as average_price
FROM taco_sales_aggregated FINAL
GROUP BY taco_type, order_count, total_quantity, total_sales

This now returns:

┌─taco_type─┬─order_count─┬─total_quantity─┬─total_sales─┬─average_price─┐
│ Chicken   │ 16183│ Beef      │ 28202.5│ Veggie    │ 1482│ Fish      │ 1310.53.5└───────────┴─────────────┴────────────────┴─────────────┴───────────────┘

Now we have our incrementally aggregated averages.

There are a couple of things you need to know when working with Aggregate Functions:

  • When using the SummmingMergeTree COUNT and SUM don’t need aggregate functions.
  • Both the SummingMergeTree and the AggregatingMergeTree support -State() functions to write and -Merge() functions to read.
  • Given the two points above, we recommend always using the SummingMergeTree for incremental aggregations.
  • You need to include the FINAL to ensure you get the fully aggregated results.

Formatting for analytics

Data stored in ClickHouse may not always be in a format suitable for analytics. Materialized views can transform and format data to make it more analytics-friendly. For example, if you wanted to extract relevant parts of a date (e.g., year, month) for easier aggregation:

The target table would look like:

CREATE TABLE taco_sales_formatted
(
    sale_id UInt32,
    sale_timestamp DateTime,
    taco_type String,
    quantity UInt32,
    price_per_taco Float32,
    customer_id UInt32,
    special_instructions String,
    toppings Array(String),
    sale_date Date,
    sale_year UInt16,
    sale_month UInt8
) ENGINE = MergeTree()
ORDER BY sale_timestamp;

And the materialized view:

CREATE MATERIALIZED VIEW taco_sales_formatted_mv
TO taco_sales_formatted
AS 
SELECT
  sale_id,
  sale_timestamp,
  taco_type,
  quantity,
  price_per_taco,
  customer_id,
  special_instructions,
  toppings,
  toDate(sale_timestamp) AS sale_date,
  toYear(sale_timestamp) AS sale_year,
  toMonth(sale_timestamp) AS sale_month
FROM taco_sales;

-- INSERT into the source table
INSERT INTO taco_sales VALUES (11, '2024-07-01 12:50:00', 'Beef', 4, 2.50, 110, '{\"spicy\": true, \"extra_cheese\": false, \"no_onions\": true}', ['lettuce', 'cheese', 'salsa']);
INSERT INTO taco_sales VALUES (12, '2024-07-01 12:55:00', 'Chicken', 5, 3.00, 111, '{\"spicy\": false, \"extra_cheese\": true, \"gluten_free\": true}', ['lettuce', 'guacamole']);
INSERT INTO taco_sales VALUES (13, '2024-07-01 13:00:00', 'Veggie', 3, 2.00, 112, '{\"spicy\": true, \"extra_cheese\": true, \"no_sour_cream\": true}', ['tomato', 'onion', 'cheese']);
INSERT INTO taco_sales VALUES (14, '2024-07-01 13:05:00', 'Fish', 2, 3.50, 113, '{\"spicy\": false, \"extra_cheese\": false, \"extra_avocado\": true}', ['lettuce', 'tomato', 'avocado']);
INSERT INTO taco_sales VALUES (15, '2024-07-01 13:10:00', 'Beef', 1, 2.50, 114, '{\"spicy\": true, \"extra_cheese\": true, \"no_tomato\": true}', ['cheese', 'sour cream']);
INSERT INTO taco_sales VALUES (16, '2024-07-01 13:15:00', 'Chicken', 4, 3.00, 115, '{\"spicy\": false, \"extra_cheese\": false, \"extra_lettuce\": true}', ['lettuce', 'tomato', 'salsa']);
INSERT INTO taco_sales VALUES (17, '2024-07-01 13:20:00', 'Veggie', 2, 2.00, 116, '{\"spicy\": true, \"extra_cheese\": false, \"no_avocado\": true}', ['lettuce', 'avocado']);
INSERT INTO taco_sales VALUES (18, '2024-07-01 13:25:00', 'Fish', 5, 3.50, 117, '{\"spicy\": false, \"extra_cheese\": true, \"no_cilantro\": true}', ['lettuce', 'onion', 'sour cream']);
INSERT INTO taco_sales VALUES (19, '2024-07-01 13:30:00', 'Beef', 3, 2.50, 118, '{\"spicy\": true, \"extra_cheese\": false, \"extra_salsa\": true}', ['tomato', 'cheese']);
INSERT INTO taco_sales VALUES (20, '2024-07-01 13:35:00', 'Chicken', 2, 3.00, 119, '{\"spicy\": false, \"extra_cheese\": true, \"extra_sour_cream\": true}', ['lettuce', 'sour cream']);

When we query the taco_sales_formatted target table:

SELECT * FROM taco_sales_formatted

We get the formatted sale_date, sale_year, and sale_month:

┌─sale_id─┬─sale_timestamp───────┬─taco_type─┬─quantity─┬─price_per_taco─┬─customer_id─┬─special_instructions───────────────────────────────────────────────┬─toppings─────────────────────────┬─sale_date──┬─sale_year─┬─sale_month─┐
202024-07-01 13:35:00  │ Chicken   │ 23119         │ {"spicy": false, "extra_cheese": true, "extra_sour_cream": true}   │ ['lettuce','sour cream']         │ 2024-07-0120247192024-07-01 13:30:00  │ Beef      │ 32.5118         │ {"spicy": true, "extra_cheese": false, "extra_salsa": true}        │ ['tomato','cheese']              │ 2024-07-0120247112024-07-01 12:50:00  │ Beef      │ 42.5110         │ {"spicy": true, "extra_cheese": false, "no_onions": true}          │ ['lettuce','cheese','salsa']     │ 2024-07-0120247122024-07-01 12:55:00  │ Chicken   │ 53111         │ {"spicy": false, "extra_cheese": true, "gluten_free": true}        │ ['lettuce','guacamole']          │ 2024-07-0120247132024-07-01 13:00:00  │ Veggie    │ 32112         │ {"spicy": true, "extra_cheese": true, "no_sour_cream": true}       │ ['tomato','onion','cheese']      │ 2024-07-0120247142024-07-01 13:05:00  │ Fish      │ 23.5113         │ {"spicy": false, "extra_cheese": false, "extra_avocado": true}     │ ['lettuce','tomato','avocado']   │ 2024-07-0120247152024-07-01 13:10:00  │ Beef      │ 12.5114         │ {"spicy": true, "extra_cheese": true, "no_tomato": true}           │ ['cheese','sour cream']          │ 2024-07-0120247162024-07-01 13:15:00  │ Chicken   │ 43115         │ {"spicy": false, "extra_cheese": false, "extra_lettuce": true}     │ ['lettuce','tomato','salsa']     │ 2024-07-0120247172024-07-01 13:20:00  │ Veggie    │ 22116         │ {"spicy": true, "extra_cheese": false, "no_avocado": true}         │ ['lettuce','avocado']            │ 2024-07-0120247182024-07-01 13:25:00  │ Fish      │ 53.5117         │ {"spicy": false, "extra_cheese": true, "no_cilantro": true}        │ ['lettuce','onion','sour cream'] │ 2024-07-0120247└─────────┴──────────────────────┴───────────┴──────────┴────────────────┴─────────────┴────────────────────────────────────────────────────────────────────┴──────────────────────────────────┴────────────┴───────────┴────────────┘

Materialized views are a great tool to format and get your data ready for analytics.

Data filtering and routing

Materialized views can filter and route data into different tables based on specific criteria. This is particularly useful for scenarios where you want to separate data into multiple tables for easier management and faster queries.

For example, let's say we want to route taco sales data into separate tables based on the type of taco.

For this example, we’ll create a beef_taco_sales table and use a materialized view to filter and route the data accordingly.

CREATE TABLE beef_taco_sales
(
    sale_id UInt32,
    sale_timestamp DateTime,
    taco_type String,
    quantity UInt32,
    price_per_taco Float32,
    customer_id UInt32,
    special_instructions String,
    toppings Array(String)
) ENGINE = MergeTree()
ORDER BY sale_timestamp;

Next, create materialized views to route the data into the appropriate tables:

CREATE MATERIALIZED VIEW beef_taco_sales_mv
TO beef_taco_sales
AS
SELECT
    sale_id,
    sale_timestamp,
    taco_type,
    quantity,
    price_per_taco,
    customer_id,
    special_instructions,
    toppings
FROM taco_sales
WHERE taco_type = 'Beef';

With these materialized views in place, any new data inserted into the taco_sales table will be automatically routed to the appropriate target table based on the type of taco.

INSERT INTO taco_sales VALUES (36, '2024-07-01 14:55:00', 'Beef', 3, 2.50, 125, '{"spicy": true, "extra_cheese": false, "no_cilantro": true}', ['lettuce', 'cheese', 'salsa']);
INSERT INTO taco_sales VALUES (37, '2024-07-01 15:00:00', 'Chicken', 4, 3.00, 126, '{"spicy": false, "extra_cheese": true, "gluten_free": true}', ['lettuce', 'guacamole', 'salsa']);
INSERT INTO taco_sales VALUES (38, '2024-07-01 15:05:00', 'Veggie', 5, 2.00, 127, '{"spicy": true, "extra_cheese": false, "no_sour_cream": true}', ['tomato', 'onion', 'cheese']);
INSERT INTO taco_sales VALUES (39, '2024-07-01 15:10:00', 'Fish', 2, 3.50, 128, '{"spicy": false, "extra_cheese": true, "extra_avocado": true}', ['lettuce', 'tomato', 'avocado']);

You can now query each target table to see the filtered and routed data:

SELECT * FROM beef_taco_sales;

Resulting in:

┌─sale_id─┬─sale_timestamp───────┬─taco_type─┬─quantity─┬─price_per_taco─┬─customer_id─┬─special_instructions───────────────────────────────────────────┬─toppings──────────────────────────────┐
362024-07-01 14:55:00  │ Beef      │ 32.5125         │ {"spicy": true, "extra_cheese": false, "no_cilantro": true}    │ ['lettuce','cheese','salsa']          │
└─────────┴──────────────────────┴───────────┴──────────┴────────────────┴─────────────┴────────────────────────────────────────────────────────────────┴───────────────────────────────────────┘

By using materialized views for data filtering and routing, you can efficiently manage and query your data based on specific criteria, making your data processing pipelines more organized and performant.

Sliding window aggregations

Sliding window aggregations are useful for analyzing data over a rolling time period. This type of aggregation allows you to compute metrics over a moving window of time, which is particularly valuable in real-time analytics and monitoring scenarios.

In ClickHouse, sliding window aggregations can be implemented using materialized views that aggregate data over a specified time window. Let's walk through an example to illustrate how this can be done.

Suppose we want to compute the total sales and average sales for each taco type over the last 1 hour. We'll create a target table to store these aggregations and then set up a materialized view to update this table with the sliding window calculations.

First, create the target table:

CREATE TABLE taco_sales_sliding_window
(
    taco_type String,
    window_end DateTime,
    total_sales Float32,
    avg_sales AggregateFunction(avg, Float64)
) ENGINE = SummingMergeTree()
ORDER BY (taco_type, window_end);

Next, create a materialized view to perform the sliding window aggregation. We'll use the toStartOfHour function to define our 1-hour window and compute the total and average sales within each window:

CREATE MATERIALIZED VIEW taco_sales_sliding_window_mv
TO taco_sales_sliding_window
AS
SELECT
    taco_type,
    toStartOfHour(sale_timestamp) + INTERVAL 1 HOUR AS window_end,
    sum(quantity * price_per_taco) AS total_sales,
    avgState(quantity * price_per_taco) AS avg_sales
FROM taco_sales
GROUP BY taco_type, window_end;

With this setup, the materialized view will keep the taco_sales_sliding_window table updated with the total and average sales for each taco type over the last 1 hour.

Let's insert some sample data into the taco_sales table to observe how the sliding window aggregations work:

INSERT INTO taco_sales VALUES (40, '2024-07-01 15:20:00', 'Beef', 5, 2.50, 129, '{"spicy": true, "extra_cheese": true}', ['lettuce', 'cheese']);
INSERT INTO taco_sales VALUES (41, '2024-07-01 15:25:00', 'Chicken', 3, 3.00, 130, '{"spicy": false, "extra_cheese": false}', ['lettuce', 'salsa']);
INSERT INTO taco_sales VALUES (42, '2024-07-01 15:30:00', 'Veggie', 4, 2.00, 131, '{"spicy": true, "extra_cheese": true}', ['tomato', 'onion', 'cheese']);
INSERT INTO taco_sales VALUES (43, '2024-07-01 15:35:00', 'Fish', 2, 3.50, 132, '{"spicy": false, "extra_cheese": false}', ['lettuce', 'tomato']);
INSERT INTO taco_sales VALUES (44, '2024-07-01 15:40:00', 'Beef', 2, 2.50, 133, '{"spicy": true, "extra_cheese": false}', ['lettuce']);
INSERT INTO taco_sales VALUES (45, '2024-07-01 15:45:00', 'Chicken', 1, 3.00, 134, '{"spicy": false, "extra_cheese": true}', ['sour cream']);
INSERT INTO taco_sales VALUES (46, '2024-07-01 15:50:00', 'Veggie', 3, 2.00, 135, '{"spicy": true, "extra_cheese": false}', ['lettuce', 'tomato']);
INSERT INTO taco_sales VALUES (47, '2024-07-01 15:55:00', 'Fish', 4, 3.50, 136, '{"spicy": false, "extra_cheese": true}', ['cheese']);
INSERT INTO taco_sales VALUES (48, '2024-07-01 16:00:00', 'Beef', 1, 2.50, 137, '{"spicy": true, "extra_cheese": true}', ['lettuce', 'salsa']);

After inserting the data, when you query the taco_sales_sliding_window table, you will get the aggregated results for the last 1-hour windows:

SELECT 
  taco_type,
  window_end,
  total_sales,
  avgMerge(avg_sales) as avg_sale_hour
FROM taco_sales_sliding_window FINAL
GROUP BY taco_type, window_end, total_sales
ORDER BY window_end DESC;

This will output:

┌─taco_type─┬─window_end─────────────┬─total_sales─┬─avg_sale_hour─┐
│ Beef      │ 2024-07-01 17:00:002.52.5│ Veggie    │ 2024-07-01 16:00:00147│ Chicken   │ 2024-07-01 16:00:00126│ Beef      │ 2024-07-01 16:00:0017.58.75│ Fish      │ 2024-07-01 16:00:002110.5└───────────┴────────────────────────┴─────────────┴───────────────┘

By using sliding window aggregations, you can maintain up-to-date metrics over a rolling time window, providing valuable insights for real-time analytics.

Enriching and denormalizing data with joins

Materialized views in ClickHouse can also be used to enrich and denormalize data by joining it with other tables. This is particularly useful when you want to combine data from multiple sources into a single table for easier querying and analysis.

Let's consider an example where we want to enrich our taco sales data with customer information. We have two tables: taco_sales and customers.

The customers table contains customer details such as customer_id, customer_name, and customer_email:

CREATE TABLE customers
(
    customer_id UInt32,
    customer_name String,
    customer_email String
) ENGINE = MergeTree()
ORDER BY customer_id;

We want to create a new enriched table that combines the taco sales data with customer information. First, let's create the target table:

CREATE TABLE enriched_taco_sales
(
    sale_id UInt32,
    sale_timestamp DateTime,
    taco_type String,
    quantity UInt32,
    price_per_taco Float32,
    customer_id UInt32,
    special_instructions String,
    toppings Array(String),
    customer_name String,
    customer_email String
) ENGINE = MergeTree()
ORDER BY sale_timestamp;

Next, create a materialized view that joins the taco_sales table with the customers table to populate the enriched_taco_sales table:

CREATE MATERIALIZED VIEW enriched_taco_sales_mv
TO enriched_taco_sales
AS
SELECT
    ts.sale_id,
    ts.sale_timestamp,
    ts.taco_type,
    ts.quantity,
    ts.price_per_taco,
    ts.customer_id,
    ts.special_instructions,
    ts.toppings,
    c.customer_name,
    c.customer_email
FROM taco_sales ts
LEFT JOIN customers c ON ts.customer_id = c.customer_id;

Now, let's insert some sample data into the customers table:

INSERT INTO customers VALUES (110, 'John Doe', 'john.doe@example.com');
INSERT INTO customers VALUES (111, 'Jane Smith', 'jane.smith@example.com');
INSERT INTO customers VALUES (112, 'Alice Johnson', 'alice.johnson@example.com');
INSERT INTO customers VALUES (113, 'Bob Brown', 'bob.brown@example.com');

And insert some sample data into the taco_sales table:

INSERT INTO taco_sales VALUES (50, '2024-07-01 16:10:00', 'Beef', 2, 2.50, 110, '{"spicy": true, "extra_cheese": false}', ['lettuce', 'cheese']);
INSERT INTO taco_sales VALUES (51, '2024-07-01 16:15:00', 'Chicken', 3, 3.00, 111, '{"spicy": false, "extra_cheese": true}', ['lettuce', 'salsa']);
INSERT INTO taco_sales VALUES (52, '2024-07-01 16:20:00', 'Veggie', 4, 2.00, 112, '{"spicy": true, "extra_cheese": true}', ['tomato', 'onion', 'cheese']);
INSERT INTO taco_sales VALUES (53, '2024-07-01 16:25:00', 'Fish', 2, 3.50, 113, '{"spicy": false, "extra_cheese": false}', ['lettuce', 'tomato']);

When we query the enriched_taco_sales table:

SELECT * FROM enriched_taco_sales;

We get the enriched data with customer information:

┌─sale_id─┬─sale_timestamp───────┬─taco_type─┬─quantity─┬─price_per_taco─┬─customer_id─┬─special_instructions──────────────────────────────────┬─toppings─────────────────────┬─customer_name─┬─customer_email───────────────┐
532024-07-01 16:25:00  │ Fish      │ 23.5113         │ {"spicy": false, "extra_cheese": false}               │ ['lettuce','tomato']         │ Bob Brown     │ bob.brown@example.com        │
512024-07-01 16:15:00  │ Chicken   │ 33111         │ {"spicy": false, "extra_cheese": true}                │ ['lettuce','salsa']          │ Jane Smith    │ jane.smith@example.com       │
522024-07-01 16:20:00  │ Veggie    │ 42112         │ {"spicy": true, "extra_cheese": true}                 │ ['tomato','onion','cheese']  │ Alice Johnson │ alice.johnson@example.com    │
502024-07-01 16:10:00  │ Beef      │ 22.5110         │ {"spicy": true, "extra_cheese": false}                │ ['lettuce','cheese']         │ John Doe      │ john.doe@example.com         │
└─────────┴──────────────────────┴───────────┴──────────┴────────────────┴─────────────┴───────────────────────────────────────────────────────┴──────────────────────────────┴───────────────┴──────────────────────────────┘

By using materialized views to join tables, you can efficiently enrich and denormalize your data, making it more valuable and easier to analyze.

Flattening JSON

Materialized views in ClickHouse can also be used to flatten JSON data stored in columns, making it easier to query and analyze. Let's see an example where we flatten the special_instructions JSON column in the taco_sales table.

First, we'll create a new table to store the flattened data:

CREATE TABLE flattened_taco_sales
(
    sale_id UInt32,
    sale_timestamp DateTime,
    taco_type String,
    quantity UInt32,
    price_per_taco Float32,
    customer_id UInt32,
    spicy UInt8,
    extra_cheese UInt8,
    no_sour_cream UInt8,
    extra_avocado UInt8,
    no_tomato UInt8,
    extra_salsa UInt8
) ENGINE = MergeTree()
ORDER BY sale_timestamp;

Next, we'll create a materialized view to extract and flatten the JSON fields from the special_instructions column into separate columns:

CREATE MATERIALIZED VIEW flattened_taco_sales_mv
TO flattened_taco_sales
AS
SELECT
    sale_id,
    sale_timestamp,
    taco_type,
    quantity,
    price_per_taco,
    customer_id,
    JSONExtractBool(special_instructions, 'spicy') AS spicy,
    JSONExtractBool(special_instructions, 'extra_cheese') AS extra_cheese,
    JSONExtractBool(special_instructions, 'no_sour_cream') AS no_sour_cream,
    JSONExtractBool(special_instructions, 'extra_avocado') AS extra_avocado,
    JSONExtractBool(special_instructions, 'no_tomato') AS no_tomato,
    JSONExtractBool(special_instructions, 'extra_salsa') AS extra_salsa
FROM taco_sales;

Let's insert some sample data into the taco_sales table to see how the flattening works:

INSERT INTO taco_sales VALUES (54, '2024-07-01 16:30:00', 'Beef', 1, 2.50, 138, '{"spicy": true, "extra_cheese": true, "no_sour_cream": false}', ['lettuce', 'cheese']);
INSERT INTO taco_sales VALUES (55, '2024-07-01 16:35:00', 'Chicken', 2, 3.00, 139, '{"spicy": false, "extra_cheese": false, "extra_avocado": true}', ['lettuce', 'salsa']);
INSERT INTO taco_sales VALUES (56, '2024-07-01 16:40:00', 'Veggie', 3, 2.00, 140, '{"spicy": true, "extra_cheese": false, "no_tomato": true}', ['tomato', 'onion', 'cheese']);
INSERT INTO taco_sales VALUES (57, '2024-07-01 16:45:00', 'Fish', 4, 3.50, 141, '{"spicy": false, "extra_cheese": true, "extra_salsa": true}', ['lettuce', 'tomato']);

When we query the flattened_taco_sales table:

SELECT * FROM flattened_taco_sales;

We get the flattened data:

┌─sale_id─┬─sale_timestamp───────┬─taco_type─┬─quantity─┬─price_per_taco─┬─customer_id─┬─spicy─┬─extra_cheese─┬─no_sour_cream─┬─extra_avocado─┬─no_tomato─┬─extra_salsa─┐
542024-07-01 16:30:00  │ Beef      │ 12.5138110000552024-07-01 16:35:00  │ Chicken   │ 23139000100562024-07-01 16:40:00  │ Veggie    │ 32140100010572024-07-01 16:45:00  │ Fish      │ 43.5141010001└─────────┴──────────────────────┴───────────┴──────────┴────────────────┴─────────────┴───────┴──────────────┴───────────────┴───────────────┴────────────┴────────────┘

By using materialized views to flatten JSON data, you can make your data more accessible and easier to work with for querying and analysis.

Building entity tables from CDC events

Change Data Capture (CDC) is a technique used to track and record changes in a database. In ClickHouse, materialized views can be utilized to build entity tables from CDC events, allowing you to maintain up-to-date tables that reflect the current state of your entities based on incoming change events.

Let's walk through an example where we build and maintain an orders entity table from CDC events.

First, we'll create a table to store the raw CDC events. Each event will have a type (INSERT, UPDATE, or DELETE), an order_id, and the associated order details.

CREATE TABLE cdc_events
(
    event_type Enum('INSERT' = 1, 'UPDATE' = 2, 'DELETE' = 3),
    order_id UInt32,
    order_timestamp DateTime,
    customer_id UInt32,
    order_amount Float32,
    order_status String
) ENGINE = MergeTree()
ORDER BY (order_id, order_timestamp);

Next, we'll create the target orders table, which will store the most recent state of each order.

CREATE TABLE orders
(
    order_id UInt32,
    order_timestamp DateTime,
    customer_id UInt32,
    order_amount Float32,
    order_status String
) ENGINE = ReplacingMergeTree(order_timestamp)
ORDER BY order_id;

Now, we'll create a materialized view that processes the cdc_events table and updates the orders table accordingly.  The order_id in the ORDER BY clause will be the table sorting key and will determine row uniqueness. The order_timestamp tells the ReplacingMergeTree to consider the row with the most recent  order_timestamp as the last version of the row.

The materialized view will handle INSERT and UPDATE events by upserting the order details and will delete rows for DELETE events.

CREATE MATERIALIZED VIEW orders_mv
TO orders
AS
SELECT
    order_id,
    order_timestamp,
    customer_id,
    order_amount,
    order_status
FROM cdc_events
WHERE event_type IN ('INSERT', 'UPDATE');

To handle DELETE events, we'll create a separate materialized view that marks the rows as deleted in the orders table by setting a special order_status.

CREATE MATERIALIZED VIEW delete_orders_mv
TO orders
AS
SELECT
    order_id,
    order_timestamp,
    customer_id,
    order_amount,
    'DELETED' AS order_status
FROM cdc_events
WHERE event_type = 'DELETE';

Let's insert some sample CDC events into the cdc_events table to see how the entity table is built and maintained.

-- Insert new orders
INSERT INTO cdc_events VALUES ('INSERT', 1, '2024-07-01 10:00:00', 1001, 50.00, 'PROCESSING');
INSERT INTO cdc_events VALUES ('INSERT', 2, '2024-07-01 10:05:00', 1002, 75.00, 'PROCESSING');

-- Update an existing order
INSERT INTO cdc_events VALUES ('UPDATE', 1, '2024-07-01 10:10:00', 1001, 55.00, 'SHIPPED');

-- Delete an order
INSERT INTO cdc_events VALUES ('DELETE', 2, '2024-07-01 10:15:00', 1002, 75.00, 'PROCESSING');

When we query the orders table:

SELECT * FROM orders FINAL;

We get the current state of the orders:

┌─order_id─┬─order_timestamp───────┬─customer_id─┬─order_amount─┬─order_status─┐
12024-07-01 10:10:00100155.0 │ SHIPPED      │
22024-07-01 10:15:00100275.0 │ DELETED      │
└──────────┴───────────────────────┴─────────────┴──────────────┴──────────────┘

By using materialized views to process CDC events and maintain entity tables, you can ensure that your tables always reflect the latest state of your entities, making it easier to query and analyze your data.

Creating derived columns

Derived columns are calculated based on existing columns in a table. They can be useful for adding new metrics, normalizing data, or enhancing data for better analysis. In ClickHouse, you can create derived columns using materialized views.

Let's walk through an example where we add a derived column to calculate the total sale amount for each order in the taco_sales table.

Let's create a new table to store the data with the derived column total_sale_amount:

CREATE TABLE taco_sales_with_total
(
    sale_id UInt32,
    sale_timestamp DateTime,
    taco_type String,
    quantity UInt32,
    price_per_taco Float32,
    customer_id UInt32,
    special_instructions String,
    toppings Array(String),
    total_sale_amount Float32
) ENGINE = MergeTree()
ORDER BY sale_timestamp;

Now, we'll create a materialized view that calculates the total_sale_amount column by multiplying the quantity by the price_per_taco:

CREATE MATERIALIZED VIEW taco_sales_with_total_mv
TO taco_sales_with_total
AS
SELECT
    sale_id,
    sale_timestamp,
    taco_type,
    quantity,
    price_per_taco,
    customer_id,
    special_instructions,
    toppings,
    quantity * price_per_taco AS total_sale_amount
FROM taco_sales;

Let's insert some sample data into the taco_sales table:

INSERT INTO taco_sales VALUES (58, '2024-07-01 17:00:00', 'Beef', 3, 2.50, 142, '{"spicy": true, "extra_cheese": false}', ['lettuce', 'cheese']);
INSERT INTO taco_sales VALUES (59, '2024-07-01 17:05:00', 'Chicken', 2, 3.00, 143, '{"spicy": false, "extra_cheese": true}', ['lettuce', 'salsa']);
INSERT INTO taco_sales VALUES (60, '2024-07-01 17:10:00', 'Veggie', 4, 2.00, 144, '{"spicy": true, "extra_cheese": true}', ['tomato', 'onion', 'cheese']);
INSERT INTO taco_sales VALUES (61, '2024-07-01 17:15:00', 'Fish', 1, 3.50, 145, '{"spicy": false, "extra_cheese": false}', ['lettuce', 'tomato']);

When we query the taco_sales_with_total table:

SELECT * FROM taco_sales_with_total;

We get the data with the derived total_sale_amount column:

┌─sale_id─┬─sale_timestamp───────┬─taco_type─┬─quantity─┬─price_per_taco─┬─customer_id─┬─special_instructions─────────────────────────────────┬─toppings─────────────────────┬─total_sale_amount─┐
582024-07-01 17:00:00  │ Beef      │ 32.5142         │ {"spicy": true, "extra_cheese": false}               │ ['lettuce','cheese']         │ 7.5592024-07-01 17:05:00  │ Chicken   │ 23143         │ {"spicy": false, "extra_cheese": true}               │ ['lettuce','salsa']          │ 6.0602024-07-01 17:10:00  │ Veggie    │ 42144         │ {"spicy": true, "extra_cheese": true}                │ ['tomato','onion','cheese']  │ 8.0612024-07-01 17:15:00  │ Fish      │ 13.5145         │ {"spicy": false, "extra_cheese": false}              │ ['lettuce','tomato']         │ 3.5└─────────┴──────────────────────┴───────────┴──────────┴────────────────┴─────────────┴──────────────────────────────────────────────────────┴──────────────────────────────┴───────────────────┘

By creating derived columns with materialized views, you can enrich your data and make it more useful for analysis and reporting.

Deduplicating data

In ClickHouse, deduplicating data can be achieved efficiently using materialized views. Deduplication is useful when you want to ensure that your data does not contain duplicate records, which can occur due to various reasons such as multiple data imports or erroneous data entry.

Let's walk through an example where we deduplicate taco sales data based on a unique combination of sale_id and sale_timestamp.

First, let's create a table to store the raw taco sales data, which may contain duplicates:

CREATE TABLE raw_taco_sales
(
    sale_id UInt32,
    sale_timestamp DateTime,
    taco_type String,
    quantity UInt32,
    price_per_taco Float32,
    customer_id UInt32,
    special_instructions String,
    toppings Array(String)
) ENGINE = MergeTree()
ORDER BY sale_timestamp;

Next, we'll create a deduplicated version of the taco sales table, ensuring that each combination of sale_id and sale_timestamp is unique:

CREATE TABLE deduplicated_taco_sales
(
    sale_id UInt32,
    sale_timestamp DateTime,
    taco_type String,
    quantity UInt32,
    price_per_taco Float32,
    customer_id UInt32,
    special_instructions String,
    toppings Array(String)
) ENGINE = ReplacingMergeTree()
ORDER BY (sale_id, sale_timestamp);

Now, we'll create a materialized view that inserts data into the deduplicated_taco_sales table, effectively deduplicating the data as it is inserted:

CREATE MATERIALIZED VIEW deduplicated_taco_sales_mv
TO deduplicated_taco_sales
AS
SELECT
    sale_id,
    sale_timestamp,
    taco_type,
    quantity,
    price_per_taco,
    customer_id,
    special_instructions,
    toppings
FROM raw_taco_sales;

Let's insert some sample data into the raw_taco_sales table, including some duplicate records:

INSERT INTO raw_taco_sales VALUES (62, '2024-07-01 18:00:00', 'Beef', 2, 2.50, 146, '{"spicy": true, "extra_cheese": false}', ['lettuce', 'cheese']);
INSERT INTO raw_taco_sales VALUES (63, '2024-07-01 18:05:00', 'Chicken', 1, 3.00, 147, '{"spicy": false, "extra_cheese": true}', ['lettuce', 'salsa']);
INSERT INTO raw_taco_sales VALUES (62, '2024-07-01 18:00:00', 'Beef', 2, 2.50, 146, '{"spicy": true, "extra_cheese": false}', ['lettuce', 'cheese']); -- Duplicate
INSERT INTO raw_taco_sales VALUES (64, '2024-07-01 18:10:00', 'Veggie', 3, 2.00, 148, '{"spicy": true, "extra_cheese": true}', ['tomato', 'onion', 'cheese']);
INSERT INTO raw_taco_sales VALUES (64, '2024-07-01 18:10:00', 'Veggie', 3, 2.00, 148, '{"spicy": true, "extra_cheese": true}', ['tomato', 'onion', 'cheese']); -- Duplicate

When we query the deduplicated_taco_sales table:

SELECT * FROM deduplicated_taco_sales FINAL ORDER BY sale_timestamp;

We get the deduplicated data:

┌─sale_id─┬─sale_timestamp───────┬─taco_type─┬─quantity─┬─price_per_taco─┬─customer_id─┬─special_instructions─────────────────────────────────┬─toppings─────────────────────┐
622024-07-01 18:00:00  │ Beef      │ 22.5146         │ {"spicy": true, "extra_cheese": false}               │ ['lettuce','cheese']         │
632024-07-01 18:05:00  │ Chicken   │ 13147         │ {"spicy": false, "extra_cheese": true}               │ ['lettuce','salsa']          │
642024-07-01 18:10:00  │ Veggie    │ 32148         │ {"spicy": true, "extra_cheese": true}                │ ['tomato','onion','cheese']  │
└─────────┴──────────────────────┴───────────┴──────────┴────────────────┴─────────────┴──────────────────────────────────────────────────────┴──────────────────────────────┘

By using materialized views with ReplacingMergeTree, you can effectively deduplicate your data, ensuring that your tables contain unique records based on specified criteria.

Reordering tables with a new sorting key

In ClickHouse, the order of data within a table can significantly impact query performance. By reordering tables with a new sorting key, you can optimize data retrieval and improve the efficiency of your queries. The sorting key determines the order in which data is stored on disk, and changing it can help to better align with typical query patterns.

Let's walk through an example of how to reorder a table with a new sorting key.

To reorder the table with a new sorting key, we'll first create a new table with the desired sorting key. In this example, we'll reorder the table by customer_id and sale_timestamp:

CREATE TABLE taco_sales_reordered
(
    sale_id UInt32,
    sale_timestamp DateTime,
    taco_type String,
    quantity UInt32,
    price_per_taco Float32,
    customer_id UInt32,
    special_instructions String,
    toppings Array(String)
) ENGINE = MergeTree()
ORDER BY (customer_id, sale_timestamp);

Next, we'll create a materialized view to insert data from the original taco_sales table into the newly reordered table:

CREATE MATERIALIZED VIEW taco_sales_reordered_mv
TO taco_sales_reordered
AS
SELECT
    sale_id,
    sale_timestamp,
    taco_type,
    quantity,
    price_per_taco,
    customer_id,
    special_instructions,
    toppings
FROM taco_sales;

Now, let's insert some sample data into the taco_sales table:

INSERT INTO taco_sales VALUES (70, '2024-07-01 17:00:00', 'Beef', 2, 2.50, 150, '{"spicy": true, "extra_cheese": false}', ['lettuce', 'cheese']);
INSERT INTO taco_sales VALUES (71, '2024-07-01 17:05:00', 'Chicken', 1, 3.00, 151, '{"spicy": false, "extra_cheese": true}', ['lettuce', 'salsa']);
INSERT INTO taco_sales VALUES (72, '2024-07-01 17:10:00', 'Veggie', 4, 2.00, 152, '{"spicy": true, "extra_cheese": true}', ['tomato', 'onion', 'cheese']);
INSERT INTO taco_sales VALUES (73, '2024-07-01 17:15:00', 'Fish', 1, 3.50, 153, '{"spicy": false, "extra_cheese": false}', ['lettuce', 'tomato']);

Finally, we can query the taco_sales_reordered table to see the reordered data:

SELECT * FROM taco_sales_reordered;

We get the data reordered by customer_id and sale_timestamp:

┌─customer_id─┬─sale_timestamp───────┬─sale_id─┬─taco_type─┬─quantity─┬─price_per_taco─┬─special_instructions────────────────────────────┬─toppings──────────────────────┐
1532024-07-01 17:15:0073      │ Fish      │ 13.5            │ {"spicy": false, "extra_cheese": false}         │ ['lettuce','tomato']          │
1522024-07-01 17:10:0072      │ Veggie    │ 42              │ {"spicy": true, "extra_cheese": true}           │ ['tomato','onion','cheese']   │
1502024-07-01 17:00:0070      │ Beef      │ 22.5            │ {"spicy": true, "extra_cheese": false}          │ ['lettuce','cheese']          │
1512024-07-01 17:05:0071      │ Chicken   │ 13              │ {"spicy": false, "extra_cheese": true}          │ ['lettuce','salsa']           │
└─────────────┴──────────────────────┴─────────┴───────────┴──────────┴────────────────┴─────────────────────────────────────────────────┴───────────────────────────────┘

By reordering tables with a new sorting key, you can optimize ClickHouse for your specific query patterns, leading to faster and more efficient data retrieval.

Performance considerations and optimization

This section will cover best practices and strategies for enhancing materialized view performance, managing resources efficiently, and ensuring ClickHouse operates at peak efficiency.

Impact on insert performance and mitigation strategies

Materialized views can affect insert performance due to the extra processing required during data ingestion. Each insert into the source table triggers an update in the materialized view's target table, potentially increasing overall insert latency.

To address this impact, consider the following strategies:

  1. Simplify the materialized view's SELECT query. Avoid using expensive operations like JOINs or complex aggregations unless absolutely necessary.
  2. Use the insert_deduplication setting to prevent duplicate inserts into the materialized view, reducing unnecessary updates and improving insert performance.
  3. Adjust the max_insert_threads setting to control the number of background threads used for materialized view updates. Increasing this value can help parallelize the update process and improve insert throughput.
  4. Consider using a buffer table to stage incoming data before inserting it into the main table. This allows you to batch inserts and reduce the frequency of materialized view updates.

Choosing the right aggregation granularity

When designing materialized views for aggregations, it's important to choose the right granularity. Aggregating data at too fine a granularity can lead to a large number of rows in the materialized view, impacting query performance and storage usage. Conversely, aggregating at too coarse a granularity may not provide the desired level of detail for analysis.

To find the optimal aggregation granularity, consider the following factors:

  1. Query patterns: Analyze the most common query patterns and their required level of detail. Choose a granularity that satisfies the majority of your queries.
  2. Data volume: Estimate the expected data volume at different aggregation levels. If the data volume is too high at a fine granularity, consider rolling up the data to a coarser level.
  3. Performance requirements: Evaluate the query performance at different aggregation levels. If the query performance is not satisfactory at a fine granularity, consider aggregating the data further.
  4. Business needs: Consult with business stakeholders to understand their reporting and analysis requirements. Ensure that the chosen granularity meets their needs.

Optimizing storage usage

Materialized views can consume significant storage space, especially when aggregating large datasets. To optimize storage usage, consider the following techniques:

  1. Use appropriate data types: Choose the smallest data type that can accurately represent your data. For example, use UInt8 instead of UInt64 for small integer values. This can significantly reduce storage usage.
  2. Enable compression: ClickHouse supports various compression codecs, such as LZ4, ZSTD, and Delta. Enabling compression can reduce the storage footprint of your materialized views. Choose the appropriate codec based on your data characteristics and performance requirements.
  3. Prune old data: If your materialized views contain time-series data, consider pruning old data using TTL (Time-To-Live) settings. This allows you to automatically remove data that is no longer needed, freeing up storage space.

Monitoring materialized view performance

To ensure the optimal performance of your materialized views, it's essential to monitor their resource usage and performance metrics. ClickHouse provides several system tables that can help you track materialized view performance:

  1. system.views_log: This table contains information about the materialized views' execution, including the query, execution time, and any errors encountered. You can use this table to monitor the performance and health of your materialized views.
  2. system.query_log: This table logs all queries executed by the ClickHouse server, including queries related to materialized view updates. You can use this table to analyze the performance of materialized view queries and identify potential bottlenecks.
  3. system.metrics: This table provides various metrics about the ClickHouse server, including the number of background merges, the number of INSERT queries, and the volume of inserted data. These metrics can help you understand the overall system performance and identify any issues related to materialized views.
  4. system.events: This table tracks various events in the ClickHouse server, such as the number of INSERT queries, the number of SELECT queries, and the number of merges. You can use this table to monitor the activity related to materialized views and identify any anomalies.

By regularly monitoring these system tables, you can proactively identify and address any performance issues related to materialized views.

Handling high-cardinality dimensions

High-cardinality dimensions, such as user IDs or product SKUs, can pose challenges when designing materialized views. Aggregating data at a granular level for high-cardinality dimensions can lead to a large number of rows in the materialized view, which can impact query performance and storage usage.

To handle high-cardinality dimensions in materialized views, consider the following strategies:

  1. Pre-aggregate data: Instead of aggregating data at the most granular level, pre-aggregate the data at a higher level. For example, instead of aggregating sales data by individual user ID, aggregate it by user segment or demographic.
  2. Use dimensionality reduction techniques: Apply dimensionality reduction techniques, such as clustering or principal component analysis (PCA), to reduce the cardinality of the dimensions. This can help compress the data and reduce the number of rows in the materialized view.
  3. Employ data sketches: Use data sketching algorithms, such as HyperLogLog or T-Digest, to approximate aggregations for high-cardinality dimensions. These algorithms provide a trade-off between accuracy and storage usage, allowing you to estimate aggregates without storing the full granularity of the data.
  4. Partition data: You can partition your materialized views based on high-cardinality dimensions. This can help distribute the data across multiple partitions and improve query performance. ClickHouse supports custom partitioning keys that allow you to define partitions based on arbitrary expressions.

By carefully designing your materialized views and employing these strategies, you can effectively handle high-cardinality dimensions and ensure optimal performance and storage usage.

Limitations and alternatives

While materialized views offer numerous benefits and use cases, they may not always be the optimal solution for every scenario. It's essential to understand the limitations of materialized views and consider alternative approaches when dealing with complex transformations, data consistency, and frequently updated data.

Complex transformations and alternative approaches

Materialized views in ClickHouse are designed to handle simple to moderate data transformations, such as aggregations, filtering, and joins. However, when dealing with complex transformations that involve multiple steps or require custom logic, materialized views might not be the most suitable option.

In such cases, alternative approaches like projections or aggregate functions can be more effective. Projections allow you to define a set of columns that are stored separately from the main table, optimized for specific query patterns. Aggregate functions, on the other hand, enable you to perform complex calculations and transformations on the data during query execution.

For example, if you need to perform a multi-step transformation that involves multiple aggregate functions and window functions, it might be more efficient to use a combination of projections and aggregate functions instead of a materialized view.

Materialized view maintenance

Materialized views in ClickHouse require careful maintenance, especially when the source table undergoes significant changes like truncation or deletion. If the source table is truncated or deleted, the materialized view will not automatically reflect these changes and may contain stale data.

To handle such scenarios, you need to manually truncate or drop the materialized view and recreate it from scratch. This can be a time-consuming process, especially for large datasets.

Frequently updated data

Materialized views are best suited for data that is relatively stable and does not undergo frequent updates. When dealing with frequently updated data, materialized views can introduce overhead and impact performance.

In such cases, it might be more efficient to use alternative approaches like live views or real-time aggregation using ClickHouse's State combinator. Live views allow you to define a view that is always up-to-date with the latest data, without the need for materialization. Real-time aggregation using the State combinator enables you to perform incremental aggregations on streaming data, reducing the overhead of frequent updates.

Future improvements and feature requests

While materialized views in ClickHouse are already a powerful tool, there is always room for improvement. Some potential future improvements and feature requests include:

  • Improved handling of schema changes in the source table, with automatic propagation to the materialized view
  • Enhanced monitoring and alerting capabilities for materialized view performance and consistency
  • Simplified maintenance and management of materialized views in large-scale, distributed environments

As the ClickHouse community continues to grow and evolve, we can expect to see further enhancements and innovations in the realm of materialized views, addressing the limitations and challenges faced by users today.

Conclusion

Materialized views in ClickHouse® offer a powerful and flexible way to transform your data. By leveraging materialized views, you can achieve real-time data aggregation, deduplication, and transformation, significantly improving query performance and simplifying complex data pipelines. However, it's essential to carefully consider the specific use cases, limitations, and alternative approaches to ensure that you are making the most effective use of this feature.

As you implement materialized views in your ClickHouse® environment, be sure to monitor their performance, manage resources efficiently, and stay informed about best practices and future improvements. By doing so, you can harness the full potential of materialized views to build robust, high-performance data solutions that meet your organization's needs.

In conclusion, materialized views are a valuable addition to your ClickHouse® toolkit, providing opportunities to streamline data processing and enhance query performance. With thoughtful design and ongoing optimization, you can maximize the benefits of materialized views and unlock new possibilities for your data-driven initiatives.

You can get started with Propel's Serverless ClickHouse® today and receive $15 in monthly credits forever. At any point, upgrade to pay-as-you-go, or contact us to learn more about our volume-based discounts. Visit our pricing page for details.

Related posts

What's new in ClickHouse® version 24.10

This is some text inside of a div block.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

  • Item A
  • Item B
  • Item C

Text link

Bold text

Emphasis

Superscript

Subscript

Funnel analysis in ClickHouse®

This is some text inside of a div block.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

  • Item A
  • Item B
  • Item C

Text link

Bold text

Emphasis

Superscript

Subscript

What's new in ClickHouse® version 24.9

This is some text inside of a div block.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

  • Item A
  • Item B
  • Item C

Text link

Bold text

Emphasis

Superscript

Subscript

Start shipping today

Deliver the analytics your customers have been asking for.