Window functions in ClickHouse®

What are they, how they work, examples, and use cases

Propel

Window functions open up new analytical possibilities. They enable users to perform complex calculations and derive insights that were previously difficult or impossible with standard SQL aggregations. With window functions, you can calculate running totals, rank results, identify trends, and more—all without cumbersome subqueries or temporary tables.

In this guide, we'll explore window functions in ClickHouse®. We'll cover the syntax, key concepts, and practical applications through real-world examples.

Introduction to window functions in ClickHouse®

At their core, window functions in ClickHouse® are SQL functions that perform calculations across a set of table rows related to the current row. They operate on a "window" of data, which is a subset of rows defined by a partition and/or an ordering.

This concept of a "window" is crucial. Imagine you have a table with taco order data, and you want to calculate the running total of sales for each taco. With window functions, you can define a window that partitions the data by taco and orders it by date. Then, for each row, the function calculates the sum of sales up to and including that row within its respective partition.

The key advantage of window functions over traditional GROUP BY aggregations is that they return a result for each input row, rather than collapsing the data into a single summary value per group. This allows you to perform complex analytical queries while still maintaining the granularity of the original data.

Some common use cases for window functions include:

Calculating running totals, moving averages, or cumulative sums
Ranking or percentile calculations within a group
Identifying the first or last value within a partition
Comparing values between the current row and previous or subsequent rows

Here's a simple example to illustrate the basic syntax. Suppose you have the following table and data:

-- Step 1: Create the Table
CREATE TABLE taco_orders
(
    taco String,
    sales UInt32,
    date Date
)
ENGINE = MergeTree()
ORDER BY date;

-- Step 2: Insert Sample Data
INSERT INTO taco_orders (taco, date, sales) VALUES ('Al Pastor', '2024-06-01', 10);
INSERT INTO taco_orders (taco, date, sales) VALUES ('Al Pastor', '2024-06-02', 15);
INSERT INTO taco_orders (taco, date, sales) VALUES ('Al Pastor', '2024-06-03', 20);
INSERT INTO taco_orders (taco, date, sales) VALUES ('Carnitas', '2024-06-01', 8);
INSERT INTO taco_orders (taco, date, sales) VALUES ('Carnitas', '2024-06-02', 12);
INSERT INTO taco_orders (taco, date, sales) VALUES ('Carnitas', '2024-06-03', 18);
INSERT INTO taco_orders (taco, date, sales) VALUES ('Barbacoa', '2024-06-01', 5);
INSERT INTO taco_orders (taco, date, sales) VALUES ('Barbacoa', '2024-06-02', 10);
INSERT INTO taco_orders (taco, date, sales) VALUES ('Barbacoa', '2024-06-03', 15);

In this query, the SUM function is applied over a window that partitions the data by taco and orders it by date.

SELECT
    date,
    taco,
    sales,
    SUM(sales) OVER (
	    PARTITION BY taco 
	    ORDER BY date) AS running_total_by_taco
FROM taco_orders;

The result is a new column, running_total_by_taco, which contains the cumulative sum of sales for each taco, calculated up to and including the current row.

┌───────date─┬─taco───────────┬─sales─┬─running_total_by_taco─┐
│ 2024-06-01 │ Beef Taco      │ 10    │ 10                    │
│ 2024-06-02 │ Beef Taco      │ 15    │ 25                    │
│ 2024-06-03 │ Beef Taco      │ 7     │ 32                    │
│ 2024-06-01 │ Chicken Taco   │ 20    │ 20                    │
│ 2024-06-02 │ Chicken Taco   │ 10    │ 30                    │
│ 2024-06-03 │ Chicken Taco   │ 5     │ 35                    │
│ 2024-06-01 │ Veggie Taco    │ 8     │ 8                     │
│ 2024-06-02 │ Veggie Taco    │ 12    │ 20                    │
│ 2024-06-03 │ Veggie Taco    │ 10    │ 30                    │
└────────────┴────────────────┴───────┴───────────────────────┘

If you wanted a running total for all rows, you would just omit the PARTITION BY clause.

SELECT
    date,
    taco,
    sales,
    SUM(sales) OVER (
	    ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
     ) AS running_total
FROM taco_orders;

And you get:

┌───────date─┬─taco──────┬─sales─┬─running_total─┐
│ 2024-06-01 │ Barbacoa  │ 5     │ 5             │
│ 2024-06-01 │ Al Pastor │ 10    │ 15            │
│ 2024-06-01 │ Carnitas  │ 8     │ 23            │
│ 2024-06-02 │ Barbacoa  │ 10    │ 33            │
│ 2024-06-02 │ Al Pastor │ 15    │ 48            │
│ 2024-06-02 │ Carnitas  │ 12    │ 60            │
│ 2024-06-03 │ Barbacoa  │ 15    │ 75            │
│ 2024-06-03 │ Al Pastor │ 20    │ 95            │
│ 2024-06-03 │ Carnitas  │ 18    │ 113           │
└────────────┴───────────┴───────┴───────────────┘

As you can see, window functions provide a powerful and flexible way to analyze data in ClickHouse®. In the following sections, we'll explore the various types of window functions, their syntax, and how to use them effectively in your queries.

Syntax of window functions in ClickHouse®

Window functions in ClickHouse® work on a subset of rows, called a "window frame", defined in relation to the current row. The general syntax for a window function is:

SELECT 
    column1, 
    column2, 
    ..., 
    window_function() OVER (
        PARTITION BY partition_column 
        ORDER BY order_column
    ) 
FROM table_name;

Let's break down each component:

PARTITION BY clause: This is optional and divides the result set into distinct partitions, to which the window function is applied independently. If omitted, the entire result set is treated as a single partition.
ORDER BY clause: This clause is also optional and defines the logical order of rows within each partition. The window function calculations are performed based on this ordering.
frame_clause: This specifies the sliding window of rows that the function operates on. It can be defined using either the ROWS or RANGE syntax.

The ROWS frame defines the window in terms of physical row offsets. For example:

ROWS BETWEEN 3 PRECEDING AND CURRENT ROW

This creates a window that includes the current row and the three rows preceding it.

On the other hand, the RANGE frame defines the window based on a logical range of values. For instance:

RANGE BETWEEN 10 PRECEDING AND CURRENT ROW

This window includes all rows within a ten-day range preceding the current row, based on a date or timestamp column specified in the ORDER BY clause.

It's important to understand the execution model of window functions in ClickHouse®. They are evaluated after the WHERE, GROUP BY, and HAVING clauses, but before the final ORDER BY and LIMIT operations. This means that window functions can access the filtered and grouped result set, but their results can still be further sorted and limited.

Supported window functions in ClickHouse®

ClickHouse® offers a variety of window functions that enable powerful analytical queries. These functions can be broadly categorized into two groups: aggregate functions and ranking functions.

Aggregate functions

All standard ClickHouse® aggregate functions, such as SUM, AVG, and COUNT, can be used as window functions. This allows you to perform calculations across a sliding window of rows, which is particularly useful for time series analysis. For example, to calculate a running total of sales over time, you can use the following query:

SELECT
    date,
    value,
    SUM(value) OVER (ORDER BY date) AS running_total
FROM sales;

This query will output the sales value for each date, along with the cumulative sum of sales up to that date.

┌───────date─┬─taco───────────┬─sales─┬─running_total─┐
│ 2024-06-01 │ Veggie Taco    │ 8     │ 38            │
│ 2024-06-01 │ Beef Taco      │ 10    │ 38            │
│ 2024-06-01 │ Chicken Taco   │ 20    │ 38            │
│ 2024-06-02 │ Beef Taco      │ 15    │ 75            │
│ 2024-06-02 │ Chicken Taco   │ 10    │ 75            │
│ 2024-06-02 │ Veggie Taco    │ 12    │ 75            │
│ 2024-06-03 │ Veggie Taco    │ 10    │ 97            │
│ 2024-06-03 │ Beef Taco      │ 7     │ 97            │
│ 2024-06-03 │ Chicken Taco   │ 5     │ 97            │
│ 2024-06-04 │ Beef Taco      │ 12    │ 97            │
└────────────┴────────────────┴───────┴───────────────┘

‍

Ranking functions

ClickHouse® provides several ranking functions that assign a rank to each row within its partition based on a specified ordering. These functions are particularly useful for identifying top performers or outliers within a group.

ROW_NUMBER()

Assigns a unique integer to each row. For instance, to give a consecutive row number to out taco orders, you can use the following query:

SELECT
    ROW_NUMBER() OVER (ORDER BY date, taco ASC) AS row_number,
    date,
    taco,
    sales
FROM taco_orders
ORDER BY row_number;

This query would result in:

┌─row_number─┬─date-------┬─taco──────┬─sales─┐
│ 1          │ 2024-06-01 │ Al Pastor │ 10    │
│ 2          │ 2024-06-01 │ Barbacoa  │ 5     │
│ 3          │ 2024-06-01 │ Carnitas  │ 8     │
│ 4          │ 2024-06-02 │ Al Pastor │ 15    │
│ 5          │ 2024-06-02 │ Barbacoa  │ 10    │
│ 6          │ 2024-06-02 │ Carnitas  │ 12    │
│ 7          │ 2024-06-03 │ Al Pastor │ 20    │
│ 8          │ 2024-06-03 │ Barbacoa  │ 15    │
│ 9          │ 2024-06-03 │ Carnitas  │ 18    │
└────────────┴────────────┴───────────┴───────┘

‍

RANK()

Assigns a rank to each row within its partition, with gaps for ties. To rank tacos by their sales in descending order, allowing for ties, you can use:

SELECT
    RANK() OVER (ORDER BY sales DESC) AS rank,
    date,
    taco,
    sales
FROM taco_orders;

This would result in:

┌─rank─┬───────date─┬─taco───────────┬─sales─┐
│ 1    │ 2024-06-01 │ Chicken Taco   │ 20    │
│ 2    │ 2024-06-02 │ Beef Taco      │ 15    │
│ 3    │ 2024-06-02 │ Veggie Taco    │ 12    │
│ 4    │ 2024-06-03 │ Veggie Taco    │ 10    │
│ 4    │ 2024-06-01 │ Beef Taco      │ 10    │
│ 4    │ 2024-06-02 │ Chicken Taco   │ 10    │
│ 7    │ 2024-06-01 │ Veggie Taco    │ 8     │
│ 8    │ 2024-06-03 │ Beef Taco      │ 7     │
│ 9    │ 2024-06-03 │ Chicken Taco   │ 5     │
└──────┴────────────┴────────────────┴───────┘

‍

DENSE_RANK()

Similar to RANK(), but without gaps in the ranking sequence. Using the same example as above, but with DENSE_RANK(), consecutive ranks will be assigned in case of ties.

SELECT
    DENSE_RANK() OVER (ORDER BY sales DESC) AS rank,
    date,
    taco,
    sales
FROM taco_orders
ORDER BY rank;

This would result in:

┌─rank─┬───────date─┬─taco───────────┬─sales─┐
│ 1    │ 2024-06-01 │ Chicken Taco   │ 20    │
│ 2    │ 2024-06-02 │ Beef Taco      │ 15    │
│ 3    │ 2024-06-02 │ Veggie Taco    │ 12    │
│ 4    │ 2024-06-03 │ Veggie Taco    │ 10    │
│ 4    │ 2024-06-01 │ Beef Taco      │ 10    │
│ 4    │ 2024-06-02 │ Chicken Taco   │ 10    │
│ 5    │ 2024-06-01 │ Veggie Taco    │ 8     │
│ 6    │ 2024-06-03 │ Beef Taco      │ 7     │
│ 7    │ 2024-06-03 │ Chicken Taco   │ 5     │
└──────┴────────────┴────────────────┴───────┘

‍

Value functions

ClickHouse® also offers functions that allow you to access specific values within a window frame.

FIRST_VALUE()

Returns the first value in the window frame. This is handy for comparing the current value to the initial value in a sequence. For example, to retrieve the initial sales for each taco:

SELECT
    date,
    sales,
    FIRST_VALUE(sales) OVER (ORDER BY date) AS initial_day_sales
FROM taco_orders
ORDER BY date, taco;

Resulting in:

┌─date───────┬─sales─┬─initial_day_sales─┐
│ 2024-06-01 │ 10    │ 5                 │
│ 2024-06-01 │ 5     │ 5                 │
│ 2024-06-01 │ 8     │ 5                 │
│ 2024-06-02 │ 15    │ 5                 │
│ 2024-06-02 │ 10    │ 5                 │
│ 2024-06-02 │ 12    │ 5                 │
│ 2024-06-03 │ 20    │ 5                 │
│ 2024-06-03 │ 15    │ 5                 │
│ 2024-06-03 │ 18    │ 5                 │
└────────────┴───────┴───────────────────┘

‍

LAST_VALUE()

Returns the last value in the window frame. To get the latest sale for each taco, you can use:

SELECT
    date,
    taco,
    sales,
    LAST_VALUE(sales) OVER (
        ORDER BY date
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS latest_sales
FROM taco_orders
ORDER BY date, taco;

Note that LAST_VALUE() requires a frame definition to work as expected.

The result would be:

┌───────date─┬─taco──────┬─sales─┬─latest_sales─┐
│ 2024-06-01 │ Al Pastor │ 10    │ 18           │
│ 2024-06-01 │ Barbacoa  │ 5     │ 18           │
│ 2024-06-01 │ Carnitas  │ 8     │ 18           │
│ 2024-06-02 │ Al Pastor │ 15    │ 18           │
│ 2024-06-02 │ Barbacoa  │ 10    │ 18           │
│ 2024-06-02 │ Carnitas  │ 12    │ 18           │
│ 2024-06-03 │ Al Pastor │ 20    │ 18           │
│ 2024-06-03 │ Barbacoa  │ 15    │ 18           │
│ 2024-06-03 │ Carnitas  │ 18    │ 18           │
└────────────┴───────────┴───────┴──────────────┘

‍

Offset functions

While the LAG() and LEAD() functions, which allow you to access data from previous or subsequent rows in the result set, are not natively supported in ClickHouse®, they can be emulated using the anyOrNull() function and a specific frame definition.

LAG() equivalent

Accesses data from a previous row in the same result set. To get the previous sales for each taco, you can use:

SELECT
    date,
    taco,
    sales,
    anyOrNull(sales) OVER (
        ORDER BY date
        ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
    ) AS previous_sales
FROM taco_orders
ORDER BY date;

This query returns:

┌─date───────┬─taco──────────┬─sales─┬─previous_sales─┐
│ 2024-06-01 │ Veggie Taco   │ 8     │ \N             │
│ 2024-06-01 │ Beef Taco     │ 10    │ 8              │
│ 2024-06-01 │ Chicken Taco  │ 20    │ 10             │
│ 2024-06-02 │ Veggie Taco   │ 12    │ 20             │
│ 2024-06-02 │ Beef Taco     │ 15    │ 12             │
│ 2024-06-02 │ Chicken Taco  │ 10    │ 15             │
│ 2024-06-03 │ Veggie Taco   │ 10    │ 10             │
│ 2024-06-03 │ Beef Taco     │ 7     │ 10             │
│ 2024-06-03 │ Chicken Taco  │ 5     │ 7              │
└────────────┴───────────────┴───────┴────────────────┘

‍

LEAD() equivalent

Accesses data from a subsequent row in the same result set. To get the next sales for each taco, you can use:

SELECT
    date,
    taco,
    sales,
    anyOrNull(sales) OVER (
        ORDER BY date
        ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING
    ) AS following_sales
FROM taco_orders
ORDER BY date;

This query results in:

┌─────date───┬──taco───────┬─sales─┬─following_sales─┐
│ 2024-06-01 │ Veggie Taco │ 8     │ 10              │
│ 2024-06-01 │ Beef Taco   │ 10    │ 20              │
│ 2024-06-01 │ Chicken Taco│ 20    │ 12              │
│ 2024-06-02 │ Veggie Taco │ 12    │ 15              │
│ 2024-06-02 │ Beef Taco   │ 15    │ 10              │
│ 2024-06-02 │ Chicken Taco│ 10    │ 10              │
│ 2024-06-03 │ Veggie Taco │ 10    │ 7               │
│ 2024-06-03 │ Beef Taco   │ 7     │ 5               │
│ 2024-06-03 │ Chicken Taco│ 5     │ \N              │
└────────────┴─────────────┴───────┴─────────────────┘

‍

Use cases and examples

Window functions in ClickHouse enable a wide range of analytical use cases. Let's explore some common real-world scenarios and see how window functions can help solve them.

Running totals

Calculating cumulative sums is a common requirement in financial reporting and sales analysis. With window functions, this becomes a straightforward task. For instance, to track the cumulative revenue over time per taco, you can use the following query:

SELECT
    date,
    taco,
    sales,
    SUM(sales) OVER (
      PARTITION BY taco
      ORDER BY date
    ) AS cumulative_revenue
FROM taco_orders
ORDER BY taco, date;

This query will output the revenue for each date, along with the running total up to that date.

┌───────date─┬─taco───────────┬─sales─┬─cumulative_revenue─┐
│ 2024-06-01 │ Beef Taco      │ 10    │ 10                 │
│ 2024-06-02 │ Beef Taco      │ 15    │ 25                 │
│ 2024-06-03 │ Beef Taco      │ 7     │ 32                 │
│ 2024-06-01 │ Chicken Taco   │ 20    │ 20                 │
│ 2024-06-02 │ Chicken Taco   │ 10    │ 30                 │
│ 2024-06-03 │ Chicken Taco   │ 5     │ 35                 │
│ 2024-06-01 │ Veggie Taco    │ 8     │ 8                  │
│ 2024-06-02 │ Veggie Taco    │ 12    │ 20                 │
│ 2024-06-03 │ Veggie Taco    │ 10    │ 30                 │
└────────────┴────────────────┴───────┴────────────────────┘

This allows you to easily monitor your business's growth and identify trends.

Moving averages

In time series analysis, moving averages are often used to smooth out short-term fluctuations and highlight longer-term trends. ClickHouse® makes it easy to compute rolling averages using window functions. For example, to calculate a 2-day moving average of sales per taco, you can use:

SELECT
    date,
    taco,
    sales,
    AVG(sales) OVER (
        PARTITION BY taco
        ORDER BY date
        ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
    ) AS 2day_moving_avg
FROM taco_orders
ORDER BY taco;

This query will output the temperature for each date, along with the average temperature over the past 2 days.

┌───────date─┬─taco───────────┬─sales─┬─2day_moving_avg─┐
│ 2024-06-01 │ Beef Taco      │ 10    │ 10              │
│ 2024-06-02 │ Beef Taco      │ 15    │ 12.5            │
│ 2024-06-03 │ Beef Taco      │ 7     │ 11              │
│ 2024-06-01 │ Chicken Taco   │ 20    │ 20              │
│ 2024-06-02 │ Chicken Taco   │ 10    │ 15              │
│ 2024-06-03 │ Chicken Taco   │ 5     │ 7.5             │
│ 2024-06-01 │ Veggie Taco    │ 8     │ 8               │
│ 2024-06-02 │ Veggie Taco    │ 12    │ 10              │
│ 2024-06-03 │ Veggie Taco    │ 10    │ 11              │
└────────────┴────────────────┴───────┴─────────────────┘

By adjusting the window size, you can compute moving averages for different time periods, such as 30-day or 90-day moving averages.

Ranking

Ranking is a powerful way to compare and contrast entities within a group. ClickHouse® provides several ranking functions that assign ranks to rows based on specified criteria. For instance, to rank tacos by their sales within each date, you can use:

SELECT
    date,
    RANK() OVER (PARTITION BY date ORDER BY sales DESC) AS sales_rank,
    taco
FROM taco_orders
ORDER BY date, sales_rank;

This query will output each product along with its sales and its rank within its category.

┌───────date─┬─sales_rank─┬─taco───────────┐
│ 2024-06-01 │ 1          │ Chicken Taco   │
│ 2024-06-01 │ 2          │ Beef Taco      │
│ 2024-06-01 │ 3          │ Veggie Taco    │
│ 2024-06-02 │ 1          │ Beef Taco      │
│ 2024-06-02 │ 2          │ Veggie Taco    │
│ 2024-06-02 │ 3          │ Chicken Taco   │
│ 2024-06-03 │ 1          │ Veggie Taco    │
│ 2024-06-03 │ 2          │ Beef Taco      │
│ 2024-06-03 │ 3          │ Chicken Taco   │
└────────────┴────────────┴────────────────┘

You can easily identify the top-selling tacos in each day and track their performance over time.

Percentiles

Percentiles are a useful way to understand the distribution of a dataset. ClickHouse® does not support a PERCENT_RANK() function, but there is a workaround. The function is based on a simple formula (Rank — 1) / (Total Rows — 1), so we can implement it using window functions that ClickHouse® supports. For example, to compute the percentile rank of each taco's sales per day, we can use:

SELECT
    taco,
    sales,
    RANK() OVER (PARTITION BY taco ORDER BY date) as rank,
    COUNT(*) OVER (PARTITION BY taco) as total,
    (rank - 1)/(total - 1) as percentile_rank
FROM taco_orders;

This query will output each student's score along with their percentile rank.

┌─taco───────────┬─sales─┬─rank─┬─total─┬─percentile_rank─┐
│ Beef Taco      │ 10    │ 1    │ 3     │ 0               │
│ Beef Taco      │ 15    │ 2    │ 3     │ 0.5             │
│ Beef Taco      │ 7     │ 3    │ 3     │ 1               │
│ Chicken Taco   │ 20    │ 1    │ 3     │ 0               │
│ Chicken Taco   │ 10    │ 2    │ 3     │ 0.5             │
│ Chicken Taco   │ 5     │ 3    │ 3     │ 1               │
│ Veggie Taco    │ 8     │ 1    │ 3     │ 0               │
│ Veggie Taco    │ 12    │ 2    │ 3     │ 0.5             │
│ Veggie Taco    │ 10    │ 3    │ 3     │ 1               │
└────────────────┴───────┴──────┴───────┴─────────────────┘

You can quickly identify high and low performers and see how each taco compares to their peers.

Year-over-year comparisons

Comparing current performance with the same period in the previous year is a common way to measure growth and identify seasonal patterns. With ClickHouse®, you can easily perform year-over-year comparisons using the equivalent of the LAG() function.

To compare sales with last year, let’s first insert some values for the year before:

-- Step 2: Insert Sample Data for the year before
INSERT INTO taco_orders (taco, sales, date) VALUES ('Beef Taco', 8, '2023-06-01');
INSERT INTO taco_orders (taco, sales, date) VALUES ('Beef Taco', 18, '2023-06-02');
INSERT INTO taco_orders (taco, sales, date) VALUES ('Beef Taco', 5, '2023-06-03');
INSERT INTO taco_orders (taco, sales, date) VALUES ('Chicken Taco', 12, '2023-06-01');
INSERT INTO taco_orders (taco, sales, date) VALUES ('Chicken Taco', 4, '2023-06-02');
INSERT INTO taco_orders (taco, sales, date) VALUES ('Chicken Taco', 1, '2023-06-03');
INSERT INTO taco_orders (taco, sales, date) VALUES ('Veggie Taco', 4, '2023-06-01');
INSERT INTO taco_orders (taco, sales, date) VALUES ('Veggie Taco', 15, '2023-06-02');
INSERT INTO taco_orders (taco, sales, date) VALUES ('Veggie Taco', 9, '2023-06-03');

To compare daily sales with the same day last year, you can use:

SELECT
    date,
    taco,
    sales,
    anyOrNull(sales) OVER (
        PARTITION BY taco
        ORDER BY date
        RANGE BETWEEN 366 PRECEDING AND 366 PRECEDING
    ) AS sales_last_year
FROM taco_orders
ORDER BY date;

Double check the PRECEDING number and make sure you are getting the value from the year before.

This query will output the sales for each date, along with the sales for the same date in the previous year.

┌─date───────┬──taco─────────┬─sales─┬─sales_last_year─┐
│ 2023-06-01 │ Chicken Taco  │ 12    │ \N              │
│ 2023-06-01 │ Beef Taco     │ 8     │ \N              │
│ 2023-06-01 │ Veggie Taco   │ 4     │ \N              │
│ 2023-06-02 │ Chicken Taco  │ 4     │ \N              │
│ 2023-06-02 │ Beef Taco     │ 18    │ \N              │
│ 2023-06-02 │ Veggie Taco   │ 15    │ \N              │
│ 2023-06-03 │ Chicken Taco  │ 1     │ \N              │
│ 2023-06-03 │ Beef Taco     │ 5     │ \N              │
│ 2023-06-03 │ Veggie Taco   │ 9     │ \N              │
│ 2024-06-01 │ Chicken Taco  │ 20    │ 12              │
│ 2024-06-01 │ Beef Taco     │ 10    │ 8               │
│ 2024-06-01 │ Veggie Taco   │ 8     │ 4               │
│ 2024-06-02 │ Chicken Taco  │ 10    │ 4               │
│ 2024-06-02 │ Beef Taco     │ 15    │ 18              │
│ 2024-06-02 │ Veggie Taco   │ 12    │ 15              │
│ 2024-06-03 │ Chicken Taco  │ 5     │ 1               │
│ 2024-06-03 │ Beef Taco     │ 7     │ 5               │
│ 2024-06-03 │ Veggie Taco   │ 10    │ 9               │
└────────────┴───────────────┴───────┴─────────────────┘

‍

Identifying the top N per group

Finding the top performers within each group is a common task. ClickHouse allows you to easily identify the top N entities per group using window functions. For example, to find the top 2 highest-sales days for each taco, you can use:

WITH ranked AS (
    SELECT
        ROW_NUMBER() OVER (PARTITION BY taco ORDER BY sales DESC) AS rank,
        taco,
        date,
        sales
    FROM taco_orders
)
SELECT *
FROM ranked
WHERE rank <= 2;

This query first assigns a rank to each date within each taco based on sales. It then selects only those dates whose rank is less than or equal to 2. This allows you to quickly identify the top days for each taco.

┌─rank─┬─taco───────────┬───date─────┬─sales─┐
│ 1    │ Beef Taco      │ 2024-06-02 │ 15    │
│ 2    │ Beef Taco      │ 2024-06-01 │ 10    │
│ 1    │ Chicken Taco   │ 2024-06-01 │ 20    │
│ 2    │ Chicken Taco   │ 2024-06-02 │ 10    │
│ 1    │ Veggie Taco    │ 2024-06-02 │ 12    │
│ 2    │ Veggie Taco    │ 2024-06-03 │ 10    │
└──────┴────────────────┴────────────┴───────┘

‍

Detecting anomalies

Identifying outliers and anomalies in time series data is crucial for monitoring system or business health and detecting fraud. ClickHouse allows you to use window functions to compute z-scores, which measure how many standard deviations a value is from the mean. For instance, to detect anomalies in taco sales, you can use:

SELECT
    date,
    sales,
    ABS(sales - AVG(sales) OVER w) / stddevPop(sales) OVER w AS z_score
FROM taco_orders
WINDOW w AS (
    ORDER BY date
    ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
)
ORDER BY date;

This query computes the z-score for each sales data point by comparing it to the average of the readings in a 7-day window centered on the current date. Readings with a high z-score (typically greater than 3) are likely to be anomalies and warrant further investigation.

┌──date──────┬─sales─┬──────────────z_score──────────────┐
│ 2024-06-01 │ 8     │ 0.9878783399072131                │
│ 2024-06-01 │ 10    │ 0.7150969419341942                │
│ 2024-06-01 │ 20    │ 1.879901969149606                 │
│ 2024-06-02 │ 12    │ 0.03763549647474893               │
│ 2024-06-02 │ 15    │ 0.7567874686642695                │
│ 2024-06-02 │ 10    │ 0.2764328025752782                │
│ 2024-06-03 │ 10    │ 0.051502620262460275              │
│ 2024-06-03 │ 7     │ 0.7095748751868993                │
│ 2024-06-03 │ 5     │ 1.4142135623730951                │
└────────────┴───────┴───────────────────────────────────┘

‍

Best practices for using window functions in ClickHouse

Optimize partitioning

Partitioning is crucial for window function performance. The PARTITION BY clause determines how data is divided into groups. Choosing the right partition columns is essential for both query performance and result accuracy.

Partition columns should ideally have a relatively low cardinality, meaning they have a limited number of distinct values. This allows ClickHouse to efficiently distribute data across partitions and parallelize computation. However, if the cardinality is too low, partitions may become too large, negating the benefits of partitioning.

Conversely, if the partition key has very high cardinality, such as a unique identifier, data will be fragmented into many small partitions. This can lead to excessive overhead and slow down the query.

The best approach is to find a balance based on your specific data and query patterns. Experiment with different partition keys and measure the impact on performance and accuracy.

See our guide on When and how to choose a partition key.

Use appropriate frame clauses

Frame clauses define the subset of the partition visible to the window function. By default, the frame includes all rows from the start of the partition to the current row. However, this may not always be necessary or efficient, especially for large datasets.

To optimize performance, specify the smallest frame that meets your requirements. For example, if you only need to compare each row with the previous one, use ROWS BETWEEN 1 PRECEDING AND CURRENT ROW. This limits the amount of data processed for each row.

Be cautious with unbounded following frames (ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING), as they can lead to significant performance overhead. Unless you need to look ahead to future rows, try to use a bounded frame instead.

Combine with materialized views

If you frequently run the same window function queries on large datasets, consider pre-calculating the results and storing them in a materialized view. This can greatly speed up query times, as the expensive window function computations are done only once when the view is created or updated.

Materialized views are especially useful for complex queries involving multiple window functions or joins. By materializing intermediate results, you can avoid redundant computations and improve overall performance.

However, keep in mind that materialized views come with a maintenance cost. They need to be updated whenever the underlying data changes, which can impact write performance. Therefore, they are most suitable for scenarios where the data is relatively stable, and the queries are read-heavy.

Be cautious with large windows

Window functions can be computationally expensive, especially when the window frame is large. If you define a window that spans a significant portion of the dataset, the query may take a long time to complete or even fail due to resource constraints.

In such cases, consider whether you need exact results for all rows. If an approximation is sufficient, you can use approximate aggregate functions like approxQuantile or approxDistinct. These functions trade accuracy for performance by using probabilistic data structures to estimate results.

Another option is to break down the large window into smaller chunks and process them incrementally. For example, instead of calculating a global rank across the entire dataset, you can calculate local ranks within each partition and then combine them as needed.

Use the sorting key and partition key effectively

The sorting and partition key can significantly speed up window function queries by optimizing data sorting and filtering. When using the ORDER BY clause, make sure to use the columns in the table’s sorting key.

Similarly, if you are using the PARTITION BY clause, consider using the sorting key columns. This helps ClickHouse quickly read the data needed to perform the query.

Read our How to select a table engine and sorting key guide.

Consider alternatives

While window functions are a powerful tool for analytics, they may not always be the most efficient solution. In some cases, other ClickHouse features like arrays or nested data structures can achieve similar results with better performance.

For example, if you need to calculate a running total for a specific metric, you can store the metric values in an array and use the arraySum function to compute the cumulative sum. This can be more efficient than using a window function, especially if the data is already stored in an array format.

Similarly, if you need to analyze hierarchical data, such as a tree of categories, you can use nested data structures to represent the hierarchy and use recursive queries to traverse it. This can be more natural and efficient than trying to simulate the hierarchy with window functions.

Of course, the choice between window functions and alternative approaches depends on your specific use case and data model. It's worth experimenting with different options and measuring their performance to find the best solution for your needs.

ClickHouse vs Snowflake window functions

While both ClickHouse and Snowflake support window functions, there are some notable differences in their implementations. Here's a comparison table highlighting the key features:

Feature	ClickHouse	Snowflake
Syntax	Standard SQL window function syntax	Standard SQL window function syntax
PARTITION BY support	Yes	Yes
ORDER BY support	Yes	Yes
ROWS frame	Supported	Supported
RANGE frame	Supported, but limited for date/time types	Fully supported, including date/time intervals
GROUPS frame	Not supported	Supported
Aggregate functions	All ClickHouse aggregate functions supported	All Snowflake aggregate functions supported
LAG/LEAD functions	Not natively supported, requires a workaround	Natively supported
EXCLUDE clause	Not supported	Supported
Performance	Generally faster for large datasets	May be slower for very large datasets
Scalability	Highly scalable, designed for big data	Scalable, but expensive at scale.

One of the main differences is in the support for RANGE frames. While both databases support ROWS frames, ClickHouse has limited support for RANGE frames when it comes to date and time types. In contrast, Snowflake fully supports RANGE frames, including intervals for date and time types, making it more flexible for time-series analysis.

Another difference is in the support for GROUPS frames. Snowflake supports GROUPS frames, which allow you to define the window frame based on groups of rows with the same values in the ORDER BY clause. ClickHouse, on the other hand, does not currently support GROUPS frames.

When it comes to aggregate functions, both databases support a wide range of functions that can be used with window functions. However, there are some differences in the specific functions available.

One notable difference is in the support for LAG and LEAD functions. Snowflake natively supports these functions, which allow you to access values from previous or subsequent rows in the window. In ClickHouse, you need to use a workaround involving the anyOrNull function and a specific frame definition.

Snowflake also supports the EXCLUDE clause, which allows you to exclude certain rows from the window frame based on different criteria (CURRENT ROW, GROUP, TIES, or NO OTHERS). This can be useful for advanced windowing scenarios. ClickHouse does not currently support the EXCLUDE clause.

ClickHouse vs PostgreSQL window functions

ClickHouse and PostgreSQL both support window functions, but there are some differences in their implementations. Let's take a closer look at how they compare:

Feature	ClickHouse	PostgreSQL
Syntax	Standard SQL window function syntax	Standard SQL window function syntax
PARTITION BY support	Yes	Yes
ORDER BY support	Yes	Yes
ROWS frame	Supported	Supported
RANGE frame	Supported, but limited for date/time types	Fully supported, including date/time intervals
GROUPS frame	Not supported	Supported
Aggregate functions	All ClickHouse aggregate functions supported	All PostgreSQL aggregate functions supported
LAG/LEAD functions	Not natively supported, requires workaround	Natively supported
EXCLUDE clause	Not supported	Supported
Performance	Generally faster for large datasets	May be slower for very large datasets
Scalability	Highly scalable, designed for big data	Limited scalability for very large datasets

PostgreSQL has been a pioneer in implementing window functions, with support dating back to version 8.4 released in 2009. As a result, it has a mature and feature-rich implementation, including full support for RANGE frames with date and time intervals, GROUPS frames, and the EXCLUDE clause.

ClickHouse, on the other hand, has a window function implementation that is relatively new, but it has quickly caught up to support most of the standard features.

One notable difference is in the support for LAG and LEAD functions. PostgreSQL natively supports these functions, making it easy to access values from previous or subsequent rows in the window. In ClickHouse, you need to use a workaround involving the anyOrNull function and a specific frame definition, which can be less intuitive.

ClickHouse vs Databricks window functions

Databricks, built on Apache Spark, is another data platform that supports window functions. Let's see how it compares to ClickHouse:

Feature	ClickHouse	Databricks
Syntax	Standard SQL window function syntax	Standard SQL window function syntax
PARTITION BY support	Yes	Yes
ORDER BY support	Yes	Yes
ROWS frame	Supported	Supported
RANGE frame	Supported, but limited for date/time types	Fully supported, including date/time intervals
GROUPS frame	Not supported	Not supported
Aggregate functions	All ClickHouse aggregate functions supported	All Spark SQL aggregate functions supported
LAG/LEAD functions	Not natively supported, requires workaround	Natively supported
EXCLUDE clause	Not supported	Not supported
Performance	Generally faster for single-node operations	Better for distributed, large-scale processing
Scalability	Highly scalable, designed for big data	Highly scalable, designed for big data

Databricks and ClickHouse have a lot in common when it comes to window functions. Both support the standard SQL syntax, PARTITION BY, ORDER BY, and ROWS frames. They also offer a wide range of aggregate functions that can be used with window functions.

However, there are a few key differences. Like PostgreSQL, Databricks fully supports RANGE frames with date and time intervals, while ClickHouse has limitations in this area. Databricks also natively supports LAG and LEAD functions, which can simplify queries that need to access values from previous or subsequent rows.

Neither platform supports the GROUPS frame or the EXCLUDE clause, which are less commonly used features.

Limitations and workarounds for ClickHouse window functions

While ClickHouse's support for window functions is extensive, there are a few limitations to be aware of. However, most of these can be overcome with clever workarounds or alternative approaches.

Lack of native `LAG` and `LEAD` functions

The LAG and LEAD functions are commonly used to access values from preceding or following rows. In ClickHouse, you can achieve similar functionality using the anyOrNull function with an appropriate ROWS frame. For example:

SELECT
    anyOrNull(value) OVER (
    	ORDER BY date ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
    ) AS prev_value
FROM table;

This workaround allows you to get the value from the previous row, similar to what LAG(value, 1) would do in other databases.

No support for `RANGE` frames with date and time types

While ClickHouse supports RANGE frames, it doesn't allow using INTERVAL syntax for defining the range. Instead, you need to use integer offsets representing the number of seconds. For instance, instead of:

RANGE BETWEEN INTERVAL 1 DAY PRECEDING AND CURRENT ROW

You would use:

RANGE BETWEEN 86400 PRECEDING AND CURRENT ROW

Here, 86400 represents the number of seconds in a day.

No support for the `GROUPS` frame

The GROUPS frame groups rows with the same values into a single frame. In most cases, similar results can be achieved using ROWS or RANGE frames with appropriate partitioning and ordering.

No `EXCLUDE` clause

The EXCLUDE clause allows you to exclude certain rows from the window frame. If you need this functionality, you'll need to implement the logic manually using subqueries or conditional aggregations.

No `PERCENT_RANK()` function

Currently, ClickHouse does not support the PERCENT_RANK() function, which is used to calculate percentiles. However, as illustrated in the percentiles example above, this can be accomplished using the RANK function.

Conclusion

ClickHouse has robust window function support that enables complex calculations across related rows. While certain functionalities have limitations and workarounds, the overall performance and feature set are more than enough for the vast majority of use cases.

When working with window functions, it's crucial to understand how partitioning and frame clauses affect the results and performance. Proper use of these clauses can significantly improve query efficiency, especially with large datasets.

ClickHouse's performance with window functions is generally excellent, thanks to its columnar storage and vectorized query execution. However, there are cases where the current implementation may face challenges, such as when working with very large windows or in distributed setups. In these situations, techniques like approximate aggregations or incremental processing can help mitigate the issues

As you explore window functions further, don't hesitate to experiment with different approaches and share your findings with the community.

You can get started with Propel's Serverless ClickHouse today and receive $15 in monthly credits forever. At any point, upgrade to pay-as-you-go, or contact us to learn more about our volume-based discounts. Visit our pricing page for details.

What's new in ClickHouse® version 24.10

This is some text inside of a div block.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.