Window functions open up new analytical possibilities. They enable users to perform complex calculations and derive insights that were previously difficult or impossible with standard SQL aggregations. With window functions, you can calculate running totals, rank results, identify trends, and more—all without cumbersome subqueries or temporary tables.
In this guide, we'll explore window functions in ClickHouse. We'll cover the syntax, key concepts, and practical applications through real-world examples.
Introduction to window functions in ClickHouse
At their core, window functions in ClickHouse are SQL functions that perform calculations across a set of table rows related to the current row. They operate on a "window" of data, which is a subset of rows defined by a partition and/or an ordering.
This concept of a "window" is crucial. Imagine you have a table with taco order data, and you want to calculate the running total of sales for each taco. With window functions, you can define a window that partitions the data by taco and orders it by date. Then, for each row, the function calculates the sum of sales up to and including that row within its respective partition.
The key advantage of window functions over traditional GROUP BY aggregations is that they return a result for each input row, rather than collapsing the data into a single summary value per group. This allows you to perform complex analytical queries while still maintaining the granularity of the original data.
Some common use cases for window functions include:
- Calculating running totals, moving averages, or cumulative sums
- Ranking or percentile calculations within a group
- Identifying the first or last value within a partition
- Comparing values between the current row and previous or subsequent rows
Here's a simple example to illustrate the basic syntax. Suppose you have the following table and data:
-- Step 1: Create the Table
CREATE TABLE taco_orders
(
taco String,
sales UInt32,
date Date
)
ENGINE = MergeTree()
ORDER BY date;
-- Step 2: Insert Sample Data
INSERT INTO taco_orders (taco, date, sales) VALUES ('Al Pastor', '2024-06-01', 10);
INSERT INTO taco_orders (taco, date, sales) VALUES ('Al Pastor', '2024-06-02', 15);
INSERT INTO taco_orders (taco, date, sales) VALUES ('Al Pastor', '2024-06-03', 20);
INSERT INTO taco_orders (taco, date, sales) VALUES ('Carnitas', '2024-06-01', 8);
INSERT INTO taco_orders (taco, date, sales) VALUES ('Carnitas', '2024-06-02', 12);
INSERT INTO taco_orders (taco, date, sales) VALUES ('Carnitas', '2024-06-03', 18);
INSERT INTO taco_orders (taco, date, sales) VALUES ('Barbacoa', '2024-06-01', 5);
INSERT INTO taco_orders (taco, date, sales) VALUES ('Barbacoa', '2024-06-02', 10);
INSERT INTO taco_orders (taco, date, sales) VALUES ('Barbacoa', '2024-06-03', 15);
In this query, the SUM
function is applied over a window that partitions the data by taco
and orders it by date
.
SELECT
date,
taco,
sales,
SUM(sales) OVER (
PARTITION BY taco
ORDER BY date) AS running_total_by_taco
FROM taco_orders;
The result is a new column, running_total_by_taco
, which contains the cumulative sum of sales for each taco, calculated up to and including the current row.
┌───────date─┬─taco───────────┬─sales─┬─running_total_by_taco─┐
│ 2024-06-01 │ Beef Taco │ 10 │ 10 │
│ 2024-06-02 │ Beef Taco │ 15 │ 25 │
│ 2024-06-03 │ Beef Taco │ 7 │ 32 │
│ 2024-06-01 │ Chicken Taco │ 20 │ 20 │
│ 2024-06-02 │ Chicken Taco │ 10 │ 30 │
│ 2024-06-03 │ Chicken Taco │ 5 │ 35 │
│ 2024-06-01 │ Veggie Taco │ 8 │ 8 │
│ 2024-06-02 │ Veggie Taco │ 12 │ 20 │
│ 2024-06-03 │ Veggie Taco │ 10 │ 30 │
└────────────┴────────────────┴───────┴───────────────────────┘
If you wanted a running total for all rows, you would just omit the PARTITION BY
clause.
SELECT
date,
taco,
sales,
SUM(sales) OVER (
ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM taco_orders;
And you get:
┌───────date─┬─taco──────┬─sales─┬─running_total─┐
│ 2024-06-01 │ Barbacoa │ 5 │ 5 │
│ 2024-06-01 │ Al Pastor │ 10 │ 15 │
│ 2024-06-01 │ Carnitas │ 8 │ 23 │
│ 2024-06-02 │ Barbacoa │ 10 │ 33 │
│ 2024-06-02 │ Al Pastor │ 15 │ 48 │
│ 2024-06-02 │ Carnitas │ 12 │ 60 │
│ 2024-06-03 │ Barbacoa │ 15 │ 75 │
│ 2024-06-03 │ Al Pastor │ 20 │ 95 │
│ 2024-06-03 │ Carnitas │ 18 │ 113 │
└────────────┴───────────┴───────┴───────────────┘
As you can see, window functions provide a powerful and flexible way to analyze data in ClickHouse. In the following sections, we'll explore the various types of window functions, their syntax, and how to use them effectively in your queries.
Syntax of window functions in ClickHouse
Window functions in ClickHouse work on a subset of rows, called a "window frame", defined in relation to the current row. The general syntax for a window function is:
SELECT
column1,
column2,
...,
window_function() OVER (
PARTITION BY partition_column
ORDER BY order_column
)
FROM table_name;
Let's break down each component:
PARTITION BY
clause: This is optional and divides the result set into distinct partitions, to which the window function is applied independently. If omitted, the entire result set is treated as a single partition.ORDER BY
clause: This clause is also optional and defines the logical order of rows within each partition. The window function calculations are performed based on this ordering.frame_clause
: This specifies the sliding window of rows that the function operates on. It can be defined using either theROWS
orRANGE
syntax.
The ROWS
frame defines the window in terms of physical row offsets. For example:
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
This creates a window that includes the current row and the three rows preceding it.
On the other hand, the RANGE
frame defines the window based on a logical range of values. For instance:
RANGE BETWEEN 10 PRECEDING AND CURRENT ROW
This window includes all rows within a ten-day range preceding the current row, based on a date or timestamp column specified in the ORDER BY
clause.
It's important to understand the execution model of window functions in ClickHouse. They are evaluated after the WHERE
, GROUP BY
, and HAVING
clauses, but before the final ORDER BY
and LIMIT
operations. This means that window functions can access the filtered and grouped result set, but their results can still be further sorted and limited.
Supported window functions in ClickHouse
ClickHouse offers a variety of window functions that enable powerful analytical queries. These functions can be broadly categorized into two groups: aggregate functions and ranking functions.
Aggregate functions
All standard ClickHouse aggregate functions, such as SUM
, AVG
, and COUNT
, can be used as window functions. This allows you to perform calculations across a sliding window of rows, which is particularly useful for time series analysis. For example, to calculate a running total of sales over time, you can use the following query:
SELECT
date,
value,
SUM(value) OVER (ORDER BY date) AS running_total
FROM sales;
This query will output the sales value for each date, along with the cumulative sum of sales up to that date.
┌───────date─┬─taco───────────┬─sales─┬─running_total─┐
│ 2024-06-01 │ Veggie Taco │ 8 │ 38 │
│ 2024-06-01 │ Beef Taco │ 10 │ 38 │
│ 2024-06-01 │ Chicken Taco │ 20 │ 38 │
│ 2024-06-02 │ Beef Taco │ 15 │ 75 │
│ 2024-06-02 │ Chicken Taco │ 10 │ 75 │
│ 2024-06-02 │ Veggie Taco │ 12 │ 75 │
│ 2024-06-03 │ Veggie Taco │ 10 │ 97 │
│ 2024-06-03 │ Beef Taco │ 7 │ 97 │
│ 2024-06-03 │ Chicken Taco │ 5 │ 97 │
│ 2024-06-04 │ Beef Taco │ 12 │ 97 │
└────────────┴────────────────┴───────┴───────────────┘
Ranking functions
ClickHouse provides several ranking functions that assign a rank to each row within its partition based on a specified ordering. These functions are particularly useful for identifying top performers or outliers within a group.
ROW_NUMBER()
Assigns a unique integer to each row. For instance, to give a consecutive row number to out taco orders, you can use the following query:
SELECT
ROW_NUMBER() OVER (ORDER BY date, taco ASC) AS row_number,
date,
taco,
sales
FROM taco_orders
ORDER BY row_number;
This query would result in:
┌─row_number─┬─date-------┬─taco──────┬─sales─┐
│ 1 │ 2024-06-01 │ Al Pastor │ 10 │
│ 2 │ 2024-06-01 │ Barbacoa │ 5 │
│ 3 │ 2024-06-01 │ Carnitas │ 8 │
│ 4 │ 2024-06-02 │ Al Pastor │ 15 │
│ 5 │ 2024-06-02 │ Barbacoa │ 10 │
│ 6 │ 2024-06-02 │ Carnitas │ 12 │
│ 7 │ 2024-06-03 │ Al Pastor │ 20 │
│ 8 │ 2024-06-03 │ Barbacoa │ 15 │
│ 9 │ 2024-06-03 │ Carnitas │ 18 │
└────────────┴────────────┴───────────┴───────┘
RANK()
Assigns a rank to each row within its partition, with gaps for ties. To rank tacos by their sales in descending order, allowing for ties, you can use:
SELECT
RANK() OVER (ORDER BY sales DESC) AS rank,
date,
taco,
sales
FROM taco_orders;
This would result in:
┌─rank─┬───────date─┬─taco───────────┬─sales─┐
│ 1 │ 2024-06-01 │ Chicken Taco │ 20 │
│ 2 │ 2024-06-02 │ Beef Taco │ 15 │
│ 3 │ 2024-06-02 │ Veggie Taco │ 12 │
│ 4 │ 2024-06-03 │ Veggie Taco │ 10 │
│ 4 │ 2024-06-01 │ Beef Taco │ 10 │
│ 4 │ 2024-06-02 │ Chicken Taco │ 10 │
│ 7 │ 2024-06-01 │ Veggie Taco │ 8 │
│ 8 │ 2024-06-03 │ Beef Taco │ 7 │
│ 9 │ 2024-06-03 │ Chicken Taco │ 5 │
└──────┴────────────┴────────────────┴───────┘
DENSE_RANK()
Similar to RANK()
, but without gaps in the ranking sequence. Using the same example as above, but with DENSE_RANK()
, consecutive ranks will be assigned in case of ties.
SELECT
DENSE_RANK() OVER (ORDER BY sales DESC) AS rank,
date,
taco,
sales
FROM taco_orders
ORDER BY rank;
This would result in:
┌─rank─┬───────date─┬─taco───────────┬─sales─┐
│ 1 │ 2024-06-01 │ Chicken Taco │ 20 │
│ 2 │ 2024-06-02 │ Beef Taco │ 15 │
│ 3 │ 2024-06-02 │ Veggie Taco │ 12 │
│ 4 │ 2024-06-03 │ Veggie Taco │ 10 │
│ 4 │ 2024-06-01 │ Beef Taco │ 10 │
│ 4 │ 2024-06-02 │ Chicken Taco │ 10 │
│ 5 │ 2024-06-01 │ Veggie Taco │ 8 │
│ 6 │ 2024-06-03 │ Beef Taco │ 7 │
│ 7 │ 2024-06-03 │ Chicken Taco │ 5 │
└──────┴────────────┴────────────────┴───────┘
Value functions
ClickHouse also offers functions that allow you to access specific values within a window frame.
FIRST_VALUE()
Returns the first value in the window frame. This is handy for comparing the current value to the initial value in a sequence. For example, to retrieve the initial sales for each taco:
SELECT
date,
sales,
FIRST_VALUE(sales) OVER (ORDER BY date) AS initial_day_sales
FROM taco_orders
ORDER BY date, taco;
Resulting in:
┌─date───────┬─sales─┬─initial_day_sales─┐
│ 2024-06-01 │ 10 │ 5 │
│ 2024-06-01 │ 5 │ 5 │
│ 2024-06-01 │ 8 │ 5 │
│ 2024-06-02 │ 15 │ 5 │
│ 2024-06-02 │ 10 │ 5 │
│ 2024-06-02 │ 12 │ 5 │
│ 2024-06-03 │ 20 │ 5 │
│ 2024-06-03 │ 15 │ 5 │
│ 2024-06-03 │ 18 │ 5 │
└────────────┴───────┴───────────────────┘
LAST_VALUE()
Returns the last value in the window frame. To get the latest sale for each taco, you can use:
SELECT
date,
taco,
sales,
LAST_VALUE(sales) OVER (
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS latest_sales
FROM taco_orders
ORDER BY date, taco;
Note that LAST_VALUE()
requires a frame definition to work as expected.
The result would be:
┌───────date─┬─taco──────┬─sales─┬─latest_sales─┐
│ 2024-06-01 │ Al Pastor │ 10 │ 18 │
│ 2024-06-01 │ Barbacoa │ 5 │ 18 │
│ 2024-06-01 │ Carnitas │ 8 │ 18 │
│ 2024-06-02 │ Al Pastor │ 15 │ 18 │
│ 2024-06-02 │ Barbacoa │ 10 │ 18 │
│ 2024-06-02 │ Carnitas │ 12 │ 18 │
│ 2024-06-03 │ Al Pastor │ 20 │ 18 │
│ 2024-06-03 │ Barbacoa │ 15 │ 18 │
│ 2024-06-03 │ Carnitas │ 18 │ 18 │
└────────────┴───────────┴───────┴──────────────┘
Offset functions
While the LAG()
and LEAD()
functions, which allow you to access data from previous or subsequent rows in the result set, are not natively supported in ClickHouse, they can be emulated using the anyOrNull()
function and a specific frame definition.
LAG()
equivalent
Accesses data from a previous row in the same result set. To get the previous sales for each taco, you can use:
SELECT
date,
taco,
sales,
anyOrNull(sales) OVER (
ORDER BY date
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
) AS previous_sales
FROM taco_orders
ORDER BY date;
This query returns:
┌─date───────┬─taco──────────┬─sales─┬─previous_sales─┐
│ 2024-06-01 │ Veggie Taco │ 8 │ \N │
│ 2024-06-01 │ Beef Taco │ 10 │ 8 │
│ 2024-06-01 │ Chicken Taco │ 20 │ 10 │
│ 2024-06-02 │ Veggie Taco │ 12 │ 20 │
│ 2024-06-02 │ Beef Taco │ 15 │ 12 │
│ 2024-06-02 │ Chicken Taco │ 10 │ 15 │
│ 2024-06-03 │ Veggie Taco │ 10 │ 10 │
│ 2024-06-03 │ Beef Taco │ 7 │ 10 │
│ 2024-06-03 │ Chicken Taco │ 5 │ 7 │
└────────────┴───────────────┴───────┴────────────────┘
LEAD()
equivalent
Accesses data from a subsequent row in the same result set. To get the next sales for each taco, you can use:
SELECT
date,
taco,
sales,
anyOrNull(sales) OVER (
ORDER BY date
ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING
) AS following_sales
FROM taco_orders
ORDER BY date;
This query results in:
┌─────date───┬──taco───────┬─sales─┬─following_sales─┐
│ 2024-06-01 │ Veggie Taco │ 8 │ 10 │
│ 2024-06-01 │ Beef Taco │ 10 │ 20 │
│ 2024-06-01 │ Chicken Taco│ 20 │ 12 │
│ 2024-06-02 │ Veggie Taco │ 12 │ 15 │
│ 2024-06-02 │ Beef Taco │ 15 │ 10 │
│ 2024-06-02 │ Chicken Taco│ 10 │ 10 │
│ 2024-06-03 │ Veggie Taco │ 10 │ 7 │
│ 2024-06-03 │ Beef Taco │ 7 │ 5 │
│ 2024-06-03 │ Chicken Taco│ 5 │ \N │
└────────────┴─────────────┴───────┴─────────────────┘
Use cases and examples
Window functions in ClickHouse enable a wide range of analytical use cases. Let's explore some common real-world scenarios and see how window functions can help solve them.
Running totals
Calculating cumulative sums is a common requirement in financial reporting and sales analysis. With window functions, this becomes a straightforward task. For instance, to track the cumulative revenue over time per taco, you can use the following query:
SELECT
date,
taco,
sales,
SUM(sales) OVER (
PARTITION BY taco
ORDER BY date
) AS cumulative_revenue
FROM taco_orders
ORDER BY taco, date;
This query will output the revenue for each date, along with the running total up to that date.
┌───────date─┬─taco───────────┬─sales─┬─cumulative_revenue─┐
│ 2024-06-01 │ Beef Taco │ 10 │ 10 │
│ 2024-06-02 │ Beef Taco │ 15 │ 25 │
│ 2024-06-03 │ Beef Taco │ 7 │ 32 │
│ 2024-06-01 │ Chicken Taco │ 20 │ 20 │
│ 2024-06-02 │ Chicken Taco │ 10 │ 30 │
│ 2024-06-03 │ Chicken Taco │ 5 │ 35 │
│ 2024-06-01 │ Veggie Taco │ 8 │ 8 │
│ 2024-06-02 │ Veggie Taco │ 12 │ 20 │
│ 2024-06-03 │ Veggie Taco │ 10 │ 30 │
└────────────┴────────────────┴───────┴────────────────────┘
This allows you to easily monitor your business's growth and identify trends.
Moving averages
In time series analysis, moving averages are often used to smooth out short-term fluctuations and highlight longer-term trends. ClickHouse makes it easy to compute rolling averages using window functions. For example, to calculate a 2-day moving average of sales per taco, you can use:
SELECT
date,
taco,
sales,
AVG(sales) OVER (
PARTITION BY taco
ORDER BY date
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
) AS 2day_moving_avg
FROM taco_orders
ORDER BY taco;
This query will output the temperature for each date, along with the average temperature over the past 2 days.
┌───────date─┬─taco───────────┬─sales─┬─2day_moving_avg─┐
│ 2024-06-01 │ Beef Taco │ 10 │ 10 │
│ 2024-06-02 │ Beef Taco │ 15 │ 12.5 │
│ 2024-06-03 │ Beef Taco │ 7 │ 11 │
│ 2024-06-01 │ Chicken Taco │ 20 │ 20 │
│ 2024-06-02 │ Chicken Taco │ 10 │ 15 │
│ 2024-06-03 │ Chicken Taco │ 5 │ 7.5 │
│ 2024-06-01 │ Veggie Taco │ 8 │ 8 │
│ 2024-06-02 │ Veggie Taco │ 12 │ 10 │
│ 2024-06-03 │ Veggie Taco │ 10 │ 11 │
└────────────┴────────────────┴───────┴─────────────────┘
By adjusting the window size, you can compute moving averages for different time periods, such as 30-day or 90-day moving averages.
Ranking
Ranking is a powerful way to compare and contrast entities within a group. ClickHouse provides several ranking functions that assign ranks to rows based on specified criteria. For instance, to rank tacos by their sales within each date, you can use:
SELECT
date,
RANK() OVER (PARTITION BY date ORDER BY sales DESC) AS sales_rank,
taco
FROM taco_orders
ORDER BY date, sales_rank;
This query will output each product along with its sales and its rank within its category.
┌───────date─┬─sales_rank─┬─taco───────────┐
│ 2024-06-01 │ 1 │ Chicken Taco │
│ 2024-06-01 │ 2 │ Beef Taco │
│ 2024-06-01 │ 3 │ Veggie Taco │
│ 2024-06-02 │ 1 │ Beef Taco │
│ 2024-06-02 │ 2 │ Veggie Taco │
│ 2024-06-02 │ 3 │ Chicken Taco │
│ 2024-06-03 │ 1 │ Veggie Taco │
│ 2024-06-03 │ 2 │ Beef Taco │
│ 2024-06-03 │ 3 │ Chicken Taco │
└────────────┴────────────┴────────────────┘
You can easily identify the top-selling tacos in each day and track their performance over time.
Percentiles
Percentiles are a useful way to understand the distribution of a dataset. ClickHouse does not support a PERCENT_RANK()
function, but there is a workaround. The function is based on a simple formula (Rank — 1) / (Total Rows — 1), so we can implement it using window functions that ClickHouse supports. For example, to compute the percentile rank of each taco's sales per day, we can use:
SELECT
taco,
sales,
RANK() OVER (PARTITION BY taco ORDER BY date) as rank,
COUNT(*) OVER (PARTITION BY taco) as total,
(rank - 1)/(total - 1) as percentile_rank
FROM taco_orders;
This query will output each student's score along with their percentile rank.
┌─taco───────────┬─sales─┬─rank─┬─total─┬─percentile_rank─┐
│ Beef Taco │ 10 │ 1 │ 3 │ 0 │
│ Beef Taco │ 15 │ 2 │ 3 │ 0.5 │
│ Beef Taco │ 7 │ 3 │ 3 │ 1 │
│ Chicken Taco │ 20 │ 1 │ 3 │ 0 │
│ Chicken Taco │ 10 │ 2 │ 3 │ 0.5 │
│ Chicken Taco │ 5 │ 3 │ 3 │ 1 │
│ Veggie Taco │ 8 │ 1 │ 3 │ 0 │
│ Veggie Taco │ 12 │ 2 │ 3 │ 0.5 │
│ Veggie Taco │ 10 │ 3 │ 3 │ 1 │
└────────────────┴───────┴──────┴───────┴─────────────────┘
You can quickly identify high and low performers and see how each taco compares to their peers.
Year-over-year comparisons
Comparing current performance with the same period in the previous year is a common way to measure growth and identify seasonal patterns. With ClickHouse, you can easily perform year-over-year comparisons using the equivalent of the LAG()
function.
To compare sales with last year, let’s first insert some values for the year before:
-- Step 2: Insert Sample Data for the year before
INSERT INTO taco_orders (taco, sales, date) VALUES ('Beef Taco', 8, '2023-06-01');
INSERT INTO taco_orders (taco, sales, date) VALUES ('Beef Taco', 18, '2023-06-02');
INSERT INTO taco_orders (taco, sales, date) VALUES ('Beef Taco', 5, '2023-06-03');
INSERT INTO taco_orders (taco, sales, date) VALUES ('Chicken Taco', 12, '2023-06-01');
INSERT INTO taco_orders (taco, sales, date) VALUES ('Chicken Taco', 4, '2023-06-02');
INSERT INTO taco_orders (taco, sales, date) VALUES ('Chicken Taco', 1, '2023-06-03');
INSERT INTO taco_orders (taco, sales, date) VALUES ('Veggie Taco', 4, '2023-06-01');
INSERT INTO taco_orders (taco, sales, date) VALUES ('Veggie Taco', 15, '2023-06-02');
INSERT INTO taco_orders (taco, sales, date) VALUES ('Veggie Taco', 9, '2023-06-03');
To compare daily sales with the same day last year, you can use:
SELECT
date,
taco,
sales,
anyOrNull(sales) OVER (
PARTITION BY taco
ORDER BY date
RANGE BETWEEN 366 PRECEDING AND 366 PRECEDING
) AS sales_last_year
FROM taco_orders
ORDER BY date;
Double check the PRECEDING
number and make sure you are getting the value from the year before.
This query will output the sales for each date, along with the sales for the same date in the previous year.
┌─date───────┬──taco─────────┬─sales─┬─sales_last_year─┐
│ 2023-06-01 │ Chicken Taco │ 12 │ \N │
│ 2023-06-01 │ Beef Taco │ 8 │ \N │
│ 2023-06-01 │ Veggie Taco │ 4 │ \N │
│ 2023-06-02 │ Chicken Taco │ 4 │ \N │
│ 2023-06-02 │ Beef Taco │ 18 │ \N │
│ 2023-06-02 │ Veggie Taco │ 15 │ \N │
│ 2023-06-03 │ Chicken Taco │ 1 │ \N │
│ 2023-06-03 │ Beef Taco │ 5 │ \N │
│ 2023-06-03 │ Veggie Taco │ 9 │ \N │
│ 2024-06-01 │ Chicken Taco │ 20 │ 12 │
│ 2024-06-01 │ Beef Taco │ 10 │ 8 │
│ 2024-06-01 │ Veggie Taco │ 8 │ 4 │
│ 2024-06-02 │ Chicken Taco │ 10 │ 4 │
│ 2024-06-02 │ Beef Taco │ 15 │ 18 │
│ 2024-06-02 │ Veggie Taco │ 12 │ 15 │
│ 2024-06-03 │ Chicken Taco │ 5 │ 1 │
│ 2024-06-03 │ Beef Taco │ 7 │ 5 │
│ 2024-06-03 │ Veggie Taco │ 10 │ 9 │
└────────────┴───────────────┴───────┴─────────────────┘
Identifying the top N per group
Finding the top performers within each group is a common task. ClickHouse allows you to easily identify the top N entities per group using window functions. For example, to find the top 2 highest-sales days for each taco, you can use:
WITH ranked AS (
SELECT
ROW_NUMBER() OVER (PARTITION BY taco ORDER BY sales DESC) AS rank,
taco,
date,
sales
FROM taco_orders
)
SELECT *
FROM ranked
WHERE rank <= 2;
This query first assigns a rank to each date within each taco based on sales. It then selects only those dates whose rank is less than or equal to 2. This allows you to quickly identify the top days for each taco.
┌─rank─┬─taco───────────┬───date─────┬─sales─┐
│ 1 │ Beef Taco │ 2024-06-02 │ 15 │
│ 2 │ Beef Taco │ 2024-06-01 │ 10 │
│ 1 │ Chicken Taco │ 2024-06-01 │ 20 │
│ 2 │ Chicken Taco │ 2024-06-02 │ 10 │
│ 1 │ Veggie Taco │ 2024-06-02 │ 12 │
│ 2 │ Veggie Taco │ 2024-06-03 │ 10 │
└──────┴────────────────┴────────────┴───────┘
Detecting anomalies
Identifying outliers and anomalies in time series data is crucial for monitoring system or business health and detecting fraud. ClickHouse allows you to use window functions to compute z-scores, which measure how many standard deviations a value is from the mean. For instance, to detect anomalies in taco sales, you can use:
SELECT
date,
sales,
ABS(sales - AVG(sales) OVER w) / stddevPop(sales) OVER w AS z_score
FROM taco_orders
WINDOW w AS (
ORDER BY date
ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
)
ORDER BY date;
This query computes the z-score for each sales data point by comparing it to the average of the readings in a 7-day window centered on the current date. Readings with a high z-score (typically greater than 3) are likely to be anomalies and warrant further investigation.
┌──date──────┬─sales─┬──────────────z_score──────────────┐
│ 2024-06-01 │ 8 │ 0.9878783399072131 │
│ 2024-06-01 │ 10 │ 0.7150969419341942 │
│ 2024-06-01 │ 20 │ 1.879901969149606 │
│ 2024-06-02 │ 12 │ 0.03763549647474893 │
│ 2024-06-02 │ 15 │ 0.7567874686642695 │
│ 2024-06-02 │ 10 │ 0.2764328025752782 │
│ 2024-06-03 │ 10 │ 0.051502620262460275 │
│ 2024-06-03 │ 7 │ 0.7095748751868993 │
│ 2024-06-03 │ 5 │ 1.4142135623730951 │
└────────────┴───────┴───────────────────────────────────┘
Best practices for using window functions in ClickHouse
Optimize partitioning
Partitioning is crucial for window function performance. The PARTITION BY
clause determines how data is divided into groups. Choosing the right partition columns is essential for both query performance and result accuracy.
Partition columns should ideally have a relatively low cardinality, meaning they have a limited number of distinct values. This allows ClickHouse to efficiently distribute data across partitions and parallelize computation. However, if the cardinality is too low, partitions may become too large, negating the benefits of partitioning.
Conversely, if the partition key has very high cardinality, such as a unique identifier, data will be fragmented into many small partitions. This can lead to excessive overhead and slow down the query.
The best approach is to find a balance based on your specific data and query patterns. Experiment with different partition keys and measure the impact on performance and accuracy.
See our guide on When and how to choose a partition key.
Use appropriate frame clauses
Frame clauses define the subset of the partition visible to the window function. By default, the frame includes all rows from the start of the partition to the current row. However, this may not always be necessary or efficient, especially for large datasets.
To optimize performance, specify the smallest frame that meets your requirements. For example, if you only need to compare each row with the previous one, use ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
. This limits the amount of data processed for each row.
Be cautious with unbounded following frames (ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
), as they can lead to significant performance overhead. Unless you need to look ahead to future rows, try to use a bounded frame instead.
Combine with materialized views
If you frequently run the same window function queries on large datasets, consider pre-calculating the results and storing them in a materialized view. This can greatly speed up query times, as the expensive window function computations are done only once when the view is created or updated.
Materialized views are especially useful for complex queries involving multiple window functions or joins. By materializing intermediate results, you can avoid redundant computations and improve overall performance.
However, keep in mind that materialized views come with a maintenance cost. They need to be updated whenever the underlying data changes, which can impact write performance. Therefore, they are most suitable for scenarios where the data is relatively stable, and the queries are read-heavy.
Be cautious with large windows
Window functions can be computationally expensive, especially when the window frame is large. If you define a window that spans a significant portion of the dataset, the query may take a long time to complete or even fail due to resource constraints.
In such cases, consider whether you need exact results for all rows. If an approximation is sufficient, you can use approximate aggregate functions like approxQuantile
or approxDistinct
. These functions trade accuracy for performance by using probabilistic data structures to estimate results.
Another option is to break down the large window into smaller chunks and process them incrementally. For example, instead of calculating a global rank across the entire dataset, you can calculate local ranks within each partition and then combine them as needed.
Use the sorting key and partition key effectively
The sorting and partition key can significantly speed up window function queries by optimizing data sorting and filtering. When using the ORDER BY
clause, make sure to use the columns in the table’s sorting key.
Similarly, if you are using the PARTITION BY
clause, consider using the sorting key columns. This helps ClickHouse quickly read the data needed to perform the query.
Read our How to select a table engine and sorting key guide.
Consider alternatives
While window functions are a powerful tool for analytics, they may not always be the most efficient solution. In some cases, other ClickHouse features like arrays or nested data structures can achieve similar results with better performance.
For example, if you need to calculate a running total for a specific metric, you can store the metric values in an array and use the arraySum
function to compute the cumulative sum. This can be more efficient than using a window function, especially if the data is already stored in an array format.
Similarly, if you need to analyze hierarchical data, such as a tree of categories, you can use nested data structures to represent the hierarchy and use recursive queries to traverse it. This can be more natural and efficient than trying to simulate the hierarchy with window functions.
Of course, the choice between window functions and alternative approaches depends on your specific use case and data model. It's worth experimenting with different options and measuring their performance to find the best solution for your needs.
ClickHouse vs Snowflake window functions
While both ClickHouse and Snowflake support window functions, there are some notable differences in their implementations. Here's a comparison table highlighting the key features:
One of the main differences is in the support for RANGE
frames. While both databases support ROWS
frames, ClickHouse has limited support for RANGE
frames when it comes to date and time types. In contrast, Snowflake fully supports RANGE
frames, including intervals for date and time types, making it more flexible for time-series analysis.
Another difference is in the support for GROUPS
frames. Snowflake supports GROUPS
frames, which allow you to define the window frame based on groups of rows with the same values in the ORDER BY
clause. ClickHouse, on the other hand, does not currently support GROUPS
frames.
When it comes to aggregate functions, both databases support a wide range of functions that can be used with window functions. However, there are some differences in the specific functions available.
One notable difference is in the support for LAG
and LEAD
functions. Snowflake natively supports these functions, which allow you to access values from previous or subsequent rows in the window. In ClickHouse, you need to use a workaround involving the anyOrNull
function and a specific frame definition.
Snowflake also supports the EXCLUDE
clause, which allows you to exclude certain rows from the window frame based on different criteria (CURRENT ROW
, GROUP
, TIES
, or NO OTHERS
). This can be useful for advanced windowing scenarios. ClickHouse does not currently support the EXCLUDE
clause.
ClickHouse vs PostgreSQL window functions
ClickHouse and PostgreSQL both support window functions, but there are some differences in their implementations. Let's take a closer look at how they compare:
PostgreSQL has been a pioneer in implementing window functions, with support dating back to version 8.4 released in 2009. As a result, it has a mature and feature-rich implementation, including full support for RANGE
frames with date and time intervals, GROUPS
frames, and the EXCLUDE
clause.
ClickHouse, on the other hand, has a window function implementation that is relatively new, but it has quickly caught up to support most of the standard features.
One notable difference is in the support for LAG
and LEAD
functions. PostgreSQL natively supports these functions, making it easy to access values from previous or subsequent rows in the window. In ClickHouse, you need to use a workaround involving the anyOrNull
function and a specific frame definition, which can be less intuitive.
ClickHouse vs Databricks window functions
Databricks, built on Apache Spark, is another data platform that supports window functions. Let's see how it compares to ClickHouse:
Databricks and ClickHouse have a lot in common when it comes to window functions. Both support the standard SQL syntax, PARTITION BY
, ORDER BY
, and ROWS
frames. They also offer a wide range of aggregate functions that can be used with window functions.
However, there are a few key differences. Like PostgreSQL, Databricks fully supports RANGE
frames with date and time intervals, while ClickHouse has limitations in this area. Databricks also natively supports LAG
and LEAD
functions, which can simplify queries that need to access values from previous or subsequent rows.
Neither platform supports the GROUPS
frame or the EXCLUDE
clause, which are less commonly used features.
Limitations and workarounds for ClickHouse window functions
While ClickHouse's support for window functions is extensive, there are a few limitations to be aware of. However, most of these can be overcome with clever workarounds or alternative approaches.
Lack of native LAG
and LEAD
functions
The LAG
and LEAD
functions are commonly used to access values from preceding or following rows. In ClickHouse, you can achieve similar functionality using the anyOrNull
function with an appropriate ROWS
frame. For example:
SELECT
anyOrNull(value) OVER (
ORDER BY date ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
) AS prev_value
FROM table;
This workaround allows you to get the value from the previous row, similar to what LAG(value, 1)
would do in other databases.
No support for RANGE
frames with date and time types
While ClickHouse supports RANGE
frames, it doesn't allow using INTERVAL
syntax for defining the range. Instead, you need to use integer offsets representing the number of seconds. For instance, instead of:
RANGE BETWEEN INTERVAL 1 DAY PRECEDING AND CURRENT ROW
You would use:
RANGE BETWEEN 86400 PRECEDING AND CURRENT ROW
Here, 86400 represents the number of seconds in a day.
No support for the GROUPS
frame
The GROUPS
frame groups rows with the same values into a single frame. In most cases, similar results can be achieved using ROWS
or RANGE
frames with appropriate partitioning and ordering.
No EXCLUDE
clause
The EXCLUDE
clause allows you to exclude certain rows from the window frame. If you need this functionality, you'll need to implement the logic manually using subqueries or conditional aggregations.
No PERCENT_RANK()
function
Currently, ClickHouse does not support the PERCENT_RANK()
function, which is used to calculate percentiles. However, as illustrated in the percentiles example above, this can be accomplished using the RANK function.
Conclusion
ClickHouse has robust window function support that enables complex calculations across related rows. While certain functionalities have limitations and workarounds, the overall performance and feature set are more than enough for the vast majority of use cases.
When working with window functions, it's crucial to understand how partitioning and frame clauses affect the results and performance. Proper use of these clauses can significantly improve query efficiency, especially with large datasets.
ClickHouse's performance with window functions is generally excellent, thanks to its columnar storage and vectorized query execution. However, there are cases where the current implementation may face challenges, such as when working with very large windows or in distributed setups. In these situations, techniques like approximate aggregations or incremental processing can help mitigate the issues
As you explore window functions further, don't hesitate to experiment with different approaches and share your findings with the community.
You can get started with Propel's Serverless ClickHouse today and receive $15 in monthly credits forever. At any point, upgrade to pay-as-you-go, or contact us to learn more about our volume-based discounts. Visit our pricing page for details.