Quantiles & percentiles
Calculate various types of quantiles and percentiles.
ClickHouse function reference
quantile
Computes an approximate quantile of a numeric data sequence.
Syntax:
Alias:
- median
Arguments:
level
(numeric, optional): Level of quantile. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. Atlevel=0.5
the function calculates the median.expr
(numeric,Date
, orDateTime
): Expression over the column values resulting in numeric data types,Date
, orDateTime
.
Returns:
Approximate quantile of the specified level.
Float64
for numeric data type input.Date
if input values have theDate
type.DateTime
if input values have theDateTime
type.
Example:
Result:
Alias:
- medianBFloat16
Arguments:
expr
(numeric): Column with numeric data. Supported types:Integer
,Float
.
Parameters:
level
(Float
, optional): Level of quantile. Possible values are in the range [0, 1]. Default value: 0.5.
Returns:
Approximate quantile of the specified level. Type: Float64
.
Example:
Result:
In this example:
median_price
calculates the median (0.5 quantile) of taco prices.q3_price
calculates the third quartile (0.75 quantile) of taco prices.median_quantity
calculates the median quantity of tacos sold.
Note that due to the bfloat16 approximation, results may slightly differ from exact quantile calculations.
quantileDD
Computes an approximate quantile of a sample with relative-error guarantees using a Distributed Deterministic (DD) sketch.
Syntax:
Arguments:
expr
(numeric): The column containing numeric data to compute the quantile from.
Parameters:
relative_accuracy
(Float64
): The relative accuracy of the quantile, between 0 and 1. Lower values increase accuracy but require more memory. Recommended value is 0.001 or higher.level
(Float64
, optional): The quantile level to compute, between 0 and 1. Default: 0.5 (median).
Returns:
Approximate quantile of the specified level. [Float64
]
Example:
Result:
In this example:
taco_price_75th_percentile
calculates the 75th percentile of taco prices with 1% relative accuracy.median_tacos_ordered
calculates the median number of tacos ordered with 1% relative accuracy.
The memory usage of the sketch depends on the data range and relative accuracy. It’s approximately log(max_value/min_value)/relative_accuracy
. Lower relative_accuracy
values or wider data ranges increase memory usage.
This function is useful for efficiently calculating approximate quantiles on large datasets where exact calculations would be too resource-intensive.
quantileDeterministic
Computes an approximate quantile of a numeric data sequence using a deterministic algorithm.
This function uses reservoir sampling with a reservoir size up to 8192 and a deterministic sampling algorithm. The result is deterministic for the same inputs and determinator.
Syntax:
Alias:
- medianDeterministic
Arguments:
level
(numeric, optional): Level of quantile. Constant float from 0 to 1. Default: 0.5 (median).expr
(numeric,Date
, orDateTime
): Expression over column values resulting in numeric,Date
, orDateTime
types.determinator
(numeric): Number whose hash is used instead of a random number generator to make the sampling deterministic.
Returns:
Approximate quantile of the specified level.
Float64
for numeric inputDate
for Date inputDateTime
for DateTime input
Example:
Result:
This example calculates the 75th percentile of taco prices, using the customer_id
as a determinator to ensure consistent results.
When using multiple quantile*
functions with different levels in a query, internal states are not combined, potentially reducing query efficiency. In such cases, consider using the quantiles
function instead.
quantileExact
Calculates the exact quantile of a numeric data sequence.
Syntax
Alias:
- medianExact
Arguments
level
(numeric): Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.expr
(numeric,Date
, orDateTime
): Expression over the column values resulting in numeric data types, Date or DateTime.
Returns:
Quantile of the specified level.
Float64
for numeric data type input.Date
if input values have the Date type.DateTime
if input values have the DateTime type.
Performance notes:
The function:
- Calculates the exact quantile by fully sorting the data.
- Consumes O(n) memory, where n is the number of values passed.
- Is very effective for a small number of values.
When using multiple quantile*
functions with different levels in a query, the internal states are not combined (less efficient). In this case, use the quantiles
function instead.
Example
Result:
In this example, we calculate the median (50th percentile) of the numbers from 0 to 9.
quantileExactLow
Computes the exact quantile of a numeric data sequence, similar to quantileExact
, but with a different method for handling the median and other quantile levels.
Syntax
Alias:
- medianExactLow
Arguments
level
(numeric, optional): Level of quantile. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates the median.expr
(numeric,Date
, orDateTime
): Expression over the column values resulting in numeric data types, Date, or DateTime.
Returns:
Quantile of the specified level.
Float64
for numeric data type input.Date
if input values have the Date type.DateTime
if input values have the DateTime type.
Performance notes:
- All passed values are combined into an array, which is then fully sorted. The sorting algorithm’s complexity is O(N·log(N)), where N is the number of elements.
- For the median (level = 0.5):
- With an even number of elements, returns the lower of the two middle values.
- With an odd number of elements, returns the middle value.
- For other levels, returns the element at the index corresponding to
level * size_of_array
. - This function is similar to Python’s median_low implementation.
Example
Result:
Result:
When using multiple quantile*
functions with different levels in a query, the internal states are not combined, potentially reducing query efficiency. In such cases, consider using the quantiles
function instead.
quantileExactHigh
Calculates the exact high quantile of a numeric data sequence.
Syntax:
Alias:
- medianExactHigh
Arguments:
level
(numeric, optional): Level of quantile. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates the median.expr
(numeric,Date
, orDateTime
): Expression over the column values resulting in numeric data types, Date, or DateTime.
Returns:
Quantile of the specified level.
Float64
for numeric data type input.Date
if input values have the Date type.DateTime
if input values have the DateTime type.
Performance notes:
- All passed values are combined into an array, which is then fully sorted. The sorting algorithm’s complexity is O(N·log(N)), where N is the number of elements.
- For the median (level = 0.5):
- With an even number of elements, returns the higher of the two middle values.
- With an odd number of elements, returns the middle value.
- For other levels, returns the element at the index corresponding to
level * size_of_array
. - This function is similar to Python’s median_high implementation.
When using multiple quantile*
functions with different levels in a query, the internal states are not combined, potentially reducing query efficiency. In such cases, consider using the quantiles
function instead.
Example:
Result:
In this example, quantileExactHigh
calculates the 75th percentile of taco prices, representing the price threshold for more expensive tacos in the dataset.
quantileExactExclusive
Exactly computes the quantile of a numeric data sequence, using an exclusive method.
This function is equivalent to the PERCENTILE.EXC
Excel function.
Syntax:
Arguments:
level
(numeric, optional): Level of quantile. Constant floating-point number from (0, 1) - bounds not included. Default value: 0.5. At level=0.5 the function calculates the median.expr
(numeric,Date
, orDateTime
): Expression over the column values resulting in numeric data types,Date
, orDateTime
.
Returns:
Quantile of the specified level.
Float64
for numeric data type input.Date
if input values have theDate
type.DateTime
if input values have theDateTime
type.
To get the exact value, all passed values are combined into an array, which is then partially sorted. Therefore, the function consumes O(n) memory, where n is the number of values passed. However, for a small number of values, the function is very effective.
When using multiple quantileExactExclusive
functions with different levels in a query, the internal states are not combined (i.e., the query works less efficiently than it could). In this case, use the quantilesExactExclusive
function.
Example:
Result:
In this example, we calculate the 60th percentile of taco prices using the exclusive method. The result shows that 60% of tacos are priced below $9.59.
quantileExactInclusive
Exactly computes the quantile of a numeric data sequence.
This function is equivalent to the PERCENTILE.INC Excel function.
Syntax
Arguments
level
(Float64
, optional): Level of quantile. Possible values: [0, 1] — bounds included. Default value: 0.5. At level=0.5 the function calculates median.expr
(numeric,Date
, orDateTime
): Expression over the column values resulting in numeric data types,Date
, orDateTime
.
Returns
Quantile of the specified level.
Float64
for numeric data type input.Date
if input values have theDate
type.DateTime
if input values have theDateTime
type.
Example
Result:
In this example, we calculate the 60th percentile of taco prices from our sales data.
- To get the exact value, all passed values are combined into an array, which is then partially sorted. This consumes O(n) memory, where n is the number of values passed.
- When using multiple
quantileExactInclusive
functions with different levels in a query, the internal states are not combined, potentially reducing query efficiency. In such cases, consider using thequantilesExactInclusive
function instead.
quantileExactWeighted
Exactly computes the quantile of a numeric data sequence, taking into account the weight of each element.
Syntax:
Alias:
- medianExactWeighted
Arguments:
level
(numeric, optional): Level of quantile. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.expr
(numeric,Date
, orDateTime
): Expression over the column values resulting in numeric data types,Date
orDateTime
.weight
(numeric): Column with weights of sequence members. Weight is a number of value occurrences.
Returns:
Quantile of the specified level.
Float64
for numeric data type input.Date
if input values have the Date type.DateTime
if input values have the DateTime type.
Example:
Result:
In this example, we calculate the 75th percentile of taco prices, weighted by the order quantity. This gives us an idea of what price point represents an “expensive” taco in our sales data.
- This function uses less memory than
quantileExact
when passed values are frequently repeated, as it uses a hash table in its algorithm. - When using multiple
quantile*
functions with different levels in a query, the internal states are not combined, potentially reducing query efficiency. In such cases, consider using thequantiles
function instead.
quantileGK
Computes the quantile of a numeric data sequence using the Greenwald-Khanna algorithm. This function allows you to control the accuracy of the approximate quantile result.
Syntax:
Alias:
- medianGK
Arguments:
accuracy
(UInt8
): Accuracy of quantile. Constant positive integer. Larger values mean less error. For example, if set to 100, the computed quantile will have an error no greater than 1% with high probability.level
(numeric, optional): Level of quantile. Constant floating-point number from 0 to 1. Default value: 0.5. At level=0.5 the function calculates the median.expr
(numeric,Date
, orDateTime
): Expression over column values resulting in numeric data types,Date
, orDateTime
.
Returns:
Quantile of the specified level and accuracy.
Float64
for numeric data type input.Date
if input values have the Date type.DateTime
if input values have the DateTime type.
Example:
Result:
In this example, quantileGK
calculates the 25th percentile (first quartile) of taco prices with an accuracy setting of 100.
There’s a trade-off between accuracy and computational complexity. Higher accuracy requires more memory and computational resources but provides more precise results. Lower accuracy allows for faster and more memory-efficient computation but with slightly lower precision.
quantileInterpolatedWeighted
Computes the quantile of a numeric data sequence using linear interpolation, taking into account the weight of each element.
Syntax:
Alias:
- medianInterpolatedWeighted
Arguments:
level
(numeric, optional): Level of quantile. Constant floating-point number from 0 to 1. Default value: 0.5. Atlevel=0.5
the function calculates the median.expr
(numeric,Date
, orDateTime
): Expression over the column values resulting in numeric data types,Date
, orDateTime
.weight
(numeric): Column with weights of sequence members. Weight is a number of value occurrences.
Returns:
The quantile of the specified level.
- For numeric data type input:
Float64
- For
Date
input:Date
- For
DateTime
input:DateTime
Example:
Let’s say we have a table of taco orders with their prices and quantities:
Result:
In this example, we calculate the 75th percentile of taco prices, weighted by the quantity ordered. This gives us an idea of the price point at which 75% of the tacos (by quantity) are sold at or below.
When using multiple quantile*
functions with different levels in a query, the internal states are not combined, which may impact query efficiency. In such cases, consider using the quantiles
function instead.
quantiles
Calculates multiple quantiles of a numeric data sequence in one pass.
Syntax:
Arguments:
level1, level2, ...
(numeric): Levels of quantiles to compute. Each level should be a float between 0 and 1.expr
(numeric): Expression over the column values. Typically a column name.
Returns:
An array of computed quantiles corresponding to the specified levels.
Example:
Result:
In this example, we calculate the 25th, 50th (median), and 75th percentiles of taco prices from the taco_sales
table.
This function is more efficient than calling quantile
multiple times for different levels, as it computes all quantiles in a single pass over the data.
quantileTDigest
Computes an approximate quantile of a numeric data sequence using the t-digest algorithm.
Syntax:
Alias:
- medianTDigest
Arguments:
level
(numeric, optional): Level of quantile. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.expr
(numeric,Date
, orDateTime
): Expression over the column values resulting in numeric data types,Date
, orDateTime
.
Returns:
Approximate quantile of the specified level.
Float64
for numeric data type input.Date
if input values have theDate
type.DateTime
if input values have theDateTime
type.
Example:
Result:
This example calculates the 75th percentile of taco prices, which could be considered the price point for more expensive tacos.
- Memory consumption is log(n), where n is the number of values.
- The result depends on the order of running the query and is nondeterministic.
- Performance is lower than
quantile
orquantileTiming
, but it offers a better ratio of state size to precision. - When using multiple
quantile*
functions with different levels in a query, the internal states are not combined. In this case, use thequantiles
function for better efficiency.
quantileTDigestWeighted
Computes an approximate quantile of a numeric data sequence using the t-digest algorithm, taking into account the weight of each sequence member. The maximum error is 1%. Memory consumption is log(n), where n is the number of values.
Syntax:
Alias:
- medianTDigestWeighted
Arguments:
level
(numeric, optional): Level of quantile. Constant floating-point number from 0 to 1. Default value: 0.5. At level=0.5 the function calculates the median.expr
(numeric,Date
, orDateTime
): Expression over the column values resulting in numeric data types,Date
, orDateTime
.weight
(numeric): Column with weights of sequence elements. Weight is the number of value occurrences.
Returns:
- Approximate quantile of the specified level.
Return Type:
Float64
for numeric data type input.Date
if input values have theDate
type.DateTime
if input values have theDateTime
type.
Example:
This query calculates the 75th percentile of taco prices, weighted by the order quantity.
- The performance is lower than
quantile
orquantileTiming
. - The result depends on the order of running the query and is nondeterministic.
- When using multiple
quantile*
functions with different levels in a query, the internal states are not combined. - Not recommended for tiny data sets as it can lead to significant error. Consider using
quantileTDigest
for small datasets.
quantileTiming
Calculates the quantile of a numeric data sequence with determined precision. This function is optimized for distributions like web page loading times or backend response times.
Syntax:
Alias:
- medianTiming
Arguments:
- level (numeric, optional): Level of quantile. Constant float from 0 to 1. Default: 0.5 (median). Recommended range: [0.01, 0.99].
- expr (numeric): Expression over column values returning a numeric type.
Returns:
Quantile of the specified level. (Float32)
- Negative values lead to undefined behavior.
- Values greater than 30,000 are capped at 30,000.
- Result is deterministic (not affected by query processing order).
- For multiple quantile* functions with different levels, use quantiles for better efficiency.
- Precise if total values ≤ 5670.
- Precise if total values > 5670 and value < 1024ms.
- Otherwise, rounded to nearest multiple of 16ms.
Example:
Result:
This example calculates the 75th percentile of taco order response times.
If no values are passed (e.g., when using quantileTimingIf
), the function returns NaN.
quantileTimingWeighted
Calculates the weighted quantile of a numeric data sequence with a specified level of precision. This function is optimized for distributions like web page loading times or backend response times.
Syntax:
Alias:
- medianTimingWeighted
Arguments:
level
(numeric, optional): Level of quantile. Constant float from 0 to 1. Default: 0.5 (median). Recommended range: [0.01, 0.99].expr
(numeric): Expression over column values returning aFloat*
type number.- Negative values produce undefined behavior.
- Values greater than 30,000 (30 seconds) are capped at 30,000.
weight
(numeric): Column with weights of sequence elements (number of value occurrences).
Returns:
Quantile of the specified level. (Float32
)
- Results are deterministic (independent of query processing order).
- For page loading time quantiles, this function is more effective and accurate than
quantile
. - If no values are passed (when using
quantileTimingIf
), it returnsNaN
.
Calculation is accurate if:
- Total number of values ≤ 5670.
- Total number of values > 5670, but page loading time < 1024ms.
Otherwise, the result is rounded to the nearest multiple of 16ms.
Example:
Calculate the weighted median response time for taco orders:
Result:
This example calculates the weighted median response time for taco orders, where response_time
is the time taken to prepare an order, and order_count
is the number of times that particular response time occurred.
When using multiple quantile*
functions with different levels in a query, internal states are not combined, reducing efficiency. In such cases, consider using the quantiles
function instead.
quantilesTimingWeighted
Calculates the weighted quantiles of a numeric data sequence with a specified level of precision. This function is optimized for distributions like web page loading times or backend response times.
Syntax:
Arguments:
level1, level2, ...
(numeric): Levels of quantiles to compute. Optional. Constant floating-point numbers from 0 to 1. Recommended range: [0.01, 0.99].expr
(numeric): Expression over column values returning a numeric data type.weight
(numeric): Column with weights of sequence elements. Weight represents the number of value occurrences.
Returns:
An array of quantiles corresponding to the specified levels.
- For values greater than 30,000 (representing times > 30 seconds), the function assumes 30,000.
- Negative input values lead to undefined behavior.
- The calculation is accurate for up to 5,670 total values or for times less than 1,024ms when exceeding 5,670 values.
- For larger datasets, results are rounded to the nearest multiple of 16ms.
Example:
taco_response_quantiles |
---|
[700,1000,1000] |
Was this page helpful?