Correlation and information theory
Calculate correlation coefficients and information theory metrics.
Remember, correlation does not imply causation.
ClickHouse function reference
contingency
Calculates the contingency coefficient between two columns. The contingency
function is similar to the cramersV
function but uses a different denominator in the square root calculation.
Syntax:
Arguments:
column1
(any): The first column to compare.column2
(any): The second column to compare.
Returns:
A value between 0 and 1, where a larger result indicates a closer association between the two columns. [Float64
]
Example:
Result:
In this example, we calculate the association between taco types and salsa types in taco orders.
cramersV
Calculates the [Cramér’s V statistic](https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V#:~:text=In%20statistics%2C%20Cram%C3%A9r’s%20V%20(sometimes,by%20Harald%20Cram%C3%A9r%20in%201946.), which measures the strength of association between two categorical variables.
Syntax:
Arguments:
x
(any): The first categorical variable.y
(any): The second categorical variable.
Returns:
A value between 0 and 1, where:
- 0 indicates no association
- 1 indicates perfect association
Return type: Float64
Example:
Result:
In this example, we calculate the association strength between taco_type
and salsa_preference
. The result of 0.28 suggests a moderate association between these two variables in taco orders.
Cramér’s V is particularly useful for comparing the strength of association between pairs of categorical variables, even when they have different numbers of categories.
cramersVBiasCorrected
Calculates the bias-corrected Cramer’s V, a measure of association between two columns in a table.
Cramer’s V measures the strength of association between two categorical variables. This function uses a bias correction to provide a more accurate measure, especially for small sample sizes or when variables have many categories.
The bias-corrected version typically returns lower values compared to the uncorrected cramersV
function, offering a more conservative and often more realistic estimate of the association.
Syntax:
Arguments:
column1
(any): The first column to be compared.column2
(any): The second column to be compared.
Returns:
A value between 0 and 1, where:
- 0 indicates no association between the columns’ values
- 1 indicates complete association
Return type: Float64
Example:
Result:
In this example, we compare the association between taco types and salsa types in orders. The bias-corrected version shows no association, providing a more conservative estimate of the relationship between these variables.
The bias-corrected version is generally preferred, especially when dealing with smaller datasets or variables with many categories, as it provides a more accurate representation of the true association.
entropy
Calculates the Shannon entropy of a column of values.
Syntax:
Arguments:
val
(any type): Column of values.
Returns:
The Shannon entropy as a Float64
.
Example:
Result:
In this example:
filling_entropy
shows the entropy of taco fillings, indicating the diversity of choices.quantity_entropy
represents the entropy of order quantities, reflecting the variability in order sizes.
A higher entropy value suggests more diversity or randomness in the data, while a lower value indicates more uniformity or predictability.
The Shannon entropy is a measure of the average amount of information contained in each element of a set. It’s useful for analyzing the distribution and unpredictability of data in various fields, including information theory and data compression.
rankCorr
Calculates the rank correlation coefficient between two columns.
Syntax:
Arguments:
x
(Float32
orFloat64
): The first set of values.y
(Float32
orFloat64
): The second set of values.
Returns:
The rank correlation coefficient as a Float64
value ranging from -1 to +1.
- A value close to +1 indicates a strong positive correlation.
- A value close to -1 indicates a strong negative correlation.
- A value close to 0 indicates little to no correlation.
Example:
Result:
In this example, we calculate the rank correlation between taco spiciness and customer satisfaction. The result of 0.8 suggests a strong positive correlation, between taco spiciness and customer satisfaction.
The function requires at least two non-null pairs of observations to compute the correlation. If there are fewer than two pairs, an exception will be thrown.
See also: Spearman’s rank correlation coefficient
Was this page helpful?