Remember, correlation does not imply causation.

ClickHouse function reference

contingency

Calculates the contingency coefficient between two columns. The contingency function is similar to the cramersV function but uses a different denominator in the square root calculation.

Syntax:

contingency(column1, column2)

Arguments:

  • column1 (any): The first column to compare.
  • column2 (any): The second column to compare.

Returns:

A value between 0 and 1, where a larger result indicates a closer association between the two columns. [Float64]

Example:

SELECT
	contingency(taco_type, salsa_type) AS taco_salsa_association
FROM
(
    SELECT 'Beef' AS taco_type, 'Mild' AS salsa_type
    UNION ALL
    SELECT 'Chicken' AS taco_type, 'Medium' AS salsa_type
    UNION ALL
    SELECT 'Vegetarian' AS taco_type, 'Hot' AS salsa_type
    UNION ALL
    SELECT 'Beef' AS taco_type, 'Medium' AS salsa_type
    UNION ALL
    SELECT 'Chicken' AS taco_type, 'Hot' AS salsa_type
    UNION ALL
    SELECT 'Vegetarian' AS taco_type, 'Mild' AS salsa_type
) AS taco_orders;

Result:

| taco_salsa_association |
|------------------------|
| 0.1643989873053573     |

In this example, we calculate the association between taco types and salsa types in taco orders.

cramersV

Calculates the [Cramér’s V statistic](https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V#:~:text=In%20statistics%2C%20Cram%C3%A9r’s%20V%20(sometimes,by%20Harald%20Cram%C3%A9r%20in%201946.), which measures the strength of association between two categorical variables.

Syntax:

cramersV(x, y)

Arguments:

  • x (any): The first categorical variable.
  • y (any): The second categorical variable.

Returns:

A value between 0 and 1, where:

  • 0 indicates no association
  • 1 indicates perfect association

Return type: Float64

Example:

SELECT
	cramersV(taco_type, salsa_preference) AS association_strength
FROM
(
    SELECT 'Beef' AS taco_type, 'Mild' AS salsa_preference
    UNION ALL
    SELECT 'Chicken' AS taco_type, 'Medium' AS salsa_preference
    UNION ALL
    SELECT 'Vegetarian' AS taco_type, 'Hot' AS salsa_preference
    UNION ALL
    SELECT 'Beef' AS taco_type, 'Medium' AS salsa_preference
    UNION ALL
    SELECT 'Chicken' AS taco_type, 'Hot' AS salsa_preference
    UNION ALL
    SELECT 'Vegetarian' AS taco_type, 'Mild' AS salsa_preference
) AS taco_orders;

Result:

| association_strength |
|----------------------|
| 0.2886751345948129   |

In this example, we calculate the association strength between taco_type and salsa_preference. The result of 0.28 suggests a moderate association between these two variables in taco orders.

Cramér’s V is particularly useful for comparing the strength of association between pairs of categorical variables, even when they have different numbers of categories.

cramersVBiasCorrected

Calculates the bias-corrected Cramer’s V, a measure of association between two columns in a table.

Cramer’s V measures the strength of association between two categorical variables. This function uses a bias correction to provide a more accurate measure, especially for small sample sizes or when variables have many categories.

The bias-corrected version typically returns lower values compared to the uncorrected cramersV function, offering a more conservative and often more realistic estimate of the association.

Syntax:

cramersVBiasCorrected(column1, column2)

Arguments:

  • column1 (any): The first column to be compared.
  • column2 (any): The second column to be compared.

Returns:

A value between 0 and 1, where:

  • 0 indicates no association between the columns’ values
  • 1 indicates complete association

Return type: Float64

Example:

SELECT
	cramersV(taco_type, salsa_type) AS uncorrected,
	cramersVBiasCorrected(taco_type, salsa_type) AS bias_corrected
FROM
(
    SELECT 'Beef' AS taco_type, 'Mild' AS salsa_type
    UNION ALL
    SELECT 'Chicken' AS taco_type, 'Medium' AS salsa_type
    UNION ALL
    SELECT 'Vegetarian' AS taco_type, 'Hot' AS salsa_type
    UNION ALL
    SELECT 'Beef' AS taco_type, 'Medium' AS salsa_type
    UNION ALL
    SELECT 'Chicken' AS taco_type, 'Hot' AS salsa_type
    UNION ALL
    SELECT 'Vegetarian' AS taco_type, 'Mild' AS salsa_type
) AS taco_orders;

Result:

| uncorrected         | bias_corrected       |
|---------------------|----------------------|
| 0.2886751345948129  | 0                    |

In this example, we compare the association between taco types and salsa types in orders. The bias-corrected version shows no association, providing a more conservative estimate of the relationship between these variables.

The bias-corrected version is generally preferred, especially when dealing with smaller datasets or variables with many categories, as it provides a more accurate representation of the true association.

entropy

Calculates the Shannon entropy of a column of values.

Syntax:

entropy(val)

Arguments:

  • val (any type): Column of values.

Returns:

The Shannon entropy as a Float64.

Example:

SELECT
	entropy(filling) AS filling_entropy,
	entropy(quantity) AS quantity_entropy
FROM (
    SELECT 'beef' AS filling, 5 AS quantity
    UNION ALL
    SELECT 'beef' AS filling, 5 AS quantity
    UNION ALL
    SELECT 'chicken' AS filling, 3 AS quantity
    UNION ALL
    SELECT 'fish' AS filling, 2 AS quantity
    UNION ALL
    SELECT 'veggie' AS filling, 1 AS quantity
) AS taco_orders;

Result:

| filling_entropy    | quantity_entropy   |
|--------------------|--------------------|
| 1.9219280948873623 | 1.9219280948873623 |

In this example:

  • filling_entropy shows the entropy of taco fillings, indicating the diversity of choices.
  • quantity_entropy represents the entropy of order quantities, reflecting the variability in order sizes.

A higher entropy value suggests more diversity or randomness in the data, while a lower value indicates more uniformity or predictability.

The Shannon entropy is a measure of the average amount of information contained in each element of a set. It’s useful for analyzing the distribution and unpredictability of data in various fields, including information theory and data compression.

rankCorr

Calculates the rank correlation coefficient between two columns.

Syntax:

rankCorr(x, y)

Arguments:

  • x (Float32 or Float64): The first set of values.
  • y (Float32 or Float64): The second set of values.

Returns:

The rank correlation coefficient as a Float64 value ranging from -1 to +1.

  • A value close to +1 indicates a strong positive correlation.
  • A value close to -1 indicates a strong negative correlation.
  • A value close to 0 indicates little to no correlation.

Example:

SELECT
	rankCorr(taco_spiciness, customer_satisfaction) AS spice_satisfaction_correlation
FROM
(
    SELECT 5 AS taco_spiciness, 8 AS customer_satisfaction
    UNION ALL
    SELECT 7 AS taco_spiciness, 9 AS customer_satisfaction
    UNION ALL
    SELECT 6 AS taco_spiciness, 7 AS customer_satisfaction
    UNION ALL
    SELECT 8 AS taco_spiciness, 10 AS customer_satisfaction
) AS taco_reviews;

Result:

| spice_satisfaction_correlation |
|--------------------------------|
| 0.80                           |

In this example, we calculate the rank correlation between taco spiciness and customer satisfaction. The result of 0.8 suggests a strong positive correlation, between taco spiciness and customer satisfaction.

The function requires at least two non-null pairs of observations to compute the correlation. If there are fewer than two pairs, an exception will be thrown.

See also: Spearman’s rank correlation coefficient