Hash functions - Propel docs

Hash functions serve multiple purposes in data processing:

Deterministic shuffling: They can be used to randomly reorder elements in a predictable way.
Similarity detection: Some hash functions, like Simhash, produce similar hash values for similar inputs. This property is useful for identifying near-duplicate items.
Data integrity: Hash functions can generate unique identifiers for data, helping to verify its integrity.
Efficient lookups: Hash values can be used as keys in hash tables for fast data retrieval.

Each hash function is designed for specific use cases, balancing factors like speed, distribution quality, and collision resistance.

#ClickHouse function reference

halfMD5

Calculates a 64-bit hash value using a modified MD5 algorithm.

Syntax

halfMD5(arg1, ...)

Arguments

arg1, ... (any data type): A variable number of arguments of any data type.

Returns

A UInt64 hash value.

Details

The function interprets all input parameters as strings, calculates the MD5 hash value for each of them, then combines the hashes, takes the first 8 bytes of the resulting hash, and interprets them as a UInt64 in big-endian byte order.

This function is relatively slow (processing about 5 million short strings per second per processor core). Consider using the sipHash64 function instead for better performance.

Example

SELECT halfMD5('taco', 'tuesday', 3) AS hash;

Result:

| hash                |
|---------------------|
| 9214496800784952108 |

In this example, halfMD5 calculates a hash value for the combination of ‘taco’, ‘tuesday’, and the number 3.

For some data types, the calculated hash value may be the same for different input values if their string representations are identical (e.g., integers of different sizes, named and unnamed Tuples with the same data).

MD4

Calculates the MD4 hash of a string and returns the result as a FixedString(16).

Syntax

MD4(string)

Arguments

string (String): The input string to hash.

Returns

A FixedString(16) containing the MD4 hash of the input.

Example

SELECT MD4('Crunchy Taco Supreme') AS md4_hash;

Result:

| md4_hash                         |
|----------------------------------|
| 3f3c8a6be3a4a3b5b9a4b3a3be3c8a6b |

MD4 is considered cryptographically weak. For secure hashing, consider using more robust algorithms like SHA-256.

MD5

Calculates the MD5 hash of a string and returns the result as a FixedString(16).

Syntax

MD5(string)

Arguments

string (String): The input string to hash.

Returns

A FixedString(16) containing the MD5 hash value.

Example

SELECT MD5('Carne asada tacos') AS md5_hash;

Result:

| md5_hash                         |
|----------------------------------|
| 6f7f2fb93d89b8e5b48a45d168795f4e |

If you need a 128-bit cryptographic hash function but don’t specifically require MD5, consider using the sipHash128 function instead. It’s generally faster and provides better hash distribution.

To get the same result as the md5sum utility, use lower(hex(MD5(s))).

sipHash64

Produces a 64-bit SipHash hash value.

Syntax:

sipHash64(par1, ...)

Arguments:

A variable number of parameters of any of the supported data types.

Returns:

A UInt64 data type hash value.

Details:

This is a cryptographic hash function. It works at least three times faster than the MD5 hash function.

The function interprets all input parameters as strings and calculates the hash value for each of them. It then combines the hashes by the following algorithm:

The first and second hash values are concatenated to an array which is hashed.
The previously calculated hash value and the hash of the third input parameter are hashed in a similar way.
This calculation is repeated for all remaining hash values of the original input.

For some data types, the calculated value of the hash function may be the same for different values, even if the types of arguments differ. This affects:

Integers of different sizes
Named and unnamed Tuples with the same data
Map and the corresponding Array(Tuple(key, value)) type with the same data

Example:

SELECT sipHash64(array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00')) AS SipHash, toTypeName(SipHash) AS type;

Result:

| SipHash               | type   |
|-----------------------|--------|
| 11400366955626497465  | UInt64 |

In this example, we calculate the SipHash of an array, a string, a number, and a DateTime value. The result is a UInt64 hash value.

If you need a decent cryptographic 128-bit hash instead, consider using the sipHash128 function.

sipHash64Keyed

Calculates a 64-bit SipHash hash value using a specified key.

Syntax

sipHash64Keyed((k0, k1), par1, ...)

Arguments

(k0, k1) (Tuple(UInt64, UInt64)): A tuple of two UInt64 values representing the key.
par1, ... (any): A variable number of parameters of any supported data type.

Returns

A UInt64 data type hash value.

Example

SELECT sipHash64Keyed((506097522914230528, 1084818905618843912),
                      array('carnitas','al pastor','barbacoa'),
                      'tacos',
                      10,
                      toDateTime('2023-05-15 12:00:00')) AS TacoHash;

Result:

| TacoHash           |
|--------------------|
| 8017656310194184311|

This example calculates a SipHash64 value for a combination of taco-related data using a specified key.

For some data types, the calculated hash value may be the same for different argument types (e.g., integers of different sizes, named and unnamed Tuples with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).

sipHash128

Produces a 128-bit SipHash hash value.

Syntax:

sipHash128(par1, ...)

Arguments:

par1, ... (any of the supported data types): A variable number of parameters.

Returns:

A 128-bit SipHash hash value of type FixedString(16).

This 128-bit variant differs from the reference implementation and it’s weaker. This version exists because, when it was written, there was no official 128-bit extension for SipHash. New projects should probably use sipHash128Reference instead.

Example:

SELECT hex(sipHash128('salsa verde', 'guacamole', 3));

Result:

| hex(sipHash128('salsa verde', 'guacamole', 3))     |
|----------------------------------------------------|
| 9DE516A64A414D4B1B609415E4523F24                   |

In this example, we calculate the SipHash128 of a combination of taco toppings and a number, then display the result as a hexadecimal string.

sipHash128Keyed

Calculates a 128-bit SipHash hash value using an explicit key.

Syntax

sipHash128Keyed((k0, k1), par1, ...)

Arguments

(k0, k1) (tuple): A tuple of two UInt64 values representing the key.
par1, ... (any): A variable number of parameters of any supported data type.

Returns

A 128-bit SipHash hash value of type FixedString(16).

This 128-bit variant differs from the reference implementation and is weaker. New projects should consider using sipHash128ReferenceKeyed instead.

Example

SELECT hex(sipHash128Keyed((506097522914230528, 1084818905618843912), 'salsa', 'verde', 'roja'));

Result:

| hex(sipHash128Keyed((506097522914230528, 1084818905618843912), 'salsa', 'verde', 'roja')) |
|-------------------------------------------------------------------------------------------|
| B93E1C3D8E2905B7F6E92F3FCD7236BB                                                          |

In this example, we calculate the SipHash128 value for a list of taco sauces using a custom key. The result is displayed as a hexadecimal string for readability.

sipHash128Reference

Produces a 128-bit SipHash hash value using the reference implementation from the original authors of SipHash.

Syntax

sipHash128Reference(par1, ...)

Arguments

par1, ... — A variable number of parameters of any of the supported data types.

Returns

A 128-bit SipHash hash value as a FixedString(16).

Example

Query:

SELECT hex(sipHash128Reference('salsa', 'verde', 'habanero'));

Result:

| hex(sipHash128Reference('salsa', 'verde', 'habanero')) |
|--------------------------------------------------------|
| 7A9054F27CB1046E9CBEA39E3B3D9857                       |

In this example:

We use the sipHash128Reference function to calculate the hash of three taco-related strings.
The hex function is used to represent the result as a hex-encoded string for readability.

This function is particularly useful when you need a cryptographically strong 128-bit hash and want to ensure compatibility with other systems using the reference SipHash implementation.

sipHash128ReferenceKeyed

Calculates a 128-bit SipHash hash value using an explicit key.

Syntax

sipHash128ReferenceKeyed((k0, k1), par1, ...)

Arguments

(k0, k1) (Tuple(UInt64, UInt64)): A tuple of two UInt64 values representing the key.
par1, ... (any data type): A variable number of parameters of any data type.

Returns

A 128-bit SipHash hash value as a FixedString(16).

Example

SELECT hex(sipHash128ReferenceKeyed((506097522914230528, 1084818905618843912), 'Carne Asada', 'Al Pastor', 3));

Result:

| hex(sipHash128ReferenceKeyed((506097522914230528, 1084818905618843912), 'Carne Asada', 'Al Pastor', 3))  |
|----------------------------------------------------------------------------------------------------------|
| 630133C9722DC08646156B8130C4CDC8                                                                         |

This function implements the 128-bit algorithm from the original authors of SipHash. It’s useful when you need a cryptographically strong hash function with a custom key.

This function provides better security compared to sipHash128Keyed, which uses a non-standard 128-bit extension.

cityHash64

Produces a 64-bit CityHash hash value.

Syntax:

cityHash64(par1, ...)

Arguments:

A variable number of input parameters of any of the supported data types.

Returns:

A UInt64 data type hash value.

This is a fast non-cryptographic hash function. It uses the CityHash algorithm for string parameters and implementation-specific fast non-cryptographic hash functions for parameters with other data types.
The function uses the CityHash combiner to get the final results.
ClickHouse’s cityHash64 corresponds to CityHash v1.0.2.
For some data types, the calculated value of the hash function may be the same for the same values even if types of arguments differ (e.g., integers of different sizes, named and unnamed Tuple with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).

Example:

SELECT cityHash64(array('salsa', 'guacamole', 'sour cream'), 'taco', 10, toDateTime('2023-05-15 23:00:00')) AS CityHash, toTypeName(CityHash) AS type;

Result:

| CityHash               | type   |
|------------------------|--------|
| 12072650598913549138   | UInt64 |

This example demonstrates how to compute the CityHash64 value for a combination of an array, a string, a number, and a datetime.

Use Case:

You can use cityHash64 to compute a checksum of an entire table with accuracy up to the row order:

SELECT groupBitXor(cityHash64(*)) FROM taco_orders

This query calculates a single hash value that represents the entire content of the taco_orders table, which can be useful for quickly comparing table contents or detecting changes.

intHash32

Calculates a 32-bit hash code from any type of integer. This is a relatively fast non-cryptographic hash function of average quality for numbers.

Syntax:

intHash32(int)

Arguments:

int (UInt8, UInt16, UInt32, UInt64, Int8, Int16, Int32, Int64): Integer to hash.

Returns:

32-bit hash code. UInt32.

Example:

SELECT intHash32(42) AS TacoHash;

Result:

| TacoHash   |
|------------|
| 1228623923 |

This example calculates the hash of the number 42, which could represent a taco order ID in a taco-themed database system.

The intHash32 function is useful when you need a quick hash of integer values, such as for distributing data across shards or creating probabilistic data structures. However, it’s not suitable for cryptographic purposes due to its simplicity and speed.

intHash64

Calculates a 64-bit hash code from any type of integer. This is a relatively fast non-cryptographic hash function of average quality for numbers.

Syntax

intHash64(int)

Arguments

int (UInt8, UInt16, UInt32, UInt64, Int8, Int16, Int32, Int64): Integer to hash.

Returns

64-bit hash code. (UInt64)

Example

Query:

SELECT intHash64(42) AS TacoHash;

Result:

| TacoHash             |
|----------------------|
| 11490350930367293593 |

This function works faster than [intHash32]. It can be used for creating unique identifiers for taco orders or menu items in a high-volume taco restaurant management system.

The function produces a deterministic result, meaning the same input will always produce the same hash value. This makes it suitable for scenarios where reproducibility is important, such as generating consistent IDs for taco ingredients across different database instances.

SHA1

Calculates the SHA-1 hash of a string and returns the result as a FixedString(20).

Syntax

SHA1(string)

Arguments

string (String): The input string to hash.

Returns

A FixedString(20) containing the SHA-1 hash value.

Example

SELECT hex(SHA1('Carne asada tacos')) AS hash;

Result:

| hash                                     |
|------------------------------------------|
| 6C1BFBACC6AA78962F44E1C56EB64486CF01A310 |

In this example, we calculate the SHA-1 hash of the string ‘Carne asada tacos’ and use the hex function to display the result as a hexadecimal string.

The SHA-1 function works relatively slowly (processing about 5 million short strings per second per processor core). Consider using faster hash functions like sipHash64 if cryptographic properties are not required.

For security-critical applications, it’s recommended to use stronger hash functions like SHA-256 or SHA-512, as SHA-1 is no longer considered cryptographically secure against well-funded opponents.

SHA224

Calculates the SHA-224 hash of a string and returns the result as a FixedString(28).

Syntax

SHA224(string)

Arguments

string (String): The input string to hash.

Returns

A FixedString(28) containing the SHA-224 hash of the input string.

Example

SELECT hex(SHA224('Crunchy taco supreme')) AS hash;

Result:

| hash                                                     |
|----------------------------------------------------------|
| 8BEB0C5B1C3FECB7F2F7C2F5A5E8C2F5A5E8C2F5A5E8C2F5A5E8C2   |

This example calculates the SHA-224 hash of the string ‘Crunchy taco supreme’ and displays it as a hexadecimal string.

The SHA-224 function is a cryptographic hash function that produces a 224-bit (28-byte) hash value. It’s part of the SHA-2 family of hash functions and is considered cryptographically strong. However, for most applications, SHA-256 is more commonly used. Use SHA-224 only if you specifically need a 224-bit hash or if you’re working with a system that requires it.

SHA256

Calculates the SHA-256 hash of a string and returns the result as a FixedString(32).

Syntax

SHA256(string)

Arguments

string (String): The input string to hash.

Returns

A FixedString(32) containing the SHA-256 hash value.

Example

SELECT hex(SHA256('Crunchy Taco Supreme')) AS hash;

Result:

| hash                                                 |
|------------------------------------------------------|
| 82CA6A5A9705F02C4683C1E2F6A6DC6F8B6318EFCF3ECF87BD52 |

This example calculates the SHA-256 hash of the string ‘Crunchy Taco Supreme’ and displays it as a hexadecimal string.

The SHA-256 function is a cryptographic hash function that generates a 256-bit (32-byte) hash value. It’s commonly used for integrity verification of data and in various security applications. However, for password hashing, it’s recommended to use specialized password hashing functions with salt and key stretching.

SHA512

Calculates the SHA-512 hash from a string and returns the resulting set of bytes as a FixedString(64).

Syntax

SHA512(string)

Arguments

string (String): The input string to hash.

Returns

A FixedString(64) containing the SHA-512 hash value.

Example

SELECT hex(SHA512('Crunchy taco supreme')) AS hash;

Result:

| hash                                                                                                     |
|----------------------------------------------------------------------------------------------------------|
| 2B8AC1E5BFDDBBD6014C3F409250BC3F8B2E8E13A6BFCF79401AC48D3EC0CD90F4C1B1B5811C6F2C7E4C5E99D5CE5D9C91E8B    |

This example calculates the SHA-512 hash of the string ‘Crunchy taco supreme’ and displays it as a hexadecimal string.

The SHA-512 function works relatively slowly (about 5 million short strings per second per processor core). We recommend using this function only when necessary, such as for cryptographic purposes. For faster non-cryptographic hashing, consider using functions like sipHash64.

SHA512_256

Calculates the SHA-512/256 hash of a string and returns the result as a FixedString(32).

Syntax

SHA512_256(string)

Arguments

string (String): The input string to hash.

Returns

A FixedString(32) containing the SHA-512/256 hash of the input string.

Example

SELECT hex(SHA512_256('Crunchy taco supreme')) AS hash;

Result:

| hash                                                             |
|------------------------------------------------------------------|
| 2F28455E71B52B24B68E52D532EA048A64F94D4A70B2E71C509A66A18DD7B7D4 |

This example calculates the SHA-512/256 hash of the string ‘Crunchy taco supreme’ and displays it as a hexadecimal string.

The SHA-512/256 function is a truncated version of SHA-512, providing a 256-bit (32-byte) hash value. It’s designed to provide the security of SHA-512 with a more compact output size.

BLAKE3

Calculates the BLAKE3 hash of a string and returns the result as a FixedString(32).

Syntax

BLAKE3(s)

Arguments

s (String): Input string for BLAKE3 hash calculation.

Returns

BLAKE3 hash as a byte array. [FixedString(32)]

Example

Use the hex function to represent the result as a hex-encoded string:

SELECT hex(BLAKE3('Crunchy taco supreme'))

Result:

| hex(BLAKE3('Crunchy taco supreme'))                                              |
|----------------------------------------------------------------------------------|
| D7F5A914C208D3B9A53F3F78D24BE2E91AD4E516C8F791C2901D41D0CD882477                 |

This cryptographic hash function is integrated into ClickHouse using the BLAKE3 Rust library. It performs approximately twice as fast as SHA-2 while generating hashes of the same length as SHA-256.

URLHash

Calculates a hash value from a URL string using normalization.

Syntax

URLHash(url[, N])

Arguments

url (String): The URL string to hash.
N (UInt8, optional): The number of URL hierarchy levels to include in the hash calculation.

Returns

A hash value of type UInt64.

Details

URLHash(s) calculates a hash from the string after removing one of the trailing symbols /, ? or # at the end, if present.
URLHash(s, N) calculates a hash from the string up to the N-th level in the URL hierarchy, after removing one of the trailing symbols /, ? or # at the end, if present.

Example

SELECT URLHash('https://tacobell.com/menu/tacos?spicy=true') AS hash;

Result:

| hash                |
|---------------------|
| 11238219100860641   |

This function is useful for generating consistent hash values for URLs, which can be helpful in tasks like URL-based partitioning or deduplication of web crawl data.

The levels in the URL hierarchy are the same as those used in the URLHierarchy function.

farmFingerprint64

Produces a 64-bit FarmHash Fingerprint value.

Syntax:

farmFingerprint64(par1, ...)

Arguments:

par1, ... (any): A variable number of parameters that can be any of the supported data types.

Returns:

A UInt64 data type hash value.

Example:

SELECT farmFingerprint64(array('salsa', 'guacamole'), 'taco', 10, toDateTime('2023-05-05 12:00:00')) AS TacoHash, toTypeName(TacoHash) AS type;

Result:

| TacoHash             | type   |
|----------------------|--------|
| 17790458267262532859 | UInt64 |

This function calculates a hash value based on the input parameters. In this example, it combines an array of taco toppings, the word ‘taco’, a number, and a timestamp to produce a unique hash.

The farmFingerprint64 function is preferred for a stable and portable hash value. For some data types, the calculated value of the hash function may be the same for the same values even if types of arguments differ (e.g., integers of different sizes, named and unnamed Tuple with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).

farmHash64

Produces a 64-bit FarmHash value.

Syntax:

farmHash64(par1, ...)

Arguments:

par1, ... (any of the supported data types): A variable number of parameters of any of the supported data types.

Returns:

A UInt64 data type hash value.

Example:

SELECT farmHash64(array('salsa', 'guacamole'), 'taco', 10, toDateTime('2023-05-05 12:00:00')) AS TacoHash, toTypeName(TacoHash) AS type;

Result:

| TacoHash             | type   |
|----------------------|--------|
| 17790458267262532859 | UInt64 |

This function calculates a hash value for the given arguments. It’s useful for tasks like data distribution or creating unique identifiers. The function uses the Hash64 method from the available FarmHash methods.

For some data types, the calculated hash value may be the same for the same values even if the types of arguments differ (e.g., integers of different sizes, named and unnamed Tuples with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).

javaHash

Calculates JavaHash from a string or numeric value.

Syntax

javaHash(value)

Arguments

value (String, Byte, Short, Integer, or Long): The input value.

Returns

A hash value of type Int32.

Description

This function calculates the hash using the same algorithm as Java’s String.hashCode() method. It’s neither fast nor of particularly good quality. The main reason to use it is when you need to calculate exactly the same result as Java’s hash function.

Note that Java only supports calculating signed integer hashes. If you want to calculate unsigned integer hashes, you must cast the input to the proper signed ClickHouse type.

Example

Calculate the hash of a string:

SELECT javaHash('Hello, world!') AS hash;

Result:

| hash        |
|-------------|
| -1880044555 |

Calculate the hash of an integer:

SELECT javaHash(toInt32(123)) AS hash;

Result:

| hash |
|------|
| 123  |

Note

This hash function is primarily useful when you need to match Java’s hash calculation exactly. For most other use cases, consider using more efficient hash functions like cityHash64 or xxHash64.

javaHashUTF16LE

Calculates JavaHash from a string, assuming it contains bytes representing a string in UTF-16LE encoding.

Syntax

javaHashUTF16LE(stringUtf16le)

Arguments

stringUtf16le (String): A string in UTF-16LE encoding.

Returns

A hash value. (Int32)

Example

Query:

SELECT javaHashUTF16LE(convertCharset('Crunchy Taco Supreme', 'utf-8', 'utf-16le'));

Result:

| javaHashUTF16LE(convertCharset('Crunchy Taco Supreme', 'utf-8', 'utf-16le')) |
|-----------------------------------------------------------------------------|
| 1326117080                                                                  |

This query calculates the JavaHash of the UTF-16LE encoded string ‘Crunchy Taco Supreme’.

The javaHashUTF16LE function is designed to work with UTF-16LE encoded strings. Make sure your input is properly encoded to get correct results.

hiveHash

Calculates HiveHash from a string.

Syntax:

hiveHash(string)

Arguments:

string (String): Input string.

Returns:

HiveHash value. [Int32]

Description:

This function calculates the HiveHash value for a given string. It is essentially the [javaHash] function with the sign bit zeroed out. This hash function is used in Apache Hive for versions before 3.0.

This hash function is neither fast nor of particularly good quality. Its primary use case is when you need to calculate exactly the same result as used in another system that implements HiveHash.

Example:

SELECT hiveHash('Crunchy Taco Supreme') AS TacoHash;

Result:

| TacoHash   |
|------------|
| 1369215258 |

In this example, we calculate the HiveHash for a delicious taco menu item.

If you don’t specifically need HiveHash compatibility, consider using other hash functions like [cityHash64] or [xxHash64] for better performance and hash quality.

metroHash64

Produces a 64-bit MetroHash hash value.

Syntax:

metroHash64(par1, ...)

Arguments:

par1, ... (any): A variable number of parameters that can be any of the supported data types.

Returns:

A UInt64 data type hash value.

Example:

SELECT metroHash64(array('salsa', 'guacamole', 'sour cream'), 'taco', 10, toDateTime('2023-05-05 12:00:00')) AS MetroHash, toTypeName(MetroHash) AS type;

Result:

| MetroHash           | type   |
|---------------------|--------|
| 14235658766382344533| UInt64 |

In this example, we calculate the MetroHash64 value for a combination of taco toppings, the word ‘taco’, the number 10, and a timestamp.

For some data types, the calculated value of the hash function may be the same for the same values even if the types of arguments differ (e.g., integers of different sizes, named and unnamed Tuple with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).

This function is useful for generating hash values for complex data structures or when you need a fast, non-cryptographic hash function.

jumpConsistentHash

Calculates a JumpConsistentHash value for a given key and number of buckets.

Syntax:

jumpConsistentHash(key, buckets)

Arguments:

key (UInt64): The input key to hash.
buckets (Integer): The number of buckets to distribute the hash values across.

Returns:

An Int32 value representing the bucket number for the given key.

Example:

SELECT jumpConsistentHash(42, 10) AS bucket;

Result:

| bucket |
|--------|
| 6      |

In this example, the key 42 is hashed and assigned to bucket 6 out of 10 possible buckets.

JumpConsistentHash is an efficient algorithm for distributing items across a fixed number of buckets. It’s particularly useful for distributed systems and can be used for tasks like sharding data or distributing load across servers.

The function has the following properties:

It’s fast and requires minimal memory.
It’s consistent, meaning the same key will always map to the same bucket as long as the number of buckets doesn’t change.
When the number of buckets increases, it minimizes the number of keys that need to be remapped.

Taco-themed Example:

Imagine you’re running a chain of taco trucks and want to distribute your specialty tacos across different trucks:

SELECT
    taco_name,
    jumpConsistentHash(cityHash64(taco_name), 5) AS truck_number
FROM
    (
        SELECT taco_name FROM VALUES
        ('Spicy Carnitas'),
        ('Baja Fish'),
        ('Veggie Delight'),
        ('Crunchy Beef'),
        ('Chicken Mole')
    ) AS taco_menu;

Result:

| taco_name      | truck_number |
|----------------|--------------|
| Spicy Carnitas | 2            |
| Baja Fish      | 0            |
| Veggie Delight | 4            |
| Crunchy Beef   | 1            |
| Chicken Mole   | 3            |

This example shows how different taco types are consistently assigned to 5 different taco trucks (numbered 0 to 4) based on their names.

kostikConsistentHash

An O(1) time and space consistent hash algorithm by Konstantin ‘kostik’ Oblakov. Previously known as yandexConsistentHash.

Syntax:

kostikConsistentHash(input, n)

Alias:

yandexConsistentHash (left for backwards compatibility)

Arguments:

input (UInt64): A UInt64-type key.
n (UInt16): Number of buckets.

Returns:

A UInt16 data type hash value.

Implementation details:

This function is efficient only if n <= 32768.

Example:

SELECT kostikConsistentHash(16045690984833335023, 2);

Result:

| kostikConsistentHash(16045690984833335023, 2) |
|-----------------------------------------------|
| 1                                             |

This function can be used for consistent hashing of taco orders across multiple servers. For example, to determine which kitchen should prepare a specific taco order:

SELECT
    order_id,
    kostikConsistentHash(order_id, 5) AS kitchen_number
FROM taco_orders;

Result:

| order_id | kitchen_number |
|----------|----------------|
| 10001    | 3              |
| 10002    | 1              |
| 10003    | 4              |
| 10004    | 2              |

In this example, taco orders are consistently distributed among 5 kitchens (numbered 0 to 4) based on their order_id.

ripeMD160

Calculates the RIPEMD-160 hash of a string.

Syntax

ripeMD160(input)

Arguments

input (String): The input string to hash.

Returns

A UInt256 value containing the 160-bit RIPEMD-160 hash in the first 20 bytes. The remaining 12 bytes are zero-padded.

Example

SELECT hex(ripeMD160('The quick brown fox jumps over the lazy taco')) AS hash;

Result:

| hash                                                   |
|--------------------------------------------------------|
| 0FA6C6F4A5306EEB79FDD26748E5F12CD70F6EF0000000000000   |

This example calculates the RIPEMD-160 hash of the input string and displays it as a hexadecimal string. The hex() function is used to convert the binary hash to a readable format.

RIPEMD-160 is a cryptographic hash function designed as a replacement for MD4 and MD5. It produces a 160-bit (20-byte) hash value, which is then zero-padded to fit into a UInt256 in ClickHouse.

murmurHash2_32

Calculates a 32-bit MurmurHash2 hash value.

Syntax

murmurHash2_32(expr1[, expr2, ...])

Arguments

expr1, expr2, … (any data type): A variable number of expressions of any data type.

Returns

A hash value of type UInt32.

Details

This function calculates a 32-bit hash value using the MurmurHash2 algorithm. It can accept multiple arguments of various data types. The result will be the same for identical values even if their data types differ (e.g., different-sized integers, named and unnamed tuples with the same data, or a Map and its corresponding Array(Tuple(key, value)) with identical data).

Example

SELECT murmurHash2_32('Crunchy Taco Supreme', 'Burrito', 10, toDateTime('2023-05-05 12:00:00')) AS TacoHash, toTypeName(TacoHash) AS type;

Result:

| TacoHash  | type   |
|-----------|--------|
| 422746021 | UInt32 |

This example calculates the MurmurHash2_32 hash of a combination of taco-related strings, a number, and a timestamp.

The MurmurHash2 algorithm is designed for speed and simplicity rather than cryptographic security. For cryptographic purposes, consider using functions like SHA256 instead.

murmurHash2_64

Produces a 64-bit MurmurHash2 hash value.

Syntax:

murmurHash2_64(par1, ...)

Arguments:

par1, ... (any of the supported data types): A variable number of parameters.

Returns:

A UInt64 data type hash value.

Example:

SELECT murmurHash2_64(array('salsa', 'guacamole'), 'taco', 10, toDateTime('2023-05-15 12:00:00')) AS TacoHash, toTypeName(TacoHash) AS type;

Result:

| TacoHash             | type   |
|----------------------|--------|
| 11832096901709403633 | UInt64 |

This example calculates the MurmurHash2_64 hash for a combination of taco-related data, including an array of toppings, the word ‘taco’, a number, and a timestamp.

gccMurmurHash

Calculates a 64-bit MurmurHash2 hash value using the same hash seed as gcc. It is portable between Clang and GCC builds.

Syntax

gccMurmurHash(par1, ...)

Arguments

par1, ...: A variable number of parameters that can be any of the supported data types.

Returns

A UInt64 hash value.

Example

SELECT gccMurmurHash('Crunchy Taco Supreme', 'Soft Taco', 3) AS res1,
       gccMurmurHash(('Burrito', [1, 2, 3], 4, (4, ['Salsa', 'Guacamole'], 1, (1, 2)))) AS res2

Result:

| res1                 | res2                |
|----------------------|---------------------|
| 12384823029245979431 | 1188926775431157506 |

In this example:

res1 calculates the hash of a combination of taco-related strings and a number.
res2 demonstrates hashing a complex nested structure containing taco ingredients and numbers.

This function is useful for generating consistent hash values across different C++ compiler implementations, which can be beneficial in distributed systems or when porting code between environments.

kafkaMurmurHash

Calculates a 32-bit MurmurHash2 hash value using the same hash seed as Kafka and without the highest bit to be compatible with Default Partitioner.

Syntax

kafkaMurmurHash(par1, ...)

Arguments

par1, ... — A variable number of parameters that can be any of the supported data types.

Returns:

A calculated hash value of type UInt32.

Example

Query:

SELECT kafkaMurmurHash('carne asada') AS res1,
       kafkaMurmurHash(array('salsa', 'guacamole', 'queso'), 'taco', 10, toDateTime('2023-05-05 12:00:00')) AS res2

Result:

| res1       | res2     |
|------------|----------|
| 1357151166 | 85479775 |

In this example:

res1 calculates the hash for a single string ‘carne asada’.
res2 demonstrates hashing multiple arguments of different types, including an array, a string, a number, and a datetime.

This function is particularly useful when you need to replicate Kafka’s partitioning behavior or when working with Kafka-related data in ClickHouse.

murmurHash3_32

Calculates a 32-bit MurmurHash3 hash value.

Syntax

murmurHash3_32(expr1[, expr2, ...])

Arguments

expr1, expr2, … (any data type): A variable number of expressions of any data type.

Returns:

A 32-bit hash value of type UInt32.

Details

This function calculates a MurmurHash3 hash for the provided arguments.
It can accept multiple arguments of various data types.
For some data types, the calculated hash value may be the same for identical values even if the argument types differ (e.g., integers of different sizes, named and unnamed Tuples with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).

Example

SELECT murmurHash3_32('Crunchy Taco Supreme', 'Burrito', 42) AS TacoHash;

Result:

| TacoHash  |
|-----------|
| 305419896 |

In this example, we calculate the MurmurHash3 hash for a combination of taco-related strings and a number.

The MurmurHash3 algorithm is designed to be fast and have good distribution properties, making it suitable for hash-based lookups and data structures. However, it is not cryptographically secure and should not be used for security-sensitive applications.

murmurHash3_64

Produces a 64-bit MurmurHash3 hash value.

Syntax:

murmurHash3_64(par1, ...)

Arguments:

par1, ... (any of the supported data types): A variable number of parameters.

Returns:

A UInt64 data type hash value.

Example:

SELECT murmurHash3_64(array('salsa', 'guacamole'), 'taco', 10, toDateTime('2023-05-05 12:00:00')) AS TacoHash, toTypeName(TacoHash) AS type;

Result:

| TacoHash             | type   |
|----------------------|--------|
| 14963026719967169838 | UInt64 |

In this example, we calculate the MurmurHash3_64 hash of an array of taco toppings, the word ‘taco’, the number 10, and a timestamp.

This function is useful for generating unique identifiers or for data partitioning in distributed systems. It provides a good balance of speed and hash quality for most non-cryptographic purposes.

murmurHash3_128

Produces a 128-bit MurmurHash3 hash value.

Syntax:

murmurHash3_128(expr)

Arguments:

expr (String): A list of expressions of any data type.

Returns:

A 128-bit MurmurHash3 hash value. FixedString(16).

Example:

SELECT hex(murmurHash3_128('Crunchy Taco Supreme', 'Burrito', 'Quesadilla'));

Result:

| hex(murmurHash3_128('Crunchy Taco Supreme', 'Burrito', 'Quesadilla')) |
|-----------------------------------------------------------------------|
| F8F7AD9B6CD4CF117A71E277E2EC2931                                      |

In this example, we calculate the MurmurHash3_128 hash of three popular taco restaurant menu items. The result is displayed as a hexadecimal string for readability.

The MurmurHash3_128 function is particularly useful when you need a high-quality 128-bit hash value for large amounts of data. It provides a good balance between speed and hash quality, making it suitable for various applications such as data fingerprinting, caching, and bloom filters in distributed systems.

xxh3

Produces a 64-bit xxh3 hash value.

Syntax:

xxh3(expr)

Arguments:

expr (any data type): A list of expressions of any data type.

Returns:

A 64-bit xxh3 hash value. UInt64.

Example:

SELECT xxh3('Crunchy Taco', 'Supreme')

Result:

| xxh3('Crunchy Taco', 'Supreme') |
|---------------------------------|
| 12235049271070501274            |

In this example, the xxh3 function calculates a hash value for the combination of ‘Crunchy Taco’ and ‘Supreme’.

The xxh3 function is a fast non-cryptographic hash function, suitable for general hash-based lookup. It provides a good balance of speed and distribution quality.

xxHash32

Calculates the 32-bit xxHash value for a given string.

Syntax

xxHash32(string)

Arguments

string (String): The input string to hash.

Returns

A 32-bit hash value. Type: UInt32.

Example

Query:

SELECT xxHash32('Crunchy Taco Supreme') AS hash;

Result:

| hash      |
|-----------|
| 3728699739|

This example calculates the xxHash32 value for the string ‘Crunchy Taco Supreme’.

xxHash32 is a fast non-cryptographic hash algorithm. It’s suitable for tasks like checksumming or fingerprinting, but should not be used for cryptographic purposes.

xxHash64

Calculates a 64-bit xxHash hash value from a string.

Syntax

xxHash64(string)

Arguments

string (String): The input string to hash.

Returns:

A UInt64 hash value.

Example

Query:

SELECT xxHash64('Crunchy Taco Supreme') AS hash;

Result:

| hash                  |
|-----------------------|
| 5345889792234302512   |

This function calculates the xxHash64 hash of the input string ‘Crunchy Taco Supreme’. The result is a 64-bit unsigned integer.

xxHash64 is a fast non-cryptographic hash algorithm, suitable for hash tables, checksums, and other non-security applications. It provides a good balance between speed and hash quality.

ngramSimHash

Calculates the n-gram simhash of a string. This function is useful for detecting semi-duplicate strings.

Syntax

ngramSimHash(string[, ngramsize])

Arguments

string (String): Input string.
ngramsize (UInt8, optional): Size of the n-gram. Default value: 3. Possible values: any number from 1 to 25.

Returns

A 64-bit hash value. (UInt64)

Details

The function splits the input string into n-grams of the specified size and calculates a simhash based on these n-grams. It is case-sensitive.

This function can be used in combination with bitHammingDistance to detect similar strings. The smaller the Hamming distance between the simhashes of two strings, the more likely these strings are similar.

Example

SELECT ngramSimHash('Crunchy beef tacos') AS TacoHash;

Result:

| TacoHash           |
|--------------------|
| 17334106714881439  |

In this example, we calculate the simhash for a taco-related string. This hash can be compared with hashes of other strings to find similar menu items or descriptions.

The ngramSimHash function is particularly useful in scenarios where you need to identify similar text entries, such as finding similar taco recipes or menu descriptions in a large database of food items.

ngramSimHashCaseInsensitive

Splits an ASCII string into n-grams of ngramsize symbols and returns the n-gram simhash. This function is case insensitive.

It can be used for detecting semi-duplicate strings when combined with the bitHammingDistance function. The smaller the Hamming Distance between the calculated simhashes of two strings, the more likely these strings are similar.

Syntax

ngramSimHashCaseInsensitive(string[, ngramsize])

Arguments

string (String): Input string.
ngramsize (UInt8, optional): The size of an n-gram. Possible values: any number from 1 to 25. Default value: 3.

Returns

A hash value. (UInt64)

Example

SELECT ngramSimHashCaseInsensitive('Crunchy Taco Supreme') AS Hash;

Result:

| Hash      |
|-----------|
| 562180645 |

In this example, the function calculates the case-insensitive n-gram simhash for the taco menu item “Crunchy Taco Supreme”.

The function is useful for fuzzy matching and finding similar strings, regardless of letter case. This can be particularly helpful when comparing user-generated content or searching for menu items with slight variations in capitalization.

ngramSimHashUTF8

Splits a UTF-8 string into n-grams of ngramsize symbols and returns the n-gram simhash. Is case sensitive.

Can be used for detection of semi-duplicate strings with bitHammingDistance. The smaller the Hamming Distance of the calculated simhashes of two strings, the more likely these strings are the same.

Syntax:

ngramSimHashUTF8(string[, ngramsize])

Arguments:

string (String): Input string.
ngramsize (UInt8, optional): The size of an n-gram. Possible values: any number from 1 to 25. Default value: 3.

Returns:

Hash value. Type: UInt64.

Example:

SELECT ngramSimHashUTF8('Crunchy Taco Supreme') AS Hash;

Result:

| Hash       |
|------------|
| 1628157797 |

In this example, ngramSimHashUTF8 calculates the simhash for the taco menu item ‘Crunchy Taco Supreme’. This hash can be used to find similar menu items or detect near-duplicate entries in a taco restaurant’s menu database.

The function is case-sensitive, so ‘Crunchy Taco Supreme’ and ‘crunchy taco supreme’ would produce different hash values.

ngramSimHashCaseInsensitiveUTF8

Splits a UTF-8 string into n-grams of ngramsize symbols and returns the n-gram simhash. This function is case insensitive.

This function can be used for detecting semi-duplicate strings when combined with the bitHammingDistance function. The smaller the Hamming Distance between the calculated simhashes of two strings, the more likely these strings are similar.

Syntax

ngramSimHashCaseInsensitiveUTF8(string[, ngramsize])

Arguments

string (String): Input string to hash.
ngramsize (UInt8, optional): Size of the n-gram. Possible values: any number from 1 to 25. Default value: 3.

Returns:

A hash value. Type: UInt64.

Example

Query:

SELECT ngramSimHashCaseInsensitiveUTF8('Spicy Taco Tuesday') AS Hash;

Result:

| Hash      |
|-----------|
| 589103868 |

In this example, the function calculates the case-insensitive simhash for the UTF-8 string ‘Spicy Taco Tuesday’. This hash can be used to compare similarity with other taco-related phrases, regardless of letter casing.

The function is particularly useful for fuzzy matching of text, such as finding similar taco recipes or menu items, even if they have slight variations in spelling or capitalization.

wordShingleSimHash

Splits a string into word shingles and calculates a SimHash value. This function is case-sensitive.

Syntax

wordShingleSimHash(string[, shinglesize])

Arguments

string (String): Input string.
shinglesize (UInt8, optional): Size of word shingles. Default value: 3. Possible values: any number from 1 to 25.

Returns

A 64-bit hash value. (UInt64)

Description

This function can be used to detect semi-duplicate strings when combined with the [bitHammingDistance] function. The smaller the Hamming distance between the calculated SimHash values of two strings, the more likely these strings are similar.

Example

Query:

SELECT wordShingleSimHash('Crunchy tacos are the best kind of tacos') AS Hash;

Result:

| Hash                  |
|-----------------------|
| 7365774305152121801   |

In this example, the function calculates the SimHash for the given taco-related string using the default shingle size of 3 words.

The SimHash algorithm is particularly useful for finding near-duplicate content, making it valuable for tasks like detecting similar taco recipes or menu descriptions in a large database of taco-related text.

wordShingleSimHashCaseInsensitive

Splits a ASCII string into parts (shingles) of shinglesize words and returns the word shingle simhash. Is case insensitive.

Can be used for detection of semi-duplicate strings with bitHammingDistance. The smaller the Hamming Distance of the calculated simhashes of two strings, the more likely these strings are the same.

Syntax:

wordShingleSimHashCaseInsensitive(string[, shinglesize])

Arguments:

string (String): Input string.
shinglesize (UInt8, optional): The size of a word shingle. Possible values: any number from 1 to 25. Default value: 3.

Returns:

Hash value. Type: UInt64.

Example:

SELECT wordShingleSimHashCaseInsensitive('Crunchy tacos are the best tacos') AS TacoHash;

Result:

| TacoHash     |
|--------------|
| 2194812424   |

In this example, the function calculates a case-insensitive hash of the taco-related phrase. This hash can be used to find similar phrases about tacos, regardless of letter case.

The function is useful for finding similar text content, such as duplicate menu descriptions or customer reviews about tacos, even if they have slight variations in wording or capitalization.

wordShingleSimHashUTF8

Splits a UTF-8 string into word shingles and calculates a SimHash value. This function is case-sensitive.

Syntax

wordShingleSimHashUTF8(string[, shinglesize])

Arguments

string (String): Input string to hash.
shinglesize (UInt8, optional): Size of word shingles. Default value: 3. Possible values: any number from 1 to 25.

Returns:

A 64-bit hash value. (UInt64)

Description

This function can be used to detect similar strings. The smaller the Hamming distance between the calculated SimHash values of two strings, the more likely these strings are similar.

To compare SimHash values, use the bitHammingDistance function.

Example

SELECT wordShingleSimHashUTF8('The quick brown fox jumps over the lazy dog and steals his crunchy taco') AS SimHash;

Result:

| SimHash               |
|-----------------------|
| 17334813355043011388  |

In this example, the function calculates the SimHash for a sentence about a taco-stealing fox. The result is a 64-bit hash value that can be used to compare similarity with other strings.

For case-insensitive hashing of UTF-8 strings, use the wordShingleSimHashCaseInsensitiveUTF8 function.

wordShingleSimHashCaseInsensitiveUTF8

Splits a UTF-8 string into parts (shingles) of shinglesize words and returns the word shingle simhash. Is case insensitive.

Can be used for detection of semi-duplicate strings with bitHammingDistance. The smaller the Hamming Distance of the calculated simhashes of two strings, the more likely these strings are the same.

Syntax:

wordShingleSimHashCaseInsensitiveUTF8(string[, shinglesize])

Arguments:

string (String): Input string.
shinglesize (UInt8, optional): The size of a word shingle. Possible values: any number from 1 to 25. Default value: 3.

Returns:

Hash value. Type: UInt64.

Example:

SELECT wordShingleSimHashCaseInsensitiveUTF8('Crunchy tacos are the best tacos') AS TacoHash;

Result:

| TacoHash    |
|-------------|
| 2194812424  |

In this example, the function calculates a hash for the given taco-related phrase. The result can be used to compare similarity with other taco-related strings, regardless of letter case.

This function is particularly useful for finding similar text content in large datasets, such as identifying near-duplicate taco recipes or reviews with slight variations.

wyHash64

Produces a 64-bit wyHash64 hash value.

Syntax:

wyHash64(string)

Arguments:

string (String): Input string to hash.

Returns:

A 64-bit hash value. UInt64.

Example:

SELECT wyHash64('Crunchy Taco Supreme') AS TacoHash;

Result:

| TacoHash               |
|------------------------|
| 17790458267262532859   |

This function calculates a 64-bit hash value for the input string ‘Crunchy Taco Supreme’ using the wyHash64 algorithm. The result is returned as a UInt64 value.

wyHash64 is a fast, high-quality hash function that can be useful for various purposes such as hash tables, checksums, or generating unique identifiers for strings.

ngramMinHash

Splits a string into n-grams and calculates hash values for each n-gram. Returns a tuple with minimum and maximum hashes.

Syntax

ngramMinHash(string[, ngramsize, hashnum])

Arguments

string (String): Input string.
ngramsize (UInt8, optional): Size of each n-gram. Default value: 3. Range: 1 to 25.
hashnum (UInt8, optional): Number of minimum and maximum hashes to calculate. Default value: 6. Range: 1 to 25.

Returns

A tuple with two hashes — the minimum and the maximum. (Tuple(UInt64, UInt64))

Example

SELECT ngramMinHash('Crunchy Taco Supreme') AS TacoHash;

Result:

| TacoHash                                     |
|----------------------------------------------|
| (13815467094590897389,9654650593037347311)   |

This function can be used for detecting semi-duplicate strings when combined with tupleHammingDistance. If one of the returned hashes is the same for two strings, those strings are likely similar.

The function is case-sensitive. For case-insensitive hashing, use ngramMinHashCaseInsensitive.

ngramMinHashCaseInsensitive

Splits an ASCII string into n-grams of ngramsize symbols and calculates hash values for each n-gram. Uses hashnum minimum hashes to calculate the minimum hash and hashnum maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. This function is case insensitive.

This function can be used for detecting semi-duplicate strings when combined with the tupleHammingDistance function. If one of the returned hashes is the same for two strings, those strings are considered similar.

Syntax

ngramMinHashCaseInsensitive(string[, ngramsize, hashnum])

Arguments

string (String): Input string.
ngramsize (UInt8, optional): Size of each n-gram. Possible values: any number from 1 to 25. Default value: 3.
hashnum (UInt8, optional): Number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.

Returns

A tuple with two hashes — the minimum and the maximum. (Tuple(UInt64, UInt64))

Example

Query:

SELECT ngramMinHashCaseInsensitive('Crunchy Taco Supreme') AS TacoHash;

Result:

| TacoHash                                        |
|-------------------------------------------------|
| (13793664552400410813,15024973645111108889)     |

In this example, the function calculates the case-insensitive n-gram min hash for the taco name “Crunchy Taco Supreme”. The result is a tuple of two UInt64 values representing the minimum and maximum hashes.

The actual hash values may differ in your results due to the nature of hash functions.

ngramMinHashUTF8

Splits a UTF-8 string into n-grams of ngramsize symbols and calculates hash values for each n-gram. Uses hashnum minimum hashes to calculate the minimum hash and hashnum maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case sensitive.

Can be used for detection of semi-duplicate strings with tupleHammingDistance. For two strings: if one of the returned hashes is the same for both strings, we consider those strings to be similar.

Syntax

ngramMinHashUTF8(string[, ngramsize, hashnum])

Arguments

string (String): Input string.
ngramsize (UInt8, optional): The size of an n-gram. Possible values: any number from 1 to 25. Default value: 3.
hashnum (UInt8, optional): The number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.

Returns:

A tuple with two hashes — the minimum and the maximum. (Tuple(UInt64, UInt64))

Example

Query:

SELECT ngramMinHashUTF8('Crunchy Taco Supreme') AS TacoHash;

Result:

| TacoHash                                       |
|------------------------------------------------|
| (13862738885292113246,7008489204842373647)     |

In this example, ngramMinHashUTF8 calculates the hash values for the taco name. The result can be used to find similar taco names or descriptions in a large dataset.

ngramMinHashCaseInsensitiveUTF8

Can be used for detection of semi-duplicate strings with tupleHammingDistance. For two strings: if one of the returned hashes is the same for both strings, we consider those strings to be similar.

Syntax

ngramMinHashCaseInsensitiveUTF8(string[, ngramsize, hashnum])

Arguments

string (String): Input string.
ngramsize (UInt8, optional): The size of an n-gram. Possible values: any number from 1 to 25. Default value: 3.
hashnum (UInt8, optional): The number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.

Returns

Tuple with two hashes — the minimum and the maximum. (Tuple(UInt64, UInt64))

Example

Query:

SELECT ngramMinHashCaseInsensitiveUTF8('Crunchy Taco Supreme') AS TacoHash;

Result:

| TacoHash                                        |
|-------------------------------------------------|
| (12493625717655877135,13203602793651726206)     |

In this example, the function calculates the case-insensitive n-gram min hash for the taco name ‘Crunchy Taco Supreme’. The result is a tuple containing two UInt64 values representing the minimum and maximum hashes.

This function is particularly useful for finding similar strings in large datasets, such as menu items or product names, regardless of case differences.

ngramMinHashArg

Splits an ASCII string into n-grams of a specified size and returns the n-grams with minimum and maximum hashes, as calculated by the ngramMinHash function with the same input. This function is case-sensitive.

Syntax

ngramMinHashArg(string[, ngramsize, hashnum])

Arguments

string (String): Input string.
ngramsize (UInt8, optional): Size of each n-gram. Possible values: any number from 1 to 25. Default value: 3.
hashnum (UInt8, optional): Number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.

Returns

A tuple containing two tuples, each with hashnum n-grams. (Tuple(Tuple(String), Tuple(String)))

Example

SELECT ngramMinHashArg('Tasty Taco Tuesday', 3, 2) AS TacoNGrams;

Result:

| TacoNGrams                                    |
|-----------------------------------------------|
| (('Tas','aco'),('sty','Tue'))                 |

In this example:

The function splits ‘Tasty Taco Tuesday’ into 3-character n-grams.
It returns two tuples, each containing two n-grams (as specified by hashnum = 2).
The first tuple contains n-grams with minimum hashes, and the second tuple contains n-grams with maximum hashes.

This function is useful for finding similar strings or for text analysis tasks where you need to identify characteristic n-grams of a string based on their hash values.

ngramMinHashArgCaseInsensitive

Splits an ASCII string into n-grams of ngramsize symbols and returns the n-grams with minimum and maximum hashes, calculated by the ngramMinHashCaseInsensitive function with the same input. This function is case insensitive.

Syntax

ngramMinHashArgCaseInsensitive(string[, ngramsize, hashnum])

Arguments

string (String): Input string.
ngramsize (UInt8, optional): The size of an n-gram. Possible values: any number from 1 to 25. Default value: 3.
hashnum (UInt8, optional): The number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.

Returns:

A tuple with two tuples, each containing hashnum n-grams. (Tuple(Tuple(String), Tuple(String)))

Example

Query:

SELECT ngramMinHashArgCaseInsensitive('Crunchy Taco Supreme', 3, 2) AS TacoNGrams;

Result:

| TacoNGrams                            |
|---------------------------------------|
| (('chy','sup'),('cru','tac'))         |

In this example:

The function splits ‘Crunchy Taco Supreme’ into 3-character n-grams.
It returns 2 minimum and 2 maximum hashes (due to hashnum = 2).
The result shows the n-grams corresponding to these hashes.
Note that the case is ignored: ‘chy’ could come from ‘Chy’ in ‘Crunchy’.

This function can be useful for finding similar strings while ignoring case, such as in taco menu item comparisons or ingredient matching where capitalization might vary.

ngramMinHashArgUTF8

Splits a UTF-8 string into n-grams of ngramsize symbols and returns the n-grams with minimum and maximum hashes, calculated by the ngramMinHashUTF8 function with the same input. Is case sensitive.

Syntax

ngramMinHashArgUTF8(string[, ngramsize, hashnum])

Arguments

string (String): Input string.
ngramsize (UInt8, optional): The size of an n-gram. Possible values: any number from 1 to 25. Default value: 3.
hashnum (UInt8, optional): The number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.

Returns

Tuple with two tuples with hashnum n-grams each. (Tuple(Tuple(String), Tuple(String)))

Example

Query:

SELECT ngramMinHashArgUTF8('Taco Tuesday is the best day for tacos', 1, 3) AS Tuple;

Result:

| Tuple                                             |
|---------------------------------------------------|
| (('day','for','the'),('Taco','Tuesday','tacos'))  |

In this example:

The function splits the input string into 1-gram words.
It returns two tuples, each containing 3 words (as specified by hashnum = 3).
The first tuple contains words with minimum hashes, and the second tuple contains words with maximum hashes.

This function is useful for text analysis, particularly when you need to identify characteristic n-grams in UTF-8 encoded text while preserving case sensitivity.

ngramMinHashArgCaseInsensitiveUTF8

Splits a UTF-8 string into n-grams of ngramsize symbols and returns the n-grams with minimum and maximum hashes, calculated by the ngramMinHashCaseInsensitiveUTF8 function with the same input. This function is case insensitive.

Syntax

ngramMinHashArgCaseInsensitiveUTF8(string[, ngramsize, hashnum])

Arguments

string (String): Input string.
ngramsize (UInt8, optional): The size of an n-gram. Possible values: any number from 1 to 25. Default value: 3.
hashnum (UInt8, optional): The number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.

Returns

Tuple with two tuples, each containing hashnum n-grams. (Tuple(Tuple(String), Tuple(String)))

Example

Query:

SELECT ngramMinHashArgCaseInsensitiveUTF8('Crunchy Taco Supreme', 2, 3) AS TacoHashes;

Result:

| TacoHashes                                       |
|--------------------------------------------------|
| (('ch','ta','su'),('cr','pr','nc'))              |

In this example:

The function splits ‘Crunchy Taco Supreme’ into 2-grams (bigrams).
It returns two tuples, each containing 3 bigrams.
The first tuple contains the bigrams with the minimum hashes.
The second tuple contains the bigrams with the maximum hashes.
The function is case-insensitive, so ‘TA’ and ‘ta’ are treated the same.

This function can be useful for detecting similar strings or for text analysis tasks where case sensitivity is not important.

wordShingleMinHash

Splits a string into word shingles and calculates hash values for each shingle. Returns a tuple with minimum and maximum hashes. This function is case-sensitive.

Syntax

wordShingleMinHash(string[, shinglesize, hashnum])

Arguments

string (String): Input string.
shinglesize (UInt8, optional): Size of word shingles. Possible values: 1 to 25. Default value: 3.
hashnum (UInt8, optional): Number of minimum and maximum hashes used. Possible values: 1 to 25. Default value: 6.

Returns

A tuple with two hashes — the minimum and the maximum. (Tuple(UInt64, UInt64))

Example

SELECT wordShingleMinHash('Crunchy tacos are the best tacos', 2, 3) AS TacoHash;

Result:

| TacoHash                                       |
|------------------------------------------------|
| (13815467094590654398,9137889751800330229)     |

This function can be used for detecting semi-duplicate strings when combined with [tupleHammingDistance]. If one of the returned hashes is the same for two strings, those strings are likely similar.

The function is designed for ASCII strings. For UTF-8 strings, use wordShingleMinHashUTF8.

wordShingleMinHashCaseInsensitive

Splits an ASCII string into word shingles and calculates a case-insensitive hash value for each shingle. Uses a specified number of minimum and maximum hashes to compute the final result.

Syntax

wordShingleMinHashCaseInsensitive(string[, shinglesize, hashnum])

Arguments

string (String): Input string to process.
shinglesize (UInt8, optional): Size of each word shingle. Default value: 3. Range: 1 to 25.
hashnum (UInt8, optional): Number of minimum and maximum hashes to use. Default value: 6. Range: 1 to 25.

Returns

A tuple containing two hash values: the minimum and maximum. (Tuple(UInt64, UInt64))

Details

This function is useful for detecting semi-duplicate strings. It can be used in conjunction with tupleHammingDistance. If one of the returned hashes is the same for two strings, those strings are likely similar.

The function is case-insensitive, meaning “Taco” and “taco” will be treated the same.

Example

SELECT wordShingleMinHashCaseInsensitive('Spicy beef tacos are the best street food in Mexico City.') AS TacoHash;

Result:

| TacoHash                                           |
|----------------------------------------------------|
| (3065874883688416519, 1634050779997673240)         |

In this example, the function computes hash values for the input string about tacos. The result can be used to compare similarity with other taco-related strings.

For UTF-8 strings, use the wordShingleMinHashCaseInsensitiveUTF8 function instead.

wordShingleMinHashUTF8

Splits a UTF-8 string into word shingles and calculates hash values for each shingle. Uses the minimum and maximum hashes to produce a tuple of hash values. This function is case-sensitive.

Syntax

wordShingleMinHashUTF8(string[, shinglesize, hashnum])

Arguments

string (String): Input string to be hashed.
shinglesize (UInt8, optional): Size of each word shingle. Default value: 3. Range: 1 to 25.
hashnum (UInt8, optional): Number of minimum and maximum hashes to calculate. Default value: 6. Range: 1 to 25.

Returns

A tuple containing two hash values: the minimum and the maximum. (Tuple(UInt64, UInt64))

This function can be used to detect semi-duplicate strings when combined with [tupleHammingDistance]. If one of the returned hashes is the same for two strings, those strings are likely similar.

Example

SELECT wordShingleMinHashUTF8('Crunchy tacos are the best kind of tacos!') AS TacoHash;

Result:

| TacoHash                                           |
|----------------------------------------------------|
| (12764999597435274891,4400879396773253283)         |

In this example, the function calculates hash values for word shingles in the taco-related sentence. The resulting tuple contains the minimum and maximum hash values computed from these shingles.

This function is particularly useful for text similarity analysis, especially when dealing with multilingual content or texts containing special characters, as it properly handles UTF-8 encoded strings.

wordShingleMinHashCaseInsensitiveUTF8

Splits a UTF-8 string into parts (shingles) of shinglesize words and calculates hash values for each word shingle. Uses hashnum minimum hashes to calculate the minimum hash and hashnum maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case insensitive.

Can be used for detection of semi-duplicate strings with tupleHammingDistance. For two strings: if one of the returned hashes is the same for both strings, we think that those strings are the same.

Syntax

wordShingleMinHashCaseInsensitiveUTF8(string[, shinglesize, hashnum])

Arguments

string (String): Input string.
shinglesize (UInt8, optional): The size of a word shingle. Possible values: any number from 1 to 25. Default value: 3.
hashnum (UInt8, optional): The number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.

Returns:

Tuple with two hashes — the minimum and the maximum. (Tuple(UInt64, UInt64))

Example

Query:

SELECT wordShingleMinHashCaseInsensitiveUTF8('Crunchy tacos are the best tacos!') AS TacoHash;

Result:

| TacoHash                                       |
|------------------------------------------------|
| (3065874883688416519, 1634050779997673240)     |

In this example, the function calculates a hash for the taco-related string. The result can be used to find similar taco-related phrases, regardless of case.

wordShingleMinHashArg

Splits a ASCII string into parts (shingles) of a specified word size and returns the shingles with minimum and maximum word hashes. This function is case sensitive.

Syntax

wordShingleMinHashArg(string[, shinglesize, hashnum])

Arguments

string (String): Input string.
shinglesize (UInt8, optional): The size of a word shingle. Possible values: any number from 1 to 25. Default value: 3.
hashnum (UInt8, optional): The number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.

Returns

A tuple with two tuples, each containing hashnum word shingles. (Tuple(Tuple(String), Tuple(String)))

Example

SELECT wordShingleMinHashArg('Crunchy tacos are the best tacos', 1, 3) AS TacoShingles;

Result:

| TacoShingles                                      |
|---------------------------------------------------|
| (('are','best','Crunchy'),('tacos','the','the'))  |

In this example:

The function splits the input string into single words (shinglesize = 1).
It returns two tuples with 3 words each (hashnum = 3).
The first tuple contains words with minimum hashes, and the second tuple contains words with maximum hashes.

This function can be useful for text analysis, finding similar documents, or implementing efficient text search algorithms.

The results may vary depending on the input string and the hash function used internally.

wordShingleMinHashArgCaseInsensitive

Splits an ASCII string into word shingles and returns the shingles with minimum and maximum word hashes. This function is case insensitive.

Syntax

wordShingleMinHashArgCaseInsensitive(string[, shinglesize, hashnum])

Arguments

string (String): Input string.
shinglesize (UInt8, optional): Size of each word shingle. Possible values: any number from 1 to 25. Default value: 3.
hashnum (UInt8, optional): Number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.

Returns

A tuple containing two tuples, each with hashnum word shingles. (Tuple(Tuple(String), Tuple(String)))

Example

Query:

SELECT wordShingleMinHashArgCaseInsensitive('Crunchy tacos are the best tacos', 1, 3) AS Tuple;

Result:

| Tuple                                                |
|------------------------------------------------------|
| (('are','best','crunchy'),('the','tacos','tacos'))   |

This function can be used for detecting semi-duplicate strings when combined with tupleHammingDistance. The smaller the Hamming distance between the calculated hashes of two strings, the more likely these strings are similar.

This function is the case-insensitive version of wordShingleMinHashArg. It’s particularly useful when you want to compare strings regardless of their letter casing.

wordShingleMinHashArgUTF8

Splits a UTF-8 string into word shingles and returns the shingles with minimum and maximum word hashes. This function is case-sensitive.

Syntax

wordShingleMinHashArgUTF8(string[, shinglesize, hashnum])

Arguments

string (String): Input string.
shinglesize (UInt8, optional): Size of each word shingle. Possible values: 1 to 25. Default value: 3.
hashnum (UInt8, optional): Number of minimum and maximum hashes used to calculate the result. Possible values: 1 to 25. Default value: 6.

Returns

A tuple containing two tuples, each with hashnum word shingles. (Tuple(Tuple(String), Tuple(String)))

Example

SELECT wordShingleMinHashArgUTF8('Crunchy tacos are the best tacos', 1, 3) AS TacoShingles;

Result:

| TacoShingles                                 |
|----------------------------------------------|
| (('are','best','Crunchy'),('tacos','the'))   |

In this example:

The function splits the input string into individual words.
It returns two tuples: one with the three words that have the minimum hashes, and another with the two words that have the maximum hashes.
The shingle size is set to 1 (individual words) and the number of hashes is set to 3.

This function is useful for finding similar strings or for text classification tasks, especially when working with multilingual text in UTF-8 encoding.

wordShingleMinHashArgCaseInsensitiveUTF8

Splits a UTF-8 string into word shingles and returns the shingles with minimum and maximum word hashes. This function is case insensitive.

Syntax

wordShingleMinHashArgCaseInsensitiveUTF8(string[, shinglesize, hashnum])

Arguments

string (String): Input string.
shinglesize (UInt8, optional): Size of each word shingle. Possible values: any number from 1 to 25. Default value: 3.
hashnum (UInt8, optional): Number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.

Returns

A tuple containing two tuples, each with hashnum word shingles. (Tuple(Tuple(String), Tuple(String)))

Example

SELECT wordShingleMinHashArgCaseInsensitiveUTF8('Crunchy Taco Supreme with extra guacamole and hot sauce', 1, 3) AS TacoShingles;

Result:

| TacoShingles                                             |
|----------------------------------------------------------|
| (('guacamole','hot','sauce'),('crunchy','with','taco'))  |

In this example:

The function splits the taco order description into individual words.
It returns two tuples: one with the shingles that produced the minimum hashes, and another with those that produced the maximum hashes.
The shingle size is set to 1 (individual words) and we’re getting 3 shingles for each tuple.

This function is useful for detecting similar strings or for text classification tasks, especially when dealing with multilingual text or when case sensitivity is not important.

sqidEncode

Encodes numbers as a Sqid which is a YouTube-like ID string. The output alphabet is abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. Do not use this function for hashing - the generated IDs can be decoded back into the original numbers.

Syntax:

sqidEncode(number1, ...)

Alias:

sqid

Arguments:

A variable number of UInt8, UInt16, UInt32 or UInt64 numbers.

Returns:

A sqid String.

Example:

SELECT sqidEncode(1, 2, 3, 4, 5) AS taco_order_id;

Result:

| taco_order_id |
|---------------|
| gXHfJ1C6dN    |

In this example, sqidEncode generates a unique ID for a taco order using the numbers 1, 2, 3, 4, and 5. This could be useful for creating short, readable order IDs in a taco ordering system.

The sqidEncode function is particularly useful when you need to generate short, URL-friendly IDs from a series of numbers. It’s ideal for creating readable identifiers for orders, products, or any other entities in your database that need a compact, reversible representation.

sqidDecode

Decodes a Sqid back into its original numbers.

Syntax:

sqidDecode(sqid)

Arguments:

sqid (String): A Sqid string to decode.

Returns:

An array of numbers decoded from the Sqid. Returns an empty array if the input string is not a valid Sqid. (Array(UInt64))

Example:

SELECT sqidDecode('gXHfJ1C6dN') AS DecodedTacoOrder;

Result:

| DecodedTacoOrder  |
|-------------------|
| [1,2,3,4,5]       |

In this taco-themed example, ‘gXHfJ1C6dN’ might represent an encoded taco order ID. The sqidDecode function decodes it back into its original numeric components, which could represent various aspects of the order such as order number, number of tacos, sauce type, etc.

This function is the inverse of sqidEncode. It’s useful for decoding short, URL-friendly IDs back into their original numeric form. However, it should not be used for security-sensitive applications as the encoding is reversible.

Introduction

Statements

​halfMD5

​MD4

​MD5

​sipHash64

​sipHash64Keyed

​sipHash128

​sipHash128Keyed

​sipHash128Reference

​sipHash128ReferenceKeyed

​cityHash64

​intHash32

​intHash64

​SHA1

​SHA224

​SHA256

​SHA512

​SHA512_256

​BLAKE3

​URLHash

​farmFingerprint64

​farmHash64

​javaHash

​javaHashUTF16LE

​hiveHash

​metroHash64

​jumpConsistentHash

​kostikConsistentHash

​ripeMD160

​murmurHash2_32

​murmurHash2_64

​gccMurmurHash

​kafkaMurmurHash

​murmurHash3_32

​murmurHash3_64

​murmurHash3_128

​xxh3

​xxHash32

​xxHash64

​ngramSimHash

​ngramSimHashCaseInsensitive

​ngramSimHashUTF8

​ngramSimHashCaseInsensitiveUTF8

​wordShingleSimHash

​wordShingleSimHashCaseInsensitive

​wordShingleSimHashUTF8

​wordShingleSimHashCaseInsensitiveUTF8

​wyHash64

​ngramMinHash

​ngramMinHashCaseInsensitive

​ngramMinHashUTF8

​ngramMinHashCaseInsensitiveUTF8

​ngramMinHashArg

​ngramMinHashArgCaseInsensitive

​ngramMinHashArgUTF8

​ngramMinHashArgCaseInsensitiveUTF8

​wordShingleMinHash

​wordShingleMinHashCaseInsensitive

​wordShingleMinHashUTF8

​wordShingleMinHashCaseInsensitiveUTF8

​wordShingleMinHashArg

​wordShingleMinHashArgCaseInsensitive

​wordShingleMinHashArgUTF8

​wordShingleMinHashArgCaseInsensitiveUTF8

​sqidEncode

​sqidDecode

halfMD5

MD4

MD5

sipHash64

sipHash64Keyed

sipHash128

sipHash128Keyed

sipHash128Reference

sipHash128ReferenceKeyed

cityHash64

intHash32

intHash64

SHA1

SHA224

SHA256

SHA512

SHA512_256

BLAKE3

URLHash

farmFingerprint64

farmHash64

javaHash

javaHashUTF16LE

hiveHash

metroHash64

jumpConsistentHash

kostikConsistentHash

ripeMD160

murmurHash2_32

murmurHash2_64

gccMurmurHash

kafkaMurmurHash

murmurHash3_32

murmurHash3_64

murmurHash3_128

xxh3

xxHash32

xxHash64

ngramSimHash

ngramSimHashCaseInsensitive

ngramSimHashUTF8

ngramSimHashCaseInsensitiveUTF8

wordShingleSimHash

wordShingleSimHashCaseInsensitive

wordShingleSimHashUTF8

wordShingleSimHashCaseInsensitiveUTF8

wyHash64

ngramMinHash

ngramMinHashCaseInsensitive

ngramMinHashUTF8

ngramMinHashCaseInsensitiveUTF8

ngramMinHashArg

ngramMinHashArgCaseInsensitive

ngramMinHashArgUTF8

ngramMinHashArgCaseInsensitiveUTF8

wordShingleMinHash

wordShingleMinHashCaseInsensitive

wordShingleMinHashUTF8

wordShingleMinHashCaseInsensitiveUTF8

wordShingleMinHashArg

wordShingleMinHashArgCaseInsensitive

wordShingleMinHashArgUTF8

wordShingleMinHashArgCaseInsensitiveUTF8

sqidEncode

sqidDecode