Hash functions
Generate hash values for data integrity and lookups.
Hash functions serve multiple purposes in data processing:
-
Deterministic shuffling: They can be used to randomly reorder elements in a predictable way.
-
Similarity detection: Some hash functions, like
Simhash
, produce similar hash values for similar inputs. This property is useful for identifying near-duplicate items. -
Data integrity: Hash functions can generate unique identifiers for data, helping to verify its integrity.
-
Efficient lookups: Hash values can be used as keys in hash tables for fast data retrieval.
Each hash function is designed for specific use cases, balancing factors like speed, distribution quality, and collision resistance.
#ClickHouse function reference
halfMD5
Calculates a 64-bit hash value using a modified MD5 algorithm.
Syntax
Arguments
arg1, ...
(any data type): A variable number of arguments of any data type.
Returns
- A
UInt64
hash value.
Details
The function interprets all input parameters as strings, calculates the MD5 hash value for each of them, then combines the hashes, takes the first 8 bytes of the resulting hash, and interprets them as a UInt64
in big-endian byte order.
This function is relatively slow (processing about 5 million short strings per second per processor core). Consider using the sipHash64
function instead for better performance.
Example
Result:
In this example, halfMD5
calculates a hash value for the combination of ‘taco’, ‘tuesday’, and the number 3.
For some data types, the calculated hash value may be the same for different input values if their string representations are identical (e.g., integers of different sizes, named and unnamed Tuples with the same data).
MD4
Calculates the MD4 hash of a string and returns the result as a FixedString(16).
Syntax
Arguments
string
(String
): The input string to hash.
Returns
- A
FixedString(16)
containing the MD4 hash of the input.
Example
Result:
MD4 is considered cryptographically weak. For secure hashing, consider using more robust algorithms like SHA-256.
MD5
Calculates the MD5 hash of a string and returns the result as a FixedString(16).
Syntax
Arguments
string
(String
): The input string to hash.
Returns
- A
FixedString(16)
containing the MD5 hash value.
Example
Result:
If you need a 128-bit cryptographic hash function but don’t specifically require MD5, consider using the sipHash128
function instead. It’s generally faster and provides better hash distribution.
To get the same result as the md5sum utility, use lower(hex(MD5(s)))
.
sipHash64
Produces a 64-bit SipHash hash value.
Syntax:
Arguments:
- A variable number of parameters of any of the supported data types.
Returns:
- A
UInt64
data type hash value.
Details:
This is a cryptographic hash function. It works at least three times faster than the MD5 hash function.
The function interprets all input parameters as strings and calculates the hash value for each of them. It then combines the hashes by the following algorithm:
- The first and second hash values are concatenated to an array which is hashed.
- The previously calculated hash value and the hash of the third input parameter are hashed in a similar way.
- This calculation is repeated for all remaining hash values of the original input.
For some data types, the calculated value of the hash function may be the same for different values, even if the types of arguments differ. This affects:
- Integers of different sizes
- Named and unnamed Tuples with the same data
- Map and the corresponding
Array(Tuple(key, value))
type with the same data
Example:
Result:
In this example, we calculate the SipHash of an array, a string, a number, and a DateTime value. The result is a UInt64 hash value.
If you need a decent cryptographic 128-bit hash instead, consider using the sipHash128
function.
sipHash64Keyed
Calculates a 64-bit SipHash hash value using a specified key.
Syntax
Arguments
(k0, k1)
(Tuple(UInt64, UInt64)
): A tuple of two UInt64 values representing the key.par1, ...
(any): A variable number of parameters of any supported data type.
Returns
- A
UInt64
data type hash value.
Example
Result:
This example calculates a SipHash64 value for a combination of taco-related data using a specified key.
For some data types, the calculated hash value may be the same for different argument types (e.g., integers of different sizes, named and unnamed Tuples with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).
sipHash128
Produces a 128-bit SipHash hash value.
Syntax:
Arguments:
par1, ...
(any of the supported data types): A variable number of parameters.
Returns:
- A 128-bit SipHash hash value of type
FixedString(16)
.
This 128-bit variant differs from the reference implementation and it’s weaker. This version exists because, when it was written, there was no official 128-bit extension for SipHash. New projects should probably use sipHash128Reference
instead.
Example:
Result:
In this example, we calculate the SipHash128 of a combination of taco toppings and a number, then display the result as a hexadecimal string.
sipHash128Keyed
Calculates a 128-bit SipHash hash value using an explicit key.
Syntax
Arguments
(k0, k1)
(tuple): A tuple of twoUInt64
values representing the key.par1, ...
(any): A variable number of parameters of any supported data type.
Returns
- A 128-bit SipHash hash value of type
FixedString(16)
.
This 128-bit variant differs from the reference implementation and is weaker. New projects should consider using sipHash128ReferenceKeyed
instead.
Example
Result:
In this example, we calculate the SipHash128 value for a list of taco sauces using a custom key. The result is displayed as a hexadecimal string for readability.
sipHash128Reference
Produces a 128-bit SipHash hash value using the reference implementation from the original authors of SipHash.
Syntax
Arguments
par1, ...
— A variable number of parameters of any of the supported data types.
Returns
- A 128-bit SipHash hash value as a
FixedString(16)
.
Example
Query:
Result:
In this example:
- We use the
sipHash128Reference
function to calculate the hash of three taco-related strings. - The
hex
function is used to represent the result as a hex-encoded string for readability.
This function is particularly useful when you need a cryptographically strong 128-bit hash and want to ensure compatibility with other systems using the reference SipHash implementation.
sipHash128ReferenceKeyed
Calculates a 128-bit SipHash hash value using an explicit key.
Syntax
Arguments
(k0, k1)
(Tuple(UInt64, UInt64)
): A tuple of two UInt64 values representing the key.par1, ...
(any data type): A variable number of parameters of any data type.
Returns
A 128-bit SipHash hash value as a FixedString(16)
.
Example
Result:
This function implements the 128-bit algorithm from the original authors of SipHash. It’s useful when you need a cryptographically strong hash function with a custom key.
This function provides better security compared to sipHash128Keyed, which uses a non-standard 128-bit extension.
cityHash64
Produces a 64-bit CityHash hash value.
Syntax:
Arguments:
- A variable number of input parameters of any of the supported data types.
Returns:
- A
UInt64
data type hash value.
- This is a fast non-cryptographic hash function. It uses the CityHash algorithm for string parameters and implementation-specific fast non-cryptographic hash functions for parameters with other data types.
- The function uses the CityHash combiner to get the final results.
- ClickHouse’s
cityHash64
corresponds to CityHash v1.0.2. - For some data types, the calculated value of the hash function may be the same for the same values even if types of arguments differ (e.g., integers of different sizes, named and unnamed Tuple with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).
Example:
Result:
This example demonstrates how to compute the CityHash64 value for a combination of an array, a string, a number, and a datetime.
Use Case:
You can use cityHash64
to compute a checksum of an entire table with accuracy up to the row order:
This query calculates a single hash value that represents the entire content of the taco_orders
table, which can be useful for quickly comparing table contents or detecting changes.
intHash32
Calculates a 32-bit hash code from any type of integer. This is a relatively fast non-cryptographic hash function of average quality for numbers.
Syntax:
Arguments:
int
(UInt8
,UInt16
,UInt32
,UInt64
,Int8
,Int16
,Int32
,Int64
): Integer to hash.
Returns:
- 32-bit hash code.
UInt32
.
Example:
Result:
This example calculates the hash of the number 42, which could represent a taco order ID in a taco-themed database system.
The intHash32
function is useful when you need a quick hash of integer values, such as for distributing data across shards or creating probabilistic data structures. However, it’s not suitable for cryptographic purposes due to its simplicity and speed.
intHash64
Calculates a 64-bit hash code from any type of integer. This is a relatively fast non-cryptographic hash function of average quality for numbers.
Syntax
Arguments
int
(UInt8
,UInt16
,UInt32
,UInt64
,Int8
,Int16
,Int32
,Int64
): Integer to hash.
Returns
- 64-bit hash code. (
UInt64
)
Example
Query:
Result:
This function works faster than [intHash32]. It can be used for creating unique identifiers for taco orders or menu items in a high-volume taco restaurant management system.
The function produces a deterministic result, meaning the same input will always produce the same hash value. This makes it suitable for scenarios where reproducibility is important, such as generating consistent IDs for taco ingredients across different database instances.
SHA1
Calculates the SHA-1 hash of a string and returns the result as a FixedString(20).
Syntax
Arguments
string
(String
): The input string to hash.
Returns
- A
FixedString(20)
containing the SHA-1 hash value.
Example
Result:
In this example, we calculate the SHA-1 hash of the string ‘Carne asada tacos’ and use the hex
function to display the result as a hexadecimal string.
The SHA-1 function works relatively slowly (processing about 5 million short strings per second per processor core). Consider using faster hash functions like sipHash64
if cryptographic properties are not required.
For security-critical applications, it’s recommended to use stronger hash functions like SHA-256 or SHA-512, as SHA-1 is no longer considered cryptographically secure against well-funded opponents.
SHA224
Calculates the SHA-224 hash of a string and returns the result as a FixedString(28).
Syntax
Arguments
string
(String
): The input string to hash.
Returns
- A
FixedString(28)
containing the SHA-224 hash of the input string.
Example
Result:
This example calculates the SHA-224 hash of the string ‘Crunchy taco supreme’ and displays it as a hexadecimal string.
The SHA-224 function is a cryptographic hash function that produces a 224-bit (28-byte) hash value. It’s part of the SHA-2 family of hash functions and is considered cryptographically strong. However, for most applications, SHA-256 is more commonly used. Use SHA-224 only if you specifically need a 224-bit hash or if you’re working with a system that requires it.
SHA256
Calculates the SHA-256 hash of a string and returns the result as a FixedString(32).
Syntax
Arguments
string
(String
): The input string to hash.
Returns
- A
FixedString(32)
containing the SHA-256 hash value.
Example
Result:
This example calculates the SHA-256 hash of the string ‘Crunchy Taco Supreme’ and displays it as a hexadecimal string.
The SHA-256 function is a cryptographic hash function that generates a 256-bit (32-byte) hash value. It’s commonly used for integrity verification of data and in various security applications. However, for password hashing, it’s recommended to use specialized password hashing functions with salt and key stretching.
SHA512
Calculates the SHA-512 hash from a string and returns the resulting set of bytes as a FixedString(64).
Syntax
Arguments
string
(String
): The input string to hash.
Returns
- A
FixedString(64)
containing the SHA-512 hash value.
Example
Result:
This example calculates the SHA-512 hash of the string ‘Crunchy taco supreme’ and displays it as a hexadecimal string.
The SHA-512 function works relatively slowly (about 5 million short strings per second per processor core). We recommend using this function only when necessary, such as for cryptographic purposes. For faster non-cryptographic hashing, consider using functions like sipHash64
.
SHA512_256
Calculates the SHA-512/256 hash of a string and returns the result as a FixedString(32).
Syntax
Arguments
string
(String
): The input string to hash.
Returns
- A
FixedString(32)
containing the SHA-512/256 hash of the input string.
Example
Result:
This example calculates the SHA-512/256 hash of the string ‘Crunchy taco supreme’ and displays it as a hexadecimal string.
The SHA-512/256 function is a truncated version of SHA-512, providing a 256-bit (32-byte) hash value. It’s designed to provide the security of SHA-512 with a more compact output size.
BLAKE3
Calculates the BLAKE3 hash of a string and returns the result as a FixedString(32).
Syntax
Arguments
s
(String
): Input string for BLAKE3 hash calculation.
Returns
- BLAKE3 hash as a byte array. [
FixedString(32)
]
Example
Use the hex
function to represent the result as a hex-encoded string:
Result:
This cryptographic hash function is integrated into ClickHouse using the BLAKE3 Rust library. It performs approximately twice as fast as SHA-2 while generating hashes of the same length as SHA-256.
URLHash
Calculates a hash value from a URL string using normalization.
Syntax
Arguments
url
(String
): The URL string to hash.N
(UInt8
, optional): The number of URL hierarchy levels to include in the hash calculation.
Returns
- A hash value of type
UInt64
.
Details
URLHash(s)
calculates a hash from the string after removing one of the trailing symbols/
,?
or#
at the end, if present.URLHash(s, N)
calculates a hash from the string up to the N-th level in the URL hierarchy, after removing one of the trailing symbols/
,?
or#
at the end, if present.
Example
Result:
This function is useful for generating consistent hash values for URLs, which can be helpful in tasks like URL-based partitioning or deduplication of web crawl data.
The levels in the URL hierarchy are the same as those used in the URLHierarchy function.
farmFingerprint64
Produces a 64-bit FarmHash Fingerprint value.
Syntax:
Arguments:
par1, ...
(any): A variable number of parameters that can be any of the supported data types.
Returns:
- A
UInt64
data type hash value.
Example:
Result:
This function calculates a hash value based on the input parameters. In this example, it combines an array of taco toppings, the word ‘taco’, a number, and a timestamp to produce a unique hash.
The farmFingerprint64 function is preferred for a stable and portable hash value. For some data types, the calculated value of the hash function may be the same for the same values even if types of arguments differ (e.g., integers of different sizes, named and unnamed Tuple with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).
farmHash64
Produces a 64-bit FarmHash value.
Syntax:
Arguments:
par1, ...
(any of the supported data types): A variable number of parameters of any of the supported data types.
Returns:
- A
UInt64
data type hash value.
Example:
Result:
This function calculates a hash value for the given arguments. It’s useful for tasks like data distribution or creating unique identifiers. The function uses the Hash64 method from the available FarmHash methods.
For some data types, the calculated hash value may be the same for the same values even if the types of arguments differ (e.g., integers of different sizes, named and unnamed Tuples with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).
javaHash
Calculates JavaHash from a string or numeric value.
Syntax
Arguments
value
(String
,Byte
,Short
,Integer
, orLong
): The input value.
Returns
- A hash value of type
Int32
.
Description
This function calculates the hash using the same algorithm as Java’s String.hashCode() method. It’s neither fast nor of particularly good quality. The main reason to use it is when you need to calculate exactly the same result as Java’s hash function.
Note that Java only supports calculating signed integer hashes. If you want to calculate unsigned integer hashes, you must cast the input to the proper signed ClickHouse type.
Example
Calculate the hash of a string:
Result:
Calculate the hash of an integer:
Result:
Note
This hash function is primarily useful when you need to match Java’s hash calculation exactly. For most other use cases, consider using more efficient hash functions like cityHash64
or xxHash64
.
javaHashUTF16LE
Calculates JavaHash from a string, assuming it contains bytes representing a string in UTF-16LE encoding.
Syntax
Arguments
stringUtf16le
(String
): A string in UTF-16LE encoding.
Returns
- A hash value. (
Int32
)
Example
Query:
Result:
This query calculates the JavaHash of the UTF-16LE encoded string ‘Crunchy Taco Supreme’.
The javaHashUTF16LE
function is designed to work with UTF-16LE encoded strings. Make sure your input is properly encoded to get correct results.
hiveHash
Calculates HiveHash from a string.
Syntax:
Arguments:
string
(String
): Input string.
Returns:
- HiveHash value. [
Int32
]
Description:
This function calculates the HiveHash value for a given string. It is essentially the [javaHash] function with the sign bit zeroed out. This hash function is used in Apache Hive for versions before 3.0.
This hash function is neither fast nor of particularly good quality. Its primary use case is when you need to calculate exactly the same result as used in another system that implements HiveHash.
Example:
Result:
In this example, we calculate the HiveHash for a delicious taco menu item.
If you don’t specifically need HiveHash compatibility, consider using other hash functions like [cityHash64] or [xxHash64] for better performance and hash quality.
metroHash64
Produces a 64-bit MetroHash hash value.
Syntax:
Arguments:
par1, ...
(any): A variable number of parameters that can be any of the supported data types.
Returns:
- A
UInt64
data type hash value.
Example:
Result:
In this example, we calculate the MetroHash64 value for a combination of taco toppings, the word ‘taco’, the number 10, and a timestamp.
For some data types, the calculated value of the hash function may be the same for the same values even if the types of arguments differ (e.g., integers of different sizes, named and unnamed Tuple with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).
This function is useful for generating hash values for complex data structures or when you need a fast, non-cryptographic hash function.
jumpConsistentHash
Calculates a JumpConsistentHash value for a given key and number of buckets.
Syntax:
Arguments:
key
(UInt64
): The input key to hash.buckets
(Integer): The number of buckets to distribute the hash values across.
Returns:
An Int32
value representing the bucket number for the given key.
Example:
Result:
In this example, the key 42
is hashed and assigned to bucket 6 out of 10 possible buckets.
JumpConsistentHash is an efficient algorithm for distributing items across a fixed number of buckets. It’s particularly useful for distributed systems and can be used for tasks like sharding data or distributing load across servers.
The function has the following properties:
- It’s fast and requires minimal memory.
- It’s consistent, meaning the same key will always map to the same bucket as long as the number of buckets doesn’t change.
- When the number of buckets increases, it minimizes the number of keys that need to be remapped.
Taco-themed Example:
Imagine you’re running a chain of taco trucks and want to distribute your specialty tacos across different trucks:
Result:
This example shows how different taco types are consistently assigned to 5 different taco trucks (numbered 0 to 4) based on their names.
kostikConsistentHash
An O(1) time and space consistent hash algorithm by Konstantin ‘kostik’ Oblakov. Previously known as yandexConsistentHash.
Syntax:
Alias:
- yandexConsistentHash (left for backwards compatibility)
Arguments:
input
(UInt64
): A UInt64-type key.n
(UInt16
): Number of buckets.
Returns:
- A
UInt16
data type hash value.
Implementation details:
This function is efficient only if n <=
32768.
Example:
Result:
This function can be used for consistent hashing of taco orders across multiple servers. For example, to determine which kitchen should prepare a specific taco order:
Result:
In this example, taco orders are consistently distributed among 5 kitchens (numbered 0 to 4) based on their order_id.
ripeMD160
Calculates the RIPEMD-160 hash of a string.
Syntax
Arguments
input
(String
): The input string to hash.
Returns
- A
UInt256
value containing the 160-bit RIPEMD-160 hash in the first 20 bytes. The remaining 12 bytes are zero-padded.
Example
Result:
This example calculates the RIPEMD-160 hash of the input string and displays it as a hexadecimal string. The hex()
function is used to convert the binary hash to a readable format.
RIPEMD-160 is a cryptographic hash function designed as a replacement for MD4 and MD5. It produces a 160-bit (20-byte) hash value, which is then zero-padded to fit into a UInt256 in ClickHouse.
murmurHash2_32
Calculates a 32-bit MurmurHash2 hash value.
Syntax
Arguments
expr1
,expr2
, … (any data type): A variable number of expressions of any data type.
Returns
- A hash value of type
UInt32
.
Details
This function calculates a 32-bit hash value using the MurmurHash2 algorithm. It can accept multiple arguments of various data types. The result will be the same for identical values even if their data types differ (e.g., different-sized integers, named and unnamed tuples with the same data, or a Map and its corresponding Array(Tuple(key, value)) with identical data).
Example
Result:
This example calculates the MurmurHash2_32 hash of a combination of taco-related strings, a number, and a timestamp.
The MurmurHash2 algorithm is designed for speed and simplicity rather than cryptographic security. For cryptographic purposes, consider using functions like SHA256 instead.
murmurHash2_64
Produces a 64-bit MurmurHash2 hash value.
Syntax:
Arguments:
par1, ...
(any of the supported data types): A variable number of parameters.
Returns:
- A
UInt64
data type hash value.
Example:
Result:
This example calculates the MurmurHash2_64 hash for a combination of taco-related data, including an array of toppings, the word ‘taco’, a number, and a timestamp.
For some data types, the calculated value of the hash function may be the same for the same values even if the types of arguments differ (e.g., integers of different sizes, named and unnamed Tuple with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).
gccMurmurHash
Calculates a 64-bit MurmurHash2 hash value using the same hash seed as gcc. It is portable between Clang and GCC builds.
Syntax
Arguments
par1, ...
: A variable number of parameters that can be any of the supported data types.
Returns
- A
UInt64
hash value.
Example
Result:
In this example:
res1
calculates the hash of a combination of taco-related strings and a number.res2
demonstrates hashing a complex nested structure containing taco ingredients and numbers.
This function is useful for generating consistent hash values across different C++ compiler implementations, which can be beneficial in distributed systems or when porting code between environments.
kafkaMurmurHash
Calculates a 32-bit MurmurHash2 hash value using the same hash seed as Kafka and without the highest bit to be compatible with Default Partitioner.
Syntax
Arguments
par1, ...
— A variable number of parameters that can be any of the supported data types.
Returns:
- A calculated hash value of type
UInt32
.
Example
Query:
Result:
In this example:
res1
calculates the hash for a single string ‘carne asada’.res2
demonstrates hashing multiple arguments of different types, including an array, a string, a number, and a datetime.
This function is particularly useful when you need to replicate Kafka’s partitioning behavior or when working with Kafka-related data in ClickHouse.
murmurHash3_32
Calculates a 32-bit MurmurHash3 hash value.
Syntax
Arguments
expr1
,expr2
, … (any data type): A variable number of expressions of any data type.
Returns:
- A 32-bit hash value of type
UInt32
.
Details
- This function calculates a MurmurHash3 hash for the provided arguments.
- It can accept multiple arguments of various data types.
- For some data types, the calculated hash value may be the same for identical values even if the argument types differ (e.g., integers of different sizes, named and unnamed Tuples with the same data, Map and the corresponding
Array(Tuple(key, value))
type with the same data).
Example
Result:
In this example, we calculate the MurmurHash3 hash for a combination of taco-related strings and a number.
The MurmurHash3 algorithm is designed to be fast and have good distribution properties, making it suitable for hash-based lookups and data structures. However, it is not cryptographically secure and should not be used for security-sensitive applications.
murmurHash3_64
Produces a 64-bit MurmurHash3 hash value.
Syntax:
Arguments:
par1, ...
(any of the supported data types): A variable number of parameters.
Returns:
- A
UInt64
data type hash value.
Example:
Result:
In this example, we calculate the MurmurHash3_64 hash of an array of taco toppings, the word ‘taco’, the number 10, and a timestamp.
For some data types, the calculated value of the hash function may be the same for the same values even if the types of arguments differ (e.g., integers of different sizes, named and unnamed Tuple with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).
This function is useful for generating unique identifiers or for data partitioning in distributed systems. It provides a good balance of speed and hash quality for most non-cryptographic purposes.
murmurHash3_128
Produces a 128-bit MurmurHash3 hash value.
Syntax:
Arguments:
expr
(String
): A list of expressions of any data type.
Returns:
A 128-bit MurmurHash3 hash value. FixedString(16)
.
Example:
Result:
In this example, we calculate the MurmurHash3_128 hash of three popular taco restaurant menu items. The result is displayed as a hexadecimal string for readability.
The MurmurHash3_128 function is particularly useful when you need a high-quality 128-bit hash value for large amounts of data. It provides a good balance between speed and hash quality, making it suitable for various applications such as data fingerprinting, caching, and bloom filters in distributed systems.
xxh3
Produces a 64-bit xxh3 hash value.
Syntax:
Arguments:
expr
(any data type): A list of expressions of any data type.
Returns:
A 64-bit xxh3 hash value. UInt64
.
Example:
Result:
In this example, the xxh3
function calculates a hash value for the combination of ‘Crunchy Taco’ and ‘Supreme’.
The xxh3
function is a fast non-cryptographic hash function, suitable for general hash-based lookup. It provides a good balance of speed and distribution quality.
xxHash32
Calculates the 32-bit xxHash value for a given string.
Syntax
Arguments
string
(String
): The input string to hash.
Returns
- A 32-bit hash value. Type:
UInt32
.
Example
Query:
Result:
This example calculates the xxHash32 value for the string ‘Crunchy Taco Supreme’.
xxHash32 is a fast non-cryptographic hash algorithm. It’s suitable for tasks like checksumming or fingerprinting, but should not be used for cryptographic purposes.
xxHash64
Calculates a 64-bit xxHash hash value from a string.
Syntax
Arguments
string
(String
): The input string to hash.
Returns:
- A
UInt64
hash value.
Example
Query:
Result:
This function calculates the xxHash64 hash of the input string ‘Crunchy Taco Supreme’. The result is a 64-bit unsigned integer.
xxHash64 is a fast non-cryptographic hash algorithm, suitable for hash tables, checksums, and other non-security applications. It provides a good balance between speed and hash quality.
ngramSimHash
Calculates the n-gram simhash of a string. This function is useful for detecting semi-duplicate strings.
Syntax
Arguments
string
(String
): Input string.ngramsize
(UInt8
, optional): Size of the n-gram. Default value: 3. Possible values: any number from 1 to 25.
Returns
- A 64-bit hash value. (
UInt64
)
Details
The function splits the input string into n-grams of the specified size and calculates a simhash based on these n-grams. It is case-sensitive.
This function can be used in combination with bitHammingDistance
to detect similar strings. The smaller the Hamming distance between the simhashes of two strings, the more likely these strings are similar.
Example
Result:
In this example, we calculate the simhash for a taco-related string. This hash can be compared with hashes of other strings to find similar menu items or descriptions.
The ngramSimHash function is particularly useful in scenarios where you need to identify similar text entries, such as finding similar taco recipes or menu descriptions in a large database of food items.
ngramSimHashCaseInsensitive
Splits an ASCII string into n-grams of ngramsize
symbols and returns the n-gram simhash. This function is case insensitive.
It can be used for detecting semi-duplicate strings when combined with the bitHammingDistance
function. The smaller the Hamming Distance between the calculated simhashes of two strings, the more likely these strings are similar.
Syntax
Arguments
string
(String
): Input string.ngramsize
(UInt8
, optional): The size of an n-gram. Possible values: any number from 1 to 25. Default value: 3.
Returns
- A hash value. (
UInt64
)
Example
Result:
In this example, the function calculates the case-insensitive n-gram simhash for the taco menu item “Crunchy Taco Supreme”.
The function is useful for fuzzy matching and finding similar strings, regardless of letter case. This can be particularly helpful when comparing user-generated content or searching for menu items with slight variations in capitalization.
ngramSimHashUTF8
Splits a UTF-8 string into n-grams of ngramsize
symbols and returns the n-gram simhash. Is case sensitive.
Can be used for detection of semi-duplicate strings with bitHammingDistance
. The smaller the Hamming Distance of the calculated simhashes of two strings, the more likely these strings are the same.
Syntax:
Arguments:
string
(String
): Input string.ngramsize
(UInt8
, optional): The size of an n-gram. Possible values: any number from 1 to 25. Default value: 3.
Returns:
- Hash value. Type:
UInt64
.
Example:
Result:
In this example, ngramSimHashUTF8
calculates the simhash for the taco menu item ‘Crunchy Taco Supreme’. This hash can be used to find similar menu items or detect near-duplicate entries in a taco restaurant’s menu database.
The function is case-sensitive, so ‘Crunchy Taco Supreme’ and ‘crunchy taco supreme’ would produce different hash values.
ngramSimHashCaseInsensitiveUTF8
Splits a UTF-8 string into n-grams of ngramsize
symbols and returns the n-gram simhash. This function is case insensitive.
This function can be used for detecting semi-duplicate strings when combined with the bitHammingDistance
function. The smaller the Hamming Distance between the calculated simhashes of two strings, the more likely these strings are similar.
Syntax
Arguments
string
(String
): Input string to hash.ngramsize
(UInt8
, optional): Size of the n-gram. Possible values: any number from 1 to 25. Default value: 3.
Returns:
- A hash value. Type:
UInt64
.
Example
Query:
Result:
In this example, the function calculates the case-insensitive simhash for the UTF-8 string ‘Spicy Taco Tuesday’. This hash can be used to compare similarity with other taco-related phrases, regardless of letter casing.
The function is particularly useful for fuzzy matching of text, such as finding similar taco recipes or menu items, even if they have slight variations in spelling or capitalization.
wordShingleSimHash
Splits a string into word shingles and calculates a SimHash value. This function is case-sensitive.
Syntax
Arguments
string
(String
): Input string.shinglesize
(UInt8
, optional): Size of word shingles. Default value: 3. Possible values: any number from 1 to 25.
Returns
- A 64-bit hash value. (
UInt64
)
Description
This function can be used to detect semi-duplicate strings when combined with the [bitHammingDistance] function. The smaller the Hamming distance between the calculated SimHash values of two strings, the more likely these strings are similar.
Example
Query:
Result:
In this example, the function calculates the SimHash for the given taco-related string using the default shingle size of 3 words.
The SimHash algorithm is particularly useful for finding near-duplicate content, making it valuable for tasks like detecting similar taco recipes or menu descriptions in a large database of taco-related text.
wordShingleSimHashCaseInsensitive
Splits a ASCII string into parts (shingles) of shinglesize
words and returns the word shingle simhash. Is case insensitive.
Can be used for detection of semi-duplicate strings with bitHammingDistance
. The smaller the Hamming Distance of the calculated simhashes of two strings, the more likely these strings are the same.
Syntax:
Arguments:
string
(String
): Input string.shinglesize
(UInt8
, optional): The size of a word shingle. Possible values: any number from 1 to 25. Default value: 3.
Returns:
- Hash value. Type:
UInt64
.
Example:
Result:
In this example, the function calculates a case-insensitive hash of the taco-related phrase. This hash can be used to find similar phrases about tacos, regardless of letter case.
The function is useful for finding similar text content, such as duplicate menu descriptions or customer reviews about tacos, even if they have slight variations in wording or capitalization.
wordShingleSimHashUTF8
Splits a UTF-8 string into word shingles and calculates a SimHash value. This function is case-sensitive.
Syntax
Arguments
string
(String
): Input string to hash.shinglesize
(UInt8
, optional): Size of word shingles. Default value: 3. Possible values: any number from 1 to 25.
Returns:
- A 64-bit hash value. (
UInt64
)
Description
This function can be used to detect similar strings. The smaller the Hamming distance between the calculated SimHash values of two strings, the more likely these strings are similar.
To compare SimHash values, use the bitHammingDistance function.
Example
Result:
In this example, the function calculates the SimHash for a sentence about a taco-stealing fox. The result is a 64-bit hash value that can be used to compare similarity with other strings.
For case-insensitive hashing of UTF-8 strings, use the wordShingleSimHashCaseInsensitiveUTF8 function.
wordShingleSimHashCaseInsensitiveUTF8
Splits a UTF-8 string into parts (shingles) of shinglesize words and returns the word shingle simhash. Is case insensitive.
Can be used for detection of semi-duplicate strings with bitHammingDistance
. The smaller the Hamming Distance of the calculated simhashes of two strings, the more likely these strings are the same.
Syntax:
Arguments:
string
(String
): Input string.shinglesize
(UInt8
, optional): The size of a word shingle. Possible values: any number from 1 to 25. Default value: 3.
Returns:
- Hash value. Type:
UInt64
.
Example:
Result:
In this example, the function calculates a hash for the given taco-related phrase. The result can be used to compare similarity with other taco-related strings, regardless of letter case.
This function is particularly useful for finding similar text content in large datasets, such as identifying near-duplicate taco recipes or reviews with slight variations.
wyHash64
Produces a 64-bit wyHash64 hash value.
Syntax:
Arguments:
string
(String
): Input string to hash.
Returns:
- A 64-bit hash value.
UInt64
.
Example:
Result:
This function calculates a 64-bit hash value for the input string ‘Crunchy Taco Supreme’ using the wyHash64 algorithm. The result is returned as a UInt64 value.
wyHash64 is a fast, high-quality hash function that can be useful for various purposes such as hash tables, checksums, or generating unique identifiers for strings.
ngramMinHash
Splits a string into n-grams and calculates hash values for each n-gram. Returns a tuple with minimum and maximum hashes.
Syntax
Arguments
string
(String
): Input string.ngramsize
(UInt8
, optional): Size of each n-gram. Default value: 3. Range: 1 to 25.hashnum
(UInt8
, optional): Number of minimum and maximum hashes to calculate. Default value: 6. Range: 1 to 25.
Returns
- A tuple with two hashes — the minimum and the maximum. (
Tuple(UInt64, UInt64)
)
Example
Result:
This function can be used for detecting semi-duplicate strings when combined with tupleHammingDistance
. If one of the returned hashes is the same for two strings, those strings are likely similar.
The function is case-sensitive. For case-insensitive hashing, use ngramMinHashCaseInsensitive
.
ngramMinHashCaseInsensitive
Splits an ASCII string into n-grams of ngramsize
symbols and calculates hash values for each n-gram. Uses hashnum
minimum hashes to calculate the minimum hash and hashnum
maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. This function is case insensitive.
This function can be used for detecting semi-duplicate strings when combined with the tupleHammingDistance
function. If one of the returned hashes is the same for two strings, those strings are considered similar.
Syntax
Arguments
string
(String
): Input string.ngramsize
(UInt8
, optional): Size of each n-gram. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): Number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
Returns
- A tuple with two hashes — the minimum and the maximum. (
Tuple(UInt64, UInt64)
)
Example
Query:
Result:
In this example, the function calculates the case-insensitive n-gram min hash for the taco name “Crunchy Taco Supreme”. The result is a tuple of two UInt64 values representing the minimum and maximum hashes.
The actual hash values may differ in your results due to the nature of hash functions.
ngramMinHashUTF8
Splits a UTF-8 string into n-grams of ngramsize
symbols and calculates hash values for each n-gram. Uses hashnum
minimum hashes to calculate the minimum hash and hashnum
maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case sensitive.
Can be used for detection of semi-duplicate strings with tupleHammingDistance
. For two strings: if one of the returned hashes is the same for both strings, we consider those strings to be similar.
Syntax
Arguments
string
(String
): Input string.ngramsize
(UInt8
, optional): The size of an n-gram. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): The number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
Returns:
- A tuple with two hashes — the minimum and the maximum. (
Tuple(UInt64, UInt64)
)
Example
Query:
Result:
In this example, ngramMinHashUTF8
calculates the hash values for the taco name. The result can be used to find similar taco names or descriptions in a large dataset.
ngramMinHashCaseInsensitiveUTF8
Splits a UTF-8 string into n-grams of ngramsize
symbols and calculates hash values for each n-gram. Uses hashnum
minimum hashes to calculate the minimum hash and hashnum
maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case insensitive.
Can be used for detection of semi-duplicate strings with tupleHammingDistance
. For two strings: if one of the returned hashes is the same for both strings, we consider those strings to be similar.
Syntax
Arguments
string
(String
): Input string.ngramsize
(UInt8
, optional): The size of an n-gram. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): The number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
Returns
- Tuple with two hashes — the minimum and the maximum. (
Tuple(UInt64, UInt64)
)
Example
Query:
Result:
In this example, the function calculates the case-insensitive n-gram min hash for the taco name ‘Crunchy Taco Supreme’. The result is a tuple containing two UInt64 values representing the minimum and maximum hashes.
This function is particularly useful for finding similar strings in large datasets, such as menu items or product names, regardless of case differences.
ngramMinHashArg
Splits an ASCII string into n-grams of a specified size and returns the n-grams with minimum and maximum hashes, as calculated by the ngramMinHash
function with the same input. This function is case-sensitive.
Syntax
Arguments
string
(String
): Input string.ngramsize
(UInt8
, optional): Size of each n-gram. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): Number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
Returns
- A tuple containing two tuples, each with
hashnum
n-grams. (Tuple(Tuple(String), Tuple(String))
)
Example
Result:
In this example:
- The function splits ‘Tasty Taco Tuesday’ into 3-character n-grams.
- It returns two tuples, each containing two n-grams (as specified by
hashnum = 2
). - The first tuple contains n-grams with minimum hashes, and the second tuple contains n-grams with maximum hashes.
This function is useful for finding similar strings or for text analysis tasks where you need to identify characteristic n-grams of a string based on their hash values.
ngramMinHashArgCaseInsensitive
Splits an ASCII string into n-grams of ngramsize
symbols and returns the n-grams with minimum and maximum hashes, calculated by the ngramMinHashCaseInsensitive
function with the same input. This function is case insensitive.
Syntax
Arguments
string
(String
): Input string.ngramsize
(UInt8
, optional): The size of an n-gram. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): The number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
Returns:
- A tuple with two tuples, each containing
hashnum
n-grams. (Tuple(Tuple(String), Tuple(String))
)
Example
Query:
Result:
In this example:
- The function splits ‘Crunchy Taco Supreme’ into 3-character n-grams.
- It returns 2 minimum and 2 maximum hashes (due to
hashnum = 2
). - The result shows the n-grams corresponding to these hashes.
- Note that the case is ignored: ‘chy’ could come from ‘Chy’ in ‘Crunchy’.
This function can be useful for finding similar strings while ignoring case, such as in taco menu item comparisons or ingredient matching where capitalization might vary.
ngramMinHashArgUTF8
Splits a UTF-8 string into n-grams of ngramsize
symbols and returns the n-grams with minimum and maximum hashes, calculated by the ngramMinHashUTF8
function with the same input. Is case sensitive.
Syntax
Arguments
string
(String
): Input string.ngramsize
(UInt8
, optional): The size of an n-gram. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): The number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
Returns
- Tuple with two tuples with
hashnum
n-grams each. (Tuple(Tuple(String), Tuple(String))
)
Example
Query:
Result:
In this example:
- The function splits the input string into 1-gram words.
- It returns two tuples, each containing 3 words (as specified by
hashnum = 3
). - The first tuple contains words with minimum hashes, and the second tuple contains words with maximum hashes.
This function is useful for text analysis, particularly when you need to identify characteristic n-grams in UTF-8 encoded text while preserving case sensitivity.
ngramMinHashArgCaseInsensitiveUTF8
Splits a UTF-8 string into n-grams of ngramsize
symbols and returns the n-grams with minimum and maximum hashes, calculated by the ngramMinHashCaseInsensitiveUTF8
function with the same input. This function is case insensitive.
Syntax
Arguments
string
(String
): Input string.ngramsize
(UInt8
, optional): The size of an n-gram. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): The number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
Returns
- Tuple with two tuples, each containing
hashnum
n-grams. (Tuple(Tuple(String), Tuple(String))
)
Example
Query:
Result:
In this example:
- The function splits ‘Crunchy Taco Supreme’ into 2-grams (bigrams).
- It returns two tuples, each containing 3 bigrams.
- The first tuple contains the bigrams with the minimum hashes.
- The second tuple contains the bigrams with the maximum hashes.
- The function is case-insensitive, so ‘TA’ and ‘ta’ are treated the same.
This function can be useful for detecting similar strings or for text analysis tasks where case sensitivity is not important.
wordShingleMinHash
Splits a string into word shingles and calculates hash values for each shingle. Returns a tuple with minimum and maximum hashes. This function is case-sensitive.
Syntax
Arguments
string
(String
): Input string.shinglesize
(UInt8
, optional): Size of word shingles. Possible values: 1 to 25. Default value: 3.hashnum
(UInt8
, optional): Number of minimum and maximum hashes used. Possible values: 1 to 25. Default value: 6.
Returns
- A tuple with two hashes — the minimum and the maximum. (
Tuple(UInt64, UInt64)
)
Example
Result:
This function can be used for detecting semi-duplicate strings when combined with [tupleHammingDistance]. If one of the returned hashes is the same for two strings, those strings are likely similar.
The function is designed for ASCII strings. For UTF-8 strings, use wordShingleMinHashUTF8
.
wordShingleMinHashCaseInsensitive
Splits an ASCII string into word shingles and calculates a case-insensitive hash value for each shingle. Uses a specified number of minimum and maximum hashes to compute the final result.
Syntax
Arguments
string
(String
): Input string to process.shinglesize
(UInt8
, optional): Size of each word shingle. Default value: 3. Range: 1 to 25.hashnum
(UInt8
, optional): Number of minimum and maximum hashes to use. Default value: 6. Range: 1 to 25.
Returns
- A tuple containing two hash values: the minimum and maximum. (
Tuple(UInt64, UInt64)
)
Details
This function is useful for detecting semi-duplicate strings. It can be used in conjunction with tupleHammingDistance
. If one of the returned hashes is the same for two strings, those strings are likely similar.
The function is case-insensitive, meaning “Taco” and “taco” will be treated the same.
Example
Result:
In this example, the function computes hash values for the input string about tacos. The result can be used to compare similarity with other taco-related strings.
For UTF-8 strings, use the wordShingleMinHashCaseInsensitiveUTF8
function instead.
wordShingleMinHashUTF8
Splits a UTF-8 string into word shingles and calculates hash values for each shingle. Uses the minimum and maximum hashes to produce a tuple of hash values. This function is case-sensitive.
Syntax
Arguments
string
(String
): Input string to be hashed.shinglesize
(UInt8
, optional): Size of each word shingle. Default value: 3. Range: 1 to 25.hashnum
(UInt8
, optional): Number of minimum and maximum hashes to calculate. Default value: 6. Range: 1 to 25.
Returns
- A tuple containing two hash values: the minimum and the maximum. (
Tuple(UInt64, UInt64)
)
This function can be used to detect semi-duplicate strings when combined with [tupleHammingDistance]. If one of the returned hashes is the same for two strings, those strings are likely similar.
Example
Result:
In this example, the function calculates hash values for word shingles in the taco-related sentence. The resulting tuple contains the minimum and maximum hash values computed from these shingles.
This function is particularly useful for text similarity analysis, especially when dealing with multilingual content or texts containing special characters, as it properly handles UTF-8 encoded strings.
wordShingleMinHashCaseInsensitiveUTF8
Splits a UTF-8 string into parts (shingles) of shinglesize words and calculates hash values for each word shingle. Uses hashnum minimum hashes to calculate the minimum hash and hashnum maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case insensitive.
Can be used for detection of semi-duplicate strings with tupleHammingDistance
. For two strings: if one of the returned hashes is the same for both strings, we think that those strings are the same.
Syntax
Arguments
string
(String
): Input string.shinglesize
(UInt8
, optional): The size of a word shingle. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): The number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
Returns:
- Tuple with two hashes — the minimum and the maximum. (
Tuple(UInt64, UInt64)
)
Example
Query:
Result:
In this example, the function calculates a hash for the taco-related string. The result can be used to find similar taco-related phrases, regardless of case.
wordShingleMinHashArg
Splits a ASCII string into parts (shingles) of a specified word size and returns the shingles with minimum and maximum word hashes. This function is case sensitive.
Syntax
Arguments
string
(String
): Input string.shinglesize
(UInt8
, optional): The size of a word shingle. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): The number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
Returns
- A tuple with two tuples, each containing
hashnum
word shingles. (Tuple(Tuple(String), Tuple(String))
)
Example
Result:
In this example:
- The function splits the input string into single words (
shinglesize = 1
). - It returns two tuples with 3 words each (
hashnum = 3
). - The first tuple contains words with minimum hashes, and the second tuple contains words with maximum hashes.
This function can be useful for text analysis, finding similar documents, or implementing efficient text search algorithms.
The results may vary depending on the input string and the hash function used internally.
wordShingleMinHashArgCaseInsensitive
Splits an ASCII string into word shingles and returns the shingles with minimum and maximum word hashes. This function is case insensitive.
Syntax
Arguments
string
(String
): Input string.shinglesize
(UInt8
, optional): Size of each word shingle. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): Number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
Returns
- A tuple containing two tuples, each with
hashnum
word shingles. (Tuple(Tuple(String), Tuple(String))
)
Example
Query:
Result:
This function can be used for detecting semi-duplicate strings when combined with tupleHammingDistance
. The smaller the Hamming distance between the calculated hashes of two strings, the more likely these strings are similar.
This function is the case-insensitive version of wordShingleMinHashArg
. It’s particularly useful when you want to compare strings regardless of their letter casing.
wordShingleMinHashArgUTF8
Splits a UTF-8 string into word shingles and returns the shingles with minimum and maximum word hashes. This function is case-sensitive.
Syntax
Arguments
string
(String
): Input string.shinglesize
(UInt8
, optional): Size of each word shingle. Possible values: 1 to 25. Default value: 3.hashnum
(UInt8
, optional): Number of minimum and maximum hashes used to calculate the result. Possible values: 1 to 25. Default value: 6.
Returns
- A tuple containing two tuples, each with
hashnum
word shingles. (Tuple(Tuple(String), Tuple(String))
)
Example
Result:
In this example:
- The function splits the input string into individual words.
- It returns two tuples: one with the three words that have the minimum hashes, and another with the two words that have the maximum hashes.
- The shingle size is set to 1 (individual words) and the number of hashes is set to 3.
This function is useful for finding similar strings or for text classification tasks, especially when working with multilingual text in UTF-8 encoding.
wordShingleMinHashArgCaseInsensitiveUTF8
Splits a UTF-8 string into word shingles and returns the shingles with minimum and maximum word hashes. This function is case insensitive.
Syntax
Arguments
string
(String
): Input string.shinglesize
(UInt8
, optional): Size of each word shingle. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): Number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
Returns
- A tuple containing two tuples, each with
hashnum
word shingles. (Tuple(Tuple(String), Tuple(String))
)
Example
Result:
In this example:
- The function splits the taco order description into individual words.
- It returns two tuples: one with the shingles that produced the minimum hashes, and another with those that produced the maximum hashes.
- The shingle size is set to 1 (individual words) and we’re getting 3 shingles for each tuple.
This function is useful for detecting similar strings or for text classification tasks, especially when dealing with multilingual text or when case sensitivity is not important.
sqidEncode
Encodes numbers as a Sqid which is a YouTube-like ID string. The output alphabet is abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. Do not use this function for hashing - the generated IDs can be decoded back into the original numbers.
Syntax:
Alias:
- sqid
Arguments:
- A variable number of
UInt8
,UInt16
,UInt32
orUInt64
numbers.
Returns:
A sqid String
.
Example:
Result:
In this example, sqidEncode
generates a unique ID for a taco order using the numbers 1, 2, 3, 4, and 5. This could be useful for creating short, readable order IDs in a taco ordering system.
The sqidEncode
function is particularly useful when you need to generate short, URL-friendly IDs from a series of numbers. It’s ideal for creating readable identifiers for orders, products, or any other entities in your database that need a compact, reversible representation.
sqidDecode
Decodes a Sqid back into its original numbers.
Syntax:
Arguments:
sqid
(String
): A Sqid string to decode.
Returns:
- An array of numbers decoded from the Sqid. Returns an empty array if the input string is not a valid Sqid. (
Array(UInt64)
)
Example:
Result:
In this taco-themed example, ‘gXHfJ1C6dN’ might represent an encoded taco order ID. The sqidDecode
function decodes it back into its original numeric components, which could represent various aspects of the order such as order number, number of tacos, sauce type, etc.
This function is the inverse of sqidEncode
. It’s useful for decoding short, URL-friendly IDs back into their original numeric form. However, it should not be used for security-sensitive applications as the encoding is reversible.
Was this page helpful?