URL functions
Parse and manipulate URL strings.
URL functions provide tools for working with URL strings. These functions are useful for:
- Web analytics
- Log processing
- Any task involving URL data
With URL functions, you can:
-
Extract key components of a URL:
- Protocol (e.g., http, https)
- Domain
- Path
- Query parameters
-
Perform common URL operations:
- Encoding
- Decoding
ClickHouse function reference
protocol
Extracts the protocol from a URL.
Syntax:
Arguments:
url
(String
): URL to extract protocol from.
Returns:
- The protocol, or an empty string if it cannot be determined. [
String
]
Example:
Result:
This function extracts the protocol (e.g., http, https, ftp) from the given URL. If the protocol cannot be determined, it returns an empty string.
This function is optimized for performance and may not strictly follow RFC-3986. For RFC-compliant parsing, use protocolRFC
instead.
domain
Extracts the hostname from a URL.
Syntax:
Arguments:
url
(String
): URL.
Returns:
- Host name if the input string can be parsed as a URL, otherwise an empty string. (
String
)
Example:
Result:
This function works with URLs specified with or without a protocol. For example:
Result:
The function is optimized for performance and may not strictly follow RFC-3986. For RFC-compliant parsing, use domainRFC()
.
domainRFC
Extracts the hostname from a URL, conforming to RFC 3986.
Syntax:
Arguments:
url
(String
): URL.
Returns:
- Host name if the input string can be parsed as a URL, otherwise an empty string. (
String
)
Example:
Result:
This function is similar to domain
, but it strictly follows RFC 3986 standards. It’s particularly useful when dealing with URLs that contain special characters or complex structures.
The domainRFC
function can handle URLs with user information and non-standard ports, which the non-RFC variant might struggle with.
domainWithoutWWW
Returns the domain name without the leading www.
subdomain, if present.
Syntax
Arguments
url
(String
): URL.
Returns:
- The domain name without the leading
www.
subdomain, if present. (String
)
Example
Result:
This function is useful for normalizing domain names in URL analysis, especially when you want to treat ‘www.example.com’ and ‘example.com’ as the same domain.
If the URL doesn’t contain www.
at the beginning of the domain, the function returns the domain as is. If the input is not a valid URL or doesn’t contain a domain, an empty string is returned.
domainWithoutWWWRFC
Returns the domain without leading www.
if present. This function conforms to RFC 3986.
Syntax
Arguments
url
(String
): URL.
Returns:
- Domain name without leading
www.
if present, otherwise an empty string. (String
)
Example
Query:
Result:
In this example, domainWithoutWWWRFC
correctly extracts the domain ‘tacosoft.com’ from the URL, removing the www.
prefix and conforming to RFC 3986 standards.
This function is particularly useful when you need to extract the domain from URLs that may contain various components like usernames, passwords, ports, and query parameters, while ensuring compliance with RFC 3986.
topLevelDomain
Extracts the top-level domain from a URL.
Syntax:
Arguments:
url
(String
): URL.
Returns:
- The top-level domain if the input string can be parsed as a URL. Otherwise, an empty string. (
String
)
Example:
Result:
This function extracts ‘com’ as the top-level domain from the TacoSoft URL.
The URL can be specified with or without a protocol. For example:
will also return ‘com’.
If the function cannot parse the input as a URL or if there’s no valid top-level domain, it returns an empty string.
topLevelDomainRFC
Extracts the top-level domain from a URL. This function conforms to RFC 3986.
Syntax
Arguments
url
(String
): URL.
Returns
- The top-level domain name if the input string can be parsed as a URL. Otherwise, an empty string. (
String
)
Example
Result:
This function is particularly useful when you need to extract the top-level domain from URLs in a way that strictly adheres to RFC 3986 standards. It can handle complex URLs and properly parse various top-level domains, including country code top-level domains (ccTLDs) and new generic top-level domains (gTLDs).
Unlike its non-RFC counterpart, topLevelDomainRFC
can correctly handle URLs with special characters in the user info part (following the @ symbol), such as %, ;, =, &, and others, as defined in RFC 3986.
firstSignificantSubdomain
Returns the “first significant subdomain” of a URL.
Syntax
Arguments
url
(String
): URL.
Returns
- The first significant subdomain. (
String
)
Description
The first significant subdomain is determined as follows:
- For second-level domains like .com, .net, .org, or .co, it returns the third-level domain.
- For all other cases, it returns the second-level domain.
Examples
Result:
Result:
The list of “insignificant” second-level domains and other implementation details may change in future versions.
This function is optimized for performance and may not strictly follow URL parsing standards. For RFC-compliant parsing, use firstSignificantSubdomainRFC
.
firstSignificantSubdomainRFC
Returns the “first significant subdomain” of a URL, conforming to RFC 3986.
Syntax
Arguments
url
(String
): URL.
Returns
- The first significant subdomain. (
String
)
Description
The first significant subdomain is determined as follows:
- For second-level domains like .com, .net, .org, or .co, it returns the third-level domain.
- For other domains, it returns the second-level domain.
This function is similar to firstSignificantSubdomain
, but strictly adheres to RFC 3986 for URL parsing.
Examples
Result:
Result:
The list of “insignificant” second-level domains and other implementation details may change in future versions.
cutToFirstSignificantSubdomain
Returns the part of the domain that includes top-level subdomains up to the “first significant subdomain”.
Syntax
Arguments
url
(String
): URL.
Returns
- Part of the domain that includes top-level subdomains up to the first significant subdomain if possible, otherwise returns an empty string.
String
.
Example
Result:
In this example:
subdomain
returns ‘tacosoft.com’, which is the part of the domain up to the first significant subdomain.subdomain2
returns ‘tacosoft’, as ‘www’ is not considered significant.subdomain3
returns an empty string, as ‘tacosoft’ alone is not a valid domain with a significant subdomain.
This function is useful for extracting the main part of a domain while ignoring less significant subdomains like ‘www’ or country-specific extensions.
cutToFirstSignificantSubdomainRFC
Returns the part of the domain that includes top-level subdomains up to the “first significant subdomain”. Similar to cutToFirstSignificantSubdomain
, but conforms to RFC 3986.
Syntax
Arguments
url
(String
): URL.
Returns
- Part of the domain that includes top-level subdomains up to the first significant subdomain if possible, otherwise returns an empty string.
String
.
Example
Result:
In this example, cutToFirstSignificantSubdomainRFC
correctly extracts ‘delicious-tacos.com’ from the URL, including the user information and port in the parsing process as per RFC 3986. The non-RFC version fails to parse the URL correctly due to the presence of user information.
This function is particularly useful when dealing with complex URLs that may contain user information, non-standard ports, or other elements that require strict adherence to URL standards.
cutToFirstSignificantSubdomainWithWWW
Returns the part of the domain that includes top-level subdomains up to the “first significant subdomain”, without stripping www.
Syntax:
Arguments:
url
(String
): URL.
Returns:
Part of the domain that includes top-level subdomains up to the first significant subdomain (with www) if possible, otherwise returns an empty string. String
.
Example:
Result:
In this example:
- For
domain
, the function returnstacosoft.com
, preserving thewww
if it was present. - For
domain2
, it returnstacosoft.co
, keeping thewww
. - For
domain3
, it returnstacosoft.io
as there’s no subdomain to cut.
This function is useful when you want to extract the main part of a domain while preserving the www
subdomain if it exists.
cutToFirstSignificantSubdomainWithWWWRFC
Returns the part of the domain that includes top-level subdomains up to the “first significant subdomain”, without stripping www. This function conforms to RFC 3986.
Syntax
Arguments
url
(String
): URL.
Returns:
- Part of the domain that includes top-level subdomains up to the first significant subdomain (with www) if possible, otherwise returns an empty string.
String
.
Example
Result:
This function is useful for extracting the main part of a domain while preserving the “www” prefix if present. It handles various top-level domains correctly, including country-specific ones.
The function adheres to RFC 3986, ensuring proper handling of special characters and edge cases in URLs.
port
Extracts the port number from a URL, or returns a default port if not specified.
Syntax
Arguments
url
(String
): URL to extract the port from.default_port
(UInt16
, optional): The default port number to return if no port is specified in the URL.
Returns
- The port number from the URL, or the default port if not specified. (
UInt16
)
Example
Result:
In this example:
port_number
extracts the explicitly specified port (8443) from the URL.default_port
returns the provided default value (80) since no port is specified in the URL.
If the URL cannot be parsed or doesn’t contain a port, and no default port is provided, the function returns 0.
portRFC
Returns the port number from a URL, or a default port if not specified. This function conforms to RFC 3986.
Syntax
Arguments
url
(String
): URL to extract the port from.default_port
(UInt16
, optional): The port number to return if no port is specified in the URL. Default: 0.
Returns
- The port number from the URL, or the default port if not specified. (
UInt16
)
Example
Result:
In this example:
port_with_url
returns 8080, which is explicitly specified in the URL.port_with_default
returns 443 (the default HTTPS port) since no port is specified in the URL.
This function is RFC 3986 compliant, which means it correctly handles URLs with special characters or unusual formats. For non-RFC compliant URL parsing, use the port
function instead.
path
Extracts the path from a URL without the query string.
Syntax
Arguments
url
(String
): URL.
Returns
- The path component of the URL without the query string. (
String
)
Example
Result:
This function extracts /menu/burritos
from the URL, omitting the query string ?size=large&extra=guac
and the fragment #nutrition
.
If the URL does not contain a path, an empty string is returned.
pathFull
Returns the full path of a URL, including the query string and fragment.
Syntax:
Arguments:
url
(String
): URL.
Returns:
The full path of the URL, including the query string and fragment. (String
)
Example:
Result:
In this example, pathFull
extracts the complete path from the URL, including the query string ?size=large
and the fragment #spiciness
.
If the URL doesn’t contain a path, query string, or fragment, an empty string is returned.
protocol
Extracts the protocol from a URL.
Syntax:
Arguments:
url
(String
): URL to extract protocol from.
Returns:
- The protocol, or an empty string if it cannot be determined. [
String
]
Example:
Result:
This function extracts the protocol (e.g., http, https, ftp) from the given URL. If the protocol cannot be determined, it returns an empty string.
This function is optimized for performance and may not strictly follow RFC-3986. For RFC-compliant parsing, use protocolRFC
instead.
queryString
Extracts the query string from a URL without the initial question mark, # and everything after #.
Syntax
Arguments
url
(String
): URL to extract the query string from.
Returns
- The query string without the initial question mark and fragment identifier. (
String
)
Example
Result:
In this example, queryString
extracts ‘items=3&sauce=hot’ from the URL, omitting the initial ’?’ and everything after ’#‘.
If the URL doesn’t contain a query string, an empty string is returned.
fragment
Extracts the fragment identifier from a URL, without the initial hash symbol.
Syntax
Arguments
url
(String
): URL to extract the fragment from.
Returns
- The fragment identifier without the initial hash symbol, or an empty string if there is no fragment. (
String
)
Example
Result:
In this example, fragment
extracts ‘spicy-tacos’ from the URL, which represents the specific section of the taco menu being referenced.
If the URL doesn’t contain a fragment identifier, an empty string is returned:
Result:
This function is useful for analyzing user navigation patterns within a webpage or for processing URLs that use fragment identifiers to represent different views or states of a single-page application.
queryStringAndFragment
Returns the query string and fragment identifier from a URL.
Syntax
Arguments
url
(String
): URL.
Returns:
- The query string and fragment identifier. (
String
)
Example
Result:
This function returns everything after the ’?’ character in the URL, including the ’?’ itself, and the fragment identifier (everything after the ’#’ character, including the ’#’).
If the URL doesn’t contain a query string or fragment identifier, an empty string is returned.
- This function does not decode URL-encoded characters.
- If only a fragment identifier is present (without a query string), it will still be returned.
extractURLParameter
Extracts the value of a specified parameter from a URL.
Syntax
Arguments
url
(String
): The URL to extract the parameter from.name
(String
): The name of the parameter to extract.
Returns
- The value of the specified parameter if present in the URL, otherwise an empty string. (
String
)
- If there are multiple parameters with the same name, the function returns the value of the first occurrence.
- The function assumes that the parameter in the URL is encoded in the same way as in the
name
argument.
Example
Result:
In this example, the function extracts the value of the ‘toppings’ parameter from a taco order URL.
extractURLParameters
Extracts all parameters and their values from a URL query string.
Syntax
Arguments
URL
(String
): The URL to extract parameters from.
Returns
- An array of
name=value
strings corresponding to the URL parameters. (Array(String)
)
Example
Result:
This function extracts all parameters from the URL’s query string, returning them as an array of strings in the format name=value
. The values are not decoded.
If the URL doesn’t contain any parameters, an empty array is returned.
extractURLParameterNames
Extracts the names of parameters from a URL.
Syntax
Arguments
url
(String
): URL to extract parameter names from.
Returns
- An array of strings containing the names of URL parameters. (
Array(String)
)
Example
Result:
This function extracts the names of all parameters from the URL query string. It does not decode the parameter names, so they are returned exactly as they appear in the URL.
- If the URL has no parameters, an empty array is returned.
- The function does not handle duplicate parameter names in any special way; all occurrences will be included in the result.
- The order of parameter names in the resulting array matches their order in the URL.
URLHierarchy
Returns an array containing the URL, truncated at the end by the symbols /
and ?
in the path and query string. Consecutive separator characters are counted as one. The cut is made in the position after all the consecutive separator characters.
Syntax
Arguments
url
(String
): URL.
Returns
- An array of strings containing the hierarchical parts of the URL.
Example
Result:
This function is useful for analyzing URL structures and creating hierarchical representations of web pages. It can be particularly helpful in web analytics tasks, site structure analysis, or when working with URL-based data.
The function includes the protocol and domain in the result, unlike URLPathHierarchy
which focuses only on the path.
URLPathHierarchy
Returns an array containing the URL path hierarchy, excluding the protocol and host.
Syntax
Arguments
url
(String
): URL.
Returns
- An array of strings representing the URL path hierarchy. (
Array(String)
)
Example
Result:
This function splits the URL path into hierarchical components:
- It removes the protocol and domain.
- It splits the remaining path at each forward slash (/).
- It builds an array where each element is a progressively longer portion of the path.
The function is useful for analyzing URL structures and creating hierarchical reports based on URL paths.
The root path (’/’) is not included in the result array.
encodeURLComponent
Encodes a URL component by replacing certain characters with their percent-encoded equivalents.
Syntax:
Arguments:
url
(String
): The URL component to encode.
Returns:
The encoded URL component. [String]
Example:
Result:
In this example, the function encodes the taco name “Spicy Jalapeño Taco”, replacing spaces with “%20” and the “ñ” character with its UTF-8 percent-encoded equivalent “%C3%B1”.
This function is useful when you need to include special characters or non-ASCII characters in a URL, ensuring that the URL remains valid and properly formatted. It’s particularly helpful when constructing URLs with query parameters that may contain special characters.
decodeURLComponent
Decodes a URL-encoded string.
Syntax:
Arguments:
url
(String
): The URL-encoded string to decode.
Returns:
The decoded string. [String
]
Example:
Result:
This function decodes URL-encoded characters in the input string. For example, %20
is decoded to a space, %3A
to a colon, and %2F
to a forward slash.
The decodeURLComponent
function is the inverse of encodeURLComponent
. It’s particularly useful when working with URLs or query parameters that may contain special characters or spaces.
encodeURLFormComponent
Encodes a URL component following RFC 1866, where spaces are encoded as plus signs (+).
Syntax:
Arguments:
url
(String
): URL component to encode.
Returns:
- The encoded URL component. [
String
]
Example:
Result:
This function is particularly useful when constructing URLs for web forms or API requests where spaces and special characters need to be properly encoded. It ensures that the URL remains valid and can be correctly interpreted by web servers.
This function differs from encodeURLComponent
in that it specifically encodes spaces as plus signs, which is the expected behavior for URL-encoded form data.
decodeURLFormComponent
Decodes a URL-encoded form component string.
Syntax
Arguments
encoded_string
(String
): A URL-encoded string.
Returns:
- The decoded string. (
String
)
Description
This function decodes a URL-encoded form component string, following RFC 1866. It specifically handles the following:
- Converts
+
(plus) to a space character. - Decodes percent-encoded sequences (e.g.,
%20
to space,%2B
to+
).
Example
Result:
In this example, decodeURLFormComponent
converts the +
to a space and decodes %21
to an exclamation mark, resulting in “Spicy Taco!“.
This function is particularly useful when working with form data submitted via HTTP POST requests or when processing URL query parameters.
netloc
Extracts the network locality (username:password@host:port) from a URL.
Syntax:
Arguments:
url
(String
): URL.
Returns:
- The network locality part of the URL (username:password@host:port). [
String
]
Example:
Result:
In this example, netloc
extracts the network locality part from a URL for a taco restaurant’s online ordering system, including the username, password, host, and port.
If the URL doesn’t contain any network locality information, an empty string is returned.
cutWWW
Removes the leading www.
from a URL’s domain, if present.
Syntax
Arguments
url
(String
): The URL to process.
Returns
- The URL with the leading
www.
removed from the domain, if present. Otherwise, returns the original URL. [String
]
Example
Result:
In this example, cutWWW
removes the www.
from the domain of the Taco Bell website URL.
This function only removes the www.
prefix if it appears at the beginning of the domain. It does not affect other parts of the URL or remove www.
if it appears elsewhere in the URL.
cutQueryString
Removes the query string from a URL, including the question mark.
Syntax:
Arguments:
url
(String
): The URL to process.
Returns:
The URL with the query string removed. [String
]
Example:
Result:
In this example, cutQueryString
removes the query string ?category=burritos&spicy=true
from the URL, leaving only the base URL.
If the URL does not contain a query string, the function returns the original URL unchanged.
cutFragment
Removes the fragment identifier from a URL, including the hash symbol (#).
Syntax
Arguments
url
(String
): The URL to process.
Returns
- The URL with the fragment identifier removed. [
String
]
Example
Result:
This function is useful when you want to remove the fragment identifier (everything after and including the ’#’ symbol) from a URL. It’s commonly used in web analytics to standardize URLs for better grouping and analysis.
If the URL doesn’t contain a fragment identifier, the function returns the original URL unchanged.
cutQueryStringAndFragment
Removes the query string and fragment identifier from a URL, including the question mark and number sign.
Syntax
Arguments
url
(String
): The URL to process.
Returns
- The URL with query string and fragment removed. (
String
)
Example
Result:
This function is useful when you need to extract the base URL without any parameters or fragments. It’s particularly handy for cleaning up URLs before analysis or grouping.
If the URL doesn’t contain a query string or fragment, it remains unchanged.
cutURLParameter
Removes a specified parameter from a URL.
Syntax
Arguments
url
(String
): The URL to modify.name
(String
orArray(String)
): The name of the parameter to remove.
Returns
- The modified URL with the specified parameter(s) removed. (
String
)
Example
Result:
In this example:
url_without_toppings
removes the ‘toppings’ parameter from the URL.url_without_size_and_sauce
removes both the ‘size’ and ‘sauce’ parameters from the URL.
This function does not encode or decode characters in parameter names. For example, ‘Salsa Type’ and ‘Salsa%20Type’ are treated as different parameter names.