Selecting table engine and sorting key

In this guide, you’ll learn how to select the optimal table engine and sorting key for your table in Propel’s Serverless ClickHouse.

Background

This section provides some background on ClickHouse table engines and the role the sorting key plays while organizing data in tables.

If you don’t need or want to understand the details, you can skip to the section
👉 How to choose the right table engine and sorting key .

Understanding table engines

Table engines determine how ClickHouse tables store, process, read, and update their data.

Table engines are organized in families. The main family (and the one that Propel supports) is the MergeTree family. Other, less common table engines are beyond the scope of this guide.

Overview of the MergeTree table engines

The MergeTree is ClickHouse’s primary table engine family. MergeTree tables optimize for fast inserts and reads. They achieve this by storing the inserted data in parts and merging those parts in the background. The various MergeTree tables differ in their merging approach, with each having unique merging rules.

The table types include:

MergeTree: The most common engine in the MergeTree family. It is designed for append-only, immutable data, as merging consolidates parts but never deletes records.
ReplacingMergeTree: A variation of the MergeTree engine. When merging data parts, it replaces old rows with new ones that have the same sorting key.
SummingMergeTree: This engine is designed for use cases where you need to calculate the total for numeric columns. When merging data parts, it sums up columns for rows with the same sorting key.
AggregatingMergeTree: This engine is ideal for incremental data aggregation. When merging data parts, it aggregates columns for rows with the same sorting key.
CollapsingMergeTree: This engine is designed for scenarios where you need to account for a record’s state changes. When merging data parts, it deletes (”collapses”) pairs of records with the same sorting key and opposite sign column.
VersionedCollapsingMergeTree: This engine is an extension of the CollapsingMergeTree. It uses a different collapsing algorithm that allows inserting the data in any order.

Understanding the sorting key

The sorting key in ClickHouse is a set of one or more columns that ClickHouse uses to organize the rows within a table. The sorting key determines the order of the rows in the table, influences how parts are merged, and significantly impacts the query performance.

If the rows are sorted well, ClickHouse can efficiently skip over unneeded rows and thus optimize query performance.

How to choose the right table engine and sorting key

In this section, we’ll guide you through how to choose the right table engine and sorting key for your table.

What type of data do you have?

The first step is to figure out what type of data you have.

Three common types of data stored: append-only data, mutable records, and incremental aggregations.

Append-only data: This type of data refers to data that is only added to and never modified or deleted. Once a record is inserted, it remains unchanged. Examples include:
- Events
- Log data
- Transaction records
- Time-series data
- Any immutable data
Mutable records: These are records that can be updated or changed after they’ve been inserted. Examples include:
- Dimension data
- Entity data like customers, employees, users, products, etc.
- Pre-aggregated data where rows are periodically updated
Incremental aggregations: This type of data refers to aggregations that are calculated incrementally as data is written. Examples include:
- Real-time game leaderboard
- Real-time e-commerce sales KPI

Append-only data

The recommended table engine for append-only data is the MergeTree**,** which is the most common choice in ClickHouse. It provides fast inserts and great query performance.

Append-only data is best stored in the MergeTree table engine

Next, we’ll help you choose a good sorting key for your table. This will depend on the schema.

A good sorting key usually has 3 to 5 columns. The following questions will help you choose the right sorting key for your query access patterns.

Identify primary timestamp

Determine if your records have a primary timestamp that you will filter on, like a created_at date. If yes, include it in the sorting key.

Identify frequently filtered columns

Identify other columns you will filter on in most queries. For example, customer_id or workspace_id columns for multi-tenant apps. You don’t need to include all columns you are going to filter on, only the filters that you are always (or most often) going to use. If you have such columns, place them to the left of the timestamp in the sorting key.

Order columns by cardinality

Arrange the columns in your sorting key by putting columns with low cardinality first (most important for filtering) and then columns with high cardinality (and less important for filtering).

Consider hierarchical relationships

If you have hierarchy or tree-like relations between the columns, arrange the columns from “root” to “leaves” in the sorting key. For example, order them as continent, country, then city name.

Example: Sorting key for an events table in a multi-tenant app

The sorting key for an events table will depend on the specific schema and query access patterns.

For example, imagine our application supports multiple customers. Each customer can have multiple workspaces, with events belonging to individual workspaces. Then, the schema for our events table could look like this:

created_at (timestamp)
customer_id (string)
workspace_id (string)
event_type (string)
payload (JSON)

Let’s assume that your query filters will of the form:

    customer_id   =  'customer 001'
AND workspace_id  =  'workspace B'
AND created_at   >=  NOW() - INTERVAL '1 month'

Then your sorting key should be: customer_id, workspace_id, then created_at.

mutation {
  createDataPoolV2(
    input: {
      uniqueName: "events"
      description: "Data Pool for events"
      timestamp: { columnName: "created_at" }
      columns: [
        { columnName: "created_at", type: TIMESTAMP, isNullable: true }
        { columnName: "customer_id", type: TIMESTAMP, isNullable: true }
        { columnName: "workspace_id", type: STRING, isNullable: true }
        { columnName: "event_type", type: STRING, isNullable: true }
        { columnName: "payload", type: JSON, isNullable: true }
      ]
      tableSettings: {
        engine: {
          mergeTree: {
            type: MERGE_TREE
          }
        }
        orderBy: ["customer_id", "workspace_id", "created_at"]
      }
    }
  ) {
    dataPool {
      id
      uniqueName
      description
      tableSettings {
        orderBy
      }
    }
  }
}

resource "propel_data_pool" "events" {
  unique_name  = "events"
  description  = "Data Pool for events"
  timestamp    = "created_at"

  column {
    name       = "created_at"
    type       = "TIMESTAMP"
    nullable   = true
  }
  column {
    name       = "customer_id"
    type       = "STRING"
    nullable   = true
  }
  column {
    name       = "workspace_id"
    type       = "STRING"
    nullable   = true
  }
  column {
    name       = "event_type"
    type       = "STRING"
    nullable   = true
  }
  column {
    name       = "payload"
    type       = "JSON"
    nullable   = true
  }

  table_settings {
    engine {
      type = "MERGE_TREE"
    }
    order_by = ["customer_id", "workspace_id", "created_at"]
  }
}

The customer_id is the lowest granularity so it should go first. The workspace_id belongs to a customer, so they have a tree-like relationship, so it should go next. Once the data is reduced to a given workspace_id of a customer_id, we’d want it ordered by time, adding the created_at last.

This way, your data will be sorted like the table below:

An illustration of the events table, with arrows indicating the sort order of its rows, first by the customer_id column, then by the workspace_id column, and finally by the created_at column.

Important things to note to take advantage of this sorting key:

You can provide just the customer_id or both the customer_id and workspace_id in the filters or GROUP BYs and still benefit from the sorting key as it reduces rows read.
If you just provide the created_at, you will NOT take advantage of the sorting key, as Propel will have to scan the full table.
With Materialized Views, you can create different versions of the same table with different sorting keys to support different query patterns.

Mutable records

The recommended table engine for mutable records is the ReplacingMergeTree. It provides fast inserts, handles updates, and provides great query performance for medium-sized tables (up to 100 million records). For tables over 100 million records, see the “Handling large updating tables” section below.

Mutable records are best stored in the ReplacingMergeTree table engine

We’ll help you choose a good sorting key and settings for your table. This will depend on the schema, how uniqueness is defined, and the scale of your data.

Determine uniqueness column

What column (it has to be immutable) determines the uniqueness of a record in your table?

If it is something like user_id, include it in the sorting key.
If it is a combined uniqueness key, like an order_id and a product_id, include them all in the sorting key.

mutation {
  createDataPoolV2(
    input: {
      uniqueName: "users"
      description: "Data Pool for users"
      columns: [
        { columnName: "user_id", type: STRING, isNullable: false }
        { columnName: "created_at", type: TIMESTAMP, isNullable: true }
        { columnName: "first_name", type: STRING, isNullable: true }
        { columnName: "last_name", type: STRING, isNullable: true }
        { columnName: "email", type: STRING, isNullable: true }
      ]
      tableSettings: {
        engine: {
          replacingMergeTree: {
            type: REPLACING_MERGE_TREE
          }
        }
        orderBy: ["user_id"]
      }
    }
  ) {
    dataPool {
      id
      uniqueName
      description
      tableSettings {
        orderBy
      }
    }
  }
}

resource "propel_data_pool" "users" {
  unique_name  = "users"
  description  = "Data Pool for users"

  column {
    name       = "user_id"
    type       = "STRING"
    nullable   = false
  }
  column {
    name       = "created_at"
    type       = "TIMESTAMP"
    nullable   = true
  }
  column {
    name       = "first_name"
    type       = "STRING"
    nullable   = true
  }
  column {
    name       = "last_name"
    type       = "STRING"
    nullable   = true
  }
  column {
    name       = "email"
    type       = "STRING"
    nullable   = true
  }

  table_settings {
    engine {
      type = "REPLACING_MERGE_TREE"
    }
    order_by = ["user_id"]
  }
}

Check for version or updated_at column

Do your records contain a version or updated_at column that increases with each update to the record?

If yes, include it in the ReplacingMergeTree version column.

The version column, ver, is an optional column that allows you to control which version of a row Propel considers ‘newest’. If unspecified, Propel considers the most recently inserted row newest.

Propel automatically filters out older versions of rows when querying and periodically deletes them in the background during merges.

mutation {
  createDataPoolV2(
    input: {
      uniqueName: "users"
      description: "Data Pool for users"
      columns: [
        { columnName: "user_id", type: STRING, isNullable: false }
        { columnName: "created_at", type: TIMESTAMP, isNullable: true }
        { columnName: "first_name", type: STRING, isNullable: true }
        { columnName: "last_name", type: STRING, isNullable: true }
        { columnName: "email", type: STRING, isNullable: true }
        { columnName: "updated_at", type: TIMESTAMP, isNullable: false }
      ]
      tableSettings: {
        engine: {
          replacingMergeTree: {
            type: REPLACING_MERGE_TREE
            ver: "updated_at"
          }
        }
        orderBy: ["user_id"]
      }
    }
  ) {
    dataPool {
      id
      uniqueName
      description
      tableSettings {
        orderBy
      }
    }
  }
}

resource "propel_data_pool" "users" {
  unique_name  = "users"
  description  = "Data Pool for users"

  column {
    name       = "user_id"
    type       = "STRING"
    nullable   = false
  }
  column {
    name       = "created_at"
    type       = "TIMESTAMP"
    nullable   = true
  }
  column {
    name       = "first_name"
    type       = "STRING"
    nullable   = true
  }
  column {
    name       = "last_name"
    type       = "STRING"
    nullable   = true
  }
  column {
    name       = "email"
    type       = "STRING"
    nullable   = true
  }
  column {
    name       = "updated_at"
    type       = "TIMESTAMP"
    nullable   = false
  }

  table_settings {
    engine {
      type = "REPLACING_MERGE_TREE"
      ver  = "updated_at"
    }
    order_by = ["user_id"]
  }
}

Consider table size

If your table has less than ~100 million records, this is all you need. You’ll get fast queries and updates based on the record identifier.

Since the sorting key determines the record uniqueness in ReplacingMergeTree tables, keeping things simple is advantageous.

Example: Sorting key for a “Users” entity table

The sorting key for an entity table will depend on how uniqueness is defined.

Consider the schema below:

user_id (string)
created_at (string)
first_name (string)
last_name (string)
email (string)

In this case, the user_id determines uniqueness. Therefore, your sorting key should be just user_id.

mutation {
  createDataPoolV2(
    input: {
      uniqueName: "users"
      description: "Data Pool for users"
      columns: [
        { columnName: "user_id", type: STRING, isNullable: false }
        { columnName: "created_at", type: TIMESTAMP, isNullable: true }
        { columnName: "first_name", type: STRING, isNullable: true }
        { columnName: "last_name", type: STRING, isNullable: true }
        { columnName: "email", type: STRING, isNullable: true }
      ]
      tableSettings: {
        engine: {
          replacingMergeTree: {
            type: REPLACING_MERGE_TREE
          }
        }
        orderBy: ["user_id"]
      }
    }
  ) {
    dataPool {
      id
      uniqueName
      description
      tableSettings {
        orderBy
      }
    }
  }
}

resource "propel_data_pool" "users" {
  unique_name  = "users"
  description  = "Data Pool for users"

  column {
    name       = "user_id"
    type       = "STRING"
    nullable   = false
  }
  column {
    name       = "created_at"
    type       = "TIMESTAMP"
    nullable   = true
  }
  column {
    name       = "first_name"
    type       = "STRING"
    nullable   = true
  }
  column {
    name       = "last_name"
    type       = "STRING"
    nullable   = true
  }
  column {
    name       = "email"
    type       = "STRING"
    nullable   = true
  }

  table_settings {
    engine {
      type = "REPLACING_MERGE_TREE"
    }
    order_by = ["user_id"]
  }
}

Putting the user_id in the sorting key will enforce uniqueness during merges and allow for efficient user lookups by ID.

This way, your data will be sorted like the table below:

An illustration of the “users” entity table, with arrows indicating the sort order of its rows by the user_id column.

Handling large updating tables

If you have updatable data that is over 100 million records with mutable data that needs to be queried fast, the following questions will help you choose a sorting key and understand the tradeoffs.

Do you need to filter or aggregate your data by time? Like, “Give me all the users created last month”? Does your data have a created_at timestamp column?

If so, you can add the created_at to the sorting key to the left of the user_id from the example above. Take into account that:

The created_at column cannot be mutable.
The created_at column will now be part of the uniqueness definition.

The resulting sorting key would be: created_at, user_id.

mutation {
  createDataPoolV2(
    input: {
      uniqueName: "users"
      description: "Data Pool for users"
      timestamp: { columnName: "created_at" }
      columns: [
        { columnName: "user_id", type: STRING, isNullable: false }
        { columnName: "created_at", type: TIMESTAMP, isNullable: false }
        { columnName: "first_name", type: STRING, isNullable: true }
        { columnName: "last_name", type: STRING, isNullable: true }
        { columnName: "email", type: STRING, isNullable: true }
      ]
      tableSettings: {
        engine: {
          replacingMergeTree: {
            type: REPLACING_MERGE_TREE
          }
        }
        orderBy: ["created_at", "user_id"]
      }
    }
  ) {
    dataPool {
      id
      uniqueName
      description
      tableSettings {
        orderBy
      }
    }
  }
}

resource "propel_data_pool" "users" {
  unique_name  = "users"
  description  = "Data Pool for users"
  timestamp    = "created_at"

  column {
    name       = "user_id"
    type       = "STRING"
    nullable   = false
  }
  column {
    name       = "created_at"
    type       = "TIMESTAMP"
    nullable   = false
  }
  column {
    name       = "first_name"
    type       = "STRING"
    nullable   = true
  }
  column {
    name       = "last_name"
    type       = "STRING"
    nullable   = true
  }
  column {
    name       = "email"
    type       = "STRING"
    nullable   = true
  }

  table_settings {
    engine {
      type = "REPLACING_MERGE_TREE"
    }
    order_by = ["created_at", "user_id"]
  }
}

Do you need to filter or aggregate your data by any other dimension? For example, “Give me all the users that signed up on mobile with an iOS device,” determined by a signup_channel and signup_os columns.

If so, you can add the signup_channel and signup_os columns to the sorting key to the left of the created_at and/or user_id from the examples above. Again, take into account that:

The signup_channel and signup_oscolumns cannot be mutable.
The signup_channel and signup_oscolumns will now be part of the uniqueness definition.

The resulting sorting key would be: signup_channel, signup_os, created_at, user_id.

mutation {
  createDataPoolV2(
    input: {
      uniqueName: "users"
      description: "Data Pool for users"
      timestamp: { columnName: "created_at" }
      columns: [
        { columnName: "user_id", type: STRING, isNullable: false }
        { columnName: "created_at", type: TIMESTAMP, isNullable: false }
        { columnName: "first_name", type: STRING, isNullable: true }
        { columnName: "last_name", type: STRING, isNullable: true }
        { columnName: "email", type: STRING, isNullable: true }
        { columnName: "signup_channel", type: STRING, isNullable: false }
        { columnName: "signup_os", type: STRING, isNullable: false }
      ]
      tableSettings: {
        engine: {
          replacingMergeTree: {
            type: REPLACING_MERGE_TREE
          }
        }
        orderBy: ["signup_channel", "signup_os", "created_at", "user_id"]
      }
    }
  ) {
    dataPool {
      id
      uniqueName
      description
      tableSettings {
        orderBy
      }
    }
  }
}

resource "propel_data_pool" "users" {
  unique_name  = "users"
  description  = "Data Pool for users"
  timestamp    = "created_at"

  column {
    name       = "user_id"
    type       = "STRING"
    nullable   = false
  }
  column {
    name       = "created_at"
    type       = "TIMESTAMP"
    nullable   = false
  }
  column {
    name       = "first_name"
    type       = "STRING"
    nullable   = true
  }
  column {
    name       = "last_name"
    type       = "STRING"
    nullable   = true
  }
  column {
    name       = "email"
    type       = "STRING"
    nullable   = true
  }
  column {
    name       = "signup_channel"
    type       = "STRING"
    nullable   = false
  }
  column {
    name       = "signup_os"
    type       = "STRING"
    nullable   = false
  }

  table_settings {
    engine {
      type = "REPLACING_MERGE_TREE"
    }
    order_by = ["signup_channel", "signup_os", "created_at", "user_id"]
  }
}

Incremental aggregations

The recommended table engine for incremental aggregations are the SummingMergeTree and AggregatingMergeTree, depending on the type of aggregations you need to do.

For simple SUM and COUNT aggregations, the SummingMergeTree will do. For other types of aggregations, like AVG, MAX, MIN, PERCENTILES, UNIQUES, and others, we recommend you use the AggregatingMergeTree.

Incremental aggregations are best stored in the SummingMergeTree and AggregatingMergeTree table engines

We’ll help you choose a good sorting key and settings for your table. This will depend on the schema and the columns you need to group by.

Identify grouping columns

Determine which columns you want to group by. Add these columns to the sorting key. Both the SummingMergeTree and the AggregatingMergeTree replace rows during merging with the same sorting key with a single row containing the aggregate values.

Choose aggregation type

Decide if you’re only doing SUM or COUNT aggregations, which would indicate using the SummingMergeTree.

Configure summarization (for SummingMergeTree)

If using SummingMergeTree, note that Propel will automatically summarize all numeric columns not part of the sorting key. If you want to specify which columns to summarize, you can enter them in the “columns” parameter.

Example: Sorting key for a “Visits” aggregated table

The sorting key for an incremental aggregation table will include the dimensions you need to group by.

Consider the schema below:

domain (string)
path (string)
visits (Int)

In this case, we want to group visits by domain and path.

Putting the domain and path in the sorting key will “roll up” the visits by two columns.

mutation {
  createDataPoolV2(
    input: {
      uniqueName: "visits"
      description: "Data Pool for visits"
      columns: [
        { columnName: "domain", type: STRING, isNullable: false }
        { columnName: "path", type: STRING, isNullable: false }
        { columnName: "visits", type: INT64, isNullable: true }
      ]
      tableSettings: {
        engine: {
          summingMergeTree: {
            type: SUMMING_MERGE_TREE
          }
        }
        orderBy: ["domain", "path"]
      }
    }
  ) {
    dataPool {
      id
      uniqueName
      description
      tableSettings {
        orderBy
      }
    }
  }
}

resource "propel_data_pool" "visits" {
  unique_name  = "visits"
  description  = "Data Pool for visits"

  column {
    name       = "domain"
    type       = "STRING"
    nullable   = false
  }
  column {
    name       = "path"
    type       = "STRING"
    nullable   = false
  }
  column {
    name       = "visits"
    type       = "INT64"
    nullable   = true
  }

  table_settings {
    engine {
      type = "SUMMING_MERGE_TREE"
    }
    order_by = ["domain", "path"]
  }
}

This way, your data will be sorted like the table below:

An illustration of the “visits” aggregated table, with rows aggregated by the domain and path columns.

Frequently asked questions

What is the difference between the sorting key and the primary key?

What happens if I don't define a sorting key for my table?

How can I change the sorting key of an existing table?

Do merges happen on a given schedule? Is it configurable?

Can a table have multiple sorting keys to support different access patterns?

What happens if I use mutable columns in my sorting key?

Get Started

Manage

Identity & Access

Resources

Enterprise

Selecting table engine and sorting key

Background

Understanding table engines

Overview of the MergeTree table engines

Understanding the sorting key

How to choose the right table engine and sorting key

What type of data do you have?

Append-only data

Example: Sorting key for an events table in a multi-tenant app

Mutable records

Example: Sorting key for a “Users” entity table

Handling large updating tables

Incremental aggregations

Example: Sorting key for a “Visits” aggregated table

Frequently asked questions

Get Started

Manage

Identity & Access

Resources

Enterprise

​Background

​Understanding table engines

​Overview of the MergeTree table engines

​Understanding the sorting key

​How to choose the right table engine and sorting key

​What type of data do you have?

​Append-only data

​Example: Sorting key for an events table in a multi-tenant app

​Mutable records

​Example: Sorting key for a “Users” entity table

​Handling large updating tables

​Incremental aggregations

​Example: Sorting key for a “Visits” aggregated table

​Frequently asked questions

Background

Understanding table engines

Overview of the MergeTree table engines

Understanding the sorting key

How to choose the right table engine and sorting key

What type of data do you have?

Append-only data

Example: Sorting key for an events table in a multi-tenant app

Mutable records

Example: Sorting key for a “Users” entity table

Handling large updating tables

Incremental aggregations

Example: Sorting key for a “Visits” aggregated table

Frequently asked questions