Skip to main content

Updating data

Updating Data in ClickHouse

Propel’s Serverless ClickHouse supports real-time updates and batch updates for updating data. Read on to learn when and how to use each method.

Real-time updates

Real-time updates allow you to update existing rows in a table, which are immediately reflected when querying.

Use cases

  • Update rows in a table and query their most recent state
  • Update slowly changing dimensions
  • Update entity tables (users, accounts, etc)

How it works

Real-time updates in Propel are supported by ClickHouse’s ReplacingMergeTree table engine. This table engine replaces rows that share the same sorting key, keeping only the most recently inserted row (or the row with the largest “ver” column value).

De-duplication happens at query time and asynchronously, through background merges. For those familiar with ClickHouse, Propel automatically includes the "FINAL" modifier on queries to ReplacingMergeTree tables.

However, it's important to note that there are two implications:

  1. The row count in ReplacingMergeTree tables may be larger than the logical number of rows. This is because there may be duplicate rows in the table that haven’t been merged yet. These rows will be de-duplicated at query time.
  2. De-duplicating records at query time can have a performance impact on very large tables. This can be compensated for by using a large Propeller.

Create a Data Pool to handle real-time updates

In this section, we will guide you through the process of creating a Data Pool that can handle real-time updates. We will provide step-by-step instructions on how to create it via the Console, API, and Terraform.


To create a Data Pool that can handle real-time updates in Console, first go to the “Data Pools” section, then click ”Create Data Pool”.

Propel's Data Pools Console page

Select the Data Pool type you need to create.

Create new Data Pool

In the “Table Settings” step, select “Mutable Data”.

ClickHouse table settings configuration

Answer the questions on how to uniquely identify records and your query patterns. For more information, read the “Mutable records” section of the How to select a table engine and sorting key guide.

ClickHouse table settings configuration questions

Next, you will get the suggested table settings. Note that the table engine is ReplacingMergeTree, and the sorting key determines the uniqueness of the row.

Suggested ClickHouse table settings

Click “Next” to finish setting up the Data Pool. Once it is created, you can go to the “Details” tab and verify the settings.

Data Pool details


Create a Materialized View to handle real-time updates

In this section, we will guide you through the process of creating a Materialized View that can handle real-time updates. We will provide step-by-step instructions on how to create it via the Console, API, and Terraform.


To create a Materialized View that can handle real-time updates in the Console, first go to the “**Materialized View**” section and click ”**Create new Materialized View**.”

Create new Materialized View

Then, enter the query that defines the Materialized View. For this example, we are going to duplicate rows from the “TacoSoft Demo Data” table, so we just need to select all records.

Materialized View query

Select “New Data Pool”.

Materialized View destination

Give your destination Data Pool a name and description.

Materialized View destination configuration

Select “Mutable records”, then click “Continue”.

Materialized View destination configuration questions

Answer the questions on how to uniquely identify records and your query patterns. For more information, read the “Mutable records” section of the How to select a table engine and sorting key guide.

Suggested ClickHouse table settings

Next, you will get the suggested table settings. Note that the table engine is ReplacingMergeTree, and the sorting key determines the uniqueness of the row.

Materialized View details

Click “Continue” to finish setting up the Materialized View.

Once you complete setting up the Materialized View, you will have a Data Pool with the de-duplicated records.


Batch updates

Batch updates are useful for one-off or automated jobs where you need to update many rows in a table.

Use cases

  • Backfill columns
  • Update column(s) for all table rows
  • Update column(s) for table rows matching a filter
  • Redact columns on specific rows

How it works

Batch updates in Propel are supported by table mutations in ClickHouse. When you perform a batch update, Propel issues an ALTER TABLE … UPDATE statement to ClickHouse and monitors its progress. The mutation proceeds part-by-part, partition-by-partition, updating rows in the table.

Propel provides a simple way to update data asynchronously using the Console or the createUpdateDataPoolRecordsJob API. An update data operation on a Data Pool updates the data matching the filters provided.

Create a batch update job

In this section, we guide you through the process of creating a batch update job via the Console and API.

There are two parts to the update job.

First, the list of filters that will be used for updating records. Records matching these filters will be updated.

Second, the columns and values to be updated. The value can be:

  • Other column names. For example, sauce_name.
  • A string in single quotes. For example, 'al pastor'.
  • A number. For example: 10.
  • A JSON property. For example: order.id.
  • An expression. For example: total_price * quantity or CONCAT(taco_name, '🌮').
💡IMPORTANT
To update a column to a string value, you must wrap the value in single quotes 'al pastor' .

You can initiate an update job in the Console by navigating to the Data Pool from which you need to update data, clicking on the "Operations" tab, and then clicking “Update data.

Update data Console UI for ClickHouse

Here, you can specify the filters of the data to update and set the values to update.

Update data Console UI for ClickHouse

Lastly, you can see the progress of the updates job and when it is completed.

Update data Console UI for ClickHouse


Notes on updating non-nullable columns:

  • Suppose we have a non-nullable column A and a nullable column B. We execute an update setting A = B + 1. The operation yields null if the job encounters a record where B is null. This will result in an error when attempting to assign it to A, since A cannot be null. Consequently, the job fails, and the remaining records remain unchanged, while the records processed before encountering the null value are updated.
  • If the column being updated has a different data type than that of the update expression, the result will be null. This could cause a similar error as the previous example if the column is non-nullable.

Conclusion

Managing updates in Propel's Serverless ClickHouse requires an understanding of the unique way it handles updates. Real-time updates are achieved through the use of the ReplacingMergeTree table engine, but there are implications to consider such as larger record counts and performance impacts on read operations.

Batch updates, on the other hand, are more suited for maintaining data integrity during schema changes and can be accomplished via the Console or the createUpdateDataPoolRecordsJob API.

Further reading