Get started with Propel's Serverless ClickHouse® forever-free plan today. Propel is the only Serverless ClickHouse® with a true pay-per-query pricing and instant auto-scaling.
ClickHouse® is renowned for its high performance in managing large datasets, and one of the most interesting features behind its flexibility is the variety of specialized table engines it offers. Among them, the ReplacingMergeTree stands out as a powerful tool for keeping data clean and up-to-date without manual interventions or complicated processing pipelines.
If you've ever struggled with deduplication in data warehouses or wondered how to efficiently manage mutable data in a system designed for speed and simplicity, the ReplacingMergeTree can be your answer.
The ReplacingMergeTree engine is part of the MergeTree family in ClickHouse. It's a powerful tool when used correctly, but it can also "shoot you in the foot" if misused. One of the main use cases of ReplacingMergeTree is handling real-time updates. In particular, it allows us to build pipelines that keep mutable data in OLTP databases in sync with ClickHouse®.
In recent releases, ReplacingMergeTree has added new features, so we decided to provide a detailed explanation of both the old and new functionalities to help ClickHouse users get the most out of it.
What is the ReplacingMergeTree table engine?
ReplacingMergeTree is a variant of the standard MergeTree table engine in ClickHouse, designed specifically to help with deduplication and replacing outdated or incorrect data. While traditional MergeTree engines focus on high-speed inserts and query efficiency, ReplacingMergeTree adds the capacity to replace rows based on a primary key, making it ideal for use cases where data may need to be updated or corrected over time.
In simpler terms, it's a way to ensure you have the latest version of each record in your table, while still benefiting from ClickHouse’s strengths in performance and scalability.
How does ReplacingMergeTree work?
At its core, ReplacingMergeTree works by merging rows based on a given sorting key. Whenever a merge operation occurs, the engine keeps only one row for each unique sorting key value.
You can think of it as a garbage collector for old versions of your data. Here’s how it typically works:
- Data insertion: Whenever you insert data into a ReplacingMergeTree table, ClickHouse will store all versions of a row that have the same sorting key. It doesn’t immediately replace older versions during the insertion.
- Merging process: Over time, ClickHouse performs background merges, and during these merges, it will remove the outdated versions of the rows, keeping only the latest one.
- Version columns (optional): You can also specify a "version" column to indicate which row should be retained during a merge. This is useful when you need more explicit control, for example, if you want the row with the highest version number to be kept. This is useful to handle out of order events.
Defining a good sorting key
The most important part of the ReplacingMergeTree engine definition is the ORDER BY
expression. Unlike a generic MergeTree engine, ORDER BY
has a different meaning here: it defines an “eventually unique” key for the table. It is much closer to the traditional primary key of OLTP databases, which enforces uniqueness of keys. The difference is that instead of rejecting non-unique values, ReplacingMergeTree de-duplicates them, keeping only the last version.
The ORDER BY
for ReplacingMergeTree defines a unique column or a unique combination of columns that acts as a key. When a row with the same key is added to the table, it is considered a new version of the row. ClickHouse then returns the new version in queries and removes old versions in the background during merges. Note that until the old version is removed in the background, both old and new versions can appear in queries. Special syntax is required to query the new version only.
In designing a good key, you must combine two goals: query performance and uniqueness. Consider the following best practice:
- Use the beginning of the
ORDER BY
to optimize for query performance, and add unique columns at the end.
For example:
CREATE TABLE taco_order_rmt AS taco_order
ENGINE = ReplacingMergeTree
PARTITION BY toYYYYMM(taco_order_date)
ORDER BY (taco_type, salsa_spice_level, filling_type, tortilla_type, customer_id, order_id)
Here, the unique columns are added at the end of an ORDER BY
that is properly optimized for performance. This approach helps balance performance and deduplication requirements.
Practical use cases
- Real-time updates: ReplacingMergeTree can be used for real-time updates, where data is inserted or updated in place to replace old values. This technique is often referred to as 'upserts'. No special syntax is required to insert or update rows.
- Deduplication: If your data ingestion pipeline sometimes sends duplicate records due to retries, using ReplacingMergeTree can help keep only one instance of each record, reducing storage costs and improving query accuracy.
- Slowly changing dimensions (SCD): When managing dimensions that change slowly, you may want to keep only the latest version to minimize storage and ensure that reporting reflects the most up-to-date information. ReplacingMergeTree is a natural fit here.
Things to watch out for
While ReplacingMergeTree is powerful, it comes with a few caveats:
- Merge timing: The replacement process depends on ClickHouse’s background merge operations, which means there's no strict guarantee of immediate consistency. ClickHouse also offers no control of when the merges happen. Until the merge happens, you may still see multiple versions of a row.
- Storage considerations: Until the merges occur, all versions of a row are kept in storage, which can temporarily increase the storage footprint.
- More rows than expected: Inflated row numbers may occur since the table might still have duplicates, leading to more rows than expected until the merging process is complete.
- Explicit control over versions: If you use a version column, make sure you have a solid versioning strategy. Incorrect versions could lead to keeping the wrong rows.
ORDER BY
Updates: Columns used in theORDER BY
clause cannot be updated using the replacing logic. When designing theORDER BY
for ReplacingMergeTree, keep in mind which columns may need to be updated.
ReplacingMergeTree example
To define a table using ReplacingMergeTree, your SQL would look something like this:
CREATE TABLE taco_orders (
order_id UInt64,
taco_type String,
order_time DateTime,
version UInt64
) ENGINE = ReplacingMergeTree(version)
ORDER BY order_id;
In this example, ClickHouse will merge rows with the same order_id
and keep the one with the highest version
during the merge process. This is particularly useful for scenarios where data corrections need to be applied without additional cleanup steps.
Querying ReplacingMergeTree
When querying ReplacingMergeTree tables, there are a couple of options to ensure you get the latest versions:
- Modifier: The
FINAL
modifier applies the replacing logic at query time, ensuring you get the latest version of each row. However, it comes with a performance cost, especially when querying large datasets. - Setting: In the most recent versions, the
final
setting can be specified in the query or user profile, allowing you to avoid adding theFINAL
keyword explicitly to every query.
When to use ReplacingMergeTree
ReplacingMergeTree is best suited for scenarios where you need:
- Latest version only: You only care about keeping the most recent version of a record.
- Automatic cleanup: You want ClickHouse to handle deduplication automatically as part of its merge process.
- Efficient storage for mutable data: You need to manage data that changes over time but want to avoid the complexity of handling updates manually.
Conclusion
ReplacingMergeTree adds an essential layer of flexibility to ClickHouse, making it easier to manage changing data without sacrificing performance. Whether you’re looking to deduplicate incoming records, manage corrections efficiently, or just want to keep your dataset clean, ReplacingMergeTree is a tool worth considering. Properly used, ReplacingMergeTree can help you implement real-time synchronization pipelines from OLTP databases like MySQL or PostgreSQL, maintaining data quality without the headaches.
Get started with Propel's Serverless ClickHouse® forever-free plan today. Propel is the only Serverless ClickHouse® with a true pay-per-query pricing and instant auto-scaling. Contact us to learn more about our volume-based discounts. Visit our pricing page for details.