Ordering Keys
Ordering Keys (a.k.a. sorting keys) define how data is sorted on disk and indexed for a table in ClickHouse. When replicating from Postgres, ClickPipes sets the Postgres primary key of a table as the ordering key for the corresponding table in ClickHouse. In most cases, the Postgres primary key serves as a sufficient ordering key, as ClickHouse is already optimized for fast scans, and custom ordering keys are often not required.
As describe in the migration guide, for larger use cases you should include additional columns beyond the Postgres primary key in the ClickHouse ordering key to optimize queries.
By default with CDC, choosing an ordering key different from the Postgres primary key can cause data deduplication issues in ClickHouse. This happens because the ordering key in ClickHouse serves a dual role: it controls data indexing and sorting while acting as the deduplication key. The easiest way to address this issue is by defining refreshable materialized views.
Use refreshable materialized views
A simple way to define custom ordering keys (ORDER BY) is using refreshable materialized views (MVs). These allow you to periodically (e.g., every 5 or 10 minutes) copy the entire table with the desired ordering key.
Below is an example of a Refreshable MV with a custom ORDER BY and required deduplication:
Custom ordering keys without refreshable materialized views
If refreshable materialized views don't work due to the scale of data, here are a few recommendations you can follow to define custom ordering keys on larger tables and overcome deduplication-related issues.
Choose ordering key columns that don't change for a given row
When including additional columns in the ordering key for ClickHouse (besides the primary key from Postgres), we recommend selecting columns that don't change for each row. This helps prevent data consistency and deduplication issues with ReplacingMergeTree.
For example, in a multi-tenant SaaS application, using (tenant_id, id) as the ordering key is a good choice. These columns uniquely identify each row, and tenant_id remains constant for an id even if other columns change. Since deduplication by id aligns with deduplication by (tenant_id, id), it helps avoid data deduplication issues that could arise if tenant_id were to change.
Set Replica Identity on Postgres tables to custom ordering key
For Postgres CDC to function as expected, it is important to modify the REPLICA IDENTITY on tables to include the ordering key columns. This is essential for handling DELETEs accurately.
If the REPLICA IDENTITY does not include the ordering key columns, Postgres CDC will not capture the values of columns other than the primary key - this is a limitation of Postgres logical decoding. All ordering key columns besides the primary key in Postgres will have nulls. This affects deduplication, meaning the previous version of the row may not be deduplicated with the latest deleted version (where _peerdb_is_deleted is set to 1).
In the above example with owneruserid and id, if the primary key does not already include owneruserid, you need to have a UNIQUE INDEX on (owneruserid, id) and set it as the REPLICA IDENTITY for the table. This ensures that Postgres CDC captures the necessary column values for accurate replication and deduplication.
Below is an example of how to do this on the events table. Make sure to apply this to all tables with modified ordering keys.
