Kafka to ClickHouse
Ingest data from Kafka topics.
Ingest real-time data from self-hosted Kafka, Confluent Cloud, AWS MSK, or Redpanda to Propel.
Get started with Kafka
Step-by-step instructions to connect your Kafka cluster to Propel.
Architecture
The Kafka Data Pools connect to specified Kafka topics to ingest data in real-time into Propel.
Features
Kafka Data Pools support the following features:
Feature name | Supported | Notes |
---|---|---|
Real-time ingestion | ✅ | See How the Kafka Data Pool works. |
Deduplication | ✅ | See the deduplication section. |
Batch Delete API | ✅ | See Batch Delete API. |
Batch Update API | ✅ | See Batch Update API. |
API configurable | ✅ | See API](/docs/management-api) docs. |
Terraform configurable | ✅ | See Terraform docs. |
How does the Kafka Data Pool work?
The Kafka Data Pool connects to specified Kafka topic to read messages in real-time. It starts from the earliest available offset to start consuming messages.
These messages are then ingested into the Data Pool. Once in the Data Pool, you can query them via SQL, the Query APIs, or transform them with Materialized Views.
Schemaless ingestion
The Kafka Data Pool stores the message body in the _propel_payload
column and the Kafka and ingestion-related metadata in other columns.
This flexible approach allows JSON Kafka messages to be ingested without needing pre-defined schemas.
Column | Type | Description |
---|---|---|
_timestamp | TIMESTAMP | The timestamp of the message. |
_topic | STRING | The Kafka topic |
_key | STRING | The key of the message. |
_offset | INT64 | The offset of the message. |
_partition | INT64 | The partition of Kafka topic. |
_propel_payload | JSON | The raw message Payload in JSON. |
_propel_received_at | TIMESTAMP | When the message is read by Propel. |
Message deduplication
The Kafka Data Pool automatically manages the deduplication of messages. This happens when messages are either sent twice by the producer or read twice due to intermittent connectivity between Propel and the Kafka stream. The uniqueness of a message is determined by the combination of _topic
, _partition
, and _offset
.
Supported formats
The Kafka Data Pool supports the ingestion of JSON messages that are stored in the _propel_payload
column.
Transforming data
Once your data is in a Kafka Data Pool, you can use Materialized Views to:
- Flatten nested JSON into tabular form
- Flatten JSON array into individual rows
- Combine data from multiple source Data Pools through JOINs
- Calculate new derived columns from existing data
- Perform incremental aggregations
- Sort rows with a different sorting key
- Filter out unnecessary data based on conditions
- De-duplicate rows
Was this page helpful?