Unlock scalable analytics with your DynamoDB data using ClickHouse.
Get started with Propel's Serverless ClickHouse forever-free plan today. Propel is the only Serverless ClickHouse with a true pay-per-query pricing and instant auto-scaling.
When working with DynamoDB, a common challenge is making the data more accessible for analytics. This typically involves transferring the data to a columnar store like ClickHouse and flattening the JSON structure emitted by DynamoDB streams into a tabular format. This transformation makes it significantly easier to query the data in systems such as ClickHouse.
In this post, we'll walk through how to flatten DynamoDB JSON using ClickHouse to set up a scalable, analyzable pipeline.
Step 1: Get your DynamoDB events to ClickHouse
Before we start flattening anything, we first need to make sure that the DynamoDB events are flowing smoothly into ClickHouse. Once you are landing the DynamoDB data in ClickHouse it will be available for further transformations.
If you are using Propel's Serverless ClickHouse you can use a Webhook Data Pool to ingest DynamoDB streams into ClickHouse. The entire DynamoDB event will be in a column called _propel_payload
.
Step 2: Creating a Materialized View for flattening
The first step in the transformation is to create a Materialized View that extracts and flattens the top level keys of the DynamoDB JSON event into its own table. These top-level keys are standard for all DynamoDB events so you can use this SQL to create a Materialized View regardless of the structure of your data.
Here is the SQL query you can use for this process:
SELECT toString("_propel_payload.awsRegion") AS awsRegion, toUInt64("_propel_payload.dynamodb.ApproximateCreationDateTime") AS ApproximateCreationDateTime, parseDateTimeBestEffort(toUInt64("_propel_payload.dynamodb.ApproximateCreationDateTime")/ 1000000) AS ApproximateCreationDateTimeTimestamp, toString("_propel_payload.dynamodb.ApproximateCreationDateTimePrecision") AS ApproximateCreationDateTimePrecision, toString("_propel_payload.dynamodb.Keys.pk.S") AS KeyPK, toString("_propel_payload.dynamodb.Keys.sk.S") AS KeySK, ifNull("_propel_payload.dynamodb.NewImage", '{}')::JSON AS NewImage, ifNull("_propel_payload.dynamodb.OldImage", '{}')::JSON AS OldImage, "_propel_payload.dynamodb.SizeBytes"::BIGINT AS SizeBytes, toString("_propel_payload.eventID") AS eventID, toString("_propel_payload.eventName") AS eventName, toString("_propel_payload.eventSource") AS eventSource, toString("_propel_payload.recordFormat") AS recordFormat, toString("_propel_payload.tableName") AS tableName, toString("_propel_payload.userIdentity") AS userIdentity FROM <Data Pool name>;
This Materialized View in ClickHouse is performing several key operations to flatten and transform the DynamoDB JSON data:
- Data Extraction: It's extracting specific fields from the nested JSON structure of the DynamoDB event, which is stored in the
_propel_payload
column. - Type Casting: The view is casting various fields to appropriate data types. For example:
toString()
is used for string fields.toUInt64()
for the unix timestamp.
- Timestamp Conversion: It's converting the
ApproximateCreationDateTime
from unix timestamp to a human-readable timestamp usingparseDateTimeBestEffort()
. - JSON Handling: The
NewImage
andOldImage
fields are being preserved as JSON objects, with a default empty object{}
if they're null. - Field Renaming: The view is giving more descriptive names to some fields, like
KeyPK
andKeySK
for the primary and sort keys. - Data Flattening: By extracting nested fields into top-level columns, it's flattening the complex DynamoDB JSON structure into a more queryable tabular format.
This Materialized View essentially creates a new table that presents the DynamoDB event data in a more analytics-friendly structure, making it easier to query and analyze in ClickHouse.
Test the query and ensure that it returns the expected results.
After the query is working, you can proceed to create the Materialized View. In Propel’s Serverless ClickHouse, follow these steps:
- Go to the Materialized View section in the Propel Console.
- Click "Create Materialized View".
- Use the SQL query above to define the view.
- Set the target Data Pool name, for example,
flattened_dynamodb_events
. - Select "Append-only data" as the data pool type, ideal for the DynamoDB event stream.
- Configure the table engine and sorting key, focusing on the timestamp column for efficient ordering of events.
Step 3: Flattening individual tables
With the main envelope of the DynamoDB event flattened, the next step is to unpack each table individually into its own Data Pool. This process makes it easier to query data specific to each DynamoDB table, particularly if you have a more traditional multi-table design.
Start by identifying which tables you need to create separate Data Pools for:
SELECT tableName, COUNT() as numRecords FROM flattened_dynamodb_events GROUP BY tableName;
For each table identified, create another Materialized View with a SQL like this:
SELECT eventName, tableName, KeyPK, KeySK, parseDateTimeBestEffortOrNull("NewImage._ct.S") AS ct, "NewImage._deleted.BOOL" AS deleted, toString("NewImage._et.S") AS et, parseDateTimeBestEffortOrNull("NewImage._lastChangedAt.N") AS lastChangedAt, -- Additional columns you need to flatten ... NewImage, OldImage FROM flattened_dynamodb_events WHERE tableName = '<YOUR_TABLE>';
Repeat this process for each table, creating a Materialized View that writes the flattened records to a new Data Pool.
Step 4: Handling single-table design
If your DynamoDB setup uses a single-table design—a popular approach for optimizing data storage and access patterns—you will need to identify different entities within the single table.
You can use a query like the following to get started:
SELECT COALESCE( toString("NewImage._et.S"), toString("OldImage._et.S") ) AS entity, tableName, MAX(KeyPK) as maxPK, MAX(KeySK) as maxSK, MIN(KeyPK) as minPK, MIN(KeySK) as minSK, MAX(NewImage), MAX(OldImage), COUNT() as numRecords FROM flattened_dynamodb_events GROUP BY tableName, COALESCE(toString("NewImage._et.S"), toString("OldImage._et.S"));
For each entity identified, create a corresponding Materialized View to output it into its own Data Pool.
Conclusion
By flattening your DynamoDB JSON using Propel and ClickHouse, you can make the data easier to work with, especially for analytics and complex querying. This process involves creating multiple Materialized Views that help unpack the nested JSON structure and partition the data by table or entity. Whether you use single-table design or multiple tables in DynamoDB, flattening the structure can make a significant difference in your ability to handle the data effectively.
This approach ensures you have the flexibility to query your data efficiently and make the most of your event-driven data architecture.
Get started with Propel's Serverless ClickHouse forever-free plan today. Propel is the only Serverless ClickHouse with a true pay-per-query pricing and instant auto-scaling. Contact us to learn more about our volume-based discounts. Visit our pricing page for details.