Flattening DynamoDB JSON in ClickHouse®

Unlock scalable analytics with your DynamoDB data using ClickHouse®.

Propel

Unlock scalable analytics with your DynamoDB data using ClickHouse®.

Get started with Propel's Serverless ClickHouse® forever-free plan today. Propel is the only Serverless ClickHouse® with a true pay-per-query pricing and instant auto-scaling.

When working with DynamoDB, a common challenge is making the data more accessible for analytics. This typically involves transferring the data to a columnar store like ClickHouse and flattening the JSON structure emitted by DynamoDB streams into a tabular format. This transformation makes it significantly easier to query the data in systems such as ClickHouse.

In this post, we'll walk through how to flatten DynamoDB JSON using ClickHouse to set up a scalable, analyzable pipeline.

Step 1: Get your DynamoDB events to ClickHouse®

Before we start flattening anything, we first need to make sure that the DynamoDB events are flowing smoothly into ClickHouse®. Once you are landing the DynamoDB data in ClickHouse it will be available for further transformations.

If you are using Propel's Serverless ClickHouse you can use a Webhook Data Pool to ingest DynamoDB streams into ClickHouse®. The entire DynamoDB event will be in a column called _propel_payload.

Step 2: Creating a Materialized View for flattening

The first step in the transformation is to create a Materialized View that extracts and flattens the top level keys of the DynamoDB JSON event into its own table. These top-level keys are standard for all DynamoDB events so you can use this SQL to create a Materialized View regardless of the structure of your data.

Here is the SQL query you can use for this process:

SELECT
  toString("_propel_payload.awsRegion") AS awsRegion,
  toUInt64("_propel_payload.dynamodb.ApproximateCreationDateTime") AS ApproximateCreationDateTime,
  parseDateTimeBestEffort(toUInt64("_propel_payload.dynamodb.ApproximateCreationDateTime")/ 1000000) AS ApproximateCreationDateTimeTimestamp,
  toString("_propel_payload.dynamodb.ApproximateCreationDateTimePrecision") AS ApproximateCreationDateTimePrecision,
  toString("_propel_payload.dynamodb.Keys.pk.S") AS KeyPK,
  toString("_propel_payload.dynamodb.Keys.sk.S") AS KeySK,
  ifNull("_propel_payload.dynamodb.NewImage", '{}')::JSON AS NewImage,
  ifNull("_propel_payload.dynamodb.OldImage", '{}')::JSON AS OldImage,
  "_propel_payload.dynamodb.SizeBytes"::BIGINT AS SizeBytes,
  toString("_propel_payload.eventID") AS eventID,
  toString("_propel_payload.eventName") AS eventName,
  toString("_propel_payload.eventSource") AS eventSource,
  toString("_propel_payload.recordFormat") AS recordFormat,
  toString("_propel_payload.tableName") AS tableName,
  toString("_propel_payload.userIdentity") AS userIdentity
FROM 
  <Data Pool name>;

This Materialized View in ClickHouse is performing several key operations to flatten and transform the DynamoDB JSON data:

Data Extraction: It's extracting specific fields from the nested JSON structure of the DynamoDB event, which is stored in the _propel_payload column.
Type Casting: The view is casting various fields to appropriate data types. For example:
- toString() is used for string fields.
- toUInt64() for the unix timestamp.
Timestamp Conversion: It's converting the ApproximateCreationDateTime from unix timestamp to a human-readable timestamp using parseDateTimeBestEffort().
JSON Handling: The NewImage and OldImage fields are being preserved as JSON objects, with a default empty object {} if they're null.
Field Renaming: The view is giving more descriptive names to some fields, like KeyPK and KeySK for the primary and sort keys.
Data Flattening: By extracting nested fields into top-level columns, it's flattening the complex DynamoDB JSON structure into a more queryable tabular format.

This Materialized View essentially creates a new table that presents the DynamoDB event data in a more analytics-friendly structure, making it easier to query and analyze in ClickHouse.

Test the query and ensure that it returns the expected results.

After the query is working, you can proceed to create the Materialized View. In Propel’s Serverless ClickHouse, follow these steps:

Go to the Materialized View section in the Propel Console.
Click "Create Materialized View".
Use the SQL query above to define the view.
Set the target Data Pool name, for example, flattened_dynamodb_events.
Select "Append-only data" as the data pool type, ideal for the DynamoDB event stream.
Configure the table engine and sorting key, focusing on the timestamp column for efficient ordering of events.

Step 3: Flattening individual tables

With the main envelope of the DynamoDB event flattened, the next step is to unpack each table individually into its own Data Pool. This process makes it easier to query data specific to each DynamoDB table, particularly if you have a more traditional multi-table design.

Start by identifying which tables you need to create separate Data Pools for:

SELECT
  tableName,
  COUNT() as numRecords
FROM 
  flattened_dynamodb_events
GROUP BY 
  tableName;

For each table identified, create another Materialized View with a SQL like this:

SELECT
  eventName,
  tableName,
  KeyPK,
  KeySK,
  parseDateTimeBestEffortOrNull("NewImage._ct.S") AS ct,
  "NewImage._deleted.BOOL" AS deleted,
  toString("NewImage._et.S") AS et,
  parseDateTimeBestEffortOrNull("NewImage._lastChangedAt.N") AS lastChangedAt,
  -- Additional columns you need to flatten
  ...
  NewImage,
  OldImage
FROM 
  flattened_dynamodb_events
WHERE 
  tableName = '<YOUR_TABLE>';

Repeat this process for each table, creating a Materialized View that writes the flattened records to a new Data Pool.

Step 4: Handling single-table design

If your DynamoDB setup uses a single-table design—a popular approach for optimizing data storage and access patterns—you will need to identify different entities within the single table.

You can use a query like the following to get started:

SELECT
  COALESCE(
    toString("NewImage._et.S"), 
    toString("OldImage._et.S")
  ) AS entity,
  tableName,
  MAX(KeyPK) as maxPK,
  MAX(KeySK) as maxSK,
  MIN(KeyPK) as minPK,
  MIN(KeySK) as minSK,
  MAX(NewImage),
  MAX(OldImage),
  COUNT() as numRecords
FROM 
  flattened_dynamodb_events
GROUP BY 
  tableName, 
  COALESCE(toString("NewImage._et.S"), toString("OldImage._et.S"));

For each entity identified, create a corresponding Materialized View to output it into its own Data Pool.

Conclusion

By flattening your DynamoDB JSON using Propel and ClickHouse, you can make the data easier to work with, especially for analytics and complex querying. This process involves creating multiple Materialized Views that help unpack the nested JSON structure and partition the data by table or entity. Whether you use single-table design or multiple tables in DynamoDB, flattening the structure can make a significant difference in your ability to handle the data effectively.

This approach ensures you have the flexibility to query your data efficiently and make the most of your event-driven data architecture.

Get started with Propel's Serverless ClickHouse® forever-free plan today. Propel is the only Serverless ClickHouse® with a true pay-per-query pricing and instant auto-scaling. Contact us to learn more about our volume-based discounts. Visit our pricing page for details.

What's new in ClickHouse® version 24.10

This is some text inside of a div block.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.