Ingest Parquet files from an Amazon S3 bucket to Propel.

Get started with Amazon S3

Step-by-step instructions to ingest Parquet files in S3 to Propel.

Architecture

Amazon S3 Data Pools connect to a specified Amazon S3 bucket and automatically synchronize Parquet files from the bucket into your Data Pool.

Features

Amazon S3 Parquet Data Pools support the following features:

Feature nameSupportedNotes
Syncs new records
Configurable sync intervalSee the How Propel syncs section below. It can be configured to occur at intervals ranging from every minute to every 24 hours.
Sync Pausing / Resuming
Real-time updatesSee the Real-time updates section.
Real-time deletesSee the Real-time deletes section.
Batch Delete APISee Batch Delete API.
Batch Update APISee Batch Update API.
API configurableSee API docs.
Terraform configurableSee Terraform docs.

How Propel syncs Parquet files in Amazon S3

The Amazon S3-based Data Pool syncs Parquet files from your S3 bucket. You specify:

  1. Bucket name
  2. File path
  3. The schema of the data
  4. Sync interval (1 minute to 24 hours)

During each sync, Propel retrieves and synchronizes all new files in the specified S3 bucket path

  1. To sync all Parquet files, use the following path:

    **/*.parquet
    
  2. To sync files in a specific directory (e.g., “sales”):

    sales/**/*.parquet
    

Use the *.parquet pattern to sync only Parquet files, excluding other file types.

New records and updates

How records are ingested depends on the table engine you select when you create your Data Pool.

  • MergeTree Data Pools (append-only data): Syncs inserts and ignores updates.
  • ReplacingMergeTree Data Pools (mutable records): Syncs inserts and updates records with the same sorting key.

Read our guide on Selecting table engine and sorting key for details.

Schema changes

Propel enables the addition of new columns to Amazon S3 Data Pools through the AddColumnToDataPool job.

For breaking changes like column deletions or type modifications, recreate the Data Pool.

See our Changing Schemas section for more details.

Data Types

The table below shows default Parquet to Propel data type mappings. When creating an Amazon S3 Parquet Data Pool, you can customize these mappings.

Parquet TypePropel TypeNotes
BOOLEANBOOLEAN
INT8INT8
UINT8INT16
INT16INT16
UINT16INT32
INT32INT32
UINT32INT64
INT64INT64
UINT64INT64
FLOATFLOAT
DOUBLEDOUBLE
DECIMAL(p ≤ 9, s=0)INT32
DECIMAL(p ≤ 9, s>0)FLOAT
DECIMAL(p ≤ 18, s=0)INT64
DECIMAL(p ≤ 18, s>0)DOUBLE
DECIMAL(p ≤ 76, s)DOUBLE
DATEDATE
TIME (ms)INT32
TIME (µs, ns)INT64
TIMESTAMPTIMESTAMP
INT96TIMESTAMP
BINARYSTRING
STRINGSTRING
ENUMSTRING
FIXED_LENGTH_BYTE_ARRAYSTRING
MAPJSON
LISTJSON