Amazon S3 Parquet
The Amazon S3 Parquet Data Pool lets you synchronize Parquet files stored in your Amazon S3 bucket to Propel, providing an easy way to power your analytic dashboards, reports, and workflows with a low-latency data API on top of your data lake.
Consider using the Amazon S3 Parquet Data Pool when:
- You require sub-second query performance for dashboards or reports on your data lake.
- You need to support high-concurrency and high-availability data workloads, such as customer-facing or mission-critical applications.
- You require fast data access through an API for web and mobile apps.
- You are building B2B SaaS or consumer applications that require multi-tenant access controls.
Get started​
Follow our step-by-step Amazon S3 Parquet setup guide to connect Parquet files stored in an Amazon S3 bucket to Propel.
Architecture overview​
Amazon S3 Parquet Data Pools connect to a specified Amazon S3 bucket and automatically synchronize Parquet files from the bucket into your Data Pool in Propel.
Features​
Amazon S3 Parquet Data Pools support the following features:
Feature name | Supported | Notes |
---|---|---|
Syncs new records | âś… | |
Configurable sync interval | âś… | See the How Propel syncs section below. It can be configured to occur at intervals ranging from every minute to every 24 hours. |
Sync Pausing / Resuming | âś… | |
Real-time updates | âś… | See the Real-time updates section. |
Real-time deletes | ❌ | See the Real-time deletes section. |
Batch Delete API | âś… | See Batch Delete API. |
Batch Update API | âś… | See Batch Update API. |
API configurable | âś… | See Management API docs. |
Terraform configurable | âś… | See Propel Terraform docs. |
How Propel syncs Parquet files in Amazon S3​
The Amazon S3-based Data Pool synchronizes Parquet files from your S3 bucket into the Data Pool. To do this, you need to specify the bucket name, the path to the files, and a sync interval. The sync interval determines how frequently files are synchronized.
The sync interval can range from 1 minute to 24 hours. During each sync Propel retrieves all the new files in the S3 bucket and synchronizes them with the Data Pool.
Syncing all files in the Amazon S3 bucket​
To sync all Parquet files in your S3 bucket across all paths, use the path value provided below:
**/*.parquet
Notice that the S3 paths only match Parquet files using the *.parquet
 wildcard pattern. This is important because we don't want to attempt to sync non-Parquet files.
Syncing files in a specific path​
To sync all Parquet files in a specific path of your S3 bucket, use the path value for that specific directory.
For instance, consider an S3 bucket with “sales” and “maintenance” directories as shown below:
s3://tacosoft
├── sales
│ ├── metadata.txt
│ ├── orders_1.parquet
│ ├── orders_2.parquet
│ └── orders_3.parquet
└── maintenance
├── metadata.txt
├── schedule_1.parquet
├── schedule_2.parquet
└── schedule_3.parquet
If you only want to sync the data in the “sales” directory to Propel, use the path value provided below:
sales/**/*.parquet
Notice that the S3 paths only match Parquet files using the *.parquet
wildcard pattern. This is important because we don't want to attempt to sync non-Parquet files, like metadata.txt
.
How Propel handles new records and updates​
Propel ingests records from your S3 bucket to your Data Pool. How records are ingested depends on the table engine you select when you create your Data Pool.
- MergeTree Data Pools (append-only data): Syncs inserts and ignores updates.
- ReplacingMergeTree Data Pools (mutable records): Syncs inserts and updates records with the same sorting key.
Read our guide on Selecting table engine and sorting key for information.
Data syncing can be configured to occur at intervals ranging from every minute to every 24 hours. The frequency at which you sync data into Propel should depend on your data freshness requirements and your upstream pipeline. For instance, if you promised your customers a data freshness of 3 hours and you have a dbt job that runs every hour, it would not make sense to run syncs every minute since there will be no new data to sync. Running syncs every hour would be sufficient.
Data requirements​
No requirements. You can ingest any type of data.
Data Types​
The table below describes default data type mappings from Parquet types to Propel types. When creating an Amazon S3 Parquet Data Pool, you can modify these default mappings. For instance, if you know that a column originally typed as a NUMBER contains a UNIX timestamp, you can convert it to a TIMESTAMP by changing the default mapping.
Parquet Type | Propel Type | Notes |
---|---|---|
BOOLEAN | BOOLEAN | |
INT8 | INT8 | |
UINT8 | INT16 | |
INT16 | INT16 | |
UINT16 | INT32 | |
INT32 | INT32 | |
UINT32 | INT64 | |
INT64 | INT64 | |
UINT64 | INT64 | |
FLOAT | FLOAT | |
DOUBLE | DOUBLE | |
DECIMAL(p ≤ 9, s=0) | INT32 | |
DECIMAL(p ≤ 9, s>0) | FLOAT | |
DECIMAL(p ≤ 18, s=0) | INT64 | |
DECIMAL(p ≤ 18, s>0) | DOUBLE | |
DECIMAL(p ≤ 76, s) | DOUBLE | |
DATE | DATE | |
TIME (ms) | INT32 | |
TIME (µs, ns) | INT64 | |
TIMESTAMP | TIMESTAMP | |
INT96 | TIMESTAMP | |
BINARY | STRING | |
STRING | STRING | |
ENUM | STRING | |
FIXED_LENGTH_BYTE_ARRAY | STRING | |
MAP | JSON | |
LIST | JSON |
Schema changes​
Propel supports non-breaking schema changes for Amazon S3 Parquet Data Pools. You can add columns to your Data Pool. To add a column to an Amazon S3 Parquet Data Pool, go to the “Operations” tab and select “Add columns to Data Pool.”
Then you can specify the column to add by giving it a name and a data type.
Clicking “Add column” starts an asynchronous operation to add the column to the Data Pool. You can monitor the progress of the operation in the “Operations” tab.
Note that when you add a column, Propel will not backfill. To backfill existing rows, you can run a batch update operation.
Column deletions, column modifications, and data type changes are not supported because they are breaking changes to the schema.
Management API​
Below is the relevant API documentation for the Amazon S3 Parquet Data Pool.
Queries​
Mutations​
- Create Data Pool
- Modify Data Pool
- Delete Data Pool by ID
- Delete Data Pool by unique name
- Create an Amazon S3 Data Source
- Modify Amazon S3 Data Source
- Delete Data Source by ID
- Delete Data Source by unique name
Limits​
No limits at this point.