Content on this page
Introduction
Partitioning can be configured for a table to improve data locality, provide data isolation, and reduce the number of bytes that need to be scanned to satisfy a query. To make use of this feature, input data must already be separated into multiple input files; all the rows in a particular input file are ingested into exactly one partition.
Partitioning works by capturing substrings of the input object path based on the configured input file pattern and inserting them as top-level fields in the emitted rows. The generated fields can be referenced in a query on those rows as if they were fields in the input data. A top-level field in the input data with the same name as the partition field will be overwritten by the inserted field. To avoid this, the field name chosen for the partition field should be unique.
Multiple partition fields can be configured for a table. Each distinct tuple of partition field values forms a different partition. Data in different partitions will always end up in separate output objects, thus providing data separation guarantees. Queries that include a condition that explicitly limits the partitions that need to be scanned will only scan the data associated with those partitions, reducing the number of bytes that need to be scanned.
Partitions are defined by setting the partitions
field in the table
definition. The partitions
field is a list of JSON objects that each define
one partition field. The following table definition defines a single partition
field called region
which references one segment of the input object path.
{
"input": [
{
"pattern": "s3://example-bucket/logs/{region}/*.json.zst"
}
],
"partitions": [
{
"field": "region"
}
]
}
If the contents of example-bucket
were as follows:
logs/eu-west-1/access-log.json.zst
logs/eu-west-1/error-log.json.zst
logs/us-east-1/access-log.json.zst
logs/us-east-1/error-log.json.zst
logs/us-west-2/access-log.json.zst
logs/us-west-2/error-log.json.zst
…then once the logs
table is fully ingested it will contain three partitions
(eu-west-1
, us-east-1
, us-west-2
) separated into distinct output files for
each partition. A query specifying a partition by name, for example:
SELECT COUNT(*) FROM logs WHERE region = 'us-west-2'
…will only end up scanning output files that are part of the us-west-2
partition.
Partitioning on dates
Sneller automatically indexes timestamp fields found while ingesting input files, therefore partitioning input data on dates found in the input object path is not recommended when input data already contains a timestamp field.
In cases where input data does not contain a timestamp field and the date can
only be determined from the input object path, a timestamp field can be derived
from date components in an object path by defining a partition of type date
in
a table definition as in the following example.
{
"input": [
{ "pattern": "s3://example-bucket/logs/{yyyy}/{mm}/{dd}/*.json.zst" }
],
"partitions": [
{ "field": "date", "type": "date", "value": "$yyyy-$mm-$dd" }
]
}