Content on this page
Introduction
This section describes how to create a definition.json
file to define a
database table and the schema of the JSON object inside of the definition file.
The definition.json
file should be placed in an AWS S3 bucket with an object
key conforming to the following pattern, where <db-name>
is the name of the
database and <table-name>
is the name of the table.
db/<db-name>/<table-name>/definition.json
Contents and schema
The contents of the definition.json
file should be a single JSON object with
the following schema.
inputs
(list, required) is the list of inputs that define the location and format of files that are used to build the table. Each item in the list is an object with the following schema:pattern
(string, required) is a URI glob pattern that specifies which files are fed into the table. Example:s3://bucket/dir/*.json
,s3://bucket/{tenant}/{region}/*.json.gz
format
(string, optional) is the format of the files in pattern. If this is blank or omitted, then the format will be inferred from the extension. Examples:json.zst
,json.gz
,json
,csv
,tsv
hints
(list or object, optional) are used to provide hints about the input data to the parser. These hints may be used to perform type-based coercion of certain fields in the input data, and may additionally eliminate some of the data as it is parsed. Hints data is format-specific. See the relevant section for details.
partitions
(list, optional) specifies synthetic fields that are generated from components of the input URI and used to partition table data. The partition definitions refer to named components of the input pattern; for example,tenant
andregion
ins3://bucket/{tenant}/{region}/*.json.gz
. Each item in this list is an object that has the following schema:field
(string, required) is the name of the partition field. If this field conflicts with a field in the input data, the partition field will override it.type
(string, optional) is the type of the partition field. Possible values for this field arestring
,int
,date
, ortimestamp
. If this field is blank or omitted, this defaults tostring
.value
(string, optional) is a template string that is used to produce the value for the partition field. The template may reference parts of the input URI specified in the input pattern by including a substring like$name
or${name}
. A literal$
can be inserted into the output by using$$
in the template. If this field is blank or omitted, the field name is used to determine the input URI part that will be used to determine the value. Examples:${tenant}_${region}
,$yyyy-$mm-$dd
retention_policy
(object, optional) is the expiration policy for table data. Data older than the expiration window will be periodically purged from the backing store during table updates. This object has the following schema:field
(string, required) is the path expression for the field used to determine the age of a record for the purpose of the data retention policy. Currently only timestamp fields are supported.valid_for
(string, required) is the validity window relative to the current time. This is a string with a format like<n>y<n>m<n>d
where<n>
is a number and any component can be omitted. Examples:6m
,1000d
,1y6m15d
beta_features
(list, optional) is a list of feature flags that can be used to turn on features for beta testing. Feature flags may be removed as features are rolled in general production. Unknown features are silently ignored.skip_backfill
(boolean, optional) if set to true, will cause this table to skip scanning the source bucket(s) for matching objects when the first objects are inserted into the table.
Here is a minimal example of a definition.json
file used to create a table
from a set of JSON input files in an AWS S3 bucket called example-bucket
sharing a common prefix logs/
.
{
"input": [
{
"pattern": "s3://example-bucket/logs/*.json",
"format": "json"
}
]
}
Here is a more complex example using logs produced by AWS VPC Flow Logs as input
data. This example shows how the hints
field can be used to configure the
parser for comma-separated and tab-separated tabular data. It also defines two
levels of partitioning (by region and date) using components of the input object
path, and defines a retention policy of 1 year and 6 months based on the end
field in the input data.
{
"input": [
{
"pattern": "s3://logs-bucket/vpcflowlogs/AWSLogs/*/vpcflowlogs/{region}/{yyyy}/{mm}/{dd}/*.log.gz",
"format": "csv.gz",
"hints": {
"skipRecords": 1,
"separator": " ",
"missingValues": [ "-" ],
"fields": [
{ "name": "version", "type": "int" },
{ "name": "account_id", "type": "string" },
{ "name": "interface_id", "type": "string" },
{ "name": "srcaddr", "type": "string" },
{ "name": "dstaddr", "type": "string" },
{ "name": "srcport", "type": "int" },
{ "name": "dstport", "type": "int" },
{ "name": "protocol", "type": "int" },
{ "name": "packets", "type": "int" },
{ "name": "bytes", "type": "int" },
{ "name": "start", "type": "datetime", "format": "unix_seconds" },
{ "name": "end", "type": "datetime", "format": "unix_seconds" },
{ "name": "action", "type": "string" },
{ "name": "log_status", "type": "string" }
]
}
}
],
"partitions": [
{ "field": "region" },
{ "field": "date", "value": "$yyyy/$mm/$dd" }
],
"retention_policy": {
"field": "end",
"valid_for": "1y6m"
}
}
Format-specific hints
This section provides examples of the hints
field that can be provided for
different input file types.
Hints for the JSON parser
The following is an example of the value the hints
field can take when used to
provide type information for json
type inputs.
[
{ "path": "path.to.field.a", "hints": "string" },
{ "path": "path.to.field.b", "hints": ["string", "bool"] }
]
The precedence of overlapping rules is determined by the order in which the
rules are written. The ?
/[?]
wildcard can be used to match all keys of the
current level. The *
/[*]
wildcard can be used to match all keys of the
current level and all following levels. The *
wildcard must be the last
segment in the path.
The hints
field must either be a string or a list of strings. The following
values are supported in the hints
field.
default
causes the type of the emitted value to be determined by the JSON type of the corresponding field. This is the default behavior if no type hint is included.string
indicates that the property is allowed to be a string value.number
indicates that the property is allowed to be either a floating point number or integer value.int
indicates that the property is allowed to be an integer value.bool
indicates that the property is allowed to be a boolean value.datetime
indicates that the property is allowed to be an RFC 3339 timestamp string with an optional nanosecond component. The value will be emitted as a timestamp truncated to microsecond precision.unix_seconds
indicates that the property is allowed to be an integer value which is interpreted as the number of seconds since the Unix epoch. The resulting field will be emitted as a timestamp.unix_milli_seconds
is as above but for milliseconds.unix_micro_seconds
is as above but for microseconds.unix_nano_seconds
is as above but for nanoseconds. The emitted value will be truncated to microsecond precision.ignore
causes the property to be ignored during parsing.no_index
is used to prevent timestamp properties from being automatically included in the index. This does not have any effect on non-timestamp fields.
Hints for the CSV and TSV parsers
The hints object provided to the CSV and TSV parsers adheres to the following schema.
skip_records
(number, optional) allows skipping the first N records, which can be useful when input files contain headers.separator
(string, optional) allows specifying a custom separator. This only applies to CSV input files. If this field is set, it must be a string containing a single character.missing_values
(list, optional) is a list of strings which represent missing values. Entries infields
may override this on a per-field basis.fields
(list, required) specifies the hint for each field. Each item in the list is an object following this schema:name
(string, required) is the name of the field.type
(string, optional) is the type of the field. This may be one of the following values:string
(default if blank or omitted)number
int
bool
datetime
ignore
default
(string, optional) is the default value if the column is an empty string.format
(string, optional) is the timestamp format if thetype
field is set todatetime
. This may be one of the following values:datetime
(default)unix_seconds
unix_milli_seconds
unix_micro_seconds
unix_nano_seconds
(the emitted field value will be truncated to microsecond precision)
allow_empty
(boolean, optional) is used to allow empty strings to be ingested. This is only valid for use with thestring
type. If missing or false, then a field with an empty value won’t be written for the record.no_index
if true specifies that the field shouldn’t be indexed. This is only valid for thedatetime
type.true_values
(list, optional) is a list of strings that represent a true boolean value. This is only valid for thebool
type.false_values
(list, optional) is a list of strings that represent a false boolean value. This is only valid for thebool
type.missing_values
(list, optional) is list of strings that represent a missing value for this particular column. This will cause any field matching this value to be omitted from the resulting object. If this is provided, it will overridemissing_values
in the top-level hints object.