Why We Built a Schemaless SQL Engine

by Phil Hofer | March 21, 2023

For better or worse, JSON has become one of the most common data interchange formats used in software engineering. Your logging and metrics pipelines probably produce it, your APIs consume and produce it, and you’re probably using it for configuration somewhere in your software stack.

JSON’s ubiquity is not a historical accident. The data model of “self-describing records composed of values, which themselves may be records or lists-of-values” is powerful enough to model most kinds of data, and simple enough that working with it is straightforward. In fact, this model is so ubiquitous that most SQL engines have grown JSON-parsing extensions, and a variety of “binary JSON” (ion, MessagePack, BSON, jsonb, and so forth) have gained moderate levels of adoption. Moreover, databases dedicated to storing and indexing JSON or JSON-like documents (like ElasticSearch or MongoDB or DynamoDB, collectively “document stores”) have also grown in popularity.

While static typing (and schema-defined interchange formats) are in vogue right now, there are enduring structural factors that make “flexibly-typed” data formats attractive to developers. “The Advantages of Flexible Typing” by the SQLite developers provides some great reasons why they made their on-disk format self-describing, and many of their arguments generalize to other applications and architectures.

Of course, JSON (and, by extension, document stores) are not without their problems. As the proliferation of “binary JSON” formats might suggest, JSON itself is not the friendliest format for machines to parse, and it is also not particularly space-efficient. In particular, neither “document stores” nor more traditional SQL databases with JSON extensions provide linear scan performance on JSON-like objects that is comparable strictly-typed analytic columnar data-stores.

Sneller, our open source query engine, bridges the gap between distributed columnar data-stores and document stores: it provides a SQL-like interface for un-structured data (like billions of JSON records) with the flexibility of a document store database and the performance of a modern distributed columnar database. Unlike typical columnar databases, the Sneller SQL engine is agnostic to the layout of the data ingested into records; every record in our on-disk format is fully self-describing. Consequently, we can ingest and query JSON documents without any configuration dictating the expected layout of the input data. Surprisingly, parsing all of these self-describing records during query execution is not cost-prohibitive; Sneller’s pure-assembly AVX-512 execution engine has such high computation density (true CPU “work” performed per instruction) that the overhead of transposing record fields into vectors is negligible.

To illustrate this point, let’s run some queries on 1 day of Github Archive JSON data using the command-line version of the Sneller query engine. I’ve downloaded all the data from January 4th, 2022 and packed it into a zion file with sdb pack. The compressed zion file contains 3.25 million rows, occupies 1.2GB of disk space, decompresses to 9.1GB of ion data, and was originally 13GB of JSON data. We can use sdb query to run the Sneller query engine directly on a file that we’ve got stored locally:

$ head /proc/cpuinfo # I'm on a 26-core Ice Lake CPU with AVX-512 support
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 106
model name      : Intel(R) Xeon(R) Gold 5320 CPU @ 2.20GHz
stepping        : 6
microcode       : 0xd000375
cpu MHz         : 800.000
cache size      : 39936 KB
physical id     : 0
$ ls -lh 1day.zion
-rw-------    1 phil     phil        1.2G Aug  2  2022 1day.zion
$ sdb query -v -fmt=json "select count(*), earliest(created_at), latest(created_at) from read_file('1day.zion')" | jq
  "count": 3259519,
  "min": "2022-01-04T00:00:00Z",
  "max": "2022-01-04T23:59:59Z"
9780068352 bytes (9.108 GiB) scanned in 76.534373ms 125GiB/s

Here’s a query that uses a regular expression to search through the records at close to DRAM speed:

$ sdb query -v -fmt=json "select distinct repo.name from read_file('1day.zion') where repo.name ~ 'torvalds/.*'" | jq
  "name": "torvalds/linux"
  "name": "jordy-torvalds/dailystack"
  "name": "torvalds/subsurface-for-dirk"
9780068352 bytes (9.108 GiB) scanned in 91.35611ms 105GiB/s

Here’s another query that performs an aggregation across all the records, again at close to DRAM speed:

$ sdb query -v -fmt=json "select count(*), type from read_file('1day.zion') group by type order by count(*) desc"
{"type": "PushEvent", "count": 1686536}
{"type": "CreateEvent", "count": 451280}
{"type": "PullRequestEvent", "count": 266838}
{"type": "IssueCommentEvent", "count": 192179}
{"type": "WatchEvent", "count": 184240}
{"type": "DeleteEvent", "count": 121968}
{"type": "PullRequestReviewEvent", "count": 101306}
{"type": "IssuesEvent", "count": 75571}
{"type": "ForkEvent", "count": 62597}
{"type": "PullRequestReviewCommentEvent", "count": 52689}
{"type": "ReleaseEvent", "count": 23119}
{"type": "CommitCommentEvent", "count": 15134}
{"type": "PublicEvent", "count": 10244}
{"type": "MemberEvent", "count": 8375}
{"type": "GollumEvent", "count": 7443}
9780068352 bytes (9.108 GiB) scanned in 128.617996ms 74.3GiB/s

I picked the two queries above as examples deliberately. The first example performs substring search, which in principle could be accelerated by building a text search index like ElasticSearch would. The second example performs a simple aggregation, which in principle could be accelerated by materializing aggregate statistics in advance like a time-series database would. However, Sneller provides excellent performance on both of these queries without using any special index or materialization tricks; our query engine is simply really good at doing table scans. You can read more about the design of our bytecode interpreter here.

I also chose Github Archive data for the example above because it illustrates another benefit of being schemaless: lots of real-world JSON data cannot be easily described using traditional SQL DDL semantics. Generally, when engineers are stuck putting un-structured data in a SQL database, they have to use VARCHAR columns (or native JSON-typed columns) and use extension functions for parsing JSON in their queries. In this case, the payload field for records in the Github Archive data is context-dependent; its structure is largely dictated by the type field. The public Google BigQuery data-set for Github Archive data works around this issue by defining payload as a string, which means your queries need to parse the payload field as a JSON object at runtime using the query engine’s JSON-parsing functions. On Sneller, the payload field behaves just as any structure would. You can simply write an expression like payload.pull_request.head.repo.name = '...' and get the “native” level of performance from the query engine without any additional JSON-parsing overhead.

AWS CloudTrail data is another example of common JSON data that frustrates strictly-typed data-stores. For example, the requestParameters field in CloudTrail logs contains all the request fields as a structure, which means that all the possible query parameters across the surface area of the entire AWS API can be keys in this structure. AWS recommends ingesting the requestParameters field as a string when using AWS Athena. (The responseElements, additionalEventData, and serviceEventDetails are some additional structure-typed fields that the CloudTrail documentation recommends you treat as strings.) Conversely, Sneller lets you dereference elements in these structures as you would any other sub-structure in a record. Here’s a real query from one of our Grafana dashboards that joins AWS CloudTrail data with our VPC Flow Logs data in order to display network traffic by EC2 Instance ID:

-- gather instanceId and networkInterfaceId from every instance in every RunInstances event in the last 5 days:
WITH instance_info AS (SELECT item.instanceId, item.networkInterfaceSet.items[0].networkInterfaceId interface_id
                       FROM "cloudtrail" c, c.responseElements.instancesSet.items item
                       WHERE c.eventTime > DATE_ADD(DAY, -5, UTCNOW())
                         AND c.eventName = 'RunInstances')
-- display time bucket, instance, and total bytes:
SELECT time, inst.instanceId AS instance_id, SUM(fl.bytes) AS bytes
FROM vpcflowlogs fl JOIN instance_info inst ON inst.interface_id = fl.interface_id
WHERE fl.start >= `${__from:date}` AND fl.start <= `${__to:date}`
GROUP BY TIME_BUCKET(fl.start, ($__interval_ms)/1000) AS time, inst.instanceId

(The ${...} items are special placeholders that Grafana fills from the time intervals on our dashboards.)

On top of the improved flexibility, there are other operational benefits to avoiding a schema in your observability pipeline. Your logs or metrics won’t suddenly stop ingesting because someone accidentally deployed code that produces de-normalized data, and you won’t have to worry about scheduling downtime because you need to run an expensive ALTER TABLE operation as part of a migration. Migrations become particularly troublesome when the tables in question have grown to petabytes in size; it often isn’t practical to re-write that much data. Instead, Sneller SQL provides plenty of functionality for structuring your queries so that they can run correctly on heterogeneous records. (For example, you can use CASE to switch on the type of a field, or COALESCE to pick the first of a set of possible fields.)

Try Sneller for Free

You can try Sneller right now on your own data for free through our playground.

If you’re a developer interested in the details of how Sneller works under the hood, you’re in luck: Sneller is open source software!