by Frank Wessels | May 3, 2023
Querying terabytes of JSON per second
In this blog post we will be showing that Sneller is capable of querying terabytes of JSON per second on a medium-sized compute cluster.
By leveraging AVX-512 instructions, Sneller is able to process many records in parallel on a single core, giving it a massive performance advantage over competing solutions.
Sneller supports JSON data natively without the need to define any schemas, and it supports enormous datasets spanning from terabytes (TBs) to petabytes (PBs) by leveraging object storage like S3.
GitHub archive dataset
For testing we will be using the GitHub archive dataset. For the year 2021 it contains over 1 billion records and our test dataset includes the full payload
field for each event. Note that the structure of the payload
field is different for each event type and fairly complex containing 100+ fields.
The dataset is stored on S3 in compressed form and measures about 390GB in size, split over several hundreds of objects.
As you can see in the list of queries below, Sneller makes it intuitive to work with JSON and supports “Path Expressions” natively for dereferencing. For example, payload.comment.body
references the body
field inside the comment
field inside the payload
field (for “IssueCommentEvent” events).
But before we get started, let’s query the total size and number of records in the test dataset:
curl -i -G 'https://play.sneller.ai/query?database=demo&json' \
--data-urlencode 'query=SELECT COUNT(*) FROM gha'
HTTP/2 200
date: Fri, 05 May 2023 16:50:04 GMT
...
x-sneller-max-scanned-bytes: 3193248415744
...
{ "count": 1020957952 }
We can see that the total size as reported by x-sneller-max-scanned-bytes
is 3,193,248,415,744 bytes or 2.9 TiB. We can trivially compute the average size per record by dividing it by 1,020,957,952 records, which gives us 3127 bytes per record.
Sneller makes it easy to do free “dry runs” to get insights into the behavior of your queries.
Queries
The following benchmarks are performed using Sneller’s public playground and it is scaled to run on 640 vCPUs (so 320 physical cores with 2 threads/core). Note that these are non-trivial queries that scan through the full dataset to give a balanced overview of the performance.
Let’s show the first query “by hand” via curl
and measure the execution time through time
(note: just copy-paste the command to your terminal to test this yourself):
time curl -G 'https://play.sneller.ai/query?database=demo&json' \
--data-urlencode $'query=SELECT type, COUNT(*) FROM gha GROUP BY type ORDER BY COUNT(*) DESC'
{"count": 501471103, "type": "PushEvent"}
...
{"count": 2770538, "type": "GollumEvent"}
real 0m1.498s
So it is easy to see that this query took 1.498 seconds. See the table below of the results of some other queries:
Query | Took (sec) | TiB/sec | GiB/sec/vCPU |
---|---|---|---|
Overview of all event types 1 | 1.498 | 2.0 | 3.1 |
Find Linus Torvalds’ repos 2 | 2.419 | 1.2 | 1.9 |
Date histogram of events 3 | 1.914 | 1.6 | 2.4 |
As you can see, the query performance is consistently well over 1 TiB/sec.
The last column of the table shows the amount of data that was scanned per second per core, and this is about 2 GiB/sec/vCPU or more.
Affordable performance
Sneller Cloud is priced at $50/PB scanned, which means each of the queries above would cost just $0.14.
This compares very favorably to alternative solutions such as BigQuery and Athena that are priced at $5 per terabyte and do not offer the same level of performance.
Partitioned queries
Here are the measurements of some more advanced queries. These focus on a specific type of event and also touch the (large) payload
field for event-specific additional information.
Query | Took (sec) | Scanned (TiB) |
---|---|---|
Search issues using regexp 4 | 1.781 | 0.33 |
PRs taking over 180 days 5 | 4.594 | 1.45 |
Sneller still maintains an “interactive” level of performance on this large dataset due to Sneller’s AVX-512 assembly acceleration.
Try for yourself
You can repeat all these experiments (or tweak the queries, ie. look for your own repos during 2021…) by either:
- copy-paste the
curl
commands into your terminal, or - copy-paste the queries below into Sneller’s online playground to try it out in your browser (yes, even on your mobile phone!)
More background on Sneller
Sneller is a high-performance SQL engine built to analyze petabyte-scale unstructured logs and other event data. Here are a couple major differentiators between Sneller and other SQL solutions:
- Sneller is designed to use cloud object storage as its only backing store.
- Sneller’s SQL VM is implemented in AVX-512 assembly. Medium-sized compute clusters provide throughput in excess of terabytes per second.
- Sneller is completely schemaless. No more ETL-ing your data! Heterogeneous JSON data can be ingested directly.
- Sneller uses a hybrid approach between columnar and row-oriented data layouts to provide lightweight ingest, low storage footprint, and super fast scanning speeds.
Explore further
There are three easy ways to get started with Sneller:
- Open source
- Online playground
- Sneller Cloud
or checkout our documentation.
Appendix: queries
-
SELECT type, COUNT(*) FROM gha GROUP BY type ORDER BY COUNT(*) DESC
↩︎ -
SELECT DISTINCT repo.name FROM gha WHERE repo.name LIKE 'torvalds/%'
↩︎ -
SELECT dayOfWeek, type, COUNT(*) FROM gha GROUP BY EXTRACT(DOW FROM created_at) AS dayOfWeek, type ORDER BY dayOfWeek, COUNT(*) DESC
↩︎ -
SELECT payload.comment.body FROM gha WHERE type = 'IssueCommentEvent' AND payload.comment.body ~ 'Sherlock [A-Z]\\w+'
↩︎ -
SELECT COUNT(*) FROM gha WHERE type = 'PullRequestEvent' AND DATE_DIFF(DAY, payload.pull_request.created_at, created_at) >= 180
↩︎
Try Sneller for Free
You can try Sneller right now on your own data for free through our playground.
If you’re a developer interested in the details of how Sneller works under the hood, you’re in luck: Sneller is open source software!