Content on this page
The Hard Way
This series of “The Hard Way” isn’t the fastest way to install Sneller, but is aimed if you really want to know about how Sneller works. It is intended for developers and operators to learn the details of Sneller, starting small on just a local server or laptop and ending with a production-grade multi-node cluster spun up via Kubernetes running on top of object storage.
Introduction
It’s best to learn Sneller from the ground up. Sneller is a full-featured petabyte scale database service, but we tried to keep it as simple as possible. This series guides you through each aspect step-by-step.
IMPORTANT: Note that your computer should be able to use AVX-512, so it requires at least an Intel Skylake processor to run these queries. If you can’t run Sneller on your PC, then you should run these experiments on an AWS, Azure or GCP virtual machine (VM).
Introducing SDB
JSON data is everywhere nowadays and data-lakes are growing day by day. JSON is a great fit for a lot of data, because it supports nested data and is schemaless. Typical SQL databases struggle with JSON and a lot of proprietary extensions have been created to deal with this. Sneller is based on PartiQL that adds standardized extensions to SQL to deal with nested data natively.
Although JSON is great, the text-based format isn’t ideal for fast processing. That’s why Sneller requires JSON to be ingested before it can be queried. We use a custom format that uses binary ION and uses compression to save space.
Sneller ships with a tool called sdb
(short for Sneller DataBase) that can be
installed using:
go install github.com/SnellerInc/sneller/cmd/sdb@latest
This tool can be used to convert JSON to Sneller’s on-disk format and run queries.
Static data
Let’s download some test data, ingest it, and run a query:
wget https://data.gharchive.org/2015-01-01-15.json.gz
sdb pack -o 2015-01-01-15.zion 2015-01-01-15.json.gz
sdb query -fmt json "SELECT COUNT(*) FROM read_file('2015-01-01-15.zion')"
Congratulations, you just ran your first Sneller query. This query only returns the number of rows in the dataset, so we could make it a little bit more interesting and determine the top 10 repositories in this dataset:
sdb query -fmt json "SELECT repo.name, COUNT(*) FROM read_file('2015-01-01-15.zion') GROUP BY repo.name ORDER BY COUNT(*) DESC LIMIT 10"
Sneller’s SQL allows to directly access the name
field inside the repo
object without any nasty syntax or hard-to-remember functions. It treats nested
data as a first-class citizen.
Suppose you get some more data and you would like to query on that second file too. That can be done by converting the second file and querying both files at once:
wget https://data.gharchive.org/2015-01-01-16.json.gz
sdb pack -o 2015-01-01-16.zion 2015-01-01-16.json.gz
sdb query -fmt json "SELECT COUNT(*) FROM read_file('2015-01-01-15.zion') ++ read_file('2015-01-01-16.zion')"
Although this works, it doesn’t scale well to a lot of files. Directly querying the packed files can be useful for analyzing a static dataset, but it isn’t particular useful for a more dynamic environment.
Dynamic data
The previous example showed how to deal with a static dataset, but often
datasets are dynamic and new data is arriving at regular intervals. For this
purpose sdb
has the sync
command to automatically ingest data that has been
added since the last sync.
The sync
command expects to find a definition.json
file
(more details)
that lists the files that should be ingested. We will create such a table
definition and synchronize.
Note that Sneller expects all databases to be stored in the db
folder that is
relative to the root folder. The name of the database is tutorial
and the
table name is table
. That’s why the definition.json
should be in that
folder.
mkdir -p db/tutorial/table
cat > db/tutorial/table/definition.json <<EOF
{
"input": [
{ "pattern": "file://*.json.gz" }
]
}
EOF
We will now download two files from the Github archive and ingest them into this table:
wget "https://data.gharchive.org/2015-01-01-15.json.gz"
wget "https://data.gharchive.org/2015-01-01-16.json.gz"
Sneller maintains an index of all files that have been ingested. This index
contains hashes of the ingested data to ensure integrity. This index file
is protected using an index key, so we need to generate a 256-bit key and
store it as a base-64 encoded string in the SNELLER_INDEX_KEY
environment
variable:
export SNELLER_INDEX_KEY=$(dd if=/dev/urandom bs=32 count=1 | base64)
echo "Using index-key: $SNELLER_INDEX_KEY"
The index-key is automatically picked up by sdb
from this environment
variable, so make sure you use the proper name. Write down the index-key,
because if you loose it all data has to be ingested again. Note that you
can unpack the individual packed files even without the index key. The
key is only used for signing and doesn’t actually encrypt the data.1
sdb -root . sync tutorial table
The data is ingested and you should see some additional files in the
db/tutorial/table
folder:
ls -l db/tutorial/table
It shows both an index
file and a packed-XXXXXXXXXXXXXXXXXXXXXXXXXX.zion
file. The index
file is the actual directory of all ingested files and the
resulting packed files. The packed-XXX.zion
file holds the actual ingested
data.
Now the data is ingested, the source files can be removed. If you need to get the JSON data back, you can simply run:
sdb unpack -fmt json db/tutorial/table/packed-*.zion
To actually determine if a file has been ingested, you can also ask sdb
for
the list of ingested files:
sdb -root . inputs tutorial table
Now the data has been ingested, it can be queried again:
sdb -root . query -fmt json "SELECT COUNT(*) FROM tutorial.table"
Download some more files from the GitHub archive and ingest again:
wget https://data.gharchive.org/2015-01-01-{17..22}.json.gz
sdb -root . sync tutorial table
sdb -root . query -fmt json "SELECT COUNT(*) FROM tutorial.table"
You can either choose to trigger this synchronization when new data arrives or run it at regular intervals.
Next…
In this chapter you learned the fundamentals of Sneller, but it’s all running on a single machine and using local storage. This may be good enough for some simple scenarios, but it scales poorly and local disk storage may not be the most robust option. We’ll deal with that in part 2 of this series.
-
The index-key should be treated as a secret. In this local example we can’t do much to protect it, but in more advanced scenarios the key shouldn’t be exposed.
↩︎