by Phil Hofer | March 27, 2023
Many modern data storage systems support transparent compression of your data. Although the most obvious reason to compress your data is to reduce its footprint in storage, data compression can also increase the end-to-end performance of your system once you take into account the available CPU, networking, and storage resources. Concretely, a query engine like Sneller that uses object storage as its primary storage back-end is frequently going to have to move data across a network, and network bandwidth is often a limiting factor for overall system performance.
As an example, imagine you are downloading a zstd
-compressed file from S3 object storage.
If we try to max out our available bandwidth by renting a 200Gbps-capable c6in.32xlarge
instance,
we can download data from S3 at a maximum rate of 200 Gbps or 23.28 GB/s. But, a c6in.32xlarge
machine has 128 CPU cores, and we can decompress zstd data at over 1 GB/s/core (in decompressed bytes).
That means in theory we can turn the 23.28GB/s downloaded stream into 128GB/s or more of decompressed data,
assuming we’ve achieved a compression ratio of at least 5.49 (128 / 23.28
).
This math extends to reads from local storage, too. If you can read 4GB/s of sequential
compressed data from an NVMe disk, and you can decompress the data blocks on multiple CPU cores,
you can likely multiply your realized read throughput by a nice constant factor.
Given the I/O advantages of compressing your data, it’s not surprising that most columnar databases
(and columnar storage formats) have some kind of native support for data compression.
Strictly-typed columnar databases typically apply compression to your data after striping
your data into columns. Consequently, a clever query engine can determine statically which columns need to be fetched in order to evaluate a particular
query, and thus the query engine only has to bear the cost of decompressing some of the data.
Of course, if you run a query like SELECT * FROM my_table
, you’ll have to decompress every column.
Since Sneller’s query engine is fundamentally row-oriented, and since Sneller supports
arbitrary heterogeneous rows, we cannot employ exactly the same tricks
as a “pure” columnar storage format. (Consider: Sneller allows users to provide entirely
disjoint sets of fields in adjacent rows. One thousand rows with ten fields each that are
unique to the row would imply that there are ten thousand “columns” just for those one thousand rows!)
Moreover, decompression performance is really critical for our end-to-end performance,
as our SQL virtual machine can process more than 4GB/s/core
of raw ion
data, whereas zstd
decompression typically runs at only about 1GB/s/core.
In order to provide some of the performance benefits of compressed columnar storage,
we use a technique we call “compression tiling.” Each block of data (a group of rows) has its top-level fields hashed into one of sixteen buckets,
and each of the sixteen buckets of data is compressed separately.
We call this compression format zion
, which is short for “zipped ion
.”
Each zion
“bucket” encodes both the field label (as an ion
symbol) and the field value for each assigned field in each record.
Prepended to the sixteen compressed buckets of data is a compressed “shape” bitstream that describes how to traverse the buckets to reconstruct the original ion
records.
At query execution time, we can elide reconstruction of all the fields of each record that are not semantically important for the query,
which means that we can achieve up to a 16x reduction in the amount of time we spend decompressing data.
Right now we use the zstd
general-purpose compression algorithm for compressing all the buckets and the “shape” bitstream,
but this encoding technique is agnostic to the compression algorithm(s) used, so we may mix-and-match compression algorithms in the future.
As an optimization, the hash function that we use for assigning top-level fields to buckets has an adjustable seed that is encoded as part of the bitstream. The encoder picks a seed that provides a the lowest-variance distribution of decompressed sizes for each of the buckets before actually splitting up the rows into buckets. As a consequence, we do not typically have issues with the buckets becoming seriously unbalanced.
An Example
Let’s consider how we would encode the following ion
structure:
{ my_string: "hello", my_number: 3, my_bool: false }
If this record was the first record in a stream (i.e. if the symbol table was zero-initialized),
then we’d probably have the ion
symbols 0x0a
, 0x0b
, and 0x0c
assigned to my_string
, my_number
, and my_bool
, respectively,
and for simplicity let’s assume those symbols end up being hashed to buckets 5, 3, and 1.
Ordinarily, binary ion
encoding of the structure above (in hex) would be db8a8568656c6c6f8b21038c10
.
Instead, we’d write 8a8568656c6c6f
into bucket 5 (the ion field 0xa
plus the ion string 'hello'
),
then 8b2103
into bucket 3 (the ion field 0xb
plus the ion integer 3
), and finally 8c10
into bucket 1 (the ion field 0xc
plus the ion boolean false
). In the shape bitstream, we’d write 033501
. The first byte 0x03
indicates that we encoded
a row with three fields. The next two bytes encode the buckets as individual nibbles, lsb-first. (We always
round the encoded size of an individual “shape” into an even number of bytes, so a 3-structure field is encoded as two bytes
rather than one-and-a-half bytes.)
We’d repeat the process above for every new row of ion
data for the block, and then compress the shape bitstream
and the buckets and concatenate them to form a complete compressed zion
block. We also include the ion
symbol table
in the shape bitstream, since both the shape bitstream and the symbol table need to be decompressed in order to unpack
even a single structure field.
A careful reader may have noticed that we’re increasing the amount of decompressed data that needs to be emitted
in order to describe each record, since we have to encode not only the entirety of each record’s contents distributed
across the buckets, but also the shape bitstream. Fortunately, this ends up being an improvement in practice:
the shape bitstream is typically highly compressible because adjacent structures typically have identical or nearly-identical “shapes.” Moreover, the individual buckets tend to be more compressible than the raw ion encoding,
because there tends to be less entropy (i.e. fewer discrete ion types, fewer field names) in an individual bucket than there
would be in the regular ion
stream. In practice, zion
-compressed ion
data with zstd
-compressed buckets tends to be
about 10% smaller than simply wrapping the ion
data with zstd
compression naïvely.
Decoding a zion
block is just a matter of running all the encoding steps above in reverse.
After decompressing the shape bitstream and symbol table, we can map any requested fields (e.g. my_string
)
to symbol IDs, and we can hash those to determine which bucket(s) need to be decompressed.
Then, we iterate the shape bitstream one item at a time and emit an ion
structure composed of the field/value pairs
encoded in each of the bucket(s) that we decompressed, taking care to omit any fields that we aren’t interested in reconstructing.
If we decompress all the buckets and copy out each field/value pair unconditionally, we end up with a result that is bit-identical
to the original input ion
data. If we only need to produce one field, then we only have to decompress one bucket, and consequently
we do (approximately) 1/16th of the decompression work we would otherwise have to do if we just encoded the rows naïvely.
Our SQL engine is quite sensitive to the performance of the “reconstruction” process for ion
data from zion
blocks,
as it is in the critical path from compressed data to ion
row data that can be fed to our SQL virtual machine.
The implementation of zion.Decoder.Decode
uses one of a handful of assembly routines to reconstruct the ion data
based on the number of fields that need to be reconstructed. We’ve managed to make this reassembly process quite fast (many GB/s/core) in practice.
Learn More
You can read the source code for the zion
format on Github.
Try Sneller for Free
You can try Sneller right now on your own data for free through our playground.
If you’re a developer interested in the details of how Sneller works under the hood, you’re in luck: Sneller is open source software!