Content on this page
Introduction
In the previous part of the walkthrough we learned how to use Sneller on a single computer and using local storage. This walkthrough will show how to store the data in object storage (i.e. AWS S3) to ensure that data is stored reliably and is highly available.
IMPORTANT: If you follow the walkthrough, then please make sure you use
the exact same names for the environment variables. Both the AWS CLI and sdb
use these variables to provide proper defaults.
Prerequisites
In this example we will use Minio to mimic AWS S3, so it can still be run on a single instance. If you prefer to use AWS S3, then you can skip the installation of Minio and use your AWS credentials to access your S3 bucket. Google Cloud Storage is also supported in S3 interoperability mode.
Install AWS CLI
The AWS CLI is used to run the commands. Technically it’s possible to invoke these commands via Docker, but it’s much more convenient to have the AWS CLI available on your system. Refer to the AWS documentation on how to install the AWS CLI.
Install Minio
In this walkthrough we will deploy Minio using Docker, so make sure it is installed on your computer. Then run the following commands to generate the access-key and start Minio.
export AWS_REGION="us-east-1"
export AWS_ACCESS_KEY_ID=$(cat /dev/urandom | tr -dc '[:alpha:]' | fold -w ${1:-20} | head -n 1)
export AWS_SECRET_ACCESS_KEY=$(cat /dev/urandom | tr -dc '[:alpha:]' | fold -w ${1:-20} | head -n 1)
export S3_ENDPOINT=http://localhost:9000
docker pull quay.io/minio/minio # pull latest version
docker run -d \
-e MINIO_ACCESS_KEY=$AWS_ACCESS_KEY_ID \
-e MINIO_SECRET_KEY=$AWS_SECRET_ACCESS_KEY \
-e MINIO_REGION=$AWS_REGION \
-p 9000:9000 -p 9001:9001 \
quay.io/minio/minio server /data
Let’s go
First we will need to create a bucket that holds all the Sneller data. In this example we will store both the source and ingested data in the same bucket, but you are free to store source data in another bucket.
export SNELLER_BUCKET=s3://sneller-test
aws s3 --endpoint $S3_ENDPOINT mb $SNELLER_BUCKET
We’ll use the same data as in the first walkthrough, so first we will download two
hours of the GitHub archive data and upload it to the S3 bucket in the source
folder:
wget https://data.gharchive.org/2015-01-01-{15..16}.json.gz
aws s3 --endpoint $S3_ENDPOINT cp 2015-01-01-15.json.gz $SNELLER_BUCKET/source/
aws s3 --endpoint $S3_ENDPOINT cp 2015-01-01-16.json.gz $SNELLER_BUCKET/source/
aws s3 --endpoint $S3_ENDPOINT ls $SNELLER_BUCKET/source/
Now we need to create the definition.json
file that is appropriate for this
configuration:
cat > definition.json <<EOF
{
"input": [
{ "pattern": "$SNELLER_BUCKET/source/*.json.gz" }
]
}
EOF
aws s3 --endpoint $S3_ENDPOINT cp definition.json $SNELLER_BUCKET/db/tutorial/table/
Sneller maintains an index of all files that have been ingested. This index
contains hashes of the ingested data to ensure integrity. This index file
is protected using an index key, so we need to generate a 256-bit key and
store it as a base-64 encoded string in the SNELLER_INDEX_KEY
environment
variable:
export SNELLER_INDEX_KEY=$(dd if=/dev/urandom bs=32 count=1 | base64)
echo "Using index-key: $SNELLER_INDEX_KEY"
Now everything is set up to ingest the data. Note that sdb
uses the environment variables
as defaults, so based on the S3_ENDPOINT
and SNELLER_BUCKET
variables, it knows where to
find the table definition and it uses the AWS_xxx
variables to get access to object storage.
sdb sync tutorial table
You can check the ingested data by running the following command:
aws s3 --endpoint $S3_ENDPOINT ls $SNELLER_BUCKET/db/tutorial/table/
As you can see the index
file has been created and a packed file that holds the ingested
data.
All data has been ingested, so you can now start to run queries on the data in object storage:
sdb query -fmt json "SELECT COUNT(*) FROM tutorial.table"
When new data arrives, you can run sdb sync tutorial table
again and ingest
the new data.
Next…
In this walkthrough you learned how to store the data in S3 object storage instead of local storage. Although this increases the availability of your data, the engine itself is still limited to a single node. In the part 3 we will show how to run using the Sneller daemon and ensure that the query engine itself is also scalable and highly available.