Content on this page
Find scripts on GitHub
You can also find all the scripts in the https://github.com/SnellerInc/examples/tree/master/terraform/cloud-onboarding repository.
Details on Terraform
The Terraform provider requires the token that has been created in the console.
This script uses the following variables:
region
specifies the AWS region where to deploy the cluster. Make sure that you specify a region that also supports Sneller (us-east-1
is a safe choice).sneller_token
specifies the Sneller token that is used to authenticate with Sneller. If the token isn’t valid (or has expired), then a new token can be created in the Sneller web-console.prefix
specifies a prefix that is used for all global resources. Some resources need unique names (i.e. S3 buckets). If you don’t specify a prefix, then it will generate a random 4 character prefix instead.database
/table
specifies the name of the example database and table that is created during deployment.
- click to show/hide content
- main.tf
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
}
sneller = {
source = "snellerinc/sneller"
}
}
}
provider "aws" {
region = var.region
}
provider "sneller" {
api_endpoint = "https://api-production.${var.region}.sneller.ai/"
default_region = var.region
token = var.sneller_token
}
variable "region" {
type = string
description = "AWS region"
default = "us-east-1"
}
variable "sneller_token" {
type = string
description = "Sneller token"
}
variable "prefix" {
type = string
description = "Prefix for all resources (required to make resources unique)"
default = "" # a 4 character random prefix will be used, when left empty
}
variable "database" {
type = string
description = "Database name"
default = "tutorial"
}
variable "table" {
type = string
description = "Table name"
default = "table1"
}
To ensure that we always have a prefix, we need some “magic” to create a randomized prefix if no prefix was set. Note that Terraform also stores the random prefix in the state, so it won’t change between runs.
- click to show/hide content
- prefix.tf
resource "random_string" "random_prefix" {
length = 4
special = false
numeric = false
upper = false
}
locals {
prefix = var.prefix != "" ? "${var.prefix}-" : "${random_string.random_prefix.id}-"
}
Creating the S3 buckets
Sneller uses two kinds of buckets:
- Source buckets that hold the data that will be ingested. The data can either stay in these buckets or it can be removed after ingestion.
- Ingestion bucket that holds the data that has been ingested by Sneller. The query engine always uses this data, so make sure it isn’t deleted (it’s not a cache). You can always export data back to the original JSON format.
Source bucket
First we’ll create the source bucket and make sure public access is denied. In this example we’ll also add some (small) sample data to ensure that we have some sample data by adding three ND-JSON encoded files to the bucket.
- click to show/hide content
- s3-source.tf
locals {
sneller_source_prefix = "sample_data/"
}
resource "aws_s3_bucket" "sneller_source" {
bucket = "${local.prefix}sneller-source"
tags = {
Name = "Source bucket for Sneller"
}
}
resource "aws_s3_bucket_public_access_block" "sneller_source" {
bucket = aws_s3_bucket.sneller_source.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_s3_object" "sneller_source_data" {
for_each = fileset(path.module, "${local.sneller_source_prefix}*")
key = each.key
bucket = aws_s3_bucket.sneller_source.id
source = each.key
}
These are the three ND-JSON encoded data files:
- click to show/hide content
- sample_data/test1.ndjson
{"value":1}
- click to show/hide content
- sample_data/test2.ndjson
{"value":2}
{"value":3}
- click to show/hide content
- sample_data/test3.ndjson
{"value":4}
{"value":5}
{"value":6}
Later in this walkthrough, we’ll show you how to ingest your own (existing) data into Sneller.
Ingestion bucket
The ingestion bucket should also disallow public access. It holds the table definition files and the actual ingested data.
- click to show/hide content
- s3-ingest.tf
resource "aws_s3_bucket" "sneller_ingest" {
bucket = "${local.prefix}sneller-ingest"
force_destroy = true
tags = {
Name = "Ingest bucket for Sneller"
}
}
resource "aws_s3_bucket_public_access_block" "sneller_ingest" {
bucket = aws_s3_bucket.sneller_ingest.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
Setup the Sneller IAM role
Sneller Cloud doesn’t store your data. All data will always be persisted in your own account and we will never persist your data (although it may be cached in RAM for performance reasons). We do need to read the source data and take care of your ingestion bucket.
To allow Sneller to work in these buckets, a custom IAM role should be defined in your AWS account and Sneller should be allowed to assume that role. Sneller creates an internal custom IAM role in our account for each tenant that will be used to assume your role.
Sneller creates a per-tenant IAM role to deal with the confused deputy problem. AWS also has its own solution for this, called external ID. Sneller also supports the use of the external ID and it’s also used in the next example.
The following script creates an IAM role that:
- Can be assumed by the internal Sneller IAM role when the correct external ID is passed.
- Allows read-only access in the source bucket.
- Allows read/write access in the ingestion bucket (only for
objects starting with the
db/
prefix).
Terraform will take care of creating a 12-character unique string that is used as the external ID.
- click to show/hide content
- iam-role.tf
# Global tenant information
data "sneller_tenant" "tenant" {}
resource "aws_iam_role" "sneller" {
name = "${local.prefix}sneller"
assume_role_policy = data.aws_iam_policy_document.sneller_assume_role.json
}
data "aws_iam_policy_document" "sneller_assume_role" {
statement {
actions = ["sts:AssumeRole"]
principals {
type = "AWS"
identifiers = [data.sneller_tenant.tenant.tenant_role_arn]
}
# Only assume role when the proper tenant ID is passed
condition {
test = "StringEquals"
variable = "sts:ExternalId"
values = [data.sneller_tenant.tenant.tenant_id]
}
}
}
resource "aws_iam_role_policy" "sneller_source" {
role = aws_iam_role.sneller.id
name = "source"
policy = data.aws_iam_policy_document.sneller_source.json
}
resource "aws_iam_role_policy" "sneller_ingest" {
role = aws_iam_role.sneller.id
name = "ingest"
policy = data.aws_iam_policy_document.sneller_ingest.json
}
data "aws_iam_policy_document" "sneller_source" {
# Read access for the source bucket
statement {
actions = ["s3:ListBucket"]
resources = [aws_s3_bucket.sneller_source.arn]
}
statement {
actions = ["s3:GetObject"]
resources = ["${aws_s3_bucket.sneller_source.arn}/*"]
}
}
data "aws_iam_policy_document" "sneller_ingest" {
# Read/Write access for the ingest bucket
statement {
actions = ["s3:ListBucket"]
resources = [aws_s3_bucket.sneller_ingest.arn]
condition {
test = "StringLike"
variable = "s3:prefix"
values = ["db/*"]
}
}
statement {
actions = ["s3:PutObject", "s3:GetObject", "s3:DeleteObject"]
resources = ["${aws_s3_bucket.sneller_ingest.arn}/db/*"]
}
}
resource "random_string" "external_id" {
length = 12
}
Note that the
sneller_tenant
data-source is used to obtain the IAM role ARN that Sneller uses
to assume your role.
Register the bucket and IAM role with Sneller Cloud
Sneller should know where to look for the databases in your account and
which IAM role to assume to access it. This can be configured using the
sneller_tenant_region
resource.
It specifies both the ingestion bucket and the IAM role that has been created in the previous step.
- click to show/hide content
- sneller-tenant-region.tf
resource "sneller_tenant_region" "sneller" {
# Make sure not to set the IAM role, before the
# role has been granted access to the S3 bucket
depends_on = [aws_iam_role_policy.sneller_ingest]
bucket = aws_s3_bucket.sneller_ingest.bucket
role_arn = aws_iam_role.sneller.arn
}
Note that this resource depends on the IAM role policy attachment, because Sneller can’t validate access before the IAM role has been given this permission. When this dependency is not met, then Terraform may already register the bucket without Sneller being able to validate access. This would cause a failure, although it will probably succeed the next time you apply the script. This dependency is added to prevent this first failure.
Register table
Sneller now knows where to look for table definitions, but we don’t have
any tables yet. A table can be added using the
sneller_table
resource.
This is a very simple table, so we will just point it to the correct S3 source bucket and path pattern.
- click to show/hide content
- sneller-table.tf
resource "sneller_table" "test" {
depends_on = [sneller_tenant_region.sneller]
# Enable this for production to avoid trashing your table
# lifecycle { prevent_destroy = true }
database = var.database
table = var.table
inputs = [
{
pattern = "s3://${aws_s3_bucket.sneller_source.bucket}/${local.sneller_source_prefix}*.ndjson"
format = "json"
}
]
}
Note that the table definition can only be saved when the ingestion
bucket is known. That’s why there is a dependency to the
sneller_tenant_region.sneller
resource.
Setting up S3 event notifications
Source bucket
You can now query the table, but if you add new files to the source bucket
they don’t show up in the results. That’s because Sneller only ingests new
data when it is asked to do so. The open-source version of Sneller uses
sdb
that can scans the source bucket again and ingests the new files. This
works fine, but it can result in higher latency when your source bucket
contains a lot of files. Also scanning the bucket isn’t free, so it may be
too slow en too expensive when there is a lot of data.
That’s why Sneller Cloud also supports an event-based method that relies on S3 event notifications. With the proper configuration, S3 will send a message to a queue whenever new data is written to the source bucket. Sneller provides an SQS queue, so the only thing you need to do is to set up the S3 event notifications.
- click to show/hide content
- s3-source-events.tf
resource "aws_s3_bucket_notification" "sneller_source" {
bucket = aws_s3_bucket.sneller_source.id
queue {
id = "sneller-source"
queue_arn = sneller_tenant_region.sneller.sqs_arn
events = ["s3:ObjectCreated:*"]
}
}
Ingest bucket
Changes can be made to the definition.json
file that can be managed
either via Terraform or directly in S3. Sneller reads the table definitions
every 5 minutes, but that may be annoying. That’s why we also add S3 event
notifications on the ingest bucket. All updates to definition.json
will
be sent to the ingestion queue and it will invalidate the definitions right
away.
- click to show/hide content
- s3-ingest-events.tf
resource "aws_s3_bucket_notification" "sneller_ingest" {
bucket = aws_s3_bucket.sneller_ingest.id
queue {
id = "config-updates"
queue_arn = sneller_tenant_region.sneller.sqs_arn
events = ["s3:ObjectCreated:*","s3:ObjectRemoved:*"]
filter_suffix = ".json"
}
}