Ingest AWS logging (Terraform setup)

Content on this page

Prerequisites

Before ingesting AWS logging in this script, make sure:

You should already have credentials to access your AWS account.
You registered with Sneller Cloud and you have a bearer token.
You setup Sneller with a proper ingestion bucket and IAM roles.

If you followed the cloud onboarding, then you should be fine. You can download the Terraform scripts as follows:

git clone https://github.com/snellerinc/examples 
cd examples/terraform/ingest-aws-logging

All examples are written for Linux and should also work on MacOS. All examples have also been tested in WSL2 (Windows Subsystem for Linux).

Summary

The scripts have been written to work with the onboarding scripts, so you should be able to run the scripts like this:

export TF_VAR_sneller_token=<your-bearer-token>
terraform init   # only needed once
terraform apply

The terraform scripts perform the following tasks:

Create an S3 bucket for AWS logging and allow AWS to store logging in it. The script tries to detect the prefix that is used during onboarding and will use the same prefix for the logging bucket.
Allow the Sneller IAM role to access (read-only) the bucket with AWS logging for ingestion.
Create a table definition (database: aws, table: cloudtrail) that ingests the CloudTrail logging.
Create a table definition (database: aws, table: flow) that ingests the default VPC flow logging.

AWS logging batches the delivery of events, so it may take a while before data shows up in Sneller. You can browse through the AWS console to make sure that API calls are invoked on your account. Spinning up an EC2 instance in the default VPC will also generate some VPC activity.

If not already done, then set the variables to access the Sneller query engine:

export SNELLER_TOKEN=<your token here>
export SNELLER_ENDPOINT=https://snellerd-production.<region>.sneller.ai

Now run the following command to determine the number of items per service (via CloudTrail).

curl -H "Authorization: Bearer $SNELLER_TOKEN" \
     -H "Accept: application/json" \
     -s "$SNELLER_ENDPOINT/query?database=aws" \
     --data-raw "SELECT eventSource, COUNT(*) FROM cloudtrail GROUP BY eventSource ORDER BY COUNT(*) DESC LIMIT 100"

This command can be used to determine the number of packets per interface ID (via VPC flow logs):

curl -H "Authorization: Bearer $SNELLER_TOKEN" \
     -H "Accept: application/json" \
     -s "$SNELLER_ENDPOINT/query?database=aws" \
     --data-raw "SELECT interface_id, COUNT(*) FROM flow GROUP BY interface_id ORDER BY COUNT(*) DESC LIMIT 10"

Details

Setting up Terraform

These scripts depend on the AWS and Sneller provider. The AWS provider uses the current user’s AWS credentials, so make sure you have sufficient rights.

This script uses the following variables:

region specifies the AWS region of your Sneller instance (default: us-east-1).
sneller_token should hold the Sneller bearer token. If it’s not set, then Terraform will ask for it.
prefix specifies a prefix that is used for the S3 bucket name. If you don’t specify a prefix, then it will try to autodetect the prefix or create a new random prefix.

            click to show/hide content
        
main.tf
terraform {
  required_providers {
    aws = {
      source = "hashicorp/aws"
    }
    sneller = {
      source = "snellerinc/sneller"
    }
  }
}

provider "aws" {
  region = var.region
}

provider "sneller" {
  api_endpoint   = "https://api-production.${var.region}.sneller.ai/"
  default_region = var.region
  token          = var.sneller_token
}

variable "region" {
  type        = string
  description = "AWS region"
  default     = "us-east-1"
}

variable "sneller_token" {
  type        = string
  description = "Sneller token"
}

variable "database" {
  type        = string
  description = "Database name for the AWS logging tables"
  default     = "aws"
}

variable "prefix" {
  type        = string
  description = "Prefix for all resources (required to make resources unique)"
  default     = "" # a 4 character random prefix will be used, when left empty
}

data "aws_caller_identity" "current" {}
data "aws_partition" "current" {}

The next steps require the SQS queue and IAM role that Sneller uses, so it should be determined using the sneller_tenant_region data source that provides this information.

            click to show/hide content
        
sneller-tenant-region.tf
data "sneller_tenant_region" "sneller" {
  region = var.region
}

locals {
  # Role name           is the text after the slash in the IAM role ARN
  sneller_iam_role_name = split("/", data.sneller_tenant_region.sneller.role_arn)[1]
}

Some “magic” is used to automatically derive the prefix from the current Sneller IAM role. When that’s not possible, a random prefix will be generated:

            click to show/hide content
        
prefix.tf
resource "random_string" "random_prefix" {
  length  = 4
  special = false
  numeric = false
  upper   = false
}

locals {
  # If no prefix is set, then we first check if there is a
  # prefix in the IAM role-name that we can use. If not,
  # then a unique 4 character prefix is used instead.
  _suggested_prefix = endswith(local.sneller_iam_role_name, "-sneller") ? trimsuffix(local.sneller_iam_role_name, "-sneller") : random_string.random_prefix.id
  prefix = var.prefix != "" ? "${var.prefix}-" : "${local._suggested_prefix}-"
}

S3 bucket for AWS logging

All AWS logging is written into an S3 bucket with the following characteristics:

Disallow public access.
Add the bucket policy to allow the AWS services to write to the bucket.
Add S3 event notification to notify Sneller when new AWS logging objects are available.

            click to show/hide content
        
s3-aws-logging.tf
# Cloudtrail delivers the log files to the following bucket
resource "aws_s3_bucket" "sneller_aws_logging" {
  bucket        = "${local.prefix}sneller-aws-logging"
  force_destroy = true

  tags = {
    Name = "AWS logging data"
  }
}

# Public access to the Cloudtrail log bucket is disabled
resource "aws_s3_bucket_public_access_block" "sneller_aws_logging" {
  bucket = aws_s3_bucket.sneller_aws_logging.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

# Enable S3 event notification to notify the Sneller ingestion
# pipeline to ingest new data as it arrives
resource "aws_s3_bucket_notification" "sneller_aws_logging" {
  bucket = aws_s3_bucket.sneller_aws_logging.id

  queue {
    id        = "sneller-aws-logging"
    queue_arn = data.sneller_tenant_region.sneller.sqs_arn
    events    = ["s3:ObjectCreated:*"]
  }
}

# Cloudtrail should be granted access to deliver log files to the bucket
resource "aws_s3_bucket_policy" "sneller_aws_logging" {
  bucket = aws_s3_bucket.sneller_aws_logging.id
  policy = data.aws_iam_policy_document.sneller_aws_logging_bucket_policy.json
}

data "aws_iam_policy_document" "sneller_aws_logging_bucket_policy" {
  source_policy_documents = [
    data.aws_iam_policy_document.sneller_cloudtrail_bucket_policy.json, # required for CloudTrail logging
    data.aws_iam_policy_document.sneller_flow_bucket_policy.json,       # required for VPC Flow logging
  ]
}

Note that this file doesn’t contain the actual bucket policy, but merges all the bucket policies for the individual logging services.

Enable IAM role to access AWS logging

The IAM role that is assumed by Sneller to read the source data should be granted access to the AWS log data:

            click to show/hide content
        
iam-role-aws-logging.tf
resource "aws_iam_role_policy" "sneller_aws_logging" {
  role   = local.sneller_iam_role_name
  name   = "aws-logging"
  policy = data.aws_iam_policy_document.sneller_aws_logging.json
}

data "aws_iam_policy_document" "sneller_aws_logging" {
  # Read access for the cloudtrail bucket
  statement {
    actions   = ["s3:ListBucket"]
    resources = [aws_s3_bucket.sneller_aws_logging.arn]
  }
  statement {
    actions   = ["s3:GetObject"]
    resources = ["${aws_s3_bucket.sneller_aws_logging.arn}/*"]
  }
}

CloudTrail logging

In this example, all service logging in all regions will be enabled, but this can be customized using event filtering. The CloudTrail logging is stored in the logging bucket, so a policy is added to allow AWS to write to this bucket.

The data is exposed via the cloudtrail Sneller table that is created here as well. The table is partitioned based on the region of the CloudTrail data. This makes queries on a single region faster and more cost efficient.

            click to show/hide content
        
aws-cloudtrail.tf
locals {
    cloudtrail_name = "sneller"
}

# Table that holds all the ingested Cloudtrail log files
resource "sneller_table" "aws_cloudtrail" {
  # Enable this for production to avoid trashing your table
  # lifecycle { prevent_destroy = true }
  database = var.database
  table    = "cloudtrail"

  inputs = [
    {
      pattern = "s3://${aws_s3_bucket.sneller_aws_logging.bucket}/AWSLogs/${data.aws_caller_identity.current.account_id}/CloudTrail/{region}/*/*/*/*.json.gz"
      format  = "cloudtrail.json.gz"
    }
  ]
  partitions = [
    {
      field = "region"
    }
  ]
}

# Enable CloudTrail in the AWS account
resource "aws_cloudtrail" "sneller" {
  # The S3 bucket policy needs to be set before CloudTrail
  # can write to the bucket
  depends_on = [ aws_s3_bucket_policy.sneller_aws_logging ]

  name           = local.cloudtrail_name
  s3_bucket_name = aws_s3_bucket.sneller_aws_logging.id
  
  include_global_service_events = true  # also log global events (i.e. IAM)
  is_multi_region_trail         = true  # log from all AWS regions

  # You can also filter which events should be logged. Refer to
  # https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudtrail
  # for more detailed information
}

# AWS logging bucket policy that allows the CloudTrail service to
# write to the S3 bucket that holds all the source data.
data "aws_iam_policy_document" "sneller_cloudtrail_bucket_policy" {
  # See https://docs.aws.amazon.com/awscloudtrail/latest/userguide/create-s3-bucket-policy-for-cloudtrail.html
  statement {
    sid    = "AWSCloudTrailAclCheck"
    effect = "Allow"

    principals {
      type        = "Service"
      identifiers = ["cloudtrail.amazonaws.com"]
    }

    actions   = ["s3:GetBucketAcl"]
    resources = [aws_s3_bucket.sneller_aws_logging.arn]
    condition {
      test     = "StringEquals"
      variable = "aws:SourceArn"
      values   = ["arn:${data.aws_partition.current.partition}:cloudtrail:${var.region}:${data.aws_caller_identity.current.account_id}:trail/${local.cloudtrail_name}"]
    }
  }

  statement {
    sid    = "AWSCloudTrailWrite"
    effect = "Allow"

    principals {
      type        = "Service"
      identifiers = ["cloudtrail.amazonaws.com"]
    }

    actions   = ["s3:PutObject"]
    resources = ["${aws_s3_bucket.sneller_aws_logging.arn}/AWSLogs/${data.aws_caller_identity.current.account_id}/*"]

    condition {
      test     = "StringEquals"
      variable = "s3:x-amz-acl"
      values   = ["bucket-owner-full-control"]
    }
    condition {
      test     = "StringEquals"
      variable = "aws:SourceArn"
      values   = ["arn:${data.aws_partition.current.partition}:cloudtrail:${var.region}:${data.aws_caller_identity.current.account_id}:trail/${local.cloudtrail_name}"]
    }
  }
}

VPC flow logging

In this example, the VPC flow logging of the region’s default VPC is logged. The VPC flow logging is stored in the logging bucket, so a policy is added to allow AWS to write to this bucket.

The data is exposed via the flow Sneller table that is created here as well.

            click to show/hide content
        
aws-flow.tf
resource "aws_flow_log" "sneller" {
  # The S3 bucket policy needs to be set before flow logging
  # can write to the bucket
  depends_on = [ aws_s3_bucket_policy.sneller_aws_logging ]

  log_destination_type = "s3"
  log_destination      = aws_s3_bucket.sneller_aws_logging.arn
  traffic_type         = "ALL"
  vpc_id               = data.aws_vpc.default.id 
}

data "aws_vpc" "default" {
  default = true
}

# Table that holds all the ingested flow log files
resource "sneller_table" "aws_flow" {
  # Enable this for production to avoid trashing your table
  # lifecycle { prevent_destroy = true }
  database = var.database
  table    = "flow"

  inputs = [
    {
      pattern   = "s3://${aws_s3_bucket.sneller_aws_logging.bucket}/AWSLogs/${data.aws_caller_identity.current.account_id}/vpcflowlogs/{region}/*/*/*/*.log.gz"
      format    = "csv.gz"
      csv_hints = {
        skip_records = 1
        separator = " "
        fields = [
          { name = "version",      type = "int"    },
          { name = "account_id",   type = "string" },
          { name = "interface_id", type = "string" },
          { name = "srcaddr",      type = "string" },
          { name = "dstaddr",      type = "string" },
          { name = "srcport",      type = "int"    },
          { name = "dstport",      type = "int"    },
          { name = "protocol",     type = "int"    },
          { name = "packets",      type = "int"    },
          { name = "bytes",        type = "int"    },
          { name = "start",        type = "datetime", format = "unix_seconds" },
          { name = "end",          type = "datetime", format = "unix_seconds" },
          { name = "action",       type = "string" },
          { name = "log_status",   type = "string" },
        ]
      }
    }
  ]
  partitions = [
    {
      field = "region"
    }
  ]
}

# AWS logging bucket policy that allows the Flow logging delivery service to
# write to the S3 bucket that holds all the source data.
data "aws_iam_policy_document" "sneller_flow_bucket_policy" {
  # See https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs-s3.html
  statement {
    sid    = "AWSLogDeliveryWrite"
    effect = "Allow"

    principals {
      type        = "Service"
      identifiers = ["delivery.logs.amazonaws.com"]
    }

    actions   = ["s3:PutObject"]
    resources = ["${aws_s3_bucket.sneller_aws_logging.arn}/*"]

    condition {
      test     = "StringEquals"
      variable = "aws:SourceAccount"
      values   = [data.aws_caller_identity.current.account_id]
    }

    condition {
      test     = "StringEquals"
      variable = "s3:x-amz-acl"
      values   = ["bucket-owner-full-control"]
    }

    condition {
      test     = "ArnLike"
      variable = "aws:SourceArn"
      values   = ["arn:${data.aws_partition.current.partition}:logs:${var.region}:${data.aws_caller_identity.current.account_id}:*"]
    }
  }

  statement {
    sid    = "AWSLogDeliveryAclCheck"
    effect = "Allow"

    principals {
      type        = "Service"
      identifiers = ["delivery.logs.amazonaws.com"]
    }

    actions   = ["s3:GetBucketAcl","s3:ListBucket"]
    resources = [aws_s3_bucket.sneller_aws_logging.arn]

    condition {
      test     = "StringEquals"
      variable = "aws:SourceAccount"
      values   = [data.aws_caller_identity.current.account_id]
    }
    condition {
      test     = "ArnLike"
      variable = "aws:SourceArn"
      values   = ["arn:${data.aws_partition.current.partition}:logs:${var.region}:${data.aws_caller_identity.current.account_id}:*"]
    }
  }
}

Intro

SQL

Onboarding

Cloud

The Hard Way