jahed.dev

Enforcing Retention Policies on AWS S3

With the recent rush for GDPR compliance, services are becoming more aware of the amount of data they hold and if it's really necessary to have all of it.

Application logs contain a variety of historical data coming from both users and third-parties, making it extremely useful when running reports and to monitor production behaviour. However, after a certain period, the burden of responsibility will begin to outweigh the usefulness of the data. Once that point is reached, it's best to shed that responsibility.

A common way to store logs is to put them on AWS S3. But, without the proper configuration, those logs will remain there indefinitely. You could manually delete objects or set an expiry when they're uploaded but there's an even more convenient solution built into S3: Lifecycle Rules.

At Unruly we use Terraform to provision our AWS resources. So, I'll be showing how you can do the same to enforce your retention policies. Before continuing, you'll need to familiarise yourself with Terraform's basics.

Prepare your S3 Bucket

You'll want to apply your retention policy to a bucket, so let's prepare one in Terraform. You have two options: create a new bucket or import an existing one

Creating a Bucket in Terraform

To get things started, let's specify a new bucket in Terraform. Here's a private bucket called "my-logs". We'll be using this bucket as our main example.

resource "aws_s3_bucket" "my-logs" {
  bucket = "my-logs"
  acl    = "private"
}

Using an Existing Non-Terraformed Bucket

If you want to use an existing bucket that isn't already in Terraform, use the terraform import command. Note that the command will only import your resource into the Terraform State and will not generate Terraform Configuration.

You'll need to manually reconfigure your resource as a Terraform Configuration, using terraform plan to continuously diff between the provisioned resource and your configuration until there's no remaining differences. Don't forget to terraform apply you're final configuration once you're happy with the diff.

Add a Lifecycle Rule

Now that we have an existing S3 bucket, let's add a Lifecycle Rule to every object prefixed with /logs. This rule isn't valid yet as we haven't added any behaviour.

resource "aws_s3_bucket" "my-logs" {
  ...
  lifecycle_rule {
    id      = "logs_6_month_retention"
    prefix  = "logs"
    enabled = true
  }
}

By setting the prefix, the rule will apply to objects like:

Add an Expiration

Let's define an Expiration to enforce our retention policy. We're setting ours to 180 days, which means any objects older than 6 months will be removed.

resource "aws_s3_bucket" "my-logs" {
  ...
  lifecycle_rule {
    ...
    expiration {
      days = 180
    }
  }
}

AWS will now check the bucket once a day for expired objects. The exact time they do this is undocumented, but we've found it's around the time the policy is enabled. So, AWS likely won't immediately remove your expired objects, you'll have to wait 24 hours.

If you're worried about misconfiguration, test it out on a safe prefix with some test objects before rolling this out. This can take some time as you'll have to wait a day for AWS to run its clean up.

Bonus: Add a Transition

If you want to save some budget, you can also transition objects you're likely to use less often to AWS Glacier. Glacier is a lot cheaper than S3 but comes with slower access times. Let's transition objects older than 3 months to Glacier.

resource "aws_s3_bucket" "my-logs" {
  ...
  lifecycle_rule {
    ...
    transition {
      days          = 90
      storage_class = "GLACIER"
    }
  }
}

Apply your changes

Make sure to terraform apply your changes if you haven't already and double check everything's correct in AWS's Web Console.

Conclusion

That's everything. With this configuration, you'll have an S3 Bucket where any files under /logs will transition to Glacier after 3 months and they'll be automatically deleted after a total of 6 months; 3 months after being moved to Glacier.

Here's what the final resource looks like:

resource "aws_s3_bucket" "my-logs" {
  bucket = "my-logs"
  acl = "private"

  lifecycle_rule {
    id      = "logs_6_month_retention"
    prefix  = "logs"
    enabled = true

    expiration {
      days = 180
    }

    transition {
      days          = 90
      storage_class = "GLACIER"
    }
  }
}