AWS Backup is terrible but you'll still probably use it anyway...

I usually don’t write blog posts on things I’ve been doing at work, however AWS Backup is just so truly bad I want to warn people of some of the gotchas should you want to perform your own migration to it. Lets start with what AWS Backup thinks it is. (Emphasis by me)

AWS Backup is a fully managed service that centralizes and automates data protection across AWS services and hybrid workloads. It provides core data protection features, ransomware recovery capabilities, and compliance insights and analytics for data protection policies and operations. AWS Backup offers a cost-effective, policy-based service with features that simplify data protection at exabyte scale across your AWS estate.

Now before I tear AWS Backup apart let me start with the parts that I think are good and why you’ll probably use it.

  • Automated and fairly straight forward copy jobs to other AWS regions and accounts
  • Very easy to demonstrate to auditors that your backups are happening and working
  • Can backup resources based on filters - no need to create a job for each resource
  • Various methods of protecting the backups from tampering
  • Continuous backup modes*

Cost effective

AWS product page mentions the cost-effectiveness of AWS Backup and this might be partly true depending on how you are currently doing your backups or what resources you are backing up. For example say you had a script that copied from one S3 bucket to another as your backup. In us-east-1 you would pay $0.023 per GB for that backup. Now that same backup using AWS Backup would cost just $0.05 per GB…. wait what. You are paying effectively 2.2 times more than S3 to backup using AWS Backup. What. DocumentDB, RedShift and Aurora are cheaper to backup using AWS Backup than S3. In Sydney region EBS volumes are cheaper to backup than S3. This makes no sense to me.

But wait, there’s more. When you perform the initial backup for S3 (and for changes) AWS Backup performs the requests to S3 like a normal client, this means you get stung with KMS, CloudTrail, S3 requests and GuardDuty api calls. There is no way to filter our AWS Backup from CloudTrail - so if you have have a lot of objects you could get up for thousands of dollars for that initial sync. AWS Backup team solution is to “turn of CloudTrail” or “don’t use AWS Backup”. Amazing. GuardDuty isn’t even listed in the support docs on possible costs.

Now before we leave the cost effectiveness section of this blog we should talk about budgeting. It is impossible to estimate how much AWS Backup will cost. I lodged 2 tickets to get an idea of how continuous backup mode works and how much it would cost to backup RDS and Aurora instances. The responses were both unhelpful and misleading. Copy jobs for Aurora and RDS are only snapshots, not PITR - this is not made clear in the docs. Trying to estimate this cost is near impossible because not even AWS knows how to do this.

Fully managed

There’s a bunch of fundamentals that are just missing from AWS Backup compared to pretty much any other backup solution. For example want to test that a backup plan works correctly? Guess what, you can’t trigger a manual start. You have to wait for the scheduled run time. So you want to get a list of jobs that are not successful - the UI only lets you filter by one state, and there is like 4 different failed states.

But it’s ok because you have useful AWS features like sending SNS notifications when backups fail… except for some reason you can’t send that notification failure SNS to another AWS account for some unknown reason….

Restore testing sounds great. You schedule it to test restores, you can even run your own validation scripts as Lambda functions. However it was clearly a hacked on feature. For S3 you have to restore the entire bucket. Have a large bucket, that’s gonna cost you. For Aurora and DocumentDB the restore test doesn’t even start an instance. It just creates a cluster. What’s being tested? Then to top it all off, S3 buckets linger around for days because to clean up the S3 bucket AWS Backup uses a lifecycle rule to delete all the objects (I know this is a good move for cost effectiveness but that’s an internal AWS thing!).

If you are configuring restore testing, much like backups there’s no way to trigger a test now. Hope your IAM, Lambda function, Event Bridge rules are perfect. Oh btw, since you have create resources for DocumentDB and Aurora - hope you can handle waiting longer than the Lambda timeout to do the test. Restore testing doesn’t even try to restore with the same configuration - so you have to manually define the VPC and SGs for the databases as well - otherwise it will try to use the default VPC and SG.

And how do you define those parameters for setting security group and vpc? As a JSON object? As a string? Nope, a JSON object as a string. There’s no validation on this, so if you send the wrong object, or send an array instead, the UI explodes.

It’s ok, restore testing will clean up your resources once the test is done…. nope if you created instances you need to delete them before it will clean up the cluster.

Simplify data protection

Here’s an incomplete list of gotchas for AWS Backup

  • Remember to exclude the CloudTrail logging bucket so you don’t make loops
  • Remember to exclude the server access logging bucket so you don’t make loops
  • Don’t remove S3 Event Bridge notifications on buckets that have been configured with AWS Backup otherwise the backup has to start again (terraform does this by default when you have a notification policy configured)
  • If the bucket is empty or not files that will be backed up - the job will be marked as failed
  • Don’t configure AWS Backup for the same backup window as RDS backup window otherwise backups will branch creating conflicting backups
  • Can’t create overlapping schedules for continuous backups - so be careful with your resource selection

Restore testing UI doesn’t display what resource it restored either. Just what the restore point ARN is. This makes it hard to demonstrate to an auditor that you tested resourcing of that resource.

Compliance insights

So after all of this you think, at least I have audit frameworks. Except if you are doing restore testing you quickly find out that the restore test resources are included in the report - suggesting you should backup your restores. And there isn’t a good way to exclude them because it’s really just AWS Config in a trench coat.


Build Pipeline Security

This occurred on an AWS website (not a site hosted on AWS, but a site run by AWS). It shows that security is hard, even for a $51 billion business. This issue can occur not just on websites but even SDKs and libraries

Fox smelling the road

📸 Erik Mclean via unsplash

While developers have a keen nose for code smells us operations types have a keen nose for infrastructure smells. When I opened this git repository for first time it hit me. A buildspec.yml file.

The humble buildspec.yml

For those unfamiliar, buildspec.yml is used by a service called CodeBuild and basically defines the steps used to build a project, including running shell commands. It’s basically remote code execution as a service.

The presence of this file in a repository isn’t call for alarm, but when it’s in a public repository it certainly raises red flags. The usual concern is someones committed some secret credentials into this file. In this case the file was clean of credentials.

All good right? Not so fast.

Fox sleeping

📸 Lachlan Gowen via unsplash

notices your deploy.sh

The buildspec.yml referenced a deploy.sh. This is when I verbally said “oh no”. Like before no secrets committed. A good start. deploy.sh contains instructions to deploy out the project - like aws s3 sync and the like, so we can determine that when this gets run it has access to upload to the production site.

Fox yelling

📸 Nathan Anderson via unsplash

The issue here is that the buildspec.yml and deploy.sh could be modified by a malicious user.

The pull request

However malicious user doesn’t have access to commit to the repository and an admin isn’t going to merge malicious code, so this is no big deal right? Let’s see what happens when we lodge a pull request.

Upon creation of the pull request GitHub triggers a CodeBuild job. This is a fairly common practice to make sure nothing in the pull request breaks the build. What prevents the pull request build from deploying to production? Lets check deploy.sh

if [[ "$CODEBUILD_WEBHOOK_HEAD_REF" == "refs/heads/main" && ${CODEBUILD_SOURCE_VERSION:0:3} != "pr/" ]]; then

oh no.

So deployment is purely controlled by a script that can be changed in the pull request.

Fox in grass

📸 Scott Walsh via unsplash

One last chance

At this stage we’ve got remote code execution into the pipeline. Apart from mining some Bitcoin this is pretty uneventful. What about the S3 sync we mentioned earlier? It’s possible that the role granted for pull requests is the same role used for deploying to production, so lets check it out.

I edited the shell script to have my code right at the start …

echo "testing a security issue" > test.html
aws s3 cp test.html s3://target_bucket/test.html
aws cloudfront create-invalidation --distribution-id $CLOUDFRONT_DIST_ID --paths "/*"
exit 1

target_bucket value was recovered from original deploy.sh

… and lodged a pull request. I checked the website and sure enough my file was there. 😮

Fox licking lips

📸 Nathan Anderson via unsplash

It doesn’t end there

It’s quite possible that the role used for deployment might have access to lots of interesting things, a private subnet, IAM admin, CloudFormation. I didn’t check further than this and submitted a disclosure reported to the security team immediately.

Prevention

If you still want pull requests to trigger builds on a public repository there a couple of things you can do to limit risk.

Place build scripts in a separate repo. Some build tools let you specify a separate repo to use for the build pipeline. Be careful though as this doesn’t guarantee that the project build can’t execute commands, depending on the programming language and build tools.

For services like CodeBuild you can utilize a separate IAM role for pull requests which is limited to just build requirements. Make sure the build agents for PRs aren’t within a a trusted network.


Even better free video streaming and storage on AWS

The below video is streamed from AWS practically for free. This is done very similar to the original Big Buck AWS but using OpsWorks logs instead. This exploit is a little more useful as you can store gigabytes of data, many more times, and CORS is enabled.

(click play if video doesn’t auto play)

When you run a command in AWS OpsWorks (such as Setup or Configure) the logs are uploaded to S3 and viewed from the UI using presigned URLs. The bucket for this is opsworks-us-east-1-log - notice how it’s an AWS bucket and not your bucket!

So what we do is we run a whole bunch of deployments to generate logs, and we modify the opsworks agent lib/instance_agent/agent/process_command.rb file to print out the presigned URL that it uses to upload logs. Once we have the presigned URLs and the logs are uploaded we reupload whatever content we want to. In this case the MPEG files for Big Buck Bunny.

ffmpeg -y \
-i bbb_sunflower_1080p_30fps_normal.mp4 \
-codec copy \
-bsf:v h264_mp4toannexb \
-map 0 \
-f segment \
-segment_time 30 \
-segment_format mpegts \
-segment_list "bbb.m3u8" \
-segment_list_type m3u8 \
"bbb-%d.ts"

for i in (seq 0 21);  gzip bbb-$i.ts; end

curl -v --upload-file bbb-10.ts.gz "PRESIGNED URL HERE"

To make use of the files all we have to do is get presigned URLs for the GET requests, so we write a simple lambda function that performs:

boto3.client('opsworks', region_name="us-east-1").describe_commands(DeploymentId=ts[requested_ts])['Commands'][0]["LogUrl"]

to get the log URLs. To form this site we dump the Lambda behind an API Gateway and serve up a HLS m3u8 file. More details can be found at the original Big Buck AWS GitHub