Infrastructure as Code - which tool?
TL;DR - I think you’re probably best off with Terraform
Preface
These are highly opinionated pieces. While they are based on many years of professional experience in the industry they don’t take into account individual nuance. Posts are mostly going to focus around AWS tooling as that’s where most of my experience is. What better way to kick off than with CloudFormation.
CloudFormation
You might hear some very opinionated (wait, this blog post is opinionated…) people claim that CloudFormation sucks. And I kind of agree. However if used in a very specific way it can be pretty useful.
CloudFormation originally only supported JSON templates. I believe the intention AWS had was that CloudFormation would only ever be written by tools (hence the later release of CDK), however that kind of sucked and people only ever hand crafted templates, leading to CloudFormation eventually supporting YAML.
This concept of machines only ever managing CloudFormation is important however. To understand why we have to understand how CloudFormation works. When a template is deployed CloudFormation spins up the resources, and saves the state and configuration of each resource. This allows CloudFormation to perform its infamous rollback process. It also allows CloudFormation to know what steps to take to perform updates. Now if users have been meddling with resources this all falls apart. Subtle changes in resources can break CloudFormation changes and even worse CloudFormation can rollback a failed release to what was in it’s own state rather than what was running prior to a stack update!
The ability to import resources into CloudFormation is new(ish) and support is very limited. If you have an existing stack - good luck.
This all sounds horrible, but if you think about what CloudFormation is trying to achieve you can make it work in your favour.
My CloudFormation rules:
- New AWS account
- Disable access to EVERYTHING except managing CloudFormation
- Engineers should never make manual changes (and shouldn’t be able to if you follow point 2)
- CloudFormation deployed out by pipelines / CICD
- If you need a resource not supported by CloudFormation - make it supported using custom resources or wait
- Don’t be afraid to make large templates that cover an entire environment. If you hit the CloudFormation size limit you probably want another AWS account anyway.
- CloudFormation needs to be tested in a staging account first
This sounds like a lot of work - whats the pay off?
- AWS supported and managed tooling
- Accurate change sets (these can be very very very important in controlled environments)
- Very simple and predictable rollback - this means in theory (provided the rollback doesn’t fail. Often causes of failed roll backs is manual resource changes) you will either move to the new state, or return to the old state
Terraform
I haven’t used OpenTofu in anger yet. It’s likely that everything I say here probably applies to OpenTofu as well and it’s worth while checking out - I just don’t have experience with it yet.
I’m going write a whole post on using Terraform so I will keep this a little bit more brief. Terraform has a lot of pitfalls and issues (can we please not store secrets in the statefile and have dynamic state backend config kthnkz) but the advantage is that it much easier for engineers to manage in an ad-hoc manner.
Unlike our CloudFormation scenario above we have:
- very good resource import support
- doesn’t use saved state to plan changes (I’m over simplifying here… I know)
- will try to correct manual changes
- it provides a much richer language for defining resources which means that engineers don’t need to use abstraction languages to generate it
It does come with its own drawbacks though:
- doesn’t support rolling back changes. If terraform breaks during a deployment you have to fix it yourself. This isn’t as bad as it sounds, but you will need to factor it into your lifecycle / deployment plans.
- Using terraform securely is hard. I don’t think any company actually has secure terraform deployments.
If you have an environment with existing resources or engineers that are likely poke things manually - terraform is a good choice.
HCL language, while not perfect can lead to good grepability. You can still fall down a rabbit hole of module mess (this will be covered in my other post) but if you try to keep your terraform flat then managing bulk changes can be ok.
Pulumi, CDK, CDK for terraform, other code generators
Unless you have a very good reason, just don’t. They seem like a good idea. Your software engineers will think its a good idea. However from experience every single implementation I’ve seen or worked with has turned into a mess. I think they can work however they have too many footguns to remain manageable.
Software engineers will often abstract resources to the point that making changes is fragile, complex and time consuming. Finding out how a resource is created and configured requires following chains of abstractions (hope your IDE is configured correctly).
After all of this you still end up with a template. Which means you still need to understand how to debug the template + you have to map the generated template back to the code. So not only do you need to know the tool / code abstractions you still end up with all the pain of managing the output of templates anyway.
The other problem is grepability. When operating large infrastructure deployments there’s some tasks when abstractions like this become frustrating. Some examples:
- Update all S3 buckets with a new security policy
- Change the TLS policy on all load balancers
- Find how/where a specific KMS key is configured
When using code generators it can be sometimes hard to perform these tasks. Sometimes easier. But often harder as everyone will have created their own abstraction for their resources. Compared to a DSL (like Terraform’s HCL or the YAML for CloudFormation) we can search all the entire company’s repos for something like aws_s3_bucket
and possibly even script changes to the resource. Using a tool like multi-gitter
suddenly means that its possible to update resources to a new company policy or standard quickly.