In previous post (you can find it here) we talk about why we should automate.
Now I would like to show our rationale for a new project and infrastructure for which we were responsible. That motivation which driven us into pure CloudFormation setup, why we bet on AWS, and why we decided to drop Terraform assuming that certain conditions will stay with us for a long time.
We evaluated Terraform and it didn’t fit our needs. It does not change our opinion about other tools created by HashiCorp. As a users of other tools (mostly Packer, Vagrant, Vault and Consul) we am grateful for the work they did. But here is why it did not match our expectations.
If you do not have an experience with CloudFormation or Terraform – please read either amazing documentation or any other article, that will introduce you to the topic – because I will rely on your basic knowledge about those two.
Terraform is not a silver bullet
When we evaluated Terraform it was before 0.7 release. According to the definition it is:
… a tool for building, changing, and versioning infrastructure safely and efficiently. Terraform can manage existing and popular service providers as well as custom in-house solutions.
It is tool built by practitioners, for fulfilling requirements of infrastructure as a code approach. It is built as an orchestration tool, focused on visibility (execution plans and resource graphs) and change automation, with minimal human interaction.
Sounds great, and what is even more important it is cloud / provider agnostic by design. That is a huge plus, especially when you need to consider either a hybrid cloud environment (mixing public cloud with on premises) or multiple cloud providers scenario (e.g. disaster recovery or availability requirements). What is also very important it does not focus on cloud APIs only – it incorporates various 3rd party APIs and Cloud provider API in one place, enabling interesting scenarios – e.g. creating infrastructure inside cloud provider, connecting it with your storage solution on premises and combining it with an external DNS provider.
What is often mentioned as a benefit is its DSL – HashiCorp Configuration Language (HCL). In my opinion it is not a revolution, but evolution in right direction. Still declarative, but more expressive and at the same time more concise, than verbose and hostile JSON / YAML formats. Terraform internally is compatible with JSON format. I would not like to dive into details, because they are extensively covered in really good documentation. One thing I want to point out – it is not a valid programming language, and we will talk about it soon as well.
So far so good – tool looked awesome for us. But before final decision I wanted to check its state and how it “feels”. And that actually that shown a whole new perspective of it. First of all – from our perspective that tool is not as mature as we would like to see it. It is in the infancy stage – I have scanned the Github repository just for bugs related with AWS provider and it uncovered a long list. It is not a definite list (probably many of them was not confirmed, neither triaged / prioritized), but gives an answer how rapidly tool evolves. Second thing – which did not affect us directly, but we heard horror stories – there are some backward incompatible changes from version to version. Again – it is totally understandable, taking into account that tool is below 1.0, still it is not suitable for our needs – we wanted stable and tested solution, that will cover AWS Services for us (so cloud agnostic does not add anything for us, we discuss this later).
What stroke me the most is how it works underneath. It does not use CloudFormation, but AWS API – which has several consequences. One of it is impossibility to perform a rollback, when something will go crazy. Usual workflow is slightly different from the CloudFormation one- first we need to plan our changes, then we need to review them and decide – should we apply or not. With CloudFormation it is possible to review changes as well, but if everything goes hazy from any reason (and it eventually will – trust me :wink:) it will be able to roll-back that change and return to the previous state.
That workflow shows also one more disadvantage for us – Terraform is highly opinionated. It requires and assumes various thing regarding your workflow. For us it does not make a difference, but that tool is not used inside project, neither company in what we want to use it. It means that it adds additional impedance mismatch and requires effort regarding knowledge exchange and learning additional tool. Last, but not least – it is stateful. In the version we’ve evaluated there was no easy way to share and lock state in a remote environment, which was critical from our point of view (and neither storing state on S3 or inside repository was acceptable solution for us).
Up to that point we did not ditch Terraform yet, but we desperately searched for an alternative – we have definitely needed to face the CloudFormation.
CloudFormation is not a hostile environment
(neither a perfect one too)
We’ve already came across CloudFormation as it was used by a client in different project.
We heard horror stories too, and we wanted to be prepared before we start the project. So I have spent some time with the beast. And guess what? It is not as ugly as I initially thought.
The obvious advantage of this tool is better support for AWS services compared to other 3rd party tools – no doubt about that. When new AWS service is launched it is already supported by CloudFormation – in worst case partially. However, there are some elements that either does not make sense in CloudFormation (e.g. registering DNS name for machine spawned inside auto-scaling group) or they are unsupported (e.g. ACM Certificate Import).
CloudFormation is stateless… except the time it is not
From your perspective, you do not need to bother with state management, however it is not entirely true. It preserves stack inside the service, that stack resembles operations invoked inside your cloud (they are called events) and it connects them with the resources. Updates are based on top of that state. Another point are exported outputs which are globally shared inside the same region and AWS account. It means that you cannot create the same stack twice based on the declared definition, because it will collide for the defined outputs.
What about the learning curve?
Well it turned out that it depends. Definitely it is not as hostile environment as it is advertised. If you will google for it – you will see how much people hate it. Of course, those people did not lie and in each accusation there is a grain of truth, but situation drastically changed after one announcement.
Announcement aforementioned above was related with releasing YAML format for the CloudFormation templates. Previously you could use only JSON for that. In my opinion it is a game changer. If you do not believe me, look at the screenshot posted above – difference is at least noticeable. I wrote templates in both formats and difference is huge, starting from really basic stuff like added support for comments (yes, JSON does not have comments at all), added support for multi-line strings (yay, no more string concatenation! This is really helpful for “AWS::CloudFormation::Init” sections and similar), to the smaller stuff like better support for invoking built-in functions. And last, but not least – it is less verbose and less painful to modify / refactor. In case of JSON it is really hard to refactor huge chunks of code, trying to make it readable and still preserving syntax validation (CloudFormation validation API supports slightly malformed JSON, so you need to have two layers of checks).
Is that all? Not entirely. I have told you that there are couple of elements that are not supported directly in CloudFormation, which makes operation really painful and semi-manual. It turned out that you can either workaround it with some well-known hacks focusing e.g. on already mentioned (in different context) “AWS::CloudFormation::Init” or with new and strongly advertised concept that we wanted to test in practice – automating operations with AWS Lambda service. It turned out that with help of SNS, SQS and AWS SDK we were able to glue CloudFormation other services in an elegant way. In serverless and event-driven fashion. But have to do it deliberately, having in mind all pros and cons.
We have already scratched the surface in the previous paragraph, but speaking about CloudFormation. So far we are happy living with it, but it has sharp edges too. In this section, I would like to point out a couple of pitfalls, that knowing about may be helpful for you.
The most significant problem covered also in many discussions over the internet, there are horror stories regarding updating and changesets for CloudFormation. We did not have any problems with a list of changes generated by service, that hides the actual operations – and simply not showing everything (there are people claiming that it is still a thing). However an update is harder than you think, in most cases because of replacements – it is hard to believe sometimes how simple change may enable domino effect and will cause to rebuilt some unrelated elements. And killing elements is problematic.
Another topic already partially covered above are exported outputs. Those globals in the stateless world, are as much helpful, as annoying. On the other hand, they are a neat way to connect stacks between, from the other – it is really easy to tangle stacks together and introduce dependencies, which are hard to reflect in the creation process.
Another element which is painful is naming (I will not repeat that old joke about hard things in computer science, but we all know it). Name should reflect the purpose, but also in many cases the location – for your convenience and better readability. In such case you have to remember which services are globally available, which are not (Route53, IAM and S3). And that is not the end of story, because for example – S3 is a global service, and from the first glance you do not need to put region in the name of it. However bucket location makes a huge differences in scenario called cross-region replication, and having region name is really helpful. Decisions, decisions. Another pain-point related to naming are limits. Strangely enough they are not validated via API, only checked during the stack execution are various name length limitations (e.g. 64 characters limit for ELB name), you will know them by heart after couple of rounds with failures and rollbacks – just be prepared. 😉
And last but not least, you want to be always compliant with CloudFormation standards to start and stay small. Use always types provided by CloudFormation, which helps validation API. Prepare you own naming convention (also for tags) and stick to it as much as you can. Keep stacks as small as possible and split them in two dimensions – by responsibility (e.g. component) and by common modification reason (e.g. they are often modified together). Use exports with caution and wisely (naming convention also helps tremendously here).
In our case, when a deliberate vendor lock-in was available as an option, we chosen CloudFormation and we are not regretting that choice so far. We were able to built our infrastructure and tools on top of solid and battle tested service, fully compliant with our cloud computing provider of choice. However I know that it is not an usual option for everyone – I hope that deepened approach to the topic and our rationale will help you to make right decision. Remember: there is no universal solution here, in case of any doubts or if you would like to hear more about certain topic feel free to reach me in the comments below or directly at our contact email.