Infrastructure Automation at Appliscale. Vol 1

The Infrastructure automation dilemma

There comes a time when every cloud computing project has to make a difficult choice – should we automate infrastructure or not?

There are pros and cons but at a certain scale, automation is something you cannot escape.

Immediately after you decide to automate the provisioning of your cloud infrastructure, another question pops up – how should we do it? Should we use a dedicated tool such as AWS CloudFormation, Azure Resource Templates, OpenStack Heat or Google Cloud Deployment Manager or a provider-agnostic solution? And then – immediately after a surprise comes in. At first glance, there is no other tool like Terraform available on the market (but there is – it is called Foreman). So we have a good tool, built by amazing people (HashiCorp) which do not rely on any particular cloud – problem solved? Not entirely.

As a grown up, just as ORM doesn’t really let you change database’s on the fly, Terraform will not let you automatically switch cloud provider for your entire system (assuming no upfront investment into the design and architecture of your system). On the other hand, you are afraid of vendor lock-in, and for sure you are considering how to design and prepare disaster recovery scenarios. A lot of unknowns aren’t there?

I would like to tackle the why problem first. Why in the first place should we care about automating infrastructure? Is scale the only reason that we should strive for automation?

Scale is not the only reason for automation – disaster is

One obvious reason why our infrastructure should be automated appears immediately after we do the very same thing (e.g. setting up the same web server) the nth time in a row. Our lazy nature immediately pops out. We feel that there has to be a better way to spend our time, instead of doing GUI click-fests or ad-hoc scripting excellence challenges.

But that is neither the only or the main reason for automating. The most important are the one’s we often overlook. Humans are pretty optimistic about how things will work in future.

Recent events that happened (either 1st or 28th of February – dates are varying depending on if you are a DBA or Operations Engineer using AWS infrastructure in us-east-1 region) have shown how much we suck at predicting disasters. Up to that day disaster recovery was the most often mitigated topic on the meetings. It will not happen to us fallacy.

After such incidents many of us dust off the disaster recovery manuals, and start to report on the state of RPO and RTO for a given businesses. Leaving aside backups (which are a really deep topic in itself), there is no better way to gracefully recover from infrastructure failures (including provider issues), than tried and tested an infrastructure automation that is in daily use. Period.

Speaking about disaster and cloud environments. The decay of metal in favour of cloud computing has enabled many scenarios that were impossible back in the days. There is no need for thinking about racks and other hardware, secured places and policies – you just need a credit card, and you are ready to use the endless capabilities of cloud computing.

We tend to forget that someone has to provide and maintain the cloud resources we utilise. There is no free lunch here as well, and eventually and inevitably we will face hardware issues, even in our virtualized cloud environment. Of course, we will not see it as a cut-off rack or dead fans, but our machines suddenly disappear. And we need to be prepared for such situation as well.

What if it hits us so hard that we will lose a very important machine™ that is the next day for a demo? Do you have all those configuration files stored in the back of your head? You would be more than happy if you have a backup of that machine, but what if you don’t? Automation is again the answer.

Operational Excellence

Who likes typing / clicking the same thing over and over again? As IT professionals, we get used to writing small scripts and leveraging computers for something they excel at – automation.

It won’t come as a surprise to that manual work is error-prone and mind-numbingly boring, it should be avoided at all cost! Of course there is a level of work above which automation is too expensive, but still, we should strive to automate as much as we can – taking into account cost efficiency.

Decisions, when to stop, should be straightforward – if cost related to recovering from failure due to manual work exceeds the cost of introducing automation you should automate. That simple heuristic is fine unless you are doing proper risk evaluation. But in such case, you do not need advice when you need to stop efforts regarding automation.

There is one more aspect of elimination of manual work – it shows how mature we are in terms of operational excellence. It is all about sustainability and predictability. If we are able to deliver assumed goals, even in the case of unexpected troubles, the better for us and our business. But remember – operational excellence is not a goal, it is a process. You need to care and develop it. Motivation is for beginners, grit and routine are for professionals.

Vendor lock-in is a thing

In this section, I would like to point out a real threat related to infrastructure automation – vendor lock-in. It is not a myth, neither a boogeyman. Relying heavily on CloudFormation and adding on top of it glue in the form of AWS Lambda should be a deliberate choice. It is a considerable a liability if you ever have to consider moving from the cloud.

If you are not a Snapchat and do not have 2 billion USD for spending over 5 years with a certain cloud provider, for sure you have considered what if we want to use a different provider or use physical infrastructure for our product. Moving either in or out between clouds or cloud and physical is a big project per se, sprinkling it with additional services provided by your cloud complicates it even more.

In most cases, if you were careful, it will be rather an infrastructure and operations effort. However, some services look so shiny, so easy to use, so cheap (like AWS Lambda) and incorporating them looks like a really good choice. Except when you consider moving out from a particular provider and you need to dedicate additional development and operations effort for such migration.

In our client’s case, the decision was actually simpler than you might think. We are moving out from physical data centers to the cloud of choice – AWS was chosen as the most mature one. There is no plan for migrating out of the cloud (we were not first project moving there, a huge development work was already done, including custom services on top of AWS API) and physical is not an elastic (obvious thing) and as cheap choice as you may think. Tests made in the cloud already confirmed, that even for our demanding domain (ad tech) that cloud provider will be sufficient for our needs. And it gives us more than the physical world.

So you can consider it as a deliberate vendor lock-in, with all the consequences. We were in the physical world, we know how it works and how it feels, and we still need more – that’s why we’ve chosen the cloud. But it is not a usual path to take – you need to evaluate that one on your own. And choosing either one or another route – automation will help you handle it like a boss.

Summary

Armed with the knowledge why we should automate infrastructure and eager to do it, we need to decide how we tackle this problem. And it is not an easy choice, each project has different motivation, goals and rationale about how to solve that particular thing. In next article, I would like to shed some light on that topic and show you one route that you can consider analysing for your project.