Intro
I would like to give an overview of how we manage our fleet of hundreds of physical servers. How to scale, what problems you might face, what is good and what can kick your ass. It is not another story about deployment automation. This is more about our learning path which showed us that handing a few servers is a very different pair of shoes than rolling with a few hundred hosts processing billions of requests per day. Who knew eh!
In this post, we will highlight some basic concepts that work for us. You might expect more articles from us covering these topics in greater details.
So what we are talking about…
We manage over 500 physical servers and a growing fleet in AWS.
On AVG, ~20 000 requests per sec, per region. And hundreds of thousands of dollars behind it.
To be clear: This is not our peak or an extraordinary situation, it is our everyday reality!
Who cares?
- In a relatively small team as ours, everybody needs to know what to do, how to do and be kind of a firefighter when needed.
- In Appliscale you deploy and take care of the fleet in the first week of work.
- You won’t be able to manage such a task without the correct tools. Yes, we have Jenkins and all the CI goodies, but everyone can run the same from the localhost.
- OPS need your help, they might manage some higher level things like hardware and network, but they may not be able to support your app to the required level.
- With a small team, you need to ensure everyone has experience with and is responsible for the platform Otherwise you are lost. No, I didn’t say you have a problem. That is: YOU ARE LOST!
Meat
As I said, Don’t ask why automation is a MUST. Let’s dive into our stuff.I will present our tools with some good practices which have worked for us.
Ansible
Why Ansible? The answer is simple. It is stateless. You don’t need any additional agents or the like. Just Python and SSH connection.
Treat your scripts like any other code.
-
- One argument is that its documentation for your infrastructure and deployments. I would even say that efficient and clean Ansible scripts are better than a plain old wiki page. If you keep it consistent for all services that you have, then you have one playbook and don’t need any additional training.
Second thing: if that’s the code, review it like any other production implementation. Finally, it goes to your servers, earn money for you. Is not so risky how it sounds if only you organised your deployment as an incremental process.
- One argument is that its documentation for your infrastructure and deployments. I would even say that efficient and clean Ansible scripts are better than a plain old wiki page. If you keep it consistent for all services that you have, then you have one playbook and don’t need any additional training.
Split your playbooks.
-
- You start with one file with all of the playbooks you have. It’s completely ok. The thing is to keep it easy to read and maintain. If you realised that you are wasting your time scrolling your playbooks and looking for the code – just split it into separate files. Simple and effective. Even if you are hearing about the Ansible for the first time, I am sure you understand the code below.
Triggers – consider twice.
-
- At first, it sounds great – define a task, call it and Ansible guarantees it will be run once. But when your Ansible scripts are growing, you are adding more and more plays which are reusing previously defined roles; it is becoming more difficult to keep the triggers under control.
A few reasons for that:- Composing your play with existing roles you may not be aware that a trigger will be invoked which may bring unexpected behaviour.
- You might add some intermediate task responsible for calling a trigger, but you must be careful because so-called no-op (no operation) module won’t invoke a trigger. In detail, a task must end up with changed status to call a trigger.
- Debugging your logs when you find an unexpected server state might be difficult when a trigger is suspected.
- At first, it sounds great – define a task, call it and Ansible guarantees it will be run once. But when your Ansible scripts are growing, you are adding more and more plays which are reusing previously defined roles; it is becoming more difficult to keep the triggers under control.
SSH multiplexing.
-
- Not much to say about it. It gives you speed-up for free! For us, it has led to an increase in the region 30 to 50 % for our deployments. Good enough for three lines of configuration.
Virtualenv.
-
- The beauty of the Ansible is that you just need Python packages to run it. That being said, it’s a good idea to manage the version using Python virtual environment. Assuming you are using Jenkins or any other tool which is not under your control – you are always safe because you are completely independent of the host Ansible version. Moreover, migration to a newer version is also easier. You are testing it locally; then you just update your deployment script. No changes on CI host are needed, no impact on the other scripts.
Incremental deployment.
-
- Two terms you need to know from the Ansible’s world:
- Forks – to simplify – number of processes able to run your scripts simultaneously
- Serial – max number of host which can be handled simultaneously
Use it to do an incremental deployment – this is obvious you cannot pull down the whole fleet at the same time. It’s a common approach but going deeper – you can detect places where a high serial makes your scripts fail.
Example: you want to run deployment on ten hosts at the same time, however, because of an artefact’s repository bandwidth limitations your script is failing. So you can try to lower serial parameter for downloading an artefact to 5 leaving ten value for the rest of tasks.
- Two terms you need to know from the Ansible’s world:
Monitoring.
- Probably you have some. Probably you, not the only one looking at it. Be a good guy and don’t scare the others! We are muting some of the alerts for a time of deployment. Our scripts send events (specific to DataDog) and messages to Slack. We know what is happening, whether it was a manual investigation or an automated build.
- Example: we see our servers are going down:
- But the same time we see that events from Ansible scripts:
Jenkins
Yeah! Our good old Jenkins. It’s well known, quite simple and flexible enough. Just two tips which make our life better.
Job DSL.
-
- There is nothing wrong with creating Jenkins jobs by hand. But if you have quite a lot of them, terrifying things come to your mind. What if you lost your Jenkins host one day? What if you want to migrate it? Yes, these are rare, and may never happen. Then, another thing: you want to change some parameters for all jobs. Well…
Job DSL allows you to code your jobs in Groovy. Then you can keep it in repo and version it. The entry barrier is really small. Just checkout our post to get more details.
- There is nothing wrong with creating Jenkins jobs by hand. But if you have quite a lot of them, terrifying things come to your mind. What if you lost your Jenkins host one day? What if you want to migrate it? Yes, these are rare, and may never happen. Then, another thing: you want to change some parameters for all jobs. Well…
Simplicity.
- Having Ansible automation, it’s good to keep your Jenkins jobs clear and simple. Best if it just fires some Ansible playbook, without any additional steps. That way you are always able to do the same from your local machine. And looking at the scripts you know it’s exactly what goes to your server.
Datadog
This is our main monitoring tool. It has a lot of features and can be easily integrated with many popular applications.
We are using two integration methods:
SNMP – many applications use it already so this the quickest way to integrate, but it requires static configuration files which might be not accurate for a dynamic environment.
DogStatsD – you can send your metrics to DataDog agent via UDP, very flexible way, no configuration required.
Events and alerts.
-
- Events are short messages sent from your application. In DataDog it looks like an RSS feed. We are using it to track server restarts. Just a quick check why do I see restarts, whether it was a manual investigation or automated job.
Pager duty and muting.
-
- When you feel responsible for your application, you ask yourself how to improve your support. When you have too many statistics, all combined with alerts, events and other notifications, there is a risk that you will start ignoring them. This is something we wanted to avoid from the very beginning. This is why we integrated DataDog with PagerDuty. It helps us to organise on-call support. When something goes wrong, DataDog triggers an alert which then notifies one of the guys thanks to PagerDuty. It has impressive support for Slack and mobile app. Of course, that way is used only for the most important problems.
As I mentioned earlier, some of the alerts are muted during the deployments.
- When you feel responsible for your application, you ask yourself how to improve your support. When you have too many statistics, all combined with alerts, events and other notifications, there is a risk that you will start ignoring them. This is something we wanted to avoid from the very beginning. This is why we integrated DataDog with PagerDuty. It helps us to organise on-call support. When something goes wrong, DataDog triggers an alert which then notifies one of the guys thanks to PagerDuty. It has impressive support for Slack and mobile app. Of course, that way is used only for the most important problems.
Local statistics.
- From time to time there are small issues with DataDog – it might be a connection problem or data inconsistency. It is rare but can happen. For that reason, we encourage to keep additional statistics close to your application. For us, this statistics page is local to each of the servers. This is both a quick health check and confirmation when some failure looks suspicious for us.
Summary
Working at such a large scale is a pleasure. The price of that pleasure is responsibility. And responsibility makes you much more creative. In many cases, we are working on a solution before the OPS team notifies us about the failure.
I hope you will find this article helpful in your day-to-day work. That all come from our experience and we are still improving our processes. You can expect more details in further posts. In the meantime share your experience! We are always open to suggestions and bright ideas!