AWS ECS Auto Scaling

AWS ECS, which stands for Elastic Container Service is an orchestration tool for containerized workloads in the AWS cloud. At the moment of writing, ECS is almost 5 years old so the core functionality of the service is familiar to the cloud community. The auto-scaling of this service could be sometimes not straightforward. However, recent updates allow for a new approach that we recently implemented. I would like to walk through some of the steps we took and hopefully save you some time in the process.

Using ECS, we have 2 approaches – either we go the serverless path and pick AWS Fargate the service, allowing AWS to take care of the necessary compute resources needed to run the cluster and we can just focus on building applications. However, when we would like to have a bit more control of the cluster’s underlying workers scaling and deployment methods, a preferred approach would be the EC2-powered cluster.

For this approach, we need to create an Auto Scaling group with Linux (or even Windows) AMI – AWS supplies ECS-optimized images but we can just build custom AMI with an ECS agent installed.

What is auto-scaling? Briefly, we can describe this term as activities executed in real-time to find equilibrium or the golden mean between the possibility to handle peak loads of our application and the lowest possible cost of the underlying infrastructure. When we use an EC2-powered ECS cluster and we think about auto-scaling it, basically we need to consider 2 dimensions – compute resources scaling and service scaling.

Regarding compute resources scaling – of course, we can scale ASG based on native Auto Scaling mechanisms and adjust desired/minimum/maximum number of instances but this may not be suitable for every use-case. This may also include some custom tooling (Lambda) which adds extra overheads to manage and may reduce our cost-effectiveness. In order to provide better scaling experience for ECS users, AWS developed a feature which is named Cluster Auto Scaling and it relies on the Capacity Providers. Capacity Provider is just a 1-to-1 link between ASG and ECS Cluster and it has 2 objectives:

Automatically scale out (add instances) the ASG if there is not enough computer power (CPU/MEM/available ports etc.) to run the tasks which customer is trying to run
Automatically scale in (remove instances)of the ASG if it can be done without disrupting any running tasks

To achieve the targets outlined above, the Capacity Provider calculates a metric which is a percentage ratio between how big ASG needs to be in order to host the current load and scheduled tasks and how big it actually is. To have a concept of tasks that the customer is trying to run but won’t fit on existing instances, the ECS Task Lifecycle was adapted and such tasks remain in a PROVISIONING state until they are placed on an ECS instance. We can also configure the Capacity Provider to maintain some spare capacity in the cluster, eg. when we use blue/green deployment and we would like to always have the possibility to deploy a new version of a component without the need to wait for new instances to spin up. What is also worth mentioning is that using Capacity Providers, we still maintain full control of ASGs, we can freely modify all parameters and we can use other scaling policies as well. Sounds cool, however, at the moment of writing, there are also some not-so-cool features of Cluster Auto Scaling, but they all seem like problems related to its young age and should be fixed in some time:

Capacity Providers are not covered in Cloud Formation so to have them defined as code, custom CLI/SDK code needs to be injected (Terraform supports them!)
Termination Protection must be enabled on ASG’s instances to make Capacity Providers work, which means that simple scheduled actions to terminate instances for nights/weekends in development environments will not work.
It is not possible to configure thresholds and number of data points to alarm for Capacity Providers related alarms; those are set automatically, can’t be modified, and may not be correct for all use cases.

Regarding ECS Service scaling – sometimes it is necessary to dynamically change the number of tasks within a particular ECS Service, e.g. during peak loads, to enable smooth processing, new replicas of the component are started. The feature which is capable of managing such actions is called Service Auto Scaling. We can set the desired/minimum/maximum number of tasks within the Service and we need to choose either Target Tracking or Step Scaling Policy.

Target Tracking policy could be considered as a smarter one, it tries to maintain particular metric on a configured level, e.g. when we set ALBRequestCountPerTarget = 100, and we would get a spike load to 400 ALB Requests, ECS Service will start 3 new tasks and after some time, when the load will cool down to 100 ALB Requests, 3 tasks will be killed as only a single task will be able to handle incoming requests.
Step Scaling policy is not as smart as Target Tracking policy, we can just bind CloudWatch alarm to it and set the actions to do when the alarm will be triggered or the value of the metric would be at a certain level. Possible actions consist of adding, removing or setting a number of tasks to a particular number. If we would like to use this type of policy, we are gonna need to manage alarms by ourselves and it could be tricky to determine valid thresholds for advanced multi-level scaling.

What is worth mentioning is that using the WEB UI we can only set some predefined metrics, but using the API or Cloud Formation we can also set our own custom CloudWatch metrics! Such a feature is very useful but unfortunately, at the moment of writing, not so well documented. An example of a Cloud Formation Auto Scaling Policy of Target Tracking type, based on custom CloudWatch metric is outlined below:

AutoScalingPolicy:
  Type: AWS::ApplicationAutoScaling::ScalingPolicy
  Properties:
    PolicyName: scaling-based-on-custom-metric-policy
    ScalingTargetId: !Ref ScalableTarget
    PolicyType: TargetTrackingScaling
    TargetTrackingScalingPolicyConfiguration:
      TargetValue: 100
      ScaleInCooldown: 120
      ScaleOutCooldown: 120
      CustomizedMetricSpecification:
        Dimensions:
          - Name: Environment
            Value: Development
          - Name: Region
            Value: !Ref AWS::Region
          - Name: Type
            Value: mean-rate per-second
        MetricName: /some/custom/metric
        Namespace: some/custom/namespace
        Statistic: Average

For both policy types, we can also set a cooldown period that defines the number of seconds between scaling actions.

Finally, in order to provide full transparency about the scaling actions which happen to an ECS Cluster, we can easily configure the notifications stream. We use Amazon EventBridge rules to detect scaling activities, both on the cluster’s compute resources scaling and Service Auto Scaling level. Having these rules, we can either write a custom Lambda function, which will parse events and forward details to e.g. Slack or it is even more straightforward to use AWS ChatBot service to create links between the cloud and Slack/Amazon Chime and receive scaling notifications on a particular channel.

To sum up, auto-scaling solutions for both dimensions of the ECS Cluster are definitely worth taking into consideration for ECS users. They synergize nicely with each other and they could reduce the AWS bill while maintaining the possibility to handle peak loads. In the near future, the teething issues should be eliminated so maybe it is a good idea to start experimenting with it right away?

Sources:
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cluster-capacity-providers.html
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-auto-scaling.html
https://aws.amazon.com/blogs/containers/deep-dive-on-amazon-ecs-cluster-auto-scaling/
https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/put-scaling-policy.html
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_cwe_events.html