Docker and Spot Instances: A Match Made in Heaven

Docker and Spot Instances: A Match Made in Heaven

Using Docker containers as a deployment technology has been growing quickly in popularity. There are many advantages to using Docker, ranging from better portability, better CI/CD, and better isolation and environment control. There are many articles that go into all these and more (links to a few at end of post), but this post is about how the characteristics of Docker plays well with leveraging AWS Spot instances, and Amazon’s EC2 Container Service (ECS) Docker orchestration offering in particular, to run a highly available cluster at 60-75% lower cost.

Container Startup Time

The first and probably most significant characteristic is the quick startup time. The majority of the time spent starting a new instance (or any VM) is typically in loading up the operating system. Since Docker containers run on top of a shared OS, there is no need to initialize a new OS. A Docker container typically launches in only a second or two, sometimes sub-second. As we discussed in the Cloud Auto-Scaling: Simple Concept, Hard to Get Right post, anything you can do to shorten the time from when you detect needing more capacity to actually getting it makes a big difference in how well you can auto-scale. Launching more capacity in a couple of seconds compared to 5 minutes or more makes a huge difference.

But the savvy readers are probably thinking: “but wait a second, those containers have to run on instances somewhere, so the capacity had to be there already for you to launch in under a second!” That is absolutely correct, but the key is the extra level of abstraction provides for the opportunity to do better optimization and capacity planning. Lets take the case of a system with two different container types, each one working on a separate task queue. If the load on one queue spikes up disproportionally, you can spin up more containers of that type to work the queue, leveraging some of the capacity not being used by the other task queue. Without containers, you would need separate auto-scaled clusters to handle each task queue, and one could not leverage the excess capacity of the other. The unit of scaling is reduced to containers, which start very quickly and can adapt to changing loads very quickly.

Container Instance Startup Time

Ok, but what about the case where both queues spike and there is no more excess capacity? In this case, we are going to need another container instance to house the additional container instantiations required.

Quick definition of terms here, because the term “instance” has become a tad overused and this can be confusing if you don’t understand the difference. “Instance” refers to a full EC2 instance running, like one c4.large. A “container instance” is a full EC2 instance running that has the software installed on it (the ECS agent) to run multiple containers in. The confusing part is what to call the instantiations of individual containers. Some people will use the term “container instance” for this, and hence the confusion. In AWS, they chose to use the term “task” for this concept, and “task definition” for the definition of the container, so I will stick to these terms for the remainder of this post.

So back to our scenario of both queues spiking and needing more tasks (container instantiations) to work on them. In this case we will have to wait while the new container instance spins up before launching the new tasks. But even here there is still an advantage. AWS has ECS optimized AMIs that launch and join the cluster faster than the average instances we see in normal Auto Scaled Groups. Part of this is because the entire health check and registration process normally associated with an Auto Scaling Group and Elastic Load Balancer can be bypassed, but some of it is the result of good AMI tuning by AWS engineers. It depends on the instance type you are using of course, but we often see new container instances register themselves into ECS in as little as 2-3 minutes. That is considerably quicker than the approximately 5 minutes or more that we often see for many traditional auto-scaled applications that have their own custom AMI. Once it is up and registered, a flurry of tasks can be started in seconds to work on both queues.

More Metrics

Another advantage is you have more metrics available to help you predict when you are likely going to need another container instance. Each task definition specifies how much memory and CPU that it needs to run. ECS tracks this and provides CloudWatch metrics for both actual values being used, as well as how much is “reserved” according definitions and number of tasks being run. So for each ECS cluster you have 4 key metrics:

• CPU Utilization
• CPU Reservation
• Memory Utilization
• Memory Reservation

In traditional auto-scaled applications we typically only have CPU Utilization to use for scaling purposes, unless we build custom metrics. Typically utilization will run lower than reservation, but it obviously depends on how accurate your task definitions are. So one effective scaling strategy is to scale based upon the higher of both the utilization and the reservation. Since the reservation typically leads the actual utilization, you will get a little extra predictive lead time and at the very least at least be partway through the container instance initialization, if not fully through it, by the time that additional tasks are being scheduled to handle the load.

Another advantage is you have access to memory metrics, both reserved and actual utilization. Lets say that the task processing for one queue is memory intensive, and the other is CPU intensive. Depending on the load and which metric is driving the scale up, you can select more cost effective instance types, say r4 or c4 families, or if it is fairly balanced between the two, then something in the m4 family.

Multiple and Widely Varying Instance Types

Which brings up another big advantage: the abstraction layer that Docker and ECS provides means that you no longer require instances in the cluster to be of fairly similar instance types. You can have a t2.micro running alongside an m4.16xlarge.You might only have 1 or 2 tasks running on the t2.micro and 100’s on the m4.16xlarge, but the point is you don’t really care. This opens up a lot more instance types as potential candidates to help power the cluster, which as discussed in Strategies for Mitigating Risk of Using AWS Spot Instances, allows us to diversify across more spot markets and leverage spot instances safely and effectively.

Termination Behavior

So lets say we now have a nice blended ECS cluster powered by container instances of which about 80% are spot instances spread across multiple instance types and multiple spot markets, and the inevitable event occurs: one of those spot markets spikes and we are about to lose an instance. What happens? First, the spot instance will be marked for termination and given a two-minute warning. Details on how to detect this are covered in Spot Instance Two Minute Warning. Once detected, we can tell ECS to put this container instance into a Draining state, as described in the ECS developer guide. The ECS scheduler will stop placing new tasks on this instance and schedule replacement tasks on other instances. The fact that Docker containers start so fast is another big plus here because this entire process typically happens very fast, so you can get all the active tasks off of the container instance you are about to lose before you lose it, providing better continuity and efficiency of processing. If you are really savvy, you can also use this time to start up a replacement instance in another spot market to help pick up the load that just shifted over to the remaining container instances in the cluster.


If you are using or planning on using ECS, spend some time thinking about how to leverage spot instances using some of the strategies outlined here and you can save a lot of operational cost. If it sounds like another development project your team just does not have time for, consider leveraging our AutoScalr service. We have all these strategies and more already implemented and ready to be applied to your ECS cluster to start saving you money in a matter of minutes.

Further Reading:

One Comment
Leave a reply

Your email address will not be published. Required fields are marked *