Auto scaling cooldown periods are not cool. In fact they have always annoyed me. I understand why we have them, but they always felt like a hack to me. The whole idea of having a closed loop control system refuse to take any further action for a substantial period of time just feels wrong. It is like taking your hands off the steering wheel while driving a race car just because you pushed the accelerator down. Why do it?
Why have cooldowns?
The reason cooldowns exist is to prevent runaway scaling events. If your system is running high on CPU and your auto scaling rule adds an instance, it is going to take 5 minutes or so before the instance is fully spun up and helping with the load. Without a cooldown, the rule would keep firing and might add 4 or 5 instances before the CPU metrics came down, resulting in wasteful over-provisioning. Or in the scale down case, overshoot and result in under-provisioning.
The challenge with cooldowns comes when you have a two peak traffic burst. Take for example a system set to scale up by one at 80% CPU with a 6 minute cooldown period. It is currently running 8 instances and the first peak hits driving the CPU to 81%. The rule fires and one instance is added and the cooldown period starts. A minute later the second traffic peak arrives, driving CPU to 100%. A human watching this would know that even after the new instance comes online the system is still going to be over 80% CPU (it will likely be around 90% or even pegged if traffic keeps increasing). But because of the cooldown period it is going to take another 5 minutes before this is detected and another instance is launched, and another 5 minutes after that before the second instance is helping with the load.
What annoys me is the autoscaler knows it just launched an instance and should be able to determine that it is not enough from current capacity and CPU metrics and launch a second or third one, just like a human operator would, but the cooldown approach prevents that.
Step Scaling Policies
AWS added support for step scaling policies for auto scaling in 2015 to address this problem. These allow you to set multiple bands of CPU and each gets their own cooldown period (technically they call it a warm up period for step policies but the concept is the same). This allows different bands to fire independently and respond quicker. For example you can set one as anytime CPU is between 80-90% to add one instance, and another as anytime CPU is between 90-100% to add 2 instances. At 81% one instance would be added, and a minute later when at 100% a second would be added. This allows your system to scale up faster. If your application has bursting traffic that you need to be able to handle quickly, I highly recommend spending the time to learn and configure a good step scaling policy (links to further reading below). The only challenge is it can be hard to determine the right bands and scaling values that work well for a given application. In my experience it takes a lot of trial and error and tuning before you get a well-tuned system that works well for all your traffic patterns.
Auto Scaling without Cooldowns
At AutoScalr we decided to take it to the next level and eliminate cooldown periods completely without requiring complicated multi-step rule definitions. Instead of specifying and tuning multiple bands of CPU actions you simply specify the target spare capacity you want to aim for and AutoScalr will constantly assess if any further action is needed to reach that target, taking into account recent launches and the expected impact they are going to have. Following the same example, if you have AutoScalr configured to give you 20% spare capacity it will scale up when CPU goes above 80%, but the difference is it will keep assessing the situation as new metrics come in and predicting where the metrics will be when all the instances spinning up are online. For example, since adding one instance is adding about 12% more capacity (1/8) you can estimate that aggregate CPU should go down by about 12% to something close to 70% given the same load.
Mathematically this prediction model is:
Which yields a 71% CPU prediction after the new instance brings us from 8 to 9 instances, so no further action is taken. But lets say now the CPU rises to 92%. The prediction now is that the after the instance comes online the CPU metric will be 81% which violates the spare capacity rule and another instance will be added immediately and your system will avoid being under provisioned. You get the same net effect as a well-designed set of step scaling policies but a much simpler specification and you do not need to go through the tuning effort step policies require.
Another advantage to taking the spare capacity approach to auto scaling is that you can account for different instance types with different numbers and types of vCPUs. As we discussed in the Strategies for Mitigating Risk of Using AWS Spot Instances blog post, using multiple instance types is very important when using spot instances to allow you to diversify your compute across enough different spot markets to minimize the amount of capacity subject to loss at any one moment in time. So the capacity-modeling algorithm we use at AutoScalr is actually a little more complicated than the example shown above to account for things like multiple instance types, but the concept is the same. Essentially you are building a model to predict how the changes that are in flight will affect the system being auto scaled and scale based upon that constantly updating prediction instead of current CPU, thus eliminating the need for a cooldown period altogether.
If you are ready to Just Say No to your cooldown periods, try out our free trial. Our charter is to save you money and simplify your auto scaling operations at the same time. At the very least, make sure to setup and tune a couple of step scaling policies so your system doesn’t put on blinders after every scaling change.
Cooldown periods, rest in peace......