In our AWS Spot Instances and Spot Market Price Drivers post we explored how Spot Markets work and what drives the price in a particular Spot Market. In this post we are going to build on that information to explore techniques for mitigating the risk of using Spot Instances in production applications.
The cost savings potential from using Spot Instances on AWS is impressive: 75% to as high as 90% less than the equivalent On-Demand instance in some cases. So why aren’t they being used for nearly every AWS application? It is that one little caveat that comes along with the big discount in price, the fact that Amazon reserves the right to take it away from you should someone else outbid you for it, or it is needed to fulfill an On-Demand request. Designing and operating high availability cloud apps is hard enough without knowingly introducing a potential availability risk, so the vast majority of highly available production workloads on AWS for small to medium sized businesses are still running on On-Demand (or Reserved) Instances because of this reason.
Let’s start with a simple example, an application powered by a single m4.large instance.
Strategy 1: Bid On-Demand Price and tolerate outages
As the chart shows, if you used this strategy for a 3 month period starting 5/7/17 you would paid an average of $0.028 per hour, 72% less than the On-Demand price of $0.10 per hour.
If you had placed that instance in us-east-1a, you would have only lost the instance once in the entire 3 month period. If you had picked us-east-1e, you would have lost it 5 times.
So anyone working on designing highly available systems probably cringed at reading the “only lost the instance once” phrase, and rightly so! If it were for a development or QA system, then no problem, you could tolerate the intermittent outages, but production? Unlikely. So what are some strategies we can employ to improve this situation?
Strategy 2: Bid Over On-Demand Price to reduce outages
The next simple strategy modification is to simply increase your bid. The price rarely spikes above the On-Demand price, so why not bid 10 times the On-Demand price? You might pay a little more during those spikes, but you keep the instance and the aggregate value you pay would be considerably less. A lot of people used this exact strategy in the early days of Spot Markets, and while it did work for awhile, as more and more people starting using it, it led to the crazy price spikes where Spot Market prices hit $999 for small instances in some cases. Amazon then added a max limit for how high you can bid, 10x the On-Demand price to keep things from getting out of hand. You can and some people still do bid 10x the On-Demand price, and it can work ok for some instance types, but in general this strategy is not very effective anymore and eventually you will still lose an instance and in some cases be paying 10x On-Demand for a considerable period of time, negating the advantage of using Spot in the first place.
Single instance applications are conceptually easy but typically not what we see for highly available cloud applications. We typically want multiple instances running behind a load balancer that are auto-scaled based upon demand. This is the case where more sophisticated strategies for leveraging spot start to become attractive because it opens up the potential for risk mitigation via diversification.
Stock Market Diversification Strategies Applied to Spot Markets
Many analogies have been drawn between Spot Markets and financial Stock Markets, and it is a good way to think about leveraging Spot Markets. In a Stock Market portfolio you reduce your risk of market fluctuation by diversifying across multiple stocks, and even stocks and bonds. If one stock does poorly, there is usually another that does well that helps offset it. The same strategy applies to reducing risk in Spot Markets. Just as you would never put your entire investment portfolio into one stock because of the risk, you should never put all of your “compute portfolio” into one Spot Market. If you did, you have all your eggs in one basket and are taking too much risk. By spreading instances over multiple Spot Markets you can limit the amount of compute capacity vulnerable to any one Spot Market price spike.
The continue with the Stock Market analogy, putting your entire portfolio into cash or bonds would be analogous to the all On-Demand instances case. Very little risk at any one time, but you are missing out on the long term cost-savings / money-making potential that a diversified portfolio can deliver.
Strategy 3: Lower Risk by Diversification Across Multiple Spot Markets
Expanding on our previous example, suppose you ran two instances, one in us-east-1a and one in us-east-1b, both bidding the On-Demand price.
In this case, you still would have lost instances about 5 times, but never at the same time! One would have always been active to handle requests. When we did lose an instance, we could immediately start up a new instance in a different availability zone to help out. But we did lose 50% of capacity for a period of 5 minutes or so while the new instance was spinning up, and that might not be acceptable. So here is where you start to play the classic trade-off game of cost, performance, and availability. For instance, say normal load would use about 60% CPU of 2 instances. If you lost one instance, the CPU would peg on the other and performance would start to degrade. But by running a third instance, average CPU should drop to roughly 40% on average, and if one of the instances is lost, you have built in redundancy. The other two should pick up the load and only be running at 60% for the 4-5 minutes while a replacement instance is started. Since the Spot price is so much cheaper than On-Demand, even running 3 Spot Instances vs 2 On-Demand is significantly cheaper! As your average number of instances running goes up, you can diversify over more Spot Markets and the penalty you pay for the required redundant capacity to handle a single price spike without performance degradation tends to go down, increasing your savings even more.
Strategy 4: Diversify Across Multiple Instance Types
The challenge with Strategy 3 above is if you limit yourself to only one instance type, you fairly quickly run out of Spot Markets to diversify over, which drives up your over-provisioning value required for availability during a spot price spike significantly. This is because given a particular instance type and region, there can only be a maximum number of Spot Markets equal to the number of availability zones in that region, and often will be less if each availability zone does not have excess capacity of that instance type. For many regions that means a maximum of 2 or 3 Spot Markets to diversify over, leaving 33% to 50% of capacity at risk even at large numbers of instances. The solution is to not limit yourself to one instance type.
There are many different instances types in each family, so diversify over several, e.g. m4.2xlarge and m4.large, and even across families if your application can support it, e.g. c4.large, r4.large, etc. Each time you add a new instance type you get more Spot Markets to diversify over. By making sure you have enough Spot Markets to diversify over, and over-provisioning by the maximum amount you have at risk in any one Spot Market, you can effectively design in the level of fault tolerance to Spot Market price spikes that you need to meet your availability requirements, and still save money.
Strategy 5: Lower Risk by Diversification Across Multiple Spot Markets and On-Demand Safety Net
But what about those really bad days? You know, the ones like in February 2017 when S3 in us-east-1 was having some issues, or the “Increased Spot Instance Launch Delays” issue on 6/19/17 we blogged about previously. Those days can push your availability strategies to the limit. They are fairly rare, and if you can tolerate some slow-downs or partial outages for those rare events, then lucky you. For the rest of us, we believe it is prudent to not put all your eggs into any one basket, including the Spot basket, even with diversification, and instead keep a safety net of capacity running in On-Demand.
Not a lot, but enough that your application could still struggle along while auto-scaling tries to recover. It might run a little slow for awhile, but at least it will not go down on those “really bad days”. It obviously depends on your application, but we have found for many applications 10-25% of capacity in On-Demand makes for a good trade-off of cost savings and availability guarantee, and can still result in about a 50% reduction in operating costs compared to all On-Demand.
So to sum up, diversification across multiple Spot Markets and On-Demand instances combined with a small amount of over-provisioning allows you to leverage Spot pricing to save money on high availability applications and limit your availability and performance risk to a level that you can define and manage.
In our next post, we will go into different approaches for how to implement these strategies for your AWS application.