One of the most common requests we received from customers of our flagship AutoScalr product went along the lines of:
"We love how AutoScalr is saving us money on our autoscaled apps, is there anything you can do to help us save money on our single EC2 instance applications too?"
We got tired of having to say no to this request, and clearly there was a market need, so we started an R&D effort to see what we could accomplish to fulfill that need.
The challenge with trying to leverage Spot instances for cost savings on single EC2 instance apps is eventually you are going to lose the Spot instance. It is only a matter of time. Since there is only one, there is nothing to pick up the load and you will have some downtime. The question is how much is tolerable and what can we do to minimize it to an acceptable level. As we talked with customers it became apparent that overall "a little downtime here and there for a few minutes if its dramatically cheaper is ok, but not more than 5 to 10 minutes, and not more than a few times a month", at least for many applications like build servers and employee facing applications.
After numerous customer discussions, we narrowed in on the characteristics and requirements of the single instance EC2 apps we were targeting a solution for as:
- Has state stored on EBS volume(s) that needs to be preserved
- Has a network location (IP Address) that needs to be preserved so that clients can access
- Can tolerate some downtime, but it needs to be bounded and not too frequent, and will vary app by app
At AutoScalr, we believe data is king, so everything we do starts with simulations run against large data sets of pricing and interruption data we have compiled for Spot markets across the world going back years. One of the first things that became apparent from analyzing the data from a single EC2 instance perspective is that Spot interruptions are a fairly rare occurrence. It depends on the instance type and region, with the GPU families being the notable exception, but for many commonly used instance types the chance of an interruption occurring over a month was only 4% or less. That's the good news. The bad news was when an interruption did occur, it was often over 10 times more likely than normal that another would occur sometime in the near future, which makes sense intuitively since it was at those times that a Spot market's capacity was limited.
Success = Handling Failures
It became clear that handling the inevitable interruptions in a way to minimize and keep the resulting downtime bounded was the key to solving this problem effectively. In order to keep it bounded, you have to have a plan to replace an instance when it is interrupted and not put your downtime in the hands of the Spot market dynamics and wait and hope. We devised a method to effectively move capacity from Spot to On-Demand and back again, preserving the EBS volumes and network location. On a Spot termination, we move the application to On-Demand and have it back up and running a few minutes later. To any clients, it simply looks like a few minutes of downtime.
The next question became "do you move the application back to Spot? and if so when?" Based on the data analysis mentioned above, we decided, to use a gambling metaphor, "to not bet against the house when the deck is stacked against us." If a second interruption is far more likely after the first, do not even play the game during that time. Move to safety, move the application to On-Demand or Reserved, and ride the storm out there for a period of time. When it looks like the storm has subsided, move back to Spot. Statistically it ends up only being a small percentage of the monthly run time, so it doesn't cost you that much in savings, and it helps availability by preventing downtime from subsequent interruptions.
Moving to On-Demand on a Spot interruption bounds the amount of downtime per interruption, but not for the application itself. If interruptions keep occurring, the downtime is still unbounded. To solve this we added downtime tracking on a per instance basis to keep a running total of minutes offline due to interruptions. If it exceeds a threshold, we stop 'playing the game' and leave the application in On-Demand to preserve the Availability requirement.
Cost vs Availability
If it were free, of course we all would want high availability for our apps. The only reason to give a little on availability is the cost factor. One of the key phrases we heard repetitively from customers was "ok if its dramatically cheaper". But how can you quantify "dramatically cheaper" on an app by app basis? How much downtime you are willing to tolerate is directly dependent on how much savings you expect to get for the inconvenience.
Intuitive ease-of-use is one of our mantras at AutoScalr, so as we were defining the UI for this product we tried to address this inherent relationship in a natural way. We realized that in order to make an informed decision about how much downtime was acceptable, you needed an estimate of how much savings would result, otherwise we all just would pick "no downtime". We have the data, so why not estimate the Cost vs Availability curve to help make the decision data driven?
It turns out to be a very non-linear exponential relationship the vast majority of the time which means you do not have to give up that much in availability to generate a lot of savings. The default availability for a single instance EC2 is limited to that of the availability zone it runs in, 99.95%. By lowering that just 0.05% to 99.9% you can lower costs by approximately 60% in most cases. Not a bad trade-off for many apps, but the point is you get to choose based upon the estimate. Pick where along the curve you feel is the right trade-off for your application and AutoScalr will work in the background to generate as much savings as possible while still meeting that availability.
Simple Activation & Settings
We wanted to make it easy to use, so after subscribing to the AutoScalr service, you only need to tag the EC2 instance with a tag to specify the name of the environment you want to put it in. AutoScalr will see the tag and start managing it for cost savings. The name of the environment allows for setting availability and schedules on groups of instances, such as Development, QA, or Build Servers.
Track Availability and Interuptions
If you want to see how well Spot interruptions are being managed you can see on a per environment basis:
Each transition from Spot to On-Demand or back takes time. Obviously the longer it takes to do a transition the fewer times you can move capacity back and forth before threatening the availability setting, so you want to try to make it as quick as possible. The most significant factor that drives transition time is how long it takes to snapshot the EBS volume being used. Since snapshots are incremental, only the blocks that have changed since the previous snapshot need to be copied, so the main factor boils down to the number of blocks that have changed since the last snapshot and the iops rating of the EBS volume being used. If it is a small volume, or one that has been snapshotted within the past few hours, it will be just a few minutes. If it is a large volume with low iops that has never been snapshotted, it could take hours.
Hours of downtime is not acceptable based on the problem definition, so before ever moving an EC2 app to Spot, AutoScalr will first trigger a snapshot and make sure it has completed. Snapshots can be run while an instance is running so no downtime needs to occur for this activity. But you also need a mechanism to make sure you take periodic snapshots to maintain your ability to make fast transitions, so we added a snapshot scheduling capability where you can specify the frequency of snapshots you require based upon the volatility of data in your EC2 app.
A subset of the applications we heard about did not have 24 by 7 requirements. Often they fell into the category of "extended office hours." For those applications we added the capability of specifying run schedules. There are many tools to accomplish this for On-Demand instances, and it is rather easy to write scripts on your own to accomplish, but they will not work against Spot instances since you cannot manually "Stop" a Spot instance. So we added support to automatically move the instance back to On-Demand at the "end-of-the-day" and then Stop it, and start it back up again as a Spot instance the next day. Spot instances are cheap but no sense paying for them when you don't need them.
We are excited to launch this product and help make lowering AWS costs for a class of single instance EC2 apps with modest availability requirements easier than ever.
Check out the product page for more information which includes a video demo showing it in action, or try it for free in your own environment and give us your feedback!