Just like in football, spot instances get a two-minute warning that they are about to be reclaimed by AWS. These are precious moments where you can have the instance deregister from accepting any new work, finish up any work in progress, and notify any interested entities that it is going away.
How to Detect
The most common way discussed to detect that the Two Minute Warning has been issued is by polling the instance metadata every few seconds. This is available on the instance at:
This field will normally return a 404 HTTP status code but once the two-minute warning has been issued, it will return the time that shutdown will actually occur.
This can only be accessed from the instance itself, so you have to put this code on every spot instance that you are running. A simple curl to that address will return the value. You might be thinking to setup a cron job, but do not go down that path. The smallest interval you can run something with cron is once a minute. If you miss it by a second or two you are not going to detect it until the next minute and you lose half of the time available to you. It is better to take the approach of running a script that just loops forever checking that URL and then sleeping a few seconds. Here is a simple example:
if [ -z $(curl http://169.254.169.254/latest/meta-data/spot/termination-time | head -1 | grep 404 | cut -d \ -f 2) ] then
# 2 minute warning received. Do all your cleanup work.
# Still running fine, sleep and then check again
There is a second option not as widely discussed but works nice if you would prefer not to have to customize the instance just because you wanted to run it as a spot instance. That option is via an EC2 API call to DescribeSpotInstanceRequests. Here is a link to its description in the EC2 API:
Part of the response you will get back includes a Status code, which will contain the value ‘marked-for-termination’ if the two minute warning has been given for the instance. The biggest advantage of this option is it can be run from anywhere that has access to the AWS API, not just on the instance itself. You also get some more detailed information about the spot request that can be useful.
What to Do
First obvious thing to do is to have the instance stop taking on any additional work. If it is part of a web server farm, have it deregister itself from the Elastic Load Balancer. For a better user experience you should also have connection draining enabled, but set it to a smaller value than the default of 300 seconds since you only have 2 minutes to play with. If the instance is working on a queue, stop pulling new items off the queue and have it finish any items it is currently working on , if time allows. If it is a long running task that will not complete in time, put the request back on the queue for another instance to take. If you are using SQS you can also simply do nothing and it the item will eventually timeout and be given to another worker.
The next thing to do is optional, but you could be using this time to request another spot instance to replace the one being terminated. You could do that either on the instance itself or via notifying some managing agent that would initiate the request. A separate managing agent is the more robust solution of the two choices though, since we have seen cases where the two-minute warning was not detected or that it was less than two minutes. If you do this, make sure that the new spot instance will be in a different spot market by selecting either another availability zone and/or instance type, or else you are likely going to be losing the replacement instance before you even get it spun up.
The hardest part of having an effective strategy for leveraging spot in a production system is to be able to gracefully handle the case when instances are taken away from you because of price and not have it affect your performance or availability Service Level Agreements. As discussed in Strategies for Mitigating Risk of Using AWS Spot Instances, diversification across spot markets is one key component of making the amount of capacity at risk at any one time limited and quantifiable. Leveraging the two-minute warning is another important component. It gives you a two-minute head start on getting a replacement instance going and can therefore lower the amount of excess capacity that you need to keep around to absorb the impact of loss of instances. The AutoScalr service uses the API approach described above to detect these two-minute warning notifications for your spot instances and automatically start a replacement instance, if necessary.