I find operational failure scenarios fascinating. Not only is it a chance to play the role of a modern day Sherlock Holmes in a “who-done-it” in the tech arena, it also gives us valuable lessons in building more resilient systems for the future.
- How did the issue first present itself?
- What were the cascading effects?
- How well did the system respond to the issue and adjust?
- Was manual intervention required? If so, can that be automated in the future?
- What was the root cause?
A few of my favorites along with their root cause:
Most people have seen the movie but if not, rent it. It's a classic in operational problem solving. The root cause a fuel cell that was accidentally dropped 2 inches during a routine test nearly 2 years prior to launch. The damage was undetectable by their tests at the time but caused the explosion mid-flight that nearly killed the crew.
Mars Climate Orbiter
$125 Million dollar spacecraft crashed into Mars because one team used Imperial units and another Metric units. Orbital mechanics and gravity were not forgiving and turned it into a pile of Martian dust.
Large Unnamed Data Center
Taken offline by a drunk hitting a power line pole and generators failing to pick up the load.
Network outage tracked back to a small code change made by one vendor many months earlier that triggered a cascading failure in 1991 for many hours.
S3 outage in us-east-1
Thousands of web sites that assumed S3 “would always be there” were taken down for hours in February by one AWS admin mistyping one command
(references to more details on several of these at end of post)
We try to design our systems to be resilient during all the “what if” scenarios but when strange edge cases come along that is when you get a true test of system resiliency. It often exposes the assumptions you didn’t even know that you made that can bite you in a big way.
Monday, June 19th started off normal enough, coffee while reviewing system KPI’s, but it quickly turned interesting and educational in the afternoon. Here is out it unfolded:
12:40 PM PDT
The “Increased Spot Instance Launch Delays” issue first presented itself to us as an alert that the amount of savings our service provides for a particular client had dropped to 0%. Normally it is closer to 65-70% so the deviation definitely warranted further investigation.
When we pulled up some of the status dashboards for this client, it showed their entire load was running on normal On-Demand instances. One of the key savings drivers the AutoScalr service provides is saving money in a safe way through the blended use of Spot and On-Demand instances diversified over multiple instance types, but clearly something was not working correctly. First question: What had changed?
We had pushed a fairly minor code change to production on Friday, so of course that was suspect number one, but that was quickly exonerated as not causing the issue. Next, we checked a few other client's systems. The first two we checked were running normal, but the third showed similar symptoms trending. And then the first clue: both applications were running in the AWS us-east-1 region. We checked the status page for AWS but it showed no issues in us-east-1 (yet). Diving into system logs we found that AutoScalr was trying to launch Spot instances but the requests were hanging in the ‘pending-fulfillment’ state. Normally this state lasts only a few seconds while AWS locates the exact spot instance to give you for the request. You only get to this state if the ‘pending-evaluation’ state completes successfully, which checks things like is there capacity available, is your spot bid above the current spot price, and other constraints. After about 60 seconds of waiting, AutoScalr was timing out the request and going with plan B: launching an On-Demand instance. After all, the triggering event was the client’s application needing more computational power, and the end users are not going to wait! Keeping the application running takes precedence over saving money. And then the epiphany:
It’s not a bug, it’s a feature!
Pleasant surprises from computer systems are a rarity in my experience. When one comes along, enjoy it! (If anyone has other pleasant surprise operational stories, please comment and share them!) We had assumed there was a problem, but the system was doing exactly what it was designed to do for an edge case that we had not seen operationally before.
1:27 PM PDT
Armed with the knowledge that AutoScalr was working correctly, we shifted gears to gathering more information about this unique event and learning as much as we could. We quickly realized that other people had to be being affected by this as well, and we might be one of the first to have detected the hanging in pending-fulfillment state, so we immediately opened a support case with AWS letting them know what we were seeing to get them looking into the problem.
2:49 PM PDT
AWS posted the operational issue publicly as:
We are investigating increased Spot Instance launch delays in the US-EAST-1 Region.
We were curious how much of a delay. AutoScalr was cancelling the spot request after 60 seconds as it moved on to plan B, so all we knew was that it was longer than 60 seconds. We started manually requesting spot instances via the console and saw the same behavior, for both individual instances and for a spot fleet request. Every request we made went to pending-fulfillment and then stayed there.
The longest one we measured was in the pending-fulfillment state for over 3 hours:
We also tested other regions. No noticeable delay was present in any other region we tested. It clearly seemed isolated to us-east-1.
2:55 PM PDT
We saw ONE spot request sneak through in under 60 seconds and make it into production. At about 3:32 PM PDT, a few more started to randomly come through in under 60 seconds.
3:39 PM PDT
AWS posted an update on the issue:
We have identified and resolved the root cause of the increased Spot Instance launch delays in the US-EAST-1 Region. We are currently processing the backlog of launch requests as we continue to work toward full recovery.
Our testing and operational systems concurred with that status and things continued to improve over the next hour.
4:54 PM PDT
AWS closed the issue as resolved with:
Between 12:55 PM and 4:46 PM PDT we experienced increased Spot Instance launch delays in the US-EAST-1 Region. The issue has been resolved and the service is operating normally.
This chart shows the number of requests that we had go past 60 seconds and time out over that period.
The first one we saw was at 12:56 PM PDT (Chart is Central time so offset by 2 hours) and the last we saw was at 4:42 PM PDT, which aligns pretty well with the posted start and end times of the issue.
This chart shows the On-Demand and individual spot markets in use over the time-period. You can see the partial recovery during mid-afternoon PDT, and full recovery by about 4:30 PDT (all chart times CDT).
So what can we learn from this incident?
First, lets summarize what we know.
On the afternoon of 6/19/17, for a period of roughly 4 hours in us-east-1 region:
- Launching new Spot instances was delayed by as much as three hours
- Existing Spot instances ran just fine with normal spot market price caveats
- On-Demand Instances could still be launched
So lets look at how different deployment architectures would have been affected by this:
AutoScalingGroup with On-Demand
No impact. I suspect this is why this was not a big news item since this pattern is probably the most commonly used production pattern. The issue was contained to AWS users trying to leverage Spot instances for cost savings. If it had slowed down a large number of web sites running in us-east-1 this would have likely had similar press to the February 28th S3 issue.
AutoScalingGroup with Spot
The simplest way to leverage Spot is to just specify a Spot instance for the launch configuration for your AutoScalingGroup. It is simple, but you only get Spot instances and only of one instance type, so your risk profile is rather high, so you really should not use it for anything beyond test/QA. This architecture would not have been able to scale up during this issue and would also have had significant capacity vulnerable to individual Spot market price spikes during the entire timeframe of the issue.
Spot Fleet does a great job in diversifying over different instance types and availability zones to isolate you from big capacity hits if one Spot market spikes. But since all instances it uses are spot, in this case it was vulnerable as well. If either more capacity was needed because of increased load, or some instances were lost due to price, a Spot Fleet powered application would have remained under-provisioned during this issue.
Spot with On-Demand
Which leads to how to plug this operational vulnerability: If you are going to run a highly-available application and leverage Spot you really need to design in some sort of fallback to On-Demand instances if needed. This will protect your application from both Spot market price fluctuations and issues like we experienced on 6/19/17. As we saw above, AutoScalr was designed to provide this level of protection for your application, but even if you have built your own custom AutoScaling solution for your application to leverage Spot, take this as an opportunity to go back and assess how well your application did or would have handled this case and what can you do to make it more resilient. Maybe you were just lucky and running in another region, or maybe you just did not need to scale up during the issue. Next time you might not be so lucky.
So some parting questions for you:
- Who else was affected by this issue?
- How well did your system respond?
- What actions did you take during the issue? Did they work?
Please comment with your stories related to this issue. Let’s learn from each other!
If you have not seen it, this entire session with James Hamilton from re:Invent 2016 is worth watching. For the discussion on the power event that took down the US airline jump to about 35 minutes in.