On Thursday, October 8th UTC, a database index operation resulted in approximately 90 minutes of increasingly degraded availability for the the Stripe API and dashboard. In aggregate, about two thirds of all API operations failed during this window.
Hundreds of thousands of companies rely on Stripe to power their businesses, and we are usually proud of our reliability. When failures happen, we put together a timeline of the events that transpired, we think about what we’ve learned, and about what remediations we can take. This was the longest period of unavailability for the Stripe API since we launched. We want to be transparent with you about what happened and what we’ve learned from it.
On Tuesday, October 6 at 21:30 UTC, an application developer working on API performance improvements submitted a request to modify an existing database index. Our automated tooling actually filed it as two separate change requests: one to add a new database index and a second to remove the old database index. Both of these change requests were reflected in our dashboard for database operators, intermingled with many other alerts. The dashboard did not indicate that the deletion request depended on the successful completion of the addition request.
On Thursday, October 8 at 00:06 UTC, a database operator working through the queue of open change requests saw that an index existing in production was marked as no longer necessary. The database operator followed the established procedure for removing unused indexes from production, which resulted in the index being removed from all the database replicas simultaneously.
With the above index missing from our production databases, requests to a set of API endpoints began slowing down and timing out. In addition, the slowed down API requests resulted in starvation of the pool of API workers, which caused almost all requests to any API endpoint to slow down and time out, returning errors. The unavailability of our API also resulted in the unavailability of the Stripe dashboard.
At 00:08 UTC, our on-call engineer had been paged and had responded. At 00:10 UTC, we linked the API degradation to the removal of the index. At 00:17 UTC, eleven minutes after the start of the incident, the response team started rebuilding the removed index.
By 00:24 UTC, the response team had determined that the index would take over an hour to rebuild, and began work in parallel on several independent ways to recover before the conclusion of the index build.
At 01:30 UTC, the availability of the API and the dashboard began to recover. One of the response teams had succeeded in modifying the API so that no code paths would attempt to use the missing index, instead operating quickly although in a slightly degraded state. At 01:45 UTC, the index build had finished and all services returned to normal operation. Since then, we have seen no further availability issues.
The proximate cause of this incident was the deletion of a heavily used index in the API’s critical path. A large number of factors contributed to this failure and to the severity of the ensuing incident.
There was a breakdown in communication between the developer who requested the index migration and the database operator who deleted the old index. Instead of working on the migration together, they communicated in an implicit way through flawed tooling. The dashboard that surfaced the migration request was missing important context: the reason for the requested deletion, the dependency on another index’s creation, and the criticality of the index for API traffic. Indeed, the database operator didn’t have a way to check whether the index had recently been used for a query.
Alerting fired immediately when API requests began to slow down. However, the fact that the index was simultaneously removed from the primary and secondaries made it too late to rollback. Instead, we had to go through a slow rebuild process. While that process was in flight, we deployed code to circumvent the slow query and return degraded responses. This ultimately only shaved minutes off of our recovery because it required writing and deploying code rather than the API gracefully degrading in the presence of this slow query.
We’re currently focused on providing any assistance we can and ensuring we respond as quickly as possible to any questions our users have. Over the next few days we will begin to focus on three categories of remediations: actions we can take to prevent incidents like this from reoccurring, actions we can take to detect incidents like this sooner, and actions we can take so that we can more quickly recover from similar incidents in the future. This will involve improvements to our architecture, monitoring, tooling, and processes.
Original URL: http://feedproxy.google.com/~r/feedsapi/BwPx/~3/h8XuLsYkNiI/outage-postmortem-2015-10-08-utc