Stripe – Outage postmortem


On Thursday, October 8th UTC, a database index operation resulted in approximately 90 minutes of increasingly degraded availability for the the Stripe API and dashboard. In aggregate, about two thirds of all API operations failed during this window.

Hundreds of thousands of companies rely on Stripe to power their businesses, and we are usually proud of our reliability. When failures happen, we put together a timeline of the events that transpired, we think about what we’ve learned, and about what remediations we can take. This was the longest period of unavailability for the Stripe API since we launched. We want to be transparent with you about what happened and what we’ve learned from it.


On Tuesday, October 6 at 21:30 UTC, an application developer working on API performance improvements submitted a request to modify an existing database index. Our automated tooling actually filed it as two separate change requests: one to add a new database index and a second to remove the old database index. Both of these change requests were reflected in our dashboard for database operators, intermingled with many other alerts. The dashboard did not indicate that the deletion request depended on the successful completion of the addition request.

On Thursday, October 8 at 00:06 UTC, a database operator working through the queue of open change requests saw that an index existing in production was marked as no longer necessary. The database operator followed the established procedure for removing unused indexes from production, which resulted in the index being removed from all the database replicas simultaneously.

With the above index missing from our production databases, requests to a set of API endpoints began slowing down and timing out. In addition, the slowed down API requests resulted in starvation of the pool of API workers, which caused almost all requests to any API endpoint to slow down and time out, returning errors. The unavailability of our API also resulted in the unavailability of the Stripe dashboard.

At 00:08 UTC, our on-call engineer had been paged and had responded. At 00:10 UTC, we linked the API degradation to the removal of the index. At 00:17 UTC, eleven minutes after the start of the incident, the response team started rebuilding the removed index.

By 00:24 UTC, the response team had determined that the index would take over an hour to rebuild, and began work in parallel on several independent ways to recover before the conclusion of the index build.

At 01:30 UTC, the availability of the API and the dashboard began to recover. One of the response teams had succeeded in modifying the API so that no code paths would attempt to use the missing index, instead operating quickly although in a slightly degraded state. At 01:45 UTC, the index build had finished and all services returned to normal operation. Since then, we have seen no further availability issues.

Contributing factors

The proximate cause of this incident was the deletion of a heavily used index in the API’s critical path. A large number of factors contributed to this failure and to the severity of the ensuing incident.

There was a breakdown in communication between the developer who requested the index migration and the database operator who deleted the old index. Instead of working on the migration together, they communicated in an implicit way through flawed tooling. The dashboard that surfaced the migration request was missing important context: the reason for the requested deletion, the dependency on another index’s creation, and the criticality of the index for API traffic. Indeed, the database operator didn’t have a way to check whether the index had recently been used for a query.

Alerting fired immediately when API requests began to slow down. However, the fact that the index was simultaneously removed from the primary and secondaries made it too late to rollback. Instead, we had to go through a slow rebuild process. While that process was in flight, we deployed code to circumvent the slow query and return degraded responses. This ultimately only shaved minutes off of our recovery because it required writing and deploying code rather than the API gracefully degrading in the presence of this slow query.


We’re currently focused on providing any assistance we can and ensuring we respond as quickly as possible to any questions our users have. Over the next few days we will begin to focus on three categories of remediations: actions we can take to prevent incidents like this from reoccurring, actions we can take to detect incidents like this sooner, and actions we can take so that we can more quickly recover from similar incidents in the future. This will involve improvements to our architecture, monitoring, tooling, and processes.

Deprecate jQuery? Nuts!

Background: I saw a comment from a developer I respect that jQuery should be deprecated.

Amazing thought. What could possibly be gained by this?

It’s so deeply ingrained in the codebase. It would be like undermining the US dollar and the full faith and credit of our treasury.

It then struck me that..

Developers attacking established broadly deployed syntax is as dysfunctional as the House threatening to default on the US debt.

PS: This was originally posted on Facebook.

Pushing the Limits of Network Traffic With Open Source

An anonymous reader writes: CloudFlare’s content delivery network relies on their ability to shuffle data around. As they’ve scaled up, they’ve run into some interesting technical limits on how fast they can manage this. Last month they explained how the unmodified Linux kernel can only handle about 1 million packets per second, when easily-available NICs can manage 10 times that. So, they did what you’re supposed to do when you encounter a problem with open source software: they developed a patch for the Netmap project to increase throughput. “Usually, when a network card goes into the Netmap mode, all the RX queues get disconnected from the kernel and are available to the Netmap applications. We don’t want that. We want to keep most of the RX queues back in the kernel mode, and enable Netmap mode only on selected RX queues. We call this functionality: ‘single RX queue mode.'” With their changes, Netmap was able to receive about 5.8 million packets per second. Their patch is currently awaiting review.

There Is No .bro In Brotli: Google/Mozilla Engineers Nix File Type As Offensive

theodp writes: Several weeks ago, Google launched Brotli, a new open source compression algorithm for the web. Since then, controversy broke out over the choice of ‘bro’ as the content encoding type. “We are hoping to establish a file ending .bro for brotli compressed files, a command line tool ‘bro’ for compressing and uncompressing brotli files, and a accept/content encoding type ‘bro’,” explained Google software engineer Jyrki Alakuijala. “Can I talk you out of it?,” replied Mozilla SW engineer Patrick McManus. “‘bro’ has a gender problem, even though the dual meaning is unintentional. It comes of[f] misogynistic and unprofessional due to the world it lives in.” Despite some pushback from commenters, a GitHub commit made by Google’s Zoltan Szabadka shows that there will be no ‘.bro’ in Brotli. “I have asked a feminist friend from the North American culture-sphere, and she advised against bro,” explained Alakuijala. “We have found a compromise that satisfies us, so we don’t need to discuss this further. Even if we don’t understand why people are upset from our cultural standpoint, they would be (unnecessarily) upset and this is enough reason not to use it.”

