Continuous Deployment at Instagram

At Instagram, we deploy our backend code 30-50 times a day… whenever engineers commit changes to master… with no human involvement in most cases. This may sound crazy, especially at our scale, but it works really well. This post talks about how we implemented this system and got it working smoothly.

Why do it?

Continuous deployment has a number of advantages for us:

  1. It lets our engineers move really fast. They aren’t limited to a few deployments per day at fixed times; instead, they can get code deployed whenever they want. This means that they waste less time and can iterate on changes very quickly.
  2. It makes it much easier to identify bad commits. Instead of having to dig through tens or hundreds of commits to find the cause of a new error, the pool is narrowed down to one, or at most two or three. This is also really useful when you identify a problem later and go back to debug. The metrics or data that indicate the problem can be used to identify an accurate start time, and from there we can find which commits were deployed at that time.
  3. Bad commits get detected very quickly and dealt with, which means that we don’t end up with an undeployable mess in master and cause significant delays for other unrelated changes. We’re always in a state where we can get important fixes out quickly.


The success of this implementation can largely be attributed to its construction’s iterative approach. Instead of building this system on the side and suddenly switching over, we evolved the current mechanisms until they became continuous deployment.

How it worked before

Before continuous deployment, engineers deployed changes on an ad-hoc basis. They’d land changes, and if they wanted them deployed soon, they’d run a rollout. Otherwise they’d wait for another engineer to come along and do so. Engineers were expected to know how to do a small scale test beforehand: they would do a rollout targeting one machine, log into that machine and check the logs, and then run a second rollout targeting the entire fleet. This was all implemented as a Fabric script, and we had a very basic database and UI called “Sauron” which stored a log of rollouts.

Canary and testing

The first step was adding canarying, which was initially simply scripting what engineers were already expected to do. Instead of running a separate rollout targeting one machine, the script deployed to the canary machine, tailed the logs for user, and asked whether it should continue to the full deploy. Next came some basic analysis of the canary machine: a script collected the HTTP status codes for each request, categorized them, and applied hard-coded percentage thresholds (e.g. less than 0.5% 5xx, at least 90% 2xx). However, this would only warn the user if the thresholds failed.

We already had a test suite, but it was only run by engineers on their development machines. Code reviewers had to take the author’s word that the tests passed, and we didn’t know the test status of the resulting commit in master. So we setup Jenkins to run tests on new commits in master and report the result to Sauron. Sauron would keep track of the latest commit which had passed tests, and when doing a rollout this commit would be suggested instead of the latest commit.

Facebook uses Phabricator ( for code reviews, and has a Continuous Integration system called Sandcastle which integrates well with Phabricator. We got Sandcastle to run tests whenever a diff was created or updated, and report the result to the diff.


To get to automation, we first had to lay some groundwork. We added states to rollouts (running, done, error), and made the script warn if the previous rollout was not in “done” state. We added an abort button in the UI which would change the state to “abort,” and got the script to check the state occasionally and react. We also added full commit tracking; instead of Sauron only knowing the latest commit which had passed tests, it now had a record for every commit in master, and knew the test status of each specific one.

Then we automated the remaining decisions which humans needed to make. The first decision was which commit to roll out. The initial algorithm was to always select a commit which had passed tests and select as few commits as possible – never more than three. If every commit had passed tests, it would select one new commit each time, and there could be at most two consecutive commits with non-passing test runs. The second decision was whether the rollout was successful. If more than 1% of hosts failed to deploy, it would be considered failed.

At this point, doing a rollout when things were normal simply consisted of answering “yes” a couple times (accepting the suggested commit, starting the canary, and continuing to the full deploy). So we allowed these questions to be answered automatically, and got Jenkins to run the rollout script. At first engineers implementing this only enabled Jenkins when they were at their desks supervising, until they didn’t need to supervise it anymore.


While we were doing continuous deployment at this stage, it wasn’t completely smooth yet. There were a couple kinks to work out.

Test failures

Engineers would often land diffs that broke tests, which would cause all subsequent master commits to also fail tests, and thereby prevent anything from being deployed. The oncall would need to notice this, revert the offending commit, wait for tests to pass on the revert, and then manually roll out the entire backlog before the automation could continue. This defeated one of the main advantages of continuous deployment, which was deploying very few commits per rollout. The problem here was that tests were slow and unreliable. We made various optimizations to get tests running in five minutes instead of 12-15 minutes, and fixed the test infrastructure problems that were causing them to be unreliable.


Despite these improvements, we still regularly have a backlog of changes that need to be deployed. The most common cause is canary failures (both false and true positives), but there are occasionally other breakages. When the cause was resolved, the automation would pick up and deploy one commit at a time, so it would take a while to clear the backlog and cause significant delays for newly landed diffs. The oncall would usually step in and deploy the entire backlog at once, which defeats one of the main advantages of continuous deployments.

To improve this, we implemented backlog handling in the commit selection logic, which made the automation deploy multiple commits when there was a backlog. The algorithm is based on setting a time goal in which to deploy every commit (30min). For each commit in the queue, it calculates the time remaining to meet the goal, the number of rollouts that can be done in that time (using a hard-coded value), and the number of commits that would have to be deployed per rollout. It takes the maximum commits/rollout value, but caps it at three. This allows us to do as many rollouts as possible, while getting every commit out in a reasonable time.

One specific cause of backlogs was that rollouts got slower as the size of our infrastructure increased. We got to a point where ssh-agent pegged an entire core authenticating SSH connections, and the fab master process also pegged a core managing all the tasks. The solution here was to switch to Facebook’s distributed SSH system.

Guiding principles

So what do you need in order to implement something similar to what we’ve done? There are a few key principles which make our system work well, which you can apply to your own.

  1. Tests: The test suite needs to be fast. It needs to have decent coverage, but doesn’t necessarily have to be perfect. The tests need to be run often: during code review, before landing (and ideally blocking lands on failure), and after landing.
  2. Canary: You need an automated canary to prevent the really bad commits from being deployed to the entire fleet. It doesn’t need to be perfect, however – even a very simple set of stats and thresholds can be good enough.
  3. Automate the normal case: You don’t have to automate every situation; just automate the known, normal situations. If anything is abnormal, make the automation stop and let humans step in.
  4. Make people comfortable: I think that a big barrier to this kind of automation is when people feel disconnected and out of control. To address this, the system needs to provide good visibility into what it has done, is doing, and (preferably) is about to do. It also needs good stop mechanisms.
  5. Expect bad deploys: Bad changes will get out, but that’s okay. You just need to detect this quickly, and be able to roll back quickly.

This is something that many other companies can implement. Continuous deployment systems don’t need to be complex. Start with something simple that focuses on the principles above, and refine it from there.

What’s next

This system is working well for us at the moment, but there are further challenges which we will face and improvements we’d like to make.

  • Keeping it fast: Instagram is growing quickly, and as such the commit rate is going to continue to increase. We need to keep the rollout fast in order to maintain very few commits per rollout. One possibility here is to split the rollout into multiple stages and implement pipelining.
  • Adding canarying: As the commit rate increases, canary failures and backlogs are going to impact more and more developers. We want to stop more bad commits from getting into master and blocking deployment, and so we’re implementing canarying as part of Landcastle. After tests pass, Landcastle will test the change with production traffic and fail the land if it doesn’t pass the canary thresholds.
  • More data: We want to improve the canary’s detection capabilities, so we’re planning to collect and check more stats like per-view function response codes. We’re also experimenting with collecting stats from a set of control machines and comparing those to the canary stats, instead of the current static thresholds.
  • Improving detection: It would be good to reduce the impact of bad commits not being caught by the canary. Instead of testing on one machine and then deploying to the entire fleet, we could add more stages in between (a cluster or a region), checking metrics at that level before continuing.

Original URL:

Original article

Facebook Messenger Platform

A Platform Designed for Businesses and Developers

Now you can interact with them where they already are

Start Building

Discover all the ways bots on the Messenger Platform help you join the conversation.

Start Building

The Send/Receive API must not be used to send marketing or promotional messages, such as sale or product announcements, brand advertising, branded content, newsletters or the up-selling or cross-selling of products or services.

© Facebook 2015. The Apple, Google Play, and Windows logos are trademarks of their respective owners. View our and .

Powered by VIP

Original URL:

Original article

Facebook’s Instant Articles is now open to all publishers

IMG_20160412_081958 As previously announced, Facebook today opened its Instant Articles format to all developers. Using Instant Articles, publishers can show Facebook mobile users a fast-loading and mostly distraction-free view of their posts while still also showing them a limited amount of their own ads (or use Facebook’s Audience Network to monetize their content) and measure pageviews through tools… Read More

Original URL:

Original article

Intel to ship thumb-sized Compute Sticks with Skylake chips in late April

If you’ve been waiting patiently for Intel’s new Compute Sticks with Skylake chips, there’s good news: Those thumb-sized PCs will start shipping on April 29.

The three Compute Sticks, which have Intel’s Core M3 and M5 chips, can turn a TV or display with an HDMI port into a PC. All you need to do is plug the Compute Stick into the HDMI port.

The new Compute sticks were announced in January at CES. Starting at $299, the Skylake-based Compute Sticks aren’t priced as low as older models but pack the processing power of lightweight laptops.

The benefits of Compute Sticks are still debated. These computers rate high in portability but have memory, storage, and port limitations. Users also need to lug around a wireless keyboard and mouse.

To read this article in full or to leave a comment, please click here

Original URL:

Original article

Ignition: A new CoreOS machine provisioning utility

We are excited to officially introduce Ignition, the next-generation machine provisioning utility from CoreOS. Those who follow along closely may have noticed that Ignition has been a part of CoreOS for the better part of a year. The project has had time to be tested and to mature, and the features and user interface are in a place where we are happy to encourage daily, heavy duty use. It’s also a good time to welcome the community to test and help improve Ignition. Before diving into the details, let’s understand why we built Ignition in the first place.

Why build Ignition?

CoreOS Cloud-Init

It’s been more than three years since we originally shipped our flavor of Cloud-Init. This tool runs after boot, reading a user-provided configuration and fetching platform-specific metadata, then applying that configuration to the running machine. The vast majority of our users employ this utility to configure CoreOS machines, performing a variety of tasks from adding SSH keys to starting an etcd cluster.

CoreOS cloudinit uses a language extremely similar to YAML to describe its configuration, the cloud-config. One of the few differences from standard YAML is that the cloud-config requires a comment on the first line indicating that the remainder of the document is a cloud-config. This, coupled with YAML’s type-inferencing nature, makes it difficult to programmatically generate or manipulate a cloud-config. Many languages have a YAML-parsing library, but very few have a library for parsing the CoreOS cloud-config variant. This results in bugs like the leading comment being stripped, thereby invalidating the config, or octal file permissions being converted to decimal, yielding unintended and potentially invalid file modes on disk. Sometimes, a configuration’s “off” can even be rewritten as “false”, and many failures cascade from there.

It is difficult for cloudinit to properly configure early system services, especially networking, because it runs later in the boot process. Most system services are already up and running by the time cloudinit starts, requiring a reconfiguration that is more prone to errors.


Ignition is a new machine provisioning utility designed to solve the same problems as coreos-cloudinit while adding a host of new capabilities with clearer semantics. At the the most basic level, Ignition is a tool for manipulating disks during early boot. This includes partitioning disks, formatting partitions, writing files, and configuring users.

On the first boot, Ignition reads its configuration from a designated source, like a remote URL, a network or cloud provider metadata service, or a hypervisor bridge, and applies that configuration to the machine. Rather than a YAML variant, Ignition uses pure JSON for its configuration format. JSON’s type system eliminates the problems that arose with cloud-config YAML, and makes it very easy to write tools to generate new configs or manipulate existing ones.

Ignition runs very early in the boot process. In fact, it runs before systemd is invoked by the kernel as PID 1, before any permanent storage is mounted. This allows Ignition to configure foundation system services and features, like reformatting the root filesystem, creating a RAID array, or configuring the network to use bonded interfaces. Running before systemd helps ensure that all services established by Ignition are known to and can be managed by systemd when it subsequently starts. This allows systemd to do what it does best: concurrently start services as quickly as possible. This results in a simpler startup, a faster startup, and the ability to accurately inspect systemd’s unit dependency graphs.

Using Ignition

On first boot, Ignition fetches and evaluates a configuration describing the desired state of the machine. Much like cloud-config, configuration is provided in a file, with the cloud-config simply being replaced by an Ignition config. Unlike coreos-cloudinit, Ignition only runs on the first boot of a machine. It is safe for system administrators or other software tools to make subsequent modifications to the system without them being continually overwritten by original configuration at every reboot.

Ignition’s configuration format is versioned to allow clear deprecation points for possible changes to the format in the future. The Ignition v0.4.0 executable accepts configurations in the v2.0.0 Ignition config format. It will also accept v1 configs and transparently convert them.

The simplest Ignition configuration is shown below:

  "ignition": {
    "version": "2.0.0"

This configuration isn’t very useful since it doesn’t actually declare any desired state. The next example is more interesting:

  "ignition": {
    "version": "2.0.0"
  "storage": {
    "files": [
        "filesystem": "root",
        "path": "/home/core/",
        "mode": 292,
        "contents": {
          "source": "",
          "verification": {
            "hash": "sha512-d7c87dad692a1948d729e44184feb2bdc5de24d26561c597a02263a96accfc2a88f87735c8fdccbfea5d90e63cddd986260da007b114d47b7ad4c5a5fc295d2b"
  "systemd": {
    "units": [
        "name": "unzip-pcr.service",
        "enable": true,
        "contents": "[Unit]nConditionFirstBoot=yesnn[Service]nType=oneshotnExecStart=/usr/bin/unzip /home/core/ -d /home/corenn[Install]"

This configuration will download and unzip a file containing the TPM validation hash values for CoreOS v1010.1.0, using the SHA-512 hash provided in the config to verify the contents of the validation hashes file itself.

Many more examples can be found in the Ignition documentation, including how to construct a RAID array and bond multiple network interfaces.

Supported platforms

Ignition v0.4.0 is included in CoreOS starting in version 1010.1.0, currently in the Alpha channel, and is currently supported on bare metal installations and PXE-booted environments. While older editions of CoreOS shipped with previous Ignition versions, Ignition v0.4.0 is the first to use the v2.0.0 configuration format.

PXE & bare metal

Ignition is particularly helpful on bare metal for network configuration and advanced disk operations because it runs even before PID 1 starts. A URL pointing to an Ignition configuration can be provided with the coreos.config.url kernel command-line parameter. The CoreOS install script also accepts Ignition configuration files, installing them into the OEM partition.

For PXE booting, supply the coreos.first_boot=1 parameter to trigger Ignition. This forces Ignition to run in PXE scenarios where a GPT disk GUID may not exist.

Cloud providers

If you’re feeling adventurous, Ignition has experimental support for Microsoft Azure, Amazon EC2, and VMware, with many more providers coming soon. The project is under active development, so expect the supported platforms list to expand in the coming months. Configuration is provided via the same mechanism as CoreOS Cloud-Init, so it’s easy to switch to Ignition. Instead of providing your machine with a cloud-config, you can provide an Ignition config.

For more information, check out the official documentation and join us on IRC at freenode/#coreos. Fire up Ignition and let us know what you think!

Original URL:

Original article

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑

%d bloggers like this: