I’ve been writing a lot about what I believe is important for the future of Drupal, but now it is your turn. After every major release of Drupal I do a “product management survey” to get your ideas and data on what to focus on for future releases of Drupal (8.x/9).
The last time we had such a survey was after the release of Drupal 7, six months into the development of Drupal 8. I presented the results at DrupalCon London in 2011. The results informed the Drupal community at large, but were also the basis for defining initiatives for Drupal 8. This time around, I’m hoping for similar impact, but also some higher-level strategic thinking about how Drupal should respond to various market trends.
It shouldn’t take more than 10-15 minutes to fill out the survey. We’d like to hear from everyone who cares about Drupal: content managers, site owners, site builders, module developers, front-end developers, people selling and marketing Drupal, etc. Whether you are a Drupal expert or just getting started with Drupal, every voice counts! Best of all, with Drupal 8’s new 6-month release cycle, we can act on the results of this survey much sooner than in the past.
I will be presenting the results during my DrupalCon New Orleans keynote (the video recording of the keynote, the presentation slides and the survey results will be downloadable on my blog after). Please tell us what you think about Drupal; your feedback will shape future versions of Drupal.
At Instagram, we deploy our backend code 30-50 times a day… whenever engineers commit changes to master… with no human involvement in most cases. This may sound crazy, especially at our scale, but it works really well. This post talks about how we implemented this system and got it working smoothly.
Why do it?
Continuous deployment has a number of advantages for us:
It lets our engineers move really fast. They aren’t limited to a few deployments per day at fixed times; instead, they can get code deployed whenever they want. This means that they waste less time and can iterate on changes very quickly.
It makes it much easier to identify bad commits. Instead of having to dig through tens or hundreds of commits to find the cause of a new error, the pool is narrowed down to one, or at most two or three. This is also really useful when you identify a problem later and go back to debug. The metrics or data that indicate the problem can be used to identify an accurate start time, and from there we can find which commits were deployed at that time.
Bad commits get detected very quickly and dealt with, which means that we don’t end up with an undeployable mess in master and cause significant delays for other unrelated changes. We’re always in a state where we can get important fixes out quickly.
The success of this implementation can largely be attributed to its construction’s iterative approach. Instead of building this system on the side and suddenly switching over, we evolved the current mechanisms until they became continuous deployment.
How it worked before
Before continuous deployment, engineers deployed changes on an ad-hoc basis. They’d land changes, and if they wanted them deployed soon, they’d run a rollout. Otherwise they’d wait for another engineer to come along and do so. Engineers were expected to know how to do a small scale test beforehand: they would do a rollout targeting one machine, log into that machine and check the logs, and then run a second rollout targeting the entire fleet. This was all implemented as a Fabric script, and we had a very basic database and UI called “Sauron” which stored a log of rollouts.
Canary and testing
The first step was adding canarying, which was initially simply scripting what engineers were already expected to do. Instead of running a separate rollout targeting one machine, the script deployed to the canary machine, tailed the logs for user, and asked whether it should continue to the full deploy. Next came some basic analysis of the canary machine: a script collected the HTTP status codes for each request, categorized them, and applied hard-coded percentage thresholds (e.g. less than 0.5% 5xx, at least 90% 2xx). However, this would only warn the user if the thresholds failed.
We already had a test suite, but it was only run by engineers on their development machines. Code reviewers had to take the author’s word that the tests passed, and we didn’t know the test status of the resulting commit in master. So we setup Jenkins to run tests on new commits in master and report the result to Sauron. Sauron would keep track of the latest commit which had passed tests, and when doing a rollout this commit would be suggested instead of the latest commit.
Facebook uses Phabricator (http://phabricator.org/) for code reviews, and has a Continuous Integration system called Sandcastle which integrates well with Phabricator. We got Sandcastle to run tests whenever a diff was created or updated, and report the result to the diff.
To get to automation, we first had to lay some groundwork. We added states to rollouts (running, done, error), and made the script warn if the previous rollout was not in “done” state. We added an abort button in the UI which would change the state to “abort,” and got the script to check the state occasionally and react. We also added full commit tracking; instead of Sauron only knowing the latest commit which had passed tests, it now had a record for every commit in master, and knew the test status of each specific one.
Then we automated the remaining decisions which humans needed to make. The first decision was which commit to roll out. The initial algorithm was to always select a commit which had passed tests and select as few commits as possible – never more than three. If every commit had passed tests, it would select one new commit each time, and there could be at most two consecutive commits with non-passing test runs. The second decision was whether the rollout was successful. If more than 1% of hosts failed to deploy, it would be considered failed.
At this point, doing a rollout when things were normal simply consisted of answering “yes” a couple times (accepting the suggested commit, starting the canary, and continuing to the full deploy). So we allowed these questions to be answered automatically, and got Jenkins to run the rollout script. At first engineers implementing this only enabled Jenkins when they were at their desks supervising, until they didn’t need to supervise it anymore.
While we were doing continuous deployment at this stage, it wasn’t completely smooth yet. There were a couple kinks to work out.
Engineers would often land diffs that broke tests, which would cause all subsequent master commits to also fail tests, and thereby prevent anything from being deployed. The oncall would need to notice this, revert the offending commit, wait for tests to pass on the revert, and then manually roll out the entire backlog before the automation could continue. This defeated one of the main advantages of continuous deployment, which was deploying very few commits per rollout. The problem here was that tests were slow and unreliable. We made various optimizations to get tests running in five minutes instead of 12-15 minutes, and fixed the test infrastructure problems that were causing them to be unreliable.
Despite these improvements, we still regularly have a backlog of changes that need to be deployed. The most common cause is canary failures (both false and true positives), but there are occasionally other breakages. When the cause was resolved, the automation would pick up and deploy one commit at a time, so it would take a while to clear the backlog and cause significant delays for newly landed diffs. The oncall would usually step in and deploy the entire backlog at once, which defeats one of the main advantages of continuous deployments.
To improve this, we implemented backlog handling in the commit selection logic, which made the automation deploy multiple commits when there was a backlog. The algorithm is based on setting a time goal in which to deploy every commit (30min). For each commit in the queue, it calculates the time remaining to meet the goal, the number of rollouts that can be done in that time (using a hard-coded value), and the number of commits that would have to be deployed per rollout. It takes the maximum commits/rollout value, but caps it at three. This allows us to do as many rollouts as possible, while getting every commit out in a reasonable time.
One specific cause of backlogs was that rollouts got slower as the size of our infrastructure increased. We got to a point where ssh-agent pegged an entire core authenticating SSH connections, and the fab master process also pegged a core managing all the tasks. The solution here was to switch to Facebook’s distributed SSH system.
So what do you need in order to implement something similar to what we’ve done? There are a few key principles which make our system work well, which you can apply to your own.
Tests: The test suite needs to be fast. It needs to have decent coverage, but doesn’t necessarily have to be perfect. The tests need to be run often: during code review, before landing (and ideally blocking lands on failure), and after landing.
Canary: You need an automated canary to prevent the really bad commits from being deployed to the entire fleet. It doesn’t need to be perfect, however – even a very simple set of stats and thresholds can be good enough.
Automate the normal case: You don’t have to automate every situation; just automate the known, normal situations. If anything is abnormal, make the automation stop and let humans step in.
Make people comfortable: I think that a big barrier to this kind of automation is when people feel disconnected and out of control. To address this, the system needs to provide good visibility into what it has done, is doing, and (preferably) is about to do. It also needs good stop mechanisms.
Expect bad deploys: Bad changes will get out, but that’s okay. You just need to detect this quickly, and be able to roll back quickly.
This is something that many other companies can implement. Continuous deployment systems don’t need to be complex. Start with something simple that focuses on the principles above, and refine it from there.
This system is working well for us at the moment, but there are further challenges which we will face and improvements we’d like to make.
Keeping it fast: Instagram is growing quickly, and as such the commit rate is going to continue to increase. We need to keep the rollout fast in order to maintain very few commits per rollout. One possibility here is to split the rollout into multiple stages and implement pipelining.
Adding canarying: As the commit rate increases, canary failures and backlogs are going to impact more and more developers. We want to stop more bad commits from getting into master and blocking deployment, and so we’re implementing canarying as part of Landcastle. After tests pass, Landcastle will test the change with production traffic and fail the land if it doesn’t pass the canary thresholds.
More data: We want to improve the canary’s detection capabilities, so we’re planning to collect and check more stats like per-view function response codes. We’re also experimenting with collecting stats from a set of control machines and comparing those to the canary stats, instead of the current static thresholds.
Improving detection: It would be good to reduce the impact of bad commits not being caught by the canary. Instead of testing on one machine and then deploying to the entire fleet, we could add more stages in between (a cluster or a region), checking metrics at that level before continuing.
Original URL: http://feedproxy.google.com/~r/feedsapi/BwPx/~3/zuD17HWvLps/
The Send/Receive API must not be used to send marketing or promotional messages, such as sale or product announcements, brand advertising, branded content, newsletters or the up-selling or cross-selling of products or services.
As previously announced, Facebook today opened its Instant Articles format to all developers. Using Instant Articles, publishers can show Facebook mobile users a fast-loading and mostly distraction-free view of their posts while still also showing them a limited amount of their own ads (or use Facebook’s Audience Network to monetize their content) and measure pageviews through tools… Read More
Original URL: http://feedproxy.google.com/~r/Techcrunch/~3/HFgVG2qmETU/
We are excited to officially introduce Ignition, the next-generation machine provisioning utility from CoreOS. Those who follow along closely may have noticed that Ignition has been a part of CoreOS for the better part of a year. The project has had time to be tested and to mature, and the features and user interface are in a place where we are happy to encourage daily, heavy duty use. It’s also a good time to welcome the community to test and help improve Ignition. Before diving into the details, let’s understand why we built Ignition in the first place.
Why build Ignition?
It’s been more than three years since we originally shipped our flavor of Cloud-Init. This tool runs after boot, reading a user-provided configuration and fetching platform-specific metadata, then applying that configuration to the running machine. The vast majority of our users employ this utility to configure CoreOS machines, performing a variety of tasks from adding SSH keys to starting an etcd cluster.
CoreOS cloudinit uses a language extremely similar to YAML to describe its configuration, the cloud-config. One of the few differences from standard YAML is that the cloud-config requires a comment on the first line indicating that the remainder of the document is a cloud-config. This, coupled with YAML’s type-inferencing nature, makes it difficult to programmatically generate or manipulate a cloud-config. Many languages have a YAML-parsing library, but very few have a library for parsing the CoreOS cloud-config variant. This results in bugs like the leading comment being stripped, thereby invalidating the config, or octal file permissions being converted to decimal, yielding unintended and potentially invalid file modes on disk. Sometimes, a configuration’s “off” can even be rewritten as “false”, and many failures cascade from there.
It is difficult for cloudinit to properly configure early system services, especially networking, because it runs later in the boot process. Most system services are already up and running by the time cloudinit starts, requiring a reconfiguration that is more prone to errors.
Ignition is a new machine provisioning utility designed to solve the same problems as coreos-cloudinit while adding a host of new capabilities with clearer semantics. At the the most basic level, Ignition is a tool for manipulating disks during early boot. This includes partitioning disks, formatting partitions, writing files, and configuring users.
On the first boot, Ignition reads its configuration from a designated source, like a remote URL, a network or cloud provider metadata service, or a hypervisor bridge, and applies that configuration to the machine. Rather than a YAML variant, Ignition uses pure JSON for its configuration format. JSON’s type system eliminates the problems that arose with cloud-config YAML, and makes it very easy to write tools to generate new configs or manipulate existing ones.
Ignition runs very early in the boot process. In fact, it runs before systemd is invoked by the kernel as PID 1, before any permanent storage is mounted. This allows Ignition to configure foundation system services and features, like reformatting the root filesystem, creating a RAID array, or configuring the network to use bonded interfaces. Running before systemd helps ensure that all services established by Ignition are known to and can be managed by systemd when it subsequently starts. This allows systemd to do what it does best: concurrently start services as quickly as possible. This results in a simpler startup, a faster startup, and the ability to accurately inspect systemd’s unit dependency graphs.
On first boot, Ignition fetches and evaluates a configuration describing the desired state of the machine. Much like cloud-config, configuration is provided in a file, with the cloud-config simply being replaced by an Ignition config. Unlike coreos-cloudinit, Ignition only runs on the first boot of a machine. It is safe for system administrators or other software tools to make subsequent modifications to the system without them being continually overwritten by original configuration at every reboot.
Ignition’s configuration format is versioned to allow clear deprecation points for possible changes to the format in the future. The Ignition v0.4.0 executable accepts configurations in the v2.0.0 Ignition config format. It will also accept v1 configs and transparently convert them.
The simplest Ignition configuration is shown below:
This configuration isn’t very useful since it doesn’t actually declare any desired state. The next example is more interesting:
This configuration will download and unzip a file containing the TPM validation hash values for CoreOS v1010.1.0, using the SHA-512 hash provided in the config to verify the contents of the validation hashes file itself.
Ignition v0.4.0 is included in CoreOS starting in version 1010.1.0, currently in the Alpha channel, and is currently supported on bare metal installations and PXE-booted environments. While older editions of CoreOS shipped with previous Ignition versions, Ignition v0.4.0 is the first to use the v2.0.0 configuration format.
PXE & bare metal
Ignition is particularly helpful on bare metal for network configuration and advanced disk operations because it runs even before PID 1 starts. A URL pointing to an Ignition configuration can be provided with the coreos.config.url kernel command-line parameter. The CoreOS install script also accepts Ignition configuration files, installing them into the OEM partition.
For PXE booting, supply the coreos.first_boot=1 parameter to trigger Ignition. This forces Ignition to run in PXE scenarios where a GPT disk GUID may not exist.
If you’re feeling adventurous, Ignition has experimental support for Microsoft Azure, Amazon EC2, and VMware, with many more providers coming soon. The project is under active development, so expect the supported platforms list to expand in the coming months. Configuration is provided via the same mechanism as CoreOS Cloud-Init, so it’s easy to switch to Ignition. Instead of providing your machine with a cloud-config, you can provide an Ignition config.