A few weeks ago we upgraded a lot of the core infrastructure in our New York (okay, it’s really in New Jersey now – but don’t tell anyone) data center. We love being open with everything we do (including infrastructure), and really consider it one of the best job perks we have. So here’s how and why we upgrade a data center. First, take a moment to look at what Stack Overflow started as. It’s 5 years later and hardware has come a long way.
Up until 2 months ago, we hadn’t replaced any servers since upgrading from the original Stack Overflow web stack. There just hasn’t been a need since we first moved to the New York data center (Oct 23rd, 2010 – over 4 years ago). We’re always reorganizing, tuning, checking allocations, and generally optimizing code and infrastructure wherever we can. We mostly do this for page load performance; the lower CPU and memory usage on the web tier is usually a (welcomed) side-effect.
So what happened? We had a meetup. All of the Stack Exchange engineering staff got together at our Denver office in October last year and we made some decisions. One of those decisions was what to do about infrastructure hardware from a lifecycle and financial standpoint. We decided that from here on out: hardware is good for approximately 4 years. After that we will: retire it, replace it, or make an exception and extend the warranty on it. This lets us simplify a great many things from a management perspective, for example: we limit ourselves to 2 generations of servers at any given time and we aren’t in the warranty renewal business except for exceptions. We can order all hardware up front with the simple goal of 4 years of life and with a 4 year warranty.
Why 4 years? It seems pretty arbitrary. Spoiler alert: it is. We were running on 4 year old hardware at the time and it worked out pretty well so far. Seriously, that’s it: do what works for you. Most companies depreciate hardware across 3 years, making questions like “what do we do with the old servers?” much easier. For those unfamiliar, depreciated hardware effectively means “off the books.” We could re-purpose it outside production, donate it, let employees go nuts, etc. If you haven’t heard, we raised a little money recently. While the final amounts weren’t decided when we were at the company meetup in Denver, we did know that we wanted to make 2015 an investment year and beef up hardware for the next 4.
Over the next 2 months, we evaluated what was over 4 years old and what was getting close. It turns out almost all of our Dell 11th generation hardware (including the web tier) fits these criteria – so it made a lot of sense to replace the entire generation and eliminate a slew of management-specific issues with it. Managing just 12th and 13th generation hardware and software makes life a lot easier – and the 12th generation hardware will be mostly software upgradable to near equivalency to 13th gen around April 2015.
What Got Love
In those 2 months, we realized we were running on a lot of old servers (most of them from May 2010):
- Web Tier (11 servers)
- Redis Servers (2 servers)
- Second SQL Cluster (3 servers – 1 in Oregon)
- File Server
- Utility Server
- VM Servers (5 servers)
- Tag Engine Servers (2 servers)
- SQL Log Database
We also could use some more space, so let’s add on:
- An additional SAN
- An additional DAS for the backup server
I know what you’re thinking: “Nick, how do you go about making such a fancy pile of servers?” I’m glad you asked. Here’s how a Stack Exchange infrastructure upgrade happens in the live data center. We chose not to failover for this upgrade; instead we used multiple points of redundancy in the live data center to upgrade it while all traffic was flowing from there.
Day -3 (Thursday, Jan 22nd): Our upgrade plan was finished (this took about 1.5 days total), including everything we could think of. We had limited time on-site, so to make the best of that we itemized and planned all the upgrades in advance (most of them successfully, read on). You can find a read the full upgrade plan here.
Day 0 (Sunday, Jan 25th): The on-site sysadmins for this upgrade were George Beech, Greg Bray, and Nick Craver (note: several remote sysadmins were heavily involved in this upgrade as well: Geoff Dalgas online from Corvallis, OR, Shane Madden, online from Denver, CO, and Tom Limoncelli who helped a ton with the planning online from New Jersey). Shortly before flying in we got some unsettling news about the weather. We packed our snow gear and headed to New York.
Day 1 (Monday, Jan 26th): While our office is in lower Manhattan, the data center is now located in Jersey City across the Hudson river: We knew there was a lot to get done in the time we had allotted in New York, weather or not. The thought was that if we skipped Monday we likely couldn’t get back to the data center Tuesday if the PATH (mass transit to New Jersey) shut down. This did end up happening. The team decision was: go time. We got overnight gear then headed to the data center. Here’s what was there waiting to be installed:
Yeah, we were pretty excited too. Before we got started with the server upgrade though, we first had to fix a critical issue with the redis servers supporting the launching-in-24-hours Targeted Job Ads. These machines were originally for Cassandra (we broke that data store), then Elasticsearch (broke that too), and eventually redis. Curious? Jason Punyon and Kevin Montrose have an excellent blog series on Providence, you can find Punyon’s post on what broke with each data store here.
The data drives we ordered for these then-redundant systems were the Samsung 840 Pro drives which turned out to have a critical firmware bug. This was causing our server-to-server copies across dual 10Gb network connections to top out around 12MB/s (ouch). Given the hundreds of gigs of memory in these redis instances, that doesn’t really work. So we needed to upgrade the firmware on these drives to restore performance. This needed to be online, letting the RAID 10 arrays rebuild as we went. Since you can’t really upgrade firmware over most USB interfaces, we tore apart this poor, poor little desktop to do our bidding:
Once that was kicked off, it ran in parallel with other work (since RAID 10s with data take tens of minutes to rebuild, even with SSDs). The end result was much improved 100-200MB/s file copies (we’ll see what new bottleneck we’re hitting soon – still lots of tuning to do). Now the fun begins. In Rack C (we have high respect for our racks, they get title casing), we wanted to move from the existing SFP+ 10Gb connectivity combined with 1Gb uplinks for everything else to a single dual 10Gb BASE-T (RJ45 connector) copper solution. This is for a few reasons: The SFP+ cabling we use is called twinaxial which is harder to work with in cable arms, has unpredictable girth when ordered, and can’t easily be gotten natively in the network daughter cards for these Dell servers. The SFP+ FEXes also don’t allow us to connect any 1Gb BASE-T items that we may have (though that doesn’t apply in this rack, it does when making it a standard across all racks like with our load balancers). So here’s what we started with in Rack C:
What we want to end up with is:
The plan was to simplify network config, cabling, overall variety, and save 4U in the process. Here’s what the top of the rack looked like when we started: …and the middle (cable management covers already off):
Let’s get started. First, we wanted the KVMs online while working so we, ummm, “temporarily relocated” them: Now that those are out of the way, it’s time to drop the existing SFP+ FEXes down as low as we could to install the new 10Gb BASE-T FEXes in their final home up top: The nature of how the Nexus Fabric Extenders work allows us to allocate between 1 and 8 uplinks to each FEX. This means we can unplug 4 ports from each FEX without any network interruption, take the 4 we find dead in the VPC (virtual port channel) out of the VPC and assign them to the new FEX. So we go from 8/0 to 4/4 to 0/8 overall as we move from old to new through the upgrade. Here’s the middle step of that process: With the new network in place, we can start replacing some servers. We yanked several old servers already, one we virtualized and 2 we didn’t need anymore. Combine this with evacuating our NY-VM01 & NY-VM02 hosts and we’ve made 5U of space through the rack. On top of NY-VM01&02 was 1 of the 1Gb FEXes and 1U of cable management. Luckily for us, everything is plugged into both FEXes and we could rip one out early. This means we could spin up the new VM infrastructure faster than we had planned. Yep, we’re already changing THE PLAN™. That’s how it goes. What are we replacing those aging VM servers with? I’m glad you asked. These bad boys:
There are 2 of these Dell PowerEdge FX2s Blade Chassis each with 2 FC630 blades. Each blade has dual Intel E5-2698v3 18-core processors and 768GB of RAM (and that’s only half capacity). Each chassis has 80Gbps of uplink capacity as well via the dual 4x 10Gb IOA modules. Here they are installed:
The split with 2 half-full chassis give us 2 things: capacity to expand by double, and avoiding any single points of failure with the VM hosts. That was easy, right? Well what we didn’t plan on was the network portion of the day, it turns out those IO Aggregators in the back are pretty much full switches with 4 external 10Gbps ports and 8 internal 10Gbps (2 per blade) ports each. Once we figured out what they could and couldn’t do, we got the bonding in place and the new hosts spun up.
It’s important to note here it wasn’t any of the guys in the data center spinning up this VM architecture after the network was live. We’re setup so that Shane Madden was able to do all this remotely. Once he had the new NY-VM01 & 02 online (now blades), we migrated all VMs over to those 2 hosts and were able to rip out the old NY-VM03-05 servers to make more room. As we ripped things out, Shane was able to spin up the last 2 blades and bring our new beasts fully online. The net result of this upgrade was substantially more CPU and memory (from 528GB to 3,072GB overall) as well as network connectivity. The old hosts each had 4x 1Gb (trunk) for most access and 2x 10Gb for iSCSI access to the SAN. The new blade hosts each have 20Gb of trunk access to all networks to split as they need.
But we’re not done yet. Here’s the new EqualLogic PS6210 SAN that went in below (that’s NY-LOGSQL01 further below going in as well):
Our old SAN was a PS6200 with 24x 900GB 10k drives and SFP+ only. This is a newer 10Gb BASE-T 24x 1.2TB 10k version with more speed, more space, and the ability to go active/active with the existing SAN. Along the the SAN we also installed this new NY-LOGSQL01 server (replacing an aging Dell R510 never designed to be a SQL server – it was purchased as a NAS):
The additional space freed by the other VM hosts let us install a new file and utility server:
Of note here: the NY-UTIL02 utility server has a lot of drive bays so we could install 8x Samsung 840 Pros in a RAID 0 in order to restore and test the SQL backups we make every night. It’s RAID 0 for space because all of the data is literally loaded from scratch nightly – there’s nothing to lose. An important lesson we learned last year was that the 840 Pros do not have capacitors in there and power loss will cause data loss if they’re active since they have a bit of DIMM for write cache on board. Given this info – we opted to stick some Intel S3700 800GB drives we had from the production SQL server upgrades into our NY-DEVSQL01 box and move the less resilient 840s to this restore server where it really doesn’t matter.
Okay, let’s snap back to blizzard reality. At this point mass transit had shut down and all hotels in (blizzard) walking distance were booked solid. Though we started checking accommodations as soon as we arrived on site, we had no luck finding any hotels. Though the blizzard did far less than predicted, it was still stout enough to shut everything down. So, we decided to go as late as we could and get ahead of schedule. To be clear: this was the decision of the guys on site, not management. At Stack Exchange employees are trusted to get things done, however they best perceive how to do that. It’s something we really love about this job.
If life hands you lemons, ignore those silly lemons and go install shiny new hardware instead.
This is where we have to give a shout out to our data center QTS. These guys had the office manager help us find any hotel we could, set out extra cots for us to crash on, and even ordered extra pizza and drinks so we didn’t go starving. This was all without asking – they are always fantastic and we’d recommend them to anyone looking for hosting in a heartbeat.
After getting all the VMs spun up, the SAN configured, and some additional wiring ripped out, we ended around 9:30am Tuesday morning when mass transit was spinning back up. To wrap up the long night, this was the near-heart attack we ended on, a machine locking up at: Turns out a power supply was just too awesome and needed replacing. The BIOS did successfully upgrade with the defective power supply removed and we got a replacement in before the week was done. Note: we ordered a new one rather than RMA the old one (which we did later). We keep a spare power supply for each wattage level in the data center, and try to use as few different levels as possible.
Day 2 (Tuesday, Jan 27th): We got some sleep, got some food, and arrived on site around 8pm. Starting the web tier (a rolling build out) was kicked off first:
While we rotated 3 servers at a time out for rebuilds on the new hardware, we also upgraded some existing R620 servers from 4x 1Gb network daughter cards to 2x 10Gb + 2x 1Gb NDCs. Here’s what that looks like for NY-SERVICE03:
The web tier rebuilding gave us a chance to clean up some cabling. Remember those 2 SFP+ FEXes? They’re almost empty: The last 2 items were the old SAN and that aging R510 NAS/SQL server. This is where the first major hiccup in our plan occurred. We planned to install a 3rd PCIe card in the backup server pictured here: We knew it was a Dell R620 10 bay chassis that has 3 half-height PCIe cards. We knew it had a SAS controller for the existing DAS and a PCIe card for the SFP+ 10Gb connections it has (it’s in the network rack with the cores in which all 96 ports are 10Gb SFP+). Oh hey look at that, it’s hooked to a tape drive which required another SAS controller we forgot about. Crap. Okay, these things happen. New plan.
We had extra 10Gb network daughter cards (NDCs) on hand, so we decided to upgrade the NDC in the backup server, remove the SFP+ PCIe card, and replace it with the new 12Gb SAS controller. We also forgot to bring the half-height mounting bracket for the new card and had to get creative with some metal snips (edit: turns out it never came with one – we feel slightly less dumb about this now). So how do we plug that new 10Gb BASE-T card into the network core? We can’t. At least not at 10Gb. Those 2 last SFP+ items in Rack C also need a home – so we decided to make a trade. The whole backup setup (including new MD1400 DAS) just love their new Rack C home:
Then we could finally remove those SFP+ FEXes, bring those KVMs back to sanity, and clean things up in Rack C:
See? There was a plan all along. The last item to go in Rack C for the day is NY-GIT02, our new Gitlab and TeamCity server:
Note: we used to run TeamCity on Windows on NY-WEB11. Geoff Dalgas threw out the idea during the upgrade of moving it to hardware: the NY-GIT02 box. Because they are such intertwined dependencies (for which both have an offsite backup), combining them actually made sense. It gave TeamCity more power, even faster disk access (it does a lot of XML file…stuff), and made the web tier more homogenous all at the same time. It also made the downtime of NY-WEB11 (which was imminent) have far less impact. This made lots of sense, so we changed THE PLAN™ and went with it. More specifically, Dalgas went with it and set it all up, remotely from Oregon. While this is happening, Greg was fighting with a DSC install hang regarding git on our web tier: Wow that’s a lot of red, I wonder who’s winning. And that’s Dalgas in a hangout on my laptop, hi Dalgas! Since the web tier builds were a relatively new process fighting us, we took the time to address some of the recent cabling changes. The KVMs were installed hastily not long before this because we knew a re-cable was coming. In Rack A for example we moved the top 10Gb FEX up a U to expand the cable management to 2U and added 1U of management space between the KVMs. Here’s that process:
Since we had to re-cable from the 1Gb middle FEXes in Rack A & B (all 4 being removed) to the 10Gb Top-of-Rack FEXes, we moved a few things around. The CloudFlare load balancers down below the web tier at the bottom moved up to spots freed by the recently virtualized DNS servers to join the other 2 public load balancers. The removal of the 1Gb FEXes as part of our all-10Gb overhaul meant that the middle of Racks A & B had much more space available, here’s the before and after:
After 2 batches of web servers, cable cleanup, and network gear removal, we called it quits around 8:30am to go grab some rest. Things were moving well and we only had half the web tier, cabling, and a few other servers left to replace.
Day 3 (Wednesday, Jan 28th): We were back in the data center just before 5pm, set up and ready to go. The last non-web servers to be replaced were the redis and “service” (tag engine, elasticsearch indexing, etc.) boxes:
We have 3 tag engine boxes (purely for reload stalls and optimal concurrency, not load) and 2 redis servers in the New York data center. One of the tag engine boxes was a more-recent R620, (this one got the 10Gb upgrade earlier) and wasn’t replaced. That left NY-SERVICE04, NY-SERVICE05, NY-REDIS01 and NY-REDIS02. On the service boxes the process was pretty easy, though we did learn something interesting: if you put both of the drives from the RAID 10 OS array in an R610 into the new R630…it boots all the way into Windows 2012 without any issues. This threw us for a moment because we didn’t remember building it in the last 3 minutes. Rebuild is simple: lay down Windows 2012 R2 via our image + updates + DSC, then install the jobs they do. StackServer (from a sysadmin standpoint) is simply a windows service – and our TeamCity build handles the install and such, it’s literally just a parameter flag. These boxes also run a small IIS instance for internal services but that’s also a simple build out. The last task they do is host a DFS share, which we wanted to trim down and simplify the topology of, so we left them disabled as DFS targets and tackled that the following week – we had NY-SERVICE03 in rotation for the shares and could do such work entirely remotely. For redis we always have a slave chain happening, it looks like this: This means we can do an upgrade/failover/upgrade without interrupting service at all. After all those buildouts, here’s the super fancy new web tier installed:
To get an idea of the scale of hardware difference, the old web tier was Dell R610s with dual Intel E5640 processors and 48GB of RAM (upgraded over the years). The new web tier has dual Intel 2687W v3 processors and 64GB of DDR4 memory. We re-used the same dual Intel 320 300GB SSDs for the OS RAID 1. If you’re curious about specs on all this hardware – the next post we’ll do is a detailed writeup of our current infrastructure including exact specs.
Day 4 (Thursday, Jan 29th): I picked a fight with the cluster rack, D. Much of the day was spent giving the cluster rack a makeover now that we had most of the cables we needed in. When it was first racked, the pieces we needed hadn’t arrived by go time. It turns out we were still short a few cat and power cables as you’ll see in the photos, but we were able to get 98% of the way there.
It took a while to whip this rack into shape because we added cable arms where they were missing, replaced most of the cabling, and are fairly particular about the way we do things. For instance: how do you know things are plugged into the right port and where the other end of the cable goes? Labels. Lots and lots of labels. We label both ends of every cable and every server on both sides. It adds a bit of time now, but it saves both time and mistakes later.
Here’s what the racks ended up looking like when we ran out of time this trip:
It’s not perfect since we ran out of several cables of the proper color and length. We have ordered those and George will be tidying the last few bits up.
I know what you’re thinking. We don’t think that’s enough server eye-candy either.
What Went Wrong
- We’d be downright lying to say everything went smoothly. Hardware upgrades of this magnitude never do. Expect it. Plan for it. Allow time for it.
- Remember when we upgraded to those new database servers in 2010 and the performance wasn’t what we expected? Yeah, that. There is a bug we’re currently helping Dell track down in their 1.0.4/1.1.4 BIOS for these systems that seems to not respect whatever performance setting you have. With Windows, a custom performance profile disabling C-States to stay at max performance works. In CentOS 7, it does not – but disabling the Intel PState driver does. We have even ordered and just racked a minimal R630 to test and debug issues like this as well as test our deployment from bare metal to constantly improve our build automation. Whatever is at fault with these settings not being respected, our goal is to get that vendor to release an update addressing the issue so that others don’t get the same nasty surprise.
- We ran into an issue deploying our web tier with DSC getting locked up on a certain reboot thinking it needed a reboot to finish but coming up in the same state after a reboot in an endless cycle. We also hit issues with our deployment of the git client on those machines.
- We learned that accidentally sticking a server with nothing but naked IIS into rotation is really bad. Sorry about that one.
- We learned that if you move the drives from a RAID array from an R610 to an R630 and don’t catch the PXE boot prompt, the server will happily boot all the way into the OS.
- We learned the good and the bad of the Dell FX2 IOA architecture and how they are self-contained switches.
- We learned the CMC (management) ports on the FX2 chassis are effectively a switch. We knew they were suitable for daisy chaining purposes. However, we promptly forgot this, plugged them both in for redundancy and created a switching loop that reset Spanning Tree on our management network. Oops.
- We learned the one guy on twitter who was OCD about the one upside down box was right. It was a pain to flip that web server over after opening it upside down and removing some critical box supports.
- We didn’t mention this was a charge-only cable. Wow, that one riled twitter up. We appreciate the #infosec concern though!
- We drastically underestimated how much twitter loves naked servers. It’s okay, we do too.
- We learned that Dell MD1400 (13g and 12Gb/s) DAS (direct attached storage) arrays do not support hooking into their 12g servers like our R620 backup server. We’re working with them on resolving this issue.
- We learned Dell hardware diagnostics don’t even check the power supply, even when the server has an orange light on the front complaining about it.
- We learned that Blizzards are cold, the wind is colder, and sleep is optional.
Here’s what the average render time for question pages looks like, if you look really closely you can guess when the upgrade happened: The decrease on question render times (from approx 30-35ms to 10-15ms) is only part of the fun. The next post in this series will detail many of the other drastic performance increases we’ve seen as the result of our upgrades. Stay tuned for a lot of real world payoffs we’ll share in the coming weeks.
Does all this sound like fun?
To us, it is fun. If you feel the same way, come do it with us. We are specifically looking for sysadmins preferably with data center experience to come help out in New York. We are currently hiring 2 positions:
If you’re curious at all, please ask us questions here, Twitter, or wherever you’re most comfortable. Really. We love Q&A.