Hardware Fault Detection and Planning

The Flurry Operations Team handles 4,428 spinning disks across 1107 servers among a team of 6 awesome operations engineers.  Since Flurry is a startup, we don’t have an on-site tech team to handle all of the hardware issues that happen at the datacenter.  As we’ve grown from 400 disks to over 4000, we've improved our process of handling servers experiencing disk hardware failures.

The most common hardware alerts we receive are from the “Self-Monitoring, Analysis and Reporting Technology”, better known as a SMART alert.   This tool tests and monitors disks and will detect and report on potential disk issue, hoping to warn the admin before a disastrous disk issue appears.  (Find out more about SMART errors).

Flurry lives and dies by the data stored in our hadoop and hbase cluster, so when a disk issue happens we need to respond quickly and decisively to prevent data loss and/or performance impacts.  We generally find that we receive critical and non-critical alerts on around 1% of active cluster disks each month, not all of which need immediate attention.

Monitoring 400 disks: SMART error detected on host

When we were a humble cluster of 100 servers it was easy to log into a box, gracefully stop the hadoop services, unmount the drive, and start the hadoop daemons back up.  Most of the actionable alerts we saw were High Read Errors or Uncorrectable Sectors, which tend to indicate a potentially fatal disk issues.

Hadoop tends to let the filesystem handle marking the sectors as bad and/or unreadable, forcing a read to occur on another replica.  Hadoop is pretty good about moving the block mapping but it can increase the read latency, and generally degrades the overall performance of the cluster.  Did I already mention that we don't like performance degradation?

Monitoring 1200 disks: Find those bad drives fast

Our first datacenter expansion in 2011 consisted of a buildout of an additional 200 servers.  Each server has 4 x 1TB drives which are utilized in the cluster, that’s 800 disks in this buildout.   During pre-production diagnostic tests, we had a 0.5% failure rate of the new disks.  

Once the initial issues were resolved, we deployed the servers into production.  The 200 new servers had an average of 2.67 disks going bad per month for the period before our next data center buildout.  Our original 400 disks started reporting 2 new issues a month.  That’s jumping from 0.3% to 0.6% disk issues a month, possibly degrading due to their age.

Monitoring 2400 disks: Throwing more servers in the mix

Four months later, we needed to double our infrastructure to handle all of the new data that we were processing for our analytics.  This time, we were adding in 1200 new disks to the cluster with the same amount of issues.  The pre-production diagnostics tests only shook out 0.02% of the bad disks.

At this time, we started seeing our drive SMART checks increasing from <1% to 1.3% failures a month.  This was also during the Holiday App Boom as seen here and here. We were spending too much time ferrying drives back and forth from the office to the datacenter and started questioning our diagnostics, urgency and response of SMART errors, and steps to replace a drive.

Our servers have temperature indicators we started to manually monitor and started noticing the new servers were running around 84°F on idle, which we tend to see higher hardware failure rates.  We started graphing the temperatures and noticed they increased to 89°F as we started to bring servers into production.  There was a lot we needed to do and not enough resources to do it, other than bug the NOCs to come up with strategies to bring us down to 77°F.

Monitoring 4800 disks: Finally some cooling relief

10 months later, we once again doubled our infrastructure and migrated all of our servers into a new space where we now benefit from more efficient datacenter cooling.  Where we had an average of 77°F, we were now running between 60°F to 65°F.  Guess what happened to our average monthly SMART errors.  It went down to 0.4% since the move.  There may be several factors at play here:

  1. higher temperatures definitely seemed to contribute to higher failure rates
  2. we had a burn in time for those first 2400 disks
  3. the load across the disks had lightened after such a large expansion

Monitoring N disks: Scripts and processes to automate our busy lives

We've also improved our process for taking out servers with SMART alerts by creating a script which smartd will call when there's an issue.  In order to automate this, we've allowed the smartd check to take out servers at will. Modifying the smartd.conf script a bit, we use the check to call our shell script which does a few checks to gracefully stop the hadoop and hbase processes. This spreads out the existing data on the effected disks to healthy servers. We've also included a check to make sure the number of servers we take down does not exceed our hadoop HDFS replication factor, which further prevents the increase in the risk of removing multiple replicated blocks of data. Once all is complete, the script will notify the Operations team of the tasks performed or skipped. We have open sourced this script on our Github account here so you can fork and use it yourself.

What about the physical disks? Instead of having an engineer go down and take out disks from each server, we plan on utilizing our Remote Hands to perform that task for us, so we can focus on handling the broader-reaching issues. There were times where we batched up disk collection and engineers would carry 30 drives back to the office (walking barefoot, uphill both ways).  

As always, we're trying to do things more efficiently.  A few improvements we have in the plans include:

  1. Having the script unmount the bad disk and bring the server back into production.
  2. The script will email Remote Hands with the server, disk, location and issue, for them to swap the bad drive.
  3. Once the disk is swapped, mount the new drive and return the server into production.
  4. Adapting the script to handle other hardware alerts/issues (network cards, cables, memory, mainboard)

We've learned from those grueling earlier days, and continue to make hardware management a priority.  With a small team managing a large cluster, it's important to lean on automating simple, repetitive tasks as well as utilizing the services you are already paying for.  I, for one, welcome our new robotic overlords.

Scaling @ Flurry: Measure Twice, Plan Once

Working in web operations can be quite exciting when you get paged in the middle of the night to debug a problem, then burn away the night hours formulating a solution on the fly using only the resources you have at hand. It’s thrilling to make that sort of heroic save, but the business flows much more smoothly when you can prepare for the problems before they even exist. At Flurry we watch our systems closely, find patterns in traffic and systems’ behavioral changes, and generally put solutions in place before we encounter any capacity-related problems.

A good example of this sort of capacity planning took place late last year. During the Christmas season Flurry usually sees an increase in traffic as people fire up their new mobile phones, so we knew in advance of December 2011 that we’d need to accommodate a jump in traffic—but how much?  Fortunately, we had retained bandwidth and session data from earlier years, so were able to estimate our maximum bandwidth draw based on the increases we experienced previously, and our estimate was within 5% of our actual maximum throughput. There are probably some variables we still need to account for in our model, but we were able to get sufficient resources in place to make it through the holiday season without any serious problems. Having worked in online retail, I can tell you that not getting paged over a holiday weekend is something wonderful.

Doubling of outbound bandwidth from Nov-Dec 2011, Dec 24th yellow

Outside of annual events, we also keep a constant eye on our daily peak traffic rates. For example, we watch bandwidth to ensure we aren’t hitting limits on any networking choke points, and requests-per-second is a valuable metric since it helps us determine scaling requirements like overall disk capacity (each request taking up a small amount of space in our data stores) and the processing throughput our systems can achieve overall. The overall throughput includes elements like networking devices (switches, routers, load balancers) or CPU time spent handling and analyzing the incoming data.

Other metrics we watch on a regular basis include disk utilization, number of incoming requests, latency for various operations (time to complete service requests, but also metrics like software deployment speed, mapreduce job runtimes, etc.), CPU, memory, and bandwidth utilization, as well as application metrics for services like MySQL, nginx, haproxy, and Flurry-specific application metrics. Taken altogether, these measurements allow us to gauge the overall health and trends of our systems’ traffic patterns, from which we can then extrapolate when certain pieces of the system might be nearing capacity limits.

Changes in traffic aren’t the only source of capacity flux, though—because the Flurry software is a continually-changing creature, Flurry operations regularly coordinates with the development teams regarding upcoming changes that might cause changes like increases in database connections, more time spent processing each incoming request, or other similar items. Working closely with our developers also allows Flurry to achieve operational improvements like bandwidth offloading by integrating content delivery networks into our data delivery mechanisms.

One area I think we could improve is in understanding what our various services are capable of when going all-out. We’ve done some one-off load tests to get an idea of upper limits for requests per second per server, and have used that research as a baseline for rough determinations of hardware requirements, but the changing nature of our software makes that a moving target. Getting more automated capacity tests would be handy in both planning hardware requirements and for potentially surfacing performance-impacting software changes.

Overall, though, I think we’ve done pretty well. Successful capacity planning doesn’t prevent every problem, but paying attention to our significant metrics allows us to grow our infrastructure to meet upcoming demand, saving our urgent-problem-solving resources for the truly emergent behavior instead of scrambling to keep up with predictable capacity issues.