Hardware Fault Detection and Planning

The Flurry Operations Team handles 4,428 spinning disks across 1107 servers among a team of 6 awesome operations engineers.  Since Flurry is a startup, we don’t have an on-site tech team to handle all of the hardware issues that happen at the datacenter.  As we’ve grown from 400 disks to over 4000, we've improved our process of handling servers experiencing disk hardware failures.

The most common hardware alerts we receive are from the “Self-Monitoring, Analysis and Reporting Technology”, better known as a SMART alert.   This tool tests and monitors disks and will detect and report on potential disk issue, hoping to warn the admin before a disastrous disk issue appears.  (Find out more about SMART errors).

Flurry lives and dies by the data stored in our hadoop and hbase cluster, so when a disk issue happens we need to respond quickly and decisively to prevent data loss and/or performance impacts.  We generally find that we receive critical and non-critical alerts on around 1% of active cluster disks each month, not all of which need immediate attention.

Monitoring 400 disks: SMART error detected on host

When we were a humble cluster of 100 servers it was easy to log into a box, gracefully stop the hadoop services, unmount the drive, and start the hadoop daemons back up.  Most of the actionable alerts we saw were High Read Errors or Uncorrectable Sectors, which tend to indicate a potentially fatal disk issues.

Hadoop tends to let the filesystem handle marking the sectors as bad and/or unreadable, forcing a read to occur on another replica.  Hadoop is pretty good about moving the block mapping but it can increase the read latency, and generally degrades the overall performance of the cluster.  Did I already mention that we don't like performance degradation?

Monitoring 1200 disks: Find those bad drives fast

Our first datacenter expansion in 2011 consisted of a buildout of an additional 200 servers.  Each server has 4 x 1TB drives which are utilized in the cluster, that’s 800 disks in this buildout.   During pre-production diagnostic tests, we had a 0.5% failure rate of the new disks.  

Once the initial issues were resolved, we deployed the servers into production.  The 200 new servers had an average of 2.67 disks going bad per month for the period before our next data center buildout.  Our original 400 disks started reporting 2 new issues a month.  That’s jumping from 0.3% to 0.6% disk issues a month, possibly degrading due to their age.

Monitoring 2400 disks: Throwing more servers in the mix

Four months later, we needed to double our infrastructure to handle all of the new data that we were processing for our analytics.  This time, we were adding in 1200 new disks to the cluster with the same amount of issues.  The pre-production diagnostics tests only shook out 0.02% of the bad disks.

At this time, we started seeing our drive SMART checks increasing from <1% to 1.3% failures a month.  This was also during the Holiday App Boom as seen here and here. We were spending too much time ferrying drives back and forth from the office to the datacenter and started questioning our diagnostics, urgency and response of SMART errors, and steps to replace a drive.

Our servers have temperature indicators we started to manually monitor and started noticing the new servers were running around 84°F on idle, which we tend to see higher hardware failure rates.  We started graphing the temperatures and noticed they increased to 89°F as we started to bring servers into production.  There was a lot we needed to do and not enough resources to do it, other than bug the NOCs to come up with strategies to bring us down to 77°F.

Monitoring 4800 disks: Finally some cooling relief

10 months later, we once again doubled our infrastructure and migrated all of our servers into a new space where we now benefit from more efficient datacenter cooling.  Where we had an average of 77°F, we were now running between 60°F to 65°F.  Guess what happened to our average monthly SMART errors.  It went down to 0.4% since the move.  There may be several factors at play here:

  1. higher temperatures definitely seemed to contribute to higher failure rates
  2. we had a burn in time for those first 2400 disks
  3. the load across the disks had lightened after such a large expansion

Monitoring N disks: Scripts and processes to automate our busy lives

We've also improved our process for taking out servers with SMART alerts by creating a script which smartd will call when there's an issue.  In order to automate this, we've allowed the smartd check to take out servers at will. Modifying the smartd.conf script a bit, we use the check to call our shell script which does a few checks to gracefully stop the hadoop and hbase processes. This spreads out the existing data on the effected disks to healthy servers. We've also included a check to make sure the number of servers we take down does not exceed our hadoop HDFS replication factor, which further prevents the increase in the risk of removing multiple replicated blocks of data. Once all is complete, the script will notify the Operations team of the tasks performed or skipped. We have open sourced this script on our Github account here so you can fork and use it yourself.

What about the physical disks? Instead of having an engineer go down and take out disks from each server, we plan on utilizing our Remote Hands to perform that task for us, so we can focus on handling the broader-reaching issues. There were times where we batched up disk collection and engineers would carry 30 drives back to the office (walking barefoot, uphill both ways).  

As always, we're trying to do things more efficiently.  A few improvements we have in the plans include:

  1. Having the script unmount the bad disk and bring the server back into production.
  2. The script will email Remote Hands with the server, disk, location and issue, for them to swap the bad drive.
  3. Once the disk is swapped, mount the new drive and return the server into production.
  4. Adapting the script to handle other hardware alerts/issues (network cards, cables, memory, mainboard)

We've learned from those grueling earlier days, and continue to make hardware management a priority.  With a small team managing a large cluster, it's important to lean on automating simple, repetitive tasks as well as utilizing the services you are already paying for.  I, for one, welcome our new robotic overlords.