Scaling @ Flurry: Measure Twice, Plan Once

Working in web operations can be quite exciting when you get paged in the middle of the night to debug a problem, then burn away the night hours formulating a solution on the fly using only the resources you have at hand. It’s thrilling to make that sort of heroic save, but the business flows much more smoothly when you can prepare for the problems before they even exist. At Flurry we watch our systems closely, find patterns in traffic and systems’ behavioral changes, and generally put solutions in place before we encounter any capacity-related problems.

A good example of this sort of capacity planning took place late last year. During the Christmas season Flurry usually sees an increase in traffic as people fire up their new mobile phones, so we knew in advance of December 2011 that we’d need to accommodate a jump in traffic—but how much?  Fortunately, we had retained bandwidth and session data from earlier years, so were able to estimate our maximum bandwidth draw based on the increases we experienced previously, and our estimate was within 5% of our actual maximum throughput. There are probably some variables we still need to account for in our model, but we were able to get sufficient resources in place to make it through the holiday season without any serious problems. Having worked in online retail, I can tell you that not getting paged over a holiday weekend is something wonderful.

Doubling of outbound bandwidth from Nov-Dec 2011, Dec 24th yellow

Outside of annual events, we also keep a constant eye on our daily peak traffic rates. For example, we watch bandwidth to ensure we aren’t hitting limits on any networking choke points, and requests-per-second is a valuable metric since it helps us determine scaling requirements like overall disk capacity (each request taking up a small amount of space in our data stores) and the processing throughput our systems can achieve overall. The overall throughput includes elements like networking devices (switches, routers, load balancers) or CPU time spent handling and analyzing the incoming data.

Other metrics we watch on a regular basis include disk utilization, number of incoming requests, latency for various operations (time to complete service requests, but also metrics like software deployment speed, mapreduce job runtimes, etc.), CPU, memory, and bandwidth utilization, as well as application metrics for services like MySQL, nginx, haproxy, and Flurry-specific application metrics. Taken altogether, these measurements allow us to gauge the overall health and trends of our systems’ traffic patterns, from which we can then extrapolate when certain pieces of the system might be nearing capacity limits.

Changes in traffic aren’t the only source of capacity flux, though—because the Flurry software is a continually-changing creature, Flurry operations regularly coordinates with the development teams regarding upcoming changes that might cause changes like increases in database connections, more time spent processing each incoming request, or other similar items. Working closely with our developers also allows Flurry to achieve operational improvements like bandwidth offloading by integrating content delivery networks into our data delivery mechanisms.

One area I think we could improve is in understanding what our various services are capable of when going all-out. We’ve done some one-off load tests to get an idea of upper limits for requests per second per server, and have used that research as a baseline for rough determinations of hardware requirements, but the changing nature of our software makes that a moving target. Getting more automated capacity tests would be handy in both planning hardware requirements and for potentially surfacing performance-impacting software changes.

Overall, though, I think we’ve done pretty well. Successful capacity planning doesn’t prevent every problem, but paying attention to our significant metrics allows us to grow our infrastructure to meet upcoming demand, saving our urgent-problem-solving resources for the truly emergent behavior instead of scrambling to keep up with predictable capacity issues.