Flurry Tech Blog

APNS Test Harness

2013-02-14T00:25:00Z

As Dr. Evil found out after spending a few decades in a cryogenic freezer, perspectives on quantity change very quickly. Ever since the explosion of mobile apps and the growing number of services that can deal with humongous amounts of data, we need to re-define our concepts of what ‘Big Data’ means. This goes for developers who want to write the Next Great App, as well as those who want to write services to support it.

One of the best ways of connecting with mobile application customers is via remote push notifications. This is available on both Android (GCM) and iOS (APNS). These services allow developers to send messages directly to their users and this is an extremely valuable tool to announce updates, send personalized messages and engage directly with the audience. Google and Apple provide services that developers can send push messages to and they in turn deliver those messages to their users.

The Problem

It’s not unusual for apps these days to have in the order of millions and even tens of millions of users. Testing a Push Notification backend can be extremely hard. Sure, you can set up a few test devices to receive messages but how do you know how long it would take your backend to send out a large number of push messages to the Google and Apple servers? Also, you don’t want to risk being throttled or completely blacklisted by either of those services by sending a ton of test data their way.

The Solution

The solution is to write a mock server that’s able to simulate the Google/Apple Push Notification Service, and a test client to hit it with requests.

The Google service is completely REST based, so a script that executes a lot of curls in a loop can do that job. Also, it’s fairly straightforward to write a simple HTTP server and accepts POSTs and sends back either a 200 or some sort of error code.

Apple’s APNS, however, presents a few challenges. It’s a binary format listed here. Since the protocol is binary, you need to write some sort of mock client that can generate messages in the specified format. At Flurry, we’ve been playing around with Node.js to build scalable services and it’s fairly straightforward to setup an Apple APNS test client and server.

The Client

The client.connect() method connects to the mock server and generates test data. The Buffer object in Node is used to pack the data into a binary format to send it over the wire. Although the protocol lets you specify a token size, the token size has been set to 64 bytes in the client since that’s typically the token length that gets generated. Also, in our experience, the APNS server actually rejects tokens that aren’t exactly 64 bytes long. The generateToken() method generates 64 byte hex tokens randomly. The payload is simple and static in this example. The createBuffer method can generate data in both the simple and enhanced format.

What good is a client without a server, you ask? So without further ado, here’s the mock server to go with the test client.

The Server

After accepting a request, the server buffers everything into an array and then reads the buffers one by one. APNS has an error protocol, but this server only sends a 0 on success and a 1 otherwise. Quick caveat: Since the server stores data in a variable until it gets a FIN from the client (on ‘end’) and only then does it process the data, the {allowHalfOpen:true} option is required on createServer so that the client does not automatically close the connection.

This setup is fairly basic, but it is useful for many reasons. Firstly, the client could be used to generate fake tokens and send them to any server that would accept them (just don’t do it to the APNS server, even in sandbox mode). The data in the payload in the above example is static, but playing around with the size of the data as well as the number of blocks sent per request helps identify the optimal size of data that you would want to send over the wire. At the moment, the server does nothing with the data, but saving it to some database or simply adding a sleep in the server would be a good indicator of estimated time to send a potentially large number of push messages. There are a number of variables that could be changed to try and estimate the performance of the system and set a benchmark of how long it would take to send a large batch of messages.
Happy testing!

Tech Women

2013-01-22T02:24:00Z

4th grade budding girl geek in the making 2nd row 2nd girl from the left

I grew up in a small suburb of New York fascinated with math and sciences. 3-2-1 Contact was my all-time favorite show then and getting their magazine was such a joy. As a young girl it was fun to try out the BASIC programs they published, programming with a joystick and running them on my Atari system (Yes programming with a joystick or paddle is just as useful as the MacBook Wheel.) It seemed like a no brainer to dive into computers when I started college. Women in my family were commonly in the sciences, so entering my college CS program was a bit of a culture shock for me; I could actually count all the women in my class year on one hand!

After graduating and working at a range of tech companies as a Quality Assurance Engineer, from big players to small startups, I've always had the desire to give back to the tech community. Only recently, however, did I find the right avenue. One day a co-worker of mine shared a link with me about the TechWomen program. From their website:

TechWomen brings emerging women leaders in Science, Technology, Engineering and Mathematics from the Middle East and Africa together with their counterparts in the United States for a professional mentorship and exchange program. TechWomen connects and supports the next generation of women entrepreneurs in these fields by providing them access and opportunity to advance their careers and pursue their dreams.

http://www.techwomen.org/

As soon as I read that, I applied right away. This was exactly the type of program I was looking for to help share what I've learned.

It must have been written in the stars as I was accepted as a mentor in the program. I was matched with Heba Hosny who is an emerging leader from Egypt. She works as a QA Engineer at a Vimov, an Alexandria based mobile application company. During her three week internship at Flurry she was involved in the process of testing the full suite of Flurry products.
During Heba’s stay with us she was like a sponge, soaking up the knowledge to learn what it takes to build and run a fast-paced, successful company in Silicon Valley. In her own words,

“EVERYBODY likes to go behind the scenes. Getting backstage access to how Flurry manages their analytics business was an eye opening experience for me. I was always curious to see how Flurry makes this analytics empire, being behind the curtains with them for just a month has been very inspiring for me to the extent that some of what Flurry does has became pillars of how I daily work as a tester for Flurry analytics product used by the company I work for.

In a typical Internship, you join one company and at the end of the day you find yourself sitting in the corner with no new valuable information. You have no ability to even contact a senior guy to have a chat with him. Well, my internship at Flurry was the total OPPOSITE of that.

The Flurry workplace is different. In Flurry managers, even C levels, are sitting alongside engineering, business development, marketing, sales, etc. This open environment allowed me to meet with company CEO, CTO, managers, and even sitting next to the analytics manager.

In short, an internship at Flurry for me was like a company wide in-depth journey of how you can run a superb analytics shop and what it's like to deal with HUGE amounts of data like what Flurry works with .”

Working with Heba during her internship was a great experience. The experience of hosting an emerging leader was very fruitful. In QA we were able to implement some of the new tools Heba introduced to us, such as the test case management tool Tarantula. Heba also gave us the opportunity to learn more about her culture and gave members of our staff a chance to practice their Arabic. The San Francisco Bay Area is a very diverse place but this is the first chance many of us have gotten to hear a first hand account of the Arab Spring.

From our experience in the tech field, it's obvious that the industry suffers from a noticeable lack of strong female leadership at the top. It’s time that women who value both a rich home life and a fulfilling career explore the tech startup world. Participating in programs such as TechWomen will help in this regard. These programs benefit not only the mentee and mentor, but the industry as a whole. Mentees who gain experience in Silicon Valley tech companies will pay it forward to next generations of future tech women in their communities by sharing their experiences. Mentors in the program not only learn from their mentees but are able to create a sense of community to help make sure the mentee has a successful internship. Company-wise, participating in programs like TechWomen bring tremendous exposure to Flurry outside of the mobile community. As we enrich more women's lives in the tech field, we can share even more experiences to help inspire young women and girls to know it's possible to touch the Silicon Valley dream, no matter where in the world they are.

For more information:

http://www.techwomen.org/

http://www.vimov.com/

The Benefits of Good Cabling Practices

2012-12-20T02:48:00Z

An organized rack makes a world of difference in tracing and replacing cables, easily removing hardware, and most importantly increasing airflow. By adopting good cabling habits, your hardware will run cooler and more efficiently and ensure the health and longevity of your cables. You also prevent premature hardware failures caused by heat retention. Good cabling practices don't sound important but it does make a difference. It's also nice to look at or show off to your friends/enemies.

When cabling, here are some practices Flurry lives by:

Label everything

There has never been a situation where you've heard someone say, "I wish I hadn't labeled this." Labeling just makes sense. Spend the extra time to label both ends of the network and power cables. Your sanity will thank you. If you're really prepared, print out the labels on a sheet ahead of time so they'll be ready to use.

Cable length

When selecting cable length, there are two schools of thought. There are those who want exact lengths and those who prefer a little extra slack. The majority of messy cabling jobs are from selecting improper cable lengths so use shorter cables where possible. A good option is custom made cables. You get the length that you need without any excess. This option is usually expensive in either time or money. The other option is to purchase standard length cables. Assuming that you have a 42U rack, the furthest distance between two network ports is a little over six feet. In our rack build outs, we've had great results using standard five foot network cables for our server to switch connections.

Cable management arms

When purchasing servers, some manufacturers provide a cable management arm with your purchase. They allow you to pull out your server without unplugging any cables. For this added benefit, they provide bulk, retain heat, and reduce proper airflow. If you have them, we suggest that you don't use them. Under most circumstances, you would unplug all cables before you pull out a server anyway.

No sharp bends

Cables do require a bit of care when being handled. A cable's integrity can suffer with any sharp bends so try to avoid this. In the past, we have seen port speed negotiation and intermittent network issues cause by damaged network cables.

Use mount points

As you group cables together, utilize anchor points inside of the rack to minimize stress on cable ends. Prolonged stress on the cable ends can cause the cable and socket it's connected in to break. Power ends are also known to unplug. The weight of the bundled power cables can gradually unplug it at any moment. Using anchor points will help alleviate directed stress to the outlet.

Less sharing

Isolate different types of cables (power, network, kvm, etc) into different runs. Separating cable types will allow for easy access and changes. Bundled power cables can cause electromagnetic interference on surrounding cables so it would be wise to separate power from network cables. If you must keep copper network and power cables close together, try to keep them at right angles. Standing at the back of the rack, network cables are positioned on the left hand side of the rack while power cables are generally on the right in our setup.

Lots and lots of velcro

We've seen the benefits of velcro cable ties very early on. It's got a lot of favorable qualities that plastic zip ties do not. They're easy to add/remove and also retie. They're also great when mounting bundled cables into anchor points inside of the racks. If your velcro ties come with a slotted end, do not give into the urge to thread the velcro into the ends. It's annoying to unwrap and rethread. Don't be shy to cut the velcro to length, either; using just the right length of velcro can make it easier to bundle and re-bundle cables.

Now that you have these tips in mind, let's get started on cabling a Flurry rack.

1. Facing the back of a 42U rack, add a 48 port switch in about the middle of the rack (position 21U (21st from the bottom). Once you have all your servers racked, now the fun part being, cabling. Let's start with the network.

2. From the top most server, connect the network cable to the top left port of your switch, which should be port 1.

3. As you go down the rack, connect the network cables on the top row of ports from left to right on the switch (usually odd numbered ports). Stop when you've reached the switch.

4. Using the velcro cable ties, gather together the cables in a group of ten and bundle the cabled groups with the cable ties. Keep the bundle on the left hand side of the rack. You will have one group of ten and one group of eleven that form into one bundled cable.

5. For the bottom set of servers, start with the lowest server (rack position 1U) and connect the network cable to the bottom left most port on the switch.

6. Starting from the bottom up, connect the network cables on the bottom row of ports from left to right on the switch (usually even numbered ports).

7. Doing the same as the top half of the rack, gather together the cables in a group of ten and bundle the cabled groups with the cable ties. Keep these bundles on the left hand side of the rack. You'll end up with two bundles of ten that form into one bundled cable. Look pretty decent?

8. Now, lets get to power cabling. In this scenario, we will have three power distribution units (pdus), one on the left and two on the right side of the rack. Starting from the top of the rack, velcro together five power cables and plug them into one of the pdu strips on the left side of the rack from the top down.

9. Take another two sets of four bundled power cables and plug them into the other pdu strips on the right hand side also following the top to bottom convention. You should end up with a balanced distribution of power plugs.

10. Take a bundle of six power cables and plug them into the pdu strip on the left hand side.

11. Take another two sets of four power cables and plug them into the two pdu strips on the right hand side.

12. Start from the bottom up, bundle the power cables in groups of five. You will end up with two sets of five power cables and a bundle of four.

13. Plug the bundle of four power cables into the pdu on the left hand side.

At this point, you can take a step back and admire your work. Hopefully, it looks sort of like this:

Good cabling can be an art form. As in any artistic endeavor, it takes a lot of time, patience, skill, and some imagination. There is no one size fits all solution, but hopefully this post will provide you with some great ideas on your next rack build out.

Exploring Dynamic Loading of Custom Filters in HBase

2012-12-06T20:07:14Z

Any program that pulls data from a large HBase table containing terabytes of data spread over many nodes will need to put a bit of thought into the retrieval of this data. Failure to do this may mean waiting for and subsequently processing a lot of unnecessary data, to the point where it renders this program (whether a single-threaded client or a MapReduce job) useless. HBase’s Scan API helps in this aspect. It configures the parameters of the data retrieval, including the columns to include, start and stop rows and batch sizing.

The Scan can also include a filter which can be the most impactful way to improve performance of scans of an HBase table. This filter is applied to a table and screens out unwanted rows from being included in a result set. A well-designed filter is performant and minimizes the data scanned and returned to the client. There are many useful Filters that come standard with Hbase, but sometimes the best solution is to use a custom Filter tuned to your HTable's schema.

Before your custom filter can be used, it will have to compiled, packaged in a jar, and deployed to all the regionservers. Restarting the HBase cluster is necessary for the regionservers to pick up the code in their classpaths. Therein lies the problem – an HBase restart takes a non-trivial amount of time (although rolling restarts mitigate that somewhat) and the downtime is significant with a cluster as big as Flurry's.

This is where dynamic filters come in. The word 'dynamic' refers to the on-demand loading of these custom filters, just like loading external modules at runtime in a typical application server or web server. In this post, we will explore an approach that makes this possible in the cluster.
How It Works Now
Before we dive into the workings of dynamically loading filters, let's see how regular custom filters work.
Assuming the custom filter has already been deployed to a jar in the regionservers’ classpath, the client can simply use the filter, e.g. in a full table scan, like this

This filter will have to be pushed to the regionservers to be run server-side. The sequence of how the custom filter gets replicated on the regionservers is as follows:

The client serializes the Filter class name and its state into a byte stream by calling the CustomFilter's write(DataOutput) method.
The client directs the byte array to the regionservers that will be part of the scan.
Server-side, the RegionScanner re-creates the Scan, including the filter, using the byte stream. Part of the stream is the filter’s class name, and the default classloader uses this fully qualified name to load the class using Class.forName().
The filter is then instantiated using its empty constructor and configured using the rest of the filter byte array (using the filter's readFields(DataInput) method (see org.apache.hadoop.hbase.client.Scan for details).
The filter is then applied as a predicate on each row.

(1) myfilter.jar containing our custom filter resides locally in the regionservers' classpath
(2) the custom filter is instantiated using the default Classloader

Once deployed, this definition of our custom filter is static. We can make an ad hoc query using the combination of filters, but if we need to add, extend or replace a custom filter, it has to be added to the regionserver’s classpath and we have to wait for its next restart before those filters can be used.

There is a faster way.
Dynamic Loading
A key takeaway from the previous section is that Filters are Writables – they are instantiated using the name of the class and then configured by a stream of bytes that the Filter understands. This makes the filter configuration highly customizable and we can use this flexibility to our advantage.
Rather than create a regular Filter, we introduce a ProxyFilter which acts as the extension point through which we can load our custom Filters on demand. During runtime, it will load the custom class filter itself.
Let’s look at some example code. To start with, there is just a small change we have to make on the client; the ProxyFilter now wraps the Filter or FilterList we want to use in our scan.

The ProxyFilter passes its own class name to be instantiated on the server side, and serializes the custom filter after.

On the regionserver the ProxyFilter is initialized in the same way as described in the previous section. The byte stream that follows should minimally contain the filter name and its configuration byte array. In the ProxyFilter's readFields method, the relevant code looks like this.

This is very much like how the default Scan re-creates the Filter on the regionserver with one critical difference – it uses a filterModule object to obtain the Class definition of the custom filter. This module retrieves the custom filter Class and returns it to ProxyFilter for instantiation and configuration.

There can be different strategies for retrieving the custom filter class. Our example code copies the jar file from the Hadoop filesystem to a local directory and delegates the loading of the Filter classes from this jar to a custom classloader [3].
To configure the location of the directory the module searches for the filters.jar in HDFS, add the following property in hbase-site.xml.
<property>
<name>flurry.hdfs.filter.jarname>
<value>/flurry/filters.jarvalue>
property>
(1) The custom filter jar resides in a predefined directory in HDFS
(2) The proxyfilter.jar containing the extension point needs to reside locally in the regionserver's classpath
(3) The ProxyFilter is instantiated using the default ClassLoader
(4) If necessary, the rowfilter.jar is downloaded from a preset Path in HDFS. A custom classloader in ProxyFilter proceeds to instantiate the correct filter. Filter interface calls are then delegated to the enclosed filter.

With the ProxyFilter in place, it is now simply a matter of placing or replacing the jar in the Hadoop FS directory to pick up the latest filters.

Reloading Filters
When a new Scan is requested on the server side, this module first checks up on the filter.jar. If this jar is unchanged, the previously loaded Classes are returned. However, if the jar has been updated, the module repeats the process of downloading it from HDFS, creating a new instance of the classloader and reloading the classes from this modified jar. The previous classloader is dereferenced and left to be garbage collected. Restarting the HBase cluster is not required.

The HdfsJarModule keeps track of the latest custom filter definitions using a separate classloader for the different jar versions

Custom classloading and reloading can be a class-linking, ClassCastException minefield, but the risk here is mitigated by the highly specialized use case of Filtering. The filter is instantiated and configured per scan and its object lifecycle limited to the time it takes to do the scan in the regionserver. The example uses the child-first classloader mentioned in a previous post on ClassLoaders that searches for a configured set of URLs before delegating to its parent classloader [2].

Things to watch out for

The example code has a performance overhead as it makes additional calls to HDFS to check for the modification time of the filter jar when a filter is first instantiated. This may be a significant factor for smaller scans. If so, the logic can be changed to check the jar less frequently.
The code is also very naïve at this point. Updating the filter.jar in the Hadoop FS while a table scan is happening can have undesired results if the updated filters are not backward compatible. Different versions of the jar can be picked up by the RegionServers for the same scan as they check and instantiate the Filters at different times.
Mutable static variables are discouraged in the custom Filter because they will be reinitialized when the class is reloaded dynamically.

Extensions
The example code is just a starting point for more interesting functionality tailored to different use cases. Scans using filters can also be used in MapReduce jobs and coprocessers. A short list of possible ways to extend the code:

The most obvious weakness in the example implementation is the ProxyFilter only supports one jar. Extending that to include all jars in a filter directory will be a good start. [4]
Different clients may expect certain versions of Filters. Some versioning and bookkeeping logic will be necessary to ensure that the ProxyFilter can serve up the correct filter to each client.
Generalize the solution to include MapReduce scenarios that use HBase as the input source. The module can load the custom filters at the start of each map task from the MR job library instead, unloading the filters after the task ends.
Support other JVM languages for filtering. We have tried serializing Groovy 1.6 scripts as Filters but performance was several times slower.

Using the Proxyfilter as a generic extension point for custom filters allows us to improve our performance without the hit of restarting our entire HBase cluster.

Footnotes
[1] Class Reloading Basics http://tutorials.jenkov.com/java-reflection/dynamic-class-loading-reloading.html
[2] See our blog post on ClassLoaders for alternate classloader delegation http://tech.flurry.com/using-the-java-classloader-to-write-efficient
[3] The classloader in our example resembles the one described in this ticket https://issues.apache.org/jira/browse/HBASE-1936
[4] A new classloader has been just been introduced in hbase-0.92.2 for coprocessors, and it seems to fit perfectly for our dynamic filters https://issues.apache.org/jira/browse/HBASE-6308
[5] Example code https://github.com/vallancelee/hbase-filter/tree/master/customFilters

Squashing bugs in multithreaded Android code with CheckThread

2012-11-19T22:14:47Z

Writing correct multithreaded code is difficult, and writing Android apps is no exception. Like many mobile platforms, Android's UI framework is single threaded and requires the application developer to manage threads with no thread-safe guarantee. If your app is more complicated than "Hello, World!" you can't escape writing multithreaded code. For example, to build a smooth and responsive UI, you will have to move long running operations like network and disk IO to background threads and then return to the UI thread to update the UI.

Thankfully, Android provides some tools to make this easier such as the AsyncTask utility class and the StrictMode API. You can also use good software development practices such as adhering to strict code style and requiring careful code review of code that involve the UI thread. Unfortunately, these approaches require diligence, are prone to human error, or only catch errors at runtime.

CheckThread for Android

CheckThread is an open source project authored by Joe Conti that provides annotations and a simple static analysis tool for verifying certain contracts in multithreaded programs. It's not a brand new project and it's not very difficult to use, but it hasn't had a very high adoption for Android apps. It offers an automated alternative to exclusively using comments and code review to ensuring no bugs related to the UI thread are introduced in your code. The annotations provided by CheckThread are: @ThreadSafe, @NotThreadSafe, @ThreadConfined

ThreadSafe and NotThreadSafe are described in Java Concurrency in Practice, and CheckThread enforces the same semantics that book defines. For the purposes of this blog post, the only annotation that we'll be using is ThreadConfined.

Thread confinement is a general property of restricting data or code to access from only a single thread. A data structure confined to the stack is inherently thread confined. A method that is only ever called by a single thread is also thread confined. In Android, updates to the UI must be confined to the UI thread. In very concrete terms, this implies that any method that mutates the state of a View should only be called from the UI thread. If this policy is violated, the Android framework may throw a RuntimeException, but also may simply produce undefined behavior, depending on the specific nature of the update to the UI.

CheckThread supports defining thread policies in XML files, so while it would be possible, it's not necessary to download the source of the Android framework code and manually add annotations to it. Instead, we can simply define a general thread policy to apply to Android framework classes.

Time for an Example

The following example demonstrates how to declare a thread policy in XML, annotate a simple program and run the CheckThread analyzer to catch a couple of bugs.

CheckThread’s XML syntax supports patterns and wildcards which allows you to concisely define policies for Android framework code. In this example we define two Android specific policies. In general this file would contain more definitions for other Android framework classes and could also contain definitions for your own code.

The first policy declares that all methods in Activity and its subclasses that begin with the prefix “on” should be confined to the main thread. Since CheckThread has no built-in concept of the Android framework or of the Activity class we need to inform the static analyzer which thread will call these methods.

The second policy declares that all methods in classes ending with “View” should be confined to the main thread. The intention is to prevent calling any code that modifies that UI from any other thread than the UI thread. This is a little bit conservative since there are some methods in Android View classes that have no side-effects, but it will do for now.

Having defined the thread policy, we can turn to our application code. The example app is the rough beginnings of a Hacker News app. It fetches the RSS feed for the front page, parses the titles and displays them in a LinearLayout.

This first version is naive; it does network IO and XML parsing in Activity.onCreate. This error will definitely be caught by StrictMode, and will likely just crash the app on launch, so it would be caught early in development, but it would be even better if it were caught before the app was even run.

In this code, we make a call to the static method getHttp in the IO utility class. The details of this class and method are not important, but since it does network IO, it should be called from off the UI thread. We’ve annotated the entire class as follows:

This annotation tells CheckThread that all the methods in this class should be called from the “other” thread.

Finally, we’re ready to run the static analyzer. CheckThread provides several ways to run the analysis tool, including Eclipse and Intellij plugins, but we’ll just use the Ant tasks on the command line. This is what CheckThread outputs after we run the analyzer:

As you can see, CheckThread reports an error: we’re calling a method that should be confined to the “other” thread from a method that’s confined to “MAIN”. One solution to this problem is to start a new thread to do network IO on. We create an anonymous subclass of java.util.Thread and override run, inside of which we fetch the RSS feed, parse it and update the UI.

We’ve annotated the run method to be confined to the “other” thread. CheckThread will use this to validate the call to IO.getHttp. After running the analyzer again, we discover that there’s a new error reported:

This time, the error is caused by calling the method setText on a TextView from a different thread than the UI thread. This error is generated by the combination of our thread policy defined in XML and the annotation on the run method.

Instead, we could call the Activity.runOnUiThread with a new Runnable. The Runnable’s run method is annotated to be confined to the UI thread, since we’re passing it to a framework method that will call it from the UI thread.

Finally, CheckThread reports no errors to us. Of course that doesn’t mean that the code is bug free, static analysis of any kind has limits. We’ve just gotten some small assurance that the contracts defined in the XML policy and annotations will be held. While this example is simple, and the code we’re analyzing would be greatly simplified by using an AsyncTask, it does demonstrate the class of errors that CheckThread is designed to catch. The complete sample project is published on Github.

The Pros and Cons of Annotations

One drawback that is probably immediately obvious is the need to add annotations to a lot of your code. Specifically, CheckThread's static analysis is relatively simple, and doesn't construct a complete call graph of your code. This means that the analyzer will not generate a warning for the code below:

While this may appear to be a significant problem, it's my opinion that in practice it is not actually a deal breaker. Java already requires that the programmer write most types in code. This is seen by some as a drawback of Java (and is often cited incorrectly as a drawback of static typing in general). However there are real advantages to annotating code with type signatures, and even proponents of languages with powerful type inference will admit this, since it's usually recommended to write the type of "top-level" or publicly exported functions even if the compiler can infer the type without any hint.

The annotations that CheckThread uses are similar; they describe an important part of a method's contract, that is whether it is thread safe or should be confined to a specific thread. Requiring the programmer to write those annotations elevates the challenge of writing correct multithreaded code to be at the forefront of the programmer's mind, requiring that some thought be put into each method's contract. The use of automated static analysis makes it less likely that a comment will become stale and describe a method as thread safe when it is not.

The Future of Static Analysis

The good news is that the future of static analysis tools designed to catch multithreaded bugs is looking very bright. A recent paper published by Sai Zhang, Hao Lü, and Michael D. Ernst at the University of Washington describes a more powerful approach to analyzing multithreaded GUI programs. Their work targets Android applications as well as Java programs written using other GUI frameworks. The analyzer described in their paper specifically does construct a complete call graph of the program being analyzed. In addition, it doesn't require any annotations by the programmer and also addresses the use of reflection in building the call graph, which Android specifically uses to inflate layouts from XML. This work was published only this past summer, and the tool itself is underdocumented at the moment, but I recommend that anyone interested in this area read the paper which outlines their work quite clearly.

Write Once, Compile Twice, Run Anywhere

2012-11-08T04:32:00Z

Many Java developers use a development environment different from the target deployment environment. At Flurry, we have developers running OS X, Windows, and Linux, and everyone is able to contribute without thinking much about the differences of their particular operating system, or the fact that the code will be deployed on a different OS.

The secret behind this flexibility is how developed the Java toolchain has become. One tool (Eclipse) in particular has Eclipsed the rest and become the dominant IDE for Java developers. Eclipse is free, with integrations like JUnit support, and a host of really great plugins making it the de facto standard in Java development, displacing IntelliJ and other options. In fact, entry level developers rarely even think about the compilation step, because Eclipse's autocompilation keeps your code compiled every time you save a file.

There's Always a Catch

Unfortunately no technology is magical and while this set up rarely causes issues, it can. One interesting case arises when the developer is using the Eclipse 1.6 compiler compliance and the target environment uses Sun's 1.6 JDK compiler. For example at Flurry, during development we rely on Eclipse's JDT Compiler, but the code we ship gets compiled for deployment on a Jenkins server by Ant using Sun's JDK compiler. Note that both the developer and continuous integration environment are building for Java 6, but using different compilers.

Until recently this never came up as an issue as the Eclipse and Sun compilers, even when running on different operating systems, behave almost identically. However, we have been running into some interesting (and frustrating) compiler issues that are essentially changing "Write Once, Run Anywhere" into "Write Once, Compile Using Your Target JDK, Run Anywhere." We have valid 1.6 Java code using generics, which compiles fine under Eclipse, but won't compile using Sun's javac.

Let's See an Example

An example of the code in question is below. Note that it meets the Java specification and should be a valid program. In fact, in Eclipse using Java 1.6 compiler compliance the code compiles, but won't compile using Sun's 1.6 JDK javac.

Compiling this code using javac in the Sun 1.6 JDK gives this compiler error:

"Write Once, Run Anywhere" never made any promises about using different compilers, but the fact that our toolchain was using a different compiler than our build server never bore much thought until now.

Possible Solutions

The obvious solution is to have all developers work on the same environment as where the code will be deployed, but this would defer developers from using their preferred environment and impact productivity by constraining our development options. Possible solutions we have kicked around :

Have ant compile using the Eclipse incremental compiler, (using flags -Dbuild.compiler=org.eclipse.jdt.core.JDTCompilerAdapter and of course -Dant.build.javac.target=1.6). This side steps the problem by forcing the continuous integration system to use the same compiler as developer laptops, but is not ideal as this was never an intended use of the Eclipse compiler.
Move to the 1.7 JDK for compilation, using a target 1.6 bytecode. This solves this particular issue, but what happens in the future?
Change the code to compile under Sun's JDK. This is not a bad option but will cost some speed of development found in the simplicity of Eclipse's built in system.

My experience has been that Eclipse is a well worn path in the Java world, and its a little surprising that this hasn't come up before for us given the heavy use of generics (although there are lots of generics issues which have been logged over at bugs.sun.com, like http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6302954 which has come up for us as a related issue - the "no unique maximal instance exists" error message).

Switching to use the Eclipse compiler for our deployable artifacts would be an unorthodox move, and I'm curious if anyone out there reading this has done that, and if so, with what results.

We had a discussion internally and the consensus was that moving to 1.7 for compilation using a target 1.6 bytecode (#2) should work, but would potentially open us up to bugs in 1.7 (and would require upgrading some internal utility machines to support this). We aren't quite ready to make the leap to Java 7, although its probably time to start considering the move.

For now, we coded around the compiler issue, but its coming up for us regularly now, and we are kicking around the ideas on how to resolve. In the near term, for the projects that run into this generics compile issue, developers are back to using good ole ant to check if their code will "really" compile. Its easy to forget how convenient autocompilation has become, and the fact that it isn't really the same as the build server's compiler.

Regression Testing NodeJS Services with Restify and Mocha

2012-10-03T19:33:09Z

Here at Flurry we are big proponents of unit testing and TDD (Test-driven Development) - we believe it makes for tighter code that is less prone to behaving in unexpected ways. We also believe in extensive Regression Testing - making sure the changes and code we add don’t break existing functionality.

Recently we have begun moving parts of the Flurry backend over to a new set of services running on NodeJs. NodeJS is a fast, flexible application framework that is especially suited to developing RESTful API services.

Restify

To help us build these new backend components we’re using the NodeJS “Restify” framework. Restify provides an extensive amount of low-level functionality to help you correctly implement RESTful services - much in the same way the “Express” NodeJS module helps you rapidly develop web-based applications - for those of you familiar with Express, you’d feel right at home in Restify. We’ve found Restify to be a fantastic framework for development - easy to use, understand and implement.

Mocha

To facilitate unit testing, we’re using the Mocha Javascript unit testing framework for everything we do and have found it to be flexible and easy to use. Mocha makes it really easy to write unit tests for your NodeJS code - so easy in fact, we decided we wanted to use Mocha for our regression testing.

After some trial and error, we have settled on a fairly simple setup, which we have found works well and is easy to implement. The following steps outline the process, and for this small tutorial we’ll build the requisite “Hello World” API that simple returns “Hello World” when called.

Before we get started, we first want to make sure that both Restify and Mocha are installed for use in our new Node project:

Once those are installed, we’re ready to create our sample “Hello World” API service, as well as setup the Mocha regression test cases.

Here’s the contents of the app.js file that we will be using for testing:

You can see that unlike other app.js examples you may have seen, this one is very small and simply makes a function call to StartServer() which has been exported from the start_server.js file. This function simply starts the server - we’ll cover that below.

Start up the NodeJS service in Mocha

Before we can do any regression testing against our “Hello World” service, we must first start up the Hello World service. To do this, we are going to create a special “before” Mocha hook - this will run before any of the other regression test files are run by Mocha, and will allow us to start the service so it can be tested.

Within your directory, create a sub directory called “test”. All of our regression test and unit test files are going to be located inside. Create a new file called “01-start-server.js” with the following contents:

This file will be picked up as the first file in the directory (that's why we started the filename with 01), and the before() function will be executed, which in turn requires and runs our StartServer() function. The StartServer() function is defined in the start_server.js file:

It's purpose is to actually initialize the Restify listeners and start the service.

Create a Mocha regression test

Now that we have a way to automatically start the service before we need to test it, we can go about the business of writing our regression test cases. Since our Hello World service is so simple, we’re just going to have one test - we’re going to test to be sure our call to the /hello service returns an HTTP response code of “200”, indicating our request was “OK”:

Initializing an HTTP Client

At the very top of our file that contains the test cases for our Hello World service you can see we are using a feature of Restify - the Restify JSON Client. The JSON Client is a small wrapper around NodeJS’s http module that makes it easy to call URL based web services that return their content in JSON format. The JSON client will automatically JSON.parse the response body and make it available for your use (as the “data” parameter in our callback function).

Once we’ve created our client, we can then use the client to make a GET call to our /hello service URL. If we encounter an error connecting to the service, our “err” parameter will contain the error. The “data” parameter will contain the data returned from the call, so we will want to check that to be sure it contains the data we requested.

Running the Regression Test

Now that we have our test in place, the next step is to actually run it, which is as easy as typing "mocha" in your project directory:

Mocha will first run the "01-start-server.js" file in the test directory - this starts our service. Next, it will move on to the service_hello_tests.js file and run our solitary regression test. If the service responds as we have outlined in our test, the test will be marked as passed by Mocha.

Using this simple setup we can add as many additional tests as needed - either extending our "hello" service, or writing additional test cases for new services.

Using Mocha for both unit testing and regression testing allows us to save time by only having to deal with one testing framework - Mocha is flexible enough to make running unit tests in more complex scenarios fairly straight forward and easy to do. Now if only it could write your unit tests for you :)

Scaling Build and Deployment Systems

2012-09-18T22:39:00Z

Flurry is a rapidly growing company in every meaning of the word including customers, teams, and applications. We have 650 million unique users per month from mobile applications that embed our SDKs which create over 1.6 billion transactions per day. Our HBase cluster with over 1,000 nodes holds many petabytes of data and is rapidly growing.

While Flurry's growth is great news for our customers, employees, and investors, it creates challenges for Release Management to support the growing number of applications, the growing number of developers, and the growing number of development and test environments. We need to manage all of that well and do so quickly and reliably.

In short, we need continuous integration to rapidly build, deploy, and test our applications.

Deployment Tools @ Flurry

To support our continuous integration, we setup three core tools.

Source Control: We use Github Enterprise to manage our source code and various configuration files. We use a variation of the Git Flow development process where all features are developed on individual branches and every merge is in the form of a pull request (which is also code reviewed).
Continuous Build: We use Jenkins to build code and deploy to our QA and test environments and to run our JUnit tests. Jenkins is set up to automatically run JUnit tests when code developers check-in code and when they create pull requests for their branches. Jenkins also runs Selenium tests with SauceLabs every night against the QA environments and Production.
Task Tracking: We use Jira (with the Greenhopper agile plugin) for ticket management for planning and tracking enhancements and bug fixes. All three tools are well integrated with various plug-ins that allow them to share information and to trigger actions.

Challenges at Scale

Our setup for continuous integration has served us well but has some challenges.

Too Many Jobs: We have more than 50 Jenkins jobs. We have over 130 deployment scripts and more than 1,600 configuration files for the CI tools and applications. Each new application and each new QA environment adds to the pile. While we are whizzes at writing bash scripts and Java programs, this is clearly not scalable in the long term.
Slow Deployments: For security reasons, our Jenkins server cannot deploy war files and config files directly to Production servers. For Production deployments, we run a Jenkins job that copies the files built by other jobs to a server in the data center over a locked-down, one-way secure tunnel. Ops staff then manually runs various scripts to push the files to the Production servers and restart them. This is inefficient in terms of time and people resources.
Test Overrun: Our JUnit test suite has over 1,000 test cases which take about an hour to run. With the increase in the number of developers, the number of test runs triggered by their pull requests is clogging the Jenkins build server. We have biweekly releases to Production which we would like to be able to cut down to a daily basis or at least every few days. The blocker to this is that the build, deploy, test, analyze, and fix cycle is too long to allow that.

Improving the Process: Design for Speed

The speed of an engineering team is directly related to the speed of release and deployment so we needed to get faster. We have taken a number of steps to address the challenges.

We optimized our JUnit test cases by removing unnecessary sleep statements and stubbed out the deployments to our test CDN which reduces network wait time.
We upgraded the build system to bigger, faster hardware and parallelized the JUnit test runs so that we can run multiple test jobs at the same time.
We added dedicated Jenkins slave servers that can share the burden during times of heavy parallel building.

Overall we have reduced the time to run the entire test suite to 15 minutes.

To make it easier to manage the Jenkins jobs, we removed a lot of old jobs and combined others using parameterized builds. We renamed the remaining Jenkins jobs to follow a standard naming convention and organized them into tabbed views. We now have a dozen jobs laid out where people can find them.

Jenkins

All of the improvement steps have helped, but we needed more fundamental changes.

Improving the Process: RPM Based Software Deployments

We changed our build and deployment process to use RPM repositories where every environment has its own local repository of RPMs. In the new process, the Jenkins job builds the war files then bundles up each war file along with its install script. The job also builds RPMs for each application's config files, email templates, data files and the config files for HBase and Hadoop. Once all of the RPMs are built, the job rsyncs the RPMs to the target environment's repo. It then runs ssh yum install against each server in the environment to get it to update itself. Once all the servers are updated, the job restarts all of the servers at once. The job is parameterized so that users can build and deploy a single app, a set of apps, all apps, or just some config files.

The developers have access to Jenkins so that they can update the QA environments at will without having to involve Release Engineering.

The RPM-based build and deployment process gives us several advantages. The install scripts are embedded into the RPMs which reduces the cluttered hierarchy of scripts called by the Jenkins jobs. The same tools and processes for deployments in the Dev and QA environments can now be safely used in Production.

By having a repo for each environment, we only have to deploy the RPMs once to that repo. Each sever in the environment then pulls the RPMs from its repo. This save a lot of time and network bandwidth for our remote environments whose servers used to get files directly from Jenkins.

RPMs support dependencies which instruct yum to deploy a group of other RPMs before deploying the given RPM. For example, we can set an application's RPM to be dependent of the application's config file RPM, so that when we install the application, yum automatically installs the config file RPM. The dependency feature also allows us to set up a parent RPM for each class of server where the parent RPM is dependent on all of the application RPMs that run on that class of server. We simple execute yum install with the parent RPM, and yum downloads and installs all of the application RPMs and their dependent config file RPMs needed for that server. In the future we will add dependencies for Java, Jetty, and various OS packages to the parent RPMs. This will allow us to kick start a new server and fully provision it at the push of a button.

Conclusion

As with any change in process and tools, there were a few gotchas. The Jenkins slave server was easy to set up, but there were a lot of tools and configurations needed to support our junit test runs that had to be copied from the Jenkins master server. We also found a few places where the concurrent junit tests runs stepped on common files.

Overall, the changes have sped up and cleaned up our build and deployments. They have allowed us to better manage what we have and to handle our future growth.

Hardware Fault Detection and Planning

2012-09-05T16:42:00Z

The Flurry Operations Team handles 4,428 spinning disks across 1107 servers among a team of 6 awesome operations engineers. Since Flurry is a startup, we don’t have an on-site tech team to handle all of the hardware issues that happen at the datacenter. As we’ve grown from 400 disks to over 4000, we've improved our process of handling servers experiencing disk hardware failures.

The most common hardware alerts we receive are from the “Self-Monitoring, Analysis and Reporting Technology”, better known as a SMART alert. This tool tests and monitors disks and will detect and report on potential disk issue, hoping to warn the admin before a disastrous disk issue appears. (Find out more about SMART errors).

Flurry lives and dies by the data stored in our hadoop and hbase cluster, so when a disk issue happens we need to respond quickly and decisively to prevent data loss and/or performance impacts. We generally find that we receive critical and non-critical alerts on around 1% of active cluster disks each month, not all of which need immediate attention.

Monitoring 400 disks: SMART error detected on host

When we were a humble cluster of 100 servers it was easy to log into a box, gracefully stop the hadoop services, unmount the drive, and start the hadoop daemons back up. Most of the actionable alerts we saw were High Read Errors or Uncorrectable Sectors, which tend to indicate a potentially fatal disk issues.

Hadoop tends to let the filesystem handle marking the sectors as bad and/or unreadable, forcing a read to occur on another replica. Hadoop is pretty good about moving the block mapping but it can increase the read latency, and generally degrades the overall performance of the cluster. Did I already mention that we don't like performance degradation?

Monitoring 1200 disks: Find those bad drives fast

Our first datacenter expansion in 2011 consisted of a buildout of an additional 200 servers. Each server has 4 x 1TB drives which are utilized in the cluster, that’s 800 disks in this buildout. During pre-production diagnostic tests, we had a 0.5% failure rate of the new disks.

Once the initial issues were resolved, we deployed the servers into production. The 200 new servers had an average of 2.67 disks going bad per month for the period before our next data center buildout. Our original 400 disks started reporting 2 new issues a month. That’s jumping from 0.3% to 0.6% disk issues a month, possibly degrading due to their age.

Monitoring 2400 disks: Throwing more servers in the mix

Four months later, we needed to double our infrastructure to handle all of the new data that we were processing for our analytics. This time, we were adding in 1200 new disks to the cluster with the same amount of issues. The pre-production diagnostics tests only shook out 0.02% of the bad disks.

At this time, we started seeing our drive SMART checks increasing from <1% to 1.3% failures a month. This was also during the Holiday App Boom as seen here and here. We were spending too much time ferrying drives back and forth from the office to the datacenter and started questioning our diagnostics, urgency and response of SMART errors, and steps to replace a drive.

Our servers have temperature indicators we started to manually monitor and started noticing the new servers were running around 84°F on idle, which we tend to see higher hardware failure rates. We started graphing the temperatures and noticed they increased to 89°F as we started to bring servers into production. There was a lot we needed to do and not enough resources to do it, other than bug the NOCs to come up with strategies to bring us down to 77°F.

Monitoring 4800 disks: Finally some cooling relief

10 months later, we once again doubled our infrastructure and migrated all of our servers into a new space where we now benefit from more efficient datacenter cooling. Where we had an average of 77°F, we were now running between 60°F to 65°F. Guess what happened to our average monthly SMART errors. It went down to 0.4% since the move. There may be several factors at play here:

higher temperatures definitely seemed to contribute to higher failure rates
we had a burn in time for those first 2400 disks
the load across the disks had lightened after such a large expansion

Monitoring N disks: Scripts and processes to automate our busy lives

We've also improved our process for taking out servers with SMART alerts by creating a script which smartd will call when there's an issue. In order to automate this, we've allowed the smartd check to take out servers at will. Modifying the smartd.conf script a bit, we use the check to call our shell script which does a few checks to gracefully stop the hadoop and hbase processes. This spreads out the existing data on the effected disks to healthy servers. We've also included a check to make sure the number of servers we take down does not exceed our hadoop HDFS replication factor, which further prevents the increase in the risk of removing multiple replicated blocks of data. Once all is complete, the script will notify the Operations team of the tasks performed or skipped. We have open sourced this script on our Github account here so you can fork and use it yourself.

What about the physical disks? Instead of having an engineer go down and take out disks from each server, we plan on utilizing our Remote Hands to perform that task for us, so we can focus on handling the broader-reaching issues. There were times where we batched up disk collection and engineers would carry 30 drives back to the office (walking barefoot, uphill both ways).

As always, we're trying to do things more efficiently. A few improvements we have in the plans include:

Having the script unmount the bad disk and bring the server back into production.
The script will email Remote Hands with the server, disk, location and issue, for them to swap the bad drive.
Once the disk is swapped, mount the new drive and return the server into production.
Adapting the script to handle other hardware alerts/issues (network cards, cables, memory, mainboard)

We've learned from those grueling earlier days, and continue to make hardware management a priority. With a small team managing a large cluster, it's important to lean on automating simple, repetitive tasks as well as utilizing the services you are already paying for. I, for one, welcome our new robotic overlords.

What is Hadoop? Part I

2012-08-24T02:25:00Z

Prologue: The Blind Men and the Elephant
As members of the Hadoop community, we here at Flurry have noticed that at many Hadoop-related gatherings, it is not atypical for about half of the attendees to be completely new to Hadoop. Perhaps someone asked them “Should we use Hadoop for this?”, and they didn’t know how to answer. Maybe they had heard of Hadoop and planned to research it, but never found the time to follow up. Out in the Hadoop community at large, some of the answers I have heard to the question “What is Hadoop?” include the comparison of Hadoop to a tape drive, and that the most important thing about Hadoop is that it is an accelerator for Moore’s Law.

I am reminded of the parable of the blind men and the elephant (which here is named Hadoop). Perhaps you have heard some variation of the story: five blind men come upon an elephant. Each touches a different part of the elephant and so describes the elephant in a manner that is partially correct yet misses the whole picture. One man touches the elephant’s leg - “An elephant is like a tree trunk” - another touches the elephant’s tail - “An elephant is like a snake”, and so on. In the end none of the men agree as they all see only parts of the whole.

Keep in mind that this is just one blind man’s description of the elephant as I answer “What is Hadoop?”

What is Hadoop?
Hadoop is a software library for distributed computing (using the map/reduce paradigm). Hence, if you have a distributed computation problem, then Hadoop may be your best solution. There are a wide variety of distributed computing problems; how do you know if Hadoop meets your needs?

While there are many possible answers to this question based on a variety of criteria, this post explores the following qualified answer to this question: “If the problem you are trying to solve is amenable to treatment with the Map/Reduce programming model, then you should consider using Hadoop. Our first stop then, is the question, “What is Map Reduce?”.

Ok, What is Map/Reduce?
Map/Reduce (or MapReduce, etc.) is a programming strategy/model for distributed computing that was first detailed in a research paper published by Google in 2004. The basic strategy is to assemble a variable number of computing nodes which run jobs in two steps. The structure of the data inputs and outputs to each step, along with the communication restrictions imposed on the steps, enable a simple programming model that is easily distributed.

The first step, Map, can be run completely independently on many nodes. A set of input data is divided up into parts (one for each node) which are mapped from the input to some results space. The second step, Reduce, involves aggregating these results from the independent processing jobs into the final result.

Classic examples of Map/Reduce style jobs include evaluating large aggregate functions and counting word frequencies in a large text corpus. In the former, data is partitioned and partial sums are calculated (Map), and then globally aggregated (Reduce). In the latter, data is partitioned and words are counted (Map), then summed (Reduce).

Map operations take as input a set of key/value pairs and produce as output a list of key/value pairs. The input and output keys and values may (and generally do) differ in type. An important fact to notice is that, at the end of the Map step, there may be values associated with a single output key living on multiple Map compute nodes. In the Reduce step, all values with the same output key are passed to the same reducer (independent of which Map node they came from). Note that this may require moving intermediate results between nodes prior to the Reduce step; however, this is the only inter-node communication step in Map/Reduce. The Reduce step then takes as input a map-output key and a list of values associated with that key, and produces as output a list of resultant values.

To clarify, all Map tasks are independent of each other. All Reduce tasks are independent of each other. It is possible to have no-op tasks for either the Map or Reduce phase.

What does this have to do with Hadoop?
Hadoop brings together a framework that is specially designed for Map/Reduce computing, from the data storage up. The two foundational components of this framework are the Hadoop Distributed File System and the Hadoop Map/Reduce framework. These are probably the two parts of Hadoop that most people have heard the most about, and most of the rest of the framework is built on top of them.

The Hadoop Distributed File System (HDFS)
HDFS is the foundation of much of the performance benefit and simplicity of Hadoop. It is arguably also the component (technically an Apache sub-project) of Hadoop that is most responsible for Hadoop’s overall association with “big data”.

HDFS allows for reliable storage of very large files across machines in a cluster comprised of commodity computing nodes. HDFS is a special-purpose file system and as such is not POSIX-compliant. For example, files, once written, cannot be modified. Files in HDFS are stored as a sequence of replicated blocks, all of which are the same size except the last one. The default block size is 64MB, but the block size and replication factor can be set at the file level.

HDFS has a master/slave architecture. The master node is called the Namenode, and it manages all of the filesystem metadata (names, permissions, and block locations). The Namenode hosts an interface for file system operations and determines the mapping of blocks to the slaves, called Datanodes.

There can be one Datanode per node in the Hadoop cluster, and they manage local data block storage. The Datanodes serve data reads and writes directly, and they create, delete, and replicate blocks in response to commands from the Namenode.

When files have been written to HDFS, their blocks are spread over the Datanodes of the Hadoop cluster. Hadoop achieves high-performance parallel computing by taking the computation to the data on the Datanodes.

Hadoop Map/Reduce

Hadoop achieves high performance for parallel computation by dividing up Map tasks so that individual Map tasks can operate on data that is local to the node on which they are running. In simple terms, this means that the Map task runs on the same server that is running a DataNode that stores the data to be mapped. The overall number of Map tasks is determined by the number of blocks taken up by the input data.

A Map/Reduce job generally proceeds as follows:

Split the input files and prepare input key/values from them.
Run one Map task per input split unit. There is no communication between Map tasks.
“Shuffle”: Group the output according to output keys and distribute to nodes that will run Reduce task(s). This is the only communication step.
Run Reduce tasks.
Write Output

Like HDFS, Hadoop Map/Reduce has a master/slave architecture. The master node is called the Jobtracker, and the slave nodes are called Tasktrackers. The Jobtracker distributes Map and Reduce tasks to Tasktrackers. Tasktrackers execute the Map and Reduce tasks and handle the communication tasks required to move data around between the Map and Reduce phases.

Specifically, for Hadoop Map/Reduce, jobs proceed as follows:

Files are loaded from HDFS
An implementation of the Hadoop interface InputFormat determines which files will be read for input. It breaks files into pieces which are implementations of the Hadoop interface InputSplit (typically FileSplit instances). A very large file may be split into several such pieces. A single Map task is created per split.
Each Map task generates input key-value pairs by reading its FileSplit using the RecordReader instance provided by the InputFormat.
Each Map task performs computation on the input key-value pairs. There is no communication between Map tasks.
Each Map task calls OutputCollector.collect to forward its results to the Reduce task(s).
(Optional) Run a Combiner on the Map node - this is a lot like a Reduce step, but it only runs on the data on one node. It is an optimization.
Shuffle: outputs from Map tasks are moved to Reducers. There can be user-controlled via a custom Partitioner class.
Sort: the map results are sorted before passing to the Reduce step.
The Reduce step is invoked. The reducer recieves a key and an iterator over all map-output values associated with the key.
Reduce invokes OutputCollector.collect to send final values to the output file.
A RecordWriter instance outputs the results to files in HDFS.

Note here that the storage architecture of HDFS drives the parallelization of the Map/Reduce job. The alignment of the data distribution and computing distribution enables the work in the Map phase to run locally to the data, which in turn makes Hadoop perform very well for this type of work. Note further that Hadoop Map/Reduce does not require that its input files reside in HDFS, nor is HDFS only usable by Map/Reduce programs. The two combine to produce a powerful and simple mode for parallel computing.

Example
Now let’s put it all together with a simple example of a Hadoop MapReduce job. Both the Map and Reduce tasks are shown below:

Those of you familiar with the Hadoop word count example will recognize that this code is similar. This example expects input files that are lists of IP addresses, one per line. It processes all the files in the (specified) input directory in HDFS using FileInputFormat. The Map task parses each line (passed in as input values) into the four numbers of an IPv4 address, and checks to see if the address is in the major IP address block for Vatican City. If the address is in the Vatican City IP block, the Mapper outputs the IP address as the output key and 1 as the output value. The Reducer receives all of the Map output values for a given IP address in the Vatican City IP block and sums up the counts for each.

Conclusion
HDFS and Hadoop Map/Reduce combine in a supportive architecture to form a foundation for high-performance parallel computing on clusters of commodity machines. If you are trying to decide whether or not to use Hadoop for a project, asking yourself “Can we use Map/Reduce to attack this problem” is a fruitful line of inquiry that can provide a compelling and defensible answer.

Epilogue: the Elephants and the Blind Man
Five elephants come across a blind man. The first elephant touches the man and declares, “Men are very flat”. All the other elephants agree.

Lyric: Linear Regression in Pure Javascript

2012-08-09T16:42:00Z

Lyric is an Open Source implementation of Linear Regression in pure javascript, built on the great javascript matrix library Sylvester by James Coglan and the popular JQuery library. Feel free to use it to implement your own trending and prediction charts or read more below on how it's built and how to use it. You can download or fork it on Github (Lyric) in the Flurry Github account.

What is Linear Regression?

Given a dataset including a dependent and an explanatory variable, linear regression is a statistical technique used to create a model (equation) that best represents the relationship between the variables. This model can then be used to predict additional values that will follow the same pattern. It is a technique commonly used to add trend lines to charts of time series data and in Machine Learning to predict future values from a given training set of data.

For example, given the following set of noisy timeseries data:

It might be hard to tell from this sample if the values are increasing or decreasing over time. Applying linear regression can yield a trendline to make the pattern clear, such as the following (in green):

Typically, linear regression is implemented using an optimization approach such as Gradient Descent which starts with a rough approximation and improves the accuracy over a large number of iterations. While such an approach will optimize the model, it can be slow based on the number of iterations required. In some cases the problem can be greatly simplified and solved in closed form using a derivation called the Normal Equation.

Lyric uses the Normal Equation to make it fast and efficient, as it should work for most applications.

Using Lyric

First, make sure your data is represented in the form of a 2xN Array comprised of elements with an 'x' and 'y' value. The x value should be the explanatory and the y the dependent variables.

Then you need to have Lyric determine the best equation to represent this data. The equation is known as the model and you can build it using the following:

Now that you have your model, you will want to apply it to a set of inputs. The newInput should be a 1xN array containing only the explanatory variable values you would like to calculate the dependent values. This will result in a new 2xN array which will include the resulting series.

The following is a complete example which, given some values for the explanatory values 1 through 5, estimates the values of 6 through 8:

If you wanted to create a trend line, such as in the example above, you would simply apply the model to the same x values you provided in the input to build the model.

For more information on using Lyric (and more advanced options) please refer to the Lyric README.

How is Lyric implemented?

Lyric implements the normal equation using a series of matrix operations implemented by Sylvester. However, before the matrix operations can be applied, the input series x (explanatory values) must be modified to represent the degree of polynomial we want to use to fit the data. To do this, we simply take the vector of x values and create a matrix where each row is the values of x raised to a given power. For example, given the power = 3 (for a 3rd degree polynomial) the output O will be of the form:

O[0] = x^0 i.e. all ones

O[1] = x^1 i.e. the same as x

O[2] = x^2 i.e. all elements of x squared

O[3] = x^3 i.e. all elements of x cubed

If you are familiar with linear algebra, you'll recognize that this represents an equation of the form:

Ax^3+Bx^2+Cx+D

Once the input is in this form, the actual matrix operations are fairly simple, following the normal equation steps.

The resulting theta matrix is the values of the constants A, B, C and D from the above equation. Then, by multiplying future values of x by the same theta matrix we can predict y values.

Learn More

If you're like to learn more about Linear Regression, the Machine Learning class offered by Coursera reviews it in high detail (as well as many other machine learning topics).

Apache Avro at Flurry

2012-07-12T19:35:00Z

Last week we released a major update to our ad serving infrastructure, which provides enhanced targeting and dynamic control of inventory. As part of this effort we decided it was time to move away from our custom binary protocol to Apache Avro for our data serialization needs. Avro provides a large feature set out of the box, however, we settled on the use of a subset of its functionality and then made a few adjustments to best suit our needs in mobile.

Why Avro?

There are many good alternatives for serialization frameworks and just as many good articles that compare them to one another across several dimensions. Given that our focus will be on what made Avro a good fit for us instead of our research into other frameworks.

Perhaps the most critical element was platform support. Even though we are currently only applying Avro for our ad serving through iOS and Android, we wanted an encoder that provides uniform support across all of our SDKs (Windows Phone 7, HTML5, Blackberry, and JavaME) in the event we extend its use to all services. We receive over 1.4 billion analytics sessions per day, so it was critical that we maintain a binary protocol that is small and efficient. While this binary protocol is a necessity, Avro also supports a JSON encoder. This has dual benefits for us as it can be used for our HTML5 SDK and as a testing mechanism (we’ll show later how easy it is to curl an endpoint). Lastly, Avro provides the rich data structures we need with records, enums, lists, and maps.

Our Subset

We are using Avro-C for our iOS SDK and Avro Java for our Android SDK and server. As mentioned earlier Avro has a large feature set. The first thing we had to decide was what components made sense for our environment. All language implementations of Avro support core, which includes Schema parsing from JSON and the ability to read/write binary data. The ability to write data files, compress information, and send it over embedded RPC/HTTP channels is not as uniform. Our SDKs have always stored information separately from the transport mechanism and likewise our server parses that data and stores in a format efficient for backend processing. Therefore, we did not have the need to write data files. In addition to this we have a sophisticated architecture for the reliable transport of data and using Avro’s embedded RPC mechanisms would affect our deployment and failover strategy. Given our environment we integrated only core from Avro.

Integrating core provided us a great deal of flexibility in how to use this efficient data serialization protocol. One of the diversions we made was in our treatment of schemas. Schemas are defined on the Avro site as:

Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.

We maintain the invariant that the schema is always present when reading Avro data, however, unlike Avro’s RPC mechanism we never transport the schema with the data. While the Avro docs note this can be optimized to reduce the need to transport the schema in a handshake, such optimization could prove difficult and would likely not provide much added benefit for us. In mobile, we deal with many small requests from millions of devices that have transient connections. Fast, reliable transmissions with as little overhead as possible are our driving factors.

As an illustration of how this makes a large difference at scale, our AdRequest schema is roughly 1.5K while the binary-encoded datum is approximately 230 bytes on any request. At 50 million requests per day (we are actually well above that figure) this adds about 70GB of daily traffic across our band.

To meet our network goals we version our endpoints to correspond to the protocol. We feel it is more transparent and maintainable to have a one to one schema mapping instead of relying on Avro to resolve the differences between a varied reader and writer schema. The result is a single call without the additional communication for schema negotiation while transporting only datum that has no per-value overhead of field identifiers.

Mobile Matters

As a 3rd party library on mobile we had some additional areas we needed to address. First on Android, we needed to maintain support for Version 1.5+. Avro uses a couple of methods that were introduced in version 2.3, namely Arrays.copyOf() and String.getBytes(Charset). Those were easily changed to use System.arrayCopy and String.getBytes(String), but it did require us to build from source and repackage.

On iOS, we need to avoid naming conflicts in case someone else integrates Avro in their library. We are able to do this by stripping the symbols. See this guide from Apple on that process. On Android, we use ProGuard, which solves conflict issues through obfuscation. Alternatively, you can use the jarjar tool to change your package naming.

Sample Set Up

We’ve talked a lot about our Avro set up so now we’ll provide a sample of its use. These are just highlights, but we have made the full source available on our GitHub account at: https://github.com/flurry/avro-mobile. This repo includes a sample Java server, iOS client, and Android client.

1. Set up your schema. We use a protocol file (avpr) vs a schema file (avsc) simply because we can define all schemas for our protocol in a single file. The protocol file is typically used with Avro’s embedded RPC mechanisms, but it doesn’t have to be used exclusively in that manner. Here are our sample schemas outlining a simple AdRequest and AdResponse:

AdRequest:

AdResponse:

2. Pre-compile the schemas into Java objects with the avro-tools jar. We have included an ant build file that will perform this action for you.

3. Build our sample self contained Avro Jetty server. The build file puts all necessary libs in a single jar (flurry-avro-server-1.0.jar) for ease of running. Run this from the command line with one argument for the port:
java -jar flurry-avro-server-1.0.jar 8080

4. Test. One of the things that distinguishes Avro from other serialization protocols is its support for dynamic typing. Since no code generation is required, we can construct a simple curl to verify our server implements the protocol as expected. This has proven useful for us since we have an independent verification measure for any issue that occurs in communication between our mobile clients and servers. We simply construct the identical request in JSON, send to our servers via curl, and hope we can blame the server guys for the issue. This also allows anyone working on the server to simulate requests without needing to instantiate mobile clients.

Sample Curl:

Note: To simulate an error, type the AdSpace Name as “throwError”.

5. Setting up the iOS app takes a little more work than on the Java side simply because Avro-C does not support static compilation. The easiest way we have found to define the schemas is to go to the Java generated schema file and copy the JSON text from the SCHEMA$ definition:

Paste that text as a definition into your header file, then get the avro schema from the provided method avro_schema_from_json. From that point write/read from the fields in the schema using the provided Avro-C methods as done in AdNetworking.m.

In the sample app provided, run AvroClient > Sim or Device and send your ad request to your server (defaults to calling http://localhost:8080). To simulate an error, type the AdSpace Name as “throwError”.

6. Since Android is based on Java, moving from the Avro Java Server to Android is a simple process. The same avpr and buildProtocol.xml files can be used to generate the Java objects. Constructing objects are equally as easy as seen here:

Conclusion

We’re excited about the use of Apache Avro on our new Ad Serving platform. It provides consistency through structured schemas, but also allows for independent testing due to its support for dynamic typing and JSON. Our protocol is much easier to understand internally since the Avpr file is standard JSON and attaches data types to our fields. Once we understood the initial setup across all platforms it became very easy to iterate over new protocol versions. We would recommend anyone looking to simplify communication over multiple environments while maintaining high performance give Avro strong consideration.

Flurry APIs now support CORS

2012-06-21T18:23:04Z

We recently upgraded all of our REST APIs to support Cross-Origin Resource Sharing (CORS). If you're not familiar with CORS, this means that you can now access Flurry APIs via Javascript on web pages other than flurry.com. Get started now or read on to learn more about CORS.

Why do we need CORS?

Modern Browsers implement something called a same origin policy when processing and rendering the data associated with a website. This policy says that the web page can only load resources that come from the same host (origin) as the web page itself. For example, if you load flurry.com your browser will only let the Javascript in the page load resources from flurry.com.

Why is a same origin policy important? Cross-site scripting attacks, which take advantage of cross-origin interactions, are one of the more common methods of personal information stealing these days. A basic cross-site scripting attack would show a user a webpage which looks just like the login page for your website. When they user enters their credentials into the malicious website, believing it to be yours, the malicious site takes the credentials and uses javascript to log them into your website using AJAX. After being logged in through javascript, they can steal data, manipulate the account and change the password - all using AJAX behind the scenes without the user being aware.

Such attacks appear to the attacked service as legitimate traffic since they originate from a normal computer browser - complete with the cookies you have set. By preventing access to resources not hosted on the origin, and hence preventing AJAX from reaching another host, the browser is protecting you from this kind of attack.

However, with the rise of HTML5, more and more web content is loaded dynamically through javascript and rendered in the browser. There are now very legitimate uses for cross-origin resource access in javascript, including widgets, applications and content management.

What is CORS?

CORS bridges the gap between security and flexibility by allowing a host to specify which resources are available from non-origin domains. This allows you to make REST APIs available for access from other domains in the browser, but not your login page.

Adding CORS support is as simple as adding an extra HTTP response header that specifies what origins can access a given resource. To allow any domain to access a resource, you would include the following HTTP header in responses to requests for that resource:

Access-Control-Allow-Origin: *

Or, to only allow access from Flurry's website domain you would use the following:

Access-Control-Allow-Origin: http://flurry.com

Note that since the CORS header is in the response of the HTTP request, the request has already been made before your browser evaluates whether to allow access to the result. It's important to keep that in mind since even if the browser detects a CORS violation, the request will have already been processed on your servers.

Not all browsers support CORS right now but most modern browsers do. You can read more on the CORS Wikipedia page.

The Delicate Art of Organizing Data in HBase

2012-06-12T16:46:00Z

Here at Flurry we make extensive use of HBase, a distributed column-store database system built on top of Apache Hadoop and HDFS. For those used to relational database systems, HBase may seem quite limited in the ways you can query data. Rows in HBase can have unlimited columns, but can only be accessed by a single row key. This is kind of like declaring a primary key on a SQL table, except you can only query the table by this key. So if you wish to look up a row by more than one column, as in SQL where you would perform a select * from table where col1 = val1 and col2 = val2, you would have to read the entire table and then filter it. As you can probably tell, reading an entire table can be an extremely costly operation, especially on the data sizes that HBase usually handles. One of the solutions to this is to use a composite key, combining multiple pieces of data into a single block.

What’s a composite key?

HBase uses basic byte arrays to store its data, in both row keys and columns. However it would be tedious to try and manipulate those in our code, so we store our keys in a container class that knows how to serialize itself. For example:

Here we have two fields forming a single key with a method to serialize it to a byte array. Now any time we want to look up or save a row, we can use this class to do so.

While this allows us to instantly get a row if we know both the userId and applicationId, what if we want to look up multiple rows using just a subset of the fields in our row key? Because HBase stores rows sorted lexicographically by comparing the key byte arrays, we can perform a scan to effectively select on one of our index fields instead of requiring both. Using the above index as an example, if we wanted to get all rows with a specific userId:

Since we’ve set the start and stop rows of the Scanner, HBase can give us all the rows with the specified userId without having to read the whole table. However there is a limitation here. Since we write out the userId first and the applicationId second, our data is organized such that all rows with the same userId are adjacent to each other, and rows with the same applicationId but different userIds are not. Thus if we want to query by just the applicationId in this example, we need to scan the entire table.

There are two “gotchas” to this serialization approach that makes it work as we expect it. First, we assume here that userId is greater than 0. The binary representation of a negative 2’s Complement Integer would be lexicographically after the largest positive number. So if we intend to have negative userIds, we would need to change our serialization method to preserve ordering. Secondly, DataOutput.writeUTF specifically serializes a string by first writing a short (2 bytes) for its length, then all the characters in the string. Using this serialization method, the empty string is naturally first lexicographically. If it we serialized it using a different method, such that the empty string were not first, then our scan would stop somewhere in the next userId.

As you can see, just putting our index data in our row key is not enough. The serialization format of our components determines how well we can exploit the properties of HBase lexicographic key ordering, and the order of the fields in our serialization determines what queries we are able to make. Deciding what order to write our row key components in needs to be heuristically driven by the reads we need to perform quickly.

How to organize your key

The primary limitation of composite keys is that you can only query efficiently by known components of the composite key in the order they are serialized. Because of this limitation I find it easiest to think of your key like a funnel. Start with the piece of data you always need to partition on, and narrow it down to the more specific data that you don’t often need to distinguish. Using the above example, if we almost always partition your data by userId, putting it in our index first is a good idea. That way we can easily select all the applicationIds for a userId, and also select a specific applicationId for a userId when we need to. However, if we are often looking up data by the applicationId for all users, then we would probably want to put the applicationId first.

As a caveat to this process, keep in mind that HBase partitions its data across region servers based on the same lexicographic ordering that gets us the behavior we’re exploiting. If your reads/writes are heavily concentrated into a few values for the first (or first few) components of your key, you will end up with poorly distributed load across region servers. HBase functions best when the distribution of reads/writes is uniform across all potential row key values. While a perfectly uniform distribution might be impossible, this should still be a consideration when constructing a composite key.

When your key gets too complex

If your queries go beyond the simple funnel model, then it’s time to consider adding another table. Those used to heavily normalized relational databases will instinctively shy away from repeating data in multiple tables, but HBase is designed to handle large amounts of data, so we need to make use of that to overcome the limitations of a single index. Adding another table that stores the same cells with a differently organized key can reduce your need to perform full table scans, which are extremely expensive time-wise, at the cost of space, which is significantly cheaper if you’re running at Hadoop scale. Using the above example, if we routinely needed to query by just userId or applicationId, we would probably want to create a second table. Assuming we still want to be able to query by both userId and applicationId, the field we use as a key for the second table would depend on the distribution of the relationship of user to application and vice versa. If a user has more applications than an application has users, then that would mean scanning a composite key of (userId, applicationId) would take longer than scanning it in (applicationId, userId) order, and vice versa.

The downside to this approach is the added complexity of ensuring that the data in both of your tables is the same. You have to ensure that whenever you write data to one table, you write it to both tables simultaneously. It helps to have all of your HBase reading and writing encapsulated so that individual producers or consumers are not accessing the HBase client code directly, and the use of multiple tables to represent a single HBase-backed entity is opaque to its users.

If you’ve got a lot of rows

Sometimes storing large parts of your data in the key can be hazardous. Scanning multiple rows is usually more costly than reading a single row, even if that single row has many columns. So if performing a scan on the first part of your composite key often returns many rows, then you might be better off reorganizing your table to convert some of the parts of your composite key to column names. The tradeoff there is that HBase reads entire rows at a time, so while you can instruct the HBase client to only return certain columns, this will not speed up your queries. In order to do that, you need to use column families.

Getting it right the first time

Finally, always take extreme care to pick the right row key structure for your current and future querying needs. While most developers are probably used to throwing out most of the code they write, and refactoring many times as necessary for evolving business requirements, changing the structure of an HBase table is very costly, and should be avoided at all costs. Unlike with most relational databases, where adding a new index is a simple operation with a mild performance hit, changing an HBase table’s structure requires a full rewrite of all the data in the table. Operating on the scale that you’d normally use Hadoop and HBase for, this can be extremely time consuming. Furthermore, if you can’t afford your table to be offline during the process, you need to handle the migration process while still allowing reads from the old and new tables, and writes to the new table, all while maintaining data integrity.

HBase or not HBase

All of this can make using HBase intimidating to first time users. Don't be afraid! These techniques are common to most NoSQL systems which are the future of large scale data storage. Mastering this new world allows you to unlock a massively powerful system for large data sets and perform analysis never possible before. While you need to spend more time up front on your data schema, you gain the ability to easily work with Petabytes of data and tables containing hundreds of Billions of rows. That's big data. We'll be writing more about HBase in the future, let us know if there's something you'd like us to cover.

Here at Flurry, we constantly working to evolve our big data handling strategies. If that sounds interesting, we are hiring engineers! Please check out http://flurry.com/jobs for more information.

Scaling @ Flurry: Measure Twice, Plan Once

2012-06-04T19:32:00Z

Working in web operations can be quite exciting when you get paged in the middle of the night to debug a problem, then burn away the night hours formulating a solution on the fly using only the resources you have at hand. It’s thrilling to make that sort of heroic save, but the business flows much more smoothly when you can prepare for the problems before they even exist. At Flurry we watch our systems closely, find patterns in traffic and systems’ behavioral changes, and generally put solutions in place before we encounter any capacity-related problems.

A good example of this sort of capacity planning took place late last year. During the Christmas season Flurry usually sees an increase in traffic as people fire up their new mobile phones, so we knew in advance of December 2011 that we’d need to accommodate a jump in traffic—but how much? Fortunately, we had retained bandwidth and session data from earlier years, so were able to estimate our maximum bandwidth draw based on the increases we experienced previously, and our estimate was within 5% of our actual maximum throughput. There are probably some variables we still need to account for in our model, but we were able to get sufficient resources in place to make it through the holiday season without any serious problems. Having worked in online retail, I can tell you that not getting paged over a holiday weekend is something wonderful.

Doubling of outbound bandwidth from Nov-Dec 2011, Dec 24th yellow

Outside of annual events, we also keep a constant eye on our daily peak traffic rates. For example, we watch bandwidth to ensure we aren’t hitting limits on any networking choke points, and requests-per-second is a valuable metric since it helps us determine scaling requirements like overall disk capacity (each request taking up a small amount of space in our data stores) and the processing throughput our systems can achieve overall. The overall throughput includes elements like networking devices (switches, routers, load balancers) or CPU time spent handling and analyzing the incoming data.

Other metrics we watch on a regular basis include disk utilization, number of incoming requests, latency for various operations (time to complete service requests, but also metrics like software deployment speed, mapreduce job runtimes, etc.), CPU, memory, and bandwidth utilization, as well as application metrics for services like MySQL, nginx, haproxy, and Flurry-specific application metrics. Taken altogether, these measurements allow us to gauge the overall health and trends of our systems’ traffic patterns, from which we can then extrapolate when certain pieces of the system might be nearing capacity limits.
Changes in traffic aren’t the only source of capacity flux, though—because the Flurry software is a continually-changing creature, Flurry operations regularly coordinates with the development teams regarding upcoming changes that might cause changes like increases in database connections, more time spent processing each incoming request, or other similar items. Working closely with our developers also allows Flurry to achieve operational improvements like bandwidth offloading by integrating content delivery networks into our data delivery mechanisms.
One area I think we could improve is in understanding what our various services are capable of when going all-out. We’ve done some one-off load tests to get an idea of upper limits for requests per second per server, and have used that research as a baseline for rough determinations of hardware requirements, but the changing nature of our software makes that a moving target. Getting more automated capacity tests would be handy in both planning hardware requirements and for potentially surfacing performance-impacting software changes.
Overall, though, I think we’ve done pretty well. Successful capacity planning doesn’t prevent every problem, but paying attention to our significant metrics allows us to grow our infrastructure to meet upcoming demand, saving our urgent-problem-solving resources for the truly emergent behavior instead of scrambling to keep up with predictable capacity issues.

Using the Java ClassLoader to write efficient, modular applications

2012-05-17T06:02:00Z

Java is an established platform for building applications that run in a server context. Writing fast, efficient, and stable code requires effective use of algorithms, data structures, and memory management, all of which are well supported and documented by the Java developer community. However, some applications need to leverage a core feature of the JVM whose nuances are not as accessible: the Java ClassLoader.

How Does Class Loading Work?

When a class is first needed, either through access to a static field or method or a call to a constructor, the JVM attempts to load its Class instance from the ClassLoader that loaded the referencing Class instance (see note 1). ClassLoaders can be chained together hierarchically, and the default strategy is to delegate to the parent ClassLoader before attempting to load the class from the child. After being loaded, the Class instance is tracked by the ClassLoader that loaded it; it persists in memory for the life of the loader. Any attempt to load the same class again from the same loader or its children will produce the same Class instance; however, attempts to load the class from another ClassLoader (that does not delegate to the first one) can produce a second instance of a Class with the same fully qualified name. This has the potential to cause confusing ClassCastExceptions, as shown below.

Why Not Use The Default ClassLoader?

The default ClassLoader created by the Java launcher is usually sufficient for applications that are either short-lived or have a relatively small, static set of classes needed at runtime (see note 2). Applications with a large or dynamic set of dependencies can, over time, fill up the memory space allocated for Class instances – the so-called “permanent generation.” This manifests as OutOfMemory errors in the PermGen space. Increasing the PermGen allocation may work temporarily; leaking Class memory will eventually require a restart. Fortunately, there are ways to solve this problem.

Class Unloading: Managing The Not-So-Permanent Generation

The first step to using Class memory efficiently is a modular application design. This should be familiar to anyone who has investigated object memory leaks on the heap. With heap memory, object references must be partitioned so the garbage collector can free clusters of inter-related objects when they are no longer referenced from the rest of the application. Similarly, Class memory in the so-called “permanent” generation can also be reclaimed by the garbage collector when it finds clusters of inter-related Class instances with no incoming Class references (see note 3).

To demonstrate, let's consider two Java projects with one class each: a container, and a module (see note 4).

For extremely modular applications where no communication is required between the container and its modules, there is an apparently easy solution: load the module in a new ClassLoader, release references to the ClassLoader when the module is no longer needed, and let the garbage collector do its thing. The following test demonstrates this approach:

Success!

When Third Party Code Attacks

So you've been diligent about modularizing your own code, and each module runs in its own ClassLoader with no communication with the container. How could you have a Class memory leak? The answer could lie in third party code used by both the container and module:

Loading a module in its own ClassLoader is not enough to prevent Class memory leaks when using a ClassLoader with the default delegation strategy. In this case, the module's ResourceLibrary Class instance is the same as the Container's, so the ResourceLibrary's HashMap holds a reference to the module's Class instance – which references its ClassLoader, which references all other Class instances in the module. The following test demonstrates the problem, and a possible solution:

The result:

Although the test with the default ClassLoader fails due to a memory leak, the test with a stand-alone ClassLoader succeeds. Creating a ClassLoader with no parent forces it to load all Class instances itself – even for classes already loaded by another ClassLoader. The (leaky) ResourceLibrary Class instance referenced by the module is different from the one used by the container, so it gets garbage collected when its ClassLoader is released by the container – along with the rest of the module's Class instances. This fixes the Class memory leak; but what happens if you need some communication between the container and the module?

The stand-alone ClassLoader approach won't work now, because the IModule Class instance loaded in the container is different from the IModule Class instance loaded by the module. The result is a confusing ClassCastException when casting a Module to an IModule, as seen in this test:

The result:

This test introduces two new strategies for loading classes: post-delegation and conditional delegation. Post-delegation simply loads classes from the child first, then from the parent if not found. Unfortunately, that doesn't work if the module's classpath includes any of the classes shared with the container, as the test shows. This makes configuring the classpath a chore, especially if shared classes have dependencies on utility classes that both the container and module require. In this test, conditional delegation works best: it allows shared classes to be loaded from the parent ClassLoader, but sandboxes all other classes (like potentially leaky third-party code) to the child ClassLoader.

Conclusion

When using custom ClassLoaders in a Java application, there are many pitfalls to watch out for. Dealing with ClassCastExceptions and ClassNotFoundExceptions can be frustrating if the overall design is not well planned. Even worse, memory leaks and PermGen errors can be very difficult to reproduce and fix, especially if they only happen in long-running production systems. Although there's not one right answer to using custom ClassLoaders, the techniques here can address most of the issues one might encounter (see note 5).

Creating a clear design and strategy for loading classes is not easy, but the reward is efficient, stable, robust server code. The time spent planning is a small price for the time saved debugging, looking through memory dumps, and dealing with outages. Please check out my examples on GitHub, find other ways to leak PermGen space, and post comments!

Note 1: in this article, “class” refers to a Java type; the term “Class instance” refers to a runtime object that models a class and is an instance of the type java.lang.Class. The difference is important, as one class can have many Class instances at runtime, if the application uses multiple ClassLoaders.

Note 2: There are other reasons to use custom ClassLoaders. Some common uses are in applications that dynamically update their code without restarting, applications that load classes from non-standard sources like encrypted jars, and applications that need to sandbox untrusted 3^rd party code. This article focuses on a more general scenario, but the lessons are applicable to those cases as well.

Note 3: PermGen Garbage collection was a bit tricky in older versions of the JRE. By default it was never reclaimed (hence the name). Sun used to provide an alternate garbage collection strategy (concurrent mark-sweep) that could be configured to reclaim PermGen space; support from other vendors varied. However, in recent versions of Java, PermGen collection works quite well using the default configuration.

Note 4: the complete source code for these examples is posted in the public git repository here.

Note 5: for more in-depth analysis on how to find and fix PermGen leaks, see Frank Kieviet's excellent blog series on the topic: here, here, and here.

Modular Libraries in iOS

2012-04-19T22:01:00Z

The Flurry SDK uses a dynamically-linked modular library design which breaks up features into different archives. This helps developers with large applications make sure that their application size is as small as possible by avoiding unnecessary libraries. We frequently get asked about how to implement dynamically-linked modular libraries in iOS so here is how it works.

Why modular libraries?

The Flurry SDK has grown considerably since its first release. New analytics and new advertising products were introduced with AppCircle and AppCircle Clips. Keeping developers in mind, we want to provide you with functionality necessary to accomplish your goals but not require you to carry the weight of all functionality if there's something that would go unused. For example, some developers use Analytics while some use Analytics and AppCircle. Others may use Analytics, AppCircle, and AppCircle Clips. The solution was to partition the existing library into smaller modular libraries that could be layered onto a core Analytics library.

Analytics is our core SDK library and it is the library responsible for communicating session data and ad data between our servers and apps. A request/response action takes place upon start and optionally upon pause or close of an app. Each request reports session data and the response contains a shared buffer filled with ad data which is relayed to each enabled ad library. Our servers currently receive 25,000 requests per second and growing, so this communication design decision (through a single communications point in the Analytics library) was made to keep traffic between our servers and apps efficient. This also creates asymmetric library dependencies since, for instance, Analytics can operate with or without AppCircle but AppCircle cannot operate without Analytics.

How modularization was accomplished

In AppCircle, the AppCircle library must be enabled.

This call above serves two purposes. First, it tells Analytics that AppCircle is being used so Analytics will request ad data on behalf of AppCircle. Second, it provides Analytics with a delegate, giving it access into the AppCircle library. The delegate is used as a callback to relay the ad data buffer into AppCircle to create the ads it provides through its API.

In Analytics, we maintain the delegate callback into AppCircle. The delegate is purposely declared an NSObject. This is set by AppCircle's setAppCircleEnabled method described above.

Then in Analytics, the delegate to AppCircle is tested and its callback is invoked with the ad data buffer as shown below. If AppCircle is not enabled then this collaboration with AppCircle is skipped altogether.

Declaring the delegate as NSObject and testing for the callback selector avoids any hard dependency to AppCircle, allowing Analytics to operate with or without AppCircle present. This pattern is also used for AppCircle Clips and future modular libraries in the works.

Modularizing our libraries significantly reduces the size of the Analytics library from a previous 6mb file to 1.7mb currently and allows layering any or all of our ad libraries on top of the Analytics library, so developers can optimize the amount Flurry library code they are linking into their applications. And because server communication remains the responsibility of Analytics during start, pause, or close of an app, our service can to continue to perform well as the traffic scales upward.

Flurry, by the numbers

2012-04-17T20:33:00Z

Welcome to the Flurry Technology Blog!

We here at Flurry are lucky to be at the center of the mobile application revolution. Flurry Analytics is currently used by over 65,000 developers across over 170,000 mobile applications and tracks over 1.2 billion application sessions a day. Flurry AppCircle, our application recommendation network, currently reaches over 300 million users every month. We spend our days working with the developers of apps and games you probably use everyday.

Up until now we haven’t shared much about how we have built our platform to handle the vast scale and amazing rate of growth seen by our products and the mobile application ecosystem. Let’s start with some statistics about the platform:

> 26,000 requests per second
2 TB of new data processed per day
> 20 billion custom events tracked per day

Along the way we’ve had to find solutions to problems that many companies that are growing quickly have to face. This blog is our way of sharing what we’ve learned and opening a conversation with you about how to build Big Data services and mobile applications.

This blog will be technical, focusing on optimizing linux servers, efficiently writing Map Reduce jobs and advanced areas of mobile platforms like iOS. If you’re interested in learning more about the mobile application market in general, please visit the Flurry Blog where we present market research and examine trends across the entire network of Flurry products.

We’ll be writing regular blog posts in the coming months. If there is something that you’d like to learn about or something that you think we can do better just let us know by posting here in the comments.

Enjoy.