Scaling Build and Deployment Systems

Flurry is a rapidly growing company in every meaning of the word including customers, teams, and applications. We have 650 million unique users per month from mobile applications that embed our SDKs which create over 1.6 billion transactions per day. Our HBase cluster with over 1,000 nodes holds many petabytes of data and is rapidly growing.

While Flurry's growth is great news for our customers, employees, and investors, it creates challenges for Release Management to support the growing number of applications, the growing number of developers, and the growing number of development and test environments. We need to manage all of that well and do so quickly and reliably.

In short, we need continuous integration to rapidly build, deploy, and test our applications.

Deployment Tools @ Flurry

To support our continuous integration, we setup three core tools.

Source Control: We use Github Enterprise to manage our source code and various configuration files. We use a variation of the Git Flow development process where all features are developed on individual branches and every merge is in the form of a pull request (which is also code reviewed).
Continuous Build: We use Jenkins to build code and deploy to our QA and test environments and to run our JUnit tests. Jenkins is set up to automatically run JUnit tests when code developers check-in code and when they create pull requests for their branches. Jenkins also runs Selenium tests with SauceLabs every night against the QA environments and Production.
Task Tracking: We use Jira (with the Greenhopper agile plugin) for ticket management for planning and tracking enhancements and bug fixes. All three tools are well integrated with various plug-ins that allow them to share information and to trigger actions.

Challenges at Scale

Our setup for continuous integration has served us well but has some challenges.

Too Many Jobs: We have more than 50 Jenkins jobs. We have over 130 deployment scripts and more than 1,600 configuration files for the CI tools and applications. Each new application and each new QA environment adds to the pile. While we are whizzes at writing bash scripts and Java programs, this is clearly not scalable in the long term.
Slow Deployments: For security reasons, our Jenkins server cannot deploy war files and config files directly to Production servers. For Production deployments, we run a Jenkins job that copies the files built by other jobs to a server in the data center over a locked-down, one-way secure tunnel. Ops staff then manually runs various scripts to push the files to the Production servers and restart them. This is inefficient in terms of time and people resources.
Test Overrun: Our JUnit test suite has over 1,000 test cases which take about an hour to run. With the increase in the number of developers, the number of test runs triggered by their pull requests is clogging the Jenkins build server. We have biweekly releases to Production which we would like to be able to cut down to a daily basis or at least every few days. The blocker to this is that the build, deploy, test, analyze, and fix cycle is too long to allow that.

Improving the Process: Design for Speed

The speed of an engineering team is directly related to the speed of release and deployment so we needed to get faster. We have taken a number of steps to address the challenges.

We optimized our JUnit test cases by removing unnecessary sleep statements and stubbed out the deployments to our test CDN which reduces network wait time.
We upgraded the build system to bigger, faster hardware and parallelized the JUnit test runs so that we can run multiple test jobs at the same time.
We added dedicated Jenkins slave servers that can share the burden during times of heavy parallel building.

Overall we have reduced the time to run the entire test suite to 15 minutes.

To make it easier to manage the Jenkins jobs, we removed a lot of old jobs and combined others using parameterized builds. We renamed the remaining Jenkins jobs to follow a standard naming convention and organized them into tabbed views. We now have a dozen jobs laid out where people can find them.

Jenkins

All of the improvement steps have helped, but we needed more fundamental changes.

Improving the Process: RPM Based Software Deployments

We changed our build and deployment process to use RPM repositories where every environment has its own local repository of RPMs. In the new process, the Jenkins job builds the war files then bundles up each war file along with its install script. The job also builds RPMs for each application's config files, email templates, data files and the config files for HBase and Hadoop. Once all of the RPMs are built, the job rsyncs the RPMs to the target environment's repo. It then runs ssh yum install against each server in the environment to get it to update itself. Once all the servers are updated, the job restarts all of the servers at once. The job is parameterized so that users can build and deploy a single app, a set of apps, all apps, or just some config files.

The developers have access to Jenkins so that they can update the QA environments at will without having to involve Release Engineering.

The RPM-based build and deployment process gives us several advantages. The install scripts are embedded into the RPMs which reduces the cluttered hierarchy of scripts called by the Jenkins jobs. The same tools and processes for deployments in the Dev and QA environments can now be safely used in Production.

By having a repo for each environment, we only have to deploy the RPMs once to that repo. Each sever in the environment then pulls the RPMs from its repo. This save a lot of time and network bandwidth for our remote environments whose servers used to get files directly from Jenkins.

RPMs support dependencies which instruct yum to deploy a group of other RPMs before deploying the given RPM. For example, we can set an application's RPM to be dependent of the application's config file RPM, so that when we install the application, yum automatically installs the config file RPM. The dependency feature also allows us to set up a parent RPM for each class of server where the parent RPM is dependent on all of the application RPMs that run on that class of server. We simple execute yum install with the parent RPM, and yum downloads and installs all of the application RPMs and their dependent config file RPMs needed for that server. In the future we will add dependencies for Java, Jetty, and various OS packages to the parent RPMs. This will allow us to kick start a new server and fully provision it at the push of a button.

Conclusion

As with any change in process and tools, there were a few gotchas. The Jenkins slave server was easy to set up, but there were a lot of tools and configurations needed to support our junit test runs that had to be copied from the Jenkins master server. We also found a few places where the concurrent junit tests runs stepped on common files.

Overall, the changes have sped up and cleaned up our build and deployments. They have allowed us to better manage what we have and to handle our future growth.

Flurry Tech Blog

Technology lessons from the cutting edge of mobile and big data.

Scaling Build and Deployment Systems