XCKnife: faster distributed tests for iOS

Simple tool for optimizing XCTest runs across machines


Written by Daniel Ribeiro.

At Square, we highly value the efficiency of our engineers. As our codebases grow and our test suite reaches tens of thousands of tests, our engineers wait longer and longer for their builds to be green in our Continuous Integration system.

We recently optimized our internal testing infrastructure to allow our iOS test suite to be easily and efficiently distributed across dozens of machines. In order to provide our engineers with the fastest results possible for their pull request tests, we also had to properly balance the test shards, as your test suite will always be as fast as its slowest component.

Balancing Test Shards

A perfectly balanced test suite would look like this:

Generating a perfectly balanced sharded test is equivalent to a known hard problem of scheduling jobs and machines. Approximating it is fast enough, and we empirically observed it to be about 10% off the optimal partition.

In order to compute the approximated balanced shards we created a project called XCKnife, which we are open sourcing today. XCKnife partitions XCTestCase tests in a way that minimizes total execution time (aka *makespan*). After employing it we saw 30% faster CI test builds on our largest project.

XCKnife works by leveraging Xctool’s json-stream reporter output for both historical timing information and for current test suite definition. Xctool is an open source tool created by Facebook, that serves as a replacement for Apple’s xcodebuild and makes it easier to test iOS and Mac products.

XCKnife generates a list of only arguments meant to be passed to Xctool’s -only test arguments, but alternatively could be used to generate multiple xcschemes with the proper test partitions.

XCKnife also automatically handles deleted tests and test targets, and extrapolates the expected time for new ones.

Two Level Balancing

It is common for our test suites to contain multiple partition sets, or multiple lists of test targets that are run in different configurations. For example, given the following test structure (from this sample project):

The three test targets (CommonTestTarget, iPadTestTarget and iPhoneTestTarget) each have a list of test classes, each of which take different timing data. We can group these test targets into three partition sets, for example:

Using XCKnife to split the iPad and the Basic partition sets within 3 machines, we get a json representing following partition:

XCKnife will partition at most classes rather than methods, as otherwise it would the call the class’ initialization more than once, which is not always safe if, for instance, you use network connections that mutate state (like making Rest calls to an external stateful web service).

How to use XCKnife

XCKnife is available both as an executable and as a Ruby gem. You can install it with:

$ gem install xcknife

As a tool:

$ xcknife -p CommonTestTarget -p CommonTestTarget,iPadTestTarget 6 example/xcknife-exemplar-historical-data.json-stream example/xcknife-exemplar.json-stream

Which yields the partitioning information as json:

As a Ruby gem:

require 'xcknife'

def compute_shards(historical_file, current_file)
  knife = XCKnife::StreamParser.new(6, [["CommonTestTarget"],
    ["CommonTestTarget", "iPadTestTarget"]])
  knife.compute_shards_for_file(historical_file, current_file)

For more detailed examples on using XCKnife, refer to the docs.

How it works

The 2-approximation for balancing a single partition set is a greedy algorithm that is O(N) after a sorting step O(N logN), where all the test classes are sorted by their computed time and put in a queue with the longest time classes in the front. Each iteration step takes the a test class from the queue and assigns it to the machine that is the least busy.

The 2-approximation means the final makespan for each partition set looks like this:

Of course, test time variance and extrapolations play a key role here as well.

Understanding Imbalances

XCKnife will report the test imbalances for both partition sets amongst themselves and for its internal partitions. These are always reported against an ideal scenario (average time), and cannot be necessarily attainable. For example, if you have two partitions of two test classes, each with one test method, with expected times of 1 second and 11 seconds, the average time is 6 seconds, but since we can’t break the test method in two, the ideal scenario is not attainable.

The chart below exemplifies a situation where the ideal time would be 1 second, but the total computed times for the partitions are 0.5, 1.1, 0.8 and 1.3:


Thanks to Dimitris Koutsogiorgas and Justin Martin for design and code reviews throughout the project.


To send contributions, go to XCKnife’s GitHub page.

Square’s iOS engineering team invites you to join us for an evening of lightning talks, plus food and drinks with your fellow iOS engineers during WWDC. Join us at The Box SF on Tuesday, June 14th — doors will open at 5:30pm. RSVP here to reserve your spot! Daniel Ribeiro (@metaphysicaldev) | Twitter The latest Tweets from Daniel Ribeiro (@metaphysicaldev). Engineer @square. San Francisco - CAtwitter.com