Herding Elephants

Wrangling a 3,500-module Gradle project

Reddit
LinkedIn

Note: please do not attempt to herd any actual elephants

Every day in Square's Seller organization, dozens of Android engineers run close to 2,000 local builds, with a cumulative cost of nearly 3 days per day on those builds. Our CI system runs more than 11,000 builds per day, for a cumulative total of over 48 days per day. (All data pulled from Gradle Enterprise.) To our knowledge, Square's Android mega-repo is one of the largest such in the world, and holds the source code for all of our Point of Sale apps, and more. The repository is a mix of Android applications, Android libraries, and JVM libraries, all in a mix of Kotlin (2 million LOC) and Java (1 million LOC), spread across more than 3,500 Gradle modules. We also have a handful of tooling modules written in Kotlin, Java, and Groovy. Not to mention innumerable scripts written in bash, Ruby, and Python. It doesn't bear thinking about the thousands and thousands of lines of YAML.

Suffice it to say, ensuring all of this remains buildable day-to-day is a full-time job. Several full-time jobs, in fact. (Obligatory: we're hiring.) Roughly seven months ago, in April, the Mobile Developer Experience Android (MDXA) team embarked on a journey to modernize the code responsible for building everything—that is, the build logic. Prior to that effort, we had a mix of code in buildSrc, the root build.gradle, and many script plugins. Some of the script plugins were applied to the entire build, some were applied to subtrees, and some were applied in an ad hoc fashion to one or more modules. In other words, there was no "single source of truth" for the build, responsibilities were muddled, and it required considerable obstinance to want to interact with Gradle directly.

Here were some of our goals when we began the project:

  • Enhance maintainability of our build logic.
  • Make it easier to upgrade to newer versions of Gradle, as well as core ecosystem plugins like Android Gradle Plugin (AGP) and Kotlin Gradle Plugin (KGP).
  • Eliminate cross-project configuration (think: allprojects and subprojects) so that every module (or "project", to use Gradle terminology) could be configured independently from every other. This should improve the configuration phase of the build, since those APIs defeat configuration on demand; and it also sets us up to be compatible with a new experimental Gradle feature known as project isolation, which holds the promise of making slow Android Studio syncs a thing of the past.
  • Regularize our build scripts to such an extent that they could be parsed with build-system-agnostic tools, including regex and AST parsers. This lets us, for certain narrow use-cases, build Very Fast Tools ™ that don't have Gradle's configuration bottleneck. One of these tools will be a Gradle-to-Bazel translator, which we'll use to run Gradle vs Bazel experiments as one input to that perennial question: which build system is best for us?

What to expect if you keep reading

This is not a technical deep dive. Instead, it is a high-level overview of what we did, why we did it, and what we gained from the effort.

All code is production code

Gradle is famously very flexible. This flexibility is convenient for spiking code and experimentation, but all too often such experiments end up becoming the (very rickety) foundation for a critical piece of software—the software that builds the other software! The cure is to treat the build code as an important piece of software in its own right, and to apply the same rigor you would to your "production", consumer-facing code. It just so happens that the consumer of the build logic is other engineers, rather than the general public.

What this looks like in practice is that we do our best to follow software design and development best practices: the single responsibility principle, intention-revealing interfaces, preference of composition over inheritance, an extensive suite of unit and integration tests, and the use of many common object-oriented patterns such as the use of facades and adaptors, which let us abstract over ecosystem plugins such as AGP. This last is particularly important, as it helps us write future-proof code that is guaranteed to work ™ for upcoming releases.

The build domain is its own domain. Treat it with rigor and respect, and it will pay dividends.

Model your build

Every build follows a model, even if it is an implicit model named "hairball."

One of our top objectives was to dramatically reduce, and even eliminate, all Gradle scripts other than those that are strictly necessary. That is, we wanted only settings.gradle and build.gradle scripts (also referred to as settings scripts and build scripts). No more script plugins (apply from: 'complicated_thing.gradle') with hundreds of lines of imperative build logic that had to be interpreted dynamically by Gradle at run time, and which consumed nearly a gigabyte of additional heap. Critically, we wanted to eliminate all conditional logic, as well as any injection of configuration external to the module itself (e.g., from allprojects or subprojects blocks). We sought a kind of Platonic ideal where every build script had, at most, three blocks: a plugins block declaring what kind of module it defined, a dependencies block declaring its dependencies, and a custom extension block (named, straightforwardly enough, square) that configured various optional characteristics that corresponded with our model of how we build our software. You should be able to look at the build script for a module and know what that module is all about from just those three declarations.

That last part is worth emphasizing: we created a model of how we build our software. We then formalized that model with so-called convention plugins. Your model for how you build your software would likely be different, but the concept, or pattern, of convention plugins is powerful enough that we think you'll find it useful, even with a different model.

Convention plugins

Convention plugins apply conventions to a build. They do this by applying ecosystem plugins such as AGP and KGP, and then configuring those plugins according to the convention used by your build. A simple example may help illustrate the change from the client (feature engineer) perspective. Consider this un-conventional build script:

apply plugin: 'com.android.library'
apply plugin: 'kotlin-android'

android {
  // lots of boilerplate
}

tasks.withType(KotlinCompile).configureEach {
  kotlinOptions {
    jvmTarget = JavaVersion.VERSION_11
  }
}

Now consider the new, conventional version of this script:

plugins {
  id 'com.squareup.android.lib'
}

In many cases, this is precisely what we've achieved. The boilerplate disappears, and the common conventions (such as jvmTarget for compilation) are entirely baked into the new plugin where, critically, they can be tested. We've also dramatically reduced the cognitive load on feature engineers. The new convention plugins are a kind of "headline", indicating at a glance the kind of module this is. The above example is for a "Square-flavored Android library" module. We also have convention plugins for com.squareup.android.app, com.squareup.jvm.lib, and more.

Cognitive load

It's not just an idle claim to say we've reduced the cognitive load when it comes to interpreting our Gradle scripts—we have metrics to prove it! We used the CodeNarc CLI tool to measure the cyclomatic complexity of those scripts. According to this tool, we've reduced the complexity of our root build script from 41 to 16, a 61% reduction. Another very complex script, square_gradle_module.gradle, which embedded many of the rules for how all modules were configured, was reduced from 75 to 0 (we deleted it!).

I don't want to minimize the fact that some of this complexity has clearly been pushed elsewhere, but the elsewhere in question is a place clearly designated as part of the build domain, separated from the feature domain, with a different context and many tests. The prior Groovy-based Gradle scripts had no tests, other than the implicit integration test named "does my build still work?".

Tests

I've mentioned tests several times now. Our convention plugins have a thorough suite of integration tests that use Gradle TestKit. The tests are all data-driven specifications written with the Spock testing framework, which makes it very easy to run tests against a matrix of important third-party dependencies. For example, most of our tests run against a matrix of Gradle and AGP versions (including unreleased versions), which give us confidence that everything will "just work" when there's a new release. We also plan to add dimensions for Kotlin and various JDKs, but those are "nice to haves" at the moment.

One of the advantages of integration tests is that we don't have to worry about the minutiae of implementation details for our suite of plugins. Instead, we just invoke normal Gradle tasks and verify the outputs from the full build are as expected. Nevertheless, we also have many unit tests for complex logic—most of these are written in Kotlin and use JUnit5.

From buildSrc to build-logic

It is clear that having build logic in buildSrc is an improvement over imperative code directly embedded in a build script or script plugin. Nevertheless, it's not ideal for a few reasons. To start, buildSrc is on the hot path for every build: Gradle must compile it and run its check task (which includes tests) on every single build. While it's true that these tasks are generally pulled from the cache, even that can impose a non-negligible cost. buildSrc is also a bit too magical and prone to becoming a hairball of code that Gradle conveniently places on the build classpath so you don't have to think about it. This is fine for smaller projects, but for large projects, we want to think about it very carefully. One of the costs of this magic is that any change to buildSrc, no matter how seemingly trivial, invalidates the configuration of every module in your build, because it changes the classpath for the entire build—resulting in an awful developer experience.

By cleanly separating our build logic from the rest of the build via the included build facility, we break this chain of misery and achieve a separation of concerns that is simply impossible with buildSrc. As a side benefit, build authors can import the build-logic build into an IDE separately from the main project, dramatically improving their productivity.

A wild performance regression appeared! From included builds to publishing our build logic

When we initially migrated from buildSrc to an included build, it seemed an elegant solution to the problems outlined above. We were dropping buildSrc, with all its problems, and using the modern facility that Gradle considers the preferred, idiomatic replacement for buildSrc. (For a very elaborate example of this, see the idiomatic-gradle repo by former Gradle engineer Jendrik Johannes.) Included builds are indeed the best of both worlds in many ways: they can exist under version control in the same repository as the build that uses them; they can be easily opened in a separate IDE instance, improving build engineer productivity; they don't have the automatic test behavior of buildSrc, which improves feature engineer performance; and they're generally very well-behaved, without any of the magic of buildSrc.

It turns out, however, that they come with a cost. When an included build provides plugins used by the main build, Gradle must do some dependency resolution very early in the configuration process, which is unfortunately single-threaded. Essentially, for every plugin request, Gradle needs to check the included build(s) for plugins that might be substituted in. In the case of our Very Large Build, this added 30s (about 33%) to the configuration phase for every single build! That kind of regression is totally unacceptable, but so was a revert to buildSrc. We ended up making the decision to publish our plugins to our Artifactory instance and resolve them like any other third-party Gradle plugin. This eliminated the regression and even improved over the original situation, since resolving a binary plugin is measurably faster than compiling (or pulling from the build cache) a plugin that lives in buildSrc. We also maintained the ability to use build-logic (a collection of 32 Gradle modules at time of writing) as an included build for rapid prototyping by build engineers. Our feature engineers only use the binary plugins, however, which is a boon for their productivity.

Faster tools

Just because you build your app with Gradle doesn't mean you have to do everything with Gradle! One benefit of having a regularized set of build scripts that follow a strict convention is then you can do things like: write a tiny Kotlin app to parse project dependencies recursively to build a trimmed-down list of projects for use by a settings script. I alluded to that in this post when I mentioned that we had replaced a Gradle task that took an average of 2 minutes to run with a Kotlin app that took about 300 ms, for an estimated savings in recovered developer productivity of over $100,000 per year.

Herding elephants

The observant reader will have noticed that this post is about migrating over 3,500 Gradle modules to a new structure, which sounds like a lot of work just updating build.gradle files. And they'd be right. While we did migrate several hundred modules manually during the earliest stages of the project, mainly as a proof of concept, this was incredibly tedious and error prone. Ultimately, we wrote a tool based on Groovy AST transforms to parse and rewrite our gradle scripts in place, translating the old way of doing things into the new way. We're now working to productionize this tool to make future migrations even easier.

🎉

What we've gained

The above barely scratches the surface of what we've done over the past half of a year, and the effort involved to do it for a project as large as ours. The gains range from the abstract to the very particular: we're modeling our build and embracing software development best practices throughout our codebase; we've reduced complexity and cognitive overhead for our feature engineers; we have built a thorough test suite that protects against regressions and enables quicker adoption of new releases of third-party software; we've reduced memory pressure and brought build times down even in the face of a growing codebase; and we've made it possible (through our build model) to express our build in various ways not necessarily tied to Gradle.

Special thanks to Roger Hu, Corbin McNeely-Smith, and Pierre-Yves Ricau for reviewing early drafts, and to everyone else for reviewing the many, many PRs it took to get to this point.