Evolution of Developer Productivity at Square - Part One
Unpacking Square's developer productivity approach
Building on a decade of innovation and scaling, Square's growth story has been nothing short of remarkable. From the early days of a simple Square card reader connecting via an iPhone's headphone jack to an expansive ecosystem of business solutions products ranging from Restaurants, Appointments, Payroll, Invoices, Retail, and Banking, our ambitions have grown as have our challenges. Managing a rapidly expanding codebase across multiple platforms and services brought forth a unique set of complications. That's why we're launching this multi-part blog series to delve into the nuances of how our development challenges have evolved and the strategies we've employed to address them.
In this first part, we will focus on our Continuous Integration (CI) infrastructure strategy. We'll outline how a dedicated Developer Productivity team revolutionized our approach, making focused investments to speed up development across our expansive digital landscape. Whether you're grappling with issues in a large monorepo or looking to increase development velocity, the insights from our journey may offer you valuable guidance.
Let me take you back to 2019. Our codebase and the number of Square customers had been growing at an impressive rate. At that time, getting anything done in the mobile repos was an enormous challenge. The scale was big and it kept getting bigger. A lot of the tooling that existed for building and testing wasn’t developed with us in mind. So, we responded by forming a small mobile developer experience team. This small team delivered immediate quality of life improvements and proved that at our growth trajectory, we needed more dedicated engineers to enable the product engineering team to get things done fast and easily. In 2020, we founded the Developer Productivity team and have been making more focused investments in this space since.
Fast forward to today, we have dedicated developer experience teams for mobile, SQUID (Square Android), backend, and web platforms, a developer tooling team, internal developer portal team, CI Infrastructure, and a Continuous Deployment team. We oversee a vast digital landscape that includes 11 mobile apps and 2 public SDKs. Our microservices are crafted using a variety of languages, with Ruby, Java, and Go. Monorepos are an integral part of our ecosystem, spanning across iOS, Android, Java, and Go platforms, each containing multiple millions of lines of code. When it comes to web applications, we're harnessing the power of modern frameworks like React, Vue, and Ember, spread across more than 600 poly repos.
And here’s a fun fact: Our engineers submit 22K+ pull requests weekly! We’re not in it alone–we collaborate with sister teams like Frameworks, Deploy, Cloud Foundations, Observability, Online Data Stores, and partner closely with teams like XPOS Platform, Release and Automation Engineering, Web Foundation, Hardware Quality Systems, Hardware Developer Infrastructure, Corporate Engineering, and Security. And of course, we are backed by a vibrant developer community.
The cornerstone of our productivity strategy has been fourfold:
- Build a dependable and fast CI infrastructure.
- Increase development velocity through platform-specific investments.
- Equip our teams with the best tools available.
- Invest in reliability and test engineering.
Our CI infrastructure evolution deserves special mention. You might be surprised to find out that we have relied on the same CI solution for more than a decade. The idea being–there “wasn’t a viable alternative” at the time – but our team thought differently and decided to develop one ourselves. We called it Kochiku (means “construction” in Japanese) and it is our pragmatic approach to fragmenting a test suite, running it across a large number of machines, and then aggregating the results. Navigating to the present day, our customer base stretches to include 5K+ engineers and over 1K teams. And what's powering all this? A scalable CI infrastructure that runs on an impressive 6K workers, processing more than 2.7M build parts weekly. It’s not an exaggeration to say that we have developed and now strategically manage one of the largest CI infrastructures in our industry. And we do this with an eye toward the future. But getting this right requires a lot of strategic thinking around scale, reliability, security, performance, and more. So, I’d love to break down a little bit of the steps we have taken to get this right.
When it comes to preparing for the future and scaling, several key decisions played pivotal roles. The first major initiative was transitioning our CI workers from our data centers to AWS. By doing so, we were able to use AWS’s auto-scaling capabilities as well as accessing a diverse pool of compute and memory-optimized EC2 instances. These enabled us to scale up to 6K EC2 instances at peak usage. The immediate impact was a 47% improvement in queueing times even as the number of builds increased by 60%! Next, we adopted the Artifactory Content Delivery Network feature. This was a significant improvement which led to 12x faster download operations for our engineers outside of California. Lastly, we made a strategic switch from using the Elastic File System to AWS S3 storage for our build cache. This decision was crucial, especially during periods of high-volume data writes, as it helped us avoid any throttling issues from the AWS side.
While scaling is undeniably essential, it becomes moot without the bedrock of reliability to support it. To this end, we engineered a failover mechanism between availability zones to help ensure continuous uptime. We tackled and resolved pressing issues related to AWS network bandwidth. Transitioning from Git NFS to using Git archives in S3 not only streamlined our processes but also enhanced system stability. We made a concerted effort to refine worker health checks, offering real-time insights into system health. Furthermore, we established a comprehensive end-to-end staging environment for our CI platform. This addition not only facilitates accelerated testing but also ensures that feature rollouts are both fast and secure. Collectively, these approaches have exponentially improved the reliability of our CI platform, evident in the 100-fold improvement, with infrastructure failure rates now trending less than 0.03%. Over the years, we have been operating a separate CI infrastructure for our iOS applications, which consists of 500+ on-prem Mac minis. Operating such physical infrastructure comes with its own set of challenges. Maintaining it isn't only expensive but demands constant support. To illustrate, we underwent a migration process where our iOS CI machines were moved from San Francisco to San Jose. This operation alone took us six months from start to finish. The task was even more challenging because it happened during the peak of the pandemic in 2020, restricting us to having no more than two people at the data center at any given time. However, there's a silver lining on the horizon. With macOS EC2 instances becoming available on AWS, we're taking our operations to the cloud! This shift has been tremendously beneficial so far. Since the beginning of this migration, we've noticed a significant decrease in our operational workload. More importantly, our developers are now experiencing faster CI build times, a win-win for everyone involved.
At Square, maintaining strong and proactive security practices is a fundamental aspect of our operational strategy, and we are continuously evolving our infrastructure and technology stack to align with the best practices in the industry. This involves working closely with our expert security teams to implement robust and effective security measures. Our commitment to security is an ongoing journey, and we regularly review and enhance our systems to ensure we remain resilient against emerging threats.
One of Square's foundational principles is fostering an "ecosystem of startups". It's a vibrant approach that celebrates letting a thousand diverse ideas flourish and has led to a rich tapestry of innovation. On the flip side, it resulted in a number of different CI platforms and managing them is like taking care of a large garden. The main challenge is getting everyone on the same page. Creating a common path and getting everyone to use it isn't easy. Still, we are seeing a change. Teams at Square are starting to see the benefits of a single CI platform. For our team, this isn't just a task; it's an important goal. We aim to understand each team's needs and bring them together and unlock new efficiencies.
Our transition narrative, especially with regard to the cloud, is filled with strategic choices. Moving to the cloud is a clear win for most customer-facing services. The scalability it offers, especially as we sought international growth and follow different regulations, makes it really important to us. It redefined our infrastructure's blueprint. The iOS journey, however, charted a different course. We held onto our on-prem setup for a while because the early "cloud" options were pretty much like leasing dedicated servers – you can imagine how expensive that could get. But all that changed with the emergence of AWS macOS EC2 instances. This offering is a true cloud solution and offers the flexibility to scale up or down according to our needs, all while allowing us to choose precisely the configurations that suit us. Because of this, we can now make our iOS services even better than before.
In our line of work, it's tempting to keep refining what we know best. As we become more skilled, it feels natural to delve deeper, optimizing further. Take our CI infrastructure as an example. A 3% error rate meant 45K build jobs were failing each week. With hard work, we managed to bring that down to just 0.03%. That's a huge win! But, there's a nuanced inflection point we often overlook – the point of diminishing returns. When incremental improvements no longer translate to substantial real-world benefits, it's a sign. A sign to pivot, reevaluate, and perhaps target more impactful challenges. This is where metrics don't just measure; they guide and inspire. They provide the empirical assurance to steer the ship towards new horizons, even when the waters are uncharted.
In this first part of our blog series, we've delved into Square's journey of evolving our CI infrastructure, adapting to the challenges of a rapidly expanding codebase and diverse services. We've touched on our shift to a cloud-based system and the strides we've made in enhancing developer productivity in a complex technological landscape. Stay tuned for our next post, where we'll dive into how we have accelerated development velocity and share actionable lessons we learned along the way.