The Road to an Envoy Service Mesh

How Square migrated to an Envoy based service mesh

Reddit
LinkedIn

At Square we’ve been running a microservices architecture for years, primarily using three different languages: Java, Ruby and Go. Running such a diverse stack can make interoperability between the different languages/frameworks challenging, and in this post I’ll talk about how Square has handled this in the past and where we’re at today — actively migrating towards a full service mesh.

The Old World

Square started off, like many other companies, just running one large monolith. For Square this was a big Ruby on Rails service, that handled everything that Square did. A monolith makes service to service communication completely unnecessary: everything runs in the same code base on a single database, so everything can be implemented without making any outbound network calls (except to third parties outside of Square).

After a while, Square decided to move more towards a service oriented architecture to reduce the reliance on a single, large application. To do this Square decided to build their own Protobuf based RPC framework, inspired by Google’s Stubby. This was years before gRPC was created, so there weren’t any open source options at the time. This RPC framework, called Sake, was implemented for Java and Go as these were the languages used for higher availability services, while regular REST HTTP was used between services that didn’t understand Sake. (Incidentally, Sake became a part of gRPC’s DNA)

Over time this framework became fairly sophisticated, with many features not generally available:

  • Smart retries with automatic failover

  • Prioritized routing based on upstream health

  • Traffic shaping and service discovery

This ended up working well, but relied on fat client/server libraries, so extending support to other languages was hard. Attempts were made to bring feature parity between languages in a variety of ways:

  • Sidecar process that bridges HTTP to Sake

  • Centralized L4 proxy coupled with smart client libraries for L7 features

Neither of these fully solved the problem, as while both mitigated the feature gap, they weren’t an exact match. App owners were still forced to be aware of which mechanism was used when debugging issues, leading to a lot of confusion.

A migration from Sake to gRPC was attempted, but the large feature set provided by Sake proved challenging to migrate to gRPC, requiring implementing custom load balancing and name resolvers to work with grpc-go and grpc-java. After partially implementing the necessary features, the migration was put on ice due to various reasons, in part because being a drop-in replacement for Sake meant little incentive for app owners to migrate.

On top of this, we were leveraging internal hardware load balancers to load balance service to service calls. While we had partial support for client side load balancing in Sake, a lot of Sake traffic and all of HTTP traffic still relied on these load balancers. As the company kept growing, so did the amount of load on the hardware load balancers.

Rethinking service to service

After a series of issues with our hardware load balancers, we decided that we needed to revamp our service to service setup. Our existing service to service setup required creating new virtual IPs on the hardware load balancers for each new service, so the load was only going to increase as the number of services grew, so we had to either add more hardware load balancers or move to a more scaleable infrastructure.

This proved to be a great opportunity to simplify: instead of having Sake and lots of other infrastructure running, we’d centralize on a single service to service mechanism. This has the added benefit of simplifying the things for app owners: no longer do they need to worry about multiple load balancers, connection pools, etc. They can focus on understanding the semantics of a single implementation, making everything from day to day operations to disaster recovery easier to reason about.

Looking around, Envoy stood out as a great candidate, due to the

  • Similarity between the Envoy load balancer data structures and Sake’s

  • Streaming configuration API

  • First class support for gRPC

  • Out of process architecture

In retrospect, mapping the load balancer implementations required quite a bit of additional upstream work (smart retries, degraded health checks and better per try timeouts to name a few) before we could reconcile Sake and Envoy. Thankfully the Envoy maintainers were extremely helpful and receptive to feature additions, allowing us to build in everything we needed.

The configuration API meant that we could easily build this on top of our existing Zookeeper based service discovery system: a centralized control plane would listen for Zookeeper changes and push those to Envoy using the configuration API.

Support for gRPC meant that we could pick up the gRPC work that had been partially rolled out, but rely on Envoy for the load balancer logic. Even more important for us, this meant that we’d be able use the exact same load balancer for RPC and HTTP traffic. This is also why the out of process architecture is so great: the implementation is the same no matter what application is using it, because the same binary is used everywhere.

Architecting the service mesh

To move to Envoy we ended up building out a control plane based on https://github.com/envoyproxy/java-control-plane which integrated with our existing service discovery infrastructure. This let us consume the same data that was used for shaping Sake traffic, allowing app owners to not have to worry about where traffic was coming from.

Envoy processes were deployed next to each app, and a very thin client library was provided for each language to communicate with this process over a unix socket. The library was intentionally kept thin to ensure that we kept as much of the logic centralized in the control plane or Envoy, as this ensures that the behavior is consistent between languages.

We decided to use a unix socket due to our apps being deployed on multitenant machines, with access to TLS certs protected by file permissions. Using a unix socket allowed us to use similar permissioning to restrict access to a given app’s Envoy instance, which let us use plain HTTP without opening up access to secrets to all other apps on the same host. Further, this provided a performance optimization in not having to encrypt and decrypt traffic multiple times. The primary form of routing relies simply on the Host header, which indicates what service the request should be routed to.

How Envoy selects a backend host based on the Host headerHow Envoy selects a backend host based on the Host header

Since we kept the client very thin, it then became extremely easy for anyone to build an Envoy client. If nothing else, a simple curl command would give you all the same traffic shaping, retry behavior, etc. that a “full” client implementation would:

curl --unix-socket egress.sock production.web.gns.square

This has allowed developers to prototype new languages at Square in a way that integrated with our service mesh within minutes instead of months.

Migration

Once this was all set up, the migration work could start. Our primary focus has been to get rid of alternative service to service infrastructure, to deprecate and move apps away from other proxies. This is a very slow process, as it relies on coordinating config changes and app deploys with various teams. This is in large part due to the manual routing that our clients are doing to Envoy: Unlike systems like Istio which can assume control of the network and act as a transparent proxy, we have to make code changes to make clients route to Envoy. Unfortunately due to app multi tenancy and lack of network namespacing the approaches used by such transparent proxies were not an option for us.

To help facilitate the migration, we released an internal CLI tool named tcli(short for traffic CLI), to simplify the operations necessary for an app to make the switch to Envoy.

For an app wanting to move its client calls to go through Envoy, the steps would look something like this:

To allocate ports, add Envoy to the deployment and request access to existing service to service dependencies, run a single command from a developer laptop:

tcli add-envoy -a <app> -d <datacenter> -e <environment>

Once access requests are approved, update client config in Git to use Envoy:

userService:
  url: user.global.square # CNAME to a hardware load balancer VIP

would need to be changed to

userService:
  envoy: { app: user }

With these two simple changes, traffic would start to be routed through the Envoy sidecar. Making this as simple as possible for app owners has been key in facilitating the rollout: most app owners are usually not interested in dealing with the details of the migration, so abstracting this away and making it self-serve has been incredibly helpful.

Current Status

The rollout is still far from done, but in general it’s been very smooth, with most of the issues stemming from subtle differences in how Envoy works compared to legacy systems. We’re excited to keep working on rolling out Envoy for use at Square, hopefully fully deprecating Sake in favor of gRPC over Envoy in 2019.

We’re also looking forward to exploring using Envoy for other parts of our traffic stack and to leverage the increasing presence of Envoy in our service mesh to improve security at Square or to make the job for app owners at Square easier.