August 9, 2016 | 9 minute read

Upgrading a Reverse Proxy from Netty 3 to 4

Tracon is our reverse HTTP proxy powered by Netty. We recently completed an upgrade to Netty 4 and wanted to share our experience.

Written by Chris Conroy and Matt Davenport.

Tracon: Square’s reverse proxy

Tracon is our reverse HTTP proxy powered by Netty. Several years ago, as we started to move to a microservice architecture, we realized that we needed a reverse proxy to coordinate the migration of APIs from our legacy monolith to our rapidly expanding set of microservices.

We chose to build Tracon on top of Netty in order to get efficient performance coupled with the ability to make safe and sophisticated customizations. We are also able to leverage a lot of shared Java code with the rest of our stack in order to provide rock-solid service discovery, configuration and lifecycle management, and much more!

Tracon was written using Netty 3 and has been in production for three years. Over its lifetime, the codebase has grown to 20,000 lines of code and tests. Thanks in large part to the Netty library, the core of this proxy application has proven so reliable that we’ve expanded its use into other applications. The same library powers our internal authenticating corporate proxy. Tracon’s integration with our internal dynamic service discovery system will soon power all service-to-service communication at Square. In addition to routing logic, we can capture a myriad of statistics about the traffic flowing into our datacenters.

Why upgrade to Netty 4 now?

Netty 4 was released three years ago. Compared to Netty 3, the threading and memory models have been completely revamped for improved performance. Perhaps more importantly, it also provides first class support for HTTP/2. Although we’ve been interested in migrating to this library for quite a while, we’ve delayed upgrading because it is a major upgrade that introduces some significant breaking changes.

Now that Netty 4 has been around for a while and Netty 3 has reached the end of its life, we felt that the time was ripe for an overhaul of this mission-critical piece of infrastructure. We want to allow our mobile clients to use HTTP/2 and are retooling our RPC infrastructure to use gRPC which will require our infrastructure to proxy HTTP/2. We knew this would be a multi-month effort and there would be bumps along the way. Now that the upgrade is complete, we wanted to share some of the issues we encountered and how we solved them.

Issues encountered

Single-threaded channels: this should be simple!

Unlike Netty 3, in Netty 4, outbound events happen on the same single thread as inbound events. This allowed us to simplify some of our outbound handlers by removing code that ensured thread safety. However, we also ran into an unexpected race condition because of this change.

Many of our tests run with an echo server, and we assert that the client receives exactly what it sent. In one of our tests involving chunked messages, we found that we would occasionally receive all but one chunk back. The missing chunk was never at the beginning of the message, but it varied from the middle to the end.

In Netty 3, all interactions with a pipeline were thread-safe. However, in Netty 4, all pipeline events must occur on the event loop. As a result, events that originate outside of the event loop are scheduled asynchronously by Netty.

In Tracon, we proxy traffic from an inbound server channel to a separate outbound channel. Since we pool our outbound connections, the outbound channels aren’t tied to the inbound event loop. Events from each event loop caused this proxy to try to write concurrently. This code was safe in Netty 3 since each write call would complete before returning. In Netty 4, we had to more carefully control what event loop could call write to prevent out of order writes.

When upgrading an application from Netty 3, carefully audit any code for events that might fire from outside the event loop: these events will now be scheduled asynchronously.

When is a channel really connected?

In Netty 3, the SslHandler “redefines” a channelConnected event to be gated on the completion of the TLS handshake instead of the TCP handshake on the socket. In Netty 4, the handler does not block the channelConnected event and instead fires a finer-grained user event:SslHandshakeCompletionEvent. Note that Netty 4 replaces channelConnected with channelActive.

For most applications, this would be an innocuous change, but Tracon uses mutually authenticated TLS to verify the identity of the services it is speaking to. When we first upgraded, we found that we lacked the expected SSLSession in the mutual authentication channelActive handler. The fix is simple: listen for the handshake completion event instead of assuming the TLS setup is complete on channelActive

@Override public void userEventTriggered(ChannelHandlerContext ctx, Object evt) throws
 if (evt.equals(SslHandshakeCompletionEvent.SUCCESS)) {
     Principal peerPrincipal = engine.getSession().getPeerPrincipal();
     // Validate the principal
     // ...
 }
 super.userEventTriggered(ctx, evt);

}

Recycled Buffers Leaking NIO Memory

In addition to our normal JVM monitoring, we added monitoring of the size and amount of NIO allocations by exporting the JMX bean *java.nio:type=BufferPool,name=direct *since we want to be able to understand and alert on the direct memory usage by the new pooled allocator.

In one cluster, we were able to observe an NIO memory leak using this data. Netty provides a leak detection framework to help catch errors in managing the buffer reference counts. We didn’t get any leak detection errors because this leak was not actually a reference count bug!

Netty 4 introduces a thread-local Recycler that serves as a general purpose object pool. By default, the recycler is eligible to retain up to 262k objects. ByteBufs are pooled by default if they are less than 64kb: that translates to a maximum of 17GB of NIO memory per buffer recycler.

Under normal conditions, it’s rare to allocate enough NIO buffers to matter. However, without adequate back-pressure, a single slow reader can balloon memory usage. Even after the buffered data for the slow reader is written, the recycler does not expire old objects: the NIO memory belonging to that thread will never be freed for use by another thread. We found the recyclers completely exhausted our NIO memory space.

We’ve notified the Netty project of these issues, and there are several upcoming fixes to provide saner defaults and limit the growth of objects:

We encourage all users of Netty to configure their recycler settings based on the available memory and number of threads and profiling of the application. The number of objects per recycler can be configured by setting -Dio.netty.recycler.maxCapacity and the maximum buffer size to pool is configured by -Dio.netty.threadLocalDirectBufferSize. It’s safe to completely disable the recycler by setting the -Dio.netty.recycler.maxCapacity to 0, and for our applications, we have not observed any performance advantage in using the recycler.

We made another small but very important change in response to this issue: we modified our global UncaughtExceptionHandler to terminate the process if it encounters an error since we can’t reasonably recover once we hit an OutOfMemoryError. This will help mitigate the effects of any potential leaks in the future.

class LoggingExceptionHandler implements Thread.UncaughtExceptionHandler {

 private static final Logger logger = Logger.getLogger(LoggingExceptionHandler.class);

 /** Registers this as the default handler. */
 static void registerAsDefault() {
   Thread.setDefaultUncaughtExceptionHandler(new LoggingExceptionHandler());
 }

 @Override public void uncaughtException(Thread t, Throwable e) {
   if (e instanceof Exception) {
     logger.error("Uncaught exception killed thread named '" + t.getName() + "'.", e);
   } else {
     logger.fatal("Uncaught error killed thread named '" + t.getName() + "'." + " Exiting now.", e);
     System.exit(1);
   }
 }
}

Limiting the recycler fixed the leak, but this also revealed how much memory a single slow reader could consume. This isn’t new to Netty 4, but we were able to easily add backpressure using the channelWritabilityChanged event. We simply add this handler whenever we bind two channels together and remove it when the channels are unlinked.

/**

* Observe the writability of the given inbound pipeline and set the {@link ChannelOption#AUTO_READ}
* of the other channel to match. This allows our proxy to signal to the other side of a proxy
* connection that a channel has a slow consumer and therefore should stop reading from the
* other side of the proxy until that consumer is ready.
*/

public class WritabilityHandler extends ChannelInboundHandlerAdapter {

 private final Channel otherChannel;

 public WritabilityHandler(Channel otherChannel) {
   this.otherChannel = otherChannel;
 }

 @Override public void channelWritabilityChanged(ChannelHandlerContext ctx) throws Exception {
   boolean writable = ctx.channel().isWritable();
   otherChannel.config().setOption(ChannelOption.AUTO_READ, writable);
   super.channelWritabilityChanged(ctx);
 }
}

The writability of a channel will go to not writable after the send buffer fills up to the high water mark, and it won’t be marked as writable again until it falls below the low water mark. By default, the high water mark is 64kb and the low water mark is 32kb. Depending on your traffic patterns, you may need to tune these values.

If a promise breaks, and there’s no listener, did you just build /dev/null as a service?

While debugging some test failures, we realized that some writes were failing silently. Outbound operations notify their futures of any failures, but if each write failure has shared failure handling, you can instead wire up a handler to cover all writes. We added a simple handler to log any failed writes:

@Singleton
@Sharable
public class PromiseFailureHandler extends ChannelOutboundHandlerAdapter {

 private final Logger logger = Logger.getLogger(PromiseFailureHandler.class);

 @Override public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise promise)
     throws Exception {
   promise.addListener(future -> {
     if (!future.isSuccess()) {
       logger.info("Write on channel %s failed", promise.cause(), ctx.channel());
     }
   });

   super.write(ctx, msg, promise);
 }
}

HTTPCodec changes

Netty 4 has an improved HTTP codec with a better API for managing chunked message content. We were able to remove some of our custom chunk handling code, but we also found a few surprises along the way!

In Netty 4, every HTTP message is converted into a chunked message. This holds true even for zero-length messages. While it’s technically valid to have a 0 length chunked message, it’s definitely a bit silly! We installed object aggregators to convert these messages to non-chunked encoding. Netty only provides an aggregator for inbound pipelines: we added a custom aggregator for our outbound pipelines and will be looking to contribute this upstream for other Netty users.

There are a few nuances with the new codec model. Of note, LastHttpContent is also a HttpContent. This sounds obvious, but if you aren’t careful you can end up handling a message twice! Additionally, a FullHttpResponse is also an HttpResponse, an HttpContent, and a LastHttpContent. We found that we generally wanted to handle this as both an HttpResponse and a LastHttpContent, but we had to be careful to ensure that we didn’t forward the message through the pipeline twice.

Don’t do this

if (msg instanceof HttpResponse) {
  ...
}

if (msg instanceof HttpContent) {
  ...
}

if (msg instanceof LastHttpContent) {
  … // Duplicate handling! This was already handled above!
}

Another nuance we discovered in some test code: LastHttpContent may fire after the receiving side has already received the complete response if there is no body. In this case, the last content is serving as a sentinel, but the last bytes have already gone out on the wire!

Replacing the engine while the plane is in the air

In total, our change to migrate to Netty 4 touched 100+ files and 8k+ lines of code. Such a large change coupled with a new threading and memory model is bound to encounter some issues. Since 100% of our external traffic flows through this system, we needed a process to validate the safety of these changes.

Our large suite of unit and integration tests was invaluable in validating the initial implementation.

Once we established confidence in the tests, we began with a “dark deploy” where we rolled out the proxy in a disabled state. While it didn’t take any traffic, we were able to exercise a large amount of the new code by running health checks through the Netty pipeline to check the status of downstream services. We highly recommend this technique for safely rolling out any large change.

As we slowly rolled out the new code to production, we also relied on a wealth of metrics in order to compare the performance of the new code. Once we addressed all of the issues, we found that Netty 4 performance using the UnpooledByteBufAllocator is effectively identical to Netty 3. We’re looking forward to using the pooled allocator in the near future for even better performance.

Thanks

We’d like to thank everyone involved in the Netty project. We’d especially like to thank Norman Maurer / @normanmaurer for being so helpful and responsive!