Faster App Recovery With Bounded Queues

Get your app back up more rapidly using bounded queues

A problem we’ve encountered in the past is when a Ruby app becomes unresponsive due to downstream issues. After the issue is resolved, the app continues to be unresponsive or very slow to process requests. While a restarts clear it up immediately, but the less manual steps we have to take to recover an app, the better.

Because testing on production is bad, we need a simple way of replicating this locally.


At Square, our Ruby apps use NGINX and Puma to serve requests. We write out a simple server side test case, which we can run by using nginx -v ./nginx.conf and puma -C puma.rb. Then we set up the client side test case:

require 'http'

started_at =
threads = do |i|
  sleep 0.1 do
      started_at =
      response = HTTP.get("http://localhost:8001/#{i}")
      [response.respond_to?(:code) ? response.code : -1, ( - started_at), response.body.to_s]
    rescue HTTP::ConnectionError => ex
    rescue => ex
      puts "#{ex.class}: #{ex.message}"

time_taken = []
threads.each_with_index do |thread, id|
  code, results, body = thread.value
  puts "[#{id}] #{code} - #{results} seconds elapsed"

  time_taken << results if results

puts "Took #{( - started_at).round(4)} total, avg of #{(time_taken.inject(:+) / time_taken.length).round(4)} in thread time, max #{time_taken.max}, min #{time_taken.min}"

Now we have a test endpoint, which sleeps for 5 seconds before responding, and the benchmark which makes 10 requests. With our initial configuration, the benchmark looks like:

[0] 200 - 4.13 seconds elapsed
[1] 200 - 9.13 seconds elapsed
[2] 200 - 14.13 seconds elapsed
[3] 200 - 19.13 seconds elapsed
[4] 200 - 24.14 seconds elapsed
[5] 200 - 29.14 seconds elapsed
[6] 504 - 29.70 seconds elapsed
[7] 504 - 29.80 seconds elapsed
[8] 504 - 29.90 seconds elapsed
[9] 504 - 30 seconds elapsed
Took 30.0102 total, avg of 21.9242 in thread time, max 30, min 4.13

Requests #0 through #5 returned HTTP 200 Success and #6 through #9 returned HTTP 504 Gateway Timeout. This is expected since our test case for NGINX has proxy_read_timeout 30s set, and Puma is configured to process one request at a time, it takes about 30 seconds to process the first 6 requests, and the remaining 4 timeout in NGINX.

Comparing the Puma logs, we see that Puma saw 10 requests, all of which returned HTTP 200s, NGINX saw the same as the client, 6 requests with HTTP 200 and 4 with HTTP 504.

This is one of those fun gotchas. The client spent 30 seconds waiting for responses, but the server took 50 seconds to process it. Here’s what’s going on:

00:00 — [A] Client sends request A to Server
00:01 — [A] NGINX buffers the HTTP request into memory and sends it to Puma
00:01 — [A] Puma accepts the connection, reads the request into memory and queues it
00:03 — [B] Client times out, sends request B to Server
00:04 — [B] NGINX buffers the HTTP request into memory and sends it to Puma
00:04 — [B] Puma accepts the connection, reads the request into memory and queues it
00:05 — [A] Puma starts processing the request, and responds with it to NGINX
00:06 — [A] NGINX throws the response away since the client closed the connection for A
00:06 — [B] Puma starts processing the request, responds with it to NGINX
00:07 — [B] Client receives the response for B

Even better is if the client has automated retries. If a client has a timeout of 500ms on the endpoint /api/v1/update-user, and that endpoint now takes 1,000ms, the server has to spend 1,000ms of processing time for every 500ms the client spends. If the client retries 3 times, your server now spends 3,000ms for every 1,500ms the client spends.

Especially in a microservice architecture, that one endpoint could call another service, which calls another and so on. We’d quickly end up DoSing ourselves, and making an outage worse than it was originally due to requests being backlogged.

Ideally, our app is configured to reject requests when it’s overloaded. It prevents the above scenario of cascading failures, and means the app can recover without having to restart and flush out all the long running requests.

In this case, we have a few levers we can tweak. We’ll start with the obvious one, which is Puma.

Restricting Requests In Puma

Digging through the Puma code, we found that by default, Puma has an unbounded queue, and accepts a connection from the socket then queues it up for processing in server.rb:

if io = sock.accept_nonblock
  client = io, @binder.env(sock)
  if remote_addr_value
    client.peerip = remote_addr_value
  elsif remote_addr_header
    client.remote_addr_header = remote_addr_header

  pool << client
  pool.wait_until_not_full unless queue_requests

Double-checking the thread_pool.rb in Puma, we find:

def <<(work)
  @mutex.synchronize do
    if @shutdown
      raise "Unable to add work while shutting down"

    @todo << work

    if @waiting < @todo.size and @spawned < @max


The variable @todo is an unbounded array which stores the HTTP requests in memory until a thread is available to process. If our app is unable to keep up, we could have a queue with hundreds/thousands of requests that timed out a long time ago. The app may eventually catch up, but we don’t want to wait 10 minutes for it to recover if 95% of the requests it is processing timed out already.

Looking at the code though for server.rb, the pool.wait_until_not_full unless queue_requests option looked promising. Turns out, it’s an option that does exactly what we want! We can make Puma only accept connections when it has a thread available to do so.

As a note, Puma 3.9.0, which was released at the start of June 2017, now only accepts a connection if it can immediately process it. The timing was unrelated to our digging into it, and was a happy coincidence.

After all that, we add queue_requests false to our Puma configuration and rerun the test… and get exactly the same results. Puma processed 10 requests, and the benchmark saw 6 successes and 4 failures.

From prior experience in working with sockets, we know that eventually a TCP socket ends up using the syscall listen (in Linux). Looking at the documentation, we see the backlog argument which “defines the maximum length to which the queue of pending connections for sockfd may grow”.

Queues All The Way Down

One of the core parts of TCP is buffering. When clients send HTTP requests, they don’t first connect, wait for the server to accept, and send the HTTP request. They attempt to send off everything at once and parts of the request will sit in buffers between the client and the server until the server accepts the connection or times out.

Good for performance, not so when we want predictable request queuing. At this point, we pinged Evan Miller, a fellow Square engineer, on the semantics of backlog and queues, which led to learning that Linux has a backlog for post-handshake sockets (the backlog we’re looking at) and then also a pre-handshake socket. This gets into semantics of how TCP works, which I won’t cover here but Wikipedia has an article explaining the handshake process.

Because we’re putting NGINX in front of Puma, we actually have four additional queues: one pre and post handshake in Puma, and another in NGINX. Looking at our server configuration and the documentation, we found the max backlog is 128 for the post-handshake queue.

We configured Puma to use a backlog of 2x what a single instance could process. For example, if one of our smaller apps can handle 10 requests simultaneously, we set the backlog to 20. Given there are multiple TCP queues, this isn’t perfect, but it helps narrow the funnel of where requests can queue up.

We’ve done what we can for the Puma side, but we still haven’t looked at NGINX.

Restricting Requests in NGINX

After chatting with other engineers, they mentioned max_conns, a NGINX option that became available in 1.11.5 (the paid version had it earlier).

The documentation looked promising, “limits the maximum number of simultaneous active connections to the proxied server”, though it has some caveats further down noting it can exceed the limit due to keep alive connections.

We used the same defaults we settled on for the backlog: each instance can queue up to 2x the requests it can handle at once. For our test, the server handles 1 request at a time, and we set max_conns=3. Rerun the test, and we see:

[0] 200 - 4.13
[1] 200 - 9.14
[2] 200 - 14.14
[3] 502 - 0.01
[4] 502 - 0
[5] 502 - 0
[6] 502 - 0
[7] 502 - 0
[8] 502 - 0
[9] 502 - 0
Took 14.14 total, avg of 2.74 in thread time, max 14.14, min 0

Puma logs only show 3 requests, and nginx rejected the other 7 because we exceeded the maximum number of connections.


Queues are fun, and trying to ensure everything between the client and the server is complicated. The NGINX max_conns option is good enough for most cases, and has the added advantage of rejecting requests before it hits the Ruby apps. The TCP backlog and nginx queuing changes are an extra layer of protection, and reduce the chance of long-lived requests hiding out in a queue between NGINX and Puma.

With these changes, we’re expecting to see more deterministic failure modes when apps slow down due to downstream issues. Instead of queuing up more requests than the app can realistically process, they’ll start responding with errors more quickly.

These aren’t perfect, but the faster an individual host goes unhealthy, the faster it can be taken out of rotation and given a chance of catching up and becoming healthy again.

Table Of Contents
View More Articles ›