Anatomy of an HAProxy <-> Java TLS bug
Adding SSL health checks to HAProxy uncovered an interesting interaction with Java servers.
TL;DR: Java interprets the TLS specification strictly and does not allow session resumption when the connection is closed uncleanly.
While deploying an internal load balancing project based on HAProxy to our staging environment with a configuration that included 100s of services and backends, HAProxy immediately pegged CPU usage at 100%. Since there was no traffic flowing through HAProxy yet, the only possible culprit was the health checking done for each service backend.
Square uses TLS for all service-to-service communication including health checks, so HAProxy was configured to use SSL without certificate verification. Assuming that the CPU usage spike was due to the cost of continuously creating new connections to the backend servers, a patch was created to add support for health check persistent connections to HAProxy. This resolved the problem, but since HAProxy is open source, having upstream integrate the patch would be the best solution.
Once submitted to the mailing list, Willy Tarreau — HAProxy’s creator — pointed out that new TLS sessions, which involve CPU-intensive key generation, should only be created on the initial connection and that subsequent connections should use session resumption. Lightbulb, head-desk, duh.
Thankfully, HAProxy exposes a stats socket with number of SSL keys exchanged per second for backends.
socat -t120 ./stats.sock stdio <<< “show info” | grep SslBackendKeyRate
SslBackendKeyRate
was continuously over 300, meaning that HAProxy was establishing new (expensive) SSL sessions 300 times a second. Since HAProxy is a single-threaded, event driven server, this was saturating CPU and it was having trouble even returning the stats information.
By slowly paring down the configuration file to a bare minimum and monitoring the SslBackendKeyRate
, the problem was eventually discovered to only be affecting Java services.
Using tcpdump and Wireshark, a snapshot of the SSL traffic between HAProxy and the backend was taken:
sudo tcpdump -w out -i any -s 1600 ‘(tcp[((tcp[12:1] & 0xf0) >> 2):1] = 0x16)’
In Wireshark: Analyze -> Decode As -> SSL for the desired ports decodes the unencrypted SSL packet information.
In the packet dump, it appeared that HAProxy would receive a Session ID from the backend and reuse it in the next connection, but the backend would still insist on a full key exchange for that new connection.
The OpenSSL client was able to confirm that session resumption still worked properly on the backend:
echo “Q” | openssl s_client -connect {HOST}:{PORT} -reconnect | grep Session-ID
Finally, the smoking gun was found once a small reproduction case was created.
By enabling Java’s SSL debugging (-Djavax.net.debug=all
), the logs showed that sessions were being invalidated when the connection was closed.
qtp1952751122–12, fatal error: 80: Inbound closed before receiving peer’s close_notify: possible truncation attack?javax.net.ssl.SSLException: Inbound closed before receiving peer’s close_notify: possible truncation attack?
%% Invalidated: [Session-2, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384]
Looking through HAProxy’s health check code, there are a few places that the connection is shutdown, but most notable were these lines:
/* Close the connection.. We absolutely want to perform a hard close
* and reset the connection if data is pending, otherwise we end
* up with many TIME_WAITs and eat all the source port range quickly.
* To avoid sending RSTs all the time, we first try to drain pending
* data.
*/
__conn_data_stop_both(conn);
conn_data_shutw_hard(conn);
conn_data_shutw_hard
in turn calls a shutw function on the SSL session with an unclean flag set:
if (!clean)
/* don't sent notify on SSL_shutdown */
SSL_set_quiet_shutdown(conn->xprt_ctx, 1);
SSL_set_quiet_shutdown sets a flag that, when the SSL session is shutdown, will not send a “close notify” packet to the server.
This behavior became valid in the TLS 1.1 specification:
close_notify
This message notifies the recipient that the sender will not send
any more messages on this connection. Note that as of TLS 1.1,
failure to properly close a connection no longer requires that a
session not be resumed. This is a change from TLS 1.0 to conform
with widespread implementation practice.
In the log messages from Java, it’s indicated that the SSL session is invalidated because of an attack against TLS found in 2013 referred to as a “truncation attack.” Java mitigates this by requiring a complete TLS close sequence to allow session resumption.
In the end, the patch was essentially a five character change:
diff --git a/src/checks.c b/src/checks.c
index 0668a76..dba45f0 100644
--- a/src/checks.c
+++ b/src/checks.c
@@ -1349,14 +1349,15 @@ static void event_srv_chk_r(struct connection *conn)
*check->bi->data = '\0';
check->bi->i = 0;
- /* Close the connection... We absolutely want to perform a hard close
- * and reset the connection if some data are pending, otherwise we end
- * up with many TIME_WAITs and eat all the source port range quickly.
- * To avoid sending RSTs all the time, we first try to drain pending
- * data.
+ /* Close the connection... We still attempt to nicely close if,
+ * for instance, SSL needs to send a "close notify." Later, we perform
+ * a hard close and reset the connection if some data are pending,
+ * otherwise we end up with many TIME_WAITs and eat all the source port
+ * range quickly. To avoid sending RSTs all the time, we first try to
+ * drain pending data.
*/
__conn_data_stop_both(conn);
- conn_data_shutw_hard(conn);
+ conn_data_shutw(conn);
/* OK, let's not stay here forever */
if (check->result == CHK_RES_FAILED)
By at least attempting to cleanly shutdown the SSL session, a “close notify” will (almost) always be sent and the session can be resumed cleanly. CPU usage decreased to normal. The patch was accepted and released in version 1.7.4.
You can find the entire thread here:
https://www.mail-archive.com/[email protected]/msg25078.html