Command Line Observability with Semantic Exit Codes
Drawing inspiration from HTTP Status Codes to improve our Command Line tools
At Square, many platform teams publish Command Line (CL) tools for interacting with their services. Their customers (Square engineers) use them to search logs, connect to databases, generate code, and more. And because CL tools are particularly easy to create, many engineers create them to automate repetitive tasks.
Over 14 years, Squares have written a huge number of these tools. They range from one-off scripts to workhorses used daily, from tools that have traction only within a few teams to others in most Squares’ toolkits, and from rough to polished.
Last year, the Developer Tools team started a project to improve the discoverability and quality of Square’s internal CL tools. We had an idea of where to apply our effort, but we wanted to understand the landscape first. We wondered:
- Which tools not in our own toolkits were nevertheless used heavily by others?
- Which tools were unused and could be cleaned up?
- How could we notify ourselves and a tool’s authors if its error rate spiked?
Here is part of the dashboard for a CLI called bootstrap that every engineer runs
By capturing just the exit code and user of each tool, we could answer questions like
- How many Daily Active Users does it have?
- How many of its users are sticky? (i.e. how many used it at least 5 days out of a month?)
- How many attempts did each user take to succeed? (i.e. how many failed executions preceded each successful execution?)
- What’s its failure rate?
We hit a snag, though, as soon as we started trying to set SLOs on failure rate.
When CL tools finish, they report a numerical status called an exit code. The number 0 indicates success and any other number indicates failure. Most CL tools just exit with 0 or 1. But if a tool prints an error message saying a user-supplied argument was invalid — or that person isn’t authorized to use this feature — should we say that the tool succeeded or failed? Exiting “successfully” would be dishonest to scripts composed with that tool. But these “failures” aren’t bugs either. We needed more nuance. We especially wanted to separate user errors from software failures in order to set an SLO on the latter.
The system call exit accepts values between 0 and 255. Values above 128 are reserved for signals. (When a program is terminated by a signal, its exit code is 128 + the signal's numeric value. When you terminate a program with
Ctrl C, for example, you send it the signal SIGINT — whose value is 2 — and the program exits with 130.) Bash reserves a few values (2 and 126–128) and sysexits.h defines 15 exit codes extracted from sendmail that have been reused elsewhere.
Several of the
sysexits.h codes are relevant to us and broadly applicable to CL tools (like Usage Error (64), Internal Software Error (70), and Permission Denied (77)). Others are over-fit to sendmail (like No User (67) and No Host (68)). Also,
sysexits.h interleaves user errors with software errors. Because we wanted to easily separate the two, we started a new list.
We took inspiration from HTTP’s Status Codes and defined exit codes in two unreserved ranges: 80–99 for user errors and 100–119 for software errors. Also following HTTP’s example, we allowed the first code in each range to be a catch-all for its range. Here are the codes we’ve established so far:
|The tool exited successfully.
|The tool exited unsuccessfully but we have no insight into the error — we can’t even say whether it was a user error or software error.
|The tool was used incorrectly.
|A required argument was omitted or an invalid value was supplied for a flag.
|An unrecognized subcommand was invoked.
|As in git colne: git exists but colne does not.
|Requirement Not Met
|A prerequisite of the tool wasn’t met.
|The tool must be used on a minimum version of the user’s OS.
|The user isn't authorized to perform the requested action.
|The tool has been migrated to a new location.
|The tool “foo” was replaced with “bar”. Now running foo tells you to run bar instead.
|Reserved for future user error codes
|The tool failed because of a bug — or for any reason that wasn’t a user error.
|A service the tool depends on was not available.
|A local daemon or remote service did not respond, a connection was closed unexpectedly, an HTTP service responded with 503.
|Reserved for future system error codes
These error ranges enabled us to separate user errors from software errors and to establish SLOs on our most critical tools. Querying our metrics for Unknown Subcommand has helped us to spot common typos and desire lines. And when we stop seeing Moved Permanently errors after a migration, we know we can finally clean up the breadcrumbs we left.
After copy-pasting these values in several repos, we extracted and open-sourced them. github.com/square/exit implements the exit codes and several helper functions in Go. We plan to add implementations in other languages and to extend the two ranges as use-cases arise. Contributions are welcome!
What kind of visibility do you have into your CL tools? Would adopting semantic exit codes help?