Incident Summary: 2023-09-07
A timeline of the events of the outage and steps for remediation
Last week, Square experienced a multi-hour outage across our services. We understand that you rely on our systems to power your business and that’s a responsibility we take seriously. We apologize for letting you down and for the length of time it took for us to get our systems back up and running.
Beginning at 1:54 PM ET on September 7, 2023, Square products and services were unavailable. At 2:05 AM ET September 8th systems began to recover with merchants able to access restored payment services by 5:19 AM ET. For sellers on a supported configuration that utilized offline mode, Square completed processing offline payments by 1:57 PM ET on September 8th or, if the device came online at a later time, shortly after the device came online. Square Online websites were available; however, Square Online customers were unable to process payments during the outage.
As we previously shared, this outage was caused by a key part of our infrastructure, our DNS servers. Now that we’ve completed a root cause analysis, we want to share an overview of the incident and steps for remediation.
We’re going to start with an overview of how Square’s systems work together. Square operates in multiple data center regions. Square services use DNS and mesh-based routing infrastructure to find service dependencies and serve requests. Without DNS, Square products, internal tools, and services cannot communicate, which results in service disruption. In this incident, an unrelated change to our host-based firewalls combined with a DNS service upgrade caused unexpected load on our internal DNS servers and caused them to fail. Once node-based DNS caches expired, services could not communicate with their dependencies and caused external requests to fail.
Square’s host-based firewall policy is managed by a central service that pushes firewall policies to nodes in Square datacenters, which then expand the policy into firewall rules. This service uses an accelerated rollout strategy to quickly adapt to changing environment state. But, in this case, a small policy change expanded to a much larger ruleset. This large ruleset caused node instability and when combined with the traffic pattern of DNS, caused DNS to start failing requests.
Square uses a microservices environment for services that handle external requests and many internal systems to manage our services. In this case, many services used for troubleshooting and recovery were also impacted, which resulted in an extended outage.
Based on a forensic analysis of the incident, we’ve ruled out a cyberattack as the cause of this incident, and there’s no evidence of a data breach or loss.
September 7, 2023
- 11:04 AM ET - Host-based firewall rule change deployed to enable region communication, increasing on-node firewall rule size.
- 1:56 PM ET - DNS zone change.
- 2:02 PM ET - Engineers were notified of infrastructure issues and incident response begins starting with DNS investigation.
- 2:47 PM ET - issquareup.com incident created.
- 2:52 PM ET - Work begins to recover internal access and tooling.
- 3:56 PM ET - Shed networking traffic to our DNS servers. Started manual work to bring up new DNS servers.
- 6:00 PM ET - DNS service capacity increased, but does not help. Started manual deployment of networking changes to re-enable our authorization and access services.
- 6:29 PM ET - Internal access services recover. This allows engineers to start working in parallel to recover the authorization and control plane services.
- 7:00 PM ET - Started manual deployment of networking changes to all data centers.
- 8:36 PM ET - Square deployment pipeline recovers.
- 10:06 PM ET - Rebuild of our DNS servers.
- 11:52 PM ET - New configuration based on reverted ruleset is built and configuration begins to be pushed to DNS hosts.
September 8, 2023
- 12:06 AM ET - Some DNS hosts are healthy and more internal tooling recovers.
- 12:55 AM ET - All DNS servers are healthy.
- 1:30 AM ET - Partial recovery of internal service to service connectivity. Partial recovery of our edge routing infrastructure.
- 2:05 AM ET - Some Square systems begin recovery.
- 2:40 AM ET - Payment traffic has fully recovered.
- 3:12 AM ET - Edge routing infrastructure fully recovered.
- 4:18 AM ET - Majority of Square products and services are recovered. issquareup.com incident is updated that we’ve implemented a series of fixes.
- 5:19 AM ET - issquareup.com incident is resolved.
- 6:59 AM ET - Additional DNS capacity is added.
- 9:52 AM ET - Background processing of offline payments begins.
- 1:57 PM ET - Uploaded offline payments have been fully processed.
The incident has highlighted a number of opportunities to improve our infrastructure, and we're working on making these changes, which are designed to prevent future incidents:
- Transitioning our DNS infrastructure to isolated infrastructure.
- Additional monitoring and optimizations for critical networking infrastructure.
- Optimizing dependencies between our deploy and platform infrastructure where feasible.
Many sellers utilized Offline Mode in order to continue accepting payments. As a precautionary measure, we deferred processing offline payments for a number of hours. We are expanding support for and improving our communication regarding the availability of Offline Mode.
We apologize for the disruption our outage might have created for you, your customers, and your employees. We know this situation was made more difficult by our communication frequency and the delayed support response some of you experienced. We will learn from this event and improve our systems and processes.
We appreciate your business and we are committed to doing better to regain your trust.