Adopting AWS VPC Endpoints at Square

Secure communication between data centers and the cloud

Reddit
LinkedIn

In this post, we share our experiences with adopting AWS VPC Endpoints at Square. We want strong security guarantees in our communication with managed AWS services and for that we designed a solution that leverages VPC Endpoints with IAM policies. In a later section, we also highlight some of the issues we faced in our setup and usage of these endpoints.

Background

As applications at Square increasingly embrace the Cloud, teams are extending their existing applications with fully managed AWS services like SQS. Previously, applications were connecting to these managed AWS services through the public API endpoints offered by AWS. For instance, an SQS queue in the us-west-2 region, can be accessed using the public endpoint https://sqs.us-west-2.amazonaws.com. Applications themselves normally run in one of Square’s DCs (Data Centers) or in one of the AWS Shared VPCs that are connected to the DC through AWS Direct Connect. At Square we use Shared VPC as it provides a clear separation of duties and minimizes VPC and Network Management costs. The rest of this writeup, however, is applicable to any AWS VPC setup that is direct-connected to the DC.

For security purposes, hosts in the DC and Shared VPC do not have direct internet access and we want to audit any public endpoints reachable from these hosts. So, to access the public AWS endpoints, teams need to proxy requests through an Egress Squid Proxy installation in the DC called CloudProxy. To use CloudProxy, teams need to submit an integration whitelist request for each endpoint and get it approved (for details on how these integration request approvals work, look out for this upcoming talk by Sarah Harvey).

image3 Applications from Square DC and Shared VPC accessing AWS Managed Services such as SQS through CloudProxy.

While this approach of accessing AWS managed endpoints through Cloudproxy works, there are some issues with it. Cloudproxy is a potential point of failure in our communication pattern with AWS. While it’s mostly stable, it is something that has been attributed in the past for slow responses from managed AWS services. We would ideally like to move to a solution managed by AWS. More importantly, using Cloudproxy does not allow us to prevent the potential of Data Exfiltration to non-Square owned AWS accounts.

Problem

Public API endpoints offered by AWS (such as sqs.us-west-2.amazonaws.com) are not restricted to access only AWS resources belonging to Square AWS accounts. This means that data from production boxes in the DC or Shared VPC can be exfiltrated to non-Square AWS accounts through the Cloudproxy whitelisted endpoints. For example, an outsider with access to one of our hosts could use the application’s or their own credentials, and write sensitive data to an SQS queue that they own. Additionally, AWS requests through Cloudproxy cannot be restricted to just requests signed by an IAM Principal belonging to a Square AWS account. Ideally, we want our proxy to have the following properties:

  • It should only be able to communicate with resources in Square owned AWS accounts.
  • It should only pass through requests that are signed by AWS Principals (IAM roles, IAM users, etc.) that belong to a Square owned AWS account.
  • It should be accessible via our internal network.

Solution - VPC Endpoints

In order to solve the previously listed problems, we came up with a solution of using VPC Endpoints with IAM policies, for communicating with supported AWS services.

Not all AWS Services have VPC Endpoints, and even among those that do, not all support setting IAM policies. As a result we restricted our initial launch of services with VPC Endpoints to be just these:

  • DynamoDB
  • SQS
  • STS
  • SNS
  • Secrets Manager
  • Kinesis Firehose
  • Kinesis Streams.

Here is an example of an IAM policy on an SQS VPC Endpoint that can give us the aforementioned guarantees.

{
    "Statement": [
        {
            "Action": "sqs:*",
            "Effect": "Allow",
            "Resource": [
                "*:*:*:*:<SQUARE_AWS_ACCOUNT_1_ID>:*",
                "*:*:*:*:<SQUARE_AWS_ACCOUNT_2_ID>:*",
                "*:*:*:*:<SQUARE_AWS_ACCOUNT_n_ID>:*",
            ],
            "Principal": "*",
            "Condition": {
                "StringEquals": {
                    "aws:PrincipalOrgID": "<SQUARE_AWS_ORGANIZATION_ID>"
                }
            }
        }
    ]
}

Some key takeaways from this IAM policy

  • Action: As can be seen in the Action section, only sqs actions are allowed (this is kind of redundant since only SQS resources are accessible via SQS VPC Endpoints and action could have as well been just *:*).
  • Resource: The Resource section lists all the AWS Accounts, that are reachable from this VPC Endpoint. Here, we enumerate the list of Square AWS Accounts whose SQS resources we want to make reachable from this VPC Endpoint. This essentially guarantees that calls made through the VPC Endpoint cannot reach a resource outside of a Square AWS account.
  • Condition: The policy includes a Conditional Context Key aws:PrincipalOrgID, which is set to Square’s AWS Organization ID. Thus only Square Principal IAM signed requests can go through these endpoints.

For the VPC Endpoints to be accessible from both the DC and Shared VPC, the endpoints were provisioned in the Shared VPC. We also turned on Private DNS setting for the VPC Endpoints. This means that public API requests for AWS services from within the Shared VPC resolve to the VPC endpoints, using Route53 records.

image1 Applications from Square DC and Shared VPC accessing AWS Managed Services such as SQS through VPC Endpoints.

We additionally built a section (screenshot shown below) in our in-house Cloud self-service tool (called Cloud Portal), where AWS account owners can have their AWS account resources opted in to be reachable through these VPC Endpoints. This allows us to have a definitive snapshot of which teams are using VPC Endpoints. We have a cron job that runs every 15 minutes and updates the IAM policies on the VPC endpoints, based on which accounts need to be whitelisted.

image4 * Section in our internal Cloud Portal tool where users can enable their resources to be reachable from VPC Endpoints.*

Issues we ran into

In this section, we share issues that we faced with the set up and usage of VPC Endpoints with IAM policies.

Limited Service Support

As mentioned previously, not all AWS services support VPC Endpoints, and amongst those that do, not all support setting IAM policies on them. The list of services supported is indicated in this AWS page.

Many AWS Actions do not support setting Account ARNs in Resources

Many AWS resource types do not support Account IDs in their ARN. As a result, many AWS actions cannot be performed through our VPC Endpoints since we restrict resources in the IAM policy based on Account IDs.

This was the biggest drawback of using VPC Endpoint Policies. Even though S3 is the most widely used AWS service at Square, we could not support S3 VPC Endpoints with IAM policy as a result of this. If you look at the S3 Actions and Resources page, it documents that the most common S3 resources - bucket and object, do not have account IDs in their ARN. As a result, most S3 calls will fail with an IAM Authorization Error, when using our VPC Endpoints setup.

VPC Endpoint Policy character limit

There is a limit of 20,480 characters on VPC Endpoint Policies. While this may suffice for most use-cases, at Square we currently have close to 200 AWS accounts and are expected to add ]hundreds more AWS accounts as more teams build in the cloud. We calculated that with our policy specification, we cannot list more than approximately 800 AWS accounts. In order to monitor this we have a graph of the VPC Endpoint Policy text sizes (in number of characters) and have an alarm set if that reaches into the ten thousand characters.

If AWS offers a aws:ResourceOrgID IAM conditional context key, similar to the aws:PrincipalOrgID conditional context key, we would not have to manually list AWS accounts in the resources section.

image2 Graph showing the VPC Endpoints Policy Text Sizes. We have an alarm set up if/when a policy text size reaches 10K characters. (Limit is 20480 characters)

Specifying Custom Endpoint for AWS SDK calls from the DC

While the Private DNS setting automatically routes SDK calls through the VPC Endpoints (using Route53 entries), this only applies if the request itself originates from the same VPC in which the VPC Endpoints are located. Applications from the DC need to explicitly specify a custom endpoint in their AWS SDK’s client initialization code, which points to the appropriate VPC Endpoint. The following shows an example of a code in Ruby, that sets this custom endpoint.

# sq-aws is a custom internal gem that returns the appropriate VPC Endpoint URL.
require 'sq-aws'
 
queue_url = "....."
region = 'us-west-2'
 
sqs_client = Aws::SQS::Client.new(
  region: region,
  endpoint: Sq::Aws.vpc_endpoint_for(environment: 'production', service: 'sqs', region: region),
)
 
sqs_client.get_queue_attributes({
  queue_url: queue_url,
  attribute_names: ["All"],
})

We created helper libraries in Ruby, Java and Go that return the appropriate VPC Endpoint for users.

Accessing VPC Gateway Endpoints from the DC

As indicated in the VPC Endpoints page, S3 and DynamoDB Endpoints are a special kind - called Gateway Endpoints. Gateway Endpoints do not actually launch an addressable Network Interface in the VPC, but they have to be specified as a target for a route in the VPC’s Route Table. So the only way to access S3 and DynamoDB through VPC Endpoints from the DC would be to first funnel requests into the Shared VPC by using a Proxy that lives in the Shared VPC itself. Since we do not want to maintain another Proxy service, we decided not to offer support for Gateway Endpoints from the DC.

VPC Endpoints in a Shared VPC

Square uses a Shared VPC model and our Network Engineering team manages and shares everyone’s VPC resources and subnets from a centrally managed AWS account. Initially, we wanted to create the VPC endpoints in a separate AWS account (which has the shared subnets) owned by our team. However we discovered that VPC Endpoints can only be created in the master account of the Shared VPC. As a result, we have to now manage the VPC Endpoints in an AWS account belonging to another team, which is not ideal from a separation of concerns perspective.

SQS AWS SDK Custom Endpoint Bug in Java and Ruby Bug

While using the SQS AWS SDK we discovered that the Ruby and Java SDKs include a plugin that by default overwrites the specified custom URL in the client, whenever an action specifies the queue URL of the SQS queue. We submitted issues to the Github repos of both the Ruby and Java SDK. While the Ruby Issue has been fixed, the Java Issue has not been addressed yet.

Conclusion

AWS VPC Endpoints with IAM policies provide us with guarantees that we are looking for in the communication pattern with managed AWS services. However, some of the issues that we highlighted do limit us in the range of AWS services for which we can utilize VPC Endpoints.

We have shared this feedback with our AWS support team. We hope that as these issues get resolved in the future we can completely replace CloudProxy with VPC Endpoints for communicating with managed AWS services.