Skip to main content

How we reduced our AWS bill by seven figures

a green laptop with a hand drawn peace sign and check mark

By Ben Whaley

Chime handles a lot of data — petabytes of data — on behalf of our members. We use the data to make decisions about things like fraud detection and prevention, bill processing, and SpotMe overdraft protection, to name just a few use cases.

Part of helping to ensure that our members have financial peace of mind entails being good stewards of company resources. When we noticed that our data transfer bills were climbing, we decided to do something about it. We also decided to open source our solution so that anyone in a similar position can benefit.

Read on to learn more, or jump straight to the code and see for yourself.

Problem

Chime’s backend infrastructure runs primarily on Amazon Web Services, but we also rely on numerous third party vendors for specific data processing needs. As a result, we transfer large volumes of data back and forth between our primary network on AWS — our source of truth — and these third parties.

On AWS, NAT devices are required for accessing the Internet from private VPC subnets. Usually, the best option is a NAT gateway, a fully managed NAT service. The pricing structure of NAT gateway includes charges of $0.045 per hour per NAT Gateway, plus $0.045 per GB processed. The former charge is negligible at about $32.40 per month. However, the data processing costs can be extremely high for large traffic volumes.

In addition to the direct NAT Gateway charges, there are also Data Transfer charges for outbound traffic leaving AWS (known as egress traffic). The cost varies depending on destination and volume, ranging from $0.09/GB to $0.01 per GB. That’s right: data traversing the NAT Gateway is first charged for processing, then charged again for egress traffic. It’s a one-two punch.

Plug in the numbers to the AWS Pricing Calculator and you may well be flabbergasted. Let’s use a nice round number as an example. Say, 2PB. The cost of processing 2PB (1PB ingress, 1PB egress) through NAT Gateway works out to an eye-popping $147,191 per month, or $1,766,292 per year.

In the long term, we have plans to manage the data more efficiently and ultimately reduce data transfer, but those ideas will take time to implement. We developed this NAT instance solution to lower our costs immediately.

NAT Gateways, meet NAT instances

NAT instances are a self-managed alternative to NAT Gateways. Unlike NAT Gateways, NAT instances do not incur data processing charges. With NAT instances, you pay for:

  1. The cost of the EC2 instances, which varies depending on the instance type used.
  2. Data transfer out of AWS (egress traffic, the same as with NAT Gateways).
  3. The operational expense of maintaining EC2 instances.

Like NAT Gateways, at large volumes, outbound data transfer is the most significant cost. Outbound data transfer is priced on a sliding scale based on the amount of traffic. Inbound data transfer is free. It is this asymmetry that we leveraged to save on the exorbitant data processing charges of NAT Gateway. We also used automation to limit the ongoing operational expenses of NAT instances, leading to a highly available, secure, performant, and substantially less expensive solution.

Consider the cost of transferring that same 2PB through a NAT instance. Remember, ingress is free, but egress is charged on a sliding scale. Once again using the handy pricing calculator, we find that this amounts to $54,690 per month, about 62.84% less than NAT Gateway, or $1,110,012 per year in savings. Not too shabby!

Needless to say, without disclosing actual numbers, Chime was processing considerably more than 2PB through the NAT Gateway, and the savings was even greater. But users processing far less traffic — as little as 10TB — can save on AWS bills by using this approach.

The devil is in the details

It isn’t enough to identify the potential for savings. We also needed to execute. After some discussion among the engineering team, we arrived at the design depicted below.

An AWS VPC configured with AlterNAT

The diagram shows an AWS VPC with public and private subnets in multiple availability zones. We created an Auto Scaling Group containing one NAT instance in each public subnet. We also created a NAT Gateway as a standby system in the event that a NAT instance fails.

We wanted the systems to be self-healing and self-maintaining, requiring little or no management by our team. To accomplish this, we wrote a Lambda function that, when invoked, can automatically replace the route to the Internet, swapping the NAT instance for the NAT Gateway. We use the function in response to two events:

  1. Every 14 days, each NAT instance is terminated and replaced with a fresh, newly patched instance. During the swap, an Auto Scaling lifecycle hook invokes the Lambda function, which updates the route table to the NAT Gateway, ensuring that clients in the private subnets have continuous access to the Internet. Auto Scaling then launches a new NAT instance, which automatically reclaims the route, thus completing the cycle.
  2. Every minute, in every private subnet, the Lambda function tests connectivity to the Internet through the NAT instance. If the request succeeds, the function exits, but if it fails, the NAT instance is presumably broken, and the route is updated to use the NAT Gateway.

One nuance of this approach is that the route replacement functionality depends on a connection to the EC2 and Lambda APIs. If a NAT instance is down, the function doesn’t have a route to the Internet. To mitigate this, we use interface VPC endpoints to allow the function to access those APIs.

A drawback of this solution is that established connections are lost during the automated route changes. This tradeoff is acceptable for our use case. During testing, we found that our jobs are resilient to connection loss errors so long as connectivity to the Internet is always available. However, this may not be acceptable for some use cases. One mitigation for those use cases could be scheduled maintenance windows for NAT instance patching.

Alternatives

NAT instances weren’t our first choice. Although the solution we built is reasonably low maintenance and robust, it is ultimately still self-managed.

Our preferred solution would have been to use AWS PrivateLink to connect with our third party vendors. Although the pricing is effectively the same, PrivateLink is a fully managed option without the connection loss drawback above, and without the cost of EC2 instances. Although our vendor offers PrivateLink, they have also chosen to monetize it, charging so much for access to the feature that it was not a viable option.

In the future, we’ll refactor our data management systems to be more efficient and less reliant upon data transfer. This addresses the root cause of the expense as opposed to the symptoms, so it feels like the best direction for us in the long run. Perhaps it’s the topic of a future blog post.

Final words

This was an interesting project to build. NAT instances are considered a legacy technology at AWS and have largely been ignored since the release of NAT Gateways. Many of the features this solution relies upon, such as VPC endpoints, the latest generation of network-optimized instance types, maximum instance lifetime, termination lifecycle hooks, and Lambda functions, were released long after NAT instances were considered a legacy option. We were able to use these more recent AWS features to breathe new life into an older technology.

We also developed a begrudging respect for the NAT Gateway. It really is quite the engineering feat. It can handle up to 100Gbps, and, in our experience, it has not failed. The implementation details of NAT Gateway are opaque. Theoretically, AWS must update and maintain the underlying NAT Gateway, but they manage to do so without ever losing the NAT translation table or dropping traffic. Our hats are off to the engineering team behind NAT Gateway.

Our hope is that one day AWS will lower the NAT Gateway data processing fees to make this project irrelevant. In the meantime, we are excited to open source our high availability NAT instance solution to the AWS community. If you’re interested, head over to the GitHub repository to check it out.