At Chime, we're committed to GitOps, and all of our application and infrastructure changes are carried out with reviewed and approved pull requests. We're also committed to safety: every change has the potential for unanticipated consequences, and we need to reduce the likelihood of an availability incident when making changes.
Our commitment as the Infrastructure Engineering team at Chime to prevent outages is deeply rooted in our member-obsessed culture. We recognize that our work has a direct impact on the experiences of our members, and we strive to ensure that our infrastructure is reliable and stable.
In this article, we'll share a tool we've built to easily preview changes to our Kubernetes (k8s) resources before they are applied.
This tool enables Chime to merge changes confidently, resulting in an average of a thousand deployments a day.
Making Changes to K8s Resources in the Dark
All of Chime's k8s resources are managed via Helm charts that are applied by Argo CD on our EKS clusters. The Helm charts are centralized in a dedicated repository. To make changes to services, Chime engineers find themselves editing both charts and override files, and submitting PRs, and when the changes are approved and merged, they are picked up and automatically reconciled by ArgoCD.
In this scenario, there is a heavy responsibility on both the author and the PR reviewer to make sure they understand and can accurately foresee what the changes will actually accomplish. This sometimes requires detailed knowledge of Kubernetes, Helm, and the implementation of our charts, making PR reviews time-consuming and risky. For example, a PR can be risky if engineers deploy a change that results in a Chime service becoming unavailable, which may or may not result in members losing access to certain features and will almost certainly disrupt Chime's internal operations.
In other words, this is reasonable for simple changes like, say, incrementing a replicaCount. When it comes to making broad changes, like updates to the shared application charts that are used by many Chime services, engineers have to be certain that they are not introducing unexpected changes that may have consequences.
This problem is made particularly acute by the fact that some of the charts we use have grown over time to include a large number of branching and parameterization.
Some of this heavy parameterization is directly tied to the lack of visibility into the changes an engineer might attempt to make. When the stakes are high, and it's difficult to preview the "net result" of the changes one is making, it's only natural that the engineers act cautiously by making use of feature flags to isolate and minimize the blast radius of their changes. As you may have guessed, this impacts the team's velocity, and large refractors are rendered practically impossible.
This comes in stark contrast to the workflow of another tool Chime's infrastructure team relies on to make changes to services: Terraform. Indeed Terraform's VCS-driven workflow automatically triggers terraform plan based on changes to our infrastructure-as-code repository. The result is visible directly from the pull request, making seeing and reviewing the "net result" of code changes straightforward.
As engineers got used to this convenient access to a "crystal ball" for their Terraform changes, it became clear that our workflow for k8s changes would be dramatically improved by having a similar feature.
We got to thinking: Wouldn't it be great if there was a way to preview the outcome of changes before they were released? What if fully rendered manifests were automatically visible directly in the pull request? The author could know the exact outcome of their changes before even submitting the PR for review. It would also make the PR reviewer's job immensely easier since there would be less guesswork involved.
Thanks to 'mani-diffy', any pull request that results in changes to Kubernetes resources is accompanied by a fully rendered manifest illustrating exactly what is changing.
Before we explain how "mani-diffy" works, it'll be helpful to review how our k8s resources are organized using the app of apps pattern.
AOA pattern primer and how we use it at Chime
If you are already familiar with the concepts of AOA and Argo CD, you may skip over this section. I'll be using the repo's example "demo" to showcase how the AOA pattern can be used.
The root and the tree of apps
The App of Apps Pattern lets us define a root ArgoCD Application (or bootstrap). Rather than point to an application manifest, the Root App points to the directory in the repository that contains Application manifests.
In this example the root app points to "/demo/bootstrap". In this directory, we have 2 Application manifests, a prod and a test cluster Ex : demo/bootstrap/prod-cluster.yaml
These applications point to the "app-of-apps" chart via the spec.source.path key.
When rendered this chart in turn will create more ArgoCD applications, which point to more charts.
In this example, we have only 2 charts, so it will be either to the "app-of-apps" chart or to the "service" chart which contains all the necessary k8s resources for a service to be created (ie. deployments, etc.).
So, from the root app all the way to the service, we have a tree of ArgoCD applications.





