Redesigning Large File Processing at Chime®
By Ajinkya Tarkunde, Engineer on the Data Platform Team
Chime® is a financial technology company that partners with The Bancorp Bank, N.A. and Stride Bank, N.A. (Chime is a financial technology company, not a bank. Banking services provided by The Bancorp Bank, N.A. or Stride Bank, N.A Members FDIC) to provide banking services and process millions of transactions every day with the help of our financial services API partner, Galileo. Chime gets transaction information from Galileo in real-time via HTTPS; however, sometimes, there are corrections that need to be made.
Chime receives RDF (Raw Data Format) files from Galileo which include all corrected data and are the ultimate source of truth — and, therefore, extremely important for our business. We process the files sent by Galileo and reconcile transactions with the data we receive in real time.
Galileo produces 5 types of RDF files for Chime. These files contain finalized records of posted transactions, authorized transactions, expired authorizations, accounts, and cards. Chime downloads these files and processes them for various reasons, such as reconciliation, analytics, or daily balances.
The two largest RDF files we receive contain posted transactions and authorized transactions (6–8 GB). We used to process these files using Ruby, which served us well initially, but as we grew, the number of transactions and overall file size grew. As the scale increased, we made decisions regarding our tools and how to best maximize efficiency. Here’s a look at the journey we took to make it happen!
Identifying our operational overhead
Initially, the RDF files were parsed row by row. For each row parsed, a Sidekiq®(Ruby background processor) job was spawned, which fired a query to MySQL®. The query fetched information that enriched the parsed row with more data like user ID, account ID, and merchant ID. The enriched row then was written to MySQL, and once complete, a post process was kicked off. This process wrote to other MySQL tables and DynamoDB® tables for application use cases.
Because the data produced by the RDF files was also required for analytical use cases, we had a sync job to sync these MySQL tables to our data warehouse, Snowflake.
This process had several challenges, the biggest of which was operational overhead:
- It was time-consuming: MySQL was being queried for each row and batch inserts were being made to MySQL with the enriched data, which took a long time — the largest file could take anywhere from 12 to 16 hours.
- It involved a lot of tasks: Every row imported from RDF files triggered various tasks. With the previous implementation, some tasks were executed synchronously inline, and others asynchronously by Sidekiq.
- We had to find workarounds: Adding new columns to MySQL was impossible due to the size of the tables that were populated by parsing RDF files. This led to having multiple tables to address this problem in MySQL.
- It didn’t support our growth: The processing time increased as Chime grew since the number of transactions grew organically.
- There was no asynchronicity to the process: A single error would need the pipeline to be re-triggered — requiring another 16 hours. Sometimes, it could take an entire week to catch up.
- The operational overhead to keep the system running was quite large.
In the end, it was clear this was a data engineering issue that needed data engineering tools to resolve it — and that’s exactly what we did.
The New Way
We knew that parsing and enriching such large files would require a Big Data processing tool. Since we were already using Apache Spark™ for several other use cases at Chime, it was a simple decision to use it to process these files, too.
To solve the asynchronicity issue, we decided to use our job orchestration tool, Apache AirflowTM. This helps us break the process down into smaller tasks and only retry the tasks that fail. One major design change we made was to write the fully processed data to Snowflake first and then propagate it to MySQL. We decided to use AWS® S3 to store the raw RDF Files and for post processing (more on that later). A higher level architecture diagram is below:
Let’s take a deeper look at the architecture now that we have a broad understanding of how we redesigned it.
Technical Design
We use Airflow as a job orchestrator, which helps us make the whole process modular and easily retriable. It’s the central point of control and makes administration very easy through its UI. Here are the steps our pipeline goes through:
1. Downloading files
- The first task downloads the encrypted RDF files from the Galileo SFTP server, decrypts them and uploads both versions to encrypted S3 to be stored permanently.
2. Data enrichment
- As mentioned above, we need to enrich the data with context (eg. user ID, merchant ID, etc) — we use Spark to fulfill this task.
- Airflow kicks off a Spark job, which is written using an internal service called Data Transformation Service. This job downloads the file from S3, parses the file, and enriches it using data present in Snowflake. The enrichment data is synced from MySQL to Snowflake in a separate sync job.
3. Data forwarding
- The enriched data is then written to Snowflake to be used by the Analytics, Risk, and Data Science teams, among other functions at Chime.
- The same data is also written to S3 partitioned by a configurable number of lines per file. These files are then consumed by a post processing service that populates MySQL and DynamoDB.
- Airflow calls the post processing service with file names, which then downloads the files and processes them in parallel.
4. Data quality checks
- We have implemented many data quality checks. The job will fail if any of these checks fail, which matters because this data is being used widely across Chime and we need to make sure that it’s accurate.
The diagram below shows the above process in greater detail:
Alternatives
- We considered using just Python code to process the files since they weren’t too large, but that would have come with the following tradeoffs:
- Less performant to enrich data of this size.
- We would have had to rewrite the after import Ruby code in Python. It also depends on other Ruby libraries, further complicating the rewrite process.
2. Doing everything in Ruby was also a consideration where only the post processing would be parallelized, but enrichment time would have remained the same and we wouldn’t have seen a very large performance improvement compared to using data processing in Spark for enrichment.
3. The communication between Airflow and the post processing API was debated as well. Our first thought was to fire and forget, but that would mean we’d lose the retry feature of Airflow and there wouldn’t be a closed loop.
4. We considered having a job ID being supplied by the API and Airflow doing status checks periodically, but that would mean a lot of network traffic since there will be thousands of these calls.
Performance Improvements
We were able to bring data processing time down from 16 hours to 20 minutes — a huge improvement! Fresh data is now available in Snowflake in 20 minutes, and many other processes at Chime that were dependent on this data are able to complete much faster. We also parallelized the post processing as part of the redesign and now the writes to MySQL and DynamoDB are much faster. It still takes about 4 hours to complete since we throttle the writes to manage load on MySQL. And because there is no sync back from Snowflake, MySQL has more free cycles to serve data to real-time applications. As we dug deep into the process, we found that the second-largest table in MySQL could be deprecated over time and we recently completed the deprecation.
Operational Improvements
Galileo occasionally adds more data to the RDF files than is required for our business. For many years, we weren’t able to add new columns to our tables in MySQL to store and use this data because they were too large to support such a change. Given that we now have this table in Snowflake, we can easily add extra columns — in fact, we have already added multiple columns since the new process was introduced in production. Retries and backfills are not only possible but can be easily fulfilled. The asynchronous design lets us retry only the failed tasks, saving a lot of time. The Airflow UI helps us visualize the process and makes job administration very simple.
Data Quality Improvements
During the migration process from Ruby to Spark, we encountered challenges due to differences in the regex engines utilized by the two platforms. As a result, complex Snowflake queries were developed to maintain data parity with the pre-redesign state. After rectifying these regex mismatch issues, we observed an increase in the number of regex matches and, subsequently, a higher volume of generated data. This migration journey underscored the non-trivial nature of transitioning from Ruby to Spark, emphasizing disparities such as the variance in regex engines employed.
Chime employees had to query multiple tables with complex join operations to get the data that they needed. With the new process, this data is available in the same table, making it a pleasant user experience.
Impact of adding new columns
Over the years, although RDF files provided more information (which was stored in multiple data tables and data storage technologies), the query performance was poor in retrieving all this new information. We have added 7 to 8 columns to our Snowflake table since the new process went live. Various teams at Chime now are empowered with a lot more easily accessible data to make decisions that impact the future of our company. Creating dashboards for executive reviews has become a lot easier.
Cross-team collaboration and the spirit of Chime
More than three teams worked closely together on this project to make it a success: the Data Ingestion team, Data Transformation Service team, and Transactions Team were the main stakeholders, while support from leadership was vital in making it a success.
With data being present a lot faster, we’re better able to generate features that power our machine learning models to predict fraud and prevent malicious actors who cause harm to our members. The analytics team is now equipped with information that helps them better understand our members and make way for features that help them achieve financial peace of mind.
This project was truly a show of how well Chime employees can team up (one of our company values) across teams to make it happen (another of our values). Our work had a Chime-wide impact and continues to bear fruit today.
Acknowledgments 🙏
This fundamental shift in how we process RDF files would not have been possible without my amazing team — Kate Wang, Nivetha Siva, Jefferey Chen, and Jiaqi Guo for helping with implementation, DQCs, Architecture, and everything in between.
We had a great support system from a lot of Chimers, including Khandu Shinde, Saurabh Sathaye, Radhika, Manpreet Khanna, and Chenyu Li.
Thanks to our leadership team, who encouraged and inspired us — Andrew Trott, Bing Liu, Dinesh Veluvolu, and Sai Sundar.