From Redundancy to Efficiency: Transforming Machine Learning Operations at Chime

Early Challenges with Scaling Machine Learning

In the early days of ML at Chime, each ML model was built from scratch with a bespoke code repository for data preparation, training, and deployment. At first, this approach helped us quickly prove the value of machine learning, but as we scaled, it slowed us down.

As Chime grew, our machine learning footprint expanded to over 40 models in production, each with unique complexities. These models tackled different prediction problems, ranging from fraud detection to marketing optimization. While they shared similar goals, every new feature or bug fix had to be implemented independently across dozens of different codebases. We were spending more time maintaining these models than building new ones. With over 100,000 lines of duplicated code, updating each model individually became time-consuming and error-prone.

Moreover, onboarding new data scientists became challenging. Each model's specific implementation required a considerable effort, making collaboration difficult and slowing down innovation. We needed a better approach—one that streamlined model development, reduced duplication, and fostered collaboration. This need led to the development of MLKit.

What is MLKit?

MLKit is an internally developed, configuration-driven machine learning framework designed to address the challenges of creating and maintaining multiple machine learning models within our organization. By centralizing the model creation process and abstracting away much of the repetitive and error-prone work, MLKit enables data scientists and engineers to quickly build, train, and update machine learning models with minimal hands-on effort.

The key innovation of MLKit is that it allows models to be defined primarily through configuration files, instead of requiring custom-written code for every new model. This simplifies the workflow for creating machine learning models, while also standardizing the process across teams and use cases. With MLKit, much of the technical complexity—such as data generation, model training, and inference—is handled behind the scenes, reducing the need for developers to maintain large, bespoke repositories of code.

Core Design Goals:

Centralization: Consolidating all Machine Learning models code into a single repository simplifies maintenance and management. This approach ensures consistency, reduces fragmentation, and makes it easier for teams to collaborate, update, and maintain models efficiently.
Standardization: Many of our ML models contain similar steps that were implemented in various ways across projects. By establishing a unified approach and standardizing these steps, we can ensure consistency, reduce redundancy, and simplify the adoption of best practices across all models. This leads to more maintainable, reliable, and efficient model development.
Extensibility: Users should have the flexibility to extend the functionality provided by the ML framework to meet their specific needs. This includes adding custom pre-processing or post-processing steps, integrating their own algorithms, and defining custom validation techniques. The framework should be versatile enough to support these customizations, but for most common use cases, the built-in defaults should be sufficient and work effectively out of the box.
Interoperability: The framework should seamlessly integrate with existing systems, such as the deployment system, without requiring significant refactoring. While some systems may need to pass or accept new arguments, the integration should remain largely consistent with how they interact with existing model repositories—primarily using SageMaker-compatible container images. This ensures that the adoption of the new framework does not disrupt existing workflows.
Decoupling the Framework from Models: The framework should be developed independently of the models it supports, allowing the framework to evolve without impacting individual model implementations. This separation ensures flexibility, easier updates, and scalability for both the framework and the models, minimizing interdependencies that could hinder development.
Ease of Use: Models should primarily leverage existing code by reusing standardized components. Model creation should be driven through configuration files, enabling straightforward setup and minimizing the need for custom coding, thus making the process more efficient and user-friendly.
Well-Tested Code: Currently, most of the model code lacks comprehensive testing, resulting in minimal unit test coverage across model repositories. This new framework provides an opportunity to address these gaps by ensuring that both the framework and model code are well-tested, leading to improved reliability and maintainability of the models.

How Does MLKit Work?

MLKit simplifies machine learning model development and maintenance through standardized, reusable workflows. These workflows are configuration-driven, meaning that instead of writing custom code for every model or workflow, users can define parameters in configuration files, reducing the complexity and duplication commonly found in machine learning projects.

Here's a breakdown of the core components of MLKit and how they work together:

Workflows

In MLKit, a Workflow represents a specific process, such as data preparation, model training, or inference. Each workflow consists of a series of steps that are executed sequentially. Workflows are reusable across multiple models, making it easy to standardize processes and avoid repetitive development tasks.

For example, a training workflow typically involves steps for:

Reading and processing the input data.
Training the model using a specific algorithm.
Evaluating the model's performance based on metrics.
Saving the trained model for future inference.

Steps

A Step is an individual task within a workflow. Steps can include tasks like reading data, transforming features, training the model, or evaluating performance.

Example steps in a training workflow could be:

ReadData: This step ingests data from a source (e.g., a database or file).
TrainModel: This step trains the model using the specified algorithm and training parameters.
EvaluateModelPerformance: After training, this step evaluates the model using metrics such as accuracy or AUC.
SaveTrainArtifacts: Finally, this step saves the trained model and relevant metadata.

Configurations

MLKit is configuration-driven, meaning that workflows, and models are defined by editing configuration files.

Model Configuration (model_cfg.yml): Defines global parameters for the model, such as its name, version, and prediction type.
Feature Configuration (feature_cfg.py): Defines the features, their data types, any preprocessing steps (such as normalization or encoding), and derived features (features that are computed from other existing features).
Workflow Configuration (<workflow_name>_cfg.yml): Defines specific parameters for a specific workflow, such as the dataset to use, the training algorithm, and evaluation metrics.

Example of a configuration file for the train workflow (train_cfg.yml):

Hooks

Hooks provide a powerful way to extend or modify workflows without altering the core functionality of MLKit. Hooks allow you to inject custom code before or after a workflow step is executed. This flexibility is essential when you need to perform additional tasks such as custom preprocessing, feature transformations, or post-training metrics.

For example, you might use a hook to clean the data before the model training step begins, or to log custom evaluation metrics after the training is complete.

Example of a Hook:

Let's say you need to add a custom data cleaning function before the training step. You can create a BeforeStepHook that cleans the data before model training:

In your workflow configuration, you would reference this hook:

Custom Workflows

For the scenarios where the existing workflows and hooks mechanism are not enough, users can create Custom Workflows. These work the same way as the standardized workflows, but are available for a single model only.

MLKit CLI

MLKit comes equipped with a command-line interface (CLI) that simplifies the interaction with the platform. The CLI allows users to quickly initialize models, configure workflows, run training or inference tasks, and perform feature engineering tasks without needing to write custom scripts. This significantly speeds up the development process and reduces manual intervention, making it easier to manage machine learning models and workflows.

The most important commands are:

mlkit init-model: Creates a new model or creates a new workflow configuration for an existing model;
mlkit run-workflow: Runs a given workflow for a given model.

Sequence Diagram of what running a workflow looks like

How To Use MLKit?

Using MLKit involves a structured process that includes configuring workflows and running them iteratively to refine the model. Below are the steps to create a model and execute workflows using MLKit.

Initialize a New Model

The first step is to initialize a new model using the MLKit command-line interface (CLI). This command creates the necessary directories and configuration files for the model.

Example:

mlkit init-model -m fraud_detection -w TrainDataGeneration

This generates:

model_cfg.yml: General configuration for the model.
feature_cfg.py: Feature configuration for the model.
train_data_generation_cfg.yml: Configuration for generating training data.

Configure and Run Workflows (Development)

Once the model is initialized, you can configure and run each workflow. For each workflow (e.g., Train Data Generation, Train, Inference), follow the steps below.

Step 1: Configure Workflow. Edit the workflow configuration file to define its parameters. For example, for data generation workflow, these parameters include SQL logic and data splits.
Step 2: Run Workflow. Run the training data generation workflow using the cli.

Example:

mlkit run-workflow -m fraud_detection -w TrainDataGeneration

Step 3: Iterate and Refine. After running each workflow, review the results and refine the configurations as needed. For instance, you may want to adjust hyperparameters in train_cfg.yml or add more features in feature_cfg.py. Once changes are made, re-run the workflow to improve the model's performance.

Below is a simplified view of how MLKit is used to develop models:

Deploying a MLKit model

When a new model is created and configured, the model development process includes the generation of a Pull Request (PR). Once the PR is reviewed and merged, MLKit automatically builds a Docker image that contains the model code, configuration, and all required dependencies. This docker image is then used in SageMaker pipelines to train the model based on the configurations provided in MLKit. After a successful training pipeline execution, the resulting model is registered as a challenger model. This allows the model to be evaluated and compared against other models for further validation and monitoring.

After the training pipeline succeeds, and if the new model artifact meets certain performance criteria, the model can be deployed. Batch Models run on SageMaker pipelines and are triggered on a schedule. Real Time Models run on SageMaker Endpoints, allowing the model to serve predictions in real-time applications and providing a quick and scalable solution for integrating machine learning into production systems.

In future blog posts, we'll explore more details on setting up and managing SageMaker pipelines, as well as how to deploy models to real-time endpoints for low-latency predictions. Stay tuned for a deeper dive into these critical components!

Key features and benefits of MLKit

Low-code development

MLKit enables low-code development, allowing users to configure machine learning workflows through YAML or Python configuration files, as demonstrated in previous sections. This eliminates the need to re-write code for workflows such as data generation and pre-processing, batch and real time inference, etc. across multiple models. This saves time for data scientists looking to quickly bootstrap a new model, and non-experts can also leverage MLKit to build models without extensive knowledge of the underlying workflows. The simplicity of configuration-driven development allows users to focus on solving business problems rather than worrying about implementation details and move quickly from concept to production in model creation.

Extensibility

With support for a variety of machine learning frameworks users can specify a model type from a particular package out of the box via their configuration. Users do not need to create custom code to account for differences in the underlying frameworks or libraries being used; they simply have to fill in the expected parameters via config. The extensibility of MLKit will allow us to also support other ML frameworks as we look to experiment with new types of models in the future. The same applies to re-usable operations for derived feature creation, encoders, and evaluation metrics. Users can add new operations to be shared among models or, alternatively, add custom code via hooks to experiment with an approach before building it into MLKit.

Scalability

MLKit consolidates our models into a mono-repo as opposed to our former state of having one repository per model. This centralization enables better scalability as the number of models being developed at Chime grows for different business use cases.

Data scientists can also more easily work on a number of different models rather than being limited to the scope of models they developed themselves. MLKit eliminates the steep learning curve that previously existed with having siloed model repositories consisting of thousands of lines of custom code that the limited ability of others to understand and contribute to the model.

Integrations

MLKit allows seamless integration with third-party vendors for key capabilities such as model observability and feature lineage tracking. This unified approach of interacting with third-party vendors in shared workflow code means that when a new integration is introduced, it can be easily incorporated without the need for duplicative migration efforts or custom adjustments per model. Moreover, the shared workflow code makes it easier to maintain and update integrations over time, reducing the risk of errors or inconsistencies.

Streamlined maintenance and enhancements

One of the most significant advantages of MLKit is the ease with which new functionalities can be added to the framework and maintenance tasks can be completed. This includes streamlining routine tasks such as patching vulnerabilities, applying security updates, upgrading CI/CD infrastructure, and ensuring consistent dependency management. Since MLKit consolidates all the machine learning models into a centralized framework, any new functionalities or improvements only need to be implemented once. This drastically reduces the time and effort required for maintenance and enhancements, as changes propagate across all models automatically. Machine learning engineers can focus on innovation rather than tedious repetitive tasks, knowing that improvements to the framework will benefit every model.

For example, adding a new pre-processing step, integrating a new evaluation metric, or updating dependencies only needs to be done in one place, and these updates will then be available for all models that use MLKit. This centralized approach not only ensures consistency across models but also significantly reduces the likelihood of bugs and version mismatches. By eliminating duplicated effort, MLKit makes the process of maintaining and evolving machine learning models far more efficient, allowing engineers to spend more time solving challenging problems and less time on routine maintenance work.

Secure data access

MLKit is designed with data security in mind by adhering to Chime's role-based access control policies for data assets. Users can only access and use data they are authorized to, based on the permissions assigned to their team or role. This seeks to prevent unauthorized access to information in compliance with data governance standards. If additional access to a different role or data is required, such requests are subject to a formal review and approval process enabling controlled and auditable data access.

Why MLKit is a game changer for our teams

MLKit has vastly improved Chime's machine learning operations spanning across more than 40 models. These models power critical use cases, from fraud prevention to enhancing customer experience. New models can be created in less than a day, and overall time to market can be reduced from 2 months to 3 weeks effectively doubling our model shipping velocity. Data scientists are freed from routine tasks in implementing and maintaining custom workflow code for their models. They can now focus on solving business problems without the cognitive overhead of complex implementation that MLKit has abstracted away.

For our platform engineers, MLKit eliminates the repetitive work that would normally be done per model repository in order to add new features or integrations. Now, new functionality is immediately available to all models when developed as part of MLKit. Furthermore, consolidation of all backend model code significantly reduces the maintenance overhead of resolving bugs, updating dependencies, and other changes that should be consistently done across models. This has reduced time spent on platform integration and maintenance work from 10 hours/week to 1.5 hours/week - 85% more efficiency!1

By centralizing our machine learning workflows and eliminating the redundancy of dozens of model repositories, MLKit has streamlined model development with less duplication, simplified maintenance, greater reliability of model code, and better collaboration across models.

Final Thoughts: The Future of Machine Learning with MLKit

MLKit is fundamentally transforming how we innovate with machine learning at Chime, enabling faster development cycles while minimizing duplicated effort. One of MLKit's core promises is to allow our teams to innovate at a higher velocity. Moving forward, we plan to extend MLKit's capabilities by integrating state-of-the-art machine learning frameworks for deep learning and graph-based models, enhancing the accuracy of our ML models. Additionally, integrating new features such as ML observability tools into MLKit will provide built-in support for monitoring live model performance and detecting issues like feature drift, ensuring that our models stay reliable over time.

Another key aspect of MLKit's future is its role in democratizing machine learning development beyond the data science team. By providing a configuration-driven framework, MLKit lowers the barriers to entry, empowering analysts from other teams, such as Risk and Product, to participate in building and iterating on models. Ultimately, MLKit is paving the way for a more inclusive approach to machine learning, where the power of data-driven insights can be harnessed by a diverse range of contributors across Chime.

References

Amazon SageMaker: Amazon SageMaker is a managed machine learning service provided by AWS (Amazon Web Services). Any mention of SageMaker in this post refers to features and services offered by AWS. Learn more about SageMaker here.

Docker: Docker is an open platform for developing, shipping, and running applications inside lightweight containers. Learn more about Docker here.

Join Our Team

Does this work sound interesting to you? Check out open positions on our careers page.

Credits

A big thank you to everyone who contributed to the development of MLKit and made this blog post possible! Special acknowledgment to Bruno Lima, Parin Shah, Aishwarya Joshi, Akshay Jain, Peter Zawadzki, and the Data Science & ML Platform teams for their expertise and collaboration. Your hard work drives our innovation! 🚀