Bridging the Data Gap

Did you know that every 40 minutes we (as in the entire world) create as much data as we did from the beginning of time up until 2003? I did a double take when I read this sentence for the first time. We live in an increasingly data-driven world where everyone is using data to derive insights and make decisions. Now the problem has shifted from that of technological limitations to that of an abundance of data. In the last few years the data landscape has changed considerably. Gone are the days of number crunching on excel sheets — even batch processing is old news now. Streaming and low latency processing is now the bread and butter.

At the same time, a few things haven't changed. A "data gap" still exists between producers (product engineers) and consumers (data scientists), as these groups often operate in silos. While data processing technology is getting better and better, we cannot derive maximum value unless this data gap is filled.

In order to unlock the full power of streaming for all of Chime, we recently launched a data streaming platform we call "Event Bus Pro". The main goal of Event Bus Pro is to allow rapid prototyping and self-serve data pipelining with a focus on low latency computation. If you've ever worked to build a platform, you'll know what an undertaking it is to design something that works for everyone and at the same time is fast, easy to use, secure, and reliable.

This is the first in a series of posts where we discuss how this Streaming Platform was designed. This post focuses on decentralization, or how we managed to bridge the data gap. So sit back, relax, and enjoy this post. But if you're in a hurry and would rather just get to the crux of the matter, I have summarized the key takeaways at the end of this post.

In the Beginning There Were Data Engineers…

Data originates from within the product code, usually from databases and event logs. It is used for analytics, for driving product insights, for machine learning, for detecting fraud.

Consumers of this data are very rarely the same people producing it.

Data science workflows require a different set of skills that have very little overlap with the skill set required for product engineering — hence the existence of Data Engineering teams.

Data Engineers (DEs) curate product data into a format suitable for data science workflows by ingesting it into a data lake, a data warehouse, or, in the case of real-time workflows, into streams. Their goal is to transform data into a format that works well with data science tools. This way, Data Scientists can focus on what they do best while DEs manage the nitty gritty steps of ingestion and pre-processing.

DEs typically form the bridge between data producers and consumers. However, because DEs often lack context about datasets (because they don't own them), curating datasets can be challenging. Moreover, DEs have to deal with changing requirements from consumers, and often find themselves re-curating datasets in response. This can be a time consuming and frustrating process.

Moreover, there is the problem of duplication of effort. In the absence of a platform, a typical workflow looks like so:

Product code produces events using one of the event logger libraries at Chime
Consumers contact DE and ask for the data to be ingested in either real-time or batch.
DE provisions the destination infrastructure for the event and sets up ingestion to it. In the case of batch workflows, the destination is normally Snowflake or AWS S3. In the case of real-time workflows, data is ingested into Kinesis streams.
Next step is transformation. Due to a lack of proper self-serve tooling, DE teams are responsible for deploying and maintaining consumer logic.
Rinse, repeat.

There are a few problems with this. The biggest one is the problem of events continuing to flow without getting collected anywhere. This is because the destination, as well as ingestion to destination, is set up for each consumer use case as the need emerges. It is kind of like having a running faucet–water keeps running until someone holds a glass under it.

This is also not scalable. Consumers can move only as fast as the DE team, and we can't keep hiring more and more people to build pipelines. It's both expensive and untenable.

Furthermore, history is lost, much like the water that is not collected in the glass. One may end up using data from a different source (usually databases) for backfilling state. This may lead to inconsistencies, not to mention, in some cases, such a backup option may not be available at all. We need a way to maintain history from day one. A reservoir or, ahem, a *data lake* to collect water from the faucet from the moment it is turned on. More on this later.

Looking at the Org Chart for Architectural Inspiration

At Chime, Product Engineering, Data Engineering, and Data Science are all separate organizations, and each team has their own planning process, OKRs and cadence. Any kind of cross-team dependency means working around incongruous schedules and priorities.

To launch a real-time pipeline, the number of lines of code is < 100. Less than 100! But it took 4 weeks to productionize because of cross team dependencies which led to a lot of waiting for JIRA tickets to get prioritized.

A cross-team dependency on DE prevented our stakeholders from moving as fast as they wanted to. We wanted to remove DE as the bottleneck, reduce these cross team dependencies, and, as a result, allow teams to move at the speed needed to accomplish their mission.

We started thinking about the best way to remove these dependencies and allow folks to self-serve data. To that end, we started looking at the common queries:

"Where can I find the average monthly active users?" — Producers are often owners of data and knowledgeable about the source of truth. The Data Science team comes up with the logic for deriving insights from key datasets, but they rely on DE or Product to find the source of truth datasets for them. If only there were a central place to catalog this information
"Here is my SQL, please run it in batch and streaming mode" — All Data Science teams should care about is their logic (usually in SQL or Python code). Ideally, they should use a notebook to test and verify. And, later, they should deploy the tested code in production seamlessly without worrying about things like setting up an AWS S3 bucket, or right-sizing a Kinesis stream. All of the infra setup should be fully abstracted and automated.
"How is my pipeline doing? Can I see some metrics?" — Another coordination point emerges from a lack of visibility into pipelines. If DE is curating a pipeline , they tend to hold the keys to the metrics and are provided to consumers on a need to know basis. They are also on the hook for getting alerted in case of issues. In a true self-serve system, metrics and dashboards should be auto-generated and discoverable. Data producers and consumers should have access to health metrics they want to be alerted about.

Building the Bridge

In thinking about how we could build a bridge between producers and consumers by removing data engineering from the middle, we tried to identify the key roles they play in our organization.

Producers — Produce and own datasets. These are normally product engineers. Their aim is to produce events from UI, backend services, and product workflows. These events should be cataloged, ingested, and processed as part of their regular workflow — without any extra effort.
Consumers — There are the analysts, data scientists, and machine learning engineers who derive insights from raw data. Their aim is to discover data, test curation logic and deploy data curation workflows.
Data Engineers — These are the orchestrators who sit in the middle and set up the infrastructure, storage, and pipelines required to help producers produce data and consumers curate data. In this platform, we automated and abstracted away all the work that an orchestrator would do.

As we drilled into the roles that each of these actors play, it became clear that we needed some kind of self-serve platform, where the core platform is responsible for automating and abstracting away all the actions that would otherwise be taken by the Orchestrator persona.

Producers onboard onto the platform once. After that, they seamlessly produce data that will be cataloged, ingested, and served on their behalf by the platform
Consumers plug into the platform to serve the data whenever they need it for their use case.
Everyone communicates via the platform, removing cross-team collaboration touch points and improving development time.

The Big Reveal…

All the aforementioned principles led to the Streaming Platform architecture above. Components in gray belong to the core platform: these can be multi-tenant systems, SDKs, libraries, services, tooling and other things needed for automating all the functions performed by the Orchestrator persona.

Components in green are the producers: they plug into the core platform via adapters. Adapters encapsulate the data from producers in an envelope schema that is understood by the platform while preserving the payload structure. When serving, the payload is unwrapped from the envelope and served in the same format it was written in.

The pink boxes represent components owned and operated by the consumers. Consumer applications usually read data from a data lake (in the case of batch operations) or from a stream (for streaming use cases) and write to different types of destinations — S3, Kinesis, Kafka, DynamoDB, FileSystem, and many more. Processors are abstracted behind a platform-agnostic declarative configuration layer that is managed by the core platform. This abstraction allows consumers to focus on the logic without worrying about the engine where this logic would run. The engine is selected based on several factors such as latency, throughput, whether they need stateful or stateless processing, etc. — unbeknownst to the consumer. In the case of streaming applications, the source stream is created and hydrated with data by the underlying platform. The processing layer and core platform warrants a dedicated post (stay tuned!), so this is as much detail as we will go into in this one.

That being said, I do want to touch on the main components at a high level:

Schema Repository

A Protocol Buffers-based schema repository is the first iteration of a catalog layer for this entire system, where producers declare the structure of their events and consumers can search for context and documentation about available events. This is the first step in event production. Product teams declare their event schema in this repo. And once that is checked in, they include the appropriate version of the generated code in their application code to produce Base64-encoded protocol buffer events using the logging library.

We use protocol buffer message options to tag event destinations, additional context such as producers, table schema where the data needs to be stored, etc. We use message options not just for tagging, but also for orchestration. For example: users can tag their schemas to declare that they want the data to be ingested into a third party system such as Amplitude and the platform will make that happen. Message options are the way by which Producers can declare the state of the world and kick off actions to orchestrate that state.

Global Event Stream (GES)

This is the main entrypoint to the platform — a multi-tenant Kinesis stream where all the events land. From here, data is persisted in the data lake for batch use cases, backfills, for maintaining history, and so on. Data is also served to consumer streams in real-time via the Fanner processor.

This stream is randomly partitioned, meaning the data is not ordered within the stream, which is ok — we want to defer event ordering to the consumers.

We made this decision because providing ordering guarantees leads to other problems, such as hot shards because of data skew and required maintenance of the ordering through every hop in the pipeline (or reordering before serving to consumers), which adds additional latency. Not all use cases may want this — some will prefer low latency delivery at the cost of some entropy. Additionally, we surface event time for each event within the envelope schema, and the field can be used by consumers to impose ordering.

Event Loggers and Adapters

At Chime, there are multiple ways of logging events: an internal event logger for producing schematized events, Segment, as well as converting CDC data stored in databases into events for ETL purposes. All of these loggers talk to the core platform via adapters.

Adapters convert source data into a format required by the core platform. They perform the following tasks (among other things):

Extract event time.
Extract source context such as caller, service, trace_id, etc.
Extract user_id or device_id. The former is useful for cataloging user data together so we can easily delete data as required by applicable laws or regulations.
Perform any serialization/deserialization. For example: Data in DynamoDB is stored in binary format. This is read as base64 and then schematized before sending it to the core platform.
Wrap the schematized event inside an envelope schema. Set all the relevant fields in the envelope schema object before forwarding to the core platform.

Event Persister, Data Lake, and Data Warehouse

Persister reads data from the global event stream and writes it to a partitioned S3 bucket (data lake); it is also written to Snowflake, our data warehouse. This data is meant for consumption by batch jobs, ad hoc queries, and backfills.

Event Persister was built using an AWS Firehose to S3 sink. Firehose supports dynamic partitioning that allows us to fan out events from the multi-tenant GES to different prefixes under the same bucket in S3, and it buffers incoming data for 1 minute before flushing to S3. As a result, data will arrive at the lake within 5 mins of generation.

Terraform configuration for Firehose to S3 with Dynamic Partitioning enabled

Prefixes are derived from the event_name and timestamp. Events are partitioned by event_name, day, and hour.

prefix=<event_name>/yyyy–MM–dd:hh

Currently, our data lake is mainly used as a staging location for Snowflake. Snowpipe integration is used to push data directly to Snowflake; applications consume from there — even Spark jobs.

We are currently working on a proper data lake solution with metadata cataloging capabilities, partition discovery, compacted partitions optimized for Spark workflows, etc. Look out for a future post about this.

Fanner Streams and Subscriptions

Fanner acts as a proxy between all of Chime's streaming events and the consumers that need to access them for streaming use cases by fanning out a subset of events to an on-demand stream. There are two main benefits to this model:

1. By preventing consumers from reading directly from the global event stream, we are controlling access to data on a need-to-know basis. Fanner acts as a gatekeeper here by checking access control and only fanning out the events that use cases are allowed to consume.

2. By adding this buffer, we are also able to prevent some consumers from overwhelming the system by hogging the read throughput. In Kinesis, read throughput is shared between consumers (except in the case of enhanced throughput consumers).

Fanner performs this fan-out based on subscriptions, which are managed in Terraform as a Terraform resource. This allowed us to introduce a special type of stream called a Fanner stream. A fanner stream is a wrapper over a Kinesis stream (but it can be any stream, really — even Kafka — we just happen to use Kinesis at Chime) and can be defined like so:

The user can specify stream characteristics through this module, such as shard count and data retention, as well as what events they would like to subscribe to (see the `events` key in the module). On Terraform, the module first creates the Kinesis stream and then the subscription in the Fanner service, telling it to send events `test_event_1` and `test_event_2` to the stream. This way both the infrastructure creation and data curation happen in one go — with 6 lines of code.

Key Takeaways

To serve our members by giving them the best quality product, we wanted to enable our teammates at the forefront (Data Scientists) to do their best work! This meant chipping away at manual processes and dependencies that were slowing them down, which led to the formation of Event Bus Pro — a self-serve streaming platform for producing, consuming, processing, and serving data securely and reliably.

Here are some design principles we used that might help you too:

When building any self-serve platform, go with the grain of your organizational structure. When organizational structures are not taken into consideration, you might end up adding processes and points of friction that can slow down the self-serve workflow.
Think of your platform as a town square for data — producers bring data, consumers consume it — it connects consumers to producers and vice versa. Build features to foster those connections.
Your aim should be to reduce the collaboration overhead between all the actors in your system. To that end, the platform should be the way to interact, find answers, and orchestrate jobs.
Be as restrictive you can be and incorporate security and privacy features into the design of your platform from Day 1. It is easier to ease restrictions later on than add them.
Consider user toolsets when designing interaction points with the platform. Try not to introduce tools that require a new skill to be acquired. This will impact adoption.
Depending on your situation, you may consider passing the data ordering responsibility to consumers. Ordering guarantees adds additional complexity to pipelines and impacts the overall latency.
Protocol Buffer schema message options are a great tool for declarative programming, tagging, and driving automations.