When I attended this year’s Strata Conference in Silicon Valley I made two observations. The first was the energy around AI/ML use cases that showcased complex applied mathematics with Amazon and demonstrated real world continuous machine learning with Kinesis, real-time image recognition from Pinterest, and examples by Google and S&P Global to capture semantic relationships between words, images and data. The second was a resurgence of enthusiasm for SQL, proving everything old is new again, with discussion of projects like Uber’s Flink SQL, Google’s Beam SQL, Confluent’s Kafka SQL, and Yahoo/eBay/Twitter Pulsar’s SQL.
The pain of forcing one size to fit all
A number of sessions focused on how they worked around issues in pub/sub event buses like Kinesis and Kafka for transactional systems. Complexities around consistent cloning of databases, transformations, selective replication and consistency enforcement was left as an exercise to the “stream application.” Let’s dig a little deeper into a specific example presented by Pinterest.
Pinterest data pipeline
Pinterest presented their highly scalable system that dynamically integrates user pins into periodic snapshots. To understand this better, we need to generalize data operations into interactions between systems of record, systems of insight and systems of engagement.
Figure 1 explains a typical data movement pipeline. A system of record is typically a RDBMS. In the case of Pinterest, the system of record is a highly sharded MySQL database with nearly 100TB of data. Figure 2 explains the Pinterest data ingestion pipeline, conceptually. The system of engagement collects pins and stores it into a system called Singer (step 1). From there the data is sent to their analytics backbone using Kafka to S3 and ultimately to Hadoop/Spark/TensorFlow (step 3). Since the pins only represent point-in-time changes, the analytics jobs in step 3 also need the full copy of system-of-record (step 2). Hence the arc between (2) and (3) requires periodic snapshots of the system of record merged with recent pin activity – this is called incremental data ingestion.
Pinterest incremental data ingestion
To solve the incremental data ingestion problem, Pinterest designed a periodic snapshotting with delta merging of incoming pins that is sent as input to the analytics jobs. This is explained in Figure 3 below.
A more detailed diagram that zooms into how periodic & incremental DB ingestion is taken and then merged with a System of Record snapshot is explained in Figure 4. In the Pinterest data ingestion pipeline this is called Watermill.
Maxwell is an open source project that mines redo logs from MySQL. The Pinterest architecture creates two sets of snapshots (previous bootstrap and newly merged). The previous bootstrap can be either the newly merged or from a daily dump (which is an additional copy of the system of record.) The total process of creating a consistent merged snapshot takes about an hour.
While this pipeline enables Pinterest to do amazing things, it can also create significant challenges for an enterprise environment. Firstly, Pinterest performs multiple ingests of the system of record (per day) from snapshots of the shards. For this, each shard is copied multiple times which, for Pinterest, is multiplied by thousands of shards creating both significant WAN and computation costs as well as potential regulatory risks. Secondly, Kafka and Maxwell create problems with duplicate events and hard/soft deletes. This forces the need for a staging area to solve the duplication problem which results in more than an hour lag of operational data. While this long lag may work for Pinterest, it does not scale for the enterprise where automated workloads like fraud detection need near instant access to transactionally consistent data. Thirdly, there is considerable operational complexity from several moving parts in their deep pipeline.
In part 2 of Strata post, I’ll share how the Griddable.io approach simplifies these challenges by using a data pipeline technology that is purpose-designed for transactional data.
This drastically simplifies the database ingestion with the following advantages:
- Periodic snapshots are not necessary (you save on space for two copies and the processing to create snapshots) because our grid is not lossy and handles hard deletes.
- Griddable.io can use either a database offline/online snapshot or a policy-based clone if you want a selective snapshot. A policy-based clone is a highly parallel extractor from the source database.
- Updates are continuous so there is no lag for the systems of insight.
- Failure cases are handled by the Griddable.io platform.
- No need to de-duplicate records since Griddable.io provides exactly-once, end-to-end at a transactional level. The Griddable.io architecture honors the source DB transactions and ensures events are not applied twice.
- No issues of consistency between multiple bootstraps and incremental streams. This is because the Griddable.io architecture implicitly follows Databus’s timeline consistency model.
- Policies on the consumer can implement filtering of personal information (PII) that is honored by both the initial bootstrap (policy based clone) and the incremental changes (relay).