In Part 1 of the Musings from Strata series, we discussed how a data ingestion pipeline was constructed for Pinterest. In this post, we will discuss how to simplify the implementation with the Griddable.io approach. To read Part 1 of this series, click here.
Using Griddable.io to do Pinterest incremental data ingestion
Griddable.io is a new system where a bootstrap is integrated into a transactional change data system. A detailed description of Griddable.io architecture is here.
With Griddable.io (Figure 5), the relay mines redo logs (like Maxwell) and captures recent change-records into memory, retiring older change-records to a change history server. This change history can be kept for as long as you wish. The system works with either single instance, clustered or sharded databases.
To setup the initial copy, there are two choices, 1) using daily dump (as in Pinterest) or 2) more interestingly, doing a policy based clone. With a policy based clone, a consumer can be selective about how they want to create a clone of the system of record. A clone need not be a full copy of the database as what you get from a daily dump. The policy for the clone can be a “selection” based on schema/table/row level.
Figure 6 implements the complete Pinterest use case, where we use two consumer plugins, one for the initial copy using a policy based clone and the other for the incremental change.
To implement the Pinterest use case with the Griddable.io system, the user will directly write to HBase or HDFS which will be the entry point into Hadoop or Spark. To achieve in-place updates without an external compaction system, Hive-ACID or HBase can be used as the target for the consumer.
This drastically simplifies the database ingestion with the following advantages:
- Periodic snapshots are not necessary (you save on space for two copies and the processing to create snapshots) because our grid is not lossy and handles hard deletes.
- Griddable.io can use either a database offline/online snapshot or a policy-based clone if you want a selective snapshot. A policy-based clone is a highly parallel extractor from the source database.
- Updates are continuous so there is no lag for the systems of insight.
- Failure cases are handled by the Griddable.io platform.
- No need to de-duplicate records since Griddable.io provides exactly-once, end-to-end at a transactional level. The Griddable.io architecture honors the source DB transactions and ensures events are not applied twice.
- No issues of consistency between multiple bootstraps and incremental streams. This is because the Griddable.io architecture implicitly follows Databus’s timeline consistency model.
- Policies on the consumer can implement filtering of personal information (PII) that is honored by both the initial bootstrap (policy based clone) and the incremental changes (relay).
Griddable.io is an end-to-end system for database continuous ingestion
At Griddable.io, we believe there is a need for a new architecture enabling continuous database ingestion for real-time analytics. We simplify and automate the process of creating initial clones, the selective replication for incremental changes and policy based transformations of data across various databases.
The work we’ve done to harden and extend the open core Databus project from LinkedIn provides a solution for the enterprise to create continuous ingestion for operational data without copy sprawl or extensive custom engineering to ensure transactional consistency and reliability. If you are intrigued by our approach, register for our beta at https://www.griddable.io/.