Most data scientists and database experts know that transactional data is unique. Transactional data is valuable only when it exists in an ecosystem of transactional semantics. Yet, transactional data is still sometimes shared using buses or queues that treat it like click data or asynchronous messages.
Griddable is unique. Its data pipeline architecture preserves transaction semantics while maximizing performance and scalability for transactional data.
Each Griddable data pipeline architecture utilizes one or more instances of three basic components: relays, consumers and change history servers.
First, relays pull change data using a source-dependent protocol. These protocols include JDBC for Oracle using LogMiner, or the binlog protocol for MySQL and MariaDB. The relay applies replication policies to each event to determine which events to transmit to consumers. Finally, the relay publishes events to a circular in-memory buffer where consumers pull them for entry into downstream target databases.
Consumers continuously pull replication events from relays over HTTP. Then, they store events in consumer event buffers until processing for entry in the target database. The consumer maintains the original order of transactions as it pulls them from the relay. When a relay no longer contains a required event, the consumer pulls the change from a persistent change history server. As it pulls and processes transactions, it keeps its own state as the last successfully processed transaction.
Preserving transaction consistency
All components in Griddable.io’s data pipeline architecture follow the commit timeline defined by the source database. This means that they see changes in the source database commit order and preserve transaction boundaries. Pull-based data pipeline architectures have several advantages:
- Resilience to unavailable, slow or faulty components
- Easy sharing of state with downstream components
- Dramatic scalability by adding additional components where capacity is needed
Following this model, every component in the data pipeline architecture expresses its state as simply the last transaction successfully processed. Each database assigns change numbers slightly differently. For example, Oracle databases assign a sequential System Change Number (SCN) to each transaction. Thus, the SCN succinctly describes every component’s progress in processing the incoming stream of change data.
Very low latency
By isolating components, the Griddable data pipeline architecture achieves very fast, in-memory data transfer for serving most consumer pull requests. Consumers contribute to performance by batching transactions into one write operation when updating target databases. Some consumers parallelize event processing to alleviate latencies associated with writing to remote data stores.
Scalable and elastic
All components in the Griddable data pipeline architecture utilize loose coupling and share nothing. Thus, the Griddable data pipeline architecture scales on-demand by simply adding additional relays or consumers and partitioning traffic across components.
The Griddable data pipeline architecture evolves gracefully as requirements for scalability and availability change. Additional relay instances work together to replicate high transaction counts from larger and faster source databases. And like relays, additional consumer instances work together to each process a portion of the incoming replicated events. The data pipeline architecture scales by adding consumers because each consumer is completely independent of relays.
To deploy additional relays or consumers on demand, the Griddable data pipeline architecture runs on an elastic Kubernetes (K8S) infrastructure. Kubernetes resizes the infrastructure automatically as needed and provides portability for the infrastructure to all major public clouds. Further, the Kubernetes graphical management dashboard shows all nodes in the cluster and its operational state.
Push the “Live demo” button to see the Griddable data pipeline architecture in action.