Written by Aneeq Yusuf
Big Data & Machine Learning Technical Lead at Royal Cyber
Data today is maintained in large data warehouses where millions of records are stored and updated, mostly at regular intervals. The new information needs to be analyzed for different business processes and to derive value from it. While databases provide a historical data store, analytical systems truly get their value from the most recent database changes and the corresponding analysis. CDC (Change Data Capture) is a concept that allows users to gather updates from databases in real-time and record them in an accessible manner. Using the most recent data, we can record data such as sales, expenditures and analyze them to assist decisions such as customer targeting, with the most up-to-date information.
While CDC is an efficient and lucrative process, the underlying difficulty is that each computer server has only limited processing power, and running an analytical query on top of the same datastore will slow down both the activities. The need of the hour is to offload these processes so that they do not interfere with each other. CDC enables business organizations to grasp what has changed and what new data has been gathered in real-time. Whereas analytical querying would require the analysis of a complete database each time a recent decision has to be made, which is cumbersome when dealing with millions of records. Let us find out how a CDC system can help make proactive business decisions at a fundamental level.
OLAP systems perform analytical processing in real-time on multidimensional data to derive some value from the existing data. CDC acts as a critical player to provide new data to these OLAP systems in an uninterrupted way so that decisions can be taken proactively. In contrast, the traditional systems upload to the OLAP system from the production servers for further data analysis. This is a tedious task as one needs to determine what changes have occurred since the last update and upload data from that point onwards. Such traditional methods use table differencing techniques to assess the effective changes between two points in the backup. These days, the operational data can range from few gigabytes to hundreds of terabytes, which renders the old procedure unreliable in production scenarios.
CDC enables you to perform this task more effectively by capturing recent changes in the database and saving them to the information sink from which the OLAP system can work. There are two approaches to CDC:
It involves the segregation of a batch of new transactions and packaging them for analytical processing. This process is time-consuming as the system needs to determine the most recent changes using the changelog, creating a batch out of them and pre-processing for further storage.
It is a non-traditional method for CDC, and it has gained popularity in recent years. This process involves extracting the newest data from the production database and streaming it directly to OLAP systems for further processing. This method is faster as there is no need to separate old and new data. Thus, no bulk processing is required.
Now, let us see how Kafka comes into the big picture and aids our process.
Kafka is a messaging service that Confluent Software develops and is incubated under the Apache Software Foundation. It provides a framework that can store, read and analyze data streams while connecting multiple components. It essentially sends messages between two distinct elements, such as processes and applications. Kafka can help provide a better CDC streaming system by connecting the data store to the OLAP system through itself. It can virtually handle trillions of events per day, which is the ideal benchmark for a live system - OLAP connection. It initially started as a messaging queue and has now become a full-fledged event streaming service. Another advantage of Kafka is that you can process the data, make micro changes as it arrives, and then send it to the OLAP server. We can approach building a CDC from Kafka by these two methods: -
Each transaction in databases such as MySQL and ODB maintains a transactional log in a binary format. We can enable Kafka to access that binary data, extract the necessary metadata from it and, stream the resultant information to the OLAP servers.
Triggers are automatic activities that perform a designated action whenever a new update is made on the database. We can configure the database to contact and communicate with Kafka whenever a new record is created, updated, or deleted from the database.
To implement triggers or transactional logs based on CDC, one needs to determine whether they would like to go with the streaming-based method or the batch method. Let us compare these two methods.
For each organization looking to implement CDC in their data stores, it is crucial to determine what method out of the two will help them most. We have outlined the pros of Streaming CDC over Batch CDC to provide a guide:
|Streaming CDC||Batch CDC|
|Faster when dealing with a large amount of data.||It takes time to process extensive segmented data.|
|Data can be modified within the message queue buffer to be shaped appropriately when it arrives at the destination.||No modification of data is possible during transit, and it needs to be appropriately shaped before entry into the sink database.|
|Techniques such as table differentiation are not required as most recent data is recognized through triggers and commit logs.||Log differencing and table differencing techniques need to be used to find the most recent transactions.|
|Connectors are available for all types of databases.||Setting up the architecture is a little more complicated.|
For example, an e-shopping organization maintains a production database that keeps account of all transactions. Streaming CDC using Kafka can help in this situation by providing the most recent transactions in a fixed period to the OLAP system. The OLAP system can then use this data to determine which region shops more at what time or when most shoppers are active. The organization can then leverage this data in real-time to provide offers specific to that condition.
We can understand that a large amount of data is deposited into production servers and databases, and processing them at the correct time can mean a lot to organizations. The easiest way to do that is using Streaming CDC, which transfers new data to the data sink as and when it is available. A simplified way to approach this technique is through Kafka's use, which is an amplified messaging queue that can transport recent data. The results through this system get incrementally better. To know more, talk to our Royal Cyber experts today.