A popular use case for CDC is replicating databases across datacenters. This is becoming increasingly common as companies look to adopt cloud and a hybrid-cloud environment. In this way CDC is closely linked to event-driven architecture (EDA) which is why CDC is becoming more common as businesses realize the importance of event-enabling their architecture to meet customer demand and improve overall customer experience with EDA. It is in their interest to ensure data is captured, analyzed, and acted upon in real-time. Gone are the days of batch processing where valuable data would sit for hours or days and lose its value over time. In this brave new world, data is analyzed in real-time while it is most valuable. A flight passenger has no use for an alert notifying them of a flight delay 5 hours later.
Coming back to CDC: It is a design pattern used to identify changes in data so that actions can be taken on those changes in real time. Databases are everywhere and we are all used to storing plenty of data in databases where it rests. We frequently query this data as well as watch it change over time. Records get inserted, deleted and updated. And while the new information provides very valuable information, there is additional value in tracking changes to these database objects in real-time.
With events, we can respond to them in real time and build downstream pipeline that can react to these events. For example, as soon as a new customer is added, we would like to send a thank-you email to them. This can be done via CDC. A new row is inserted for the new customer which is turned into an event, published to an event broker, subscribed to by some number of downstream processes, one of which is responsible for sending a thank-you email.
Once you have the docker container running, you will need to install MS SQL CLI. There are different ones available for the OS you are using. For an EC2 instance, you can install the CLI using this command:
Now that the db and table have been created, we need to enable them for CDC so that MS SQL Server knows we want changes to be tracked. You have to enable CDC at database level and then at table level.
There are different ways to deploy StreamSets. For this demo, I am using their free-tier SaaS offering which you can sign up for here. There is a lot of tutorials available to learn how you can build a pipeline using different input sources and destinations. You can even use components to transform payloads. For our demo, though, we will keep it simple and publish the event as it is to Solace PubSub+.
Click on Set Up >>Deployments and create a new Deployment with the following additional libraries: JMS, and SQL Server 2019 Big Data Cluster Enterprise Library. Follow rest of the instructions to get to the docker installation instructions. For example:
PubSub+ supports dynamic hierarchical topics which allow subscribers to filter events using wildcards. For example, I might have multiple CDC sources (Oracle, MS SQL, etc.) publishing to different topics such as cdc/demo/. I might have one downstream subscriber subscribing to cdc/*/sqlserver and another subscribing to cdc/*/oracle. You can learn more about Solace topics here.
As you can see in the payload, it picked up the new row that was inserted into the database. You can use this event as a trigger for numerous downstream processes. One such process can be one that sends a Welcome email to new customers. I am sure James Bond would love that!
As one of Solace's solutions architects, Himanshu is an expert in many areas of event-driven architecture, and specializes in the design of systems that capture, store and analyze market data in the capital markets and financial services sectors. This expertise and specialization is based on years of experience working at both buy- and sell-side firms as a tick data developer where he worked with popular time series databases kdb+ and OneTick to store and analyze real-time and historical financial market data across asset classes.
In addition to writing blog posts for Solace, Himanshu publishes two blogs of his own: enlist[q] focused on time series data analysis, and a bit deployed which is about general technology and latest trends. He has also written a whitepaper about publish/subscribe messaging for KX, publishes code samples at GitHub and kdb+ tutorials on YouTube!
Himanshu holds a bachelors of science degree in electrical engineering from City College at City University of New York. When he's not designing real-time market data systems, he enjoys watching movies, writing, investing and tinkering with the latest technologies.
Change Data Capture (CDC) is quite the buzz these days. I have been in numerous client meetings where CDC has come up lately, way more than before. So what is CDC and why might you be interested in it?
CDC is closely linked to Event-Driven Architecture (EDA) and with the recent rise in the popularity of EDA, CDC has become more common as well. As businesses realize the importance of event-enabling their architecture to meet customer demand and improve overall customer experience, they have embraced EDA with open arms. It is in their interest to ensure data is captured, analyzed, and acted upon in real-time. Gone are the days of batched processing where valuable data would sit for hours or days and lose its value over time. In this brave new world, data is analyzed in real-time instead of batches while it is most valuable. A flight passenger has no use for an alert notifying them of a flight delay 5 hours later.
Imagine a Customers table which tracks all of the active customers who are currently using a software company's products. As more and more customers purchase more products over time, new rows are inserted and this table grows. Occasionally, customer information needs to be updated, such as their contact information, which leads to rows being updated in the Customers table.
While this is all very valuable information, it is still static and shows you a point-in-time view of the. You can query the table to get the contact information for a customer but it doesn't tell you when and how this information has changed over time. This is where CDC comes in. Its goal is to event-enable databases by turning all databases changes into events. Bingo!
In this diagram, we have different databases where our records might be stored. We can event-enable these databases using CDC connectors provided by StreamSets. StreamSets is an enterprise data integration platform with multiple CDC connectors to databases such as Microsoft SQL Server, Oracle, MySQL, and PostgreSQL. StreamSets will generate events and publish them to Solace's PubSub+ broker.
PubSub+ is an enterprise-grade event broker widely deployed by companies across industries to event-enable their architecture. It supports open APIs and multiple protocols which makes it extremely easy to integrate it with other technologies natively (without any additional proxies). Once the event is published to PubSub+ broker, it can be consumed by multiple consumers independently. You can have a microservice using Java API to process the event, enrich it and then write it to Salesforce. Events can also be pushed out to downstream services such as AWS's API Gateway via REST webhooks (all natively built into Solace PubSub+) and then consumed by additional AWS services such as Lambda. Finally, you can leverage Apache Spark for analytics as well.
Now that we know how the different components work, let's test it out! We will show how you can event-enable your Microsoft SQL Server leveraging Streamsets' CDC connector and Solace PubSub+ event broker.
Now that the DB and table have been created, we need to enable them for CDC so that MS SQL Server knows we want changes to be tracked. You have to enable CDC at the database level and then at the table level.
Click on that engine and click on External Resources on the following page. This is where you can upload additional libraries/resources for your data collector. We will need to upload Solace's JMS library here to be able to use StreamSet's JMS Producer.
Just like StreamSets, there are numerous ways to deploy PubSub+ event broker. Pick the one that's the easiest for you. I prefer to spin up a free instance using my trial account on Solace Cloud. You can sign up for an account here.
Once you have a service running, go to the Try Me! tab and connect to the Subscriber app. Here we will subscribe to the topic that our JMS Producer was configured to publish events to in the previous step: cdc/demo/sqlserver.
I am using flink checkpointing to restore states of my job. I am using unaligned checkpointing with 100 ms as the checkpointing interval. I see few events getting dropped that were sucessfully processed by the operators or were in-flight that were yet to be captured by checkpoint. That is these were new events which came into the pipeline between the previously captured checkpoint state and the failure.
My project acknowledges(commits) back to the topic after the event read and mongo ingestion. But the pipeline has transformation, enrichment and sink operators after that. These missing events were read, ack'd back to the topic and transformed successfully before failure and were not yet checkpointed (withing the 100 ms interval between checkpoints) were dropped.
I see that whenever runtime exception is thrown it triggers the close method in each of the functions one by one. Do we have to store the states which were not yet captured by the checkpoint before failure? What happens Network failures or task manager crash or any other abrupt failure?
Or do we have to shift the source topic acknowledgment to the last ( but we will have to chain all this operators to run in a single thread and carry the bytearray message object from solace queue to do ack at the end).
His songs, about loneliness and isolation, provided a profound sense of connection. I found solace, especially during lockdown, in his song Afterhours, which explores thoughts about painful longing and introspection. It had a big impact on my life. His soulful vocals and relatable lyrics captured the essence of how I was feeling.
c80f0f1006