Hello,
I hope you're all doing well. We're working on real-time replication in production from several databases, but we're encountering significant lag from one particular source. The lag typically ranges from 5 to 10 minutes, occasionally reaching up to 200 minutes, depending on the load.
Our cluster is dedicated to Debezium replication and consists of three broker nodes, each running Kafka Broker, Kafka Connect, and Schema Registry. Each node has 8 vCPUs and 16 GB of RAM (with 6 GB heap to broker and 4 GB heap to Connect). We're running Confluent Kafka/Connect/Schema Registry 7.5 with Zulu JDK 11 and using the Debezium Oracle Connector version 2.5.2.
The problematic connector handles updates on 75 tables, with an average of 5 million events per day, peaking around midday when the lag is at its worst. I’ve tried increasing the heap for Kafka Connect, but it wasnt fully utilized. I also split the tables across three connectors, but this hasn’t improved performance.
After monitoring the production load with JMX, I noticed that QueueRemainingCapacity rarely drops below 7,000, while QueueTotalCapacity is set to the default of 8,192.
Any insights or suggestions would be appreciated.
Thank you!
Hi,
--
You received this message because you are subscribed to the Google Groups "debezium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/45ec1df9-7a8e-41da-b33a-5eeccb08af3fn%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/95869ccf-1589-4cae-9b1d-1d934ce6cf50n%40googlegroups.com.
Hi Chris,
I've made several iterations with the filter mode enabled and adjusted parameters, and so far, it has significantly improved performance. However, a few issues remain.
Log Mining Batch Size: I experimented with the log.mining.batch.size setting, as the default values were causing bottlenecks—batches were being exhausted at 100k, even after enabling filter mode.
Producer Configuration: I also updated the Kafka Connect producer settings (batch size, max request size, linger, and compression). These settings were previously set to the default Confluent values, which seemed to cause a bottleneck between Connect and the broker.
Batch and Queue Sizes: Finally, I tweaked the max.batch.size and max.queue.size settings, which are now slightly overestimated and not fully utilized.
Here’s the configuration I'm working with:
{
To view this discussion visit https://groups.google.com/d/msgid/debezium/570c997c-1b23-4748-a103-d4d294915127n%40googlegroups.com.
Hi Chris,
I wanted to share a quick update after our last meeting. Upon reviewing the Oracle Alert Logs, I’ve identified a possible culprit for the periodic source lag. The issue seems to occur after every second or third log switch. I’ve attached two alert logs from yesterday for your reference—one shows a seamless log switch, while the other (15 minutes later) introduced an 8 minute lag.
We haven’t been able to tie this behavior to CPU or disk I/O usage on the database. As we discussed earlier, the database is running on ASM with optical fiber and enterprise-grade specialized disks. The bandwidth between the database and Debezium doesn’t appear to be a suspect either.
Additionally, there have been a few recent changes to the database log setup:
Despite these changes, there hasn’t been any noticeable improvement in lag. This might be due to the fact that the log switches typically occur when the logs reach around 2.5GB during working hours, rather than utilizing the full log size.
If you have any further recommendations or areas we should review with the DBA, please let me know.
Thank you,
Michael
To view this discussion visit https://groups.google.com/d/msgid/debezium/5d743ba1-719c-48fe-abec-7859eda158a7n%40googlegroups.com.
Hi Chris,
Thank you for getting back to me so quickly. I’ve compiled an Excel with two lists: one containing the log mining results for the no-lag scenario, and another with results from when the lag was significant.
Due to file size limitations on Google Groups, I’ve uploaded the zip file to Google Drive for easier access - https://drive.google.com/file/d/13sewTpcQ3AXDwdyQNaauEZMQDlh9bN7c/view?usp=sharing
If it's needed, tables we query are following: "table.include.list": "TIA.ACC_ITEM,TIA.ACC_PAYMENT_DETAILS,TIA.AGREEMENT_LINE,TIA.CASE_ITEM,TIA.CASE_ITEM_ATTACHMENTS,TIA.CLA_CASE,TIA.CLA_CODE,TIA.CLA_CUBE,TIA.CLA_EVENT,TIA.CLA_EVENT_LOG,TIA.CLA_ITEM,TIA.CLA_PAYMENT_ITEM,TIA.CLA_QUESTION,TIA.CLA_SUBCASE,TIA.CLA_THIRD_PARTY,TIA.HISTORY_LOG,TIA.INTERESTED_PARTY_IN_OBJ,TIA.NAME,TIA.NAME_TELEPHONE,TIA.TCP_CLA_E_COMMUNICATION,TIA.OBJECT,TIA.OBJ_RISK_SLAVE,TIA.POLICY,TIA.POLICY_ENTITY,TIA.PRINT_ARGUMENT,TIA.PRINT_REQUEST,TIA.PRODUCT_LINE,TIA.RELATION,TIA.TARIFF_CODES,TIA.TARIFF_RATING,TIA.TARIFF_STRUCTURE,TIA.TCP_ANU_CLAIM_RESERVE,TIA.TCP_ANU_CNU,TIA.TCP_ANU_MODEL_VOUCHER,TIA.TCP_CLAIM_CONTACT,TIA.TCP_CLAIM_NAME_ROLES,TIA.TCP_CLA_ACCESS_AUDIT,TIA.TCP_CLA_COINSURANCE,TIA.TCP_CLA_COINSURANCE_SHARES,TIA.TCP_CLA_CRS_COMMUNICATION,TIA.TCP_CLA_CRS_DATA,TIA.TCP_CL_RECEIVABLES_OUT,TIA.TCP_DELEGATE_USER,TIA.TCP_EMAIL_REQUEST,TIA.TCP_HOME_SERVICE,TIA.TCP_INET_NONTASK_INSP,TIA.TCP_INET_STAT_PLACE,TIA.TCP_KCC_H_VAZBALPRIZIKOTIA,TIA.TCP_KCC_RISK_XREF,TIA.TCP_PL_PLACE_OF_INSURANCE,TIA.TCP_POLICY_NAME_ROLES,TIA.TCP_POL_AGENT_COMMISSION_NO,TIA.TCP_THIRD_PARTY_POV,TIA.TCP_USER_QUALIF_FA,TIA.TCP_VEHICLE_CARD,TIA.TIA_USER_PROFILE,TIA.TOP_USER,TIA.WORK_GROUP_MEMBER,TIA.TCP_CLA_COINSURANCE_REI,TIA.TCP_QUALIFICATION,TIA.TCP_DOCS_EXPECTED,TIA.TCP_POLICY_CANCELLATION,TIA.TCP_KCC_H_VAZBABALICEKLP,TIA.TCP_USER_AGENT,TIA.POL_REFERRAL,TIA.TCP_GIS_DETAILS,TIA.TCP_B1_GIS_DETAILS,TIA.POST_CODE,TIA.TCP_EMAIL_ARGUMENT,TIA.TCP_PROD_COV_ADDRESS,TIA.TCP_SEG_VIP,TIA.XLA_PE_REFERENCE,TIA.ACC_ITEM_LOG,TIA.TCP_QUALIFICATION_HIST,TIA.TCP_USER_QUALIF_FA_HIST"
Let me know if there’s anything else you need!
Thank you,
Michael
Hi Chris,
Just checking in—any updates regarding the suspicious log switching and rollbacks?
Thanks,
Michael
To view this discussion visit https://groups.google.com/d/msgid/debezium/16c99367-8855-44ab-bb93-ea0d6a828f32n%40googlegroups.com.