Based on the following metrics what can be done to reduce lag from source?

30 views
Skip to first unread message

Lucas Rangel Gazire

unread,
Jun 10, 2024, 2:37:33 PMJun 10
to debezium
Hi all,
Thanks to Chris i have optimized the embedded debezium without throughput impact, i'm using the following configuration:
DEBEZIUM_FIXED_INCOME_MIN_BATCH_SIZE=50000
DEBEZIUM_FIXED_INCOME_MAX_BATCH_SIZE=20000000
DEBEZIUM_FIXED_INCOME_DEFAULT_BATCH_SIZE=500000
DEBEZIUM_FIXED_INCOME_MIN_SLEEP_TIME=1000
DEBEZIUM_FIXED_INCOME_MAX_SLEEP_TIME=3000
DEBEZIUM_FIXED_INCOME_SLEEP_TIME_DEFAULT=1500
DEBEZIUM_FIXED_INCOME_POLL_INTERVAL_MS=5000
DEBEZIUM_FIXED_INCOME_CACHE_SIZE=100000
DEBEZIUM_FIXED_INCOME_MAX_QUEUE_SIZE=49152
DEBEZIUM_FIXED_INCOME_STREAM_MAX_BATCH_SIZE=16384
DEBEZIUM_FIXED_INCOME_FETCH_SIZE=50000

From this point on things start getting hard, we have a high volume at 11 am that keeps till 3 pm, the lag from source keeps growing until hits 2 hours. Sometimes after some time it recovers but in most of it it kept this lag till the end of the day.
I export the metrics and these are the values at this time interval:

last_duration_of_fetch_query_in_milliseconds = 49 seconds
average_batch_processing_throughput = 84
total_parse_time_in_milliseconds = 5.84min
queue_remaining_capacity = min of 48475 and max 49152
max_duration_of_fetch_query_in_milliseconds = 3.44 min
last_duration_of_fetch_query_in_milliseconds = 52 seconds

Is there a way to improve the configuration without interfere in the throughput?  As far as i can see the queue is not being filled enough is that a problem? Or should i improve the batch size?




Chris Cranford

unread,
Jun 10, 2024, 3:13:31 PMJun 10
to debe...@googlegroups.com
Hi Lucas -

What type of TPS and average size of transactions do you see during this 11am-3pm window?
I'd also be curious to know what does your redo log switch trends look like during this window as well as the BatchSize trend too.

Looking at the metrics, the first concern for me is the fetch query max value, 3.5 minutes seems to indicate that there may have been a substantial number of log switches at one point and caused an tremendous number of logs to require reading.  This paired with the extremely large upper bounds of the batch size can really put a strain both on the database's IOPS, the SGA memory used, network bandwidth, and the memory requirements for the connector.  Maybe 20M makes sense given the volume of your window, but its really not something I would expect as an upper bounds for the aforementioned reasons.

Are you also able to provide the TotalBatchProcessingTimeInMilliseconds, MinBatchProcessingTimeInMilliseconds, MaxBatchProcessingTimeInMilliseconds, and LastBatchProcessingTimeInMilliseconds.  I'd also suggest adding TotalResultSetNextTimeInMilliseconds to see what the read delay is between database fetch loops.

I think what is most concerning though is the throughput of 84, but lets take a look at these other metrics first. 

Chris
--
You received this message because you are subscribed to the Google Groups "debezium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/77c24984-ec1f-4d5f-a032-e79ead8599a1n%40googlegroups.com.

Lucas Rangel Gazire

unread,
Jun 11, 2024, 8:39:25 AMJun 11
to debezium
We have 83 transactions per second i don't have the average right now, i will try to fetch it with our oracle team.
These are the metrics value (didn't find the TotalBatchProcessingTimeInMilliseconds tough)...
MinBatchProcessingTimeInMilliseconds: 625ms
MaxBatchProcessingTimeInMilliseconds:14.5min (yes, minutes)
LastBatchProcessingTimeInMilliseconds: 1.89 min
TotalResultSetNextTimeInMilliseconds:: 23 hour

Chris Cranford

unread,
Jun 11, 2024, 9:09:41 AMJun 11
to debe...@googlegroups.com
Hi -

OK, so this explains a bit.

A batch requiring 14 minutes to process sounds suspicious.  Can we dig into the metrics at this precise point to see whether there are any obvious outliers here.  Batch processing shouldn't take more than a minute or two max depending on the batch size.  This is probably where we need to spend half our time investigating. 

But I think above all, we need to understand why ResultSetNext is 23 hours.  I wouldn't be surprised if this isn't related to the batch size & fetch size configurations.  This calculation basically tracks how much time is spent calling "resultSet.next()", which should be minimal in most cases and only taking a hit during the next database round-trip to fetch the next "fetchSize" worth of data.  Do you have trends of this metric to see if we can identify why this is so high?  If the connector has been up for many days, then it makes sense but if this is a window of lets say 2 days, then "Houston we have a problem" :)

Chris

Lucas Rangel Gazire

unread,
Jun 11, 2024, 9:21:13 AMJun 11
to debezium
There's a weird behavior with this metric, it goes up and never down in the Jun 03 i restarted the service to collect the metrics and it goes up till the Jun 05 and reach 15.5hour, the same thing on Jun 06 (have to restart again) till today it reaches 23 hours.
I think we have a problem, Houston

Chris Cranford

unread,
Jun 11, 2024, 11:28:32 AMJun 11
to debe...@googlegroups.com
Lucas -

Would it be possible to enable DEBUG logging on the connector and share the logs & perhaps even the database alert logs if possible.  It doesn't necessarily need to be during the entire 11am-3pm window, although that would be extremely valuable, but at least for a reasonable portion of that so we can see whats going on from both the DB side and the Connector side.   You can send those privately to my email if you'd prefer.

Thanks,
Chris

Lucas Rangel Gazire

unread,
Jun 11, 2024, 11:42:23 AMJun 11
to debezium
Yes, this can be done, but debug logs a large volume of data, what are we looking for here?

Chris Cranford

unread,
Jun 11, 2024, 1:23:01 PMJun 11
to debe...@googlegroups.com
Hi Lucas -

If you are on Debezium 2.6, it would be the output from the io.debezium.connector.oracle.logminer.LogFileCollector class specifically.

Thanks,
Chris

Niels Berglund

unread,
Jun 12, 2024, 7:47:56 AMJun 12
to debe...@googlegroups.com
Hmm, coming late to the party here. But, we are using the SQL Server Connector, and after a couple of false starts, we are doing easily 1k tx/sec, and have load tested up to 10k/sec with no lag. Our payload size is ~4k. What made a difference for us was to set the "producer.override.batch.size": 500000".

Thanks!

Niels

Reply all
Reply to author
Forward
0 new messages