Performance LogMiner vs XStream

1,708 views
Skip to first unread message

Carlos Fernando Villacorta Padilla

unread,
Jan 28, 2021, 12:15:34 PM1/28/21
to debezium
Hi guys,

is any real difference in the performance between Logminer and XStream?

I have recently tested a 5 million event changes using logminer and it take up to 16 hours to replicate the changes into a Snowflake database.

XStream is a go-to when we talk about near real-time changes?

Or is that the real bottleneck is usually the destination database processing the queries?

Thank you

Chris Cranford

unread,
Jan 29, 2021, 9:54:27 AM1/29/21
to debe...@googlegroups.com, Carlos Fernando Villacorta Padilla
Hi Carlos -

See my comments inline

On 1/28/21 12:15 PM, Carlos Fernando Villacorta Padilla wrote:
Hi guys,

is any real difference in the performance between Logminer and XStream?
We haven't done any official performance benchmarks, so there could be advantages with both depending on one's environment.


I have recently tested a 5 million event changes using logminer and it take up to 16 hours to replicate the changes into a Snowflake database.
To be clear, was this test doing a snapshot or actually streaming 5 million new change events?


XStream is a go-to when we talk about near real-time changes?
Both adapters essentially do the same thing, both mine the redo/archive logs in order to emit change events.  If one is faster at emitting changes than the other, any details, input, and logs to help diagnose if its something that can be adjusted is certainly most welcomed.  But in general, both should be more than satisfactory at performing real-time change emission tasks.


Or is that the real bottleneck is usually the destination database processing the queries?
Performing the Logminer queries can take a few seconds on each poll because the mining session has to be started, the logs loaded from disk into memory by the database, and then the client query can execute.  You might be able to improve performance by using the continuous mining option with Logminer; however, be advised this option was deprecated & removed in Oracle 19. 

Keep in mind, this connector is still in an incubating state, it's being actively developed, and any input/contributions are most welcomed.
If you identify anything specifically that points to any performance issue or if you have a suggestion on where you see performance bottlenecks please feel free to open a JIRA [1] with all the details.

Until then,
Chris

[1]: https://issues.redhat.com/projects/DBZ

Milo van der Zee

unread,
Jan 29, 2021, 10:07:18 AM1/29/21
to debe...@googlegroups.com, Carlos Fernando Villacorta Padilla
Hello Carlos,

I am using logminer in an environment with over 20k changes/second and the logminer part is perfectly able to handle those amounts of changes. So 16 hours for 5 million events seems extremely long.

Debezium though is not optimized for those numbers. Parsing the SQL statement is very expensive with the used library. I shared my implementation of the parsing with the Debezium project and am in the process to Open-Sourcing our setup based on Kubernetes.

When you run the poll often enough the number of log files you have to open becomes very low. Probably only the redo-log. In my case the setup to start mining is in the order of milliseconds.

Golden Gate is nice but very expensive if you use it in a large environment. Hiring a couple of engineers to maintain a solution like Debezium is often cheaper. And the near-realtime is not that different. Even with polling my delay stays well within a couple of seconds. This should be possible with Debezium as well.

MAG,
Milo


Op vr 29 jan. 2021 om 15:54 schreef Chris Cranford <cran...@gmail.com>:
--
You received this message because you are subscribed to the Google Groups "debezium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/dfe11ac9-f75c-ec32-f773-d4a070121ee2%40gmail.com.

Gunnar Morling

unread,
Jan 29, 2021, 10:13:19 AM1/29/21
to debe...@googlegroups.com, Carlos Fernando Villacorta Padilla
Hi,

Am Fr., 29. Jan. 2021 um 16:07 Uhr schrieb Milo van der Zee <mi...@vanderzee.org>:
Hello Carlos,

I am using logminer in an environment with over 20k changes/second and the logminer part is perfectly able to handle those amounts of changes. So 16 hours for 5 million events seems extremely long.

100% agreed; something seems not quite right.

Debezium though is not optimized for those numbers. Parsing the SQL statement is very expensive with the used library. I shared my implementation of the parsing with the Debezium project and am in the process to Open-Sourcing our setup based on Kubernetes.

When you say shared, where was that (if for instance on the chat, I easily might have missed it)? Can you elaborate a bit how your approach works? Is it using a custom parser implementation (because I don't think there's a way around parsing DML statements)? If you'd like to contribute this work to Debezium, that'd be more than welcome.

When you run the poll often enough the number of log files you have to open becomes very low. Probably only the redo-log. In my case the setup to start mining is in the order of milliseconds.

Yes, that's how it should work ideally. Only the current online log should be required to be mined.

Golden Gate is nice but very expensive if you use it in a large environment. Hiring a couple of engineers to maintain a solution like Debezium is often cheaper. And the near-realtime is not that different. Even with polling my delay stays well within a couple of seconds. This should be possible with Debezium as well.
 
Yes, that's my expectation too, and we'd love to collaborate with folks like you for making that happen!

MAG,
Milo

Best,

--Gunnar
 

Milo van der Zee

unread,
Jan 29, 2021, 10:43:22 AM1/29/21
to debe...@googlegroups.com
Hello Gunnar,

I contributed it during PRs for the Oracle part and tried to discuss it in the forum. But it would be much easier to have a separate Oracle channel or maybe just with a couple of us. I can share everything. At least the knowledge. My dayjob for the last year has mainly been to get a CDC based on logminer working.
There are a lot of pitfalls with getting this to work in a distributed environment like kubernetes. Problem with Debezium to me seems to be that it does not scale very well. Or install many configurations what makes maintaining it harder.

Parsing based on the library used in Debezium is slow and creates a lot of GC dirt. Nice thing is that the statements coming from logminer are always the same and so having the generic library is overkill. I first tried it with regular expressions but that also is based on string manipulation and still creates way to much garbage.
I now read the chararray of the SQL statement and go through it once. Very unreadable code that requires a lot of unit tests to keep reliable during changes. But it is fast and in this case that is the primary goal.
For example:
UPDATE 'schema'.'table' SET 'name' = 'whoknows' WHERE 'name' = 'whoknew'
- U
- P
- D
- ...
- E => set flag updateFound
-
- ' => set flag insideQuotes
- etc...

This saves copying the string and so saving a lot of GC time.

And obviously there is the thing of transactions in a distributed environment. I'm storing open transactions in Hazelcast so all nodes can access it.

MAG,
Milo

Op vr 29 jan. 2021 om 16:13 schreef Gunnar Morling <gun...@hibernate.org>:

Gunnar Morling

unread,
Jan 29, 2021, 12:41:18 PM1/29/21
to debe...@googlegroups.com
Ok, this sounds very interesting :)

Re communication, agreed it's not always easy; often, there's multiple conversations in parallel, the core team is spread out rather thin between working on the multiple connectors, etc. I try to stay on top of all the conversations here on the ML, I think it's the best avenue, considering folks are distributed around the world and work in different TZs. That said, would you be up for a call some time next week? If so, I could send you an invite for Tuesday afternoon for instance, and we could talk about those parser improvements and other things. WDYT?

--Gunnar


Carlos Fernando Villacorta Padilla

unread,
Jan 29, 2021, 4:19:30 PM1/29/21
to debezium
Thank you guys for your quick answers.

To answer youre questions. My first test was inserting 5 million rows. Then making 1 update query without a WHERE condition, this caused 5 million change events.

The destination is a Saas Datawarehouse called Snowflake. Let's think of it as a remote database. 100ms latency in the worst cases.

Also, I'm new and also eager to contribute to the project.
I don't know many parts of the code but I'm already reading some parts of this connector and starting to make some local changes and testing.
If this is not an issue to you guys, maybe I can join to the meeting too to learn more about this topic and contribute ASAP.

Carlos Villacorta

Carlos Fernando Villacorta Padilla

unread,
Feb 4, 2021, 1:11:22 PM2/4/21
to debezium
Hi everyone,

It was nice to meet you guys.
I was wondering if you could explain me a little further the issues related to the transactions during a logminer session.
And how this causes a issue when you start multiple logmining sessions.

Why there was a need to store open transactions on an external system like Hazelcast?

Carlos Villacorta

Chris Cranford

unread,
Feb 4, 2021, 3:44:49 PM2/4/21
to debe...@googlegroups.com, Carlos Fernando Villacorta Padilla
Hi Carlos -

First lets table the multiple logmining sessions and just look at the redo logs at their very core.

First, redo logs contain intertwined records that relate to varied transactions and these entries can be rewritten even if its associated transaction hasn't been committed or rolled back.  So as a consumer of V$LOGMNR_CONTENTS iterates the result-set, the consumer needs to maintain a relationship between the redo log row you've read and the encompassing transaction its part of.  The reason you need to do this goes back to the fact we have no idea if this transaction will end up being committed or rolled back until we read a COMMIT or ROLLBACK entry that is linked to that transaction.  This means that some type of buffer needs to be used by the consumer to manage this relationship between redo log rows you've "parsed" and the transaction so that when you detect a COMMIT, you know its safe to emit messages for all the redo log rows associated with said transaction or if a ROLLBACK is detected that its safe to discard all those rows entirely as nothing needs to be emitted.

With multiple LogMiner sessions, a certain amount of care must be taken when handling the events by each session.  In particular if each session processes a subset of tables but those tables participate in the same transaction; you need to be absolutely sure that all sessions have processed all pertinent redo log events and the transaction was committed prior to emitting any events for the changes you've read across all sessions in correct chronological order.  External systems like Infinispan makes managing this concurrent data state easy and has the benefit that the state can be managed across restarts of the LogMining session consumer processes.

The big questions for me are:

Whats the gain by using multiple sessions over a single session that processes events sequentially?
Whats the bookkeeping requirements for a multi session setup versus single session?
Whats the effective load on the database using multiple sessions versus single session?

Let me know if you have any other questions.
Chris

Milo van der Zee

unread,
Feb 4, 2021, 4:06:43 PM2/4/21
to debe...@googlegroups.com
Hello Chris,

Nice explanation about the need for transaction storage.

Whats the gain by using multiple sessions over a single session that processes events sequentially?
=> We have hundreds of tables, some having a thoughput of many thousands of events per second. I'm not able to handle all those events from a single instance. By partitioning over multiple processors I can distribute the load if handling the events.

Whats the bookkeeping requirements for a multi session setup versus single session?
=> Annoyingly much :) I have to keep track of transactions over multiple tables and have to keep track of that I can only handle the commit if all tables have reached that SCN also.
I also have to distribute the transactions between the processors with some kind of shared memory solution. I'm using Hazelcast but also Redhat has a solution to do shared memory.

Whats the effective load on the database using multiple sessions versus single session?
=> The load increases but we did not notice serious impact on the database. Our DBA's were not impressed and did not care about our sessions. They did care about the need to enable the ALL and PK flags for the log mining. In our case this increased the size of the logfiles by about 10%. I did some extensive testing with a database on my local machine and the results helped to estimate the impact on our production database.

MAG,
Milo


Op do 4 feb. 2021 om 21:44 schreef Chris Cranford <cran...@gmail.com>:

Chris Cranford

unread,
Feb 5, 2021, 12:23:08 PM2/5/21
to debe...@googlegroups.com, Milo van der Zee
Milo -

I've been looking at the connector's performance the last few days and I have a few questions in particular about your setup as it relates to what we're doing. 

You mentioned on the call that DDL changes aren't a concern for your use case, so therefore are you using DICT_FROM_ONLINE_CATALOG rather than DICT_FROM_REDO_LOGS?  As I understand it, the former option makes LogMiner generate REDO SQL statements based on the current structure of the tables and not based on the structure of the table when the event occurred where-as the latter does.  I've found that the time it takes to start a LogMiner session when using the latter is significantly higher than the former.  Are you aware of any setting or configuration that could help reduce the dictionary build times when using DICT_FROM_REDO_LOGS?  The only real saving grace here seems to be that the dictionary doesn't have to be built except when we end the session & start a new session when a log switch occurs, so generally we only take this hit once in my benchmarks but in real-world production environments, I would expect this to occur more often.

Secondly, most articles I've read including the Oracle docs, state that queries against V$LOGMNR_CONTENTS are slow because the rows are generated dynamically by sequentially reading the redo logs.  I decided to see what the impact would be if we used a real table that disabled logging and there was still the performance hit waiting for the SELECT INTO to complete; however the per-row advancement when the results were being iterated by the connector was mangitude's faster.  Does all this point to the disk speed of where the archive/redo logs are stored?  Do you know if this can be improved somehow?

Thanks
Chris

Carlos Fernando Villacorta Padilla

unread,
Feb 5, 2021, 4:33:05 PM2/5/21
to debe...@googlegroups.com, Milo van der Zee
O fftopic 1: I noticed the solution we are currently working with actually captures the ROWID value and rename the column to _[CUSTOM]_ID varchar. This solution also creates 2 more columns named _[CUSTOM]_DELETED bool and _[CUSTOM]_SYNCED timestamp
I know the ROWID will not be the same on a different Oracle instance, that is why they rename it and use it as PK. Also it would be very useful when there are no primary keys defined on a table.

Offtopic 2: Sometimes I'm getting this error and I have to restart the connector.

OriginalSql = BEGIN sys.dbms_logmnr.start_logmnr(startScn => '16771728401175', endScn => '16771728421175', OPTIONS => DBMS_LOGMNR.DICT_FROM_REDO_LOGS + DBMS_LOGMNR.DDL_DICT_TRACKING  + DBMS_LOGMNR.NO_ROWID_IN_STMT);END;
, Error Msg = ORA-01333: failed to establish Logminer Dictionary

Which seemed a bit odd. Does it require me to build a dictionary before the mining session starts?

And today when I booted the containers again, I noticed the error ORA-01291: missing logfile.
Maybe because the last scn is not found in the logs.

Any idea why this could happen?

Best,
Carlos

You received this message because you are subscribed to a topic in the Google Groups "debezium" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/debezium/3b23GyOtY74/unsubscribe.
To unsubscribe from this group and all its topics, send an email to debezium+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/6058d197-f75e-fee6-2c1f-ce192261a19a%40gmail.com.

Milo van der Zee

unread,
Feb 5, 2021, 6:23:57 PM2/5/21
to Carlos Fernando Villacorta Padilla, debe...@googlegroups.com
Hello Carlos,

1) ROWID as a PK... Hmmm, might be useful. I like to do partial 'full load' actions and using the ROWID might help there. Our primary keys are not consecutive. Is the ROWID consecutive over inserts?

2) Yes, this is an option. Or get the dictionary from the logs or from the database. Because the number of schema changes is very limited I choose to use it from the database.

3) Missing log file is when you request an SCN before the oldest log file. This often happens when the backup procedure just removes the log files as soon as the backup is ready.

MAG,
Milo

Op vr 5 feb. 2021 om 22:33 schreef Carlos Fernando Villacorta Padilla <car...@gmail.com>:

Milo van der Zee

unread,
Feb 5, 2021, 6:37:00 PM2/5/21
to Chris Cranford, debe...@googlegroups.com
Hello Chris,

I do read the dictionary from the database. I had a lot of issues when using the log files. Also because our log files are removed after the backup so the set of logs is sometimes very limited. I did not put much effort in this. If I read your comment then I assume it would be problematic in our situation. We have very frequent log switches. Even though the log is 1GB each.

It makes sense that the diskspeed is of influence on the speed of reading the log miner results. In our situation we did not see any impact on the database and reading is fast enough.
What do you mean by disabling the log and still parsing it with the connector? Then you won't get much data... I enabled the ALL and PK flags and obviously that creates much more data and so is slower.
But I might misunderstand your comment?

I wonder if supporting DDL changes is worth the effort. A change procedure like:
- stop changes
- finish reading archives
- do DDL change
- start changes
- start mining
Something like this would work even without supporting DDL changes and reading the dictionary from the online catalog.

MAG,
Milo

Op vr 5 feb. 2021 om 18:23 schreef Chris Cranford <cran...@gmail.com>:

Chris Cranford

unread,
Feb 8, 2021, 8:48:08 AM2/8/21
to Milo van der Zee, debe...@googlegroups.com
Hi, see inline.


On 2/5/21 6:36 PM, Milo van der Zee wrote:
Hello Chris,

I do read the dictionary from the database. I had a lot of issues when using the log files. Also because our log files are removed after the backup so the set of logs is sometimes very limited. I did not put much effort in this. If I read your comment then I assume it would be problematic in our situation. We have very frequent log switches. Even though the log is 1GB each.
It's not that it would be problematic but the fact that the streaming of events would have this small 25-30s blip as the current mining session is swapped out with a new one when using redo logs dictionaries.  When using the database for the dictionary, this took a matter of 1-2s.  Worse case scenario, I would expect the connector to require several of the most recent redo logs or perhaps an archive log in conjunction with those; however, that shouldn't be an issue.


It makes sense that the diskspeed is of influence on the speed of reading the log miner results. In our situation we did not see any impact on the database and reading is fast enough.
I expect in the case of the database, a lot of that information is likely cached.  I also agree, in my tests with using the online catalog rather than the redo log catalogs, the speed difference was massive.  We were able to start the streaming phase considerably faster and anytime log switches occurred, the restart of the new mining session was relatively instantaneous.

What do you mean by disabling the log and still parsing it with the connector? Then you won't get much data... I enabled the ALL and PK flags and obviously that creates much more data and so is slower.
But I might misunderstand your comment?
One area where I noticed a performance drop was fetching the results from V$LOGMNR_CONTENTS.  This comes back to the fact that the results in this view are populated in real-time at the time the query against the view is ran.  Since these are populated in real-time, this meant the traditional ResultSet#next() call had disk IO overhead that across a relatively large result-set lead to 15-20s of total time spent.  So what I did was took the query we used against V$LOGMNR_CONTENTS used that to run a "INSERT INTO staging_table WHERE (SELECT ... FROM V$LOGMNR_CONTENTS ...)" as a test.  The difference in the result-set fetch dropped drastically to sub 1s values for the same result-set.  Unfortunately, in the larger picture that didn't help as we simply traded that disk IO during fetching of the results to the bulk load insert which consumed roughly the same time in the end.


I wonder if supporting DDL changes is worth the effort. A change procedure like:
- stop changes
- finish reading archives
- do DDL change
- start changes
- start mining
Something like this would work even without supporting DDL changes and reading the dictionary from the online catalog.
It's an idea but I don't think that every environment and situation can support the notion of "stop changes", particularly if the table(s) being monitored are part of some customer/user facing that cannot be taken down completely but could be via a backward compatible blue/green deployment when schema changes are required.  That'd be something the team needs to discuss if such a limitation is acceptable.

Ignatenko, Andrey

unread,
Feb 8, 2021, 9:54:52 AM2/8/21
to debe...@googlegroups.com, Milo van der Zee
Hi, Chris.
Also using the "select into temp table"  approach has another serious disadvantage:  ordering.
It is unclear which columns to sort in which order.
Andrey




--


Andrey Ignatenko
Staff Engineer
tel: +15102675105 /// mobile: /// email: andrey.i...@navis.com
Navis LLC /// 55 Harrison Street, Suite 600 Oakland CA 94607 UNITED STATES
www.navis.com




CONFIDENTIAL – Information in this email, including any attachments, is confidential, may be legally privileged and may contain proprietary information. If you are not the intended recipient, please immediately notify the sender by reply email and then delete this email message and any attachments. You should not copy, use or disclose to any other person this message or any attachments. Thank you.

Carlos Fernando Villacorta Padilla

unread,
Feb 8, 2021, 5:27:50 PM2/8/21
to debezium
@milo

I noticed this happens after starting the connector with "initial" snapshot mode.

Apparently after the initial snapshot ends, the connectors detects the next SCN for the table is 16771728401175 but this SCN is not in any of the logfiles.

Probably because the reason you pointed.

This is also an issue because the dictionary can't be build from the logs. I have to take it from the catalog.

I'll try to make some changes to improve this.

Chris Cranford

unread,
Feb 9, 2021, 3:25:16 PM2/9/21
to debe...@googlegroups.com, Carlos Fernando Villacorta Padilla
Carlos -

Is this with the Debezium Oracle connector?  That SCN value appears like it could be the NEXT_CHANGE# related to the redo log that is marked CURRENT.

Chris

Chris Cranford

unread,
Feb 10, 2021, 9:06:54 AM2/10/21
to debe...@googlegroups.com, Carlos Fernando Villacorta Padilla
Carlos -

Can you send me the log entries where you observed this with the error?  I'd like to take a look.

Thanks,
Chris

Gunnar Morling

unread,
Feb 10, 2021, 9:59:51 AM2/10/21
to debe...@googlegroups.com, Carlos Fernando Villacorta Padilla
Best would be to capture this in a new Jira issue, so to make sure we
can keep track of the related discussions, logs etc.

--Gunnar
> To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/afdee28e-23a6-4e27-6ab4-1b77627e096b%40gmail.com.
Reply all
Reply to author
Forward
0 new messages