Is it possible to disable the initial data snapshot?

8,958 views
Skip to first unread message

Moran Shemesh

unread,
Oct 11, 2016, 5:39:07 AM10/11/16
to debezium
Hi,

First, I would like to emphasize how awesome you product is! 

We are evaluating Debezium as a tool for transferring data changes from our mysql binary logs into Kafka, and then into Hadoop.

We are very excited to use it, but unfortunately the amount of data in our Mysql instances is too large for Debezium to take the initial snapshot (Terra bytes of data per server).

Since we already have all the current data in Hadoop, we don't really need the initial data snapshot.

Is there some kind of configuration that will allow us to skip the initial snapshot and only use Debezium's change data capture feature?

Thanks,
Moran

Prannoy Mittal

unread,
Oct 11, 2016, 7:08:57 AM10/11/16
to debezium
I am also in need of same requirement. 

They have a setting snapshot.mode=never but according to documentation, it can lead to missing of Create DDL statements (as log files are mostly switched) for table schema and hence leading to missing of some events upto my understanding.

but i think @randall can explain better.

Thanks,
Prannoy Mittal.

Randall Hauch

unread,
Oct 11, 2016, 10:42:47 AM10/11/16
to Prannoy Mittal, debezium
Hi, Moran.

Thanks for the compliment!

Debezium's `snapshot.mode=never` means that it will connect to the database and start reading the MySQL binlog from the earliest point in time that’s still in the binlog. And Prannoy is correct that this will be problematic if the binlog no longer contains the create DDL statements for the tables you might be interested in. If you’re lucky enough that the tables are newer than the oldest binlog position, but I’m guessing that’s not the case.

However, there is a use case that I think makes perfect sense that Debezium doesn’t yet handle, and that is begin capturing database changes starting now while ignoring the previous history. Essentially, it would capture the current schema of the database, and then start reading the binlog at the latest position (e.g., the most recent change). This would actually be pretty easy to do, since it really is just a subset of the existing snapshot feature. We might even enable it with `snapshot.mode=schema`.

If this sounds useful, please log an enhancement in our JIRA (https://issues.jboss.org/projects/DBZ), and we’ll schedule it for an upcoming release.


Best regards,

Randall
--
You received this message because you are subscribed to the Google Groups "debezium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+u...@googlegroups.com.
To post to this group, send email to debe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/aa3d7cb7-ee12-472c-91c2-cb0645a3895c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Prannoy Mittal

unread,
Oct 11, 2016, 3:46:53 PM10/11/16
to debezium


On Tuesday, October 11, 2016 at 3:09:07 PM UTC+5:30, Moran Shemesh wrote:

Adrien Quentin

unread,
Aug 1, 2017, 3:22:03 AM8/1/17
to debezium
Hi,

I'm trying to use snapshot.mode="never" using postgres connector, is DBZ-133 specific to mysql, or could be work with postgres ?

Thank you.

satyajit vegesna

unread,
Aug 3, 2017, 2:46:55 PM8/3/17
to debezium
Hi Prannoy/Randall,

I would like to know, if the schema_only option reads the complete bing_log to capture the schema information ? Atleast from my experience i can see the connector trying to read the complete binlog to capture the schema's.
Is there a way to skip reading the binlogs and start getting the current schema information using standard mysql schema related commands, like using DESCRIBE command or SHOW CREATE TABLE?

Regards,
Satyajit.

Randall Hauch

unread,
Aug 3, 2017, 5:13:35 PM8/3/17
to debezium
The snapshot modes get the current schema by using SHOW CREATE TABLE and other metadata queries, and they do NOT attempt to rebuild the schema from the binlog. In fact, none of the snapshot modes do anything with the binlog; only *after* the snapshot completes does the connector start reading the binlog.

The "schema_only" snapshot mode differs from the other modes in that the snapshot captures the current schema but skip reading the rows in the tables. It is useful when you want to start capturing changes from that point forward and don't care about the rows that were already in the database prior to starting the connector.

Hope this answers your questions.

Randall

--
You received this message because you are subscribed to the Google Groups "debezium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+unsubscribe@googlegroups.com.

To post to this group, send email to debe...@googlegroups.com.

satyajit vegesna

unread,
Aug 3, 2017, 5:59:40 PM8/3/17
to debezium
Thank you very much quick reply Randall.

Would highly appreciate, if you could also answer my below questions,

1. When i try to enable multiple connectors, for example i have two databases in my MYSQLDB,
     a.inventory
     b.customers
  first i create a connector for inventory and i immediately add another connector for customers DB. When i do this , i get the below errors in status and i am not sure if the connector is still running or down. PFB error log from offset-status,
  
 {"state":"FAILED","trace":"org.apache.kafka.connect.errors.ConnectException: io.debezium.text.ParsingException: Expecting token type 128 at line 1, column 1 but found 'UPDATE':  ===>> UPDATE `logged_cronj\n\tat io.debezium.connector.mysql.MySqlConnectorTask.start(MySqlConnectorTask.java:192)\n\tat org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:141)\n\tat org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:139)\n\tat org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:182)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: io.debezium.text.ParsingException: Expecting token type 128 at line 1, column 1 but found 'UPDATE':  ===>> UPDATE `logged_cronj\n\tat io.debezium.text.TokenStream.consume(TokenStream.java:737)\n\tat io.debezium.relational.ddl.DdlParser.consumeStatement(DdlParser.java:568)\n\tat io.debezium.relational.ddl.DdlParser.parseUnknownStatement(DdlParser.java:376)\n\tat io.debezium.connector.mysql.MySqlDdlParser.parseNextStatement(MySqlDdlParser.java:156)\n\tat io.debezium.relational.ddl.DdlParser.parse(DdlParser.java:286)\n\tat io.debezium.relational.ddl.DdlParser.parse(DdlParser.java:267)\n\tat io.debezium.relational.history.AbstractDatabaseHistory.lambda$recover$0(AbstractDatabaseHistory.java:57)\n\tat io.debezium.relational.history.KafkaDatabaseHistory.recoverRecords(KafkaDatabaseHistory.java:202)\n\tat io.debezium.relational.history.AbstractDatabaseHistory.recover(AbstractDatabaseHistory.java:52)\n\tat io.debezium.connector.mysql.MySqlSchema.loadHistory(MySqlSchema.java:312)\n\tat io.debezium.connector.mysql.MySqlTaskContext.loadHistory(MySqlTaskContext.java:116)\n\tat io.debezium.connector.mysql.MySqlConnectorTask.start(MySqlConnectorTask.java:80)\n\t... 8 more\n","worker_id":"10.33.8.17:8083","generation":13}

    b.(Second question) there was an other time, when i have created a single connector for all DB's of Mysql, after which only history topic and database server name topic created and did not see a single topic creation for tables.
        I would like to know if you have or some have come across a similar situation.

    c.(Third question) Some databases , for which individual connectors created, show up the table topics immediately created and some take forever and i am not even sure if the connectors are working fine in such situation when i do not see the corresponding topics.Is there something that i should check about this one?
 
    d.(Fourth question) Should i create seperate connect-offsets, connect-status, connect-configs for each connector that i create?

Regards,
Satyajit. 


On Thursday, 3 August 2017 14:13:35 UTC-7, Randall Hauch wrote:
The snapshot modes get the current schema by using SHOW CREATE TABLE and other metadata queries, and they do NOT attempt to rebuild the schema from the binlog. In fact, none of the snapshot modes do anything with the binlog; only *after* the snapshot completes does the connector start reading the binlog.

The "schema_only" snapshot mode differs from the other modes in that the snapshot captures the current schema but skip reading the rows in the tables. It is useful when you want to start capturing changes from that point forward and don't care about the rows that were already in the database prior to starting the connector.

Hope this answers your questions.

Randall
On Thu, Aug 3, 2017 at 1:46 PM, satyajit vegesna <varma.s...@gmail.com> wrote:
Hi Prannoy/Randall,

I would like to know, if the schema_only option reads the complete bing_log to capture the schema information ? Atleast from my experience i can see the connector trying to read the complete binlog to capture the schema's.
Is there a way to skip reading the binlogs and start getting the current schema information using standard mysql schema related commands, like using DESCRIBE command or SHOW CREATE TABLE?

Regards,
Satyajit.

On Tuesday, 11 October 2016 12:46:53 UTC-7, Prannoy Mittal wrote:
https://issues.jboss.org/browse/DBZ-133


On Tuesday, October 11, 2016 at 3:09:07 PM UTC+5:30, Moran Shemesh wrote:
Hi,

First, I would like to emphasize how awesome you product is! 

We are evaluating Debezium as a tool for transferring data changes from our mysql binary logs into Kafka, and then into Hadoop.

We are very excited to use it, but unfortunately the amount of data in our Mysql instances is too large for Debezium to take the initial snapshot (Terra bytes of data per server).

Since we already have all the current data in Hadoop, we don't really need the initial data snapshot.

Is there some kind of configuration that will allow us to skip the initial snapshot and only use Debezium's change data capture feature?

Thanks,
Moran

--
You received this message because you are subscribed to the Google Groups "debezium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+u...@googlegroups.com.

To post to this group, send email to debe...@googlegroups.com.

Randall Hauch

unread,
Aug 3, 2017, 6:34:49 PM8/3/17
to debezium
On Thu, Aug 3, 2017 at 4:59 PM, satyajit vegesna <varma.s...@gmail.com> wrote:
Thank you very much quick reply Randall.

Would highly appreciate, if you could also answer my below questions,

1. When i try to enable multiple connectors, for example i have two databases in my MYSQLDB,
     a.inventory
     b.customers
  first i create a connector for inventory and i immediately add another connector for customers DB. When i do this , i get the below errors in status and i am not sure if the connector is still running or down. PFB error log from offset-status,
  
 {"state":"FAILED","trace":"org.apache.kafka.connect.errors.ConnectException: io.debezium.text.ParsingException: Expecting token type 128 at line 1, column 1 but found 'UPDATE':  ===>> UPDATE `logged_cronj\n\tat io.debezium.connector.mysql.MySqlConnectorTask.start(MySqlConnectorTask.java:192)\n\tat org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:141)\n\tat org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:139)\n\tat org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:182)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: io.debezium.text.ParsingException: Expecting token type 128 at line 1, column 1 but found 'UPDATE':  ===>> UPDATE `logged_cronj\n\tat io.debezium.text.TokenStream.consume(TokenStream.java:737)\n\tat io.debezium.relational.ddl.DdlParser.consumeStatement(DdlParser.java:568)\n\tat io.debezium.relational.ddl.DdlParser.parseUnknownStatement(DdlParser.java:376)\n\tat io.debezium.connector.mysql.MySqlDdlParser.parseNextStatement(MySqlDdlParser.java:156)\n\tat io.debezium.relational.ddl.DdlParser.parse(DdlParser.java:286)\n\tat io.debezium.relational.ddl.DdlParser.parse(DdlParser.java:267)\n\tat io.debezium.relational.history.AbstractDatabaseHistory.lambda$recover$0(AbstractDatabaseHistory.java:57)\n\tat io.debezium.relational.history.KafkaDatabaseHistory.recoverRecords(KafkaDatabaseHistory.java:202)\n\tat io.debezium.relational.history.AbstractDatabaseHistory.recover(AbstractDatabaseHistory.java:52)\n\tat io.debezium.connector.mysql.MySqlSchema.loadHistory(MySqlSchema.java:312)\n\tat io.debezium.connector.mysql.MySqlTaskContext.loadHistory(MySqlTaskContext.java:116)\n\tat io.debezium.connector.mysql.MySqlConnectorTask.start(MySqlConnectorTask.java:80)\n\t... 8 more\n","worker_id":"10.33.8.17:8083","generation":13}

This should stop the connector, but it shouldn't have anything to do with running two connectors. They connectors are independent, and it's a common pattern to run multiple connectors to the same database instance.

The error is often caused by one of two things: improperly configuring the MySQL binlog row format (see http://debezium.io/docs/connectors/mysql/#setting-up-mysql for specifics) such that UPDATE and other DML statements appear in the binlog, or a bug in the connector's DDL parser that can't handle a DDL statement. The latter has been much rarer lately, so make sure that you're using the latest version.
 

    b.(Second question) there was an other time, when i have created a single connector for all DB's of Mysql, after which only history topic and database server name topic created and did not see a single topic creation for tables.
        I would like to know if you have or some have come across a similar situation.

Yes, this can happen for a couple of reasons. Most of the time it is an invalid connector configuration where there actually aren't any tables in the database's that are included, or the table and database whitelists are incorrect and not matching the actual tables. Turn on debug logging (http://debezium.io/docs/logging) for more detail about what the connector is actually doing: it will log messages when it skips tables for any reason. 

Another possibility, especially when this happens after restarting, is that the topic where the database history is recorded is actually not properly configured to have infinite retention. When Kafka cleans up old history records, upon restart the connector can't find a valid history of DDL statements, and will thus exclude all tables for which it doesn't know the schema.

A fourth possibility, especially when there isn't much data in the tables, is that the Kafka Connect worker's producers are configured with more buffering than there is data. So, the worker's producer will essentially buffer all of the data it finds, and only after that buffer fills up will the producer send the batch of messages to Kafka. See the "batch.size" producer property that defaults to 16K (https://kafka.apache.org/documentation/#producerconfigs), which is configured on the worker with "producer.batch.size".
 

    c.(Third question) Some databases , for which individual connectors created, show up the table topics immediately created and some take forever and i am not even sure if the connectors are working fine in such situation when i do not see the corresponding topics.Is there something that i should check about this one?

See the fourth possibility mentioned in the answer to question b.
 
 
    d.(Fourth question) Should i create seperate connect-offsets, connect-status, connect-configs for each connector that i create?

No. Each separate cluster of Kafka Connect workers should have their own "internal" topics for offsets, status, and configs. But it's important that every worker within a cluster all share the same internal topics.

Best regards,

Randall
 
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+unsubscribe@googlegroups.com.

To post to this group, send email to debe...@googlegroups.com.

satyajit vegesna

unread,
Aug 4, 2017, 6:06:20 PM8/4/17
to debezium
Thank you very much Randall. 
Reply all
Reply to author
Forward
0 new messages