HDFS connector write table to Hive with schema

96 views
Skip to first unread message

justi...@hackerrank.com

unread,
Jun 7, 2018, 9:52:09 PM6/7/18
to Confluent Platform
Hi I'm trying to replicate data from MySQL to Hive using binlogs. My current desired flow is MySQL -> Debezium -> Kafka -> Confluent HDFS Connector -> Hive. I am also using Schema Registry.

Everything seems to work, but when I check the data in the Hive table, I get this instead of my table:
hive> select * from test;

+------+--------------------------+--------------------+---+-------------+---------+
|before|                     after|              source| op|        ts_ms|partition|
+------+--------------------------+--------------------+---+-------------+---------+
|  null|[some,things,from,table...|[0.7.3,cool_thing...|  c|1528421143809|        0|
+------+--------------------------+--------------------+---+-------------+---------+

Am I mistaken to think that the HDFS connector will write a proper table into Hive for me? Or is it because the messages Debezium writes to Kafka is in this format? Is this something I cannot do by ingesting binlogs from MySQL?

Thanks for your help...

Here's my config for reference:
{
  "name": "test",
  "config": {
    "name": "test",
    "logs.dir": "/logs",
    "topics": "topic",
    "connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
    "format.class": "io.confluent.connect.hdfs.parquet.ParquetFormat",
    "tasks.max": "1",
    "flush.size": "3",
    "hadoop.conf.dir": "/etc/hadoop/conf",
    "hdfs.url": "hdfs_url",
    "hive.integration": true,
    "hive.metastore.uris": "hive_metastore",
    "hive.conf.dir": "/etc/hadoop/conf",
    "schema.compatibility": "BACKWARD",
    "key.converter": "io.confluent.connect.avro.AvroConverter",
    "key.converter.schemas.enable": "true",
    "key.converter.schema.registry.url": "url",
    "value.converter": "io.confluent.connect.avro.AvroConverter",
    "value.converter.schemas.enable": "true",
    "value.converter.schema.registry.url": "url",
    "auto.offset.reset": "earliest"
  }
}


justi...@hackerrank.com

unread,
Jun 7, 2018, 9:54:31 PM6/7/18
to Confluent Platform
Actually I believe this is a problem with the way Debezium is writing its messages. Does anyone know any alternatives?

Gunnar Morling

unread,
Jul 27, 2018, 4:43:19 AM7/27/18
to Confluent Platform
Hi,

Indeed that table structure resembles the structure of Debezium's complex CDC message. You can use the provided "unwrapping" SMT which will only propagate the "after" state of the events:


That way the table in the sink will have the columns of the captured table.

Hth,

--Gunnar
Reply all
Reply to author
Forward
0 new messages