HDFS source into Kafka

83 views
Skip to first unread message

Zhu Wayne

unread,
Oct 27, 2016, 2:21:09 PM10/27/16
to gobblin-users
Is there any way to ingest HDFS data as source into Kafka topic?

Issac Buenrostro

unread,
Oct 27, 2016, 3:15:43 PM10/27/16
to Zhu Wayne, gobblin-users
Hi Zhu,

What format are your files in?
Gobblin has a Kafka writer which can write the data into Hadoop. It also has most of the functionality for reading from a file system, and it is fully able to read records from an Avro file in HDFS. What it is unfortunately missing is a plain text reader from HDFS, but this should be very easy to implement. At that point, you could combine a file system reader and the Kafka writer to achieve what you want. If you need a non-avro fs reader, do you want to implement it and contribute it to the project? We can of course guide you on where to start.

Best,
Issac

On Thu, Oct 27, 2016 at 11:21 AM, Zhu Wayne <zhuw.c...@gmail.com> wrote:
Is there any way to ingest HDFS data as source into Kafka topic?

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-users+unsubscribe@googlegroups.com.
To post to this group, send email to gobbli...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gobblin-users/1d872cba-c2bd-4aa2-a262-bc69b24da61b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Abhishek Tiwari

unread,
Oct 27, 2016, 7:01:02 PM10/27/16
to Issac Buenrostro, Zhu Wayne, gobblin-users
Hi Zhu, 

If your use case is to read Avro files from HDFS and write to Kafka: you can try the attached pull file; this should mostly work out of the box for you (just make sure to replace server urls). 
For plain text, etc files - it will be a bit extra work as suggested by Issac. 

Regards
Abhishek

hadoop-to-confluent-kafka.pull

Zhu Wayne

unread,
Oct 28, 2016, 4:32:46 PM10/28/16
to Abhishek Tiwari, Issac Buenrostro, gobblin-users
Abhishek,
Thanks for the HDFS avro to Kakfa pull configuration, I got some exception pulling an avro file into kafka. Where could I find more info on configuration. What is the difference between these two properties?

# Source to read from HDFS / local file system
source.filebased.fs.uri=

# Source will read the Avro data within this directory
#source.filebased.data.directory=<source directory on HDFS or Local FS>

Best,

Wayne

On Thu, Oct 27, 2016 at 6:00 PM, Abhishek Tiwari <abhishekti...@gmail.com> wrote:
Hi Zhu, 

If your use case is to read Avro files from HDFS and write to Kafka: you can try the attached pull file; this should mostly work out of the box for you (just make sure to replace server urls). 
For plain text, etc files - it will be a bit extra work as suggested by Issac. 

Regards
Abhishek


On Thu, Oct 27, 2016 at 12:15 PM, 'Issac Buenrostro' via gobblin-users <gobblin-users@googlegroups.com> wrote:
Hi Zhu,

What format are your files in?
Gobblin has a Kafka writer which can write the data into Hadoop. It also has most of the functionality for reading from a file system, and it is fully able to read records from an Avro file in HDFS. What it is unfortunately missing is a plain text reader from HDFS, but this should be very easy to implement. At that point, you could combine a file system reader and the Kafka writer to achieve what you want. If you need a non-avro fs reader, do you want to implement it and contribute it to the project? We can of course guide you on where to start.

Best,
Issac

On Thu, Oct 27, 2016 at 11:21 AM, Zhu Wayne <zhuw.c...@gmail.com> wrote:
Is there any way to ingest HDFS data as source into Kafka topic?

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-users+unsubscribe@googlegroups.com.
To post to this group, send email to gobbli...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gobblin-users/1d872cba-c2bd-4aa2-a262-bc69b24da61b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-users+unsubscribe@googlegroups.com.
To post to this group, send email to gobbli...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gobblin-users/CAMZ-pYBwtEdZok4Zz4mqvSwR4cy5ZmPunRLyYH0bwbPbkgYTmw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.




--
Wayne Zhu

Zhu Wayne

unread,
Oct 28, 2016, 4:41:28 PM10/28/16
to Abhishek Tiwari, Issac Buenrostro, gobblin-users
Abhishek,
I built from master branch gobblin. I ran example wiki pull and it ran fine. However,  I ran into an exception on pulling hdfs. Could you take a look?
$ cat  nohup.out
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/cloudera/gobblin-dist/lib/avro-tools-1.8.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/cloudera/gobblin-dist/lib/slf4j-log4j12-1.7.21.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.createUnstarted()Lcom/google/common/base/Stopwatch;
    at com.google.common.util.concurrent.ServiceManager$ServiceListener.<init>(ServiceManager.java:593)
    at com.google.common.util.concurrent.ServiceManager.<init>(ServiceManager.java:177)
    at gobblin.runtime.app.ServiceBasedAppLauncher.start(ServiceBasedAppLauncher.java:125)
    at gobblin.scheduler.SchedulerDaemon.main(SchedulerDaemon.java:65)


On Thu, Oct 27, 2016 at 6:00 PM, Abhishek Tiwari <abhishekti...@gmail.com> wrote:
Hi Zhu, 

If your use case is to read Avro files from HDFS and write to Kafka: you can try the attached pull file; this should mostly work out of the box for you (just make sure to replace server urls). 
For plain text, etc files - it will be a bit extra work as suggested by Issac. 

Regards
Abhishek


On Thu, Oct 27, 2016 at 12:15 PM, 'Issac Buenrostro' via gobblin-users <gobblin-users@googlegroups.com> wrote:
Hi Zhu,

What format are your files in?
Gobblin has a Kafka writer which can write the data into Hadoop. It also has most of the functionality for reading from a file system, and it is fully able to read records from an Avro file in HDFS. What it is unfortunately missing is a plain text reader from HDFS, but this should be very easy to implement. At that point, you could combine a file system reader and the Kafka writer to achieve what you want. If you need a non-avro fs reader, do you want to implement it and contribute it to the project? We can of course guide you on where to start.

Best,
Issac

On Thu, Oct 27, 2016 at 11:21 AM, Zhu Wayne <zhuw.c...@gmail.com> wrote:
Is there any way to ingest HDFS data as source into Kafka topic?

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-users+unsubscribe@googlegroups.com.
To post to this group, send email to gobbli...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gobblin-users/1d872cba-c2bd-4aa2-a262-bc69b24da61b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-users+unsubscribe@googlegroups.com.
To post to this group, send email to gobbli...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gobblin-users/CAMZ-pYBwtEdZok4Zz4mqvSwR4cy5ZmPunRLyYH0bwbPbkgYTmw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.




--
Wayne Zhu

Zhu Wayne

unread,
Oct 28, 2016, 6:57:43 PM10/28/16
to gobblin-users, abhishekti...@gmail.com, ibue...@linkedin.com
Resolved by using 0.8 stable release.



On Friday, October 28, 2016 at 3:41:28 PM UTC-5, Zhu Wayne wrote:
Abhishek,
I built from master branch gobblin. I ran example wiki pull and it ran fine. However,  I ran into an exception on pulling hdfs. Could you take a look?
$ cat  nohup.out
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/cloudera/gobblin-dist/lib/avro-tools-1.8.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/cloudera/gobblin-dist/lib/slf4j-log4j12-1.7.21.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.createUnstarted()Lcom/google/common/base/Stopwatch;
    at com.google.common.util.concurrent.ServiceManager$ServiceListener.<init>(ServiceManager.java:593)
    at com.google.common.util.concurrent.ServiceManager.<init>(ServiceManager.java:177)
    at gobblin.runtime.app.ServiceBasedAppLauncher.start(ServiceBasedAppLauncher.java:125)
    at gobblin.scheduler.SchedulerDaemon.main(SchedulerDaemon.java:65)

Zhu Wayne

unread,
Oct 28, 2016, 7:26:51 PM10/28/16
to gobblin-users, ibue...@linkedin.com, zhuw.c...@gmail.com
Abhishek,
Thanks for the pull config. I finally got started. However,  I got exceptions writing data into Kafka. Could you give some? I tested my avro file and it was good.  The avro table has a few columns in string and one in double type.

2016-10-28 16:23:41 PDT WARN  [ForkExecutor-0] gobblin.writer.RetryWriter$1  83 - Caught exception. This may be retried.
org.apache.kafka.common.errors.SerializationException: Can't convert value of class org.apache.avro.generic.GenericData$Record to class org.apache.kafka.common.serialization.ByteArraySerializer specified in value.serializer





On Thursday, October 27, 2016 at 6:01:02 PM UTC-5, Abhishek Tiwari wrote:
Hi Zhu, 

If your use case is to read Avro files from HDFS and write to Kafka: you can try the attached pull file; this should mostly work out of the box for you (just make sure to replace server urls). 
For plain text, etc files - it will be a bit extra work as suggested by Issac. 

Regards
Abhishek

On Thu, Oct 27, 2016 at 12:15 PM, 'Issac Buenrostro' via gobblin-users <gobbli...@googlegroups.com> wrote:
Hi Zhu,

What format are your files in?
Gobblin has a Kafka writer which can write the data into Hadoop. It also has most of the functionality for reading from a file system, and it is fully able to read records from an Avro file in HDFS. What it is unfortunately missing is a plain text reader from HDFS, but this should be very easy to implement. At that point, you could combine a file system reader and the Kafka writer to achieve what you want. If you need a non-avro fs reader, do you want to implement it and contribute it to the project? We can of course guide you on where to start.

Best,
Issac
On Thu, Oct 27, 2016 at 11:21 AM, Zhu Wayne <zhuw.c...@gmail.com> wrote:
Is there any way to ingest HDFS data as source into Kafka topic?

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-user...@googlegroups.com.

To post to this group, send email to gobbli...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gobblin-users/1d872cba-c2bd-4aa2-a262-bc69b24da61b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-user...@googlegroups.com.

To post to this group, send email to gobbli...@googlegroups.com.

Shirshanka Das

unread,
Nov 1, 2016, 12:38:00 AM11/1/16
to Zhu Wayne, gobblin-users, Issac Buenrostro
Hi Zhu,
  The error occurs because you have an Avro record while the Kafka producer has been configured to treat it as a byte array. Do you have a schema registry that you are using for your Kafka cluster? 

  I recommend going through this page first to understand how to configure the Kafka sink. 


Shirshanka

 

To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-users+unsubscribe@googlegroups.com.

To post to this group, send email to gobbli...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages