Kafka/Gobblin integration?

Eric Ogren

unread,

Feb 26, 2015, 4:58:26 PM2/26/15

to gobbli...@googlegroups.com

We are in the middle of setting up a tracking pipeline sending various analytics data through Kafka. During the original design phase, we didn't take Gobblin into account at all (of course it didn't exist yet) — we were planning on just writing our own Python based Kafka consumer to pull events out of the Kafka topics, buffer them, then dump them onto S3 or HDFS.

However, looking at the docs it seems like we could probably build out a KafkaSource, Extractor, etc, and, say, kick off a job every 30 minutes to do the same thing. Upside is we’d have Gobblin handling both data from external partners and from the tracking pipeline so we probably get benefits from code re-use, scheduling, etc — aside from the short-term timeline hits we’d take from trying to write code against an unfamiliar ecosystem, are there are any architectural downsides in Gobblin that anybody knows of that would make this a bad idea later? I know there is no pre-existing Kafka source but I assume that’s because LinkedIn has Camus already so there was no need to write one.

thanks

Eric

Shirshanka Das

unread,

Feb 26, 2015, 5:28:32 PM2/26/15

to Eric Ogren, gobbli...@googlegroups.com

Hey Eric,

We actually intend to build a Kafka-based Gobblin adapter exactly as you described. At LinkedIn, we are planning to transition our flows from Camus to Gobblin. The cost of running separate code bases and pipelines is too high. Also Camus has a few architectural deficiencies we would like to fix. It would be great to collaborate on this task actually, we can share the design doc and collaborate on initial development.

Shirshanka

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-user...@googlegroups.com.
To post to this group, send email to gobbli...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gobblin-users/054325c1-ea1b-47ec-80ee-9a4619e6f86e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jonathan Hodges

unread,

Mar 20, 2015, 7:41:49 AM3/20/15

to gobbli...@googlegroups.com, er...@nerdwallet.com

Hi,

We also use Kafka and Camus for our data pipeline and are interested in helping out and collaborating on this task. Hopefully we can still make use of the Avro schema registry like Camus to support schema evolution as that is a really nice feature.

Thanks,

Jonathan

Lin Qiao

unread,

Mar 20, 2015, 11:42:57 AM3/20/15

to Jonathan Hodges, gobbli...@googlegroups.com, er...@nerdwallet.com

Hi Jonathan,

It's super exciting for us to start collaboration with the community!

We will share with you our design doc shortly. Now it's a perfect time for you to chime in and influence the design and feature spec at the early stage.

Best,

Lin

From: gobbli...@googlegroups.com [gobbli...@googlegroups.com] on behalf of Jonathan Hodges [jonatha...@pearson.com]
Sent: Friday, March 20, 2015 4:41 AM
To: gobbli...@googlegroups.com
Cc: er...@nerdwallet.com
Subject: Re: Kafka/Gobblin integration?

To view this discussion on the web visit https://groups.google.com/d/msgid/gobblin-users/36a4f37b-fa35-4472-b58a-ffc01a224268%40googlegroups.com.

Jonathan Hodges

unread,

Mar 20, 2015, 1:35:47 PM3/20/15

to gobbli...@googlegroups.com, jonatha...@pearson.com, er...@nerdwallet.com

Hi Lin,

That is great! Thanks for open sourcing this project as many will have these same integration concerns. Without seeing the design I really don't have too much to say. I guess the main thing we like with Camus is the integration with the Avro schema registry and schema evolution. The Confluent streaming data platform implementation might be the best to base the Gobblin integration if you guys are going to go that route. I know the schema registry is a pretty big dependency so maybe it could be stubbed out?

Jonathan

Reply all

Reply to author

Forward