Hadoop (or S3) to Kafka Pipeline

272 views
Skip to first unread message

Sean Hermany

unread,
Apr 28, 2015, 5:46:56 PM4/28/15
to camu...@googlegroups.com
Camus is great for doing partitioned ETL on a Kafka topic into HDFS.

Question is...does anyone know of a tool or way to perform the opposite? That is, "replay" data from HDFS back to a Kafka topic?

We apply stream transformations on Kafka feeds that don't translate very well to a Hadoop/MR type job. A common case is that:

1. Data comes in
2. We put it on a topic that Camus then stores, partitioned by ingest data
3. We process that same feed in real time and output it to another topic.

We currently don't have a great solution for pumping the data from HDFS back to Kafka (so that we can re-apply the transformation when we make modifications to that process.)

I wanted to see if a good solution for that existed, so we don't end up re-inventing the wheel.

- Sean

Félix GV

unread,
Apr 28, 2015, 7:25:26 PM4/28/15
to camu...@googlegroups.com
Hi,

At my previous job, we wrote a tool for exactly that purpose: https://github.com/mate1/camus2kafka

You would likely need to provide a custom implementation of this class: https://github.com/mate1/camus2kafka/blob/master/src/main/scala/com/mate1/camus2kafka/AbstractC2KReducer.scala so that your serialization format in Kafka can match what you expect (i.e.: put a magic byte, schema ID bytes, payload in whatever format you expect, etc.). At the time we wrote this tool, the data was all just json-encoded avro with no magic byte or version info. If you have some flavor of schema repo running, you should definitely hook it up with that to make your life 100x easier later down the line.

The camus2kafka code base is pretty small so it should be easy to wrap one's mind around it. The README and comments in the config file are probably worth reading as well...

Hopefully that helps.

--
 
Felix GV
Data Infrastructure Engineer
Distributed Data Systems
LinkedIn
 
f...@linkedin.com
linkedin.com/in/felixgv

From: camu...@googlegroups.com[camu...@googlegroups.com] on behalf of Sean Hermany [sean.h...@gmail.com]
Sent: Tuesday, April 28, 2015 2:46 PM
To: camu...@googlegroups.com
Subject: Hadoop (or S3) to Kafka Pipeline



--
--
Félix

Sean Hermany

unread,
Apr 28, 2015, 8:31:38 PM4/28/15
to Félix GV, camu...@googlegroups.com
Thanks so much! That's pretty much exactly what we wanted.

Have to say though, you missed out on a great opportunity to name the project "Sumac" (our anticipated name if we didn't find anything suitable and ended up rolling our own.)

Cheers,
Sean


--
You received this message because you are subscribed to a topic in the Google Groups "Camus - Kafka ETL for Hadoop" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/camus_etl/hIi1cbOCFvU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to camus_etl+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Félix GV

unread,
Apr 28, 2015, 8:35:51 PM4/28/15
to Sean Hermany, camu...@googlegroups.com
Hehe cool...!

The people who still own the project should be pretty open to pull requests if you need any. But if that ends up not being the case, you can also fork and rename to Sumac I guess ;) ... It's all Apache-licensed (: ...

Good luck (:

-F

Hisham Mardam-Bey

unread,
May 1, 2015, 6:52:56 PM5/1/15
to Félix GV, Sean Hermany, camu...@googlegroups.com, Boris Fersing
On Tue, Apr 28, 2015 at 8:35 PM, Félix GV <fel...@gmail.com> wrote:
Hehe cool...!

The people who still own the project should be pretty open to pull requests if you need any.


*raises hand*

Thanks so much! That's pretty much exactly what we wanted.

Awesome (=
 

Have to say though, you missed out on a great opportunity to name the project "Sumac" (our anticipated name if we didn't find anything suitable and ended up rolling our own.)

No reason this can't be done. We'll look into it!

hmb.
 
You received this message because you are subscribed to the Google Groups "Camus - Kafka ETL for Hadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to camus_etl+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Hisham Mardam-Bey
-=[ CTO ]-=-[ Mate1 Inc. ]=-

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

-=[ Codito Ergo Sum ]=-
Reply all
Reply to author
Forward
0 new messages