How to make druid read avro files?

Skip to first unread message

Konrad Dziedzic

unread,
Nov 20, 2014, 4:32:39 AM11/20/14
to druid-de...@googlegroups.com
Hi !

I want to make my druid installation use files in .avro format. Unfortunately I didn't find any information how to do that (apart from the fact that I maybe should use ProtoBufInputRowParser).
Could you give me some advises?
I have a hadoop cluster full of data compressed in .avro files. I would like druid to index it and to perform queries over it. Thank you in advance for help!

Best regards
Konrad 

Eric Tschetter

unread,
Nov 20, 2014, 11:52:13 AM11/20/14
to druid-de...@googlegroups.com
Konrad,

Reading from Avro files directly will require some code changes.  Specifically, you will need to swap out the InputFormat classes on the Hadoop jobs that read the input data.

Depending on if you are just kicking the tires on the project versus trying to do a production installation, it might make sense to start by converting the Avro files into one of the text formats that are supported, load that up and make sure that you are comfortable with the feature set and operations of Druid.

--Eric

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/72a78b7e-32c4-4f92-8d22-a54b290d541c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Fangjin Yang

unread,
Nov 20, 2014, 4:45:46 PM11/20/14
to druid-de...@googlegroups.com
FWIW, some other folks in the community have already started working on this:


We should probably make this the main thread to discuss about avro as another thread sidelined into the topic.


On Thursday, November 20, 2014 8:52:13 AM UTC-8, Eric Tschetter wrote:
Konrad,

Reading from Avro files directly will require some code changes.  Specifically, you will need to swap out the InputFormat classes on the Hadoop jobs that read the input data.

Depending on if you are just kicking the tires on the project versus trying to do a production installation, it might make sense to start by converting the Avro files into one of the text formats that are supported, load that up and make sure that you are comfortable with the feature set and operations of Druid.

--Eric
On Thu, Nov 20, 2014 at 1:32 AM, Konrad Dziedzic <konradd...@gmail.com> wrote:
Hi !

I want to make my druid installation use files in .avro format. Unfortunately I didn't find any information how to do that (apart from the fact that I maybe should use ProtoBufInputRowParser).
Could you give me some advises?
I have a hadoop cluster full of data compressed in .avro files. I would like druid to index it and to perform queries over it. Thank you in advance for help!

Best regards
Konrad 

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

PE Montabrun

unread,
Nov 21, 2014, 4:05:26 AM11/21/14
to druid-de...@googlegroups.com
Hi Konrad,

the version below is a modified stable version of druid-0.6-160:

I have modified it to be able to ingest avro files with the hadoop indexing-service.
It doesn't affect druid's existing functionalities, it just adds avro ingestion capabilities.
For now, You just need two more parameters in your task:
{
    "type" : "index_hadoop",
  "hadoopDependencyCoordinates" : ["org.apache.hadoop:hadoop-client:2.4.0"],
    "config" : {
        ...
      "avro":"true",
      "avroSchema":"/Path/To/AvroSchema.avsc" //Not mandatory
    }
}

We've been testing it this week without any issue but If bugs or difficulties occur please share them here so I can fix them.

Pierre-Edouard Montabrun

On Thursday, November 20, 2014 10:45:46 PM UTC+1, Fangjin Yang wrote:
FWIW, some other folks in the community have already started working on this:


We should probably make this the main thread to discuss about avro as another thread sidelined into the topic.

On Thursday, November 20, 2014 8:52:13 AM UTC-8, Eric Tschetter wrote:
Konrad,

Reading from Avro files directly will require some code changes.  Specifically, you will need to swap out the InputFormat classes on the Hadoop jobs that read the input data.

Depending on if you are just kicking the tires on the project versus trying to do a production installation, it might make sense to start by converting the Avro files into one of the text formats that are supported, load that up and make sure that you are comfortable with the feature set and operations of Druid.

--Eric
On Thu, Nov 20, 2014 at 1:32 AM, Konrad Dziedzic <konradd...@gmail.com> wrote:
Hi !

I want to make my druid installation use files in .avro format. Unfortunately I didn't find any information how to do that (apart from the fact that I maybe should use ProtoBufInputRowParser).
Could you give me some advises?
I have a hadoop cluster full of data compressed in .avro files. I would like druid to index it and to perform queries over it. Thank you in advance for help!

Best regards
Konrad 

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.

Konrad Dziedzic

unread,
Nov 24, 2014, 8:29:31 AM11/24/14
to druid-de...@googlegroups.com
Hi,

Thanks a lot guys, I will check the version recommended by PE Montabrun and let you know if i find any problems

Best regards
Konrad 

zhao weinan

unread,
Nov 25, 2014, 10:55:39 AM11/25/14
to druid-de...@googlegroups.com
It's really cool! Since I'm planing using camus with AVRO-1124 avro schema repo, I think maybe I'll need to write a KafkaAvroMessageParser likes KafkaAvroMessageDEcoder to let realtime node can ingest avro kafka messages, or there already has one?

在 2014年11月21日星期五UTC+8上午5时45分46秒,Fangjin Yang写道:
FWIW, some other folks in the community have already started working on this:


We should probably make this the main thread to discuss about avro as another thread sidelined into the topic.

On Thursday, November 20, 2014 8:52:13 AM UTC-8, Eric Tschetter wrote:
Konrad,

Reading from Avro files directly will require some code changes.  Specifically, you will need to swap out the InputFormat classes on the Hadoop jobs that read the input data.

Depending on if you are just kicking the tires on the project versus trying to do a production installation, it might make sense to start by converting the Avro files into one of the text formats that are supported, load that up and make sure that you are comfortable with the feature set and operations of Druid.

--Eric
On Thu, Nov 20, 2014 at 1:32 AM, Konrad Dziedzic <konradd...@gmail.com> wrote:
Hi !

I want to make my druid installation use files in .avro format. Unfortunately I didn't find any information how to do that (apart from the fact that I maybe should use ProtoBufInputRowParser).
Could you give me some advises?
I have a hadoop cluster full of data compressed in .avro files. I would like druid to index it and to perform queries over it. Thank you in advance for help!

Best regards
Konrad 

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
Message has been deleted

Pierre-Edouard Montabrun

unread,
Dec 10, 2014, 3:50:19 AM12/10/14
to druid-de...@googlegroups.com
Hi Rohit,

Have you fixed the error yet? I haven't modified any file in the docs so my first guess would be a typo when you checkout from git.

Pierre-Edouard


On Monday, December 8, 2014 10:24:40 AM UTC+1, Rohit Jha wrote:
Invalid path (contains separator ':'): docs/content/Tutorial:-A-First-Look-at-Druid.md
Invalid path (contains separator ':'): docs/content/Tutorial:-A-First-Look-at-Druid.md


I am getting this error while trying to check the code out from " https://github.com/Accengage/druid.git" branch.

Pierre-Edouard Montabrun

unread,
Dec 10, 2014, 3:50:40 AM12/10/14
to druid-de...@googlegroups.com
Hi Zhao,

I don't think such a solution exists but I would be definitely interested in such a Parser since I'll have to parse avro directly from Kafka too. Please keep us informed about this project if decide to start it.

Pierre-Edouard

Zhao Weinan

unread,
Dec 18, 2014, 8:03:06 AM12/18/14
to druid-de...@googlegroups.com
Hi Pierre-Edouard,

Sorry for the late reply, i'll do the avro parsing in Realtime Node this weekeend if no accident, let me know if by any luck you have finished it :)

--
You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/5bow2fZxZ4g/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-developm...@googlegroups.com.

To post to this group, send email to druid-de...@googlegroups.com.

Zhao Weinan

unread,
Dec 22, 2014, 3:41:34 AM12/22/14
to druid-de...@googlegroups.com
Hi,

We are currently using camus and avro-schema-repo, for avro stream.

The codec logic has been done in com.linkedin.camus.etl.kafka.coders.KafkaAvroMessageDecoder/KafkaAvroMessageEncoder, but the trouble is camus are not currently in public maven repository, and I don't know whether will it be wired letting druid-kafka depending on camus.

One the other hand copy/rewrite codec is not good way to go.

Any suggestions?
Reply all
Reply to author
Forward
0 new messages