Has anyone gotten compression to work with JSON data?

65 views
Skip to first unread message

b...@tokbox.com

unread,
Mar 12, 2017, 4:27:49 PM3/12/17
to gobblin-users
Hi Everyone,

I am hoping that maybe someone has already done this, so I am putting it out there.

I was hoping to be able to compress my JSON data as I extract it from Kafka.  
This is what we had been doing with Camus and it worked well for us, so I want to continue that practice.

I was looking at the source for the SimpleDataWriterBuilder and AvroHdfsDataWriter.  
Its seems like it might be possible to include the code for a CodecFactory.  
Or is there more to it?  Maybe some side-effect or something...
Has anyone given it a try and gotten it work?

Thanks in advance,
Bob

Eric Ogren

unread,
Mar 13, 2017, 8:24:39 PM3/13/17
to gobblin-users
Hey Bob -

I added gzip support to SimpleDataWriter about 10 days ago (https://github.com/linkedin/gobblin/commit/39930f3e1886ade157a078819ea0f72d9687d876), so if you are running off of a trunk build you should be able to just specify "writer.codec.type=gzip" in your pull file. I'm assuming that's the writer you're using based on the other question you posted. The writer is also hardcoded to append '.gzip' to the file when you do this, so that may be the only unexpected side-effect.

If you are interested in a different compression algorithm, should be pretty straight forward to add one - the interface is different from CodecFactory (we deal with streams where CodecFactory seems to be more block based), but the underlying concepts are pretty similar.

Eric

b...@tokbox.com

unread,
Mar 13, 2017, 10:06:49 PM3/13/17
to gobblin-users
Hi Eric,

Thanks for answering so fast!  Yes I am using the SimpleDataWriter.  I had tried the Avro writer but it didn't quite work out for me.  

Yes, I did do a git pull on Mar 9, so I should have that code.  You know I saw the buildEncoders() in the superclass and thought that was interesting.  Should have looked more closely...
I will give it a whirl as soon as the current run finishes.  I was testing to see what happens when Gobblin gets behind in the extraction by a few hours.

No news will be good news.

Thanks,
Bob

b...@tokbox.com

unread,
Mar 13, 2017, 11:04:47 PM3/13/17
to gobblin-users
Okay, looks like the gzip does indeed work.  I will let it run for a few cycles and compare the times to see what I am getting.

Thanks Eric!

Eric Ogren

unread,
Mar 14, 2017, 12:02:03 AM3/14/17
to gobblin-users
Great!
Eric

b...@tokbox.com

unread,
Mar 25, 2017, 6:37:26 PM3/25/17
to gobblin-users
Hi Eric,

One little gotta that I found with what Gobblin is doing here.  Since it is using the .gzip extension, this presents a little bit of a conflict for using the compressed data downstream in another MR job.  Hadoop uses the .gz extension by default for the GzipCodec, so it will be necessary to either rename the files or override how Hadoop selects the codec to use.

Bob

Eric Ogren

unread,
Mar 27, 2017, 9:15:32 PM3/27/17
to b...@tokbox.com, gobblin-users
Ah, for some reason I thought the two suffixes were interchangeable. Will fix this week -- thanks for the report!

Eric


--
You received this message because you are subscribed to a topic in the Google Groups "gobblin-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gobblin-users/NWe8hIC04ZM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gobblin-users+unsubscribe@googlegroups.com.
To post to this group, send email to gobbli...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gobblin-users/93d31ec7-7adc-4534-8ae9-d99cf80d6044%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

b...@tokbox.com

unread,
Mar 27, 2017, 10:40:48 PM3/27/17
to gobblin-users, b...@tokbox.com
NP, not too big of a deal, just created a custom RecordReader and InputFormat to tell Hadoop to use the GzipCodec with .gzip, not too bad

Thanks for getting the code started in the first place!
To unsubscribe from this group and all its topics, send an email to gobblin-user...@googlegroups.com.

To post to this group, send email to gobbli...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages