LZO Combine File Input Format.. Again

71 views
Skip to first unread message

Jeremy Davis

unread,
May 12, 2014, 4:39:48 PM5/12/14
to elephant...@googlegroups.com
Hello again,
We've been trying to get LZO and CFIF working, but are running in to a problem.
We can combine small files successfully, but the same code is having issues splitting a larger file. (Which obviously works in the the non CFIF case)
Wondered if anyone has made any progress on this, or can point to some example code we can crib from?

-JD

Dmitriy Ryaboy

unread,
May 12, 2014, 5:18:26 PM5/12/14
to elephant...@googlegroups.com
My conclusion on CFIF is that it's borked by design and won't work for non-trivial file input format implementations. The thing doesn't delegate half the calls you expect when implementing an InputFormat, with predictable results.


--
You received this message because you are subscribed to the Google Groups "elephantbird-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elephantbird-d...@googlegroups.com.
To post to this group, send email to elephant...@googlegroups.com.
Visit this group at http://groups.google.com/group/elephantbird-dev.
For more options, visit https://groups.google.com/d/optout.



--
Dmitriy V Ryaboy
Data Platform @ Twitter
http://twitter.com/squarecog

Jeremy Davis

unread,
May 13, 2014, 2:56:27 PM5/13/14
to elephant...@googlegroups.com
Yeah I saw that, and I'm trying to hack together something that is specific for LZO, and so it's not so much a matter of delegation to the underlying format.

Raghu Angadi

unread,
May 13, 2014, 4:16:44 PM5/13/14
to elephant...@googlegroups.com, Jeremy Davis
Can you share your code? It is not very clear to me what the issue is,  you are saying you are not able to split the larger files with your patch. Assuming you have the lzo index file, don't know what could be wrong without looking at the changes you made.

Sean Gottschalk

unread,
May 15, 2014, 7:31:38 PM5/15/14
to elephant...@googlegroups.com
Hi guys,
I've been part of the "we" that Jeremy referred to above. I've extracted our code into a small project and put it here: https://github.com/sgottschalk/lzo-split

I created a word count example, and I've found that it will fail whenever an lzo file is greater than 256MB because it will try and naively split the lzo file at 256MB, even if there's an lzo index file. When I disable "cascading.hadoop.hfs.combine.files" it will intelligently split the file.

I think the code is pretty straightforward, but let me know if anything's confusing. The main class is LzoCFIFTest.java

Thanks!

Sean Gottschalk

unread,
May 28, 2014, 4:27:22 PM5/28/14
to elephant...@googlegroups.com
Hi guys,

Just checking in to see if anyone's had a chance to take a look at this. Let me know if there's anything that I can clarify.

Dmitriy Ryaboy

unread,
May 28, 2014, 5:14:34 PM5/28/14
to elephant...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages