How to process binary file?

Dwinanda Prayudi

unread,

Oct 26, 2009, 3:52:10 AM10/26/09

to cascading-user

Dear All,

I want to process a binary file using cascading.
I think the right way to do this is to make a custom source Tap which
will do something like these:
- read a binary file
- pass the file name to an external application (via shell excecution
or JNI, to be decided later),
- result of the externall application wilo is a text file
- open the output file, parse and make the Tuple from it
In my understanding I have to extend class Tap and Scheme. But I have
no clue in which method I put the code to read a binary file and put
it into tuple. Can someone give me clue how to do it?

I also realize that I must process the binary file locally since
Hadoop can't split the file to data node. Is it right?
I really appreciate any help on this matter.

Thanks!

Dwinanda Prayudi

Chris K Wensel

unread,

Oct 26, 2009, 11:51:17 AM10/26/09

to cascadi...@googlegroups.com

Hi Dwinanda

Taps only interface to FileSystems. Schemes only know how to process a
given data format.

So if your binary file is in HDFS, you do not need a custom Tap.

But you do need to write your own Hadoop InputFormat that knows how to
parse your binary data and wrap that in a custom Cascading Scheme.
Looking at the code of any of the built in Scheme classes should make
this all very obvious (see SequenceFile).

cheers,
chris

--
Chris K Wensel
ch...@concurrentinc.com
http://www.concurrentinc.com

Dwinanda Prayudi

unread,

Oct 27, 2009, 12:18:44 AM10/27/09

to cascading-user

Hi Chris.
Many thanks for the answer.
Next question is can I put the binary file into HDFS or must I put it
locally? In my understanding Hadoop won't be able to split binary file
since it doesn't have row delimiter (newline). Or maybe Hadoop just
split the binary file into slices and will merge the slices and make
one stream when I use it via Tap? Correct me if I'm wrong.

> ch...@concurrentinc.comhttp://www.concurrentinc.com- Hide quoted text -
>
> - Show quoted text -

Chris K Wensel

unread,

Oct 27, 2009, 11:02:38 AM10/27/09

to cascadi...@googlegroups.com

You can stuff anything you want in HDFS. blocks are based on length,
not delimiters. but your InputFormat (and supporting classes) will be
responsible for determining splits based on whatever boundaries you use.

This is all completely independent of Cascading.

ckw

Reply all

Reply to author

Forward