Re: your hadoop typedbyte

148 views
Skip to first unread message

Klaas Bosteels

unread,
May 13, 2012, 4:50:22 AM5/13/12
to Qiming He, dumbo...@googlegroups.com
You can't just take plain typed bytes files as input for a dumbo job like that. Typed bytes are not a file format really, it's just a way of encoding values. Dumbo mainly uses typed bytes for sending data to and from the Hadoop Java processes. It also uses them for storing it's output and intermediate files, but those files are sequence files containing TypedBytesWritable objects and not just plain files containing typed bytes like in your case here.

You could write a custom input format for reading plain typed bytes files, but such files won't be easily splittable so running more than one mappers on the same file will probably not be possible. You might want to try writing your input as sequence files instead.

-K

On Fri, May 11, 2012 at 10:37 PM, Qiming He <qimi...@openresearchinc.com> wrote:
Klaas,

I have a question regarding typedbytes file processing using

Firstly, your tb format is different hadoopy's tb format (from hex view perspective). 
(https://github.com/bwhite/hadoopyhadoopy's tb is more like hadoop's sequence file.
Is there a standard (binary compatible) hadoop typedbytes file?

Here is what I want to do with your code: create a tb file from a binary file, using its filename (string) as key, and its content (binary) as value; put the tb file on HDFS; and use hadoop-streaming to output key (i.e., filename) only:
-----------------------------
#This is script to convert 1 binary file into test.tb and put on HDFS /tmp/test.tb
import sys
import typedbytes
if len(sys.argv) < 2:
    sys.stderr.write('Usage: convert.py input.bin output.tb\n')
    sys.exit(1)
with open(sys.argv[2], "wb") as fp:
    a = typedbytes.Output(fp)
    a.write_string(sys.argv[1]) #filename as key
    a.write_bytes(open(sys.argv[1], 'rb').read()) #bytes as value
--------------------------------
#Here is streaming mapper.py
import sys
import typebytes
input = typedbytes.PairedInput(sys.stdin)
output=  typedbytes.PairedOutput(sys.stdout)
for (key, value) in input:
  output.write((key,1))
--------------------------------
$hadoop fs -rmr /output; hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u3.jar -input /tmp/test.tb -output /output -mapper mapper.py -io typedbytes

I expect to see one line in the format of (it should be only one pair of key,value), i.e.,
test.tb 1
Actually I see hundreds of lines in the format of (no hadoop error)
<increasing numbers> 1

What is the problem from what you can see?

Thanks in advance.

-Qiming

On Fri, Feb 10, 2012 at 11:53 AM, Klaas Bosteels <klaas.b...@gmail.com> wrote:
Hey Qiming,

CDH supports Dumbo out of the box, but that doesn't mean it includes it. It has all the Hadoop patches that Dumbo requires but Dumbo itself is not part of the CDH distribution. 

So you indeed have to install Dumbo separately and you seem to have done that correctly, but you're passing an incorrect path via the -hadoop option. You have to point to Hadoop's home directory, not the "hadoop" binary/script:

humbo cat /path/to/some/file -hadoop /usr/lib/hadoop-0.20


Hope this helps,
-Klaas 



On Thu, Feb 9, 2012 at 5:09 PM, Qiming He <qimi...@openresearchinc.com> wrote:
Klass

According to https://github.com/klbostee/dumbo/wiki/Building-and-installing, "...Alternatively you can use Cloudera’s Hadoop distribution, which supports Dumbo out of the box from version 2 (CDH2) onwards...."

However, my installation of Hadoop-cdh3u3, I cannot find (by find / -name dumbo) installed.
I try to install it manually, 
$wget -O ez_setup.py http://bit.ly/ezsetup
$python ez_setup.py dumbo
When I run "dumb cat </path> -hadoop /usr/bin/hadoop", I cam getting error like
ERROR: Streaming jar not found                                                                                   

I do have /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u3.jar                                        
Putting it into CLASSAPTH does not help either.

Could you please advise either Cloudera installation instruction (if it is missing out-of-box),
or a workaround to resolve missing streaming jar issue, e.g., copy jar to a specific location?

Thanks

-Qiming




--
Dr. Qiming He
Qimi...@openresearchinc.com
301-525-6612 (Phone)
815-327-2122 (Fax)



Reply all
Reply to author
Forward
0 new messages