Klaas,
I have a question regarding typedbytes file processing using
Firstly, your tb format is different hadoopy's tb format (from hex view perspective).
Is there a standard (binary compatible) hadoop typedbytes file?
Here is what I want to do with your code: create a tb file from a binary file, using its filename (string) as key, and its content (binary) as value; put the tb file on HDFS; and use hadoop-streaming to output key (i.e., filename) only:
-----------------------------
#This is script to convert 1 binary file into test.tb and put on HDFS /tmp/test.tb
import sys
import typedbytes
if len(sys.argv) < 2:
sys.stderr.write('Usage: convert.py input.bin output.tb\n')
sys.exit(1)
with open(sys.argv[2], "wb") as fp:
a = typedbytes.Output(fp)
a.write_string(sys.argv[1]) #filename as key
a.write_bytes(open(sys.argv[1], 'rb').read()) #bytes as value
--------------------------------
#Here is streaming mapper.py
import sys
import typebytes
input = typedbytes.PairedInput(sys.stdin)
output= typedbytes.PairedOutput(sys.stdout)
for (key, value) in input:
output.write((key,1))
--------------------------------
$hadoop fs -rmr /output; hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u3.jar -input /tmp/test.tb -output /output -mapper mapper.py -io typedbytes
I expect to see one line in the format of (it should be only one pair of key,value), i.e.,
test.tb 1
Actually I see hundreds of lines in the format of (no hadoop error)
<increasing numbers> 1
What is the problem from what you can see?
Thanks in advance.
-Qiming
On Fri, Feb 10, 2012 at 11:53 AM, Klaas Bosteels
<klaas.b...@gmail.com> wrote:
Hey Qiming,
CDH supports Dumbo out of the box, but that doesn't mean it includes it. It has all the Hadoop patches that Dumbo requires but Dumbo itself is not part of the CDH distribution.
So you indeed have to install Dumbo separately and you seem to have done that correctly, but you're passing an incorrect path via the -hadoop option. You have to point to Hadoop's home directory, not the "hadoop" binary/script:
humbo cat /path/to/some/file -hadoop /usr/lib/hadoop-0.20
Hope this helps,
-Klaas
On Thu, Feb 9, 2012 at 5:09 PM, Qiming He
<qimi...@openresearchinc.com> wrote:
Klass
According to https://github.com/klbostee/dumbo/wiki/Building-and-installing, "...Alternatively you can use Cloudera’s Hadoop distribution, which supports Dumbo out of the box from version 2 (CDH2) onwards...."
However, my installation of Hadoop-cdh3u3, I cannot find (by find / -name dumbo) installed.
I try to install it manually,
$wget -O ez_setup.py http://bit.ly/ezsetup
$python ez_setup.py dumbo
When I run "dumb cat </path> -hadoop /usr/bin/hadoop", I cam getting error like
ERROR: Streaming jar not found
I do have /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u3.jar
Putting it into CLASSAPTH does not help either.
Could you please advise either Cloudera installation instruction (if it is missing out-of-box),
or a workaround to resolve missing streaming jar issue, e.g., copy jar to a specific location?
Thanks
-Qiming
--
Dr. Qiming He
Qimi...@openresearchinc.com
301-525-6612 (Phone)
815-327-2122 (Fax)