Hi,
I have no problem running dumbo from my cluster master node.
I was looking into running dumbo from my local machine (osx client) .
I got this working by adjusting my local core-site.xml.... to point to
the master node hdfs.
i can successfully run a hadoop job from my local machine ./bin/hadoop
jar ./hadoop-examples-0.20.2-cdh3u2.jar wordcount eno.txt test_output
unfortunately with dumbo this does not work. I am able to ls cat, put
to remote hdfs but running a job from a virtualenv will give me :
dumbo start wordcount.py -input eno.txt -output outdumbo -hadoop
starcluster
dumbo start wordcount.py -input eno.txt -output outdumbo -hadoop
starcluster
EXEC: HADOOP_CLASSPATH=":$HADOOP_CLASSPATH" /Volumes/Locodrive/hadoop/
clouderaha
doop-0.20.2-CDH3u2-src/bin/hadoop jar /Volumes/Locodrive/hadoop/
clouderahadoop-0
.20.2-CDH3u2-src/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar -
mapper 'p
ython -m wordcount map 0 262144000' -outputformat
'org.apache.hadoop.mapred.Sequ
enceFileOutputFormat' -inputformat
'org.apache.hadoop.streaming.AutoInputFormat'
-reducer 'python -m wordcount red 0 262144000' -file '/Volumes/
Locodrive/Dev/.v
irtualenvs/cloudcomp/lib/python2.7/site-packages/dumbo/backends/
common.pyc' -fil
e '/Volumes/Locodrive/Dev/cloud/dumbo_test/wordcount.py' -output
'outdumbo' -job
conf '
mapred.job.name=wordcount.py (1/1)' -jobconf
'stream.map.input=typedbytes'
-jobconf 'stream.map.output=typedbytes' -jobconf
'stream.reduce.input=typedbyte
s' -jobconf 'stream.reduce.output=typedbytes' -jobconf 'tmpfiles=/
Volumes/Locodr
ive/Dev/.virtualenvs/cloudcomp/lib/python2.7/site-packages/
typedbytes.pyc' -inpu
t 'eno.txt' -cmdenv 'PYTHONPATH=common.pyc' -cmdenv
'dumbo_jk_class=dumbo.backen
ds.common.JoinKey' -cmdenv
'dumbo_mrbase_class=dumbo.backends.common.MapRedBase'
-cmdenv
'dumbo_runinfo_class=dumbo.backends.streaming.StreamingRunInfo'
12/04/30 12:26:18 WARN streaming.StreamJob: -jobconf option is
deprecated, pleas
e use -D instead.
packageJobJar: [/Volumes/Locodrive/Dev/.virtualenvs/cloudcomp/lib/
python2.7/site
-packages/dumbo/backends/common.pyc, /Volumes/Locodrive/Dev/cloud/
dumbo_test/wor
dcount.py] [] /var/folders/y3/qz5f8njs7nx8sjt_dp2xjqrr0000gn/T/
streamjob49122085
27955281463.jar tmpDir=null
12/04/30 12:26:20 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName
=JobTracker, sessionId=
12/04/30 12:26:20 WARN util.NativeCodeLoader: Unable to load native-
hadoop libra
ry for your platform... using builtin-java classes where applicable
12/04/30 12:26:20 INFO mapred.JobClient: Cleaning up the staging area
file:/tmp/
hadoop/hadoop-locojay/mapred/staging/locojay-
2135596361/.staging/
job_local_0001
12/04/30 12:26:20 ERROR streaming.StreamJob: Error launching job , bad
input pat
h : File does not exist: /Volumes/Locodrive/Dev/.virtualenvs/cloudcomp/
lib/pytho
n2.7/site-packages/typedbytes.pyc
Streaming Command Failed!
The file /Volumes/Locodrive/Dev/.virtualenvs/cloudcomp/lib/pytho
n2.7/site-packages/typedbytes.pyc exists locally but does not exists
on the master node.
Any idea how i can get dumbo working from my client machine i would
like to have a workflow where i don' t need to move my dev files to
the master node.
Thanks