Hadoop Streaming

52 views
Skip to first unread message

Hari

unread,
Jan 5, 2013, 4:04:32 AM1/5/13
to dumbo...@googlegroups.com
Hi, as any one in this forum able to use dumbo with Hadoop streaming? 

I am using hadoop-1.0.4 but getting the error 'Streaming jar not found' when trying to use dumbo. I used the command 
$dumbo start wordcount.py -hadoop /usr/local/hadoop-1.0.4/bin/hadoop -input excludes.txt -output out.txt

After reading a related post in this forum I modified the file dumbo/util.py to include os.path.join(hadoop, 'share', 'hadoop', 'contrib', name). Even after that I am getting the same error...

Any suggestions?

Thanks in Advance
-Hari

xpjian

unread,
Jan 5, 2013, 8:44:07 AM1/5/13
to dumbo...@googlegroups.com
I think the problem is "/usr/local/hadoop-1.0.4/bin/hadoop" this directory does not contains "contrib" dir, so the program can not find the streaming jar.  I guess the right place should be "/usr/local/hadoop-1.0.4"(of course U should confirm the contrib dir is here).  Like mine, my hadoop is /home/hadoop/hadoop, this dir had bin dir, contrib dir, lib dir and so on.  Hope to help ^_^

Dr. Hari Koduvely

unread,
Jan 5, 2013, 12:22:07 PM1/5/13
to dumbo...@googlegroups.com

Thanks Juan. It is working fine now after using /usr/local/hadoop-1.0.4

Regards
Hari

--
You received this message because you are subscribed to the Google Groups "dumbo-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/dumbo-user/-/eWEjLdQDAqYJ.
To post to this group, send email to dumbo...@googlegroups.com.
To unsubscribe from this group, send email to dumbo-user+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/dumbo-user?hl=en.

xpjian

unread,
Jan 5, 2013, 9:39:36 PM1/5/13
to dumbo...@googlegroups.com
that is great. haha~

sudhee...@gmail.com

unread,
Oct 1, 2013, 9:23:39 AM10/1/13
to dumbo...@googlegroups.com

Apache Hadoop is an open-source program framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the jogging of applications on giant clusters of commodity hardware. Hadoop was derived from Google's MapReduce & Google File Technique (GFS) papers<a href="https://www.youtube.com/watch?v=PqbYn5LXzRw">Hadoop Online Training Demo in Hyderabad India</a>The Hadoop framework transparently provides both reliability & information motion to applications. Hadoop implements a computational paradigm named MapReduce, where the application is divided in to plenty of small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file technique that stores information on the compute nodes, providing high aggregate bandwidth across the cluster. Both map/reduce & the distributed file technique are designed so that node failures are automatically handled by the framework. It allows applications to work with thousands of computation-independent computers & petabytes of information. The whole Apache Hadoop "platform" is now often thought about to consist of the Hadoop kernel, MapReduce & Hadoop Distributed File Technique (HDFS), & a variety of related projects including Apache Hive, Apache HBase, & others <a href=" http://hadooponlinetrainings.com/hadoop-online-training/">Hadoop Online Training</a>

Hadoop is written in the Java programming language & is an Apache top-level project being built & used by a worldwide community of contributors. Hadoop & its related projects (Hive, HBase, Zookeeper, & so on) have plenty of contributors from across the ecosystem. Though Java code is most common, any programming language can be used with "streaming" to implement the "map" & "reduce" parts of the technique.

 

Hadoop permits a computing solution that is:

 

 Scalable New nodes can be added as needed, & added without needing to fine-tune data formats, how data is loaded, how jobs are written, or the applications on top.

 Cost effective Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all of your data.

 Flexible Hadoop is schema-less, & can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined & aggregated in arbitrary ways enabling deeper analyses than any process can provide.

 Fault tolerant When you lose a node, the process redirects work to another location of the data & continues processing without missing a beat.

 

Apache Hadoop is 100% open source, & pioneered a fundamentally new way of storing & processing data. In lieu of relying on costly, proprietary hardware & different systems to store & technique data, Hadoop permits distributed parallel processing of immense amounts of data across cheap, industry-standard servers that both store & technique the data, & can scale without limits. With Hadoop, no data is sizable. & in today's hyper-connected world where increasingly data is being created every day, Hadoop's breakthrough advantages mean that businesses & organizations can now find value in data that was recently thought about useless.  <a href=" http://hadooponlinetrainings.com/hadoop-online-training/">Online Hadoop Training</a>

sudheer attain

unread,
Oct 12, 2013, 5:24:52 AM10/12/13
to dumbo...@googlegroups.com

Itis good and it is very helpful for us.Hadoop online trainings provides best <a href="http://hadooponlinetrainings.com/hadoop-online-training/">online training Hadoop</a>.

G Suresh

unread,
Oct 24, 2013, 6:08:50 AM10/24/13
to dumbo...@googlegroups.com



Itis good and it is very helpful for us.Hadoop online trainings provides best online training Hadoop
Reply all
Reply to author
Forward
0 new messages