Run jobs on a remote Hadoop cluster without local Hadoop?

213 views
Skip to first unread message

Ross Donaldson

unread,
Mar 20, 2014, 3:59:21 PM3/20/14
to mr...@googlegroups.com
Hello all --

My googling has yet to reveal an answer to this config question: is it possible to have a local install of MrJob, no local Hadoop, and submit Hadoop jobs in to a remote cluster? The remote cluster is a set of internal servers at my company, not, say, EMR. Ideally, we want to be able to configure a set of VMs to have MrJob -- but no local Hadoop -- with the ability to submit in to a client node on a different server. 

This doesn't seem like it's going to work. Is there a config I'm missing?

Thanks!
--Ross

Jeffrey Quinn

unread,
Apr 3, 2014, 8:34:44 PM4/3/14
to mr...@googlegroups.com
If you use the hadoop job runner, mrjob is going to spawn subprocesses to call `hadoop`: `hadoop fs -put`, `hadoop fs -get` etc. This is why you need to set the location of your local hadoop installation when you use the hadoop job runner. There is no way for mrjob to interact with the remote hadoop cluster if the `hadoop` CLI is not available on the local machine.

With EMR mrjob is able to get away with routing all its interactions through boto, which in turn relies upon the nice REST APIs that AWS exposes.

Such a thing doesn't exist for hadoop unfortunately. There is a project called snakebite that seems to be working on it however (disclaimer: I haven't tried it): http://labs.spotify.com/2013/05/07/snakebite/

tl;dr as far as I can tell, you need to have java, hadoop, start the JVM, the whole 9 yards to use the hadoop job runner. No way around it.
Reply all
Reply to author
Forward
0 new messages