Streaming to Hadoop with Node.js

393 views
Skip to first unread message

tom lurge

unread,
Nov 19, 2013, 5:56:49 PM11/19/13
to mongod...@googlegroups.com
Hi all,

I've written a rather involved mapReduce job in JavaScript and ran it on MongoDB 2.4.6. As was to be expected the performance is not good enough. I'd like to run this job with Node.js streaming on Hadoop to get an idea of how much performance is gained by skipping the de-serialization step from BSON to JSON (and back again). I think (hope) I have set up everything correctly so far but I don't know how to start the job. The only documentation I can find is the streaming readme [1] and that is rather sparse on the topic. Probably I'm missing something obvious since I have no experience at all with Node.js. Can somebody give me a hint how to proceed?

Regards,
Thomas


[1] https://github.com/mongodb/mongo-hadoop/blob/master/streaming/README.md

tom lurge

unread,
Nov 20, 2013, 1:01:17 PM11/20/13
to mongod...@googlegroups.com
Maybe I found the answer to my question but I'm running into more problems. This is the streaming command I use (broken into lines for readability):

hadoop
        jar             /usr/local/Cellar/hadoop/1.1.2/libexec/contrib/streaming/hadoop-streaming-1.1.2.jar
        -libjars          /usr/local/Cellar/hadoop/1.1.2/libexec/contrib/streaming/mongo-hadoop-streaming-assembly-1.1.0.jar
        -input             /tmp/in
        -output            /tmp/out
        -inputformat     com.mongodb.hadoop.mapred.MongoInputFormat
        -outputformat     com.mongodb.hadoop.mapred.MongoOutputFormat
        -jobconf         mongo.input.uri=mongodb://127.0.0.1:27017/visionion.import?readPreference=primary
        -jobconf         mongo.output.uri=mongodb://127.0.0.1:27017/visionion.hadoopfacts
         -jobconf         stream.io.identifier.resolver.class=com.mongodb.hadoop.streaming.io.MongoIdentifierResolver
        -io             mongodb
        -mapper            /Users/me/aggregation/hadoop/mapper.js
        -reducer        /Users/me/aggregation/hadoop/reducer.js
        -jobconf        mongo.input.query={_id:{\\$date:1365030000000}}


But I'm getting this:


13/11/20 18:53:45 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
13/11/20 18:53:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/11/20 18:53:45 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
13/11/20 18:53:46 INFO mapred.MongoInputFormat: Using com.mongodb.hadoop.splitter.StandaloneMongoSplitter@102f729e to calculate splits. (old mapreduce API)
13/11/20 18:53:46 INFO splitter.StandaloneMongoSplitter: Running splitvector to check splits against mongodb://127.0.0.1:27017/visionion.import?readPreference=primary
13/11/20 18:53:51 INFO filecache.TrackerDistributedCacheManager: Creating mongo-hadoop-streaming-assembly-1.1.0.jar in /tmp/hadoop-tl/mapred/local/archive/4825921399136019333_-2111303317_1891993332/file/usr/local/Cellar/hadoop/1.1.2/libexec/contrib/streaming/mongo-hadoop-streaming-assembly-1.1.0.jar-work--1602227508575918435 with rwxr-xr-x
13/11/20 18:53:51 INFO filecache.TrackerDistributedCacheManager: Extracting /tmp/hadoop-tl/mapred/local/archive/4825921399136019333_-2111303317_1891993332/file/usr/local/Cellar/hadoop/1.1.2/libexec/contrib/streaming/mongo-hadoop-streaming-assembly-1.1.0.jar-work--1602227508575918435/mongo-hadoop-streaming-assembly-1.1.0.jar to /tmp/hadoop-tl/mapred/local/archive/4825921399136019333_-2111303317_1891993332/file/usr/local/Cellar/hadoop/1.1.2/libexec/contrib/streaming/mongo-hadoop-streaming-assembly-1.1.0.jar-work--1602227508575918435
13/11/20 18:53:51 INFO filecache.TrackerDistributedCacheManager: Cached file:///usr/local/Cellar/hadoop/1.1.2/libexec/contrib/streaming/mongo-hadoop-streaming-assembly-1.1.0.jar as /tmp/hadoop-tl/mapred/local/archive/4825921399136019333_-2111303317_1891993332/file/usr/local/Cellar/hadoop/1.1.2/libexec/contrib/streaming/mongo-hadoop-streaming-assembly-1.1.0.jar
13/11/20 18:53:51 INFO filecache.TrackerDistributedCacheManager: Cached file:///usr/local/Cellar/hadoop/1.1.2/libexec/contrib/streaming/mongo-hadoop-streaming-assembly-1.1.0.jar as /tmp/hadoop-tl/mapred/local/archive/4825921399136019333_-2111303317_1891993332/file/usr/local/Cellar/hadoop/1.1.2/libexec/contrib/streaming/mongo-hadoop-streaming-assembly-1.1.0.jar
13/11/20 18:53:51 WARN mapred.LocalJobRunner: LocalJobRunner does not support symlinking into current working dir.
13/11/20 18:53:51 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-tl/mapred/local]
13/11/20 18:53:51 INFO streaming.StreamJob: Running job: job_local_0001
13/11/20 18:53:51 INFO streaming.StreamJob: Job running in-process (local Hadoop)
13/11/20 18:53:51 INFO mapred.Task:  Using ResourceCalculatorPlugin : null
13/11/20 18:53:51 INFO mapred.MapTask: numReduceTasks: 1
13/11/20 18:53:51 INFO mapred.MapTask: io.sort.mb = 100
13/11/20 18:53:51 INFO mapred.MapTask: data buffer = 79691776/99614720
13/11/20 18:53:51 INFO mapred.MapTask: record buffer = 262144/327680
13/11/20 18:53:51 INFO streaming.PipeMapRed: PipeMapRed exec [/Users/me/aggregation/hadoop/mapper.js]
java.io.IOException: Cannot run program "/Users/me/aggregation/hadoop/mapper.js": error=13, Permission denied
    at java.lang.ProcessBuilder.processException(ProcessBuilder.java:478)
...
13/11/20 18:53:51 ERROR streaming.PipeMapRed: configuration exception
java.io.IOException: Cannot run program "/Users/me/aggregation/hadoop/mapper.js": error=13, Permission denied
...
13/11/20 18:53:51 WARN mapred.LocalJobRunner: job_local_0001
java.lang.RuntimeException: Error in configuring object
...
...
...
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
...Caused by: java.io.IOException: Cannot run program "/Users/me/aggregation/hadoop/mapper.js": error=13, Permission denied
    at java.lang.ProcessBuilder.processException(ProcessBuilder.java:478)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:457)
    at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
    ... 19 more
Caused by: java.io.IOException: error=13, Permission denied
    at java.lang.UNIXProcess.forkAndExec(Native Method)
    at java.lang.UNIXProcess.<init>(UNIXProcess.java:53)
    at java.lang.ProcessImpl.start(ProcessImpl.java:91)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
    ... 20 more
13/11/20 18:53:52 INFO streaming.StreamJob:  map 0%  reduce 0%
13/11/20 18:53:52 INFO streaming.StreamJob: Job running in-process (local Hadoop)
13/11/20 18:53:52 ERROR streaming.StreamJob: Job not successful. Error: NA
13/11/20 18:53:52 INFO streaming.StreamJob: killJob...
Streaming Command Failed!


I already changed the permissions for mapper.js to 755 just to be sure but to no avail. Can soembody please shed some light on where I should look for the error?

Regards,
Thomas

tom lurge

unread,
Nov 22, 2013, 8:56:10 AM11/22/13
to mongod...@googlegroups.com
Sorry to bump the question but there was no reply for 3 days. Any help would be appreciated!

Do people actually use Node.js streaming, is it ready for not so skilled people like me and is it supported? I'm asking because it's hard to find any questions let alone answers about it here and on stack overflow.

Cheers,

Thomas


On Tuesday, November 19, 2013 11:56:49 PM UTC+1, tom lurge wrote:
Message has been deleted

tom lurge

unread,
Nov 22, 2013, 10:04:03 AM11/22/13
to mongod...@googlegroups.com
Mhm, I found the reason for that error. It actually was a permissions error. Silly me. Sorry for the inconvenience!
Getting other errors though which I'm examining now... (but "good", "real" errors, script errors)
Reply all
Reply to author
Forward
0 new messages