mongo-hadoop and hadoop-0.23.1

Mark Lewandowski

unread,

Mar 9, 2012, 3:36:26 PM3/9/12

to mongod...@googlegroups.com

I'm currently trying to get mongo-hadoop working with hadoop-0.23.1 and streaming. From the little documentation that exists on the web, I'm pretty certain that this is possible.

After installing hadoop, and writing a quick test MR job, I tried running it using mongo-hadoop. The output from the hadoop job says it failed (output pasted below), but when I look in mongo, the correct output is sitting in a new collection.

Any ideas?

Here's the hadoop output:

╰─➤ $HADOOP_COMMON_HOME/bin/hadoop jar /home/mark/workspace/mongo-hadoop/streaming/target/mongo-hadoop-streaming-assembly-1.0.0-rc1-SNAPSHOT.jar -mapper pymapper.py -reducer pyreducer.py -inputURI mongodb://127.0.0.1/path_production.users -outputURI mongodb://127.0.0.1/path_production.mr_usercount -file pymapper.py -file pyreducer.py
12/03/09 12:28:42 INFO streaming.MongoStreamJob: Running
12/03/09 12:28:42 INFO streaming.MongoStreamJob: Init
12/03/09 12:28:42 INFO streaming.MongoStreamJob: Process Args
12/03/09 12:28:42 INFO streaming.StreamJobPatch: Setup Options'
12/03/09 12:28:42 INFO streaming.StreamJobPatch: PreProcess Args
12/03/09 12:28:42 INFO streaming.StreamJobPatch: Parse Options
12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: '-mapper'
12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: 'pymapper.py'
12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: '-reducer'
12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: 'pyreducer.py'
12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: '-inputURI'
12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: 'mongodb://127.0.0.1/path_production.users'
12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: '-outputURI'
12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: 'mongodb://127.0.0.1/path_production.mr_usercount'
12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: '-file'
12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: 'pymapper.py'
12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: '-file'
12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: 'pyreducer.py'
12/03/09 12:28:42 INFO streaming.StreamJobPatch: Add InputSpecs
12/03/09 12:28:42 INFO streaming.StreamJobPatch: Setup output_
12/03/09 12:28:42 INFO streaming.StreamJobPatch: Post Process Args
12/03/09 12:28:42 INFO streaming.MongoStreamJob: Args processed.
12/03/09 12:28:43 INFO io.MongoIdentifierResolver: Resolving: bson
12/03/09 12:28:43 INFO io.MongoIdentifierResolver: Resolving: bson
12/03/09 12:28:43 INFO io.MongoIdentifierResolver: Resolving: bson
12/03/09 12:28:43 INFO io.MongoIdentifierResolver: Resolving: bson
12/03/09 12:28:43 INFO streaming.MongoStreamJob: Input Format: com.mongodb.hadoop.mapred.MongoInputFormat@d0721b0
12/03/09 12:28:43 INFO streaming.MongoStreamJob: Output Format: com.mongodb.hadoop.mapred.MongoOutputFormat@4f34b07e
12/03/09 12:28:43 INFO streaming.MongoStreamJob: Key Class: class com.mongodb.hadoop.io.BSONWritable
12/03/09 12:28:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
12/03/09 12:28:43 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
12/03/09 12:28:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/03/09 12:28:43 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS
12/03/09 12:28:43 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
12/03/09 12:28:43 INFO util.MongoSplitter: Calculate Splits Code ... Use Shards? false, Use Chunks? true; Collection Sharded? false
12/03/09 12:28:43 INFO util.MongoSplitter: Creation of Input Splits is enabled.
12/03/09 12:28:43 INFO util.MongoSplitter: Using Unsharded Split mode (Calculating multiple splits though)
12/03/09 12:28:43 INFO util.MongoSplitter: Calculating unsharded input splits on namespace 'path_production.users' with Split Key '{ "_id" : 1}' and a split size of '8'mb per
12/03/09 12:28:43 WARN util.MongoSplitter: WARNING: No Input Splits were calculated by the split code. Proceeding with a *single* split. Data may be too small, try lowering 'mongo.input.split_size' if this is undesirable.
12/03/09 12:28:43 INFO input.MongoInputSplit: Creating a new MongoInputSplit for MongoURI 'mongodb://127.0.0.1/path_production.users', query: '{ "$query" : { }}', fieldSpec: '{ }', sort: '{ }', limit: 0, skip: 0 .
12/03/09 12:28:43 INFO mapreduce.JobSubmitter: number of splits:1
12/03/09 12:28:43 WARN mapred.LocalDistributedCacheManager: LocalJobRunner does not support symlinking into current working dir.
12/03/09 12:28:43 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
12/03/09 12:28:43 INFO mapred.LocalJobRunner: OutputCommitter set in config null
12/03/09 12:28:43 INFO mapreduce.Job: Running job: job_local_0001
12/03/09 12:28:43 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
12/03/09 12:28:44 INFO mapred.LocalJobRunner: Waiting for map tasks
12/03/09 12:28:44 INFO mapred.LocalJobRunner: Starting task: attempt_local_0001_m_000000_0
12/03/09 12:28:44 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.yarn.util.LinuxResourceCalculatorPlugin@19381960
12/03/09 12:28:44 INFO mapred.MapTask: numReduceTasks: 1
12/03/09 12:28:44 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
12/03/09 12:28:44 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
12/03/09 12:28:44 INFO mapred.MapTask: soft limit at 83886080
12/03/09 12:28:44 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
12/03/09 12:28:44 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
12/03/09 12:28:44 INFO streaming.PipeMapRed: PipeMapRed exec [/home/mark/workspace/tmp/mongo-hadoop/./pymapper.py]
12/03/09 12:28:44 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
12/03/09 12:28:44 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
12/03/09 12:28:44 INFO mapreduce.Job: Job job_local_0001 running in uber mode : false
12/03/09 12:28:44 INFO mapreduce.Job: map 0% reduce 0%
12/03/09 12:28:45 INFO input.MongoRecordReader: Cursor exhausted.
Done Mapping.
12/03/09 12:28:45 INFO streaming.PipeMapRed: Records R/W=27/1
12/03/09 12:28:45 INFO streaming.PipeMapRed: MRErrorThread done
12/03/09 12:28:45 INFO io.BSONWritable: No Length Header available.java.io.EOFException
12/03/09 12:28:45 INFO streaming.PipeMapRed: mapRedFinished
12/03/09 12:28:45 INFO mapred.LocalJobRunner:
12/03/09 12:28:45 INFO mapred.MapTask: Starting flush of map output
12/03/09 12:28:45 INFO mapred.MapTask: Spilling map output
12/03/09 12:28:45 INFO mapred.MapTask: bufstart = 0; bufend = 1323; bufvoid = 104857600
12/03/09 12:28:45 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214292(104857168); length = 105/6553600
12/03/09 12:28:45 INFO mapred.MapTask: Finished spill 0
12/03/09 12:28:45 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
12/03/09 12:28:45 INFO mapred.LocalJobRunner: Records R/W=27/1
12/03/09 12:28:45 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
12/03/09 12:28:45 INFO mapred.LocalJobRunner: Finishing task: attempt_local_0001_m_000000_0
12/03/09 12:28:45 INFO mapred.LocalJobRunner: Map task executor complete.
12/03/09 12:28:45 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.yarn.util.LinuxResourceCalculatorPlugin@8497904
12/03/09 12:28:45 INFO mapred.Merger: Merging 1 sorted segments
12/03/09 12:28:45 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 1358 bytes
12/03/09 12:28:45 INFO mapred.LocalJobRunner:
12/03/09 12:28:45 INFO streaming.PipeMapRed: PipeMapRed exec [/home/mark/workspace/tmp/mongo-hadoop/./pyreducer.py]
12/03/09 12:28:45 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
12/03/09 12:28:45 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
12/03/09 12:28:45 INFO streaming.PipeMapRed: MRErrorThread done
12/03/09 12:28:45 INFO streaming.PipeMapRed: Records R/W=27/1
12/03/09 12:28:45 INFO io.BSONWritable: No Length Header available.java.io.EOFException
12/03/09 12:28:45 INFO streaming.PipeMapRed: mapRedFinished
12/03/09 12:28:45 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
12/03/09 12:28:45 INFO mapred.LocalJobRunner: Records R/W=27/1 > reduce
12/03/09 12:28:45 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
12/03/09 12:28:45 WARN mapred.LocalJobRunner: job_local_0001
java.io.FileNotFoundException: File file:/tmp/_temporary/0 does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:315)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1249)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1289)
    at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:540)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1249)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1289)
    at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:262)
    at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:302)
    at org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
    at org.apache.hadoop.mapred.OutputCommitter.commitJob(OutputCommitter.java:208)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:455)
12/03/09 12:28:46 INFO mapreduce.Job: map 100% reduce 100%
12/03/09 12:28:46 INFO mapreduce.Job: Job job_local_0001 failed with state FAILED due to: NA
12/03/09 12:28:46 INFO mapreduce.Job: Counters: 27
    File System Counters
        FILE: Number of bytes read=902093
        FILE: Number of bytes written=1105556
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
    Map-Reduce Framework
        Map input records=27
        Map output records=27
        Map output bytes=1323
        Map output materialized bytes=1383
        Input split bytes=128
        Combine input records=0
        Combine output records=0
        Reduce input groups=1
        Reduce shuffle bytes=0
        Reduce input records=27
        Reduce output records=1
        Spilled Records=54
        Shuffled Maps =0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=88
        CPU time spent (ms)=0
        Physical memory (bytes) snapshot=0
        Virtual memory (bytes) snapshot=0
        Total committed heap usage (bytes)=351928320
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=0
12/03/09 12:28:46 ERROR streaming.StreamJob: Job not Successful!
MongoDB Streaming Command Failed!

-Mark

Brendan W. McAdams

unread,

Mar 9, 2012, 4:08:18 PM3/9/12

to mongod...@googlegroups.com

Hadoop streaming seems to want to work with a file absolutely and we work around that… I'm wondering why it's tossing an exception.

Are you running this in local mode? pseudo-distributed?

-Mark

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/mongodb-user/-/1JHY3q8Gs18J.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

Mark Lewandowski

unread,

Mar 9, 2012, 4:12:59 PM3/9/12

to mongod...@googlegroups.com

This is running in local mode, trying to get an idea of what the mongo-hadoop package is capable of before I develop against it in production.

To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com.

Brendan W. McAdams

unread,

Mar 9, 2012, 4:26:53 PM3/9/12

to mongod...@googlegroups.com

Local mode should definitely work without issues. Can you send me your mapper / reducer to take a look?

To view this discussion on the web visit https://groups.google.com/d/msg/mongodb-user/-/7WuFfrE2KHIJ.

To post to this group, send email to mongod...@googlegroups.com.

To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.

Mark Lewandowski

unread,

Mar 9, 2012, 4:38:36 PM3/9/12

to mongod...@googlegroups.com

maper.py
-----------------------------------------------------------

#!/usr/bin/env python

import sys
sys.path.append(".")

from pymongo_hadoop import BSONMapper

def mapper(documents):
    for doc in documents:
        yield {'_id': 'user', 'count': 1}

BSONMapper(mapper)
print >> sys.stderr, "Done Mapping!!!"

reducer.py
-----------------------------------------------------------

#!/usr/bin/env python

import sys
sys.path.append('.')

from pymongo_hadoop import BSONReducer

def reducer(key, values):
    _count = 0
    for v in values:
        _count += v['count']
    return {'_id': key, 'count': _count}

BSONReducer(reducer)

Brendan W. McAdams

unread,

Mar 14, 2012, 3:24:39 PM3/14/12

to mongodb-user

Mark,

I have been unable to reproduce this issue despite several
configurations and test beds. Are you still seeing problems?

On Mar 9, 5:38 pm, Mark Lewandowski <mark.e.lewandow...@gmail.com>
wrote:

> >>>> ╰─➤ $HADOOP_COMMON_HOME/bin/hadoop jar /home/mark/workspace/mongo-**
> >>>> hadoop/streaming/target/mongo-**hadoop-streaming-assembly-1.0.**0-rc1-SNAPS HOT.jar

> >>>> -mapper pymapper.py -reducer pyreducer.py -inputURI mongodb://

> >>>> 127.0.0.1/path_**production.users<http://127.0.0.1/path_production.users>-outputURI mongodb://
> >>>> 127.0.0.1/path_**production.mr_usercount<http://127.0.0.1/path_production.mr_usercount>-file pymapper.py -file pyreducer.py

> >>>> 12/03/09 12:28:42 INFO streaming.MongoStreamJob: Running
> >>>> 12/03/09 12:28:42 INFO streaming.MongoStreamJob: Init
> >>>> 12/03/09 12:28:42 INFO streaming.MongoStreamJob: Process Args
> >>>> 12/03/09 12:28:42 INFO streaming.StreamJobPatch: Setup Options'
> >>>> 12/03/09 12:28:42 INFO streaming.StreamJobPatch: PreProcess Args
> >>>> 12/03/09 12:28:42 INFO streaming.StreamJobPatch: Parse Options
> >>>> 12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: '-mapper'
> >>>> 12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: 'pymapper.py'
> >>>> 12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: '-reducer'
> >>>> 12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: 'pyreducer.py'
> >>>> 12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: '-inputURI'
> >>>> 12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: 'mongodb://

> >>>> 127.0.0.1/path_**production.users<http://127.0.0.1/path_production.users>

> >>>> '
> >>>> 12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: '-outputURI'
> >>>> 12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: 'mongodb://

> >>>> 127.0.0.1/path_**production.mr_usercount<http://127.0.0.1/path_production.mr_usercount>

> >>>> '
> >>>> 12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: '-file'
> >>>> 12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: 'pymapper.py'
> >>>> 12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: '-file'
> >>>> 12/03/09 12:28:42 INFO streaming.StreamJobPatch: Arg: 'pyreducer.py'
> >>>> 12/03/09 12:28:42 INFO streaming.StreamJobPatch: Add InputSpecs
> >>>> 12/03/09 12:28:42 INFO streaming.StreamJobPatch: Setup output_
> >>>> 12/03/09 12:28:42 INFO streaming.StreamJobPatch: Post Process Args
> >>>> 12/03/09 12:28:42 INFO streaming.MongoStreamJob: Args processed.
> >>>> 12/03/09 12:28:43 INFO io.MongoIdentifierResolver: Resolving: bson
> >>>> 12/03/09 12:28:43 INFO io.MongoIdentifierResolver: Resolving: bson
> >>>> 12/03/09 12:28:43 INFO io.MongoIdentifierResolver: Resolving: bson
> >>>> 12/03/09 12:28:43 INFO io.MongoIdentifierResolver: Resolving: bson
> >>>> 12/03/09 12:28:43 INFO streaming.MongoStreamJob: Input Format:

> >>>> com.mongodb.hadoop.mapred.**MongoInputFormat@d0721b0

> >>>> 12/03/09 12:28:43 INFO streaming.MongoStreamJob: Output Format:

> >>>> com.mongodb.hadoop.mapred.**MongoOutputFormat@4f34b07e

> >>>> 12/03/09 12:28:43 INFO streaming.MongoStreamJob: Key Class: class

> >>>> com.mongodb.hadoop.io.**BSONWritable

> >>>> 12/03/09 12:28:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with
> >>>> processName=JobTracker, sessionId=
> >>>> 12/03/09 12:28:43 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics
> >>>> with processName=JobTracker, sessionId= - already initialized
> >>>> 12/03/09 12:28:43 WARN util.NativeCodeLoader: Unable to load
> >>>> native-hadoop library for your platform... using builtin-java classes where
> >>>> applicable
> >>>> 12/03/09 12:28:43 WARN conf.Configuration: fs.default.name is
> >>>> deprecated. Instead, use fs.defaultFS

> >>>> 12/03/09 12:28:43 WARN conf.Configuration: mapred.used.**genericoptionsparser
> >>>> is deprecated. Instead, use mapreduce.client.**
> >>>> genericoptionsparser.used

> >>>> 12/03/09 12:28:43 INFO util.MongoSplitter: Calculate Splits Code ...
> >>>> Use Shards? false, Use Chunks? true; Collection Sharded? false
> >>>> 12/03/09 12:28:43 INFO util.MongoSplitter: Creation of Input Splits is
> >>>> enabled.
> >>>> 12/03/09 12:28:43 INFO util.MongoSplitter: Using Unsharded Split mode
> >>>> (Calculating multiple splits though)
> >>>> 12/03/09 12:28:43 INFO util.MongoSplitter: Calculating unsharded input
> >>>> splits on namespace 'path_production.users' with Split Key '{ "_id" : 1}'
> >>>> and a split size of '8'mb per
> >>>> 12/03/09 12:28:43 WARN util.MongoSplitter: WARNING: No Input Splits
> >>>> were calculated by the split code. Proceeding with a *single* split. Data
> >>>> may be too small, try lowering 'mongo.input.split_size' if this is
> >>>> undesirable.
> >>>> 12/03/09 12:28:43 INFO input.MongoInputSplit: Creating a new

> >>>> MongoInputSplit for MongoURI 'mongodb://127.0.0.1/path_**
> >>>> production.users <http://127.0.0.1/path_production.users>', query: '{

> >>>> "$query" : { }}', fieldSpec: '{ }', sort: '{ }', limit: 0, skip: 0 .
> >>>> 12/03/09 12:28:43 INFO mapreduce.JobSubmitter: number of splits:1

> >>>> 12/03/09 12:28:43 WARN mapred.**LocalDistributedCacheManager:

> >>>> LocalJobRunner does not support symlinking into current working dir.
> >>>> 12/03/09 12:28:43 INFO mapreduce.Job: The url to track the job:
> >>>>http://localhost:8080/
> >>>> 12/03/09 12:28:43 INFO mapred.LocalJobRunner: OutputCommitter set in
> >>>> config null
> >>>> 12/03/09 12:28:43 INFO mapreduce.Job: Running job: job_local_0001
> >>>> 12/03/09 12:28:43 INFO mapred.LocalJobRunner: OutputCommitter is

> >>>> org.apache.hadoop.mapred.**FileOutputCommitter

> >>>> 12/03/09 12:28:44 INFO mapred.LocalJobRunner: Waiting for map tasks
> >>>> 12/03/09 12:28:44 INFO mapred.LocalJobRunner: Starting task:
> >>>> attempt_local_0001_m_000000_0
> >>>> 12/03/09 12:28:44 INFO mapred.Task: Using ResourceCalculatorPlugin :

> >>>> org.apache.hadoop.yarn.util.**LinuxResourceCalculatorPlugin@**19381960

> >>>> 12/03/09 12:28:44 INFO mapred.MapTask: numReduceTasks: 1
> >>>> 12/03/09 12:28:44 INFO mapred.MapTask: (EQUATOR) 0 kvi
> >>>> 26214396(104857584)
> >>>> 12/03/09 12:28:44 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
> >>>> 12/03/09 12:28:44 INFO mapred.MapTask: soft limit at 83886080
> >>>> 12/03/09 12:28:44 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
> >>>> 12/03/09 12:28:44 INFO mapred.MapTask: kvstart = 26214396; length =
> >>>> 6553600
> >>>> 12/03/09 12:28:44 INFO streaming.PipeMapRed: PipeMapRed exec

> >>>> [/home/mark/workspace/tmp/**mongo-hadoop/./pymapper.py]

> >>>> 12/03/09 12:28:44 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s]
> >>>> out:NA [rec/s]
> >>>> 12/03/09 12:28:44 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s]
> >>>> out:NA [rec/s]
> >>>> 12/03/09 12:28:44 INFO mapreduce.Job: Job job_local_0001 running in
> >>>> uber mode : false
> >>>> 12/03/09 12:28:44 INFO mapreduce.Job: map 0% reduce 0%
> >>>> 12/03/09 12:28:45 INFO input.MongoRecordReader: Cursor exhausted.
> >>>> Done Mapping.
> >>>> 12/03/09 12:28:45 INFO streaming.PipeMapRed: Records R/W=27/1
> >>>> 12/03/09 12:28:45 INFO streaming.PipeMapRed: MRErrorThread done
> >>>> 12/03/09 12:28:45 INFO io.BSONWritable: No Length Header
> >>>> available.java.io.EOFException
> >>>> 12/03/09 12:28:45 INFO streaming.PipeMapRed: mapRedFinished
> >>>> 12/03/09 12:28:45 INFO mapred.LocalJobRunner:
> >>>> 12/03/09 12:28:45 INFO mapred.MapTask: Starting flush of map output
> >>>> 12/03/09 12:28:45 INFO mapred.MapTask: Spilling map output
> >>>> 12/03/09 12:28:45 INFO mapred.MapTask: bufstart = 0; bufend = 1323;
> >>>> bufvoid = 104857600
> >>>> 12/03/09 12:28:45 INFO mapred.MapTask: kvstart = 26214396(104857584);
> >>>> kvend = 26214292(104857168); length = 105/6553600
> >>>> 12/03/09 12:28:45 INFO mapred.MapTask: Finished spill 0

> >>>> 12/03/09 12:28:45 INFO mapred.Task: Task:attempt_local_0001_m_**000000_0

> >>>> is done. And is in the process of commiting
> >>>> 12/03/09 12:28:45 INFO mapred.LocalJobRunner: Records R/W=27/1

> >>>> 12/03/09 12:28:45 INFO mapred.Task: Task 'attempt_local_0001_m_000000_*
> >>>> *0' done.

> >>>> 12/03/09 12:28:45 INFO mapred.LocalJobRunner: Finishing task:
> >>>> attempt_local_0001_m_000000_0
> >>>> 12/03/09 12:28:45 INFO mapred.LocalJobRunner: Map task executor
>

> ...
>
> read more »

Mark Lewandowski

unread,

Mar 19, 2012, 2:23:58 PM3/19/12

to mongod...@googlegroups.com

Brendan,

The same issue is still occuring, but it does not seem to stop hadoop from finishing correctly. I've just learned to ignore this for the time being, since I'm still in an experimentation phase. When I begin rolling this out to a production environment I'll probably revisit this issue.

Thanks for looking into this.

-Mark

robee

unread,

Apr 27, 2012, 6:39:00 AM4/27/12

to mongod...@googlegroups.com

I have same problem.

I use the example from http://www.slideshare.net/spf13/mongodb-and-hadoop

here is the mapper and reducer. run on local mode and single node hadoop 0.23.1 on OS X Lion.

mapper :

#!/usr/bin/env python

import sys

sys.path.append(".")

from pymongo_hadoop import BSONMapper

def mapper(documents):

"""docstring for mapper"""

for doc in documents:

yield {'_id': doc['user']['time_zone'], 'count': 1}

BSONMapper(mapper)

print >> sys.stderr, "Done Mapping."

reducer:

#!/usr/bin/env python

import sys

sys.path.append(".")

from pymongo_hadoop import BSONReducer

def reducer(key, values):

"""docstring for reducer"""

print >> sys.stderr, "Processing Timezon %s" % key

_count = 0

for v in values:

_count += v['count']

return {'_id':key, 'count': _count}

BSONReducer(reducer)

does anyone has solve this problem?

thanks.

Brendan W. McAdams

unread,

Apr 27, 2012, 10:46:32 AM4/27/12

to mongod...@googlegroups.com

What is the problem you are seeing exactly, when you run this?

> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To view this discussion on the web visit

> https://groups.google.com/d/msg/mongodb-user/-/rv39k1i4y-kJ.

>
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to

> mongodb-user...@googlegroups.com.

robee

unread,

Apr 27, 2012, 10:50:53 AM4/27/12

to mongod...@googlegroups.com

Here's the output from hadoop

hadoop jar mongo-hadoop-streaming-assembly-1.0.0.jar -mapper twit_map.py -reducer twit_reduce.py -inputURI mongodb://127.0.0.1/test.live -outputURI mongodb://127.0.0.1/test.twit_reduction -file twit_map.py -file twit_reduce.py

12/04/27 17:21:17 INFO streaming.MongoStreamJob: Running

12/04/27 17:21:17 INFO streaming.MongoStreamJob: Init

12/04/27 17:21:17 INFO streaming.MongoStreamJob: Process Args

12/04/27 17:21:17 INFO streaming.StreamJobPatch: Setup Options'

12/04/27 17:21:17 INFO streaming.StreamJobPatch: PreProcess Args

12/04/27 17:21:17 INFO streaming.StreamJobPatch: Parse Options

12/04/27 17:21:17 INFO streaming.StreamJobPatch: Arg: '-mapper'

12/04/27 17:21:17 INFO streaming.StreamJobPatch: Arg: 'twit_map.py'

12/04/27 17:21:17 INFO streaming.StreamJobPatch: Arg: '-reducer'

12/04/27 17:21:17 INFO streaming.StreamJobPatch: Arg: 'twit_reduce.py'

12/04/27 17:21:17 INFO streaming.StreamJobPatch: Arg: '-inputURI'

12/04/27 17:21:17 INFO streaming.StreamJobPatch: Arg: 'mongodb://127.0.0.1/test.live'

12/04/27 17:21:17 INFO streaming.StreamJobPatch: Arg: '-outputURI'

12/04/27 17:21:17 INFO streaming.StreamJobPatch: Arg: 'mongodb://127.0.0.1/test.twit_reduction'

12/04/27 17:21:17 INFO streaming.StreamJobPatch: Arg: '-file'

12/04/27 17:21:17 INFO streaming.StreamJobPatch: Arg: 'twit_map.py'

12/04/27 17:21:17 INFO streaming.StreamJobPatch: Arg: '-file'

12/04/27 17:21:17 INFO streaming.StreamJobPatch: Arg: 'twit_reduce.py'

12/04/27 17:21:17 INFO streaming.StreamJobPatch: Add InputSpecs

12/04/27 17:21:17 INFO streaming.StreamJobPatch: Setup output_

12/04/27 17:21:17 INFO streaming.StreamJobPatch: Post Process Args

12/04/27 17:21:17 INFO streaming.MongoStreamJob: Args processed.

2012-04-27 17:21:17.614 java[21100:1903] Unable to load realm info from SCDynamicStore

2012-04-27 17:21:17.752 java[21100:1903] Unable to load realm info from SCDynamicStore

12/04/27 17:21:18 INFO io.MongoIdentifierResolver: Resolving: bson

packageJobJar: [twit_map.py, twit_reduce.py] [] /var/folders/wz/vmr658f56dn5ly79j7vzxy2c0000gq/T/streamjob668124738975197009.jar tmpDir=null

12/04/27 17:21:18 INFO streaming.MongoStreamJob: Input Format: com.mongodb.hadoop.mapred.MongoInputFormat@a50a649

12/04/27 17:21:18 INFO streaming.MongoStreamJob: Output Format: com.mongodb.hadoop.mapred.MongoOutputFormat@34d507e9

12/04/27 17:21:18 INFO streaming.MongoStreamJob: Key Class: class com.mongodb.hadoop.io.BSONWritable

12/04/27 17:21:18 WARN conf.Configuration: session.id is deprecated. Instead, use dfs.metrics.session-id

12/04/27 17:21:18 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=

12/04/27 17:21:18 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized

12/04/27 17:21:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

12/04/27 17:21:18 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS

12/04/27 17:21:18 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used

12/04/27 17:21:18 INFO util.MongoSplitter: Calculate Splits Code ... Use Shards? false, Use Chunks? true; Collection Sharded? false

12/04/27 17:21:18 INFO util.MongoSplitter: Creation of Input Splits is enabled.

12/04/27 17:21:18 INFO util.MongoSplitter: Using Unsharded Split mode (Calculating multiple splits though)

12/04/27 17:21:18 INFO util.MongoSplitter: Calculating unsharded input splits on namespace 'test.live' with Split Key '{ "_id" : 1}' and a split size of '8'mb per

12/04/27 17:21:18 INFO util.MongoSplitter: Calculated 5 splits.

12/04/27 17:21:18 INFO input.MongoInputSplit: Creating a new MongoInputSplit for MongoURI 'mongodb://127.0.0.1/test.live', query: '{ "$query" : { } , "$max" : { "_id" : { "$oid" : "4f981eb458407f60e02ec0ab"}}}', fieldSpec: '{ }', sort: '{ }', limit: 0, skip: 0 .

12/04/27 17:21:18 INFO input.MongoInputSplit: Creating a new MongoInputSplit for MongoURI 'mongodb://127.0.0.1/test.live', query: '{ "$query" : { } , "$min" : { "_id" : { "$oid" : "4f981eb458407f60e02ec0ab"}} , "$max" : { "_id" : { "$oid" : "4f9916fa6ada108fa2771c7f"}}}', fieldSpec: '{ }', sort: '{ }', limit: 0, skip: 0 .

12/04/27 17:21:18 INFO input.MongoInputSplit: Creating a new MongoInputSplit for MongoURI 'mongodb://127.0.0.1/test.live', query: '{ "$query" : { } , "$min" : { "_id" : { "$oid" : "4f9916fa6ada108fa2771c7f"}} , "$max" : { "_id" : { "$oid" : "4f9917486ada108fa27723e2"}}}', fieldSpec: '{ }', sort: '{ }', limit: 0, skip: 0 .

12/04/27 17:21:18 INFO input.MongoInputSplit: Creating a new MongoInputSplit for MongoURI 'mongodb://127.0.0.1/test.live', query: '{ "$query" : { } , "$min" : { "_id" : { "$oid" : "4f9917486ada108fa27723e2"}} , "$max" : { "_id" : { "$oid" : "4f9917b16ada108fa2772b45"}}}', fieldSpec: '{ }', sort: '{ }', limit: 0, skip: 0 .

12/04/27 17:21:18 INFO input.MongoInputSplit: Creating a new MongoInputSplit for MongoURI 'mongodb://127.0.0.1/test.live', query: '{ "$query" : { } , "$min" : { "_id" : { "$oid" : "4f9917b16ada108fa2772b45"}} , "$max" : { "_id" : { "$oid" : "4f9918596ada108fa27732a8"}}}', fieldSpec: '{ }', sort: '{ }', limit: 0, skip: 0 .

12/04/27 17:21:18 INFO input.MongoInputSplit: Creating a new MongoInputSplit for MongoURI 'mongodb://127.0.0.1/test.live', query: '{ "$query" : { } , "$min" : { "_id" : { "$oid" : "4f9918596ada108fa27732a8"}}}', fieldSpec: '{ }', sort: '{ }', limit: 0, skip: 0 .

12/04/27 17:21:18 INFO mapreduce.JobSubmitter: number of splits:6

12/04/27 17:21:19 WARN mapred.LocalDistributedCacheManager: LocalJobRunner does not support symlinking into current working dir.

12/04/27 17:21:19 INFO mapreduce.Job: The url to track the job: http://localhost:8080/

12/04/27 17:21:19 INFO mapred.LocalJobRunner: OutputCommitter set in config null

12/04/27 17:21:19 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter

12/04/27 17:21:19 INFO mapreduce.Job: Running job: job_local_0001

12/04/27 17:21:19 INFO mapred.LocalJobRunner: Waiting for map tasks

12/04/27 17:21:19 INFO mapred.LocalJobRunner: Starting task: attempt_local_0001_m_000000_0

12/04/27 17:21:19 INFO mapred.Task: Using ResourceCalculatorPlugin : null

12/04/27 17:21:19 INFO mapred.MapTask: numReduceTasks: 1

12/04/27 17:21:19 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)

12/04/27 17:21:19 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100

12/04/27 17:21:19 INFO mapred.MapTask: soft limit at 83886080

12/04/27 17:21:19 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600

12/04/27 17:21:19 INFO mapred.MapTask: kvstart = 26214396; length = 6553600

12/04/27 17:21:19 INFO streaming.PipeMapRed: PipeMapRed exec [/Users/tkid/Projects/mongo-hadoop/target/./twit_map.py]

12/04/27 17:21:19 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS

12/04/27 17:21:20 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]

12/04/27 17:21:20 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]

12/04/27 17:21:20 INFO mapreduce.Job: Job job_local_0001 running in uber mode : false

12/04/27 17:21:20 INFO mapreduce.Job: map 0% reduce 0%

12/04/27 17:21:20 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]

12/04/27 17:21:21 INFO streaming.PipeMapRed: Records R/W=518/1

12/04/27 17:21:22 INFO streaming.PipeMapRed: R/W/S=1000/728/0 in:500=1000/2 [rec/s] out:364=728/2 [rec/s]

12/04/27 17:21:23 INFO input.MongoRecordReader: Cursor exhausted.

Done Mapping.

12/04/27 17:21:23 INFO streaming.PipeMapRed: MRErrorThread done

12/04/27 17:21:23 INFO io.BSONWritable: No Length Header available.java.io.EOFException

12/04/27 17:21:23 INFO streaming.PipeMapRed: mapRedFinished

12/04/27 17:21:23 INFO mapred.LocalJobRunner:

12/04/27 17:21:23 INFO mapred.MapTask: Starting flush of map output

12/04/27 17:21:23 INFO mapred.MapTask: Spilling map output

12/04/27 17:21:23 INFO mapred.MapTask: bufstart = 0; bufend = 109200; bufvoid = 104857600

12/04/27 17:21:23 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26206840(104827360); length = 7557/6553600

12/04/27 17:21:24 INFO mapred.MapTask: Finished spill 0

12/04/27 17:21:24 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting

12/04/27 17:21:24 INFO mapred.LocalJobRunner: Records R/W=518/1

12/04/27 17:21:24 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.

12/04/27 17:21:24 INFO mapred.LocalJobRunner: Finishing task: attempt_local_0001_m_000000_0

12/04/27 17:21:24 INFO mapred.LocalJobRunner: Starting task: attempt_local_0001_m_000001_0

12/04/27 17:21:24 INFO mapred.Task: Using ResourceCalculatorPlugin : null

12/04/27 17:21:24 INFO mapred.MapTask: numReduceTasks: 1

12/04/27 17:21:24 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)

12/04/27 17:21:24 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100

12/04/27 17:21:24 INFO mapred.MapTask: soft limit at 83886080

12/04/27 17:21:24 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600

12/04/27 17:21:24 INFO mapred.MapTask: kvstart = 26214396; length = 6553600

12/04/27 17:21:24 INFO streaming.PipeMapRed: PipeMapRed exec [/Users/tkid/Projects/mongo-hadoop/target/./twit_map.py]

12/04/27 17:21:24 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]

12/04/27 17:21:24 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]

12/04/27 17:21:25 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]

12/04/27 17:21:25 INFO mapreduce.Job: map 100% reduce 0%

12/04/27 17:21:26 INFO streaming.PipeMapRed: Records R/W=516/1

12/04/27 17:21:26 INFO streaming.PipeMapRed: R/W/S=1000/801/0 in:1000=1000/1 [rec/s] out:801=801/1 [rec/s]

12/04/27 17:21:27 INFO input.MongoRecordReader: Cursor exhausted.

Done Mapping.

12/04/27 17:21:27 INFO streaming.PipeMapRed: MRErrorThread done

12/04/27 17:21:27 INFO io.BSONWritable: No Length Header available.java.io.EOFException

12/04/27 17:21:27 INFO streaming.PipeMapRed: mapRedFinished

12/04/27 17:21:27 INFO mapred.LocalJobRunner:

12/04/27 17:21:27 INFO mapred.MapTask: Starting flush of map output

12/04/27 17:21:27 INFO mapred.MapTask: Spilling map output

12/04/27 17:21:27 INFO mapred.MapTask: bufstart = 0; bufend = 105925; bufvoid = 104857600

12/04/27 17:21:27 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26206836(104827344); length = 7561/6553600

12/04/27 17:21:27 INFO mapred.MapTask: Finished spill 0

12/04/27 17:21:27 INFO mapred.Task: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting

12/04/27 17:21:27 INFO mapred.LocalJobRunner: Records R/W=516/1

12/04/27 17:21:27 INFO mapred.Task: Task 'attempt_local_0001_m_000001_0' done.

12/04/27 17:21:27 INFO mapred.LocalJobRunner: Finishing task: attempt_local_0001_m_000001_0

12/04/27 17:21:27 INFO mapred.LocalJobRunner: Starting task: attempt_local_0001_m_000002_0

12/04/27 17:21:27 INFO mapred.Task: Using ResourceCalculatorPlugin : null

12/04/27 17:21:27 INFO mapred.MapTask: numReduceTasks: 1

12/04/27 17:21:28 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)

12/04/27 17:21:28 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100

12/04/27 17:21:28 INFO mapred.MapTask: soft limit at 83886080

12/04/27 17:21:28 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600

12/04/27 17:21:28 INFO mapred.MapTask: kvstart = 26214396; length = 6553600

12/04/27 17:21:28 INFO streaming.PipeMapRed: PipeMapRed exec [/Users/tkid/Projects/mongo-hadoop/target/./twit_map.py]

12/04/27 17:21:28 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]

12/04/27 17:21:28 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]

12/04/27 17:21:28 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]

12/04/27 17:21:29 INFO streaming.PipeMapRed: Records R/W=557/1

12/04/27 17:21:29 INFO streaming.PipeMapRed: R/W/S=1000/505/0 in:1000=1000/1 [rec/s] out:505=505/1 [rec/s]

12/04/27 17:21:30 INFO input.MongoRecordReader: Cursor exhausted.

Done Mapping.

12/04/27 17:21:30 INFO streaming.PipeMapRed: MRErrorThread done

12/04/27 17:21:30 INFO io.BSONWritable: No Length Header available.java.io.EOFException

12/04/27 17:21:30 INFO streaming.PipeMapRed: mapRedFinished

12/04/27 17:21:30 INFO mapred.LocalJobRunner:

12/04/27 17:21:30 INFO mapred.MapTask: Starting flush of map output

12/04/27 17:21:30 INFO mapred.MapTask: Spilling map output

12/04/27 17:21:30 INFO mapred.MapTask: bufstart = 0; bufend = 103041; bufvoid = 104857600

12/04/27 17:21:30 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26206836(104827344); length = 7561/6553600

12/04/27 17:21:30 INFO mapred.MapTask: Finished spill 0

12/04/27 17:21:30 INFO mapred.Task: Task:attempt_local_0001_m_000002_0 is done. And is in the process of commiting

12/04/27 17:21:30 INFO mapred.LocalJobRunner: Records R/W=557/1

12/04/27 17:21:30 INFO mapred.Task: Task 'attempt_local_0001_m_000002_0' done.

12/04/27 17:21:30 INFO mapred.LocalJobRunner: Finishing task: attempt_local_0001_m_000002_0

12/04/27 17:21:30 INFO mapred.LocalJobRunner: Starting task: attempt_local_0001_m_000003_0

12/04/27 17:21:30 INFO mapred.Task: Using ResourceCalculatorPlugin : null

12/04/27 17:21:30 INFO mapred.MapTask: numReduceTasks: 1

12/04/27 17:21:31 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)

12/04/27 17:21:31 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100

12/04/27 17:21:31 INFO mapred.MapTask: soft limit at 83886080

12/04/27 17:21:31 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600

12/04/27 17:21:31 INFO mapred.MapTask: kvstart = 26214396; length = 6553600

12/04/27 17:21:31 INFO streaming.PipeMapRed: PipeMapRed exec [/Users/tkid/Projects/mongo-hadoop/target/./twit_map.py]

12/04/27 17:21:31 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]

12/04/27 17:21:31 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]

12/04/27 17:21:31 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]

12/04/27 17:21:32 INFO streaming.PipeMapRed: Records R/W=527/1

12/04/27 17:21:32 INFO streaming.PipeMapRed: R/W/S=1000/499/0 in:1000=1000/1 [rec/s] out:499=499/1 [rec/s]

12/04/27 17:21:33 INFO input.MongoRecordReader: Cursor exhausted.

Done Mapping.

12/04/27 17:21:33 INFO io.BSONWritable: No Length Header available.java.io.EOFException

12/04/27 17:21:33 INFO streaming.PipeMapRed: MRErrorThread done

12/04/27 17:21:33 INFO streaming.PipeMapRed: mapRedFinished

12/04/27 17:21:33 INFO mapred.LocalJobRunner:

12/04/27 17:21:33 INFO mapred.MapTask: Starting flush of map output

12/04/27 17:21:33 INFO mapred.MapTask: Spilling map output

12/04/27 17:21:33 INFO mapred.MapTask: bufstart = 0; bufend = 102427; bufvoid = 104857600

12/04/27 17:21:33 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26206836(104827344); length = 7561/6553600

12/04/27 17:21:33 INFO mapred.MapTask: Finished spill 0

12/04/27 17:21:33 INFO mapred.Task: Task:attempt_local_0001_m_000003_0 is done. And is in the process of commiting

12/04/27 17:21:33 INFO mapred.LocalJobRunner: Records R/W=527/1

12/04/27 17:21:33 INFO mapred.Task: Task 'attempt_local_0001_m_000003_0' done.

12/04/27 17:21:33 INFO mapred.LocalJobRunner: Finishing task: attempt_local_0001_m_000003_0

12/04/27 17:21:33 INFO mapred.LocalJobRunner: Starting task: attempt_local_0001_m_000004_0

12/04/27 17:21:33 INFO mapred.Task: Using ResourceCalculatorPlugin : null

12/04/27 17:21:33 INFO mapred.MapTask: numReduceTasks: 1

12/04/27 17:21:33 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)

12/04/27 17:21:33 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100

12/04/27 17:21:33 INFO mapred.MapTask: soft limit at 83886080

12/04/27 17:21:33 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600

12/04/27 17:21:33 INFO mapred.MapTask: kvstart = 26214396; length = 6553600

12/04/27 17:21:33 INFO streaming.PipeMapRed: PipeMapRed exec [/Users/tkid/Projects/mongo-hadoop/target/./twit_map.py]

12/04/27 17:21:34 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]

12/04/27 17:21:34 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]

12/04/27 17:21:34 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]

12/04/27 17:21:34 INFO streaming.PipeMapRed: Records R/W=559/1

12/04/27 17:21:35 INFO streaming.PipeMapRed: R/W/S=1000/515/0 in:1000=1000/1 [rec/s] out:515=515/1 [rec/s]

12/04/27 17:21:36 INFO input.MongoRecordReader: Cursor exhausted.

Done Mapping.

12/04/27 17:21:36 INFO streaming.PipeMapRed: MRErrorThread done

12/04/27 17:21:36 INFO io.BSONWritable: No Length Header available.java.io.EOFException

12/04/27 17:21:36 INFO streaming.PipeMapRed: mapRedFinished

12/04/27 17:21:36 INFO mapred.LocalJobRunner:

12/04/27 17:21:36 INFO mapred.MapTask: Starting flush of map output

12/04/27 17:21:36 INFO mapred.MapTask: Spilling map output

12/04/27 17:21:36 INFO mapred.MapTask: bufstart = 0; bufend = 99915; bufvoid = 104857600

12/04/27 17:21:36 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26206836(104827344); length = 7561/6553600

12/04/27 17:21:36 INFO mapreduce.Job: map 66% reduce 0%

12/04/27 17:21:36 INFO mapred.MapTask: Finished spill 0

12/04/27 17:21:36 INFO mapred.Task: Task:attempt_local_0001_m_000004_0 is done. And is in the process of commiting

12/04/27 17:21:36 INFO mapred.LocalJobRunner: Records R/W=559/1

12/04/27 17:21:36 INFO mapred.Task: Task 'attempt_local_0001_m_000004_0' done.

12/04/27 17:21:36 INFO mapred.LocalJobRunner: Finishing task: attempt_local_0001_m_000004_0

12/04/27 17:21:36 INFO mapred.LocalJobRunner: Starting task: attempt_local_0001_m_000005_0

12/04/27 17:21:36 INFO mapred.Task: Using ResourceCalculatorPlugin : null

12/04/27 17:21:36 INFO mapred.MapTask: numReduceTasks: 1

12/04/27 17:21:36 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)

12/04/27 17:21:36 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100

12/04/27 17:21:36 INFO mapred.MapTask: soft limit at 83886080

12/04/27 17:21:36 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600

12/04/27 17:21:36 INFO mapred.MapTask: kvstart = 26214396; length = 6553600

12/04/27 17:21:36 INFO streaming.PipeMapRed: PipeMapRed exec [/Users/tkid/Projects/mongo-hadoop/target/./twit_map.py]

12/04/27 17:21:36 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]

12/04/27 17:21:36 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]

12/04/27 17:21:37 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]

12/04/27 17:21:37 INFO mapreduce.Job: map 100% reduce 0%

12/04/27 17:21:37 INFO streaming.PipeMapRed: Records R/W=552/1

12/04/27 17:21:38 INFO streaming.PipeMapRed: R/W/S=1000/504/0 in:1000=1000/1 [rec/s] out:504=504/1 [rec/s]

12/04/27 17:21:38 INFO input.MongoRecordReader: Cursor exhausted.

Done Mapping.

12/04/27 17:21:38 INFO io.BSONWritable: No Length Header available.java.io.EOFException

12/04/27 17:21:38 INFO streaming.PipeMapRed: MRErrorThread done

12/04/27 17:21:38 INFO streaming.PipeMapRed: mapRedFinished

12/04/27 17:21:38 INFO mapred.LocalJobRunner:

12/04/27 17:21:38 INFO mapred.MapTask: Starting flush of map output

12/04/27 17:21:38 INFO mapred.MapTask: Spilling map output

12/04/27 17:21:38 INFO mapred.MapTask: bufstart = 0; bufend = 90089; bufvoid = 104857600

12/04/27 17:21:38 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26207708(104830832); length = 6689/6553600

12/04/27 17:21:39 INFO mapred.MapTask: Finished spill 0

12/04/27 17:21:39 INFO mapred.Task: Task:attempt_local_0001_m_000005_0 is done. And is in the process of commiting

12/04/27 17:21:39 INFO mapred.LocalJobRunner: Records R/W=552/1

12/04/27 17:21:39 INFO mapred.Task: Task 'attempt_local_0001_m_000005_0' done.

12/04/27 17:21:39 INFO mapred.LocalJobRunner: Finishing task: attempt_local_0001_m_000005_0

12/04/27 17:21:39 INFO mapred.LocalJobRunner: Map task executor complete.

12/04/27 17:21:39 INFO mapred.Task: Using ResourceCalculatorPlugin : null

12/04/27 17:21:39 INFO mapred.Merger: Merging 6 sorted segments

12/04/27 17:21:39 INFO mapred.Merger: Down to the last merge-pass, with 6 segments left of total size: 632791 bytes

12/04/27 17:21:39 INFO mapred.LocalJobRunner:

12/04/27 17:21:39 INFO streaming.PipeMapRed: PipeMapRed exec [/Users/tkid/Projects/mongo-hadoop/target/./twit_reduce.py]

12/04/27 17:21:39 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]

12/04/27 17:21:39 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]

12/04/27 17:21:39 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]

12/04/27 17:21:39 INFO streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]

Processing Timezon None

Processing Timezon Abu Dhabi

Processing Timezon Adelaide

Processing Timezon Alaska

Processing Timezon Almaty

Processing Timezon Amsterdam

Processing Timezon Arizona

Processing Timezon Astana

Processing Timezon Athens

Processing Timezon Atlantic Time (Canada)

Processing Timezon Auckland

Processing Timezon Azores

Processing Timezon Baghdad

Processing Timezon Bangkok

Processing Timezon Beijing

Processing Timezon Belgrade

Processing Timezon Berlin

Processing Timezon Bern

Processing Timezon Bogota

Processing Timezon Brasilia

Processing Timezon Brisbane

Processing Timezon Brussels

Processing Timezon Bucharest

Processing Timezon Budapest

Processing Timezon Buenos Aires

Processing Timezon Cairo

Processing Timezon Canberra

Processing Timezon Cape Verde Is.

Processing Timezon Caracas

Processing Timezon Casablanca

Processing Timezon Central America

Processing Timezon Central Time (US & Canada)

Processing Timezon Chennai

Processing Timezon Chihuahua

Processing Timezon Copenhagen

Processing Timezon Dhaka

Processing Timezon Dublin

Processing Timezon Eastern Time (US & Canada)

Processing Timezon Edinburgh

Processing Timezon Ekaterinburg

Processing Timezon Fiji

Processing Timezon Georgetown

Processing Timezon Greenland

Processing Timezon Guadalajara

Processing Timezon Guam

Processing Timezon Hanoi

Processing Timezon Harare

Processing Timezon Hawaii

Processing Timezon Helsinki

Processing Timezon Hong Kong

Processing Timezon Indiana (East)

Processing Timezon International Date Line West

Processing Timezon Irkutsk

Processing Timezon Islamabad

Processing Timezon Istanbul

Processing Timezon Jakarta

Processing Timezon Jerusalem

Processing Timezon Kabul

Processing Timezon Karachi

Processing Timezon Kathmandu

Processing Timezon Kuala Lumpur

Processing Timezon Kuwait

Processing Timezon Kyiv

Processing Timezon La Paz

Processing Timezon Lima

Processing Timezon Lisbon

Processing Timezon Ljubljana

Processing Timezon London

Processing Timezon Madrid

Processing Timezon Mazatlan

Processing Timezon Melbourne

Processing Timezon Mexico City

Processing Timezon Mid-Atlantic

Processing Timezon Minsk

Processing Timezon Monterrey

Processing Timezon Moscow

Processing Timezon Mountain Time (US & Canada)

Processing Timezon Mumbai

Processing Timezon Muscat

Processing Timezon Nairobi

Processing Timezon New Caledonia

Processing Timezon New Delhi

Processing Timezon Novosibirsk

Processing Timezon Nuku'alofa

Processing Timezon Osaka

Processing Timezon Pacific Time (US & Canada)

12/04/27 17:21:40 INFO streaming.PipeMapRed: R/W/S=10000/0/0 in:10000=10000/1 [rec/s] out:0=0/1 [rec/s]

Processing Timezon Paris

Processing Timezon Perth

Processing Timezon Prague

Processing Timezon Pretoria

Processing Timezon Quito

Processing Timezon Riga

Processing Timezon Riyadh

Processing Timezon Rome

Processing Timezon Santiago

Processing Timezon Sapporo

Processing Timezon Sarajevo

Processing Timezon Seoul

Processing Timezon Singapore

Processing Timezon St. Petersburg

Processing Timezon Stockholm

Processing Timezon Sydney

Processing Timezon Taipei

Processing Timezon Tallinn

Processing Timezon Tashkent

Processing Timezon Tehran

Processing Timezon Tijuana

Processing Timezon Tokyo

Processing Timezon Ulaan Bataar

Processing Timezon Vienna

Processing Timezon Warsaw

Processing Timezon Wellington

Processing Timezon West Central Africa

Processing Timezon Yakutsk

Processing Timezon Zagreb

12/04/27 17:21:40 INFO streaming.PipeMapRed: MRErrorThread done

12/04/27 17:21:40 INFO streaming.PipeMapRed: Records R/W=11127/1

12/04/27 17:21:40 INFO io.BSONWritable: No Length Header available.java.io.EOFException

12/04/27 17:21:40 INFO streaming.PipeMapRed: mapRedFinished

12/04/27 17:21:40 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting

12/04/27 17:21:40 INFO mapred.LocalJobRunner: Records R/W=11127/1 > reduce

12/04/27 17:21:40 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.

12/04/27 17:21:40 WARN mapred.LocalJobRunner: job_local_0001

java.io.FileNotFoundException: File file:/tmp/_temporary/0 does not exist

at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:315)

at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1249)

at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1289)

at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:540)

at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1249)

at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1289)

at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:262)

at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:302)

at org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)

at org.apache.hadoop.mapred.OutputCommitter.commitJob(OutputCommitter.java:208)

at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:455)

12/04/27 17:21:41 INFO mapreduce.Job: map 100% reduce 100%

12/04/27 17:21:41 INFO mapreduce.Job: Job job_local_0001 failed with state FAILED due to: NA

12/04/27 17:21:41 INFO mapreduce.Job: Counters: 29

File System Counters

FILE: Number of bytes read=669175

FILE: Number of bytes written=3766514

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=0

HDFS: Number of bytes written=0

HDFS: Number of read operations=0

HDFS: Number of large read operations=0

HDFS: Number of write operations=0

Map-Reduce Framework

Map input records=11127

Map output records=11127

Map output bytes=610597

Map output materialized bytes=632887

Input split bytes=1306

Combine input records=0

Combine output records=0

Reduce input groups=115

Reduce shuffle bytes=0

Reduce input records=11127

Reduce output records=115

Spilled Records=22254

Shuffled Maps =0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=2262

Total committed heap usage (bytes)=1720918016

File Input Format Counters

Bytes Read=0

File Output Format Counters

Bytes Written=0

12/04/27 17:21:41 ERROR streaming.StreamJob: Job not Successful!

MongoDB Streaming Command Failed!

--

robee

Brendan W. McAdams

unread,

Apr 27, 2012, 10:58:51 AM4/27/12

to mongod...@googlegroups.com

The previous poster indicated that despite this error, the data was
succesfully written to his mongodb output collection. Can you verify
this?

We are looking into the filesystem error separately but as we do not
use the filesystem at any time, believe it to be a false error and the
job to have succeeded.

robee

unread,

Apr 27, 2012, 11:00:26 AM4/27/12

to mongod...@googlegroups.com

Oh right, the data successfully written to mongodb output collection.

--

robee

Tyler Brock

unread,

May 10, 2012, 8:54:07 AM5/10/12

to mongodb-user

I'm having the same problem... any updates on this issue?

-Tyler

On Apr 27, 11:00 am, robee <muhammad.ro...@gmail.com> wrote:
> Oh right, the data successfully written to mongodb output collection.
>
> --
> robee
>
>
>
>
>
>
>
> On Friday, 27 April 2012 at 21:58, Brendan W. McAdams wrote:
> > The previous poster indicated that despite this error, the data was
> > succesfully written to his mongodb output collection. Can you verify
> > this?
>
> > We are looking into the filesystem error separately but as we do not
> > use the filesystem at any time, believe it to be a false error and the
> > job to have succeeded.
>

> > > 12/04/27 17:21:18 WARN conf.Configuration: fs.default.name (http://fs.default.name) is deprecated.

> > > 12/04/27 17:21:19 WARN conf.Configuration: fs.default.name (http://fs.default.name) is deprecated.

> ...
>
> read more »

Tyler Brock

unread,

May 10, 2012, 9:15:00 AM5/10/12

to mongodb-user

So the issue is hadoop 0.23.1, if you downgrade to 0.23.0 it works
just fine.

You can use this homebrew formula if you want:
https://github.com/TylerBrock/evil-formulas/blob/master/0.23.0/hadoop.rb

brew uninstall hadoop, then brew install
https://raw.github.com/TylerBrock/evil-formulas/master/0.23.0/hadoop.rb

Cheers,

Tyler

> ...
>
> read more »

Jesse Sanford

unread,

Aug 23, 2012, 3:07:52 PM8/23/12

to mongod...@googlegroups.com

I am seeing something similar when using cdh 0.20.2 with cdh3

2012-08-23 14:35:41,708 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]
2012-08-23 14:35:42,163 INFO com.mongodb.hadoop.io.BSONWritable: No Length Header available.java.io.EOFException
2012-08-23 14:35:42,163 INFO org.apache.hadoop.streaming.PipeMapRed: MRErrorThread done
2012-08-23 14:35:42,163 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed failed!
2012-08-23 14:35:42,184 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2012-08-23 14:35:42,202 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
	at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
	at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
	at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:137)
	at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:479)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
	at org.apache.hadoop.mapred.Child.main(Child.java:264)

Reply all

Reply to author

Forward