Hello,I am new to this forum and have been looking around for some information, but can't seem to find anything relevant. I am running a MR job on Amazon EMR using the latest Mongo-Hadoop connector and Hadoop 0.21something.My mongo db is unsharded, and the input collection has 9.4 million records. When I examine the output, it seems like all the splits are being created properly and all records should be accounted for. However, the MapReduce output says that there were only ~295k input records. I should be expecting 9.4m, yes?When I look at my db:> db.crawled.count()9483528But the MR output says:map 100% reduce 100% 2012-10-01 00:30:40,167 INFO org.apache.hadoop.mapred.JobClient (main): Job complete: job_201209302024_0001 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): Counters: 30 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): Job Counters 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): Launched reduce tasks=459 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): SLOTS_MILLIS_MAPS=6186073223 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Total time spent by all reduces waiting after reserving slots (ms)=0 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Total time spent by all maps waiting after reserving slots (ms)=0 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Rack-local map tasks=2992 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Launched map tasks=2992 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): SLOTS_MILLIS_REDUCES=2459973706 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): Failed map tasks=1 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): File Input Format Counters 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): Bytes Read=0 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): File Output Format Counters 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): Bytes Written=0 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): FileSystemCounters 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): FILE_BYTES_READ=783482 2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): HDFS_BYTES_READ=720739 2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): FILE_BYTES_WRITTEN=105144496 2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): Map-Reduce Framework 2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): Map output materialized bytes=15042281 2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): Map input records=294990 2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): Reduce shuffle bytes=14983089 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main): Spilled Records=13133 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main): Map output bytes=1153768 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main): Total committed heap usage (bytes)=1089867415552 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main): CPU time spent (ms)=72729860 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main): Map input bytes=0 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main): SPLIT_RAW_BYTES=720739 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main): Combine input records=11954 2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main): Reduce input records=6381 2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main): Reduce input groups=5806 2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main): Combine output records=10582 2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main): Physical memory (bytes) snapshot=1345551327232 2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main): Reduce output records=5806 2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main): Virtual memory (bytes) snapshot=2689983688704 2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main): Map output records=7753
See bolded line above. There are no other errors aside from the occasional failed task. I am also definitely getting orders of magnitude fewer reduce output records than I expect.I am using all the standard configuration options, and get a reasonable split output:2012-09-30 20:25:40,669 INFO com.flicket.hadoop.Matcher (main): Setting input URI: mongodb://<REDACTED>.compute-1.amazonaws.com/flicket.crawled 2012-09-30 20:25:40,733 INFO com.flicket.hadoop.Matcher (main): Setting output URI: mongodb://<REDACTED>.compute-1.amazonaws.com/flicket.matched_201209302012-09-30 20:25:41,719 INFO org.apache.hadoop.mapred.JobClient (main): Default number of map tasks: null 2012-09-30 20:25:41,719 INFO org.apache.hadoop.mapred.JobClient (main): Setting default number of map tasks based on cluster size to : 6 2012-09-30 20:25:41,719 INFO org.apache.hadoop.mapred.JobClient (main): Default number of reduce tasks: 351 2012-09-30 20:25:41,919 INFO org.apache.hadoop.mapred.JobClient (main): Setting group to hadoop 2012-09-30 20:25:42,385 INFO com.mongodb.hadoop.util.MongoSplitter (main): Calculate Splits Code ... Use Shards? false, Use Chunks? true; Collection Sharded? false 2012-09-30 20:25:42,385 INFO com.mongodb.hadoop.util.MongoSplitter (main): Creation of Input Splits is enabled. 2012-09-30 20:25:42,385 INFO com.mongodb.hadoop.util.MongoSplitter (main): Using Unsharded Split mode (Calculating multiple splits though) 2012-09-30 20:25:42,409 INFO com.mongodb.hadoop.util.MongoSplitter (main): Calculating unsharded input splits on namespace 'flicket.crawled' with Split Key '{ "_id" : 1}' and a split size of '8'mb per 2012-09-30 20:25:47,471 INFO com.mongodb.hadoop.util.MongoSplitter (main): Calculated 2478 splits.It then goes on to list all of the splits which are perfectly acceptable. I also looked in the individual syslogs for each task, and I can see all of the contents of my input collection (though kind of unreadable -- it seems they are partially binary?).I am confused and can't figure out what's going on. Any hints on what else I should look at?Thanks,Eric
2012-09-30 20:25:49,003 INFO com.mongodb.hadoop.mapred.input.MongoInputSplit (main): Creating a new MongoInputSplit for MongoURI 'mongodb://REDACTED.compute-1.amazonaws.com/flicket.crawled', query: '{ "$query" : { } , "$min" : { "_id" : "http://www.220.ro/desene-animate/Foamy-Si-Uraganul/UCLSo9zokS/"} , "$max" : { "_id" : "http://www.220.ro/documentare/Pleo-Robot-Sau-Animal-De-Casa/1nhjDTywjZ/"}}', fieldSpec: '{ }', sort: '{ }', limit: 0, skip: 0 . 2012-09-30 20:25:49,003 INFO com.mongodb.hadoop.mapred.input.MongoInputSplit (main): Creating a new MongoInputSplit for MongoURI 'mongodb://REDACTED.compute-1.amazonaws.com/flicket.crawled', query: '{ "$query" : { } , "$min" : { "_id" : "http://www.220.ro/documentare/Pleo-Robot-Sau-Animal-De-Casa/1nhjDTywjZ/"} , "$max" : { "_id" : "http://www.220.ro/faze-tari/Au-Ajuns-Faimosi-Pe-Internet-Blocati-In-Aeroport/mosMTHYCAu/"}}', fieldSpec: '{ }', sort: '{ }', limit: 0, skip: 0 . 2012-09-30 20:25:49,003 INFO com.mongodb.hadoop.mapred.input.MongoInputSplit (main): Creating a new MongoInputSplit for MongoURI 'mongodb://REDACTED.compute-1.amazonaws.com/flicket.crawled', query: '{ "$query" : { } , "$min" : { "_id" : "http://www.220.ro/faze-tari/Au-Ajuns-Faimosi-Pe-Internet-Blocati-In-Aeroport/mosMTHYCAu/"} , "$max" : { "_id" : "http://www.220.ro/videoclipuri/Florinl-Si-Ioana-Iubirea-Nu-Are-Lege/pNNJJlR543/"}}', fieldSpec: '{ }', sort: '{ }', limit: 0, skip: 0 . 2012-09-30 20:25:49,003 INFO com.mongodb.hadoop.mapred.input.MongoInputSplit (main): Creating a new MongoInputSplit for MongoURI 'mongodb://REDACTED.compute-1.amazonaws.com/flicket.crawled', query: '{ "$query" : { } , "$min" : { "_id" : "http://www.220.ro/videoclipuri/Florinl-Si-Ioana-Iubirea-Nu-Are-Lege/pNNJJlR543/"} , "$max" : { "_id" : "http://www.69stream.com/EN/scheda/7526/aiutami-figlio-mio.html"}}', fieldSpec: '{ }', sort: '{ }', limit: 0, skip: 0 .
{"content_url":"http://www.amazon.com/Batman-Dark-Knight-Returns-Part/dp/B009GEAPYW","archive_date":1349211077783,"title":"Batman: The Dark Knight Returns Part 1","released":"2012","duration":"77","genre":["animation","action"],"actor":["peter weller","michael emerson"],"director":["jay oliva"],"keyword":["peter weller","michael emerson","david selby","michael mckean","ariel winter","wade williams","jay oliva","alan burnett","bob goodman","batman: the dark knight returns part 1"],"description":"It is ten years after an aging Batman has retired, and Gotham City has sunk deeper into decadence and lawlessness. Now, when his city needs him most, he returns in a blaze of glory."}
-----
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb
You received this message because you are subscribed to a topic in the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mongodb-user/i0hwErABUOc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mongodb-user...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.