I am new to this forum and have been looking around for some information, but can't seem to find anything relevant. I am running a MR job on Amazon EMR using the latest Mongo-Hadoop connector and Hadoop 0.21something.
My mongo db is unsharded, and the input collection has 9.4 million records. When I examine the output, it seems like all the splits are being created properly and all records should be accounted for. However, the MapReduce output says that there were only ~295k input records. I should be expecting 9.4m, yes?
When I look at my db:
> db.crawled.count()
9483528
But the MR output says:
map 100% reduce 100% 2012-10-01 00:30:40,167 INFO org.apache.hadoop.mapred.JobClient (main): Job complete: job_201209302024_0001 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): Counters: 30 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): Job Counters 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): Launched reduce tasks=459 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): SLOTS_MILLIS_MAPS=6186073223 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Total time spent by all reduces waiting after reserving slots (ms)=0 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Total time spent by all maps waiting after reserving slots (ms)=0 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Rack-local map tasks=2992 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Launched map tasks=2992 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): SLOTS_MILLIS_REDUCES=2459973706 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): Failed map tasks=1 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): File Input Format Counters 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): Bytes Read=0 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): File Output Format Counters 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): Bytes Written=0 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): FileSystemCounters 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): FILE_BYTES_READ=783482 2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): HDFS_BYTES_READ=720739 2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): FILE_BYTES_WRITTEN=105144496 2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): Map-Reduce Framework 2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): Map output materialized bytes=15042281*2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): Map input records=294990* 2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): Reduce shuffle bytes=14983089 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main): Spilled Records=13133 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main): Map output bytes=1153768 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main): Total committed heap usage (bytes)=1089867415552 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main): CPU time spent (ms)=72729860 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main): Map input bytes=0 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main): SPLIT_RAW_BYTES=720739 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main): Combine input records=11954 2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main): Reduce input records=6381 2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main): Reduce input groups=5806 2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main): Combine output records=10582 2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main): Physical memory (bytes) snapshot=1345551327232 2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main): Reduce output records=5806 2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main): Virtual memory (bytes) snapshot=2689983688704 2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main): Map output records=7753
See bolded line above. There are no other errors aside from the occasional failed task. I am also definitely getting orders of magnitude fewer reduce output records than I expect.
I am using all the standard configuration options, and get a reasonable split output:
2012-09-30 20:25:40,669 INFO com.flicket.hadoop.Matcher (main): Setting input URI: mongodb://<REDACTED>.compute-1.amazonaws.com/flicket.crawled 2012-09-30 20:25:40,733 INFO com.flicket.hadoop.Matcher (main): Setting output URI: mongodb://<REDACTED>.compute-1.amazonaws.com/flicket.matched_20120930
2012-09-30 20:25:41,719 INFO org.apache.hadoop.mapred.JobClient (main): Default number of map tasks: null 2012-09-30 20:25:41,719 INFO org.apache.hadoop.mapred.JobClient (main): Setting default number of map tasks based on cluster size to : 6 2012-09-30 20:25:41,719 INFO org.apache.hadoop.mapred.JobClient (main): Default number of reduce tasks: 351 2012-09-30 20:25:41,919 INFO org.apache.hadoop.mapred.JobClient (main): Setting group to hadoop 2012-09-30 20:25:42,385 INFO com.mongodb.hadoop.util.MongoSplitter (main): Calculate Splits Code ... Use Shards? false, Use Chunks? true; Collection Sharded? false 2012-09-30 20:25:42,385 INFO com.mongodb.hadoop.util.MongoSplitter (main): Creation of Input Splits is enabled. 2012-09-30 20:25:42,385 INFO com.mongodb.hadoop.util.MongoSplitter (main): Using Unsharded Split mode (Calculating multiple splits though) 2012-09-30 20:25:42,409 INFO com.mongodb.hadoop.util.MongoSplitter (main): Calculating unsharded input splits on namespace 'flicket.crawled' with Split Key '{ "_id" : 1}' and a split size of '8'mb per 2012-09-30 20:25:47,471 INFO com.mongodb.hadoop.util.MongoSplitter (main): Calculated 2478 splits.
It then goes on to list all of the splits which are perfectly acceptable. I also looked in the individual syslogs for each task, and I can see all of the contents of my input collection (though kind of unreadable -- it seems they are partially binary?).
I am confused and can't figure out what's going on. Any hints on what else I should look at?
Could you please upload the actual map/reduce command that you passed to the database to start this whole process? It is possible that something in the command could be causing this issue. It would also be helpful to see the schema of the documents that are going into your map reduce.
I don't know what could be causing this problem right now, hopefully with some more information we can figure it out.
On Monday, October 1, 2012 3:59:58 PM UTC-7, Eric Lange wrote:
> Hello,
> I am new to this forum and have been looking around for some information, > but can't seem to find anything relevant. I am running a MR job on Amazon > EMR using the latest Mongo-Hadoop connector and Hadoop 0.21something.
> My mongo db is unsharded, and the input collection has 9.4 million > records. When I examine the output, it seems like all the splits are being > created properly and all records should be accounted for. However, the > MapReduce output says that there were only ~295k input records. I should > be expecting 9.4m, yes?
> When I look at my db:
> > db.crawled.count() > 9483528
> But the MR output says:
> map 100% reduce 100% > 2012-10-01 00:30:40,167 INFO org.apache.hadoop.mapred.JobClient (main): Job complete: job_201209302024_0001 > 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): Counters: 30 > 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): Job Counters > 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): Launched reduce tasks=459 > 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): SLOTS_MILLIS_MAPS=6186073223 > 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Total time spent by all reduces waiting after reserving slots (ms)=0 > 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Total time spent by all maps waiting after reserving slots (ms)=0 > 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Rack-local map tasks=2992 > 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Launched map tasks=2992 > 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): SLOTS_MILLIS_REDUCES=2459973706 > 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): Failed map tasks=1 > 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): File Input Format Counters > 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): Bytes Read=0 > 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): File Output Format Counters > 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): Bytes Written=0 > 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): FileSystemCounters > 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): FILE_BYTES_READ=783482 > 2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): HDFS_BYTES_READ=720739 > 2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): FILE_BYTES_WRITTEN=105144496 > 2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): Map-Reduce Framework > 2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): Map output materialized bytes=15042281*2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): Map input records=294990* > 2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): Reduce shuffle bytes=14983089 > 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main): Spilled Records=13133 > 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main): Map output bytes=1153768 > 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main): Total committed heap usage (bytes)=1089867415552 > 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main): CPU time spent (ms)=72729860 > 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main): Map input bytes=0 > 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main): SPLIT_RAW_BYTES=720739 > 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main): Combine input records=11954 > 2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main): Reduce input records=6381 > 2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main): Reduce input groups=5806 > 2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main): Combine output records=10582 > 2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main): Physical memory (bytes) snapshot=1345551327232 > 2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main): Reduce output records=5806 > 2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main): Virtual memory (bytes) snapshot=2689983688704 > 2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main): Map output records=7753
> See bolded line above. There are no other errors aside from the occasional failed task. I am also definitely getting orders of magnitude fewer reduce output records than I expect.
> I am using all the standard configuration options, and get a reasonable split output:
> 2012-09-30 20:25:41,719 INFO org.apache.hadoop.mapred.JobClient (main): Default number of map tasks: null > 2012-09-30 20:25:41,719 INFO org.apache.hadoop.mapred.JobClient (main): Setting default number of map tasks based on cluster size to : 6 > 2012-09-30 20:25:41,719 INFO org.apache.hadoop.mapred.JobClient (main): Default number of reduce tasks: 351 > 2012-09-30 20:25:41,919 INFO org.apache.hadoop.mapred.JobClient (main): Setting group to hadoop > 2012-09-30 20:25:42,385 INFO com.mongodb.hadoop.util.MongoSplitter (main): Calculate Splits Code ... Use Shards? false, Use Chunks? true; Collection Sharded? false > 2012-09-30 20:25:42,385 INFO com.mongodb.hadoop.util.MongoSplitter (main): Creation of Input Splits is enabled. > 2012-09-30 20:25:42,385 INFO com.mongodb.hadoop.util.MongoSplitter (main): Using Unsharded Split mode (Calculating multiple splits though) > 2012-09-30 20:25:42,409 INFO com.mongodb.hadoop.util.MongoSplitter (main): Calculating unsharded input splits on namespace 'flicket.crawled' with Split Key '{ "_id" : 1}' and a split size of '8'mb per > 2012-09-30 20:25:47,471 INFO com.mongodb.hadoop.util.MongoSplitter (main): Calculated 2478 splits.
> It then goes on to list all of the splits which are perfectly acceptable. I also looked in the individual syslogs for each task, and I can see all of the contents of my input collection (though kind of unreadable -- it seems they are partially binary?).
> I am confused and can't figure out what's going on. Any hints on what else I should look at?
Thanks for the response. I am not sure I totally understand what you mean by the "map/reduce command that you passed to the database". I am passing no query. I want to process every record in the collection. I am not sure if this is relevant or not, but here is the elastic-mapreduce command I run to kick the process off.
{"content_url":"http://www.amazon.com/Batman-Dark-Knight-Returns-Part/dp/B009GEAPYW","archive_date":1349211077783,"title":"Batman: The Dark Knight Returns Part 1","released":"2012","duration":"77","genre":["animation","action"],"actor" :["peter weller","michael emerson"],"director":["jay oliva"],"keyword":["peter weller","michael emerson","david selby","michael mckean","ariel winter","wade williams","jay oliva","alan burnett","bob goodman","batman: the dark knight returns part 1"],"description":"It is ten years after an aging Batman has retired, and Gotham City has sunk deeper into decadence and lawlessness. Now, when his city needs him most, he returns in a blaze of glory."} and store it in the collection "flicket.crawled". This is, incidentally, another hadoop process that parses crawl archive text files on S3 and writes the output to MongoDB when it detects a video on the page. To my knowledge, this worked without issue, generating the 9.4m records without a problem. I have another collection called "flicket.master" with the exact same schema that is used as the canonical master DB of movies/shows (think IMDb). My hadoop process does a search of the above record against the set of master records to try and find a match and then assigns a likelihood score. The idea being that when you search for "Batman the Dark Knight Returns", you will find the most likely places to watch it online.
I know that records are getting dropped somehow, because I can see individual results in the crawl that when I manually match them against the master collection using the same matching code that the mapper uses, I can see strong match results. But they are missing from the hadoop run.
My mapper takes the input document and performs a search against the master collection, rendering the master record with the highest "score" (my own secret sauce). If the score is over a certain threshold, I then send it out to the combiner/reducer as a BSONWritable with the schema { _id: <canonical db id>, match: [ {content_url: <url of content>, score: <matching score>} ]} -- note it will always be an array with just one element.
The reducer then takes that output and combines all the matches for a canonical id into a single record: { _id: <canonical db id>, match: [ {content_url: <url of content #1>, score: <matching score #1>}, {content_url: <url of content #2>, score: <marching score #2>}, ... ]}
I hope this helps. Please let me know if there is any additional information I can provide.
On Tuesday, October 2, 2012 10:59:01 AM UTC-7, Andrew Emil wrote:
> Hi Eric,
> Could you please upload the actual map/reduce command that you passed to > the database to start this whole process? It is possible that something in > the command could be causing this issue. It would also be helpful to see > the schema of the documents that are going into your map reduce.
> I don't know what could be causing this problem right now, hopefully with > some more information we can figure it out.
> Thanks, > Andrew
> On Monday, October 1, 2012 3:59:58 PM UTC-7, Eric Lange wrote:
>> Hello,
>> I am new to this forum and have been looking around for some information, >> but can't seem to find anything relevant. I am running a MR job on Amazon >> EMR using the latest Mongo-Hadoop connector and Hadoop 0.21something.
>> My mongo db is unsharded, and the input collection has 9.4 million >> records. When I examine the output, it seems like all the splits are being >> created properly and all records should be accounted for. However, the >> MapReduce output says that there were only ~295k input records. I should >> be expecting 9.4m, yes?
>> When I look at my db:
>> > db.crawled.count() >> 9483528
>> But the MR output says:
>> map 100% reduce 100% >> 2012-10-01 00:30:40,167 INFO org.apache.hadoop.mapred.JobClient (main): Job complete: job_201209302024_0001 >> 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): Counters: 30 >> 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): Job Counters >> 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): Launched reduce tasks=459 >> 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): SLOTS_MILLIS_MAPS=6186073223 >> 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Total time spent by all reduces waiting after reserving slots (ms)=0 >> 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Total time spent by all maps waiting after reserving slots (ms)=0 >> 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Rack-local map tasks=2992 >> 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Launched map tasks=2992 >> 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): SLOTS_MILLIS_REDUCES=2459973706 >> 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): Failed map tasks=1 >> 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): File Input Format Counters >> 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): Bytes Read=0 >> 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): File Output Format Counters >> 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): Bytes Written=0 >> 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): FileSystemCounters >> 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): FILE_BYTES_READ=783482 >> 2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): HDFS_BYTES_READ=720739 >> 2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): FILE_BYTES_WRITTEN=105144496 >> 2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): Map-Reduce Framework >> 2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): Map output materialized bytes=15042281*2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): Map input records=294990* >> 2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main): Reduce shuffle bytes=14983089 >> 2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main):
Perhaps this is more of an AWS/Hadoop issue, but I am now seeing that all of my tasks are uncerimoniously ending after approximately 3 hours whether they have completed their slice or not. In fact, the entire job takes the same amount of time (right around 4 hours) no matter how many instances I throw at it. When doing a google search for anything relating to 3 hours, nothing comes up. Also, regardless of how many instances I give it, mongo-hadoop slices my collection into 8mb chunks. Is there a reason for this? Do I need to manually cut it into smaller slices if I want to try to make the tasks shorter?
On Tuesday, October 2, 2012 2:14:52 PM UTC-7, Eric Lange wrote:
> Hi Andrew,
> Thanks for the response. I am not sure I totally understand what you mean > by the "map/reduce command that you passed to the database". I am passing > no query. I want to process every record in the collection. I am not sure > if this is relevant or not, but here is the elastic-mapreduce command I run > to kick the process off.
> {"content_url":" > http://www.amazon.com/Batman-Dark-Knight-Returns-Part/dp/B009GEAPYW","archive_date":1349211077783,"title":"Batman: The Dark Knight Returns Part 1","released":"2012","duration":"77","genre":["animation","action"],"actor" :["peter weller","michael emerson"],"director":["jay oliva"],"keyword":["peter weller","michael emerson","david selby","michael mckean","ariel winter","wade williams","jay oliva","alan burnett","bob goodman","batman: the dark knight returns part 1"],"description":"It is ten years after an aging Batman has retired, and Gotham City has sunk deeper into decadence and lawlessness. Now, when his > city needs him most, he returns in a blaze of glory."} > and store it in the collection "flicket.crawled". This is, incidentally, > another hadoop process that parses crawl archive text files on S3 and > writes the output to MongoDB when it detects a video on the page. To my > knowledge, this worked without issue, generating the 9.4m records without a > problem. I have another collection called "flicket.master" with the exact > same schema that is used as the canonical master DB of movies/shows (think > IMDb). My hadoop process does a search of the above record against the set > of master records to try and find a match and then assigns a likelihood > score. The idea being that when you search for "Batman the Dark Knight > Returns", you will find the most likely places to watch it online.
> I know that records are getting dropped somehow, because I can see > individual results in the crawl that when I manually match them against the > master collection using the same matching code that the mapper uses, I can > see strong match results. But they are missing from the hadoop run.
> My mapper takes the input document and performs a search against the > master collection, rendering the master record with the highest "score" (my > own secret sauce). If the score is over a certain threshold, I then send > it out to the combiner/reducer as a BSONWritable with the schema { _id: > <canonical db id>, match: [ {content_url: <url of content>, score: > <matching score>} ]} -- note it will always be an array with just one > element.
> The reducer then takes that output and combines all the matches for a > canonical id into a single record: { _id: <canonical db id>, match: [ > {content_url: <url of content #1>, score: <matching score #1>}, > {content_url: <url of content #2>, score: <marching score #2>}, ... ]}
> I hope this helps. Please let me know if there is any additional > information I can provide.
> Thanks! > -Eric
> On Tuesday, October 2, 2012 10:59:01 AM UTC-7, Andrew Emil wrote:
>> Hi Eric,
>> Could you please upload the actual map/reduce command that you passed to >> the database to start this whole process? It is possible that something in >> the command could be causing this issue. It would also be helpful to see >> the schema of the documents that are going into your map reduce.
>> I don't know what could be causing this problem right now, hopefully with >> some more information we can figure it out.
>> Thanks, >> Andrew
>> On Monday, October 1, 2012 3:59:58 PM UTC-7, Eric Lange wrote:
>>> Hello,
>>> I am new to this forum and have been looking around for some >>> information, but can't seem to find anything relevant. I am running a MR >>> job on Amazon EMR using the latest Mongo-Hadoop connector and Hadoop >>> 0.21something.
>>> My mongo db is unsharded, and the input collection has 9.4 million >>> records. When I examine the output, it seems like all the splits are being >>> created properly and all records should be accounted for. However, the >>> MapReduce output says that there were only ~295k input records. I should >>> be expecting 9.4m, yes?
>>> When I look at my db:
>>> > db.crawled.count() >>> 9483528
>>> But the MR output says:
>>> map 100% reduce 100% >>> 2012-10-01 00:30:40,167 INFO org.apache.hadoop.mapred.JobClient (main): Job complete: job_201209302024_0001 >>> 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): Counters: 30 >>> 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): Job Counters >>> 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): Launched reduce tasks=459 >>> 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): SLOTS_MILLIS_MAPS=6186073223 >>> 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Total time spent by all reduces waiting after reserving slots (ms)=0 >>> 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Total time spent by all maps waiting after reserving slots (ms)=0 >>> 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Rack-local map tasks=2992 >>> 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Launched map tasks=2992 >>> 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): SLOTS_MILLIS_REDUCES=2459973706 >>> 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): Failed map tasks=1 >>> 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): File Input Format Counters >>> 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): Bytes Read=0 >>> 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): File Output Format Counters >>> 2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main): Bytes Written=0
Another update: I changed the split size from the default 8mb to 2mb (setting it to 1mb caused a Java Heap Space exception during startup). By doing this, the runtime went from 4 hours to 12 and the number of records processed from 295k to 1.07m, still well short of the 9.4m total records. But one thing is clear: there is some sort of time limit for running a task in the 3-4 hour range. Is there some cursor timeout that could be happening? Clearly, I need to optimize the mapper, but still this should work. Again, any help is appreciated.
On Thursday, October 4, 2012 8:07:16 AM UTC-7, Eric Lange wrote:
> Quick update:
> Perhaps this is more of an AWS/Hadoop issue, but I am now seeing that all > of my tasks are uncerimoniously ending after approximately 3 hours whether > they have completed their slice or not. In fact, the entire job takes the > same amount of time (right around 4 hours) no matter how many instances I > throw at it. When doing a google search for anything relating to 3 hours, > nothing comes up. Also, regardless of how many instances I give it, > mongo-hadoop slices my collection into 8mb chunks. Is there a reason for > this? Do I need to manually cut it into smaller slices if I want to try to > make the tasks shorter?
> Thanks, > Eric
> On Tuesday, October 2, 2012 2:14:52 PM UTC-7, Eric Lange wrote:
>> Hi Andrew,
>> Thanks for the response. I am not sure I totally understand what you >> mean by the "map/reduce command that you passed to the database". I am >> passing no query. I want to process every record in the collection. I am >> not sure if this is relevant or not, but here is the elastic-mapreduce >> command I run to kick the process off.
>> {"content_url":" >> http://www.amazon.com/Batman-Dark-Knight-Returns-Part/dp/B009GEAPYW","archive_date":1349211077783,"title":"Batman: The Dark Knight Returns Part 1","released":"2012","duration":"77","genre":["animation","action"],"actor" :["peter weller","michael emerson"],"director":["jay oliva"],"keyword":["peter weller","michael emerson","david selby","michael mckean","ariel winter","wade williams","jay oliva","alan burnett","bob goodman","batman: the dark knight returns part 1"],"description":"It is ten years after an aging Batman has retired, and Gotham City has sunk deeper into decadence and lawlessness. Now, when his >> city needs him most, he returns in a blaze of glory."} >> and store it in the collection "flicket.crawled". This is, incidentally, >> another hadoop process that parses crawl archive text files on S3 and >> writes the output to MongoDB when it detects a video on the page. To my >> knowledge, this worked without issue, generating the 9.4m records without a >> problem. I have another collection called "flicket.master" with the exact >> same schema that is used as the canonical master DB of movies/shows (think >> IMDb). My hadoop process does a search of the above record against the set >> of master records to try and find a match and then assigns a likelihood >> score. The idea being that when you search for "Batman the Dark Knight >> Returns", you will find the most likely places to watch it online.
>> I know that records are getting dropped somehow, because I can see >> individual results in the crawl that when I manually match them against the >> master collection using the same matching code that the mapper uses, I can >> see strong match results. But they are missing from the hadoop run.
>> My mapper takes the input document and performs a search against the >> master collection, rendering the master record with the highest "score" (my >> own secret sauce). If the score is over a certain threshold, I then send >> it out to the combiner/reducer as a BSONWritable with the schema { _id: >> <canonical db id>, match: [ {content_url: <url of content>, score: >> <matching score>} ]} -- note it will always be an array with just one >> element.
>> The reducer then takes that output and combines all the matches for a >> canonical id into a single record: { _id: <canonical db id>, match: [ >> {content_url: <url of content #1>, score: <matching score #1>}, >> {content_url: <url of content #2>, score: <marching score #2>}, ... ]}
>> I hope this helps. Please let me know if there is any additional >> information I can provide.
>> Thanks! >> -Eric
>> On Tuesday, October 2, 2012 10:59:01 AM UTC-7, Andrew Emil wrote:
>>> Hi Eric,
>>> Could you please upload the actual map/reduce command that you passed to >>> the database to start this whole process? It is possible that something in >>> the command could be causing this issue. It would also be helpful to see >>> the schema of the documents that are going into your map reduce.
>>> I don't know what could be causing this problem right now, hopefully >>> with some more information we can figure it out.
>>> Thanks, >>> Andrew
>>> On Monday, October 1, 2012 3:59:58 PM UTC-7, Eric Lange wrote:
>>>> Hello,
>>>> I am new to this forum and have been looking around for some >>>> information, but can't seem to find anything relevant. I am running a MR >>>> job on Amazon EMR using the latest Mongo-Hadoop connector and Hadoop >>>> 0.21something.
>>>> My mongo db is unsharded, and the input collection has 9.4 million >>>> records. When I examine the output, it seems like all the splits are being >>>> created properly and all records should be accounted for. However, the >>>> MapReduce output says that there were only ~295k input records. I should >>>> be expecting 9.4m, yes?
>>>> When I look at my db:
>>>> > db.crawled.count() >>>> 9483528
>>>> But the MR output says:
>>>> map 100% reduce 100% >>>> 2012-10-01 00:30:40,167 INFO org.apache.hadoop.mapred.JobClient (main): Job complete: job_201209302024_0001 >>>> 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): Counters: 30 >>>> 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): Job Counters >>>> 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): Launched reduce tasks=459 >>>> 2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): SLOTS_MILLIS_MAPS=6186073223 >>>> 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Total time spent by all reduces waiting after reserving slots (ms)=0 >>>> 2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main): Total time spent by all maps waiting after reserving slots (ms)=0 >>>> 2012-10-01 00:30:40,205 INFO