Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Unexpected number of input records on large dataset
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  5 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Eric Lange  
View profile  
 More options Oct 1 2012, 6:59 pm
From: Eric Lange <eric.w.la...@gmail.com>
Date: Mon, 1 Oct 2012 15:59:58 -0700 (PDT)
Local: Mon, Oct 1 2012 6:59 pm
Subject: Unexpected number of input records on large dataset

Hello,

I am new to this forum and have been looking around for some information,
but can't seem to find anything relevant.  I am running a MR job on Amazon
EMR using the latest Mongo-Hadoop connector and Hadoop 0.21something.

My mongo db is unsharded, and the input collection has 9.4 million records.
 When I examine the output, it seems like all the splits are being created
properly and all records should be accounted for.  However, the MapReduce
output says that there were only ~295k input records.  I should be
expecting 9.4m, yes?

When I look at my db:

> db.crawled.count()

9483528

But the MR output says:

map 100% reduce 100%
2012-10-01 00:30:40,167 INFO org.apache.hadoop.mapred.JobClient (main): Job complete: job_201209302024_0001
2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main): Counters: 30
2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main):   Job Counters
2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main):     Launched reduce tasks=459
2012-10-01 00:30:40,204 INFO org.apache.hadoop.mapred.JobClient (main):     SLOTS_MILLIS_MAPS=6186073223
2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main):     Total time spent by all reduces waiting after reserving slots (ms)=0
2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main):     Total time spent by all maps waiting after reserving slots (ms)=0
2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main):     Rack-local map tasks=2992
2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main):     Launched map tasks=2992
2012-10-01 00:30:40,205 INFO org.apache.hadoop.mapred.JobClient (main):     SLOTS_MILLIS_REDUCES=2459973706
2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main):     Failed map tasks=1
2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main):   File Input Format Counters
2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main):     Bytes Read=0
2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main):   File Output Format Counters
2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main):     Bytes Written=0
2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main):   FileSystemCounters
2012-10-01 00:30:40,206 INFO org.apache.hadoop.mapred.JobClient (main):     FILE_BYTES_READ=783482
2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main):     HDFS_BYTES_READ=720739
2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main):     FILE_BYTES_WRITTEN=105144496
2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main):   Map-Reduce Framework
2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main):     Map output materialized bytes=15042281*2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main):     Map input records=294990*
2012-10-01 00:30:40,207 INFO org.apache.hadoop.mapred.JobClient (main):     Reduce shuffle bytes=14983089
2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main):     Spilled Records=13133
2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main):     Map output bytes=1153768
2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main):     Total committed heap usage (bytes)=1089867415552
2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main):     CPU time spent (ms)=72729860
2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main):     Map input bytes=0
2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main):     SPLIT_RAW_BYTES=720739
2012-10-01 00:30:40,208 INFO org.apache.hadoop.mapred.JobClient (main):     Combine input records=11954
2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main):     Reduce input records=6381
2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main):     Reduce input groups=5806
2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main):     Combine output records=10582
2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main):     Physical memory (bytes) snapshot=1345551327232
2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main):     Reduce output records=5806
2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main):     Virtual memory (bytes) snapshot=2689983688704
2012-10-01 00:30:40,209 INFO org.apache.hadoop.mapred.JobClient (main):     Map output records=7753

See bolded line above.  There are no other errors aside from the occasional failed task.  I am also definitely getting orders of magnitude fewer reduce output records than I expect.

I am using all the standard configuration options, and get a reasonable split output:

2012-09-30 20:25:40,669 INFO com.flicket.hadoop.Matcher (main): Setting input URI: mongodb://<REDACTED>.compute-1.amazonaws.com/flicket.crawled
2012-09-30 20:25:40,733 INFO com.flicket.hadoop.Matcher (main): Setting output URI: mongodb://<REDACTED>.compute-1.amazonaws.com/flicket.matched_20120930

2012-09-30 20:25:41,719 INFO org.apache.hadoop.mapred.JobClient (main): Default number of map tasks: null
2012-09-30 20:25:41,719 INFO org.apache.hadoop.mapred.JobClient (main): Setting default number of map tasks based on cluster size to : 6
2012-09-30 20:25:41,719 INFO org.apache.hadoop.mapred.JobClient (main): Default number of reduce tasks: 351
2012-09-30 20:25:41,919 INFO org.apache.hadoop.mapred.JobClient (main): Setting group to hadoop
2012-09-30 20:25:42,385 INFO com.mongodb.hadoop.util.MongoSplitter (main):  Calculate Splits Code ... Use Shards? false, Use Chunks? true; Collection Sharded? false
2012-09-30 20:25:42,385 INFO com.mongodb.hadoop.util.MongoSplitter (main): Creation of Input Splits is enabled.
2012-09-30 20:25:42,385 INFO com.mongodb.hadoop.util.MongoSplitter (main): Using Unsharded Split mode (Calculating multiple splits though)
2012-09-30 20:25:42,409 INFO com.mongodb.hadoop.util.MongoSplitter (main): Calculating unsharded input splits on namespace 'flicket.crawled' with Split Key '{ "_id" : 1}' and a split size of '8'mb per
2012-09-30 20:25:47,471 INFO com.mongodb.hadoop.util.MongoSplitter (main): Calculated 2478 splits.

It then goes on to list all of the splits which are perfectly acceptable.  I also looked in the individual syslogs for each task, and I can see all of the contents of my input collection (though kind of unreadable -- it seems they are partially binary?).

I am confused and can't figure out what's going on.  Any hints on what else I should look at?

Thanks,

Eric


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andrew Emil  
View profile  
 More options Oct 2 2012, 1:59 pm
From: Andrew Emil <andrew.e...@10gen.com>
Date: Tue, 2 Oct 2012 10:59:01 -0700 (PDT)
Local: Tues, Oct 2 2012 1:59 pm
Subject: Re: Unexpected number of input records on large dataset

Hi Eric,

Could you please upload the actual map/reduce command that you passed to
the database to start this whole process?  It is possible that something in
the command could be causing this issue.  It would also be helpful to see
the schema of the documents that are going into your map reduce.  

I don't know what could be causing this problem right now, hopefully with
some more information we can figure it out.

Thanks,
Andrew


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Eric Lange  
View profile  
 More options Oct 2 2012, 5:14 pm
From: Eric Lange <eric.w.la...@gmail.com>
Date: Tue, 2 Oct 2012 14:14:52 -0700 (PDT)
Local: Tues, Oct 2 2012 5:14 pm
Subject: Re: Unexpected number of input records on large dataset

Hi Andrew,

Thanks for the response.  I am not sure I totally understand what you mean
by the "map/reduce command that you passed to the database".  I am passing
no query.  I want to process every record in the collection.  I am not sure
if this is relevant or not, but here is the elastic-mapreduce command I run
to kick the process off.

/opt/aws/emr/elastic-mapreduce --create --plain-output --name
Job_Matcher__20120929_234142 --ami-version=2.1.1 --hadoop-version=0.20.205
--jar s3n://flicket/emr/jars/flicket-hadoop-1.0.1.jar --step-name
Run_Matcher --log-uri s3n://flicket/emr/logs --main-class
com.flicket.hadoop.Matcher --access-id ******** --private-key ********
--arg -Dmapreduce.job.split.metainfo.maxsize=-1 --arg
-Dmapred.max.map.failures.percent=50 --arg
InputPath=mongodb://ec2-107-22-127-215.compute-1.amazonaws.com/flicket.craw led
--arg
OutputURI=mongodb://ec2-********.compute-1.amazonaws.com/flicket.matched_20 120929
--arg InputURI=mongodb://********.compute-1.amazonaws.com/flicket.crawled
--instance-group master --instance-type m1.large --instance-count 1
--availability-zone us-east-1b --instance-group core --instance-type
m1.large --instance-count 1 --instance-group task --instance-type c1.xlarge
--instance-count 100 --bid-price 0.08

Here are a few random lines from the syslog showing how the split is done,
which seems completely rational (the ID is the URL of the content):

2012-09-30 20:25:49,003 INFO com.mongodb.hadoop.mapred.input.MongoInputSplit (main): Creating a new MongoInputSplit for MongoURI 'mongodb://REDACTED.compute-1.amazonaws.com/flicket.crawled', query: '{ "$query" : { } , "$min" : { "_id" : "http://www.220.ro/desene-animate/Foamy-Si-Uraganul/UCLSo9zokS/"} , "$max" : { "_id" : "http://www.220.ro/documentare/Pleo-Robot-Sau-Animal-De-Casa/1nhjDTywjZ/"}}', fieldSpec: '{ }', sort: '{ }', limit: 0, skip: 0 .
2012-09-30 20:25:49,003 INFO com.mongodb.hadoop.mapred.input.MongoInputSplit (main): Creating a new MongoInputSplit for MongoURI 'mongodb://REDACTED.compute-1.amazonaws.com/flicket.crawled', query: '{ "$query" : { } , "$min" : { "_id" : "http://www.220.ro/documentare/Pleo-Robot-Sau-Animal-De-Casa/1nhjDTywjZ/"} , "$max" : { "_id" : "http://www.220.ro/faze-tari/Au-Ajuns-Faimosi-Pe-Internet-Blocati-In-A..."}}', fieldSpec: '{ }', sort: '{ }', limit: 0, skip: 0 .
2012-09-30 20:25:49,003 INFO com.mongodb.hadoop.mapred.input.MongoInputSplit (main): Creating a new MongoInputSplit for MongoURI 'mongodb://REDACTED.compute-1.amazonaws.com/flicket.crawled', query: '{ "$query" : { } , "$min" : { "_id" : "http://www.220.ro/faze-tari/Au-Ajuns-Faimosi-Pe-Internet-Blocati-In-A..."} , "$max" : { "_id" : "http://www.220.ro/videoclipuri/Florinl-Si-Ioana-Iubirea-Nu-Are-Lege/p..."}}', fieldSpec: '{ }', sort: '{ }', limit: 0, skip: 0 .
2012-09-30 20:25:49,003 INFO com.mongodb.hadoop.mapred.input.MongoInputSplit (main): Creating a new MongoInputSplit for MongoURI 'mongodb://REDACTED.compute-1.amazonaws.com/flicket.crawled', query: '{ "$query" : { } , "$min" : { "_id" : "http://www.220.ro/videoclipuri/Florinl-Si-Ioana-Iubirea-Nu-Are-Lege/p..."} , "$max" : { "_id" : "http://www.69stream.com/EN/scheda/7526/aiutami-figlio-mio.html"}}', fieldSpec: '{ }', sort: '{ }', limit: 0, skip: 0 .

As for the schema, these are video records from a crawl.  For instance,
when this page<http://www.amazon.com/gp/product/B009GEAPYW/ref=s9_al_bw_g318_ir24?pf...>is crawled, I generate the following document:

{"content_url":"http://www.amazon.com/Batman-Dark-Knight-Returns-Part/dp/B009GEAPYW","archive_date":1349211077783,"title":"Batman: The Dark Knight Returns Part 1","released":"2012","duration":"77","genre":["animation","action"],"actor" :["peter weller","michael emerson"],"director":["jay oliva"],"keyword":["peter weller","michael emerson","david selby","michael mckean","ariel winter","wade williams","jay oliva","alan burnett","bob goodman","batman: the dark knight returns part 1"],"description":"It is ten years after an aging Batman has retired, and Gotham City has sunk deeper into decadence and lawlessness. Now, when his
city needs him most, he returns in a blaze of glory."}
and store it in the collection "flicket.crawled".  This is, incidentally,
another hadoop process that parses crawl archive text files on S3 and
writes the output to MongoDB when it detects a video on the page.  To my
knowledge, this worked without issue, generating the 9.4m records without a
problem.  I have another collection called "flicket.master" with the exact
same schema that is used as the canonical master DB of movies/shows (think
IMDb).  My hadoop process does a search of the above record against the set
of master records to try and find a match and then assigns a likelihood
score.  The idea being that when you search for "Batman the Dark Knight
Returns", you will find the most likely places to watch it online.

I know that records are getting dropped somehow, because I can see
individual results in the crawl that when I manually match them against the
master collection using the same matching code that the mapper uses, I can
see strong match results.  But they are missing from the hadoop run.

My mapper takes the input document and performs a search against the master
collection, rendering the master record with the highest "score" (my own
secret sauce).  If the score is over a certain threshold, I then send it
out to the combiner/reducer as a BSONWritable with the schema { _id:
<canonical db id>, match: [ {content_url: <url of content>, score:
<matching score>} ]} -- note it will always be an array with just one
element.

The reducer then takes that output and combines all the matches for a
canonical id into a single record: { _id: <canonical db id>, match: [
{content_url: <url of content #1>, score: <matching score #1>},
{content_url: <url of content #2>, score: <marching score #2>}, ... ]}

I hope this helps.  Please let me know if there is any additional
information I can provide.

Thanks!
-Eric

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Eric Lange  
View profile  
 More options Oct 4 2012, 11:07 am
From: Eric Lange <eric.w.la...@gmail.com>
Date: Thu, 4 Oct 2012 08:07:16 -0700 (PDT)
Local: Thurs, Oct 4 2012 11:07 am
Subject: Re: Unexpected number of input records on large dataset

Quick update:

Perhaps this is more of an AWS/Hadoop issue, but I am now seeing that all
of my tasks are uncerimoniously ending after approximately 3 hours whether
they have completed their slice or not.  In fact, the entire job takes the
same amount of time (right around 4 hours) no matter how many instances I
throw at it.  When doing a google search for anything relating to 3 hours,
nothing comes up.  Also, regardless of how many instances I give it,
mongo-hadoop slices my collection into 8mb chunks.  Is there a reason for
this?  Do I need to manually cut it into smaller slices if I want to try to
make the tasks shorter?

Thanks,
Eric

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Eric Lange  
View profile  
 More options Oct 5 2012, 12:59 am
From: Eric Lange <eric.w.la...@gmail.com>
Date: Thu, 4 Oct 2012 21:59:25 -0700 (PDT)
Local: Fri, Oct 5 2012 12:59 am
Subject: Re: Unexpected number of input records on large dataset

Another update:  I changed the split size from the default 8mb to 2mb
(setting it to 1mb caused a Java Heap Space exception during startup).  By
doing this, the runtime went from 4 hours to 12 and the number of records
processed from 295k to 1.07m, still well short of the 9.4m total records.
 But one thing is clear: there is some sort of time limit for running a
task in the 3-4 hour range.  Is there some cursor timeout that could be
happening?  Clearly, I need to optimize the mapper, but still this should
work.  Again, any help is appreciated.

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »