BSONFileInputFormat dropping records (Mongo Hadoop project)

424 views
Skip to first unread message

Mayur Gupta

unread,
Mar 11, 2014, 5:12:45 AM3/11/14
to mongod...@googlegroups.com
Hey There,

I am using the mongo-hadoop project for importing mongo bson files into Hive tables using BSONFileInputFormat. After importing a dump which had million of records, the number of rows in Hive were considerably less. When I looked at the log of Hive, I found following error messages:

2014-03-11 13:22:54,881 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 7 forwarding 10000 rows
2014-03-11 13:22:54,881 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 3 forwarding 10000 rows
2014-03-11 13:22:54,881 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 2 forwarding 10000 rows
2014-03-11 13:22:54,881 INFO ExecMapper: ExecMapper: processing 10000 rows: used memory = 144388608
2014-03-11 13:22:55,162 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 7 forwarding 100000 rows
2014-03-11 13:22:55,163 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 3 forwarding 100000 rows
2014-03-11 13:22:55,163 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 2 forwarding 100000 rows
2014-03-11 13:22:55,163 INFO ExecMapper: ExecMapper: processing 100000 rows: used memory = 152418384
2014-03-11 13:22:56,001 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.
2014-03-11 13:22:56,001 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.
2014-03-11 13:22:56,003 INFO org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://localhost:8020/user/hive/warehouse/jb.db/visit/visits.bson
2014-03-11 13:22:56,005 ERROR com.mongodb.hadoop.mapred.input.BSONFileRecordReader: Error reading key/value from bson file: BSONDecoder doesn't understand type : 57 name: 4091990.531827d6e2076
2014-03-11 13:22:56,005 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.
2014-03-11 13:22:56,005 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.
2014-03-11 13:22:56,007 INFO org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://localhost:8020/user/hive/warehouse/jb.db/visit/visits.bson
2014-03-11 13:22:56,008 ERROR com.mongodb.hadoop.mapred.input.BSONFileRecordReader: Error reading key/value from bson file: null
2014-03-11 13:22:56,008 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.
2014-03-11 13:22:56,008 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.
2014-03-11 13:22:56,010 INFO org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://localhost:8020/user/hive/warehouse/jb.db/visit/visits.bson
2014-03-11 13:22:56,011 ERROR com.mongodb.hadoop.mapred.input.BSONFileRecordReader: Error reading key/value from bson file: BSONDecoder doesn't understand type : 117 name: id
2014-03-11 13:22:56,011 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.
2014-03-11 13:22:56,011 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.
2014-03-11 13:22:56,011 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 7 finished. closing... 
2014-03-11 13:22:56,011 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 7 forwarded 541194 rows
2014-03-11 13:22:56,011 INFO org.apache.hadoop.hive.ql.exec.MapOperator: DESERIALIZE_ERRORS:0
2014-03-11 13:22:56,011 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 3 finished. closing... 
2014-03-11 13:22:56,011 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 3 forwarded 541194 rows
2014-03-11 13:22:56,011 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 2 finished. closing... 
2014-03-11 13:22:56,012 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 2 forwarded 541194 rows
2014-03-11 13:22:56,012 INFO org.apache.hadoop.hive.ql.exec.GroupByOperator: 1 finished. closing... 

Below is my table definition in Hive:

CREATE EXTERNAL TABLE test( 
  visitId       STRING,
  browserId     STRING,
  softUserId    STRING,
  userId        STRING,
  matchType     BOOLEAN,
  ts            TIMESTAMP
) 
ROW FORMAT SERDE 'com.mongodb.hadoop.hive.BSONSerDe'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"visitid":"_id", "browserid":"bid", "softuserid":"uid0", "userid":"uid", "matchtype":"um"}')
STORED AS INPUTFORMAT 'com.mongodb.hadoop.mapred.BSONFileInputFormat'
OUTPUTFORMAT 'com.mongodb.hadoop.hive.output.HiveBSONFileOutputFormat';

The number of rows in MongoDB is about 3.2 million and in Hive I see only .5 million rows.

I am using 1.0.3 version of Hadoop and 1.2 version of mongo hadoop project. The mongo java driver is 2.11.3.

Any ideas what is causing this?

Thanks
-Mayur

Justin Lee

unread,
Mar 11, 2014, 9:39:50 AM3/11/14
to mongod...@googlegroups.com
Offhand, it looks like a bad key in your input file.  The logging isn't *terribly* helpful as it is.  I just added some more context to the log message so you can at least tell what line it is but that won't help you in this instance, unfortunately.  If you turn on debug logging, the connector will log every 10000 lines and at least isolate which section of the file it's in.  Then you can scan those keys for anything odd.  If you don't mind building a snapshot build, you could use the new debug with this import file and see if anything leaps out at you.


--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mayur Gupta

unread,
Mar 12, 2014, 1:33:49 PM3/12/14
to mongod...@googlegroups.com
I tried with debug level but the strange thing is I don't see the BsonSplitter used anywhere. The same file gets processed properly with plain Map reduce job with BsonFileInputFormat and even with embedded hive server. I see that the  getSplits method in BsonFileInputFormat is not called at all when using it with Hive.

One other point, I am using map reduce api.

I will continue looking, still far from the root cause.

Mayur Gupta

unread,
Mar 13, 2014, 9:21:07 AM3/13/14
to mongod...@googlegroups.com
Hey Justin,

I tried with updated logging and below is snipped from that log (also attached complete log). The error always happens at doc 0. Moreover, if I process the same bson file using plain Map Reduce without Hive, it all works perfectly. I don't see the BSONSplitter being called in Hive jobs but it is called in plain Map Reduce jobs. Is this how it is intended to work?

Just to make sure that the file is not corrupted, I imported it back to Mongo and it works without any errors. I will appreciate it if you can point me where I am going wrong.

2014-03-13 17:57:42,510 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 7 forwarding 100000 rows
2014-03-13 17:57:42,510 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 3 forwarding 100000 rows
2014-03-13 17:57:42,510 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 2 forwarding 100000 rows
2014-03-13 17:57:42,510 INFO ExecMapper: ExecMapper: processing 100000 rows: used memory = 154566472
2014-03-13 17:57:43,397 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.
2014-03-13 17:57:43,397 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.
2014-03-13 17:57:43,400 INFO org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://localhost:8020/user/hive/warehouse/bug.db/visit/visits.bson
2014-03-13 17:57:43,412 ERROR com.mongodb.hadoop.mapred.input.BSONFileRecordReader: Error reading key/value from bson file on line BSONDecoder doesn't understand type : -101 name: ��D : 0, value=<BSONWritable:{ "_id" : "v-2014-03-13-1394701146.5321735ae61cb" , "bid" : "bw-2014-03-13-1394701146.5321735ae4337" , "ts" : { "$date" : "2014-03-13T08:59:06.942Z"} , "uid" : 619965 , "um" : true}>
java.lang.UnsupportedOperationException: BSONDecoder doesn't understand type : -101 name: ��D
at org.bson.BasicBSONDecoder.decodeElement(BasicBSONDecoder.java:226)
at org.bson.BasicBSONDecoder._decode(BasicBSONDecoder.java:79)
at org.bson.BasicBSONDecoder.decode(BasicBSONDecoder.java:57)
at com.mongodb.hadoop.mapred.input.BSONFileRecordReader.next(BSONFileRecordReader.java:85)
at com.mongodb.hadoop.mapred.input.BSONFileRecordReader.next(BSONFileRecordReader.java:43)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:101)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:331)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:249)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:215)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:200)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
2014-03-13 17:57:43,414 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.
2014-03-13 17:57:43,414 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.

On Tuesday, March 11, 2014 7:09:50 PM UTC+5:30, Justin Lee wrote:
log.txt

Mayur Gupta

unread,
Mar 17, 2014, 1:24:33 PM3/17/14
to mongod...@googlegroups.com
Is this suppose to work or not? Has it worked for anybody. I am close to abandoning this approach.

Thanks

Justin Lee

unread,
Mar 17, 2014, 1:34:56 PM3/17/14
to mongod...@googlegroups.com
Sorry.  I had this post open trying to get a reply written but kept getting side tracked.  Can you post the errant log line and I can dig in to it?

Mayur Gupta

unread,
Mar 17, 2014, 2:17:33 PM3/17/14
to mongod...@googlegroups.com
Below is errant log line. It seems to be the problem with splitting, the BSONSplitter is never called. In the previous output I have attached  the complete log.

14-03-13 17:57:43,397 INFO com.mongodb.hadoop.mapred.
input.BSONFileRecordReader: closing bson file split.
2014-03-13 17:57:43,400 INFO org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://localhost:8020/user/hive/warehouse/bug.db/visit/visits.bson
2014-03-13 17:57:43,412 ERROR com.mongodb.hadoop.mapred.input.BSONFileRecordReader: Error reading key/value from bson file on line BSONDecoder doesn't understand type : -101 name: ��D : 0, value=<BSONWritable:{ "_id" : "v-2014-03-13-1394701146.5321735ae61cb" , "bid" : "bw-2014-03-13-1394701146.5321735ae4337" , "ts" : { "$date" : "2014-03-13T08:59:06.942Z"} , "uid" : 619965 , "um" : true}>
java.lang.UnsupportedOperationException: BSONDecoder doesn't understand type : -101 name: ��D
at org.bson.BasicBSONDecoder.decodeElement(BasicBSONDecoder.java:226)


Justin Lee

unread,
Mar 17, 2014, 2:30:23 PM3/17/14
to mongod...@googlegroups.com
I mean the line it's trying to import...  I can bang up a unit test and see what's going on.

Mayur Gupta

unread,
Mar 18, 2014, 2:24:22 AM3/18/14
to mongod...@googlegroups.com
Hey Justin,

The problem is not on a particular line. The reason I say this is because I tried running a plain map reduce job with multiple mappers and it worked. It also works in Hive when I run a unit test in Standalone mode. So to give you test data, I used the raw table at https://github.com/mongodb/mongo-hadoop/blob/master/examples/enron/hive/hive_enron.q and populated 1 million records. Below is my script to do this and I have also uploaded the BSON file at https://dl.dropboxusercontent.com/u/7493716/mail.bson.

for(var i = 0; i < 1000000; i++) {
db.mail.insert({headers : {From: "FromAddress"+i+"@.com", To: "ToAddress"+i+"@.com"}});
}

If I just run a select count(*) from raw, the count is 687052. If I look at the logs, this is what I see:

Processing split: Paths:/user/hive/warehouse/raw/mail.bson:0+67108864,/user/hive/warehouse/raw/mail.bson:67108864+30668916InputFormatClass: com.mongodb.hadoop.mapred.BSONFileInputFormat

This is after which the failure happens:
2014-03-18 11:00:19,435 DEBUG input.BSONFileRecordReader (BSONFileRecordReader.java:next(91)) - read 685000 docs from hdfs://localhost:8020/user/hive/warehouse/raw/mail.bson:0+67108864 at 66907780

Error:
2014-03-18 11:00:19,485 ERROR input.BSONFileRecordReader (BSONFileRecordReader.java:next(95)) - Error reading key/value from bson file on line BSONDecoder doesn't understand type : 64 name: .com: 0, value=<BSONWritable:{ "_id" : { "$oid" : "5327d4c04ee094adc7af08a7"} , "headers" : { "From" : "FromAddress687051@.com" , "To" : "ToAddress687051@.com"}}>
java.lang.UnsupportedOperationException: BSONDecoder doesn't understand type : 64 name: .com
at org.bson.BasicBSONDecoder.decodeElement(BasicBSONDecoder.java:226)
at org.bson.BasicBSONDecoder._decode(BasicBSONDecoder.java:79)
at org.bson.BasicBSONDecoder.decode(BasicBSONDecoder.java:57)
at com.mongodb.hadoop.mapred.input.BSONFileRecordReader.next(BSONFileRecordReader.java:85)
at com.mongodb.hadoop.mapred.input.BSONFileRecordReader.next(BSONFileRecordReader.java:43)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:101)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41)


Attached is the complete log for your reference.

One other thing, when I run plain map reduce I see the split file (.mail.bson.splits) but this is not there in case of hive.
log.log

Mayur Gupta

unread,
Mar 18, 2014, 2:34:56 AM3/18/14
to mongod...@googlegroups.com
It looks like it is failing at block boundaries of HDFS.

Mayur Gupta

unread,
Mar 20, 2014, 12:31:55 AM3/20/14
to mongod...@googlegroups.com
Hey Justin,

Did you get a chance to look at this?

Justin Lee

unread,
May 29, 2014, 11:49:25 AM5/29/14
to mongod...@googlegroups.com
I apologize for not getting back sooner.  I've been stuck off in other code and trying to get the hive tests cleaned up enough to work with.  At any rate, I just ran hive_enron.q with your file and it worked just fine.  I don't see any errors in the logs.  For what it's worth, i ran it like this:

hive -v -f examples/enron/hive/hive_enron.q -d INPUT=`pwd`/mail.bson -d OUTPUT=hdfs:///user/hive/warehouse/enron.out

Mayur Gupta

unread,
May 30, 2014, 7:25:45 AM5/30/14
to mongod...@googlegroups.com
It seems the test is faulty. The 'raw' table is internal table so when it is dropped the file mail.bson is also deleted and there is nothing to process. Just make the table external and then try, you should see the error.

Justin Lee

unread,
May 30, 2014, 9:38:09 AM5/30/14
to mongod...@googlegroups.com
I ran the test as you described it, though.  Shouldn't I be seeing the same results if the fault was in that script?


For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.

---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.

Mayur Gupta

unread,
May 31, 2014, 12:33:50 AM5/31/14
to mongod...@googlegroups.com
Did you run the script twice, since the table was already internal when you ran it after making it external so the data was deleted with drop. Run it again.


You received this message because you are subscribed to a topic in the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mongodb-user/g2dxZmEGqiM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mongodb-user...@googlegroups.com.

To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.

Mayur Gupta

unread,
Jun 6, 2014, 3:39:02 AM6/6/14
to mongod...@googlegroups.com
Were you able to replicate the problem?

Justin Lee

unread,
Jun 6, 2014, 10:20:01 AM6/6/14
to mongod...@googlegroups.com
I have not be able to yet, no.


David Beveridge

unread,
Jul 14, 2014, 10:09:47 PM7/14/14
to mongod...@googlegroups.com
Mayur,


I'm trying to use your code here as a sample test on an Amazon Map-Reduce cluster (so, nearly identical set-up as what you posted).  Data imports OK, but when querying (directly from the Hive command line on the master server), the MR job fails out with:

    java.io.IOException: cannot find class com.mongodb.hadoop.mapred.BSONFileInputFormat
        at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:557)


We've copied the following 3 jars into the /home/hadoop/lib/ directory on both the master and core machines:

    total 620K
    -rw-r--r-- 1 hadoop  72K Jul 11 22:24 json-serde-1.1.9.3-SNAPSHOT.jar
    -rw-r--r-- 1 hadoop 104K Jul 11 22:52 mongo-hadoop-core-1.4.0-SNAPSHOT.jar
    -rw-r--r-- 1 hadoop  21K Jul 11 22:24 mongo-hadoop-hive-1.4.0-SNAPSHOT.jar
    -rw-r--r-- 1 hadoop 408K Jul 14 22:33 mongo-java-driver-2.11.1.jar


Did you have to do any further configuration than that?

Mayur Gupta

unread,
Jul 15, 2014, 2:36:27 PM7/15/14
to mongod...@googlegroups.com
Hey David,

The reason you didn't get the error on load is because Hive is schema on read so the InputFormat doesn't come into picture until the table is queried. As for your question, I remember adding the jars on master only using EMR bootstrap action. If you added the jars manually, I think you need to restart Hadoop daemons.

David Beveridge

unread,
Jul 16, 2014, 1:51:24 PM7/16/14
to mongod...@googlegroups.com
Mayur,


Thanks for the response! We wound up having to reboot all the machines for them to pick up the libs . . . next step is creating custom bootstrap actions to do all this, so we're happily back into *familiar* territory.


Cheers!

Mayur Gupta

unread,
Jul 17, 2014, 1:58:40 AM7/17/14
to mongod...@googlegroups.com
Do let me know whether the Bson Format works for Hive tables since as you saw from the thread it didn't work for me.


--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mongodb-user/g2dxZmEGqiM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.

董浩

unread,
Aug 5, 2014, 10:48:24 AM8/5/14
to mongod...@googlegroups.com
Mayur
I have the same problem with you.Bsonfileinputformat not work in hive.Have you resolved?

Mayur Gupta

unread,
Aug 8, 2014, 6:58:40 AM8/8/14
to mongod...@googlegroups.com
I wrote MR jobs to convert Bson into text using BsonFileInputFormat that is used to create Hive tables.


On Tue, Aug 5, 2014 at 8:18 PM, 董浩 <dongha...@gmail.com> wrote:
Mayur
    I have the same problem with you.Bsonfileinputformat not work in hive.Have you resolved?
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mongodb-user/g2dxZmEGqiM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
Reply all
Reply to author
Forward
0 new messages