BSONFileInputFormat dropping records (Mongo Hadoop project)

Mayur Gupta

unread,

Mar 11, 2014, 5:12:45 AM3/11/14

to mongod...@googlegroups.com

Hey There,

I am using the mongo-hadoop project for importing mongo bson files into Hive tables using BSONFileInputFormat. After importing a dump which had million of records, the number of rows in Hive were considerably less. When I looked at the log of Hive, I found following error messages:

2014-03-11 13:22:54,881 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 7 forwarding 10000 rows
2014-03-11 13:22:54,881 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 3 forwarding 10000 rows
2014-03-11 13:22:54,881 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 2 forwarding 10000 rows
2014-03-11 13:22:54,881 INFO ExecMapper: ExecMapper: processing 10000 rows: used memory = 144388608
2014-03-11 13:22:55,162 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 7 forwarding 100000 rows
2014-03-11 13:22:55,163 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 3 forwarding 100000 rows
2014-03-11 13:22:55,163 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 2 forwarding 100000 rows
2014-03-11 13:22:55,163 INFO ExecMapper: ExecMapper: processing 100000 rows: used memory = 152418384
2014-03-11 13:22:56,001 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.
2014-03-11 13:22:56,001 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.
2014-03-11 13:22:56,003 INFO org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://localhost:8020/user/hive/warehouse/jb.db/visit/visits.bson
2014-03-11 13:22:56,005 ERROR com.mongodb.hadoop.mapred.input.BSONFileRecordReader: Error reading key/value from bson file: BSONDecoder doesn't understand type : 57 name: 4091990.531827d6e2076
2014-03-11 13:22:56,005 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.
2014-03-11 13:22:56,005 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.
2014-03-11 13:22:56,007 INFO org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://localhost:8020/user/hive/warehouse/jb.db/visit/visits.bson
2014-03-11 13:22:56,008 ERROR com.mongodb.hadoop.mapred.input.BSONFileRecordReader: Error reading key/value from bson file: null
2014-03-11 13:22:56,008 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.
2014-03-11 13:22:56,008 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.
2014-03-11 13:22:56,010 INFO org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://localhost:8020/user/hive/warehouse/jb.db/visit/visits.bson
2014-03-11 13:22:56,011 ERROR com.mongodb.hadoop.mapred.input.BSONFileRecordReader: Error reading key/value from bson file: BSONDecoder doesn't understand type : 117 name: id
2014-03-11 13:22:56,011 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.
2014-03-11 13:22:56,011 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.
2014-03-11 13:22:56,011 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 7 finished. closing... 
2014-03-11 13:22:56,011 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 7 forwarded 541194 rows
2014-03-11 13:22:56,011 INFO org.apache.hadoop.hive.ql.exec.MapOperator: DESERIALIZE_ERRORS:0
2014-03-11 13:22:56,011 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 3 finished. closing... 
2014-03-11 13:22:56,011 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 3 forwarded 541194 rows
2014-03-11 13:22:56,011 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 2 finished. closing... 
2014-03-11 13:22:56,012 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 2 forwarded 541194 rows
2014-03-11 13:22:56,012 INFO org.apache.hadoop.hive.ql.exec.GroupByOperator: 1 finished. closing...

Below is my table definition in Hive:

CREATE EXTERNAL TABLE test( 
  visitId       STRING,
  browserId     STRING,
  softUserId    STRING,
  userId        STRING,
  matchType     BOOLEAN,
  ts            TIMESTAMP
) 
ROW FORMAT SERDE 'com.mongodb.hadoop.hive.BSONSerDe'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"visitid":"_id", "browserid":"bid", "softuserid":"uid0", "userid":"uid", "matchtype":"um"}')
STORED AS INPUTFORMAT 'com.mongodb.hadoop.mapred.BSONFileInputFormat'
OUTPUTFORMAT 'com.mongodb.hadoop.hive.output.HiveBSONFileOutputFormat';

The number of rows in MongoDB is about 3.2 million and in Hive I see only .5 million rows.

I am using 1.0.3 version of Hadoop and 1.2 version of mongo hadoop project. The mongo java driver is 2.11.3.

Any ideas what is causing this?

Thanks

-Mayur

Justin Lee

unread,

Mar 11, 2014, 9:39:50 AM3/11/14

to mongod...@googlegroups.com

Offhand, it looks like a bad key in your input file. The logging isn't *terribly* helpful as it is. I just added some more context to the log message so you can at least tell what line it is but that won't help you in this instance, unfortunately. If you turn on debug logging, the connector will log every 10000 lines and at least isolate which section of the file it's in. Then you can scan those keys for anything odd. If you don't mind building a snapshot build, you could use the new debug with this import file and see if anything leaps out at you.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mayur Gupta

unread,

Mar 12, 2014, 1:33:49 PM3/12/14

to mongod...@googlegroups.com

I tried with debug level but the strange thing is I don't see the BsonSplitter used anywhere. The same file gets processed properly with plain Map reduce job with BsonFileInputFormat and even with embedded hive server. I see that the getSplits method in BsonFileInputFormat is not called at all when using it with Hive.

One other point, I am using map reduce api.

I will continue looking, still far from the root cause.

Mayur Gupta

unread,

Mar 13, 2014, 9:21:07 AM3/13/14

to mongod...@googlegroups.com

Hey Justin,

I tried with updated logging and below is snipped from that log (also attached complete log). The error always happens at doc 0. Moreover, if I process the same bson file using plain Map Reduce without Hive, it all works perfectly. I don't see the BSONSplitter being called in Hive jobs but it is called in plain Map Reduce jobs. Is this how it is intended to work?

Just to make sure that the file is not corrupted, I imported it back to Mongo and it works without any errors. I will appreciate it if you can point me where I am going wrong.

2014-03-13 17:57:42,510 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 7 forwarding 100000 rows

2014-03-13 17:57:42,510 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 3 forwarding 100000 rows

2014-03-13 17:57:42,510 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 2 forwarding 100000 rows

2014-03-13 17:57:42,510 INFO ExecMapper: ExecMapper: processing 100000 rows: used memory = 154566472

2014-03-13 17:57:43,397 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.

2014-03-13 17:57:43,400 INFO org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://localhost:8020/user/hive/warehouse/bug.db/visit/visits.bson

2014-03-13 17:57:43,412 ERROR com.mongodb.hadoop.mapred.input.BSONFileRecordReader: Error reading key/value from bson file on line BSONDecoder doesn't understand type : -101 name: ��D : 0, value=<BSONWritable:{ "_id" : "v-2014-03-13-1394701146.5321735ae61cb" , "bid" : "bw-2014-03-13-1394701146.5321735ae4337" , "ts" : { "$date" : "2014-03-13T08:59:06.942Z"} , "uid" : 619965 , "um" : true}>

java.lang.UnsupportedOperationException: BSONDecoder doesn't understand type : -101 name: ��D

at org.bson.BasicBSONDecoder.decodeElement(BasicBSONDecoder.java:226)

at org.bson.BasicBSONDecoder._decode(BasicBSONDecoder.java:79)

at org.bson.BasicBSONDecoder.decode(BasicBSONDecoder.java:57)

at com.mongodb.hadoop.mapred.input.BSONFileRecordReader.next(BSONFileRecordReader.java:85)

at com.mongodb.hadoop.mapred.input.BSONFileRecordReader.next(BSONFileRecordReader.java:43)

at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)

at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:101)

at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41)

at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)

at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:331)

at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:249)

at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:215)

at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:200)

at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)

at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)

at org.apache.hadoop.mapred.Child$4.run(Child.java:268)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)

at org.apache.hadoop.mapred.Child.main(Child.java:262)

2014-03-13 17:57:43,414 INFO com.mongodb.hadoop.mapred.input.BSONFileRecordReader: closing bson file split.

On Tuesday, March 11, 2014 7:09:50 PM UTC+5:30, Justin Lee wrote:

log.txt

Mayur Gupta

unread,

Mar 17, 2014, 1:24:33 PM3/17/14

to mongod...@googlegroups.com

Is this suppose to work or not? Has it worked for anybody. I am close to abandoning this approach.

Thanks

Justin Lee

unread,

Mar 17, 2014, 1:34:56 PM3/17/14

to mongod...@googlegroups.com

Sorry. I had this post open trying to get a reply written but kept getting side tracked. Can you post the errant log line and I can dig in to it?

Mayur Gupta

unread,

Mar 17, 2014, 2:17:33 PM3/17/14

to mongod...@googlegroups.com

Below is errant log line. It seems to be the problem with splitting, the BSONSplitter is never called. In the previous output I have attached the complete log.

14-03-13 17:57:43,397 INFO com.mongodb.hadoop.mapred.

input.BSONFileRecordReader: closing bson file split.

2014-03-13 17:57:43,400 INFO org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://localhost:8020/user/hive/warehouse/bug.db/visit/visits.bson

2014-03-13 17:57:43,412 ERROR com.mongodb.hadoop.mapred.input.BSONFileRecordReader: Error reading key/value from bson file on line BSONDecoder doesn't understand type : -101 name: ��D : 0, value=<BSONWritable:{ "_id" : "v-2014-03-13-1394701146.5321735ae61cb" , "bid" : "bw-2014-03-13-1394701146.5321735ae4337" , "ts" : { "$date" : "2014-03-13T08:59:06.942Z"} , "uid" : 619965 , "um" : true}>

java.lang.UnsupportedOperationException: BSONDecoder doesn't understand type : -101 name: ��D

at org.bson.BasicBSONDecoder.decodeElement(BasicBSONDecoder.java:226)

Justin Lee

unread,

Mar 17, 2014, 2:30:23 PM3/17/14

to mongod...@googlegroups.com

I mean the line it's trying to import... I can bang up a unit test and see what's going on.

Mayur Gupta

unread,

Mar 18, 2014, 2:24:22 AM3/18/14

to mongod...@googlegroups.com

Hey Justin,

The problem is not on a particular line. The reason I say this is because I tried running a plain map reduce job with multiple mappers and it worked. It also works in Hive when I run a unit test in Standalone mode. So to give you test data, I used the raw table at https://github.com/mongodb/mongo-hadoop/blob/master/examples/enron/hive/hive_enron.q and populated 1 million records. Below is my script to do this and I have also uploaded the BSON file at https://dl.dropboxusercontent.com/u/7493716/mail.bson.

for(var i = 0; i < 1000000; i++) {

db.mail.insert({headers : {From: "FromAddress"+i+"@.com", To: "ToAddress"+i+"@.com"}});

}

If I just run a select count(*) from raw, the count is 687052. If I look at the logs, this is what I see:

Processing split: Paths:/user/hive/warehouse/raw/mail.bson:0+67108864,/user/hive/warehouse/raw/mail.bson:67108864+30668916InputFormatClass: com.mongodb.hadoop.mapred.BSONFileInputFormat

This is after which the failure happens:

2014-03-18 11:00:19,435 DEBUG input.BSONFileRecordReader (BSONFileRecordReader.java:next(91)) - read 685000 docs from hdfs://localhost:8020/user/hive/warehouse/raw/mail.bson:0+67108864 at 66907780

Error:

2014-03-18 11:00:19,485 ERROR input.BSONFileRecordReader (BSONFileRecordReader.java:next(95)) - Error reading key/value from bson file on line BSONDecoder doesn't understand type : 64 name: .com: 0, value=<BSONWritable:{ "_id" : { "$oid" : "5327d4c04ee094adc7af08a7"} , "headers" : { "From" : "FromAddress687051@.com" , "To" : "ToAddress687051@.com"}}>

java.lang.UnsupportedOperationException: BSONDecoder doesn't understand type : 64 name: .com