Small percentage of corruption on mongoimport

135 views
Skip to first unread message

mark birch

unread,
Feb 9, 2014, 9:49:15 AM2/9/14
to mongod...@googlegroups.com
Hi guys

I'm currently experiencing the joys of data transfer from a single primary to another single primary.

Does anyone have any experience with finding out the reason for data corruption using mongoimport?

Basically:
1 mongoexport to a json file from server (3.2 mil rows)
2. mongoimport into a db on a remote machine.

Then on the remote machine I cannot add an index due to invalid BSON.
Using db.user_data.validate(true) shows there are 6 rows where there is Invalid BSON detected. On deleting one of these, and reimporting the single row from the original mongoexport, there is no corruption on this row. Therefore the corruption was due to the mongoimport.

This is not something that can be used in an automated fashion - it's not fit for purpose with these results. Can anyone help me figure out why I'm getting any corruption at all, and to get rid of this?

Asya Kamsky

unread,
Feb 9, 2014, 11:42:47 AM2/9/14
to mongodb-user
If you are transfering data from mongod to another mongod you should
NOT be using mongoimport.

Use mongodump - it uses bson format to dump out the data - mongoimport
uses json which is appropriate if you are extracting data to use in
another (non-MongoDB) system, but it's very lossy since MongoDB has
types that are not standard json, plus it is extremely inefficient
since the data has to be converted to json just to be converted back
to bson.

Asya
P.S. if you'd like to track down what was the root cause of particular
rows being corrupted you will need to provide more information, like
version of MongoDB, OS/platform, what were the contents of the
document(s) that were corrupted and what was in the (receiving) mongod
logs on the first import (that created the bad rows).
> --
> --
> You received this message because you are subscribed to the Google
> Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com
> To unsubscribe from this group, send email to
> mongodb-user...@googlegroups.com
> See also the IRC channel -- freenode.net#mongodb
>
> ---
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mongodb-user...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

mark birch

unread,
Feb 9, 2014, 3:13:53 PM2/9/14
to mongod...@googlegroups.com
Thanks for your response.

There are a number of items for me to respond to in your reply:

1. yes, mongoexport is inefficient, however it should either work, or not work. For quite the first 20 million documents it worked perfectly, and for these 6 in the 4th mongoimport it was corrupted.

Here is one of the rows in the mongoexport json file, which generated invalid BSON when imported - the rows are all very similar:
(I've had to obfuscate the data a bit - it is sensitive)

{ "_id" : { "$oid" : "52f385693d14c8f96942b9e3" }, "cpn" : "XXXXXXXXXXX", "brand" : "YYY", "ua" : "Windows,Firefox,27", "keywords" : "Home", "section" : "/home", "art_id" : "", "time_offset" : "0", "utag_id" : "12345678", "wruid" : "111111.222222, "ad_cat" : "Sports", "ad_time" : "139169111188611", "ad_size" : "300x250", "ad_slot_size" : "300x250", "ad_slot_pos" : "pos2", "ad_id" : "63811254", "ad_creative_id" : "32848601694", "ad_name" : "ADNAME1", "ad_action" : "view", "ad_order_id" : "1111111", "res" : "1920x1080", "adv_id" : "14176854", "ad_sid" : "4690014", "log_time" : { "$date" : 1391691113771 }, "ua_original" : "Mozilla/5.0 (Windows NT 6.1; rv:27.0) Gecko/20100101 Firefox/27.0", "ip" : "1.1.1.1", "ip_city" : "A CITY", "ip_country" : "United Kingdom", "ip_iso_code" : "GB", "ip_lookup_time" : { "$date" : 1385838098250 }, "ip_org" : "org info", "ip_province" : "England", "is_logged_in" : true, "cookie_value" : "12345-12345-12345", "client_type" : "PC" }

When I later wrote this to a file, and ran mongoimport for that file with just this single document in it (after deleting this _id), it was imported fine.

The row immediately after is which was fine during the original import is this:
{ "_id" : { "$oid" : "52f3856a3d14c8df6942b9dc" }, "cpn" : "XXXXXXXXX", "brand" : "YYY", "ua" : "Windows,Firefox,27", "keywords" : "Home", "section" : "/home", "art_id" : "", "time_offset" : "0", "utag_id" : "123456679", "wruid" : "111111.2222222", "ad_cat" : "Sports", "ad_time" : "139169111188611", "ad_size" : "728x90", "ad_slot_size" : "728x90", "ad_slot_pos" : "pos2", "ad_id" : "63811014", "ad_creative_id" : "32848648974", "ad_name" : "ADNAME1", "ad_action" : "view", "ad_order_id" : "111111", "res" : "1920x1080", "adv_id" : "14176854", "ad_sid" : "4690014", "log_time" : { "$date" : 1391691114006 }, "ua_original" : "Mozilla/5.0 (Windows NT 6.1; rv:27.0) Gecko/20100101 Firefox/27.0", "ip" : "1.1.1.1", "ip_city" : "A CITY", "ip_country" : "United Kingdom", "ip_iso_code" : "GB", "ip_lookup_time" : { "$date" : 1385838098250 }, "ip_org" : "org info", "ip_province" : "England", "is_logged_in" : true, "cookie_value" : "12345-12345-12345", "client_type" : "PC" }

This lead me to believe that the corruption was due to a glitch at that time, which is worrying.


2. The reason I used mongoexport/import and not mongodump/restore is that I have to specify a --query criteria that involved ISODate, to export a subset of data for one day (this is the log_time attribute). Mongodump ignores any query which has anything referencing an ISODate field, even if specified in a different way (eg. Date(12345678...) or {"$date":12345678...} (I've spent an entire day trying different ways to get this to work). I would be happy to rewrite the application code to not use ISODate, and to update all the existing mongodb rows to provide a Date() field instead but only if this is demonstrably totally necessary.


The machine with the live data on it is running mongodb 2.4.8, not NUMA, 4-core machine . It executes mongoexport locally and outputs to a local file, and then immediately  executes mongoimport to the second, remote, server.

The second machine is running mongodb 2.4.9. This machine is a 32-core, 128MB ram machine which sadly uses NUMA. I am running mongodb with /usr/bin/numactl --interleave=all to get mongo to be happy with this (let me know if you have any experience with this please!)

Both are running Ubuntu 12.04


Is there a known issue with mongoimport to a remote host?

Does mongodb have an 'acceptable corruption percentage' for data imports, and I just have to deal with this, manually fixing things each day?


Thanks for any input.

MB

Asya Kamsky

unread,
Feb 9, 2014, 3:27:09 PM2/9/14
to mongodb-user
Mark,

Sorry you're having trouble with this and I can assure you, there is no such thing as "acceptable level" of any sort of corruption by mongoimport or any other tool - mongoimport does run in "unacknowledged" mode for most of its duration but that just means it wouldn't notify you about failure to import some records, not that it would be able to import data that was somehow corrupt.

How exactly was the data corrupted in the DB?   Were some fields missing or had the wrong value or something else?

Also can you provide exact mongoexport command you used?  I'm going to try to replicate your issue with ISODate().

Btw, if you only insert into this collection and not update you can use _id rather than ISODate() - I noticed the _id timestamp value in your example document was the same as the log_time value.

Asya

mark birch

unread,
Feb 9, 2014, 4:22:26 PM2/9/14
to mongod...@googlegroups.com

Thanks - it's good to know that there is not an acceptable level of corruption- this answers one of my questions.

The nature of the corruption seemed to be at the BSON level which is really bad news - basically I tried to add an index back onto the collection and it failed with an error message saying Invalid BSONElement type: 43   (note - I think it was 43 - there have been a few different numbers with all the testing I've done)

In order to test the import I did a mongoimport using the same json file into a new, empty collection, from the first server over to the second server.


mongoimport --host [second server reference] --db user_keywords_test_import --collection user_data  --file /tmp/userdata_2014-02-01T00:00:00Z_2014-02-08T00:00:00Z.json

This reported:

check 9 3181232
Sat Feb  8 21:14:53.326 imported 3181232 objects

logged in to server2 - just do a complete count on this new clean collection:
use user_keywords_test_import
db.user_data.count()
3181232

seems great. However:

db.user_data.count( {log_time: { $gt : ISODate('2014-02-01T00:00:00Z'), $lte : ISODate('2014-02-08T00:00:00Z') } } )

FAILED:

Sat Feb  8 21:24:00.384 count failed: { "ok" : 0, "errmsg" : "10320 BSONElement: bad type 116" } at src/mongo/shell/query.js:180

So - what's wrong? I used validate(true)


db.user_data.validate(true)
{
    "ns" : "user_keywords_test_import.user_data",
    "firstExtent" : "0:2000 ns:user_keywords_test_import.user_data",
    "lastExtent" : "5:45007000 ns:user_keywords_test_import.user_data",
    "extentCount" : 19,
    "extents" : [
...
all the extents
...
]
"datasize" : 2845897088,
    "nrecords" : 3181232,
    "lastExtentSize" : 840650752,
    "padding" : 1,
    "firstExtentDetails" : {
        "loc" : "0:2000",
        "xnext" : "0:1a000",
        "xprev" : "null",
        "nsdiag" : "user_keywords_test_import.user_data",
        "size" : 49152,
        "firstRecord" : "0:20b0",
        "lastRecord" : "0:dc70"
    },
    "lastExtentDetails" : {
        "loc" : "5:45007000",
        "xnext" : "null",
        "xprev" : "5:1d5cd000",
        "nsdiag" : "user_keywords_test_import.user_data",
        "size" : 840650752,
        "firstRecord" : "5:450070b0",
        "lastRecord" : "5:64669e10"
    },
    "objectsFound" : 3181232,
    "invalidObjects" : 19,
    "bytesWithHeaders" : 2896796800,
    "bytesWithoutHeaders" : 2845897088,
    "deletedCount" : 11,
    "deletedSize" : 313861744,
    "nIndexes" : 1,
    "keysPerIndex" : {
        "user_keywords_test_import.user_data.$_id_" : 3181232
    },
    "valid" : false,
    "errors" : [
        "invalid bson object detected (see logs for more info)"
    ],
    "advice" : "ns corrupt, requires repair",
    "ok" : 1
}


So - looking at /var/log/mongodb/mongo.log I find these:

Sun Feb  9 21:00:20.693 [conn119] Assertion: 10320:BSONElement: bad type 116
0xde46e1 0xda5e1b 0x6eb769 0xa45dd0 0xa47775 0xa5a073 0xa5ab8b 0x8167b1 0xa7f28f 0x8f1c02 0x8e049a 0x8e1402 0x8e2472 0xa85630 0xa89efc 0x9fe119 0x9ff633 0x6e8518 0xdd0cae 0x7fdc65897e9a
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xde46e1]
 /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x9b) [0xda5e1b]
 /usr/bin/mongod(_ZNK5mongo11BSONElement4sizeEv+0x1f9) [0x6eb769]
 /usr/bin/mongod(_ZNK5mongo7Matcher13matchesDottedEPKcRKNS_11BSONElementERKNS_7BSONObjEiRKNS_14ElementMatcherEbPNS_12MatchDetailsE+0x14f0) [0xa45dd0]
 /usr/bin/mongod(_ZNK5mongo7Matcher7matchesERKNS_7BSONObjEPNS_12MatchDetailsE+0xe5) [0xa47775]
 /usr/bin/mongod(_ZNK5mongo19CoveredIndexMatcher7matchesERKNS_7BSONObjERKNS_7DiskLocEPNS_12MatchDetailsEb+0xd3) [0xa5a073]
 /usr/bin/mongod(_ZNK5mongo19CoveredIndexMatcher14matchesCurrentEPNS_6CursorEPNS_12MatchDetailsE+0xeb) [0xa5ab8b]
 /usr/bin/mongod(_ZN5mongo6Cursor14currentMatchesEPNS_12MatchDetailsE+0x41) [0x8167b1]
 /usr/bin/mongod(_ZN5mongo8runCountEPKcRKNS_7BSONObjERSsRi+0x9af) [0xa7f28f]
 /usr/bin/mongod(_ZN5mongo8CmdCount3runERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x122) [0x8f1c02]
 /usr/bin/mongod(_ZN5mongo12_execCommandEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x3a) [0x8e049a]
 /usr/bin/mongod(_ZN5mongo7Command11execCommandEPS0_RNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xc02) [0x8e1402]
 /usr/bin/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x5f2) [0x8e2472]
 /usr/bin/mongod(_ZN5mongo11runCommandsEPKcRNS_7BSONObjERNS_5CurOpERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x40) [0xa85630]
 /usr/bin/mongod(_ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0xd7c) [0xa89efc]
 /usr/bin/mongod() [0x9fe119]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x383) [0x9ff633]
 /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x98) [0x6e8518]
 /usr/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x42e) [0xdd0cae]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7fdc65897e9a]
Sun Feb  9 21:00:20.699 [conn119] Count with ns: user_keywords_test_import.user_data and query: { log_time: { $gt: new Date(1391212800000), $lte: new Date(1391817600000) } } failed with exception: 10320 BSONElement: bad type 116 code: 10320
Sun Feb  9 21:00:20.699 [conn119] command user_keywords_test_import.$cmd command: { count: "user_data", query: { log_time: { $gt: new Date(1391212800000), $lte: new Date(1391817600000) } }, fields: {} } ntoreturn:1 keyUpdates:0 numYields: 3 locks(micros) r:4273449 reslen:81 2146ms
Sun Feb  9 21:01:33.141 [conn119] Invalid bson detected in user_keywords_test_import.user_data with _id: ObjectId('52eea6283d14c8ef2d27ae11')
Sun Feb  9 21:01:33.142 [conn119] Invalid bson detected in user_keywords_test_import.user_data with _id: ObjectId('52eea62b3d14c8992d27ae10')
Sun Feb  9 21:01:33.142 [conn119] Invalid bson detected in user_keywords_test_import.user_data with _id: ObjectId('52eea6303d14c8992d27ae13')
Sun Feb  9 21:01:33.142 [conn119] Invalid bson detected in user_keywords_test_import.user_data with _id: ObjectId('52eea6323d14c8db2d27ae16')
Sun Feb  9 21:01:43.195 [conn119] Invalid bson detected in user_keywords_test_import.user_data with _id: ObjectId('52f5113d3d14c8d62642b9d6')
Sun Feb  9 21:01:43.195 [conn119] Invalid bson detected in user_keywords_test_import.user_data with _id: ObjectId('52f5113f3d14c87b2642b9d9')
Sun Feb  9 21:01:43.195 [conn119] Invalid bson detected in user_keywords_test_import.user_data with _id: ObjectId('52f511423d14c8e62642b9ce')
Sun Feb  9 21:01:43.294 [conn119] Invalid bson detected in user_keywords_test_import.user_data with _id: ObjectId('52f51f493d14c8483542b9ea')
Sun Feb  9 21:01:43.294 [conn119] Invalid bson detected in user_keywords_test_import.user_data with _id: ObjectId('52f51f493d14c8723542b9e4')
Sun Feb  9 21:01:43.294 [conn119] Invalid bson detected in user_keywords_test_import.user_data with _id: ObjectId('52f51f503d14c8503642b9d3')
Sun Feb  9 21:01:43.294 [conn119] Invalid bson detected in user_keywords_test_import.user_data with _id: ObjectId('52f51f503d14c80a3642b9dc')
Sun Feb  9 21:01:43.294 [conn119] Invalid bson detected in user_keywords_test_import.user_data with _id: ObjectId('52f51f503d14c8993542b9e3')
Sun Feb  9 21:01:43.384 [conn119] Invalid bson detected in user_keywords_test_import.user_data with _id: ObjectId('52f52cf23d14c8694542b9ec')
Sun Feb  9 21:01:43.384 [conn119] Invalid bson detected in user_keywords_test_import.user_data with _id: ObjectId('52f52cf43d14c8c74542b9e6')
Sun Feb  9 21:01:43.384 [conn119] Invalid bson detected in user_keywords_test_import.user_data with _id: ObjectId('52f52cf73d14c8e24542b9dc')
Sun Feb  9 21:01:43.384 [conn119] Invalid bson detected in user_keywords_test_import.user_data with _id: ObjectId('52f52cf73d14c82e4642b9d6')
Sun Feb  9 21:01:43.384 [conn119] Invalid bson detected in user_keywords_test_import.user_data with _id: ObjectId('52f52cf83d14c8364542b9f1')
Sun Feb  9 21:01:43.384 [conn119] Invalid bson detected in user_keywords_test_import.user_data with _id: ObjectId('52f52cfa3d14c8674642b9d1')
Sun Feb  9 21:01:43.384 [conn119] Invalid bson detected in user_keywords_test_import.user_data with _id: ObjectId('52f52cfa3d14c84e4642b9d1')
Sun Feb  9 21:01:43.810 [conn119] validating index 0: user_keywords_test_import.user_data.$_id_
Sun Feb  9 21:01:43.932 [conn119] command user_keywords_test_import.$cmd command: { validate: "user_data", full: true } ntoreturn:1 keyUpdates:0 locks(micros) r:13831388 reslen:4351 13831ms


Understanding this is beyond me - I just want to finish developing the app to do *reliable stuff* which is already way behind schedule!

You also mentioned that the _id field is also the same as the log_time value - can you let me know how you get the time value from the _id values? I'd be very interested to learn about this.

eg. for one document:

 "_id" : { "$oid" : "52f385693d14c8f96942b9e3" } ... "log_time" : { "$date" : 1391691113771 }


MB

Asya Kamsky

unread,
Feb 9, 2014, 6:20:21 PM2/9/14
to mongodb-user
Mark,

A few things I can suggest to help.   First of all - can you make sure that the collection you're dumping from passes validate as well?   

Now, you can use mongodump the same way you would use mongoexport (except then if there is any BSON corruption you'll  know it right away).

Here's my test on your sample document:
$ mongoexport -d gg -c badjson --query '{log_time:{$gt:Date(1391691114000)}}'
connected to: 127.0.0.1
{ "_id" : { "$oid" : "52f3856a3d14c8df6942b9dc" }, "cpn" : "XXXXXXXXX", "brand" : "YYY", "ua" : "Windows,Firefox,27", "keywords" : "Home", "section" : "/home", "art_id" : "", "time_offset" : "0", "utag_id" : "123456679", "wruid" : "111111.2222222", "ad_cat" : "Sports", "ad_time" : "139169111188611", "ad_size" : "728x90", "ad_slot_size" : "728x90", "ad_slot_pos" : "pos2", "ad_id" : "63811014", "ad_creative_id" : "32848648974", "ad_name" : "ADNAME1", "ad_action" : "view", "ad_order_id" : "111111", "res" : "1920x1080", "adv_id" : "14176854", "ad_sid" : "4690014", "log_time" : { "$date" : "2014-02-06T07:51:54.006-0500" }, "ua_original" : "Mozilla/5.0 (Windows NT 6.1; rv:27.0) Gecko/20100101 Firefox/27.0", "ip" : "1.1.1.1", "ip_city" : "A CITY", "ip_country" : "United Kingdom", "ip_iso_code" : "GB", "ip_lookup_time" : { "$date" : "2013-11-30T14:01:38.250-0500" }, "ip_org" : "org info", "ip_province" : "England", "is_logged_in" : true, "cookie_value" : "12345-12345-12345", "client_type" : "PC" }
exported 1 records
asyasmacbook:asya-mongo asya13$ mongodump -d gg -c badjson --query '{log_time:{$gt:Date(1391691114000)}}' -o /tmp/ddate
connected to: 127.0.0.1
2014-02-09T18:13:11.629-0500 DATABASE: gg to /tmp/ddate/gg
2014-02-09T18:13:11.631-0500 gg.badjson to /tmp/ddate/gg/badjson.bson
2014-02-09T18:13:11.632-0500 1 objects
2014-02-09T18:13:11.632-0500 Metadata for gg.badjson to /tmp/ddate/gg/badjson.metadata.json

How I used _id to see that it was the same timestamp as log_time:

> db.badjson.findOne({},{_id:1,log_time:1})
{
"_id" : ObjectId("52f3856a3d14c8df6942b9dc"),
"log_time" : ISODate("2014-02-06T12:51:54.006Z")
}
> ObjectId("52f3856a3d14c8df6942b9dc").getTimestamp()
ISODate("2014-02-06T12:51:54Z")
> ObjectId("52f3856a3d14c8df6942b9dc").getTimestamp().getTime()
1391691114000

Note that the ISODate has higher precision, not sure if that matters to you...

But you should be able to use either type with a query filter with mongodump (which I strongly recommend rather than mongoimport).

I'd love to get to the bottom of what's happening to your JSON doing mongoimport, but you will get to finishing your project much faster IMHO if you switch (back) to mongodump/restore...

Asya


mark birch

unread,
Feb 9, 2014, 7:35:01 PM2/9/14
to mongod...@googlegroups.com
Hi Asya

Thanks for your help & useful information regarding this.

After looking at your examples I have been able to figure out where I was going wrong with my attempts to do a mongodump, and can now successfully create a query that is accepted by mongodump to select the part of the collection I want. I will change over to using mongodump & mongorestore and hopefully these issues will disappear.

I will also validate the original dataset too.

regards
Mark
Reply all
Reply to author
Forward
0 new messages