Mongodump - bad performance on a particular collection

1,471 views
Skip to first unread message

Paul McAdam

unread,
Jul 9, 2013, 12:27:39 PM7/9/13
to mongod...@googlegroups.com
Hi,

We have a Mongo cluster running on VMWare. The underlying hardware consists of HP G7's with RAID 5 15K SAS drives. It's pretty hefty. One of our dbs contains a particular collection with about 10 million objects that takes up 11G . It contains other large collections, for example another collection has 2.2 million objects that takes up 1.1G. All contain simple objects that just contain text and date fields. Nothing unusual is stored.

When I do a mongodump (on a secondary) the 10 million object collection takes around 12-14 hours to dump, but the 2.2 million object collection only takes about 30 seconds! Here is the output from mongodump

Tue Jul  9 16:52:18 analytica.staging_rfq2s to /data/dump/analytica/staging_rfq2s.bson
Tue Jul  9 16:52:21 250400/2281649 10% (objects)
Tue Jul  9 16:52:24 577200/2281649 25% (objects)
Tue Jul  9 16:52:27 898000/2281649 39% (objects)
Tue Jul  9 16:52:30 1198900/2281649 52% (objects)
Tue Jul  9 16:52:33 1549600/2281649 67% (objects)
Tue Jul  9 16:52:36 1899700/2281649 83% (objects)
Tue Jul  9 16:52:39 2252600/2281649 98% (objects)
Tue Jul  9 16:52:39 2281649 objects
Tue Jul  9 16:52:39 Metadata for analytica.staging_rfq2s to /data/dump/analytica/staging_rfq2s.metadata.json
Tue Jul  9 16:52:39 analytica.snowplow_events to /data/dump/analytica/snowplow_events.bson
Tue Jul  9 16:53:04 200/10881422 0% (objects)
Tue Jul  9 16:53:27 4300/10881422 0% (objects)
Tue Jul  9 16:53:49 8500/10881422 0% (objects)
Tue Jul  9 16:54:10 12700/10881422 0% (objects)
Tue Jul  9 16:54:31 16900/10881422 0% (objects)
Tue Jul  9 16:54:52 21000/10881422 0% (objects)
Tue Jul  9 16:55:14 25200/10881422 0% (objects)
Tue Jul  9 16:55:37 29400/10881422 0% (objects)
Tue Jul  9 16:55:58 33600/10881422 0% (objects)
Tue Jul  9 16:56:17 37800/10881422 0% (objects)
Tue Jul  9 16:56:37 41900/10881422 0% (objects)
Tue Jul  9 16:56:56 46100/10881422 0% (objects)
Tue Jul  9 16:57:17 50300/10881422 0% (objects)
Tue Jul  9 16:57:36 54500/10881422 0% (objects)
Tue Jul  9 16:57:55 58700/10881422 0% (objects)
Tue Jul  9 16:58:15 62900/10881422 0% (objects)
Tue Jul  9 16:58:34 67000/10881422 0% (objects)

You can see that the pace slows to a crawl. Throughout the dump disk i/o and CPU usage are fairly high, but not maxing out the VMWare instance or underlying host resources. The CPU and disk usage is the same whether it's the fast dumping collection or the slow one. 

I've tried it with and without --oplog set, it makes no difference. I also tried running it directly on the same instance as the Mongo secondary and from another instance over hte network. Again, no difference.

The obvious thought is that the _id index on the slow collection is missing or corrupt. It's not missing 

rs1:SECONDARY> db.snowplow_events.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "analytica.snowplow_events",
"name" : "_id_"
},
{
"v" : 1,
"key" : {
"analytica_updated_at" : 1
},
"ns" : "analytica.snowplow_events",
"name" : "analytica_updated_at_1",
"background" : true
}
]

How can I check if it's corrupt? Searches on the collection return fine. Can the index on _id be rebuilt?

What else could cause this?

Thanks,
Paul.

Asya Kamsky

unread,
Jul 9, 2013, 10:51:59 PM7/9/13
to mongod...@googlegroups.com
If the _id index cannot fit in RAM when the backup is running, it may have to swap parts of the index out to load the documents - this would indicate way under provisioned RAM which I don't suppose could be the case without your noticing.  Can you provide output of mongostat while this collection is being dumped, as well as full stats for the collection?



--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb
 
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Paul McAdam

unread,
Jul 10, 2013, 1:12:59 PM7/10/13
to mongod...@googlegroups.com
Hi,

Firstly, I should have mentioned, it's version 2.2.2.

The instance has 6GB of memory, we never get close to it. Here is the output from mongodump and mongostat whilst dumping the same two collections.

Wed Jul 10 17:08:11 analytica.staging_rfq2s to /data/dump/analytica/staging_rfq2s.bson
Wed Jul 10 17:08:14 97400/2285792 4% (objects)
Wed Jul 10 17:08:17 410400/2285792 17% (objects)
Wed Jul 10 17:08:20 625800/2285792 27% (objects)
Wed Jul 10 17:08:24 946500/2285792 41% (objects)
Wed Jul 10 17:08:27 1320700/2285792 57% (objects)
Wed Jul 10 17:08:30 1455800/2285792 63% (objects)
Wed Jul 10 17:08:34 1777700/2285792 77% (objects)
Wed Jul 10 17:08:37 2137600/2285792 93% (objects)
Wed Jul 10 17:08:39 2285792 objects
Wed Jul 10 17:08:39 Metadata for analytica.staging_rfq2s to /data/dump/analytica/staging_rfq2s.metadata.json
Wed Jul 10 17:08:39 analytica.snowplow_events to /data/dump/analytica/snowplow_events.bson
Wed Jul 10 17:09:04 200/10937305 0% (objects)
Wed Jul 10 17:09:28 4300/10937305 0% (objects)
Wed Jul 10 17:09:49 8500/10937305 0% (objects)

insert  query update delete getmore command flushes mapped  vsize    res faults       locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn set repl       time
    *0     *0     *2     *0      15     3|0       0    38g  76.8g  2.38g     79  analytica:0.5%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:08:04
    *0     *0     *5     *0      14     3|0       0    38g  76.8g   1.8g     86  analytica:6.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:08:05
    *0     *0     *5     *0      22     3|0       0    38g  76.8g  1.93g     41  analytica:0.3%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:08:06
    *0      4     *0     *0      22     7|0       0    38g  76.8g  2.08g     13  analytica:0.0%          0       0|0     1|0   951b     7k    10 rs1  SEC   17:08:07
    *0     *0     *0     *0       0     4|0       0    38g  76.8g  2.08g      2  analytica:0.0%          0       0|0     1|0   452b    30k    10 rs1  SEC   17:08:08
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  2.08g      3  analytica:0.0%          0       0|0     0|0   388b     2k    10 rs1  SEC   17:08:09
    *0     *0     *0     *0      12     5|0       0    38g  76.8g  2.14g      6  analytica:0.0%          0       0|0     1|0   598b     3k    10 rs1  SEC   17:08:10
    *0      4     *0     *0      25     3|0       0    38g  76.8g  2.27g     27  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:08:11
    *0      3     *0     *0      16     6|0       0    38g  76.8g  2.32g    209  analytica:0.0%          0       0|1     1|0   910b   953m    10 rs1  SEC   17:08:12
    *0     *0     *0     *0       0     4|0       0    38g  76.8g  2.32g      2  analytica:0.0%          0       0|0     1|0   451b    30k    10 rs1  SEC   17:08:13
insert  query update delete getmore command flushes mapped  vsize    res faults       locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn set repl       time
    *0     *0     *0     *0       2     3|0       0    38g  76.8g  2.33g     14  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:08:14
    *0     *0     *0     *0      14     3|0       0    38g  76.8g  2.39g     56  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:08:15
    *0     *0     *0     *0      17     3|0       0    38g  76.8g  2.45g     87  analytica:0.1%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:08:16
    *0     *0     *0     *0       9     3|0       0    38g  76.8g  2.46g     42  analytica:4.2%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:08:17
    *0     *0     *0     *0       0     6|0       0    38g  76.8g  2.46g      2  analytica:0.0%          0       0|0     1|0   661b    30k    10 rs1  SEC   17:08:18
    *0     *0     *0     *0      13     3|0       0    38g  76.8g  2.49g     96  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:08:19
    *0     *0     *0     *0      18     3|0       1    38g  76.8g  2.52g    165  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:08:20
    *0     *0     *0     *0      18     3|0       0    38g  76.8g  2.55g    141  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:08:21
    *0     *0     *0     *0      16     3|0       0    38g  76.8g  2.57g    102  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:08:22
    *0     *0     *0     *0       0    14|0       0    38g  76.8g  2.57g      2  analytica:0.0%          0       0|0     1|0     1k    35k    10 rs1  SEC   17:08:23
insert  query update delete getmore command flushes mapped  vsize    res faults       locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn set repl       time
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  2.57g      3      local:0.0%          0       0|1     1|0   389b     2k    10 rs1  SEC   17:08:24
    *0     *0     *0     *0      16     5|0       0    38g  76.8g  2.61g    152  analytica:0.0%          0       0|0     1|0   596b     3k    10 rs1  SEC   17:08:25
    *0     *0     *0     *0      21     3|0       0    38g  76.8g  2.64g    124  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:08:26
    *0     *0     *0     *0      13     3|0       0    38g  76.8g  2.64g     92  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:08:27
    *0     *0     *0     *0       0     4|0       0    38g  76.8g  2.64g      2  analytica:0.0%          0       0|0     1|0   452b    29k    10 rs1  SEC   17:08:28
    *0     *0     *0     *0       2     3|0       0    38g  76.8g  2.65g     26  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:08:29
    *0     *0     *0     *0      18     3|0       0    38g  76.8g  2.67g    130  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:08:30
    *0     *0     *0     *0      17     3|0       0    38g  76.8g  2.67g    179  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:08:31
    *0     *0     *0     *0      14     3|0       0    38g  76.8g  2.65g    160  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:08:32
    *0     *0     *0     *0       0     6|0       0    38g  76.8g  2.65g      2  analytica:0.0%          0       0|0     1|0   659b    29k    10 rs1  SEC   17:08:33
insert  query update delete getmore command flushes mapped  vsize    res faults       locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn set repl       time
    *0      4     *0     *0       3     7|0       0    38g  76.8g  2.67g      4  analytica:0.0%          0       0|0     1|0   952b     7k    10 rs1  SEC   17:08:34
    *1     *0     *0     *0      13     3|0       0    38g  76.8g  1.25g     12  analytica:0.1%          0       1|0     1|0   388b     2k    10 rs1  SEC   17:08:35
    *0     *0     *0     *0      19     3|0       0    38g  76.8g  1.32g     12  analytica:8.6%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:08:36
    *0     *0     *0     *0      13     3|0       0    38g  76.8g  1.36g     13  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:08:37
    *0     *0     *0     *0       0     4|0       0    38g  76.8g  1.36g      2  analytica:0.0%          0       0|0     1|0   452b    29k    10 rs1  SEC   17:08:38
    *0     *0     *0     *0      10     3|0       0    38g  76.8g   1.4g     14  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:08:39
    *0      1     *0     *0       5     6|0       0    38g  76.8g  1.39g    119  analytica:0.0%          0       0|0     0|0   766b     1g    10 rs1  SEC   17:08:40
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.39g    154  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:08:41
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.39g    155  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:08:42
    *0     *0     *0     *0       0     4|0       0    38g  76.8g  1.38g     94  analytica:0.0%          0       0|0     1|0   451b    29k    10 rs1  SEC   17:08:43
insert  query update delete getmore command flushes mapped  vsize    res faults       locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn set repl       time
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.38g    100  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:08:44
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.38g    155  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:08:45
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.39g    153  analytica:0.0%          0       0|0     0|0   389b     2k    10 rs1  SEC   17:08:46
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.39g    158  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:08:47
    *0     *0     *0     *0       0     6|0       0    38g  76.8g  1.38g    152  analytica:0.0%          0       0|0     0|0   661b    29k    10 rs1  SEC   17:08:48
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.38g    158  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:08:49
    *0     *0     *0     *0       0    13|0       0    38g  76.8g  1.38g    162  analytica:0.0%          0       0|0     1|0     1k     8k    10 rs1  SEC   17:08:50
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.38g    150  analytica:0.0%          0       0|0     0|0   388b     2k    10 rs1  SEC   17:08:51
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.38g    155  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:08:52
    *0     *0     *0     *0       0     4|0       0    38g  76.8g  1.38g    108  analytica:0.0%          0       0|0     1|0   451b    29k    10 rs1  SEC   17:08:53
insert  query update delete getmore command flushes mapped  vsize    res faults       locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn set repl       time
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.38g    157  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:08:54
    *0     *0     *0     *0       0     5|0       0    38g  76.8g  1.38g    156  analytica:0.0%          0       0|0     1|0   597b     3k    10 rs1  SEC   17:08:55
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.37g    149  analytica:0.0%          0       0|0     0|0   389b     2k    10 rs1  SEC   17:08:56
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.37g    144  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:08:57
    *0     *0     *0     *0       0     4|0       0    38g  76.8g  1.36g    151  analytica:0.0%          0       0|1     1|0   452b    28k    10 rs1  SEC   17:08:58
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.36g    148  analytica:0.0%          0       0|0     0|0   388b     2k    10 rs1  SEC   17:08:59
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.36g    151  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:09:00
    *0      4     *0     *0       0     7|0       0    38g  76.8g  1.36g    140  analytica:0.0%          0       0|0     0|0   951b     7k    10 rs1  SEC   17:09:01
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.37g    135  analytica:0.0%          0       0|0     0|0   389b     2k    10 rs1  SEC   17:09:02
    *0     *0     *0     *0       0     6|0       0    38g  76.8g  1.37g    137  analytica:0.0%          0       0|0     0|0   660b    28k    10 rs1  SEC   17:09:03
insert  query update delete getmore command flushes mapped  vsize    res faults       locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn set repl       time
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.37g    152  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:09:04
    *0     *0     *0     *0       1     3|0       0    38g  76.8g  1.37g    160  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:09:05
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.37g    150  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:09:06
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.37g    150  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:09:07
    *0     *0     *0     *0       0     4|0       0    38g  76.8g  1.37g    141  analytica:0.0%          0       0|0     0|0   452b    28k    10 rs1  SEC   17:09:08
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.37g    144  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:09:09
    *0      4     *0     *0       0     5|0       0    38g  76.8g  1.37g    156  analytica:0.0%          0       0|0     1|0   598b     3k    10 rs1  SEC   17:09:10
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.37g    144  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:09:11
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.37g    147  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:09:12
    *0     *0     *0     *0       0     4|0       0    38g  76.8g  1.37g    153  analytica:0.0%          0       0|0     1|0   451b    28k    10 rs1  SEC   17:09:13
insert  query update delete getmore command flushes mapped  vsize    res faults       locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn set repl       time
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.37g    154  analytica:0.0%          0       0|0     0|0   389b     2k    10 rs1  SEC   17:09:14
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.36g    148  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:09:15
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.36g    159  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:09:16
    *0     *0     *0     *0       0    13|0       0    38g  76.8g  1.36g    142  analytica:0.0%          0       0|0     1|0     1k     8k    10 rs1  SEC   17:09:17
    *0     *0     *0     *0       0     6|0       0    38g  76.8g  1.36g    151  analytica:0.0%          0       0|0     1|0   661b    28k    10 rs1  SEC   17:09:18
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.36g    154  analytica:0.0%          0       0|0     0|0   388b     2k    10 rs1  SEC   17:09:19
    *0     *0     *0     *0       0     3|0       1    38g  76.8g  1.37g    129  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:09:20
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.37g    141  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:09:21
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.37g    160  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:09:22
    *0     *0     *0     *0       0     4|0       0    38g  76.8g  1.37g    161  analytica:0.0%          0       0|0     1|0   451b    28k    10 rs1  SEC   17:09:23
insert  query update delete getmore command flushes mapped  vsize    res faults       locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn set repl       time
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.36g    150  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:09:24
    *0     *0     *0     *0       0     5|0       0    38g  76.8g  1.35g    161  analytica:0.0%          0       0|0     1|0   597b     3k    10 rs1  SEC   17:09:25
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.35g    157  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:09:26
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.35g    153  analytica:0.0%          0       0|0     0|0   388b     2k    10 rs1  SEC   17:09:27
    *0      4     *0     *0       0     8|0       0    38g  76.8g  1.35g    156  analytica:0.0%          0       0|0     1|0     1k    32k    10 rs1  SEC   17:09:28
    *0     *0     *0     *0       1     3|0       0    38g  76.8g  1.34g    155  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:09:29
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.34g    139  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:09:30
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.34g    159  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:09:31
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.33g    153  analytica:0.0%          0       0|0     0|0   389b     2k    10 rs1  SEC   17:09:32
    *0     *0     *0     *0       0     6|0       0    38g  76.8g  1.33g    157  analytica:0.0%          0       0|0     0|0   660b    27k    10 rs1  SEC   17:09:33
insert  query update delete getmore command flushes mapped  vsize    res faults       locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn set repl       time
    *0     *0     *0     *0       0     4|0       0    38g  76.8g  1.33g    159  analytica:0.0%          0       0|0     1|0   447b     3k    10 rs1  SEC   17:09:34
    *0     *0     *5     *0       0     3|0       0    38g  76.8g  1.33g    157  analytica:1.7%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:09:35
    *0     *0     *1     *0       0     3|0       0    38g  76.8g  1.32g    160  analytica:0.2%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:09:36
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.25g    161  analytica:0.7%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:09:37
    *0     *0     *0     *0       0     4|0       0    38g  76.8g  1.25g    143  analytica:0.0%          0       0|0     0|0   452b    27k    10 rs1  SEC   17:09:38
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.23g    152  analytica:0.0%          0       0|0     0|0   388b     2k    10 rs1  SEC   17:09:39
    *0     *0     *0     *0       0     5|0       0    38g  76.8g  1.22g    150  analytica:0.0%          0       0|0     1|0   598b     3k    10 rs1  SEC   17:09:40
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.21g    162  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:09:41
    *0     *0     *0     *0       0     3|0       0    38g  76.8g  1.21g    161  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:09:42
    *0     *0     *0     *0       0     4|0       0    38g  76.8g  1.21g    154  analytica:0.0%          0       0|0     1|0   451b    27k    10 rs1  SEC   17:09:43
insert  query update delete getmore command flushes mapped  vsize    res faults       locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn set repl       time
    *0     *0     *0     *0       0    13|0       0    38g  76.8g  1.21g    158  analytica:0.0%          0       0|0     1|0     1k     8k    10 rs1  SEC   17:09:44
    *0     *0     *0     *0       0     3|0       0    38g  76.8g   1.2g    162  analytica:0.0%          0       0|0     1|0   388b     2k    10 rs1  SEC   17:09:45
    *0     *0     *0     *0       0     3|0       0    38g  76.8g   1.2g    162  analytica:0.0%          0       0|0     1|0   389b     2k    10 rs1  SEC   17:09:46
    *0     *0     *0     *0       0     3|0       0    38g  76.8g   1.2g    146  analytica:0.0%          0       0|0     0|0   388b     2k    10 rs1  SEC   17:09:47
    *0     *0     *0     *0       0     4|0       0    38g  76.8g   1.2g    152  analytica:0.0%          0       0|0     1|0   346b    27k    10 rs1  SEC   17:09:48
    *0     *0     *6     *0       1     5|0       0    38g  76.8g  1.19g    158  analytica:0.5%          0       0|0     1|0   703b     3k    10 rs1  SEC   17:09:49
    *0     *0     *0     *0       0     2|0       0    38g  76.8g  1.19g    155  analytica:0.1%          0       0|0     1|0   225b     2k    10 rs1  SEC   17:09:50
    *0     *0     *0     *0       0     4|0       0    38g  76.8g  1.19g    159  analytica:0.0%          0       0|0     1|0   552b     3k    10 rs1  SEC   17:09:51
    *0     *0     *0     *0       0     2|0       0    38g  76.8g  1.19g    160  analytica:0.0%          0       0|0     1|0   225b     2k    10 rs1  SEC   17:09:52
    *0     *0     *0     *0       0     5|0       0    38g  76.8g  1.18g    158  analytica:0.0%          0       0|0     1|0   615b    27k    10 rs1  SEC   17:09:53

You can see that resident memory never gets close to 6GB. We have page faults throughout and it does go up a bit on the slow collection. That doesn't seem like enough to slow it down so much. What is a 'bad' number of page faults?

The collection stats are as follows 

rs1:SECONDARY> db.snowplow_events.stats()
{
"ns" : "analytica.snowplow_events",
"count" : 10941460,
"size" : 11000417908,
"avgObjSize" : 1005.3884863628803,
"storageSize" : 13296381904,
"numExtents" : 24,
"nindexes" : 2,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1.0000000000000089,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 1215583152,
"indexSizes" : {
"_id_" : 914395664,
"analytica_updated_at_1" : 301187488
},
"ok" : 1
}

If I'm reading this correctly the _id index is about 870MB, so easily fits in our 6GB of memory.

One other thing I've noticed is that the disk write rate slows down dramatically during the slow dump. Basically it's almost zero. It's not that the disks can't keep up, they are not being written too. The read also slows down from about 30-40MB/s to about 16MB/s. So it seems Mongo is slowly reading the collection then extremely slowly writing to the dump file. Any idea why?

Thanks,
Paul.

Asya Kamsky

unread,
Jul 10, 2013, 8:33:15 PM7/10/13
to mongod...@googlegroups.com
I'm going to guess your disk readahead is set very high.

You need sudo to see it - run this command:
sudo blockdev --report

The RA (second?) column is the readahead.  

Asya

Paul McAdam

unread,
Jul 11, 2013, 12:56:56 PM7/11/13
to mongod...@googlegroups.com
Thanks. I tried setting it all the way down to 16 (and up to 1024 just in case) but it made no difference. In fact it seemed slightly worse. 

The cluster is spread across VMWare servers in our office and AWS instances. I tried doing the backup on AWS with the same result. It really seems to be a Mongo rather than OS or hardware problem.

Asya Kamsky

unread,
Jul 12, 2013, 6:04:22 AM7/12/13
to mongod...@googlegroups.com
Did you restart mongod after lowering the readhead?   Otherwise it has no effect.



On Thu, Jul 11, 2013 at 9:56 AM, Paul McAdam <paul_...@yahoo.com> wrote:
Thanks. I tried setting it all the way down to 16 (and up to 1024 just in case) but it made no difference. In fact it seemed slightly worse. 

The cluster is spread across VMWare servers in our office and AWS instances. I tried doing the backup on AWS with the same result. It really seems to be a Mongo rather than OS or hardware problem.

--

Paul McAdam

unread,
Jul 12, 2013, 11:13:45 AM7/12/13
to mongod...@googlegroups.com
No difference :(

Paul McAdam

unread,
Jul 12, 2013, 11:32:45 AM7/12/13
to mongod...@googlegroups.com
If I use --forceTableScan it dumps fine which implies that something is wrong with the _id index, even though it passes validation. I'll re-index and see if that helps, though as it's a replica set its going to be a pain.

Asya Kamsky

unread,
Jul 12, 2013, 9:09:08 PM7/12/13
to mongodb-user
That is strange.  Let us know what happens when you reIndex.  



On Fri, Jul 12, 2013 at 8:32 AM, Paul McAdam <paul_...@yahoo.com> wrote:
If I use --forceTableScan it dumps fine which implies that something is wrong with the _id index, even though it passes validation. I'll re-index and see if that helps, though as it's a replica set its going to be a pain.

--

Paul McAdam

unread,
Jul 17, 2013, 1:27:22 PM7/17/13
to mongod...@googlegroups.com
Rebuilding the indexes didn't help.

I've found some more symptoms. Basically it's all iowait caused by reading the collection. On a multi CPU instance 1 CPU is 100 used doing iowait. For example on a 4 CPU instance I see 25% iowait but only 1-2% CPU use. On a 1 CPU instance we see 100% iowait but again virtually no CPU use. Looking at the iowait it's all reads. If I run the mongodump from another instance its still slow and still all iowait reading the Mongo DB on the Mongo instance. 

I've tried various things to tune the IO but at the end of the day this is happening on both local VMWare and as AWS (using Xen). It's acting like it just won't use that the index so I took a closer look and found something interesting. 

If do a find or a count on the collection it returns immediately. But if I then ask it to explain the find it takes a long time and interestingly only uses a BasicCursor, so it's not using the _id index. Then if I try the same thing against an unindexed field, again BasicCursor (as expected). If I do a explain explicitly searching the _id field it used BtreeCursor and returns immediately. 

Now my understanding is that mongodump should use _id, but it's not. Why would it not? Also why does db.collection.find().explain() not use the _id field?

Thanks,
Paul.




On Tuesday, 9 July 2013 17:27:39 UTC+1, Paul McAdam wrote:

Asya Kamsky

unread,
Jul 18, 2013, 1:28:26 AM7/18/13
to mongodb-user
Paul,

I think there may be a couple of misunderstandings here...

First of all, db.collection.find()  would never use an _id (or any) index and that's expected.  Index is used to speed up filtering/selecting a subset of records.  This query says "return everything".

The query that mongodump is running is db.collection.find().snapshot() - if you run explain on it, you will see that it does walk the _id index.

Now, you may be getting an impression that mongodump time is proportional to the number of records or the size of _id index and it is not, at least not very directly - not across collections.  Your 2.2 million object collection may have had smaller documents - mongodump is actually proportional to the total amount of data that your collection holds.

I see no evidence that your _id index is corrupt, what I do see is that your disk IO cannot keep up with a collection of this size being dumped fast enough.  The collection is about 13GB.  I don't know about your VMWare set-up but I do know about AWS.  I just did a quick measurement and I could get (at best) about 1-2MB/second on an EBS volume.  This is at 50% reads and 50% writes which is what you are getting (your documents need to be read from disk into RAM and then written out into the dump directory).

Even assuming that you are getting 1/2 of those IOPs as writes it will take close to 20 minutes to write one GB and you have 13 GB collection.  We're looking at about four hours for the full dump at this estimate and that's an optimistic estimate.

The fact that you see "disk usage is the same whether it's the fast dumping collection or the slow one" tells me you are at the limit of how much the disk can handle...

Asya

--

Paul McAdam

unread,
Oct 17, 2013, 11:13:06 AM10/17/13
to mongod...@googlegroups.com
FYI, I never fully got to the bottom of this but the collection was corrupt. Recreating it completely, fixed the problem. Rebuilding indexes didn't help and it passed validation, but nevertheless....





On Tuesday, 9 July 2013 17:27:39 UTC+1, Paul McAdam wrote:

Daniel Coupal

unread,
Oct 17, 2013, 11:51:53 AM10/17/13
to mongod...@googlegroups.com
Paul,

as a note, RAID 5 is not a good choice for MongoDB.
Use RAID 1+0 instead.


Regards,

Daniel Coupal

Paul McAdam

unread,
Oct 17, 2013, 12:56:33 PM10/17/13
to mongod...@googlegroups.com
Thanks Daniel, unfortunately it's not on a dedicated array. Our normal use i/o is low enough for it not to matter though, it's only backups that really spike the i/o. Even that is fine apart from this one collection that was corrupt.

In an ideal world I'd move it though :)




On Tuesday, 9 July 2013 17:27:39 UTC+1, Paul McAdam wrote:
Reply all
Reply to author
Forward
0 new messages