Simple inserts and updates to MongoDB run very slowly under load

869 views
Skip to first unread message

Leonid Koren

unread,
Oct 13, 2015, 12:33:57 PM10/13/15
to mongodb-user
Hi,

I'm using Mongo 2.6.9 with a cluster of 2 shards, each shard has 3 replicas, one of which is hidden.
It's a 5 machines deployment running on RedHat where 4 machines contain a single replica of 1 shard and the 5th machine contains the hidden replicas of both shards.
There is a load running of around 250 inserts per second and 50 updates per second. These are simple inserts and updates of pretty small documents.
In addition there is a small load of small files inserted to GridFS (around 1 file / second). The average file size is less than 1 MB.

There are 14 indexes defined for the involved collections. Those will be required when I will be adding the application that will be reading from the DB.

From the logs of the primary replicas I see during the whole run a huge amount of simple inserts and updates or even GetLastError requests that take hundreds of ms or even sometimes seconds (the default logging level only shows queries that took more than 100ms). For example this simple update uses an index for the query and doesn't update any index:

2015-10-12T06:12:17.258+0000 [conn166086] update chatlogging.SESSIONS query: { _id: "743_12101506113018605820fe43610c0a81eb9_IM" } update: { $set: { EndTime: new Date(1444630335126) } } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 keyUpdates:0 numYields:0 locks(micros) w:430 2131ms

2015-10-12T06:12:17.259+0000 [conn166086] command chatlogging.$cmd command: update { update: "SESSIONS", updates: [ { q: { _id: "743_12101506113018605820fe43610c0a81eb9_IM" }, u: { $set: { EndTime: new Date(1444630335126) } }, multi: false, upsert: false } ], writeConcern: { w: 1 }, ordered: true, metadata: { shardName: "S1R", shardVersion: [ Timestamp 17000|3, ObjectId('56017697ca848545f5f47bf5') ], session: 0 } } ntoreturn:1 keyUpdates:0 numYields:0  reslen:155 2132ms

All inserts and updates are made with w:1, j:1.

The machines have plenty of available CPU and memory. The disk I/O is significant, but not coming anywhere near 100% when these occur.

I really need to figure out what's causing this unexpectedly slow responsiveness of the DB. It's very possible that I need to change something in the way the DB is set up. Mongo runs with default configuration including the logging level.

Any help is highly appreciated!

Thank you,
Leonid

Leonid Koren

unread,
Oct 19, 2015, 11:56:29 AM10/19/15
to mongodb-user
I've continued looking into this issue and here are additional details that I'm hoping will allow to pinpoint the root cause of the problem or at least point me to the right direction:
 
The total DB size for a single shard is more than 200GB at the moment. The indexes being almost 50GB. Here is the relevant part from db.stats() and the mem part from db.ServerStatus() from the primary replica of one of the shards:
 
        "collections" : 7,
        "objects" : 73497326,
        "avgObjSize" : 1859.9700916465995,
        "dataSize" : 136702828176,
        "storageSize" : 151309253648,
        "numExtents" : 150,
        "indexes" : 14,
        "indexSize" : 46951096976,
        "fileSize" : 223163187200,
        "nsSizeMB" : 16,
 
 
"mem" : {
                "bits" : 64,
                "resident" : 5155,
                "virtual" : 526027,
                "supported" : true,
                "mapped" : 262129,
                "mappedWithJournal" : 524258
        },
 
The servers have 8GB of RAM, out of which the mongod process use around 5GB. So the majority of the data and probably more important the indexes is not kept in memory. Can this be the our root cause? When I previously wrote that the system has plenty of free memory, I was refering to the fact that the mongod process isn't using as much as it could and also that most of the RAM is used for cached memory that can be released if required:
 
 free -m
                    total       used       free     shared    buffers     cached
Mem:          7871       7745       126              0         96           6022
-/+ buffers/cache:      1627      6244
Swap:         1983           0        1983

 
Here is the output of mongostat from the same mongod:
insert  query update delete getmore command flushes mapped  vsize    res faults         locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn set repl       time
   172     *0     39     *0     193   431|0       0   256g   514g  4.95g      4 chatlogging:21.8%          0       0|0     0|0   193k   263k   584 S1R  PRI   09:42:59
   226     *0     41     *0     193   484|0       0   256g   514g  4.95g      6 chatlogging:15.7%          0       0|0     0|0   293k   441k   584 S1R  PRI   09:43:00
    97     *0     23     *0     133   278|0       0   256g   514g  4.95g      3  chatlogging:5.6%          0       0|0     0|0   191k   296k   584 S1R  PRI   09:43:01
   198     *0     37     *0     184   421|0       0   256g   514g  4.95g      8 chatlogging:26.9%          0       0|0     0|0   205k   287k   584 S1R  PRI   09:43:02
   161     *0     36     *0     166   383|0       0   256g   514g  4.95g      9 chatlogging:26.9%          0       0|0     0|0   275k   436k   584 S1R  PRI   09:43:03
   170     *0     37     *0     182   409|0       0   256g   514g  4.95g      6 chatlogging:14.5%          0       0|0     0|0   209k   295k   584 S1R  PRI   09:43:04
   151     *0     29     *0     179   363|0       0   256g   514g  4.95g      8 chatlogging:18.6%          0       0|0     0|0   301k   495k   584 S1R  PRI   09:43:05
   101     *0     28     *0     153   288|0       0   256g   514g  4.95g      4  chatlogging:7.1%          0       0|0     0|0   150k   212k   584 S1R  PRI   09:43:06

I do see few faults in these, but these numbers look too low to me to indicate a real problem. Am I wrong?
Also I don't know whether the numbers seen in "locked db" are considered reasonable, or do those indicate that we have locks contention?
 
During the same timeframe when these stats were taken, many simple update operations that find a document based on an index and don't update an index, like the following one took hundreds of ms:
2015-10-19T09:44:09.220+0000 [conn210844] update chatlogging.SESSIONS query: { _id: "838_19101509420840010420fe43620c0a81eb9_IM" } update: { $set: { EndTime: new Date(1445247849092) } } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 keyUpdates:0 numYields:0 locks(micros) w:214 126ms
Many other types of insert or update operations take hundreds of ms too. So the issue looks to be system wide and not related to a specific type of query. Using mtools I'm not able to find operations that scan lots of documents.
 
I'm hoping that here I'll be able to get help with regards to finding the root cause of the problem. I can provide whatever additional info or statistics from the system.
 
Thank you in advance,
Leonid

Wan Bachtiar

unread,
Nov 1, 2015, 8:25:47 PM11/1/15
to mongodb-user

The servers have 8GB of RAM, out of which the mongod process use around 5GB. So the majority of the data and probably more important the indexes is not kept in memory. Can this be the our root cause?

Hi Leonid,

Note that MongoDB MMAPv1 does not buffer any data within its own process, instead data is cached in the FileSystem cache. Due to the way MongoDB journalling works, often times data files are in memory but under filesystem cache and not resident memory. See SERVER-9415 for some further information on this.

The general recommendation is to ensure that indexes fit in RAM to reduce index page faults. Since you have 50GB of indexes, you probably should also look at optimising your indexes:

One point to add is that the new WiredTiger storage engine in MongoDB v3.0 supports compression for all collections and indexes. The compression on the indexes may help reducing your RAM size requirement.


2015-10-19T09:44:09.220+0000 [conn210844] update chatlogging.SESSIONS query: { _id: “838_19101509420840010420fe43620c0a81eb9_IM” } update: { $set: { EndTime: new Date(1445247849092) } } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 keyUpdates:0 numYields:0 locks(micros) w:214 126ms

I noticed that your _id is quite interesting : 838_19101509420840010420fe43620c0a81eb9_IM.
One tip in managing the size of your working set is index access patterns. If you are inserting into indexes at random locations (as would happen with ids that are randomly generated by hashes), you will continually be using the whole index. If instead you are able to create your ids in approximately ascending order (i.e. day concatenated with a random id), all the updates will occur at the right side of the B-tree and the working set size for the id index may be smaller if your reads favour the newest documents.


All inserts and updates are made with w:1, j:1.

By setting write concern value of j:1, mongod will confirm the write only after it has written the operation to the journal. Although this confirms that the write can survive a shutdown and increases the write durability, it comes with performance cost (especially on spinning disk). See j-option. In a replica set deployment w:majority is normally used to improve durability.

For general recommendations for your deployment please see the MongoDB v2.6 production notes. I would specifically point out the below:

  • Make sure that read-ahead settings for the block devices that store the database files are appropriate. For random access use patterns, set low readahead values. A readahead of 32 (16kb) often works well.

  • Disable Transparent Huge Pages, as MongoDB performs better with normal (4096 bytes) virtual memory pages.

  • Use SSD if available and economical.

  • Make sure that there are no other processes running on the instances that may create resource contention issues on the machine.



Regards,

Wan.

Leonid Koren

unread,
Nov 2, 2015, 7:17:07 AM11/2/15
to mongodb-user
Wan, thank you very much for the detailed response!

I managed to remove two of the existing indexes since for now they aqre not used. However as eventually we will be using the stored data fro additional purposes, I think that even more indexes will need to be added.

Upgrading to ver 3.x in order to use WiredTiger with compression on the expense of CPU, sounds like a good suggestion, since in general CPU doesn't looks to be a bottleneck in my case. This is something I won't be able to do immediately, but going forward it sounds that this is something worth checking. Can you estimate the affect on the workset size this will have as well as the affect on CPU usage. Any other pointers regarding using MMAPv1 vs WiredTiger?

I realize that using SSD disks will greatly improve performance, but unfortunatelly at the moment that's not something I can do.

Regarding the _id format, indeed the prefix is randomly generated. This was specifically designed to increase the write scalability of the system based on the documentation here - https://docs.mongodb.org/manual/core/sharding-shard-key/#write-scaling. This because the _id field is the shard key for this collection. My understanding that if the values were monolitically increasing, then new documents were always inserted to a single shard and the shards would be rebalanced later on to keep them in balance. So the write scalability of the whole system would be capped by the write scalability of a single shard. The extra rebalancing would also be causing bigger parts of the indexes to become part of the working set when those parts would be moved between chunks. Is my understanding not entirely correct?

I realize now that using random values has the negative affect of having the whole relevant index or indexes be part the working set. Which would slow things down if the amount of RAM isn't sufficient as in my case at the moment.

So in my case it looks like using random or monolitically increasing values each have its positive and negative affects. Is there a recommended way for investigating which of the approaches is a more suitable one besides trying them both under load to see how each behaves?

As to using w:majority instead of j:1, either waiting for the data to be written to journal or repliacted to secondaries adds its penalty. In my case the replica set consists of 3 replicas, so if using majority we would be required to wait for replication to a single secondary. Are you suggesting w:majority since the penalty of waiting for the data to be journaled is usually bigger than replicating it to a secondary?

Following your suggestion I will look into the read-ahead and the Transparent Huge Pages settings on the boxes and make sure they are configured according to the recommendations.


Thank you,
Leonid

Wan Bachtiar

unread,
Nov 5, 2015, 2:07:45 AM11/5/15
to mongodb-user

Can you estimate the affect on the workset size this will have as well as the affect on CPU usage. Any other pointers regarding using MMAPv1 vs WiredTiger?

Hi Leonid,

A working set estimation is quite complex, as there are many factors to consider. The most effective way to measure this would be testing in your environment under the representative data set and work load.

In regards to MMAPv1 vs WiredTiger, this also depends on your use case but in general, WiredTiger has better performance due to compression, document-level concurrency control, and the ability to effectively utilise multi-core processor resources better than MMAPv1.

For WiredTiger you may find these FAQs useful :


My understanding that if the values were monolitically increasing, then new documents were always inserted to a single shard and the shards would be rebalanced later on to keep them in balance. So the write scalability of the whole system would be capped by the write scalability of a single shard. The extra rebalancing would also be causing bigger parts of the indexes to become part of the working set when those parts would be moved between chunks. Is my understanding not entirely correct?

It’s correct that utilising Hash Based Partitioning ensures an even distribution of writes between the shards, however at the expense of efficient range queries . Also see Performance distinctions between Range and Hash Based Partitioning.

So in my case it looks like using random or monolitically increasing values each have its positive and negative affects. Is there a recommended way for investigating which of the approaches is a more suitable one besides trying them both under load to see how each behaves?

There are a number of approaches for this. For more detailed discussions on those approaches and their pros and cons, I would suggest to check out Socialite: Social Data Reference Architecture.

For even more on Socialite, there were three sessions recorded at MongoDB World 2014:


Are you suggesting w:majority since the penalty of waiting for the data to be journaled is usually bigger than replicating it to a secondary?

Journaling provides single-instance write durability. For a replica set deployment, we normally recommend w:majority so writes are acknowledged by the majority of the replica set members. If you are only acknowledging writes in the primary, those writes can be rolled back in the event of primary failover. See Rollbacks During Replica Set Failover


Regards,

Wan.

PS: You can also download the white paper for MongoDB Performance Best Practices by answering a few questions.

Leonid Koren

unread,
Nov 9, 2015, 4:16:16 AM11/9/15
to mongodb-user
Wan, thank you again for providing the extra details.

I've started going through the recommendations and the links you have provided. One of the things I've done so far is to review the estimated working set size from the db.serverStatus output. The numbers look like this:

"workingSet" : {
                "note" : "thisIsAnEstimate",
                "pagesInMemory" : 89539,
                "computationTimeMicros" : 24955,
                "overSeconds" : 826
        },

After monitoring for 24 hours I see that they don't change by much. The pagesInMemory and overSeconds values sometimes are a little higher and sometimes are a little lower, but they both stay around the same numbers over time.

The servers have 24GB of RAM now which is the maximum I could get at the moment. The current indexes size of almost 50GB and data size is more than 110GB. By multiplying 89539 by 4K it gives me around 350MB which is the estimated working set which holds data from the last 14 minutes. This server only runs 1 mongo router, 1 mongo config server and 1 mongod replica. Other processes besides the mongod replica do not require lots of memory. The value in overSeconds doesn't decrease over time. I would expect the pagesInMemory to be much higher, however that's not the case. What am I missing here?

Thank you,
Leonid

Wan Bachtiar

unread,
Nov 16, 2015, 5:29:51 PM11/16/15
to mongodb-user

Hi Leonid,

The workingSet is an estimated value to give you an idea of what proportion of data you are using.

The pagesInMemory is the total number of pages accessed by mongod over the measurement period. It is not what is currently in physical memory, but rather pages ‘touched‘. MongoDB is estimating that your operations “touched” 89539 pages, ( or ~350MB of data). So this is a measure of how much data is “hot” for that measurement interval (~14minutes).

Right now it looks like you are accessing a relatively low amount of data within your mongod process. However, this can be misleading as your working set can change over the course of the day as your workload increases/decreases. Ideally you should measure when you are at the peak of your workload for the day and see what the values look like then. You can also try out MongoDB Cloud Manager Monitoring for metric trending.

To get a better idea of how MongoDB is utilising RAM, you should also consider the following values :

If the numbers above are all low, it is more likely that your working set is contained in memory.


Regards,
Wan.


Leonid Koren

unread,
Nov 22, 2015, 10:16:04 AM11/22/15
to mongod...@googlegroups.com
Thank you Wan.

As this is a test environment, the load doesn't change much during the day. The numbers displayed in the workingSet are not changing by much over time.

I would think that if my working set is much smaller than the amount of RAM, over time the overSeconds value should grow and so is the pagesInMemory value. This because additional pages are accessed by mongo (as new documents keep getting inserted into the DB) and previously accessed documents should not be evacuated from RAM as there is nothing else that requires the extra space in RAM (mongod and mongos are the only applications on this server doing any significant amount of work, with only mongod taking a significant amount of RAM). And on the other hand if my working set would be larger than RAM, pagesInMemory should be very high and overSeconds small.

Since none of these occur on my environment, I must be missing something regarding workingSet output of db.serverStatus. Could you provide more details on this?

Thank you,
Leonid

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mongodb-user/ZbisWFtNpoo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/febecdff-aefe-44c7-a175-2dc86c5209e0%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Wan Bachtiar

unread,
Nov 24, 2015, 1:25:07 AM11/24/15
to mongodb-user

As this is a test environment, the load doesn’t change much during the day. The numbers displayed in the workingSet are not changing by much over time.

Hi Leonid,

What kind of operations are you running on the test environment ? i.e. a loop of read queries? random inserts ?


Since none of these occur on my environment, I must be missing something regarding workingSet output of db.serverStatus. Could you provide more details on this?

To look into this further, can you post the readahead values please ?
You can run sudo blockdev --report to get the readahead settings.

Also, are your journal and data files on the same block device ? The default journal commit interval is 100ms if a single block device (e.g. physical volume, RAID device, or LVM volume) contains both the journal and the data files. (See commitIntervalMs)

You mentioned previously that you are using mtools, can you plot the log file using mplotqueries --group operation and post the resulting graph ?

If you are setting up new environment, I would suggest to use the latest stable release of MongoDB (currently v3.0.7).


Kind Regards,

Wan.


Leonid Koren

unread,
Nov 25, 2015, 1:54:22 PM11/25/15
to mongodb-user
Hi Wan,

The load that runs on the environment consists from both random inserts, simple queries that use indexes and simple updates. The inserts and updates are made on individual docs and not in bulks. In addition, once per day several heavy querries are executed in order to delete old docs.

At the moment the readahead value is set to 256. I plan to try decreasing this to 32 to see if it makes a difference, but didn't have the chance to try that out yet.

Journal is on the same block device as the DB data files. I beleive that the commit interval is automatically set to 33ms since the inserts use j=1.

I hope to be able to provide you the mplotqueries output early next week, as I'm away till end of this week.

Thank you,
Leonid
Reply all
Reply to author
Forward
0 new messages