mongodb 3.4.10 memory growth

994 views
Skip to first unread message

Daniel Doyle

unread,
Jan 9, 2018, 9:48:47 AM1/9/18
to mongodb-user
Hi all,

We are running a mongo 3.4.10 instance on CentOS7 that appears to be constantly growing in RES usage. Poking around a little bit, I believe this is because of the "page heap freelist" in tcalloc growing. Eventually the process steals all filesystem cache and performance degrades to the point where we have to restart, usually on the order of 3-4 days. The machine has 128G of memory.

I grabbed a section of the serverStatus command from just before restarting it last. My parsing of that is that it seems like tcmalloc has allocated a ton of memory but is never (or not enough anyway) unmapping back to the OS.

db.serverStatus().wiredTiger.cache
                ...........

                  "bytes currently in the cache" : 42629567194,  

                  "maximum bytes configured" : 56408145920,



db.serverStatus().tcmalloc

            "generic" : {

                        "current_allocated_bytes" : NumberLong("72999682360"),

                        "heap_size" : NumberLong("99688194048")

                },

                "tcmalloc" : {

                        "pageheap_free_bytes" : NumberLong("22580260864"),

                        "pageheap_unmapped_bytes" : NumberLong("2185707520"),

                        "max_total_thread_cache_bytes" : NumberLong(1073741824),

                        "current_total_thread_cache_bytes" : 27332768,

                        "total_free_bytes" : NumberLong(1922543304),

                        "central_cache_free_bytes" : NumberLong(1894941896),

                        "transfer_cache_free_bytes" : 268640,

                        "thread_cache_free_bytes" : 27332768,

                        "aggressive_memory_decommit" : 0,

                        "formattedString" : "------------------------------------------------\nMALLOC:    72999682360 (69617.9 MiB) Bytes in use by application\nMALLOC: +  22580260864 (21534.2 MiB) Bytes in page heap freelist\nMALLOC: +   1894941896 ( 1807.2 MiB) Bytes

 in central cache freelist\nMALLOC: +       268640 (    0.3 MiB) Bytes in transfer cache freelist\nMALLOC: +     27332768 (   26.1 MiB) Bytes in thread cache freelists\nMALLOC: +    558653632 (  532.8 MiB) Bytes in malloc metadata\nMALLOC:   ------------\nMALLOC: =  98

061140160 (93518.4 MiB) Actual memory used (physical + swap)\nMALLOC: +   2185707520 ( 2084.5 MiB) Bytes released to OS (aka unmapped)\nMALLOC:   ------------\nMALLOC: = 100246847680 (95602.8 MiB) Virtual address space used\nMALLOC:\nMALLOC:        7267027

 Spans in use\nMALLOC:            177              Thread heaps in use\nMALLOC:           4096              Tcmalloc page size\n------------------------------------------------\nCall ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).\nBytes relea

sed to the OS take up virtual address space but no physical memory.\n"

                }




Based on some reading of issues like https://jira.mongodb.org/browse/SERVER-20104 I tried enabling aggressive decommit via the environment variable, but the overhead of doing that resulted in a massive overall performance drop so we reverted fairly quickly. There are a few other tcmalloc env variables like RELEASE_RATE or HEAP_LIMIT that may be relevant, but I figured it would be good to ask around first after trying a few in dev environment and managing to segfault mongo (such as the max heap size setting).


Attached is the current full output of serverStatus if that is helpful. As you can see the heap size is considerably lower right now, but it will grow over time until it looks like the above.


A few misc details - 

1. The server is running in a replica set, but at the moment the secondary has been removed while we do some hardware changes.

2. The server is currently a single shard. This is an artifact from the 2.X days when we had to shard data to get around single threaded limitations of MMAP, but is a peculiarity so I wanted to point it out.

3. Our workflow is very heavily update based. We're storing time series data at a fairly regular interval, and documents contain "blocks" of time. One document for example represents a few hours worth of 60s samples, so the workflow each minute is to update an array in that document to add the value. These arrays are fixed size at which point a new doc is created, but I would think internally that the documents grow from their initial size.


I am happy to provide any additional information if I can. Thanks.

serverStatus.txt

Daniel Doyle

unread,
Jan 16, 2018, 9:28:12 AM1/16/18
to mongodb-user
As a followup to this, we tried turning off replication to see if that helped any. Judging by our memory usage monitoring, it appears to have had no appreciable impact. I did verify that according to rs.printReplicationInfo() that the oplog is no longer updating, though I do still have a "WT RecordStoreThread: local.oplog.rs" operation running, not sure what to make of that.

The wiredtiger cache size was set to 50G, which it seems to be holding to based on the output of db.serverStatus(), but the RES of the mongod process has reached 62G as of this morning and will likely keep climbing until we restart it based on the last couple of weeks.

This system was upgraded from 3.0.12 to 3.4.10 including the steps along the way such as converting config servers to wiredtiger. We did not experience this behavior in the 3.0 line, and we will be trying to downgrade to 3.2 to see if that helps anything.

Any help or insights would be greatly appreciated.

Kevin Adistambha

unread,
Jan 16, 2018, 7:40:02 PM1/16/18
to mongodb-user

Hi Daniel

Could you add more details please:

  • Are you running with --enableMajorityReadConcern setting?
  • Could you post the output of db.serverCmdLineOpts(), rs.conf(), and the content of any config files?
  • Do you have big databases? How many collections do you have in the system? The output of db.stats() per database would help to determine this.
  • Could you post the directory listing of your dbpath that includes the file sizes, e.g. output of ls -l
  • Do you have big aggregation queries running all the time?

This system was upgraded from 3.0.12 to 3.4.10 including the steps along the way such as converting config servers to wiredtiger.

Did you upgrade to MongoDB 3.2 in-between? Note that the supported upgrade path involves upgrading through every intermediate major versions.

Also, did you change the replica set protocol version during the upgrade process?

we will be trying to downgrade to 3.2 to see if that helps anything.

Without knowing more about your use case, I would venture to say that it’s likely to be better to upgrade to the current version of 3.6.2 instead of moving downward to the 3.2 series.

Best regards
Kevin

Daniel Doyle

unread,
Jan 17, 2018, 2:26:08 PM1/17/18
to mongodb-user
Kevin,

Certainly, thanks for replying. Responses inline...


On Tuesday, January 16, 2018 at 7:40:02 PM UTC-5, Kevin Adistambha wrote:

Hi Daniel

Could you add more details please:

I do not believe so. The command line options are just the config file and "--quiet", as you will see below.


  • Could you post the output of db.serverCmdLineOpts(), rs.conf(), and the content of any config files?
> db.serverCmdLineOpts();
{
        "argv" : [
                "/usr/bin/mongod",
                "--quiet",
                "-f",
                "/etc/mongo/mongod-shard2.conf"
        ],
        "parsed" : {
                "config" : "/etc/mongo/mongod-shard2.conf",
                "net" : {
                        "port" : 27025
                },
                "processManagement" : {
                        "fork" : true,
                        "pidFilePath" : "/run/mongodb/mongod-shard2.pid"
                },
                "security" : {
                        "authorization" : "enabled",
                        "keyFile" : "/etc/mongo/keyfile-tsds.bldc"
                },
                "sharding" : {
                        "clusterRole" : "shardsvr"
                },
                "storage" : {
                        "dbPath" : "/mongo/shard2",
                        "engine" : "wiredTiger",
                        "syncPeriodSecs" : 60,
                        "wiredTiger" : {
                                "engineConfig" : {
                                        "cacheSizeGB" : 30,
                                        "journalCompressor" : "snappy"
                                }
                        }
                },
                "systemLog" : {
                        "destination" : "file",
                        "logAppend" : true,
                        "logRotate" : "reopen",
                        "path" : "/var/log/mongodb/mongod-shard2.log",
                        "quiet" : true
                }
        },
        "ok" : 1


$ cat /etc/mongo/mongod-shard2.conf
# THIS FILE IS MAINTAINED BY PUPPET

systemLog:
   destination: file
   path: "/var/log/mongodb/mongod-shard2.log"
   logAppend: true
   logRotate: reopen

storage:
   engine: "wiredTiger"
   syncPeriodSecs: 60
   dbPath: "/mongo/shard2"
   wiredTiger:
      engineConfig:
         journalCompressor: "snappy"
         cacheSizeGB: 30

processManagement:
   fork: true
   pidFilePath: "/run/mongodb/mongod-shard2.pid"

net:
   port: 27025


#replication:
#   replSetName: "tsds.bldc"

sharding:
   clusterRole: shardsvr

security:
   authorization: "enabled"
   keyFile: "/etc/mongo/keyfile-tsds.bldc"



I can't currently show the output of rs.conf() since it's not running in a replica set presently (see 2nd message), but it was only a single mongod data instance and a single mongod arbiter instance on separate hosts. There is another mongod data instance not currently in the set as noted in my original message. We haven't done anything special as far as I know to replication, it's a 50G oplog with defaults.


  • Do you have big databases? How many collections do you have in the system? The output of db.stats() per database would help to determine this.
We have about 9 database with any substantial data in them, and each of those has mostly the same set of 8 collections. "interface" is the obvious dominator in terms of size. All queries should be fully indexed.
 
I have attached a txt file with that output, db_stats.txt.

  • Could you post the directory listing of your dbpath that includes the file sizes, e.g. output of ls -l

I have attached a txt file with that output, db_path_list.txt.

  • Do you have big aggregation queries running all the time?

I don't know what qualifies as "big" aggregation queries, but most of what we do is aggregate and update queries. The workflow across a given timeperiod is pretty consistent. There are jobs at the top of the hour and at midnight that do bigger crunches, but it's across the same dataset (data received over the last hour/day respectively, which is effectively a constant). These bigger crunches are typically doing aggregate and/or find queries that find tens of thousands of results and page through them in blocks.
 

This system was upgraded from 3.0.12 to 3.4.10 including the steps along the way such as converting config servers to wiredtiger.

Did you upgrade to MongoDB 3.2 in-between? Note that the supported upgrade path involves upgrading through every intermediate major versions.

Yes, we upgraded to 3.2 in between. If memory serves that is where the upgrade to config wiredtiger servers had to happen.

 

Also, did you change the replica set protocol version during the upgrade process?

Yes.
 

we will be trying to downgrade to 3.2 to see if that helps anything.

Without knowing more about your use case, I would venture to say that it’s likely to be better to upgrade to the current version of 3.6.2 instead of moving downward to the 3.2 series.

Unfortunately, we cannot currently go to 3.6.2 because the driver for the language we are using (Perl) has not released with official support for that version. I expect that will change in the near-ish future, but for today we can only try going backwards. https://jira.mongodb.org/projects/PERL/issues/PERL-754?filter=allopenissues

As noted earlier, this behavior did not occur in the 3.0 line so it seems it was an unintended result of a change made between there and 3.4.10. While rolling back versions is not an optimal or desired solution, we are trying to get to a stable environment as a first priority. I am more than happy to help provide more information and stay on 3.4.10 for the time being if it helps track anything down.

As an aside, while poking around at the JIRA tracker, we did discover this that seems like it may be related. I look at the diagnostics directory but didn't see an obvious way to turn that into something useful, guessing it's for internal team tools only.


Thanks again for taking a look, we appreciate it.
-Dan
 

Best regards
Kevin

db_path_list.txt
db_stats.txt

Kevin Adistambha

unread,
Jan 18, 2018, 10:01:33 PM1/18/18
to mongodb-user

Hi Daniel

Unfortunately there is nothing that points to a definitive issue from the serverStatus output that you posted. There are a number of databases and indexes, but I think it’s still a relatively manageable number. None of the information you posted pointed to any obvious cause either.

It would be helpful if you could post about 3-4 output of db.serverStatus() within a minute or so from each other during the height of the issue. Hopefully this could point to a reason.

In the meantime, you might want to check out the Production Notes and see if there are any settings that were missed/non-optimal.

Best regards
Kevin

Daniel Doyle

unread,
Jan 25, 2018, 9:47:53 AM1/25/18
to mongodb-user
Kevin,

Thanks for reviewing, and apologies for the delayed response as I have been traveling. I have reviewed the Production Notes and confirmed that we are doing all best practices to the best of my knowledge.

Attached is a 10minute span of db.serverStatus() run out of a minute crontab. I hope this helps.

-Dan
cronServerStatus.txt

Daniel Doyle

unread,
Feb 8, 2018, 10:34:13 AM2/8/18
to mongodb-user
I hate to bump this, but we are still seeing this issue. We have tried upgrading to the most recent 3.4.11 to see if that would change anything, but it seems to be the same behavior. 

I am attaching several graphs out of a monitoring system that shows the overall memory usage pattern in a few different dimensions. The far left side is where it was at when we restarted it last time (note: we have adjusted the cache size from 30G to 40G during the last restart to see if there was any difference). The sawtoothing in the graph lines up pretty well with when the hourly crunch jobs run, which is to be expected I think. It's doing a fairly large aggregate query during that time so memory balloons slightly and then should fall when it's finished. The stuff around midnight is some more large jobs running. The problem is that you can clearly see that the overall trend line for this is going upwards over time, but I can't see anything that is obvious.

The graphs labeled "mongodb_*" are from tracking the output of db.serverStatus(), the other is just something tracking memory usage of the whole server.

I can also open up a ticket in the Jira tracker if this is the wrong place.
Screen Shot 2018-02-08 at 10.31.41 AM.png
Screen Shot 2018-02-08 at 10.31.31 AM.png
Screen Shot 2018-02-08 at 10.31.27 AM.png
Screen Shot 2018-02-08 at 10.31.19 AM.png
Screen Shot 2018-02-08 at 10.31.12 AM.png
Screen Shot 2018-02-08 at 10.30.17 AM.png
Screen Shot 2018-02-08 at 10.27.07 AM.png

Kevin Adistambha

unread,
Feb 8, 2018, 9:26:40 PM2/8/18
to mongodb-user

Hi Daniel

From the mongodb_WT_cache graph you posted, the WiredTiger cache seems to be holding steady at ~32 GB, which is expected if you set the WiredTiger cache to 40 GB. WiredTiger will use up to 80% of the configured cache size, then will try to keep cache utilization at this level for performance reasons.

Other than the mongodb_current_allocated graph, everything else looks expected. I would like to reconfirm that other than this constantly climbing allocation, is everything else in the database is running as expected? E.g. no stalling, drop in throughput, unexpected shutdowns, etc.?

In the past, have you seen the mongodb_current_allocated graph to plateau at some point?

It’s doing a fairly large aggregate query during that time

The stuff around midnight is some more large jobs running

It is entirely possible that the job you’re doing simply needs that much memory to run.

Could you elaborate on the large aggregate queries? How large are we talking about, e.g. the number of documents, the average size of the documents, whether it uses the $out stage, etc. It might be useful to see the aggregation query and some example documents to get a sense of the scale. A bit more insight into the other “large jobs” would be useful to know as well. Could you elaborate on those large jobs?

Finally, if possible, could you perform these steps to collect more information?

1. Run mongod with mongod --setParameter heapProfilingEnabled=true, or add this entry in your config file:

setParameter:
    heapProfilingEnabled: true

This will run mongod in a heap profiling mode, which is a troubleshooting mode.

2. Run this script in the server to collect serverStatus data every minute:

mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:true}))); sleep(60*1000)}" > serverStatus.out

Ideally, the script should be run after restart, and let running for ~2 days (the length of the graphs you posted). Longer would definitely be better. It will produce a serverStatus() output every minute. Note that there are no identifiable information in the serverStatus() output, it only contains various metrics of what the server is doing at that point in time.

Please post the resulting serverStatus.out. It’s a text file, so you can compress it with zip or similar.

Best regards
Kevin

Daniel Doyle

unread,
Feb 9, 2018, 10:05:33 AM2/9/18
to mongodb-user
Hi Kevin,

Thanks for sticking with me here, we really appreciate it.


On Thursday, February 8, 2018 at 9:26:40 PM UTC-5, Kevin Adistambha wrote:

Hi Daniel

From the mongodb_WT_cache graph you posted, the WiredTiger cache seems to be holding steady at ~32 GB, which is expected if you set the WiredTiger cache to 40 GB. WiredTiger will use up to 80% of the configured cache size, then will try to keep cache utilization at this level for performance reasons.


Yes, it would seem that the WT engine may not be the culprit here as that part is solid.
 

Other than the mongodb_current_allocated graph, everything else looks expected. I would like to reconfirm that other than this constantly climbing allocation, is everything else in the database is running as expected? E.g. no stalling, drop in throughput, unexpected shutdowns, etc.?

In the past, have you seen the mongodb_current_allocated graph to plateau at some point?


Yes, everything else appears to be fine. We have never had a crash that wasn't OOM induced, throughput seems more or less consistent based on the output of mongostat. There is a 1-3s drop in write throughput following a "flush" but that doesn't seem unreasonable, and it doesn't drop to 0. During the ~2days it takes to reach the restart point, the application is behaving just fine and performant. Eventually the performance starts to degrade as the FS cache is eaten up, at which point we restart things and it all goes back to being happy again for another few days.
 

It’s doing a fairly large aggregate query during that time
The stuff around midnight is some more large jobs running

It is entirely possible that the job you’re doing simply needs that much memory to run.

Could you elaborate on the large aggregate queries? How large are we talking about, e.g. the number of documents, the average size of the documents, whether it uses the $out stage, etc. It might be useful to see the aggregation query and some example documents to get a sense of the scale. A bit more insight into the other “large jobs” would be useful to know as well. Could you elaborate on those large jobs?

The large jobs are cyclical in nature. We allocate a document that stores an array of datapoints and that array gets updated during the hour with stuff being collected. This array is finite in size, so for a given "top of the hour" the size and workflow is a constant even if the overall dataset is growing. Today we are collecting on about 75k things, so to make things simple we have 75k documents every hour that need summarizing. 

At the top of the hour, an aggregation pipeline runs across about 75k documents that (according to stats() output) are 741 avgobjsize representing metadata. For each block of this metadata (our block size is 5) it then queries the actual datapoint collection for documents it needs to summarize. The datapoint collection avgobjsize is 79618 bytes according to stats(). This "join" happens in memory in our code, and there is no $out stage. The summarized data is crunched and then written back to a third collection with an avgobjsize of 554680. 

At the top of the day there is an additional stage where it does the exact same thing but for daily summaries instead of hourly (based on the hourlies, not the original data, so again 75k total docs).

I am inclined to think that total memory required isn't the problem here because it runs fine for several days and on restart does not immediately go back to where it was, but at this point I feel like I am not sure of anything completely. We have tried letting it run through the entire box's memory just to see if it would eventually settle, but it got OOM'd when it hit something like 106 or 107GB (with a 30GB WT cache size). 
 
 

Finally, if possible, could you perform these steps to collect more information?

1. Run mongod with mongod --setParameter heapProfilingEnabled=true, or add this entry in your config file:

setParameter:
    heapProfilingEnabled: true

This will run mongod in a heap profiling mode, which is a troubleshooting mode.

2. Run this script in the server to collect serverStatus data every minute:

mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:true}))); sleep(60*1000)}" > serverStatus.out

Ideally, the script should be run after restart, and let running for ~2 days (the length of the graphs you posted). Longer would definitely be better. It will produce a serverStatus() output every minute. Note that there are no identifiable information in the serverStatus() output, it only contains various metrics of what the server is doing at that point in time.

Please post the resulting serverStatus.out. It’s a text file, so you can compress it with zip or similar.


Thanks, I will add this to the server when we restart it today and get you the output.

 

Best regards
Kevin

Daniel Doyle

unread,
Feb 12, 2018, 9:20:56 AM2/12/18
to mongodb-user
Here is a .xz of the command you posted above starting just a minute or so after server startup and until we had to restart it, I think a little less than 2 days.

The original file size is around 534M. I had to use "xz" to get it to compress down enough for Google Groups to allow me to attach it here, as even something like "gzip -9" resulted in a file it wouldn't accept. If that format is a problem let me know and I can send it some other way.

Thanks again.
-Dan
serverStatus.out.xz

Kevin Adistambha

unread,
Feb 12, 2018, 8:49:04 PM2/12/18
to mongodb-user

Hi Daniel

Thanks for the data. From what you posted, it appears that your workload actually requires all that memory.

By default (in MongoDB), tcmalloc is not set to aggressively decommit allocated memory. This was done to ensure that if this memory is required quite soon in the future, tcmalloc doesn’t have to go through a deallocate-reallocate cycle that would be detrimental to performance. The side effect of this setting is that memory is not promptly released back to the OS, which is what you are seeing in your case.

By plotting the tcmalloc pageheap_free_bytes metric, it can be observed that there is a large portion of memory (26 GB at peak) that was allocated, but was not released back to the OS. However, the point below illustrates a good reason for not releasing it:


Note in the graph above, at point B, there was 17 GB that was considered “free” at 2018-02-11T14:51:19Z. However, this number drops to just 69 MB at point C 2018-02-11T14:52:24Z — in about one minute, almost all 17 GB is completely reused:


The fact that turning on aggressive decommit leads to poor performance seems to confirm that the process does actually need all those resources. This poor performance would mean that mongod deallocated memory that was being reallocated again later.

There are some things that you could try to alleviate this situation:

  1. If you’re able to provision more memory, you might be able to observe memory usage plateau at some point.
  2. Given the current hardware, the workload seems to require more resource than what is available. You might be able to shard and distribute the load across more servers.

Best regards
Kevin

Reply all
Reply to author
Forward
0 new messages