mongo initial sync performance impact

Daniel Doyle

unread,

Apr 17, 2018, 12:15:48 PM4/17/18

to mongodb-user

Hi,

We're testing doing an initial sync as a way to "compact" a database. Everything is running 3.6.3. The primary machine is very comfortably handling all activity during regular replication and live data use.

When we add the blank secondary and initial sync starts happening, the dirty cache size on the primary rapidly rises to 20% and stays there and the write throughput drops from thousands of updates/sec to several hundred / sec. From my reading I understand that this is because mongo is blocking while trying to clean out the dirty cache to keep it inside of some parameters. The machine is still more or less coasting during this time, plenty of IO available based on tools like iotop/iostat, plenty of CPU available (80%+ most of the time).

Is this expected behavior? This is basically rendering the primary unserviceable for the application, and I would expect based on how far it has gotten before we have stopped it that it would take hours to complete so we can't really afford to take that sort of hit. Is there a knob to tune this? If I had to guess, it seems like the issue may be related to keeping the entire oplog in the cache or something like that, so maybe the sizing considerations for this has to be working set + sizeof oplog?

Thanks,

-Dan

Daniel Doyle

unread,

Apr 17, 2018, 12:17:57 PM4/17/18

to mongodb-user

I forgot to mention but figured I would if it's helpful. When we remove the blank secondary from the replica set, everything more or less instantly goes back to regular performance. The dirty cache drops and stays low and all the backlogged operations rapidly complete.

Daniel Doyle

unread,

Apr 18, 2018, 6:44:10 PM4/18/18

to mongodb-user

Another testing note - this also appears to happen whenever a secondary is significantly lagged behind. If I take a secondary out for say 4 hours and then put it back in, dirty cache % rises dramatically on the primary and stays at around 20% until the secondary is caught up. During this time, throughput is significantly limit (thousands of ops to hundreds). This may be the same thing that is occurring during initial sync where a large amount of replication information is needed.

Kevin Adistambha

unread,

May 24, 2018, 2:03:56 AM5/24/18

to mongodb-user

Hi Daniel

It’s been some time since you posted this question. Have you been successful in performing the initial sync?

When we add the blank secondary and initial sync starts happening, the dirty cache size on the primary rapidly rises to 20% and stays there and the write throughput drops from thousands of updates/sec to several hundred / sec.

During normal operation, WiredTiger attempts to keep dirty cache percentage at 5%. Once this number hits 20%, WiredTiger will try harder to evict dirty data from its cache by using application threads in addition to its cache eviction threads.

In other words, this process attempts to regulate incoming data into WiredTiger so that it’s not overwhelming the physical storage subsystem.

However, evicting dirty data from the cache involves a lot of work. Since WiredTiger is an MVCC no-overwrite storage engine, different version of data in memory must be reconciled before a consistent state of the database can be written to disk. If the machine is struggling with the load imposed on it, it may appear to “stall” to process the work.

From your description so far, it appears that your hardware can cope with normal day-to-day operations. However it cannot cope with those operations and an initial sync at the same time. One suggestion is to perform the initial sync during non-peak times.

Another testing note - this also appears to happen whenever a secondary is significantly lagged behind.

This situation may be described in SERVER-34938. Please comment/upvote on the ticket if you think this ticket applies to your situation.

Best regards
Kevin

Daniel Doyle

unread,

May 24, 2018, 9:41:25 AM5/24/18

to mongodb-user

Hi Kevin,

No, we were never successful in performing the initial sync. The issue you linked seems to describe what we are seeing pretty accurately. Our application is very write heavy, so there is likely a lot of cache contention while doing a sync. I have tried fiddling with things like increasing the cache limits, adding more memory, etc which seem to help a little bit but always end up in the same boat.

The particular secondary in question is definitely weaker than the primary (though not by so much) so I suspect becomes the bottleneck while doing large resyncs. Its imagined use case was to run backups out of and to periodically use as a "compact" host by doing resyncs since our data rate changes in significant ways at some points. As noted in my original post, the primary machine is still coasting during this initial sync. Loads of CPU, memory, and iops available.

I can try using the setParameter option described in that ticket to see if there is an appreciable change.