Mongo OOM and disks in production

13 views

Skip to first unread message

RoieY

unread,

Mar 18, 2018, 3:31:48 PM3/18/18

to mongodb-user

Hi,

Previous week i opened a Jira ticket to mongo due to OOM restarts we have encountered in production.

As part of the discussion and based on the diagnostic-data i still have open questions that should be discussed here.

In general , we encountered an OOM crash in production - based on mongo engeeners its expected behaviour and not considered as a bug.

i still have open questions like why only one node crashed? is it one problematic query? how can i avoid this in the future?

In addition, i understood that we might have some disks issues - how can i know that disks we use are fast enough?

What considered as poor IO performance? what disks should we have? we use AWS EBS volume , io1 type with 5000 IOPS.

i'm adding the Jira for more information including diagnostic data:

https://jira.mongodb.org/browse/SERVER-33809?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

Thanks.

Kevin Adistambha

unread,

Mar 29, 2018, 12:56:08 AM3/29/18

to mongodb-user

Let me try to address your question by also mentioning some of the facts in the linked ticket.

why only one node crashed?

It is possible that the way your application was designed contributes this issue. In the linked ticket you mentioned:

our application configured to prefer reads from secondaries.

Thus it’s possible that for some reason (e.g. network speed, accessibility, etc.), your application simply performs more reads from this particular secondary. Hence this one node was hammered with queries from the application, all the while it is trying to keep up with writes from the primary.

Note that the main purpose of a replica set is to provide high availability and redundancy. It is not for scaling your reads. Most times, work pressure is actually higher on a secondary than on the primary, because the secondary must keep up with writes, since it can be called to be a primary at any time.

how can i know that disks we use are fast enough

This is a question that have a different answer for everyone. What is fast enough for you may be too slow for some use case, or it could be too fast for other use case. There is no simple answer for instance sizing, since it totally depends on the use case.

One example is, you can have two different deployments, where both contains identical data. One deployment is for development purposes, and another is for production. Since the dev servers are typically not as heavily used as the prod servers, you can have a much smaller hardware on the dev servers, even though the data in both servers are the same.

So the answer to “how do you know that the disks you use are fast enough”, is that it can keep up with the load you’re putting on it. If you find that the current provision is not enough, you may want to to use a larger/faster disk.