High writeops count on secondaries

Valtteri Pirttilä

unread,

Aug 14, 2016, 2:16:48 PM8/14/16

to mongodb-user

Hi!

I'm running Mongodb 3.2 on Amazon EC2 with 3 shards, each having a replica set of 3 instances.

EC2 instances limit IO on the amount of operations you can do per second. With this in mind, I've been surprised that our primaries and secondaries seem to write in very different ways. The primaries do roughly half the write operations of our secondaries while writing roughly 50% in terms of bytes. Below is a picture from our first replica set. The primary is blue, secondaries are green and orange.

Is this behavior that I can change? The secondaries are becoming problematic. The amount of data saved in terms of bytes is very small, but the secondaries are overwhelming our IO operations per second limits. The primaries have no such problems.

Appreciate any insight you can give,

Valtteri

Chris Cunningham

unread,

Aug 24, 2016, 12:08:22 AM8/24/16

to mongodb-user

Hi Valtteri,

The behavior you observed where the IOPS/write patterns on Secondaries is different than on Primaries may be caused when a single query results in multiple Oplog entries. This happens because of Oplog Idempotency, where a single operation (e.g. an update operation with {multi:true}) on the Primary becomes individual document updates on the Secondary. For example:

// Insert 10 documents
for (i = 0; i < 9; i++) {db.foo.insert({ "_id" : i, "message": "this is a test message"})}

// Update 5 of them using a single update statement
db.foo.update(
     { _id: { $gte: 2, $lte: 6 } },
     { $set: { "message": "this is an updated message" } },
     { multi: true }
  )

This single update results in multiple Oplog entries for each document that match the specified query, in the example above, _id: {$gte: 2, $lte: 6}. Below is a sample of an oplog showing these multiple entries that resulted from the single update:

> use local
> db.oplog.rs.find()
...
{ "ts" : Timestamp(1471935844, 11), "t" : NumberLong(1), "h" : NumberLong("-2676367555054767283"), "v" : 2, "op" : "u", "ns" : "test.foo", "o2" : { "_id" : 2 }, "o" : { "$set" : { "message" : "this is an updated message" } } }
{ "ts" : Timestamp(1471935844, 12), "t" : NumberLong(1), "h" : NumberLong("-2041993120688929804"), "v" : 2, "op" : "u", "ns" : "test.foo", "o2" : { "_id" : 3 }, "o" : { "$set" : { "message" : "this is an updated message" } } }
{ "ts" : Timestamp(1471935844, 13), "t" : NumberLong(1), "h" : NumberLong("6366090271544179458"), "v" : 2, "op" : "u", "ns" : "test.foo", "o2" : { "_id" : 4 }, "o" : { "$set" : { "message" : "this is an updated message" } } }
{ "ts" : Timestamp(1471935844, 14), "t" : NumberLong(1), "h" : NumberLong("-3214242448671565250"), "v" : 2, "op" : "u", "ns" : "test.foo", "o2" : { "_id" : 5 }, "o" : { "$set" : { "message" : "this is an updated message" } } }
{ "ts" : Timestamp(1471935844, 15), "t" : NumberLong(1), "h" : NumberLong("5701116647602331752"), "v" : 2, "op" : "u", "ns" : "test.foo", "o2" : { "_id" : 6 }, "o" : { "$set" : { "message" : "this is an updated message" } } }
...

Please note that the idempotency property is required to ensure the oplog remains robust and correct under many scenarios. Therefore this behaviour is by design.

It might be helpful if you could provide additional details on the types of operations your system is performing for further analysis and possible recommendations.

Thanks,

Chris

Valtteri Pirttilä

unread,

Aug 24, 2016, 7:10:12 AM8/24/16

to mongodb-user

Hi Chris!

Thank you very much for your response.

I investigated our queries and our oplog, and I'm afraid it doesn't look like we use multi updates or inserts to such a degree as to explain the behavior. We have some multi updates, but they are typically very constrained: certain documents of a single user when something very fundamental changes in his data for example. That causes us to update denormalized data that we though would be of a more permanent nature. However, these operations were very rare and the IO difference is consistently such that the secondaries have roughly double ops all the time.

The MongoDB Cloud Manager shows Opcounters like this:

If I understand correctly, the multi-update on the primary would be shown here as more updated on the secondaries? Alas, that is not happening.

On the types of operations we perform, we typically have a lot of in-place updates. I would hazard a guess that over 95% of our updates are in-place. The commands seen in the above pictures are typically findAndModify operations which have in-place updates in them.

Respectfully,

Valtteri

Chris Cunningham

unread,

Aug 31, 2016, 9:49:39 PM8/31/16

to mongodb-user

Hi Valtteri,

Thank you for the additional screen shots. However, they appear to show the Primary doing more ops than the Secondaries at that moment.

Are you experiencing any particular issue with your Cloud Manager Deployment that you are concerned with? If so, please post the relevant logs and the corresponding oplog entries for further analysis to find out what is causing the difference in the OpCount between the Primary & Secondaries.

Please note that having a difference in the OpCount between the Primary & Secondaries is expected and this behavior is by design. It is a prerequisite for the idempotency of the oplog that some operations behave this way.

Thanks,

Chris

Valtteri Pirttilä

unread,

Sep 6, 2016, 3:39:38 AM9/6/16

to mongodb-user

Unfortunately, the Cloud Manager OpCount for the secondaries does not reflect the real IO.

I've double checked the IO usage with several tools, including iostat and Amazon EC2 CloudWatch and they are all reporting roughly twice the amount of IO ops on secondaries when compared to the OpCounters I'm seeing in Cloud Manager. The Cloud Manager numbers do seem to make sense - I would expect ops in the range it is reporting. But the actual file IO is double.

Here's the situation given with iostat -d -x. I've picked a single point in time, but it is a common pattern:

Primary:

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util

xvda 0.00 22.17 12.12 365.61 337.10 9695.55 26.56 1.48 3.91 0.30 11.32

Secondary:

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util

xvda 0.00 121.68 8.79 702.81 199.62 13235.16 18.88 1.41 1.98 0.45 31.80

For any single Mongo operation, it would seem that the secondaries are doing two write ops on disk. That is the problem I am trying to solve.

Valtteri Pirttilä

unread,

Oct 5, 2016, 6:56:27 AM10/5/16

to mongodb-user

Hello again. I've had time to perform further experimentation regarding Mongo replica set writeops in the Amazon cloud environment and they seem to verify that this is not an issue related to our application but instead ordinary Mongo operation.

I created a fresh replica set, I will provide the steps I used at the end of this post. Then I ran the following script through the primary member shell:

for (i = 0; i < 100000; i++) {

db.data.insert({"foo": "bar"})

}

Mongostat 30 shows the following on primary:

insert query update delete getmore command % dirty % used flushes vsize res qr|qw ar|aw netIn netOut conn set repl time

*0 *0 *0 *0 0 1|0 0.2 5.6 0 840M 115M 0|0 0|0 424b 1.24k 8 rstest PRI 2016-10-05T10:28:20Z

918 *0 *0 *0 317 591|0 1.4 6.7 1 841M 124M 0|0 0|0 448k 434k 9 rstest PRI 2016-10-05T10:28:50Z

1116 *0 *0 *0 363 675|0 2.6 7.9 0 857M 136M 0|0 0|0 520k 519k 9 rstest PRI 2016-10-05T10:29:20Z

1115 *0 *0 *0 372 689|0 0.9 9.1 1 867M 148M 0|0 0|0 529k 523k 11 rstest PRI 2016-10-05T10:29:50Z

183 *0 *0 *0 59 110|0 1.3 9.3 0 868M 150M 0|0 0|0 84.8k 86.2k 11 rstest PRI 2016-10-05T10:30:20Z

And very similar on secondaries, here's one of them:

insert query update delete getmore command % dirty % used flushes vsize res qr|qw ar|aw netIn netOut conn set repl time

*0 *0 *0 *0 0 1|0 0.0 3.7 1 807M 97.0M 0|0 0|0 164b 1.11k 3 rstest SEC 2016-10-05T10:28:21Z

*956 *0 *0 *0 0 1|0 1.8 4.8 0 815M 107M 0|0 0|0 166b 1.12k 3 rstest SEC 2016-10-05T10:28:51Z

*1117 *0 *0 *0 0 1|0 1.8 6.0 1 827M 119M 0|0 0|0 166b 1.12k 3 rstest SEC 2016-10-05T10:29:21Z

*1117 *0 *0 *0 0 1|0 3.0 7.2 0 838M 130M 0|0 0|0 166b 1.12k 3 rstest SEC 2016-10-05T10:29:51Z

*141 *0 *0 *0 0 1|0 0.0 7.4 1 840M 131M 0|0 0|0 166b 1.12k 3 rstest SEC 2016-10-05T10:30:21Z

And the interesting part is the file IO behavior we get from Amazon Cloudwatch:

Blue line is the primary, doing a few thousands file WriteOps for these 100k inserts. Yellow and green are the secondaries, with both doing approximately 50k file WriteOps for the given 100k inserts.

Here is the installation procedure I used for the instances, all identical:

1) Install Mongo with yum as shown at https://docs.mongodb.com/v3.2/tutorial/install-mongodb-on-amazon/#install-mongodb

2) Create directories:

sudo mkdir /data /log /journal

sudo chown -R mongod:mongod /data /journal /log

sudo ln -s /journal /data/journal

3) Create following conf file at /etc/mongod.conf

storage:

dbPath: "/data"

engine: "wiredTiger"

wiredTiger:

collectionConfig:

blockCompressor: snappy

systemLog:

destination: file

path: "/log/mongod.log"

logAppend: true

logRotate: reopen

timeStampFormat: iso8601-utc

replication:

replSetName: "rstest"

processManagement:

fork: true

pidFilePath: "/var/run/mongods/mongoconf.pid"

net:

port: 27017

operationProfiling:

slowOpThresholdMs: 100

mode: slowOp

4) sudo nano /etc/security/limits.conf

* soft nofile 64000

* hard nofile 64000

* soft nproc 32000

* hard nproc 32000

5) sudo nano /etc/security/limits.d/90-nproc.conf

* soft nproc 32000

* hard nproc 32000

6) ReadAhead:

sudo blockdev --setra 32 /dev/xvda1

echo 'ACTION=="add", KERNEL=="xvda1", ATTR{bdi/read_ahead_kb}="16"' | sudo tee -a /etc/udev/rules.d/85-ebs.rules

7) Disabled hugepage defrag according to http://docs.mongodb.org/manual/tutorial/transparent-huge-pages/#transparent-huge-pages-thp-settings

8) Add each server to /etc/hosts

9) Initialize the replica set

This should be reproducible. I tried it twice and got the same results on both runs.

Valtteri

Valtteri Pirttilä

unread,

Oct 5, 2016, 7:02:44 AM10/5/16

to mongodb-user

A note on the AWS settings: I used m3 medium instances with 100 GB EBS drives.

Reply all

Reply to author

Forward