High writeops count on secondaries

86 views
Skip to first unread message

Valtteri Pirttilä

unread,
Aug 14, 2016, 2:16:48 PM8/14/16
to mongodb-user
Hi!

I'm running Mongodb 3.2 on Amazon EC2 with 3 shards, each having a replica set of 3 instances.

EC2 instances limit IO on the amount of operations you can do per second. With this in mind, I've been surprised that our primaries and secondaries seem to write in very different ways. The primaries do roughly half the write operations of our secondaries while writing roughly 50% in terms of bytes. Below is a picture from our first replica set. The primary is blue, secondaries are green and orange.




Is this behavior that I can change? The secondaries are becoming problematic. The amount of data saved in terms of bytes is very small, but the secondaries are overwhelming our IO operations per second limits. The primaries have no such problems.

Appreciate any insight you can give,

Valtteri

Chris Cunningham

unread,
Aug 24, 2016, 12:08:22 AM8/24/16
to mongodb-user

Hi Valtteri,

The behavior you observed where the IOPS/write patterns on Secondaries is different than on Primaries may be caused when a single query results in multiple Oplog entries. This happens because of Oplog Idempotency, where a single operation (e.g. an update operation with {multi:true}) on the Primary becomes individual document updates on the Secondary. For example:

// Insert 10 documents
for (i = 0; i < 9; i++) {db.foo.insert({ "_id" : i, "message": "this is a test message"})}
// Update 5 of them using a single update statement
db.foo.update(
     { _id: { $gte: 2, $lte: 6 } },
     { $set: { "message": "this is an updated message" } },
     { multi: true }
  )

This single update results in multiple Oplog entries for each document that match the specified query, in the example above, _id: {$gte: 2, $lte: 6}. Below is a sample of an oplog showing these multiple entries that resulted from the single update:

> use local
> db.oplog.rs.find()
...
{ "ts" : Timestamp(1471935844, 11), "t" : NumberLong(1), "h" : NumberLong("-2676367555054767283"), "v" : 2, "op" : "u", "ns" : "test.foo", "o2" : { "_id" : 2 }, "o" : { "$set" : { "message" : "this is an updated message" } } }
{ "ts" : Timestamp(1471935844, 12), "t" : NumberLong(1), "h" : NumberLong("-2041993120688929804"), "v" : 2, "op" : "u", "ns" : "test.foo", "o2" : { "_id" : 3 }, "o" : { "$set" : { "message" : "this is an updated message" } } }
{ "ts" : Timestamp(1471935844, 13), "t" : NumberLong(1), "h" : NumberLong("6366090271544179458"), "v" : 2, "op" : "u", "ns" : "test.foo", "o2" : { "_id" : 4 }, "o" : { "$set" : { "message" : "this is an updated message" } } }
{ "ts" : Timestamp(1471935844, 14), "t" : NumberLong(1), "h" : NumberLong("-3214242448671565250"), "v" : 2, "op" : "u", "ns" : "test.foo", "o2" : { "_id" : 5 }, "o" : { "$set" : { "message" : "this is an updated message" } } }
{ "ts" : Timestamp(1471935844, 15), "t" : NumberLong(1), "h" : NumberLong("5701116647602331752"), "v" : 2, "op" : "u", "ns" : "test.foo", "o2" : { "_id" : 6 }, "o" : { "$set" : { "message" : "this is an updated message" } } }
...

Please note that the idempotency property is required to ensure the oplog remains robust and correct under many scenarios. Therefore this behaviour is by design.

It might be helpful if you could provide additional details on the types of operations your system is performing for further analysis and possible recommendations.

Thanks,

Chris

Valtteri Pirttilä

unread,
Aug 24, 2016, 7:10:12 AM8/24/16
to mongodb-user
Hi Chris!

Thank you very much for your response.

I investigated our queries and our oplog, and I'm afraid it doesn't look like we use multi updates or inserts to such a degree as to explain the behavior. We have some multi updates, but they are typically very constrained: certain documents of a single user when something very fundamental changes in his data for example. That causes us to update denormalized data that we though would be of a more permanent nature. However, these operations were very rare and the IO difference is consistently such that the secondaries have roughly double ops all the time.

The MongoDB Cloud Manager shows Opcounters like this:



If I understand correctly, the multi-update on the primary would be shown here as more updated on the secondaries? Alas, that is not happening.

On the types of operations we perform, we typically have a lot of in-place updates. I would hazard a guess that over 95% of our updates are in-place. The commands seen in the above pictures are typically findAndModify operations which have in-place updates in them.

Respectfully,

Valtteri

Chris Cunningham

unread,
Aug 31, 2016, 9:49:39 PM8/31/16
to mongodb-user

Hi Valtteri,

Thank you for the additional screen shots. However, they appear to show the Primary doing more ops than the Secondaries at that moment.

Are you experiencing any particular issue with your Cloud Manager Deployment that you are concerned with? If so, please post the relevant logs and the corresponding oplog entries for further analysis to find out what is causing the difference in the OpCount between the Primary & Secondaries.

Please note that having a difference in the OpCount between the Primary & Secondaries is expected and this behavior is by design. It is a prerequisite for the idempotency of the oplog that some operations behave this way.

Thanks,

Chris

Valtteri Pirttilä

unread,
Sep 6, 2016, 3:39:38 AM9/6/16
to mongodb-user
Unfortunately, the Cloud Manager OpCount for the secondaries does not reflect the real IO.

I've double checked the IO usage with several tools, including iostat and Amazon EC2 CloudWatch and they are all reporting roughly twice the amount of IO ops on secondaries when compared to the OpCounters I'm seeing in Cloud Manager. The Cloud Manager numbers do seem to make sense - I would expect ops in the range it is reporting. But the actual file IO is double.

Here's the situation given with iostat -d -x. I've picked a single point in time, but it is a common pattern:

Primary:


Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util

xvda              0.00    22.17   12.12  365.61   337.10  9695.55    26.56     1.48    3.91   0.30  11.32


Secondary:

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util

xvda              0.00   121.68    8.79  702.81   199.62 13235.16    18.88     1.41    1.98   0.45  31.80



For any single Mongo operation, it would seem that the secondaries are doing two write ops on disk. That is the problem I am trying to solve.

Valtteri Pirttilä

unread,
Oct 5, 2016, 6:56:27 AM10/5/16
to mongodb-user
Hello again. I've had time to perform further experimentation regarding Mongo replica set writeops in the Amazon cloud environment and they seem to verify that this is not an issue related to our application but instead ordinary Mongo operation.

I created a fresh replica set, I will provide the steps I used at the end of this post. Then I ran the following script through the primary member shell:

for (i = 0; i < 100000; i++) {
  db.data.insert({"foo": "bar"})
}

Mongostat 30 shows the following on primary:

insert query update delete getmore command % dirty % used flushes vsize  res qr|qw ar|aw netIn netOut conn    set repl                 time

    *0    *0     *0     *0       0     1|0     0.2    5.6       0  840M 115M   0|0   0|0  424b  1.24k    8 rstest  PRI 2016-10-05T10:28:20Z

   918    *0     *0     *0     317   591|0     1.4    6.7       1  841M 124M   0|0   0|0  448k   434k    9 rstest  PRI 2016-10-05T10:28:50Z

  1116    *0     *0     *0     363   675|0     2.6    7.9       0  857M 136M   0|0   0|0  520k   519k    9 rstest  PRI 2016-10-05T10:29:20Z

  1115    *0     *0     *0     372   689|0     0.9    9.1       1  867M 148M   0|0   0|0  529k   523k   11 rstest  PRI 2016-10-05T10:29:50Z

   183    *0     *0     *0      59   110|0     1.3    9.3       0  868M 150M   0|0   0|0 84.8k  86.2k   11 rstest  PRI 2016-10-05T10:30:20Z


And very similar on secondaries, here's one of them:


insert query update delete getmore command % dirty % used flushes vsize   res qr|qw ar|aw netIn netOut conn    set repl                 time

    *0    *0     *0     *0       0     1|0     0.0    3.7       1  807M 97.0M   0|0   0|0  164b  1.11k    3 rstest  SEC 2016-10-05T10:28:21Z

  *956    *0     *0     *0       0     1|0     1.8    4.8       0  815M  107M   0|0   0|0  166b  1.12k    3 rstest  SEC 2016-10-05T10:28:51Z

 *1117    *0     *0     *0       0     1|0     1.8    6.0       1  827M  119M   0|0   0|0  166b  1.12k    3 rstest  SEC 2016-10-05T10:29:21Z

 *1117    *0     *0     *0       0     1|0     3.0    7.2       0  838M  130M   0|0   0|0  166b  1.12k    3 rstest  SEC 2016-10-05T10:29:51Z

  *141    *0     *0     *0       0     1|0     0.0    7.4       1  840M  131M   0|0   0|0  166b  1.12k    3 rstest  SEC 2016-10-05T10:30:21Z


And the interesting part is the file IO behavior we get from Amazon Cloudwatch:




Blue line is the primary, doing a few thousands file WriteOps for these 100k inserts. Yellow and green are the secondaries, with both doing approximately 50k file WriteOps for the given 100k inserts.


Here is the installation procedure I used for the instances, all identical:


2) Create directories:
sudo mkdir /data /log /journal

sudo chown -R mongod:mongod /data /journal /log

sudo ln -s /journal /data/journal

3) Create following conf file at /etc/mongod.conf

storage:

 dbPath: "/data"

 engine: "wiredTiger"

 wiredTiger:

  collectionConfig:

   blockCompressor: snappy

systemLog:

 destination: file

 path: "/log/mongod.log"

 logAppend: true

 logRotate: reopen

 timeStampFormat: iso8601-utc

replication:

 replSetName: "rstest"

processManagement:

 fork: true

 pidFilePath: "/var/run/mongods/mongoconf.pid"

net:

 port: 27017

operationProfiling:

 slowOpThresholdMs: 100

 mode: slowOp

4) sudo nano /etc/security/limits.conf

* soft nofile 64000

* hard nofile 64000

* soft nproc 32000

* hard nproc 32000

5) sudo nano /etc/security/limits.d/90-nproc.conf

* soft nproc 32000

* hard nproc 32000

6) ReadAhead:

sudo blockdev --setra 32 /dev/xvda1

echo 'ACTION=="add", KERNEL=="xvda1", ATTR{bdi/read_ahead_kb}="16"' | sudo tee -a /etc/udev/rules.d/85-ebs.rules

7) Disabled hugepage defrag according to http://docs.mongodb.org/manual/tutorial/transparent-huge-pages/#transparent-huge-pages-thp-settings

8) Add each server to /etc/hosts

9) Initialize the replica set


This should be reproducible. I tried it twice and got the same results on both runs.


Valtteri


Valtteri Pirttilä

unread,
Oct 5, 2016, 7:02:44 AM10/5/16
to mongodb-user
A note on the AWS settings: I used m3 medium instances with 100 GB EBS drives.
Reply all
Reply to author
Forward
0 new messages