Hi Valtteri,
The behavior you observed where the IOPS/write patterns on Secondaries is different than on Primaries may be caused when a single query results in multiple Oplog entries. This happens because of Oplog Idempotency, where a single operation (e.g. an update operation with {multi:true}) on the Primary becomes individual document updates on the Secondary. For example:
// Insert 10 documents
for (i = 0; i < 9; i++) {db.foo.insert({ "_id" : i, "message": "this is a test message"})}
// Update 5 of them using a single update statement
db.foo.update(
{ _id: { $gte: 2, $lte: 6 } },
{ $set: { "message": "this is an updated message" } },
{ multi: true }
)
This single update results in multiple Oplog entries for each document that match the specified query, in the example above, _id: {$gte: 2, $lte: 6}. Below is a sample of an oplog showing these multiple entries that resulted from the single update:
> use local
> db.oplog.rs.find()
...
{ "ts" : Timestamp(1471935844, 11), "t" : NumberLong(1), "h" : NumberLong("-2676367555054767283"), "v" : 2, "op" : "u", "ns" : "test.foo", "o2" : { "_id" : 2 }, "o" : { "$set" : { "message" : "this is an updated message" } } }
{ "ts" : Timestamp(1471935844, 12), "t" : NumberLong(1), "h" : NumberLong("-2041993120688929804"), "v" : 2, "op" : "u", "ns" : "test.foo", "o2" : { "_id" : 3 }, "o" : { "$set" : { "message" : "this is an updated message" } } }
{ "ts" : Timestamp(1471935844, 13), "t" : NumberLong(1), "h" : NumberLong("6366090271544179458"), "v" : 2, "op" : "u", "ns" : "test.foo", "o2" : { "_id" : 4 }, "o" : { "$set" : { "message" : "this is an updated message" } } }
{ "ts" : Timestamp(1471935844, 14), "t" : NumberLong(1), "h" : NumberLong("-3214242448671565250"), "v" : 2, "op" : "u", "ns" : "test.foo", "o2" : { "_id" : 5 }, "o" : { "$set" : { "message" : "this is an updated message" } } }
{ "ts" : Timestamp(1471935844, 15), "t" : NumberLong(1), "h" : NumberLong("5701116647602331752"), "v" : 2, "op" : "u", "ns" : "test.foo", "o2" : { "_id" : 6 }, "o" : { "$set" : { "message" : "this is an updated message" } } }
...
Please note that the idempotency property is required to ensure the oplog remains robust and correct under many scenarios. Therefore this behaviour is by design.
It might be helpful if you could provide additional details on the types of operations your system is performing for further analysis and possible recommendations.
Thanks,
Chris
Hi Valtteri,
Thank you for the additional screen shots. However, they appear to show the Primary doing more ops than the Secondaries at that moment.
Are you experiencing any particular issue with your Cloud Manager Deployment that you are concerned with? If so, please post the relevant logs and the corresponding oplog entries for further analysis to find out what is causing the difference in the OpCount between the Primary & Secondaries.
Please note that having a difference in the OpCount between the Primary & Secondaries is expected and this behavior is by design. It is a prerequisite for the idempotency of the oplog that some operations behave this way.
Thanks,
Chris
Primary:
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 22.17 12.12 365.61 337.10 9695.55 26.56 1.48 3.91 0.30 11.32
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 121.68 8.79 702.81 199.62 13235.16 18.88 1.41 1.98 0.45 31.80
For any single Mongo operation, it would seem that the secondaries are doing two write ops on disk. That is the problem I am trying to solve.
insert query update delete getmore command % dirty % used flushes vsize res qr|qw ar|aw netIn netOut conn set repl time
*0 *0 *0 *0 0 1|0 0.2 5.6 0 840M 115M 0|0 0|0 424b 1.24k 8 rstest PRI 2016-10-05T10:28:20Z
918 *0 *0 *0 317 591|0 1.4 6.7 1 841M 124M 0|0 0|0 448k 434k 9 rstest PRI 2016-10-05T10:28:50Z
1116 *0 *0 *0 363 675|0 2.6 7.9 0 857M 136M 0|0 0|0 520k 519k 9 rstest PRI 2016-10-05T10:29:20Z
1115 *0 *0 *0 372 689|0 0.9 9.1 1 867M 148M 0|0 0|0 529k 523k 11 rstest PRI 2016-10-05T10:29:50Z
183 *0 *0 *0 59 110|0 1.3 9.3 0 868M 150M 0|0 0|0 84.8k 86.2k 11 rstest PRI 2016-10-05T10:30:20Z
And very similar on secondaries, here's one of them:
insert query update delete getmore command % dirty % used flushes vsize res qr|qw ar|aw netIn netOut conn set repl time
*0 *0 *0 *0 0 1|0 0.0 3.7 1 807M 97.0M 0|0 0|0 164b 1.11k 3 rstest SEC 2016-10-05T10:28:21Z
*956 *0 *0 *0 0 1|0 1.8 4.8 0 815M 107M 0|0 0|0 166b 1.12k 3 rstest SEC 2016-10-05T10:28:51Z
*1117 *0 *0 *0 0 1|0 1.8 6.0 1 827M 119M 0|0 0|0 166b 1.12k 3 rstest SEC 2016-10-05T10:29:21Z
*1117 *0 *0 *0 0 1|0 3.0 7.2 0 838M 130M 0|0 0|0 166b 1.12k 3 rstest SEC 2016-10-05T10:29:51Z
*141 *0 *0 *0 0 1|0 0.0 7.4 1 840M 131M 0|0 0|0 166b 1.12k 3 rstest SEC 2016-10-05T10:30:21Z
And the interesting part is the file IO behavior we get from Amazon Cloudwatch:
Blue line is the primary, doing a few thousands file WriteOps for these 100k inserts. Yellow and green are the secondaries, with both doing approximately 50k file WriteOps for the given 100k inserts.
Here is the installation procedure I used for the instances, all identical:
sudo chown -R mongod:mongod /data /journal /log
sudo ln -s /journal /data/journal
3) Create following conf file at /etc/mongod.conf
storage:
dbPath: "/data"
engine: "wiredTiger"
wiredTiger:
collectionConfig:
blockCompressor: snappy
systemLog:
destination: file
path: "/log/mongod.log"
logAppend: true
logRotate: reopen
timeStampFormat: iso8601-utc
replication:
replSetName: "rstest"
processManagement:
fork: true
pidFilePath: "/var/run/mongods/mongoconf.pid"
net:
port: 27017
operationProfiling:
slowOpThresholdMs: 100
mode: slowOp
4) sudo nano /etc/security/limits.conf
* soft nofile 64000
* hard nofile 64000
* soft nproc 32000
* hard nproc 32000
5) sudo nano /etc/security/limits.d/90-nproc.conf
* soft nproc 32000
* hard nproc 32000
6) ReadAhead:
sudo blockdev --setra 32 /dev/xvda1
echo 'ACTION=="add", KERNEL=="xvda1", ATTR{bdi/read_ahead_kb}="16"' | sudo tee -a /etc/udev/rules.d/85-ebs.rules
7) Disabled hugepage defrag according to http://docs.mongodb.org/manual/tutorial/transparent-huge-pages/#transparent-huge-pages-thp-settings
8) Add each server to /etc/hosts
9) Initialize the replica set
This should be reproducible. I tried it twice and got the same results on both runs.
Valtteri