I am using Mongo 2.2.0.I populate a collection of 1M documents:var documents = 1000000;
db.example.drop();
for (var i=0; i < documents; i++) {
db.example.insert({_id:i, value:0});
}Then I start to watch with an independent connection the number of documents inside the collection:while (true) {
print("find(filter).count(): " + db.example.find({value:0}).count());
print("count(filter): " + db.example.count({value:0}));
sleep(1000);
}Then I modify every single document, adding a new field. I'm simulating a migration here:var affected = 0;
db.example.find().snapshot().forEach(function(d) {
d.new_field = true;
db.example.save(d);
affected++;
});
print(affected);I use .snapshot() to isolate the cursor from retrieving the documents it already has processed.Its output is, correctly:[20:19:45][giorgio@Desmond:~]$ mongo add_snapshot.js
MongoDB shell version: 2.2.0
connecting to: test
1000000However, the output of my watching script is:MongoDB shell version: 2.2.0
connecting to: test
find(filter).count(): 1000000
count(filter): 1000000
find(filter).count(): 1000288
count(filter): 1008043
find(filter).count(): 1006630
count(filter): 1006200
find(filter).count(): 1006894
count(filter): 1006360
find(filter).count(): 1005347
count(filter): 1007082
find(filter).count(): 1004306
count(filter): 1008218
find(filter).count(): 1006303
count(filter): 1007049
find(filter).count(): 1007005
count(filter): 1006759
find(filter).count(): 1006258
count(filter): 1005788
find(filter).count(): 1005684
count(filter): 1003041
find(filter).count(): 1007427
count(filter): 1006692
find(filter).count(): 1006000
count(filter): 1006322
find(filter).count(): 1007484
count(filter): 1007345
find(filter).count(): 1007936
count(filter): 1004408
find(filter).count(): 1006841
count(filter): 1007516
find(filter).count(): 1007674
count(filter): 1007300
find(filter).count(): 1008179
count(filter): 1005257
find(filter).count(): 1006860
count(filter): 1006940
find(filter).count(): 1008112
count(filter): 1002686
find(filter).count(): 1005848
count(filter): 1006899
find(filter).count(): 1006854
count(filter): 1005871
find(filter).count(): 1006772
count(filter): 1007599
find(filter).count(): 1003431
count(filter): 1007896
find(filter).count(): 1005819
count(filter): 1007818
find(filter).count(): 1008379
count(filter): 1007536
find(filter).count(): 1007804
count(filter): 1007147
find(filter).count(): 1005792
count(filter): 1006068
find(filter).count(): 1007565
count(filter): 1006985
find(filter).count(): 1007409
count(filter): 1007327
find(filter).count(): 1007119
count(filter): 1007088
find(filter).count(): 1003996
count(filter): 1006486
find(filter).count(): 1006202
count(filter): 1007679
find(filter).count(): 1007214
count(filter): 1006594
find(filter).count(): 1005883
count(filter): 1004979
find(filter).count(): 1002196
count(filter): 1000000
find(filter).count(): 1000000
count(filter): 1000000
find(filter).count(): 1000000
count(filter): 1000000
// ...many identical rows...
find(filter).count(): 1000000
count(filter): 1000000Since a modification to a single document is atomic by [1], I would think that there are always 1M documents in the collection: I modify a single document, save it, and it takes the place of its previous version.Yet the output of the count operations is different: the number of document grows and eventually comes back to 1M.What is happening?
I think this is "expected" behavior. I have not tested it but if you make the count()s also "snapshot" that may remove the phantom/double reads.
Here is what I think is happening.1) The 1M documents are inserted into the collection closely packed, i.e., no extra padding or space around the documents.
2) When the update is performed the document grows and does not fit in its original location. That forces the document to be moved to the "end" of the collection's allocated space.
3) Since cursors that are not "snapshot" may return the same document more than once the result is the count is inflated as some documents are moved from the "start" of the collection to the "end" of the collection during the time to scan all 1M documents. All of those moved documents are double counted. There was a good discussion of this in one of Kristina Chodorow's MongoDB books but I don't have it handy to give a better reference. Sorry.
The non-filter count works because it is probably using the _id index (or other optimization) which is not being mutated by the updates.