Inconsistent query results on large data and result sets

Shi Shei

unread,

Jan 6, 2012, 12:47:53 PM1/6/12

to mongodb-user

We are running mongodb-linux-x86_64-2.0.1.
The problem we encounter is that not all documents are returned. This
happens only on large result sets (more than 1 million documents) on
large databases having more than 50 millions documents.
Our large databases are all sharded. We never encountered the problem
on our unsharded system having only about 23 million documents.

It seems that a query run through the java driver (2.7.2) may return
2.8 million documents instead of the real 3.38 million documents. The
last returned document is empty. All fields are empty.
However, as the returned result is sorted on _id, we can re-execute
the query where _id is greater than the last received _id. This was
our workaround, to get the whole result set.
Can you suggest us another solution for this issue?

Please have a look also at my console output, where I'm iterating and
counting the returned documents.
The result may vary from 937.000 to 3.35 millions - but I've never got
the whole 3.38 millions.

As you can see, I played also with sort und batchSize. Even the most
simple query without sort or batchSize returned a wrong result.

mongos> var cursor = db.offer.find({shopId: 4640},{_id:
1}).batchSize(100).sort({_id:1});
mongos> var count = 0;
mongos> for (var i = 0; cursor.hasNext(); ++i){cursor.next(); count+
+;}
2727700
mongos> db.offer.count({shopId:4640})
3381735
mongos> var count = 0;
mongos> var cursor = db.offer.find({shopId: 4640},{_id:
1}).batchSize(10000);
mongos> for (var i = 0; cursor.hasNext(); ++i){cursor.next(); count+
+;}
3351902
mongos> db.offer.count({shopId:4640})
3382546
mongos> var count = 0;
mongos> var cursor = db.offer.find({shopId: 4640},{_id:
1}).batchSize(10000);
mongos> for (var i = 0; cursor.hasNext(); ++i){cursor.next(); count+
+;}
3351901
mongos> db.offer.count({shopId:4640})
3382546
mongos> var count = 0;
mongos> var cursor = db.offer.find({shopId: 4640});
mongos> for (var i = 0; cursor.hasNext(); ++i){cursor.next(); count+
+;}
937007
mongos> db.offer.count({shopId:4640})
3382648
mongos>

Which information do you need to track down this issue?
Thank you!

Antoine Girbal

unread,

Jan 6, 2012, 4:18:24 PM1/6/12

to mongodb-user

There is a bug indeed for the count() with a sharded system, which may
report more documents than there are:
https://jira.mongodb.org/browse/SERVER-3645

In short:
During chunk migrations, some documents exist on both source and
target shards.
Also there can be some left over old documents from previous failed
migrations.
find() will properly filter out the documents that do not belong to a
shard. But it does not apply to count(), so it may overcount.

Now about the query not returning all documents, that is a problem.
Maybe it is because there are migrations going on, and iteration over
so many documents is halted by migration.
In theory it should not happen (documents stay on old shard as long as
there is a cursor open for them) but maybe there is bug.
Also it is very odd that it would return an empty document, instead of
throwing an error.
Could you try to turn off the balancer temporarily and see if queries
return consistent results then?

Shi Shei

unread,

Jan 6, 2012, 7:32:49 PM1/6/12

to mongodb-user

I'm aware of the bug you're speaking of but I don't think that's the
reason of our observation because the difference it too much. I can
hardly believe that the same query returns less than a million
documents and some minutes later it returns more than 3 millions. Even
if it's a live system, our documents do not change that often. And I
don't think either that the balancer is moving so many chunks in this
short period.

However, I will try your suggestion to turn off the balancer to see
whether it has an impact to get consistent results. I'll keep you up
to date monday.
Thank you!

Eliot Horowitz

unread,

Jan 7, 2012, 12:50:32 AM1/7/12

to mongod...@googlegroups.com

Can you run this test with mongos running in verbose mode?

Shi Shei

unread,

Jan 7, 2012, 5:10:42 AM1/7/12

to mongodb-user

yes of course

Shi Shei

unread,

Jan 9, 2012, 6:15:46 AM1/9/12

to mongodb-user

I've set up a new router on my local machine to get clean logs, not
interfering with requests of other mongo clients.
I couldn't find a way to attach the log file here, so you can download
it from:
http://www.yourfilelink.com/get.php?fid=743890

I executed the same statements as before. The results are as weird as
before.
Here is the console output:

MongoDB shell version: 2.0.1
connecting to: localhost:27018/offerStore

mongos> var cursor = db.offer.find({shopId: 4640},{_id:
1}).batchSize(100).sort({_id:1});
mongos> var count = 0;
mongos> for (var i = 0; cursor.hasNext(); ++i){cursor.next(); count+
+;}

2725800

mongos> db.offer.count({shopId:4640})

3381864

mongos> var count = 0;
mongos> var cursor = db.offer.find({shopId: 4640},{_id:
1}).batchSize(10000);
mongos> for (var i = 0; cursor.hasNext(); ++i){cursor.next(); count+
+;}

3351222

mongos> db.offer.count({shopId:4640})

3381866

mongos> var count = 0;
mongos> var cursor = db.offer.find({shopId: 4640},{_id:
1}).batchSize(10000);
mongos> for (var i = 0; cursor.hasNext(); ++i){cursor.next(); count+
+;}

3351223

mongos> db.offer.count({shopId:4640})

3381887

mongos> var count = 0;
mongos> var cursor = db.offer.find({shopId: 4640});
mongos> for (var i = 0; cursor.hasNext(); ++i){cursor.next(); count+
+;}

806915

mongos> db.offer.count({shopId:4640})

3381996
mongos>

On Jan 7, 6:50 am, Eliot Horowitz <el...@10gen.com> wrote:

> Can you run this test with mongos running in verbose mode?
>

Shi Shei

unread,

Jan 9, 2012, 6:25:43 AM1/9/12

to mongodb-user

I'll execute the same test with balancer off right now.

Shi Shei

unread,

Jan 9, 2012, 7:21:49 AM1/9/12

to mongodb-user

Here comes the same test with balancer turned off:

MongoDB shell version: 2.0.1
connecting to: localhost:27018/offerStore

mongos> use config
switched to db config
mongos> db.settings.update( { _id: "balancer" }, { $set : { stopped:
true } } , true );
mongos> while( db.locks.findOne({_id: "balancer"}).state )
{ print("waiting..."); sleep(1000); }
mongos> use offerStore
switched to db offerStore

mongos> var cursor = db.offer.find({shopId: 4640},{_id:
1}).batchSize(100).sort({_id:1});
mongos> var count = 0;

mongos> while(cursor.hasNext()){cursor.next(); count++;}
2726100

mongos> db.offer.count({shopId:4640})

3383877

mongos> var count = 0;
mongos> var cursor = db.offer.find({shopId: 4640},{_id:
1}).batchSize(10000);

mongos> while(cursor.hasNext()){cursor.next(); count++;}
3353297

mongos> db.offer.count({shopId:4640})

3383883

mongos> var count = 0;
mongos> var cursor = db.offer.find({shopId: 4640},{_id:
1}).batchSize(10000);

mongos> while(cursor.hasNext()){cursor.next(); count++;}
3353251

mongos> db.offer.count({shopId:4640})

3383879

mongos> var count = 0;
mongos> var cursor = db.offer.find({shopId: 4640});

mongos> while(cursor.hasNext()){cursor.next(); count++;}
806766

mongos> db.offer.count({shopId:4640})

3383966
mongos>

As you can see, the results are nearly the same.
You can download the router logs here:
http://www.yourfilelink.com/get.php?fid=743894

Could it be that some items of every batch get lost? Because with a
total of 3.38 millions items, I'd have 33.800 batches using a
batchSize of 100 and I'm missing round about 650.000 items.
When I set batchSize to 10.000, I should have 338 batches and I'm
missing fewer items than before, yet round about 30.000 items.

The last example, where I don't use batchSize, returns the fewest
items. However, the whole document is returned (and not only _id), so
it could be that only a few documents fit in one batch. Thus, many
batches need to be requested. If every batch looses some of its items,
the total number of returned items is heavily decimated.

Eliot Horowitz

unread,

Jan 9, 2012, 1:52:23 PM1/9/12

to mongod...@googlegroups.com

Is the mongos also 2.0.1?
Can you try 2.0.2?

> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>

Antoine Girbal

unread,

Jan 9, 2012, 6:52:15 PM1/9/12

to mongodb-user

In addition to trying with 2.0.2, could you:
- give a sample document. For regular size documents the regular 100
doc batch size should be able to easily fit all 100 docs in 1 batch.
- give output of db.printShardingInfo()
- for the run that gives the fewest docs, is the number of docs you
get about constant or changes randomly?

Shi Shei

unread,

Jan 10, 2012, 5:50:20 AM1/10/12

to mongodb-user

> Is the mongos also 2.0.1?

Yes, the mongos was also 2.0.1

> For regular size documents the regular 100 doc batch size
> should be able to easily fit all 100 docs in 1 batch.

Yes, I do think that 100 docs should fit in one batch. The documents
are not that huge (see attached).

> is the number of docs you get about constant or changes randomly?

It's never exactly the same number (it's a life system) but the order
of magnitude is the same. For a total of round about 3.5M documents, I
got yesterday 930k, then 807k and today 792k documents for the same
query without batchSize.

Here comes the session played with mongos 2.0.2.
Please see shardingStatus, stats and 3 obfuscated sample documents
here:
http://pastebin.com/raw.php?i=YQRbCa0e
The mongos logs can be downloaded here:
http://www.yourfilelink.com/get.php?fid=744131

mongos> var count = 0;
mongos> while(cursor.hasNext()){cursor.next(); count++;}

1113400

mongos> db.offer.count({shopId:4640})

3566867

mongos> var count = 0;
mongos> var cursor = db.offer.find({shopId: 4640},{_id:
1}).batchSize(10000);
mongos> while(cursor.hasNext()){cursor.next(); count++;}

3533439

mongos> db.offer.count({shopId:4640})

3564098

mongos> var count = 0;
mongos> var cursor = db.offer.find({shopId: 4640},{_id:
1}).batchSize(10000);
mongos> while(cursor.hasNext()){cursor.next(); count++;}

3533661

mongos> db.offer.count({shopId:4640})

3564105

mongos> var count = 0;
mongos> var cursor = db.offer.find({shopId: 4640});
mongos> while(cursor.hasNext()){cursor.next(); count++;}

792141

mongos> db.offer.count({shopId:4640})

3564145

> ...
>
> read more »

Shi Shei

unread,

Jan 10, 2012, 5:56:48 AM1/10/12

to mongodb-user

One document is in average about 2 kB, so in one batch (4MB) should
fit up to 2000 documents.

> ...
>
> read more »

Shi Shei

unread,

Jan 10, 2012, 6:20:37 AM1/10/12

to mongodb-user

I forgot to copy & paste the first line of my replayed test. However,
it's the same as yesterday (batchSize:100) which returned 2.7M
docuemnts yesterday and 1.1M documents today.

mongos> var cursor = db.offer.find({shopId: 4640},{_id:
1}).batchSize(100).sort({_id:1});
mongos> var count = 0;
mongos> while(cursor.hasNext()){cursor.next(); count++;}

1113400

> ...
>
> read more »

Eliot Horowitz

unread,

Jan 10, 2012, 11:12:37 AM1/10/12

to mongod...@googlegroups.com

Can you turn up the verbosity on mognos
db.adminCommand( { setParameter : 1 , logLevel : 5 } )
and do the same thing?

> --

Shi Shei

unread,

Jan 11, 2012, 4:14:04 AM1/11/12

to mongodb-user

I thought I started mongos with -vvvvvv parameter. Didn't I?
Well I will execute the command you've posted and re-run the test.

> ...
>
> read more »

Shi Shei

unread,

Jan 11, 2012, 9:06:53 AM1/11/12

to mongodb-user

Here comes it again. It took some time because we had some hardware
trouble with a config server (sx175) and needed to replace it (sx210
now). ...in case you stumble in the logs about.

The mongos logs can be downloaded here:

http://www.yourfilelink.com/get.php?fid=744438

I executed the batchSize(10000) query only once because it produces
always the same result (same order of magnitude).

connecting to: localhost:27018/offerStore
mongos> db.adminCommand( { setParameter : 1 , logLevel : 5 } )
{ "was" : 6, "ok" : 1 }

mongos> var cursor = db.offer.find({shopId: 4640},{_id:
1}).batchSize(100).sort({_id:1});
mongos> var count = 0;
mongos> while(cursor.hasNext()){cursor.next(); count++;}

2861800

mongos> db.offer.count({shopId:4640})

3572829

mongos> var count = 0;
mongos> var cursor = db.offer.find({shopId: 4640},{_id:
1}).batchSize(10000);
mongos> while(cursor.hasNext()){cursor.next(); count++;}
3542187
mongos> db.offer.count({shopId:4640})

3572834

mongos> var count = 0;
mongos> var cursor = db.offer.find({shopId: 4640});
mongos> while(cursor.hasNext()){cursor.next(); count++;}

758560

mongos> db.offer.count({shopId:4640})

3572905

> ...
>
> read more »

Shi Shei

unread,

Jan 12, 2012, 6:29:46 AM1/12/12

to mongodb-user

It seems that the iterator did not finish due to a timeout or server
restart.
However, no server was restarted!
Also, a batchSize of 100 shouldn't timeout the cursor. What happened
to Mongo???

Furthermore, I couldn't declare any variable in the mongo shell -
please see below:

mongos> var count = 0;

mongos> var cursor = db.offer.find({shopId: 4640},{_id:1,bokey:
1}).batchSize(100);
mongos> while(cursor.hasNext()){obj=cursor.next();
if(obj.bokey==undefined){printjson(obj)}; count++;}
{
"$err" : "getMore: cursor didn't exist on server, possible restart or
timeout?",
"code" : 13127
}
2912200
mongos> count
2912201

mongos> var count = 0;

getMore: cursor didn't exist on server, possible restart or timeout?
mongos> var count = 0;
getMore: cursor didn't exist on server, possible restart or timeout?
mongos> var foo = 0;
getMore: cursor didn't exist on server, possible restart or timeout?
mongos>

Mongos logs can be downloaded here:
http://www.yourfilelink.com/get.php?fid=744691

Please advise.
Thanks!

> ...
>
> read more »

Antoine Girbal

unread,

Jan 12, 2012, 2:45:39 PM1/12/12

to mongodb-user

how long does the query take to iterate through the documents?
You are correct that in theory the cursor is touched every time a
batch is fetched, so it should not time out.

Creating the variable works, it's just an oddity from the shell which
displays the result of the last getLastError even if the statement
didnt go to the server.
This was fixed in current 2.1 dev.

Taking a look at logs

> ...
>
> read more »

Shi Shei

unread,

Jan 12, 2012, 4:51:13 PM1/12/12

to mongodb-user

It took round about 25 minutes to iterate.
Using batchSize(10000) was faster, around 5 Minutes.

Thanks for looking at the logs!

Btw.: by using the no-timeout query option I can iterate through *all*
documents but that's not an option because I can't guarantee that the
cursor is always closed by the client, thus, they would stack up on
the server.

> ...
>
> read more »

Antoine Girbal

unread,

Jan 12, 2012, 5:01:03 PM1/12/12

to mongodb-user

if you use the no timeout option, can you iterate through all
documents whatever the batch size you use?
basically the problem goes away?

> ...
>
> read more »

Shi Shei

unread,

Jan 12, 2012, 5:45:52 PM1/12/12

to mongodb-user

Yes.

> ...
>
> read more »

Shi Shei

unread,

Jan 13, 2012, 12:34:33 PM1/13/12

to mongodb-user

Should I open a JIRA for this?

> ...
>
> read more »

Antoine Girbal

unread,

Jan 13, 2012, 4:27:40 PM1/13/12

to mongodb-user

I tried to reproduce by pulling documents over a long period of time
through mongos, but wasnt able to make it fail so far.
I asked the kernel team and noone knows of such an issue, so yes
please open a server ticket for this.

> ...
>
> read more »

Shi Shei

unread,

Jan 13, 2012, 7:14:26 PM1/13/12

to mongodb-user

Done: https://jira.mongodb.org/browse/SERVER-4680
Please check and assign it.
Thanks!

> ...
>
> read more »

Shi Shei

unread,

Jan 16, 2012, 1:40:38 PM1/16/12

to mongodb-user

This is very hot: I ran into the same issue using a similar query but
the no-timeout option didn't help. Only batchSize and the returned
fields had an impact whether the iterator could go through all
documents or not.
I was able to copy the whole resultset into a brand new collection, so
I could play with it. I could reproduce the inconsistent query results
on this collection too! Then I dumped the collection and restored it
into a brand new sharded system where I could reproduce the same
phenomenon!
Then I executed:
db.shop199850.validate({full:true})
But all seemded ok. Here you can check the output: http://pastie.org/3196491
db.repairDatabase() didn't help either:
{
"raw" : {
"s1/localhost:20017" : {
"ok" : 1
},
"s2/localhost:20018" : {
"ok" : 1
},
"s3/localhost:20019" : {
"ok" : 1
}
},
"ok" : 1
}

Let ne show you the console output of the queries.

1) two returned fields (id, bokey) with batchSize 10000 OK

2) two returned fields (id, sku) with batchSize 10000 OK

3) three returned fields (id, bokey, sku) with batchSize 10000 NOT OK

4) three returned fields (id, bokey, sku) with batchSize 1000 OK

mongos> var cursor = db.shop199850.find({shopId: 199850},{_id:1,bokey:
1}).batchSize(10000);
mongos> var count=0;while(cursor.hasNext()){obj=cursor.next(); count+
+;}
89605
mongos> var cursor = db.shop199850.find({shopId: 199850},{_id:1,sku:
1}).batchSize(10000);
mongos> var count=0;while(cursor.hasNext()){obj=cursor.next(); count+
+;}
89605
mongos> var cursor = db.shop199850.find({shopId: 199850},{_id:1,bokey:
1,sku:1}).batchSize(10000);
mongos> var count=0;while(cursor.hasNext()){obj=cursor.next(); count+
+;}
25305
mongos> var cursor = db.shop199850.find({shopId: 199850},{_id:1,bokey:
1,sku:1}).batchSize(1000);
mongos> var count=0;while(cursor.hasNext()){obj=cursor.next(); count+
+;}
89605
mongos>

I restored the same collection also in a unshareded system and could
NOT reproduce the error - at leat not with batchSize 10000 and the 3
returned fields above.
My sharded system is constituted of 3 shards, 1 configserver, 1
router. Each shard has only 1 node. I'm running mongo v2.0.1 on Linux
64 Bit.

The collection is only 16 MB (compressed) so it would be easy to send
it to you, so you could hopefully reproduce and track down the issue.
Of course, our data are confidential. How can I privately send it to
you?

> ...
>
> read more »

Eliot Horowitz

unread,

Jan 17, 2012, 12:42:46 AM1/17/12

to mongod...@googlegroups.com

Can you open a case @ http://jira.mongodb.org/
The community private section will create a private ticket, so only
10gen engineers can see it.

Shi Shei

unread,

Jan 17, 2012, 5:25:38 AM1/17/12

to mongodb-user

Here is it:
https://jira.mongodb.org/browse/SUPPORT-202

Thank you!

On Jan 17, 6:42 am, Eliot Horowitz <el...@10gen.com> wrote:
> Can you open a case @http://jira.mongodb.org/

> The community private section will create a private ticket, so only
> 10gen engineers can see it.
>

Nick Gerner

unread,

Sep 11, 2012, 6:02:24 PM9/11/12

to mongod...@googlegroups.com

any updates on this issue? I think I'm reproducing it with mongo 2.0.2 (a sharded setup) and java driver 2.9.1.

I set my batchSize to 10000 and I only get 25400 rows in my result set (this is what itcount returns and what I get when I iterate in my app logic using .hasNext() and .next())

when I run count, or use a different (smaller) projection, or use a smaller (like 100, 1000) (or much larger like 100000) batchSize I get all 125149 rows.

So what's up with that?

Nick Gerner

unread,

Sep 11, 2012, 7:20:41 PM9/11/12

to mongod...@googlegroups.com

here are some suprising results:

mongos> d=new Date();printjson(db.visitgrain.find(q).count());new Date() - d

125149

5057

d=new Date();printjson(db.visitgrain.find(q).itcount());new Date() - d

125149

6526

so far, so good.

now, let's bump up the batch size (should only affect how things get stuck together in single getMore calls, right?)

mongos> d=new Date();printjson(db.visitgrain.find(q).batchSize(3000).itcount());new Date() - d

8678

603

that's fast, but clearly incorrect

mongos> d=new Date();printjson(db.visitgrain.find(q).batchSize(30000).itcount());new Date() - d

43732

1863

still wrong

mongos> d=new Date();printjson(db.visitgrain.find(q).batchSize(300000).itcount());new Date() - d

125149

4072

at last, we're correct (and faster than the default batch size)

If I run these queries a second time, I get exactly the same results in approximately the same time.

so what's up with this?

Antoine Girbal

unread,

Sep 12, 2012, 10:56:54 AM9/12/12

to mongod...@googlegroups.com

Nick,

there has been tickets fixed since 2.0.2 that are related to this bug, basically related to batchSize and sharding.

Could you try with the latest 2.0.7?

Nick Gerner

unread,

Nov 6, 2012, 7:07:58 PM11/6/12

to mongod...@googlegroups.com

by the way, upgrading to 2.2.0 solved my problem

Reply all

Reply to author

Forward