Using the MongoDB C# driver, I'm trying to pull down sets of documents
for processing externally. I want to pull down at least 10000-50000
documents at a time, and this didn't used to cause a problem. There
are now over 20million documents in the database though and my query
barely ever resolves. Even when I reduce the limit down to 500-1000
documents I get this error:
"Unable to read data from the transport connection: A connection
attempt failed because the connected party did not properly respond
after a period of time, or established connection failed because
connected host has failed to respond."
On default settings, the error tends to fire after about 31 seconds.
Here's a standard example query that causes the problem:
{
"LastUpdated" : { "$gte" : ISODate("2011-11-22T15:01:54.851Z"),
"$lte" : ISODate("2011-11-22T17:39:48.013Z") },
"_id" : { "$gt" : "1300broadband.com" }
}
There is an index on LastUpdated. Obviously _id is indexed, and I also
created this composite index { "_id" : 1, "LastUpdated" : -1 }
Here's the explain on limit(5) with no sorting:
{
"cursor" : "BtreeCursor LastUpdated_-1",
"nscanned" : 5,
"nscannedObjects" : 5,
"n" : 5,
"millis" : 1,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"LastUpdated" : [
[
ISODate("2011-11-22T17:39:48.013Z"),
ISODate("2011-11-22T15:01:54.851Z")
]
]
}
}
The explain when I introduce sorting on the _id column (of type
string) basically never completes, or at least not within a reasonable
enough amount of time for me to extract an explain out without the
server first dying from obsolescence.
So you're thinking, well duh obviously it's the sort causing the
issue. Maybe so, but three things puzzle me:
1) _id is indexed, so why is it causing me problems when trying to
sort on it?
2) Why does the query resolution time increase exponentially as I
increase the limit() size?
3) While my code is running the query, there's no matching currentOp()
returned.
I have the server connectiontimeout set to 5 minutes. SlaveOk = true,
sockettimeout is default, though i've tried setting this to 1 minute
and when I do, the error occurs after about 100 seconds.
Can anyone help me figure this out?
The other option is only using the index on _id, and it will still
have to scan, but will be pre-sorted.
I would try a hint on the _id index and see how that performs.
> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>
Basically the documents all get updated periodically and so their
LastUpdated values get updated every now and again and the query is
designed to pull back changed documents so they can be indexed by
Lucene. The idea is to page through the documents between date A and
date B, using the last document in each batch in order to find the
next batch between those two dates. When no more documents are
returned, the _id I'm comparing against is reset and the date markers
move forward. This allows me to always prioritise the documents that
have been waiting the longest for retrieval since they were updated,
and to be able to page correctly even when previous documents in the
database have been updated, which would mess up the paging if I was
using skip() with limit() for paging.
Why can't you simple have an index on last_updated, and just query
for last_updated >= XXX and keep track of how far you've read?
db.domains.find({
"LastUpdated" : {
"$gte" : ISODate("2011-11-22T15:01:54.851Z"),
"$lt" : ISODate("2011-11-22T17:39:48.013Z")
},
"_id" : { "$gt" : "1300broadband.com" }
}).sort({ _id:1 }).limit(50).explain()
Here is the explain. Note the 55 minute execution time!!
{
"cursor" : "BtreeCursor Lastupdated_-1__id_1",
"nscanned" : 13112,
"nscannedObjects" : 13100,
"n" : 50,
"scanAndOrder" : true,
"millis" : 3347845,
"nYields" : 5454,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"LastUpdated" : [
[
ISODate("2011-11-22T17:39:48.013Z"),
ISODate("2011-11-22T15:01:54.851Z")
]
],
"_id" : [
[
"1300broadband.com",
{
}
]
]
}
}
It clearly has a composite index to work with, and the ranges are
clearly defined. Why is this taking so long to execute?
{ _id:1, LastUpdated: 1 }
Queries speeds are back to acceptable levels.