I would like to do geo queries against a collection with 1M docs or
more. I'm seeing terrible performance for geo queries, even though
the outputs from find() commands and getIndexes() indicate that a 2d
geo index is being used for the query.
Each entry looks like this:
> db.commits.findOne()
{
"_id" : ObjectId("4ce4ba7df360623d32000000"),
"loc" : [
55.59664,
13.00156
],
"sha1" : "88d2a028ebfb7ddc9f8a8b11efac03503c4ddd7f",
"parents" : [
"5f2c2ed26d5a467b83d5df91c8366f3c4a7caa23"
],
"location" : "Malmö / Sweden",
"committed_date" : "2010-04-24T05:49:34-07:00",
"committed_date_native" : "Fri Apr 23 2010 22:49:34 GMT-0700 (PDT)",
"author" : "FredrikL",
"authored_date" : "2010-04-24T05:49:34-07:00"
}
Indexes:
> db.commits.getIndexes()
[
{
"name" : "_id_",
"ns" : "processed.commits",
"key" : {
"_id" : 1
}
},
{
"ns" : "processed.commits",
"name" : "sha1_1",
"key" : {
"sha1" : 1
}
},
{
"ns" : "processed.commits",
"name" : "loc_2d",
"key" : {
"loc" : "2d"
}
},
{
"ns" : "processed.commits",
"name" : "committed_date_native_1",
"key" : {
"committed_date_native" : 1
}
}
]
Of the 955K entries, about 73K are from a geo search for the SF bay
area. Oddly, the naive javascript filter runs faster than using the
geo index - 20 seconds vs 43 sec. The results are the same.
var geo_javascript = "this.loc ? (this.loc[0] > 37.200000000000003 &&
this.loc[0] < 38.0 && this.loc[1] > -123.0 && this.loc[1] < -121.0) :
false";
var geo_filter = {'loc': {"$within": {"$box": [[37.200000000000003,
-123.0], [38.0, -121.0]]}}};
> db.commits.find(geo_javascript).explain()
{
"cursor" : "BasicCursor",
"nscanned" : 955831,
"nscannedObjects" : 955831,
"n" : 73774,
"millis" : 19962,
"indexBounds" : {
}
}
> db.commits.find(geo_filter).explain()
{
"cursor" : "GeoBrowse-box",
"nscanned" : 73774,
"nscannedObjects" : 73774,
"n" : 73774,
"millis" : 42900,
"indexBounds" : {
}
}
This order flips, and the indexed version is faster, when I use a 100K-
entry subset of the same data (with about the same proportion of
entries matching the query):
> use processed100k
switched to db processed100k
> db.commits.find(geo_javascript).explain()
{
"cursor" : "BasicCursor",
"nscanned" : 100000,
"nscannedObjects" : 100000,
"n" : 8588,
"millis" : 2911,
"indexBounds" : {
}
}
> db.commits.find(geo_filter).explain()
{
"cursor" : "GeoBrowse-box",
"nscanned" : 8588,
"nscannedObjects" : 8588,
"n" : 8588,
"millis" : 214,
"indexBounds" : {
}
}
I converted the query to use geoNear, but since I can't get back a
cursor, the returned data set ends up too large somewhere between 20K
and 40K results:
> db.runCommand({geoNear:"commits", near:[37.600000000000001, -122.0], num:100000, maxDistance: 1.0770329614269003}).results.length
Sun Nov 21 02:28:13 uncaught exception: error {
"$err" : "Invalid BSONObj spec size: 28385338 (3A20B101) first
element:ns: \"processed.commits\" ",
"code" : 10334
}
... so I can't use geoNear.
As for using the regular find() with a circular bounds box, I can't
seem to change the number of values to return from 100:
> db.commits.find({"loc": {"$near": [37.600000000000001, -122.0], "$maxDistance": 1.0770329614269003}}).count()
100
including '$num' and 'num' as parameters returns nothing.
I'm not the first one with this problem:
http://stackoverflow.com/questions/3889601/mongodbs-geospatial-index-how-fast-is-it
Any ideas as to how to get reasonable geo performance with larger
collections?
I wonder if it's some aspect of my data that is made worse at large
sizes - there are lots of duplicate lat/long pairs, plus about half of
the entries don't have a loc field. For a query search to occur, must
all entries in a collection have the query fields defined?
I guess an ugly workaround would be to make a separate locations
collection with pointers back to the larger commits collection, to
reduce the search size by only having unique locations.
All the numbers are from a months-old MacBook with 4G RAM, 4x2.4 GHz
cores, and an SSD, running mongo 1.6.3 installed via brew.
Thanks,
Brandon