Re: [mongodb-user] Slow queries using geospatial indexes

482 views
Skip to first unread message

Shane Spencer

unread,
Nov 4, 2012, 9:30:59 PM11/4/12
to mongod...@googlegroups.com
Hey Ben.. hit me up off list if you're interested in a lengthy but
practical solution for this.

As for your query and data... it looks like you never hit your index
because 'indexBounds' is empty. Make sure the index you are using
follows the rules and has the geospatial field first.. and make sure
you query the geospatial field first.

As for further optimization if you need it once you figure out the indexing...

The gist of it.. use your own geohash and use a geohash solver to
help limit the data using plaintext indexes based on a standard
geohash.

I was able to query a few hundred million documents by using a first
pair, second pair, third pair compound index of a geohash in a few
msec for a very large area. (much larger than VA :) and got some
pretty sweet sharding out of it to boot. That and I can use the
geohash after other indexes in a compound index.

- Shane

ben lamothe

unread,
Nov 5, 2012, 5:15:10 PM11/5/12
to mongod...@googlegroups.com, sh...@bogomip.com
Thanks a lot for the reply, but I still have a few more questions.

I assumed that this query used an index because the GeoSearchCursor was used instead of the BasicCursor. Although I can't seem to find much documentation about the GeoSearchCursor per se, this says that BasicCursor would have been used if the query ran without an index. Also, both nscanned and nscannedobjects are 0. If the query didn't use an index then I would expect nscanned to be all 24 million or so documents in the collection because it would have to do a full table scan. I thought the indexBounds would be set only if i used max() or min() in my query, which I did not. Is there something that I'm not understanding about the behavior of this type of index or about what explain is telling me?

Also, here is the relevant index returned from getIndexes() called on the collection:

{
"v" : 1,
"key" : {
"location" : "2d",
"snapshot_ids" : -1
},
"ns" : "brief_production.people",
"name" : "location_2d_snapshot_ids_-1"
}

Sundar Rajagopalan

unread,
Dec 12, 2012, 3:03:32 PM12/12/12
to mongod...@googlegroups.com, sh...@bogomip.com
Hi Ben,

I guess you must have found the solution, or found a work around by now - but to complete the loop:

I think your query was causing a full scan of the records matching the distance criteria (regardless of the compound index) because of the "$not";

Secondly, having a "$or" also has the same effect of causing a complete scan of the documents matching the distance criteria.

Thirdly, and most importantly, compound indexes combined with a "2d" indexes don't work as well, since the first index criteria (the spatial index) isn't very "selective" - think of it as a monotonically increasing sequence of records ordered by distance (that's not how its implemented, from a perspective of how the index looks like, that analogy works) - so, adding a compound criteria to it doesn't help as much.

How (and if) did you overcome this issue?

=Sundar

Shane Spencer

unread,
Dec 27, 2012, 11:55:53 AM12/27/12
to Sundar Rajagopalan, mongod...@googlegroups.com
Sorry for the late reply.. it's actually rather difficult for me to explain.

But here it goes.

I have polygons and points in my data.  I use geohashes at precision level 1 all the way down to 10 to help me get data for both polygons and points.

For points I simply need to associate the highest precision geohash to a point and use my own geohash solver to figure out which geohashes would need to be queried based on my viewport or virtual range.  I have a LOT of options on how I store geographic information so I'm just going to explain what I'm doing for a client right now.

My client has lots of polygons that need to be presented on a map.  I use GDAL to separate the polygons into lots of different geohashes.  First I use precision 1 geohashes and create a document for each precision 1 geohash that the polygon touches... then I do the same for precision 2 and so on until I have reached my target precision.  I can optionally.. completely optionally.. store extra information along with each document helping represent a single polygon.

For instance.  We are toggling some statistics for each high precision geohash that a polygon touches... so one of the documents will contain some statistical information that users of the data can change.  In doing so we also update statistics in the lower precision documents as that happens.  This way we are pre-aggregating information and it is immediately accessible depending on our zoom level.

In order to find documents that match a specific viewport or range I have my own options as well.  I use the same fast algorithm to create a list of the most appropriate geohashes based on user or site parameters.  I personally determine the precision based on the google  maps zoom level and then attempt to find all geohashes that are contained in or touch the viewport.  It's incredibly fast to solve and it gives me direct index access to my data.  I simply need to query for "polygon_geohash in ['hash', ....]"

I can optionally choose to query one geohash at a time based on a concentric query as well.  This can sometimes speed up applications that need a quick redraw as users move around.

I chose this technique before Sharding of geographic data was available.  What I ended up with was an ultimately more shardable solution that I could tune myself.

So the point.  You don't need to use geospatial indexes.  However they are wicked fast for point data.  I find that storing geospatial information using my own geohash and quadtree 'user oriented' indexing allows for a lot of flexibility.

And it's easy enough to say that being able to rebuild pyramids of aggregated statistics for your data from the highest precision is.. amazingly useful.  Completely removes the need to do larger queries later on.

- Shane


On Mon, Nov 5, 2012 at 1:29 PM, Sundar Rajagopalan <happy...@gmail.com> wrote:
Hi Shane,

I'm having problems similar to OP, so I'm interested in the solution too.

Can you please elaborate on - "I was able to query a few hundred million documents by using a first 
pair, second pair, third pair compound index of a geohash"...?

What did the compound index contain? Are you suggesting that you computed a geohash, and indexed it as part of a compound index?

Doesn't the GeoHaystackIndex (www.mongodb.org/display/DOCS/Geospatial+Haystack+Indexing) do pretty much the same thing?

Thanks for your time!

Sundar

Shane Spencer

unread,
Dec 27, 2012, 12:06:22 PM12/27/12
to Sundar Rajagopalan, mongod...@googlegroups.com
It's worth pointing out that this makes 'near' queries very .. very .. simple.  You can simply select a handful of geohashes that would represent your bounds and use GDAL to calculate the distance as results come in.

I also use $min and $max when I want access to the highest level precisions from a lower level.

So for all documents of the gghj57 geohash (6 precisions) I can select $min:gghj570000 to $max:gghj57ZZZZ vs using a regular expression to do a partial match.
Reply all
Reply to author
Forward
0 new messages