Using geohash in GAE

178 views
Skip to first unread message

Ubaldo Huerta

unread,
Aug 1, 2008, 9:25:18 AM8/1/08
to Google App Engine
Just discovered geohash a couple of days ago. I was about to write off
GAE for a new project, and instead use MySQL or Postgres spatial
extensions. It doesn't look like, given the big table architecture,
that lat-long indexes will be supported soon.

I'd love to know what can and can't be done with geohash, since there
are also plenty of people claiming that there are false positives in
bounding box searches, that doesn't work near lat- long 0, 90, 180,
etc, etc

I've tried myself a few cases of querying if a point falls within a
bounding box (using geohash of SW and NE corners) as described below
and it seems to work (tried with small bounding boxes, representing
areas of 10 x 10 km approx). Haven't tested near equator, poles, etc,
etc nor have studied geohash algorithm in detail.

http://labs.metacarta.com/blog/27.entry/geographic-queries-on-google-app-engine/.

Is there mathematical prove that bounding box searches as described by
metacarta blog work? Are there any special cases?

It would also like to know what other queries can be performed using
geohashes. For example, can it be used to test if two bounding boxes
intersect (have common area)?

Here's also a claim that google is using geohash internally.
http://blog.labix.org/2008/05/20/google-using-geohash

Calvin Spealman

unread,
Aug 1, 2008, 1:19:31 PM8/1/08
to google-a...@googlegroups.com
Geohash makes a great approximation, but I think you'd often want to
use it for a resolution just larger than what you need, and then
exclude the results that are along the edge, outside your intended
bounding box. But, yes, geohash is a really good algorithm. I've
looked at using it before, but had only run across it after the work
on traditional bounding boxes was finished for that project.

--
Read my blog! I depend on your acceptance of my opinion! I am interesting!
http://ironfroggy-code.blogspot.com/

Carsten Neubert

unread,
Aug 4, 2008, 1:33:44 PM8/4/08
to Google App Engine
On Aug 1, 3:25 pm, Ubaldo Huerta <uba...@gmail.com> wrote:
>I'd love to know what can and can't be done with geohash, since there
>are also plenty of people claiming that there are false positives in
>bounding box searches, that doesn't work near lat- long 0, 90, 180,
>etc, etc

The underlying structure of a geohash is, in essence, a quadtree.
Locations with a similiar "prefix" will be in the same sub-tree.
Problem is, even two very nearby locations might end up being in
totally different branches of the tree. So, if you want to make an
efficient search for nearby locations, you'll have to include up to 4
different prefixes.

Simple example: you want to find all locations within an area that
happens to cross the equator. You'll have to search for two different
prefixes - one for each side of the equator. And the same sort of
problem doesn't just happen near the equator. The equator is just the
worst case example.

Ubaldo Huerta

unread,
Aug 4, 2008, 1:46:54 PM8/4/08
to Google App Engine
Hi Carsten

I've seen what you've described at http://github.com/davetroy/geohash-js/tree/master

Carsten Neubert

unread,
Aug 4, 2008, 1:57:27 PM8/4/08
to Google App Engine
Yep, exactly. If you want go down the hashing route, "geohashes"
aren't the only option. Here's a link that I posted earlier:

http://www.ddj.com/184410998

I'd think that one could come up with something better than vanilla
geohashes for this purpose.

On Aug 4, 7:46 pm, Ubaldo Huerta <uba...@gmail.com> wrote:
> Hi Carsten
>
> I've seen what you've described athttp://github.com/davetroy/geohash-js/tree/master

Ubaldo Huerta

unread,
Aug 4, 2008, 2:00:47 PM8/4/08
to Google App Engine
Let me clarify my initial questions while I'm at it, because I think
they're confusing.

I'm not, obviously, trying to find out if a given point (lat,lon)
falls within a given bounding box, described by (lat,lon) of SW corner
and (lat, lon) of NE corner. That can be easily tested by simple
comparison. While I'm trying to find out is, given a large collection
of points (lat,lon), the points that are located within a given
bounding box.

Similarly, given a large collection of bounding boxes, described the
usual way by the 2 corner points, I'd like to query for the bounding
boxes that intersect (have common area) with a given bounding box.

Might be over stating the obvious, but I thought that my initial
question is awkwardly phrased.




On Aug 4, 7:46 pm, Ubaldo Huerta <uba...@gmail.com> wrote:
> Hi Carsten
>
> I've seen what you've described athttp://github.com/davetroy/geohash-js/tree/master

Ubaldo Huerta

unread,
Aug 4, 2008, 2:11:17 PM8/4/08
to Google App Engine
Thanks Carsten, for the article. I'll digest it soon. BTW, looks like
google base developers, which presumably are using bigtable indexing
technology, have already solved this problem. See
http://code.google.com/apis/base/attrs-queries.html#LocDatQuer

I also think that I should clarify my initial questions while I'm at
it, because they're confusing.

I'm not, obviously, trying to find out if a given point (lat,lon)
falls within a given bounding box, described by (lat,lon) of SW corner
and (lat, lon) of NE corner. That can be easily tested by simple
comparison. While I'm trying to find out is, given a large collection
of points (lat,lon), the points that are located within a given
bounding box.

Similarly, given a large collection of bounding boxes, described the
usual way by the 2 corner points, I'd like to query for the bounding
boxes that intersect (have common area) with a given bounding box.
Second problem might be reduced to problem one; I haven't thought
about it thoroughly

Might be over stating the obvious, but I thought that my initial
question is awkwardly phrased.





José Oliver Segura

unread,
Aug 4, 2008, 2:53:08 PM8/4/08
to google-a...@googlegroups.com
On Mon, Aug 4, 2008 at 7:33 PM, Carsten Neubert <carst...@gmail.com> wrote:
>
> The underlying structure of a geohash is, in essence, a quadtree.
> Locations with a similiar "prefix" will be in the same sub-tree.
> Problem is, even two very nearby locations might end up being in
> totally different branches of the tree. So, if you want to make an
> efficient search for nearby locations, you'll have to include up to 4
> different prefixes.

Isn't it necessary to lookup for 9 different prefixes instead of 4?

I mean: if the problem is with "tree limits" (different branches in
the algorithm binary comparison for lat/lon), what is being constructed
by the geohash algorithm is a grid. Given a search point (lat,lon), if
you want to find "nearby" points, the simple "hash(p1) = hash(p2)" is
not enough, since the "equator problem" can prevent you from getting
geographically nearby points that fall in another branch of the quadtree.

But "near" points can be -without further investigation about the
position of the search point inside its quadrant- in any of the 8 bounding
boxes that surrond the search quadrant.

So, if you want to assure that, given a certain zoom (precission)
in the hash, you get all nearby points to a given point, you have to
search the quadrant holding the search point plus the 8 surrounding
quadrants (up,down,left,right...and... up-left, up-right, down-left,
down-right).

Well, at least that is what I assume by reading the description of
geohashes...

Please, correct me if I'm wrong.

Jose

Carsten Neubert

unread,
Aug 4, 2008, 4:25:22 PM8/4/08
to Google App Engine
Necessary? I don't think so. As you said, the problem is getting to
geographically nearby points that fall in another branch of the
quadtree. Given a "worst case" query, you'll need at least 4 different
prefixes (or "cells overlapping the query window") to get sufficiently
deep into the tree. It's a quad tree after all. It might be that you
could further optimize the query by getting more cells, but it's not
"necessary".

Am I overlooking something?

On Aug 4, 8:53 pm, "José Oliver Segura" <primi...@gmail.com> wrote:

José Oliver Segura

unread,
Aug 5, 2008, 4:21:17 AM8/5/08
to google-a...@googlegroups.com
On Mon, Aug 4, 2008 at 10:25 PM, Carsten Neubert <carst...@gmail.com> wrote:
>
> Necessary? I don't think so. As you said, the problem is getting to
> geographically nearby points that fall in another branch of the
> quadtree. Given a "worst case" query, you'll need at least 4 different
> prefixes (or "cells overlapping the query window") to get sufficiently
> deep into the tree. It's a quad tree after all. It might be that you
> could further optimize the query by getting more cells, but it's not
> "necessary".
>
> Am I overlooking something?

Ok, I think I see the difference in our reasonings. First of
all, I think I'm refering to "prefixes" when I should talk about
"quadrants" (actually, up to three four quadrants can fall into the
same prefix). The second point is also important; you are talking
about finding four prefixes ("cells overlapping the query window"), I
didn't take into consideration the "query window", but only "nearby
points" given a geohash (a "starting point", no explicit bounding box,
but the one defined by the geohash and assuming the "limit" is, maybe,
the outer zoom level -that is, a convention of mine, sorry- and that
the "zoom level" is implictely defined by the precission of the hash).

I'm working with a simplified version of geohashes. Basically,
the only difference is using characters [a-d] instead of bits, I
think, which leads to a less compact but much easier to "understand"
(for me) naming of quadrants. For example, for precission=1, you'll
split the space into four cells:

a|b
----
c|d

and with precission=2, again, split the space:

aa|ab|ba|bb
----------------
ac|ad|bc|bd
-----------------
ca|cb|da|db
-----------------
cc|cd|dc|dd

Now, if you want to find points near the hash "da" (without a explicit
query window), you have to look for points in da, but also in db, dc,
dd, bc, bd, ad, cb and cd (that is, 9 quadrants). Of course, talking
about prefixes, that list can be reducet to d, bc, bd, ad, db and cd
(6 prefixes, since 4 of the original 9 quadrants get assumed by the
"d" prefix)

As I told, it's just a matter of "my" decission that "nearby points"
given a geohash are those points that fall into the same hash and also
the 8 neighbours. Of course, if you have a "bounding box" things are
easier, since you can obtain the four hashes for each one of the four
corners and just perform a search for those 4 prefixes (that also map
to 4 quadrants, as long as you select the appropriate precission).

Don't know; maybe this is a little bit more complicated than it should
be (I mean, I'm making it more complicated myself), but it's tricky to
try to take into consideration all the variations that can occur when
you use geohashes and, on the other side, the user can use just
"google maps points" and nothing assures you that the "bounding boxes"
of those maps map exactly to the geohash bounding boxes (for example,
what happens if you want to show all point that are inside a given
area, defined by a Google Map instance in a webpage which occurs not
to be a perfect square? that can break the "four cells overlapping the
query window" assumption, I believe)

As I said... a little bit trick. It could be, of course, much more
easier if the datastore had support for GeoPt comparisons (filter
>,<), but at that point you'll get another problem if, for some
reason, you want to also filter by another property (for example,
rating), as long as the datastore api only allows one inequality per
query...

Best,
Jose

Gareth Watts

unread,
Aug 6, 2008, 1:54:27 PM8/6/08
to google-a...@googlegroups.com
Incidentally, if you want to get a feel for what sort of area different geohashes cover, you might find my little google maps maplet interesting:
http://maplets.omnipotent.net/geohash.php
Reply all
Reply to author
Forward
0 new messages