Optimizing geographical search

1,133 views
Skip to first unread message

Dvir Volk

unread,
Apr 11, 2012, 7:14:15 AM4/11/12
to redi...@googlegroups.com
Hi All,
As some of you remember, I have a little geo location library using redis and geohashing, that currently mostly tells you where you are based on IP or lat,lon, with resolutions of continent, country, state, city, zipcode, etc.

I'm in the process of adding textual radius search, and I've hit a bit of a performance wall, and would appreciate fresh perspective.

My use case is a query that given all the businesses in a city (by name or tags) can tell me where I can get a pizza within a 1km radius.
This is how I've modeled it:

1. I have a sorted set for each tag/name I want to index (e.g. "pizza"), with nodes and their prior score. e.g. { "luiggi's pizza": 0.5, "mario's pizza": 0.35", ... }

2. I have divided the objects' geo hash indexes into "cells" of several resolutions, which are geohash prefixes (the less bits in a geohash, the larger the area it covers). So for this example, let's say I divide businesses into a ~1X1km cells, and ~4X4km cells.
Each such cell is a SET.

3. When a user searches pizza within a given radius, I'm selecting the right geohash resolution to search, calculate the list of cells I want to search (the user's cell, plus some adjacent based on the radius).

4. Then I do SUNIONSTORE on the businesses in those cells, and ZINTERSTORE of the result with the textual index for "pizza". Since ZINTERSTORE is affected mainly by the shorter set, I'm not worried of indexing keywords nation wide.

5. After loading the result, I do a more exact radius trimming of the returned nodes, and refactor the scores based on proximity.

Now, this works well,  but as you might have guessed, searching a very crowded city center means I have thousands of businesses in every cell, and the SUNION takes a long time (5-7ms on my computer, for about 5 1x1km cells).
I've also tried using a single sorted set and doing range queries on it, but it's not much better - since there isn't one range that contains the adjacent cells, you need to do several queries.

Currently I'm thinking of splitting the geo indexes by categories, so I won't have so much nodes in each cell.

But as I said, fresh perspective would be appreciated!

Thanks,







Salvatore Sanfilippo

unread,
Apr 11, 2012, 7:31:21 AM4/11/12
to redi...@googlegroups.com
Hi Dvir,

this is my first instinct:

use cells that are keyword-based, so for instance you have a different
square kilometer for pizza, bike, ..., whatever, and of course you
create the key only if there is at least one shop and you delete when
you delete the last shop in the square.

I said 1 km, it's up to you, you may even do that you have 1km but if
there are more than N shops inside you set a sentinel key and divide
more. It's the correct way to do it algorithmically-wise, but I would
go for fixed sized squares by instinct.

Then if the user search "pizza" you simply fetch and refine doing the
math client side.

if the user typed "pizza foo bar" you fetch all three, and perform an
intersection client side.

And so forth.
My feeling is that's going to work well.

p.s. in your squares just put IDs of shops so that fetch /
intersection will be very fast, even if this will force you to fetch
shop objects later to render them.

Salvatore

> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to
> redis-db+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/redis-db?hl=en.

--
Salvatore 'antirez' Sanfilippo
open source developer - VMware

http://invece.org
"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele

Salvatore Sanfilippo

unread,
Apr 11, 2012, 7:32:05 AM4/11/12
to redi...@googlegroups.com
p.s. I think this should also be seen in the context that users often
write a single thing in this kind of search boxes.

Dvir Volk

unread,
Apr 11, 2012, 8:09:15 AM4/11/12
to redi...@googlegroups.com
Thanks Salvatore.

I thought of that approach, my instinct was that it's good for popular tags like Pizza, and wasteful for sparse tags (the amounts of data for the US are huge anyway). Also, if what I want is more like an inverted index where I index each word of the business name, it can create tons of duplicate keys for the same business, as I'm actually precaching my queries. But I'll see how much of an impact it does.

It also won't allow me to just do a radius query without text, unless I duplicate some of the data.

Salvatore Sanfilippo

unread,
Apr 11, 2012, 8:20:32 AM4/11/12
to redi...@googlegroups.com
On Wed, Apr 11, 2012 at 2:09 PM, Dvir Volk <dv...@everything.me> wrote:

> I thought of that approach, my instinct was that it's good for popular tags
> like Pizza, and wasteful for sparse tags

That's true, however if you may afford to don't store lat/long of
single shops, but just IDs, in your "squares", with conveniently set
"intset" parameters you can store those info inside Sets in very
efficient way.

This would require fetching for the on-client additional filtering
stage to avoid returning some that is 1.3 Km away.

Cheers,
Salvatore

Dvir Volk

unread,
Apr 11, 2012, 8:25:18 AM4/11/12
to redi...@googlegroups.com
On Wed, Apr 11, 2012 at 3:20 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote:
On Wed, Apr 11, 2012 at 2:09 PM, Dvir Volk <dv...@everything.me> wrote:

> I thought of that approach, my instinct was that it's good for popular tags
> like Pizza, and wasteful for sparse tags

That's true, however if you may afford to don't store lat/long of
single shops, but just IDs, in your "squares", with conveniently set
"intset" parameters you can store those info inside Sets in very
efficient way.

Oh, I forgot to mention that's what I do. the actual lat,lons are saved in separate HASHes with the rest of the object's data, which makes the post-filtering slower, but saves a lot of space. 

One optimization to that would be to make the cells ZSETs where the full resolution geohash is the score, so I can do the filtering by converting them to lat,lon without doing HMGET on the actual objects first. but as I said, the most costly thing right now is the SUNIONSTORE. 

 
This would require fetching for the on-client additional filtering
stage to avoid returning some that is 1.3 Km away.

Cheers,
Salvatore

--
Salvatore 'antirez' Sanfilippo
open source developer - VMware

http://invece.org
"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele

Josiah Carlson

unread,
Apr 12, 2012, 9:04:29 PM4/12/12
to redi...@googlegroups.com
I know that it is convenient to perform the union of the cells, then
executing the intersection... But as far as I can tell, because the
intersection will necessarily filter out locations, the only gain to
the early union is that you don't need to execute multiple
intersection calls. But that isn't really a gain, as the union is
essentially just copying data around that will be discarded before the
final result.

If you performed the intersections, then directly filtered the
resulting zsets, you would end up with the same result, without paying
for the costly union up front. If you then find that filtering in
Redis is expensive, you could union the intersection results, and
return that directly to the client (with geohash scores), who could
then perform the distance filtering.

For filtering based on distance, because you are using such small grid
regions (1x1km or 4x4km), you can assume a flat world, and can scale
the east/west dimension for simplicity. This would reduce your
distance calculations from N standard geo distance calculations to 1
geo distance calculation (for the scaling factor) followed by an
additions and 3 multiplications (not necessarily in that order) for
each of your N points.

If you've tried these things before and they didn't work out, please
let me know :)

Regards,
- Josiah

Dvir Volk

unread,
Apr 13, 2012, 1:56:00 PM4/13/12
to redi...@googlegroups.com
On Fri, Apr 13, 2012 at 4:04 AM, Josiah Carlson <josiah....@gmail.com> wrote:
I know that it is convenient to perform the union of the cells, then
executing the intersection... But as far as I can tell, because the
intersection will necessarily filter out locations, the only gain to
the early union is that you don't need to execute multiple
intersection calls. But that isn't really a gain, as the union is
essentially just copying data around that will be discarded before the
final result.

that's a very good point. It crossed my mind but I instinctively had the knee-jerk reaction of think "oh, this will require 5-9 queries instead of 2, of course it will suck" and dismissed it. I'll give it a try, sounds like it will improve things. 
 


For filtering based on distance, because you are using such small grid
regions (1x1km or 4x4km), you can assume a flat world, and can scale
the east/west dimension for simplicity. This would reduce your
distance calculations from N standard geo distance calculations to 1
geo distance calculation (for the scaling factor) followed by an
additions and 3 multiplications (not necessarily in that order) for
each of your N points.

didn't quite get what you mean by "scaling" them. could you elaborate? 
but this part is really cheap compared to the unions, I'm not too worried.

Josiah Carlson

unread,
Apr 13, 2012, 6:02:28 PM4/13/12
to redi...@googlegroups.com
On Fri, Apr 13, 2012 at 10:56 AM, Dvir Volk <dv...@everything.me> wrote:
> On Fri, Apr 13, 2012 at 4:04 AM, Josiah Carlson <josiah....@gmail.com>
> wrote:
>>
>> I know that it is convenient to perform the union of the cells, then
>> executing the intersection... But as far as I can tell, because the
>> intersection will necessarily filter out locations, the only gain to
>> the early union is that you don't need to execute multiple
>> intersection calls. But that isn't really a gain, as the union is
>> essentially just copying data around that will be discarded before the
>> final result.
>
> that's a very good point. It crossed my mind but I instinctively had the
> knee-jerk reaction of think "oh, this will require 5-9 queries instead of 2,
> of course it will suck" and dismissed it. I'll give it a try, sounds like it
> will improve things.
>
>> For filtering based on distance, because you are using such small grid
>> regions (1x1km or 4x4km), you can assume a flat world, and can scale
>> the east/west dimension for simplicity. This would reduce your
>> distance calculations from N standard geo distance calculations to 1
>> geo distance calculation (for the scaling factor) followed by an
>> additions and 3 multiplications (not necessarily in that order) for
>> each of your N points.
>
>
> didn't quite get what you mean by "scaling" them. could you elaborate?
> but this part is really cheap compared to the unions, I'm not too worried.

Say that you have a geo_dist() function that takes lat/lon pairs and
returns the distance in km.

center = (34.7228, -118.17) # the center of your search
km_per_degree = geo_dist((0,0), (0,1))
scale = geo_dist(center, (center[0], center[1]+1) / km_per_degree
scale2 = km_per_degree * km_per_degree
cutoff_squared = max_distance * max_distance

close_enough = []
for id, pt in matches:
dla = pt[0] - center[0]
dlo = scale * (pt[1] - center[1])
dist = (dla * dla + dlo * dlo) * scale2
if dist < cutoff_squared:
close_enough.append((id, pt))

Regards,

Dvir Volk

unread,
Apr 14, 2012, 8:42:03 AM4/14/12
to redi...@googlegroups.com
Thanks Josiah. 
I've tried your first advice and it reduced the time spent on the intersection in redis from 5-7ms to 0.4-0.5ms :)

As  for the other part - nice trick ,but this doesn't seem like much of a problem currently. The bigger problem would be that I have to actually load the lat,lon pairs of each matching object before filtering, which takes longer than the actual calculations (by then we are usually left with a handful of nodes anyway). This can be optimized by making the geo cells sorted sets, with the exact geohash as the score. then I won't need to load anything to filter, just transform the scores to lat,lons. 

Josiah Carlson

unread,
Apr 15, 2012, 3:14:24 PM4/15/12
to redi...@googlegroups.com
On Sat, Apr 14, 2012 at 5:42 AM, Dvir Volk <dv...@everything.me> wrote:
> Thanks Josiah.
> I've tried your first advice and it reduced the time spent on the
> intersection in redis from 5-7ms to 0.4-0.5ms :)

I'd been hoping for a factor of 2x, but better than 10x is awesome :D

> As  for the other part - nice trick ,but this doesn't seem like much of a
> problem currently. The bigger problem would be that I have to actually load
> the lat,lon pairs of each matching object before filtering, which takes
> longer than the actual calculations (by then we are usually left with a
> handful of nodes anyway). This can be optimized by making the geo cells
> sorted sets, with the exact geohash as the score. then I won't need to load
> anything to filter, just transform the scores to lat,lons.

It's always about the lowest-hanging fruit. When you get rid of the
lat/lon fetch using the geohash score, then you translate your data,
you will still need to compute the distances. If you can do your
distance calculations in 1/10th the time it takes for the other parts,
then I'd say the low-hanging fruit is gone. But this change should
give you close to a 4x improvement in lat/lon distance calculation
performance, with generally better than .05% difference in distance
calculated.

Regards,
- Josiah

gradetwo

unread,
Apr 19, 2012, 5:52:20 AM4/19/12
to redi...@googlegroups.com
1.encode lat,lng as score
lat,lng interleave with each other as score  :

def geo_encode(latitude, longitude):                                                                                          
    latitude = float(latitude)                                                                                                
    longitude = float(longitude)                                                                                              
    # sanity check
    ...
                                                                                                            
    lat_s = '{0:0>32b}'.format(int(latitude * 100000))                                                                        
    lng_s = '{0:0>32b}'.format(int(longitude * 100000))                                                                       
    return int("".join("".join(x) for x in zip(lat_s, lng_s)), 2)  


zadd key geocode      shop_id:shop_type          

2. search     
radius to quick calc lat_min,lat_max, lng_min,lng_max
(lat_min,lng_min) (lat_mat, lng_max) give a box 

geocode_min = geo_encode(lat_min,lng_min)
geocode_max = geo_encode(lat_max,lng_max)


result = ZRANGEBYSCORE key geocode_min geocode_m WITHSCORES

for value_v, geo_v in result:                    
    #filter shop type here
    shop_id, shop_type = value_v.split(':')
    if shot_type !='xxxxx':
        continue
    lat_v, lng_v =geo_decode(geo_v)                                                                       
     # ignore shop outside box                                                                                  
      if (lng_v < lng_min or lng_v > lng_max or  lat_v < lat_min or lat_v > lat_max):                                                                  
          continue
       #use shop id here or real distance  filter                                                                             
                                                

Dvir Volk

unread,
Apr 19, 2012, 6:05:44 AM4/19/12
to redi...@googlegroups.com
thanks, that was the initial approach I took, but the number of nodes outside the box was so big it was the slowest approach of the many I've tried. 

The thing is that the points between min_geocode and max_geocode include not only points outside the radius, but a _lot_ of points outside the box. so this approach was not fast enough.

                                                

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To view this discussion on the web visit https://groups.google.com/d/msg/redis-db/-/P4txuYbNzgIJ.

Josiah Carlson

unread,
Apr 19, 2012, 12:03:53 PM4/19/12
to redi...@googlegroups.com
Sadly, to do geohashing right, you may need to query a variety of
ranges depending on your box (the simple range that gradetwo mentioned
is not actually the way to use geohashing). That's one of the reasons
why I'm a big fan of just pre-sharding into areas (like your 1x1 km
and 4x4 km boxes). Not only are all of the calculations involving
"which boxes do I use" so much easier, but when you need to explain it
to others about what you did, there isn't any magic of a recursive
z-order traversal of the 2-d space.

Regards,
- Josiah

gradetwo

unread,
Apr 19, 2012, 10:19:02 PM4/19/12
to redi...@googlegroups.com


On Thursday, April 19, 2012 6:05:44 PM UTC+8, Dvir Volk wrote:
thanks, that was the initial approach I took, but the number of nodes outside the box was so big it was the slowest approach of the many I've tried. 

The thing is that the points between min_geocode and max_geocode include not only points outside the radius, but a _lot_ of points outside the box. so this approach was not fast enough.
 
Two reason use this bit-interleave geocode
1. There was  just a few points outside the box, especially for spatial data
2. It's integer, can be zset's score

Disadvantage: ZRANGEBYSCORE  Time complexity: O(log(N)+M)
1. It's for spatial data
2. If you have hundreds millions data (high N), or hundreds shop-types(low hit rate), it's cost too much
3. If you use large radius(high M) ,it's cost too much

Plan B: pre-fillter
If have a lot of shop types, pre-shared with shop_type: 
zadd key:shop_type  geo_code  shop_id


optimizing more:
1. write lua scripts
2. deploy slaver as need, query slaver

BTW. I have a geo zset with approximal 9 millions data, it's work pretty well

Dvir Volk

unread,
Apr 20, 2012, 7:56:00 AM4/20/12
to redi...@googlegroups.com
For my SLA it was too slow, maybe yours is more relaxed. but the solution I ended up with works pretty well so far.

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To view this discussion on the web visit https://groups.google.com/d/msg/redis-db/-/mVsmqrnv5PcJ.

Salvatore Sanfilippo

unread,
Apr 20, 2012, 7:58:37 AM4/20/12
to redi...@googlegroups.com
On Fri, Apr 20, 2012 at 1:56 PM, Dvir Volk <dv...@everything.me> wrote:
> For my SLA it was too slow, maybe yours is more relaxed. but the solution I
> ended up with works pretty well so far.

Hi Dvir, for people that will read this interesting thread in the next
months, please could you summarize what you ended using? Thanks!

Dvir Volk

unread,
Apr 20, 2012, 8:31:20 AM4/20/12
to redi...@googlegroups.com
Sure, it may evolve some more and be further optimized down the rowd, and I'll push it to github soon, too, but to sum up the solution: 

1. every node has of course a geohash of its lat,lon.

2. I've mapped geohash bit lengths into very rough estimations of box size, meaning all the nodes with geohashes stripped to the N most significant bits, should have the same prefix if they are within the same "geo cell". 

I've used several resolutions you can choose from: 1km, 4km, 16km, 64km, and 128km. You can select the indexing resolutions for each object, depending on how you're going to use them (in my library, all indexed objects inherit from a main "location" object, so I use different resolutions for cities and for businesses, for example)

3. Each such cell is a ZSET of the ids of all the objects that are in that cell, each with the complete 64 bit geohash as the ZSCORE. This can be optimized down the road by splitting it further by "category" (shops, restaurants, etc), or as Salvatore suggested, if you have few tags, just create a separate geo cell for each tag. but this is not my case.

4. On top of that, you have indexes that map ndoe names (or an inverted index in one use case for businesses) to their ids. each tag or word or name are a sorted set containing ids. 

5. Querying "pizza" in a radius of N km  from a point involves:
  • Selecting the right resolution based on the radius - note that currently for simplicity I do not allow a radius greater than the max resolution.
  • Getting all the cells for that resolution that fall in this radius - by "drawing" a sample circle of a few lat,lon pairs around our center, and converting the points to the correct geo cells. Both these steps do not require any redis queries yet. This approach also solves the geohash problem around lat or lon 0 degrees.
  • Once you have  the cells you want, you intersect each cell's nodes with the textual index, as Josiah suggested. Since ZINTERSTORE depends on the length of the smaller set and the intersection size, crossing with all pizza places around the globe should not be a problem. 
  • this is done with ZINTERSTORE <dest> <geo_key> <text_key> WEIGHTS 1 0 AGGREGATE SUM - the resulting intersection score for each element is the full length geohash of it.
  • Then we just fetch the resulting intersections (no need to union them on the server), with ids and their their scores, convert each score to lat,lon, and filter out nodes in the exact radius.
  • Finally, we take the remaining ids, load the actual objects they represent, and we can then sort them by whatever weight function we want to give them.
Querying just the radius involves just a big union of all the nodes around me. It makes sense if you have data sets that have reasonable size.

You can of course create other keys to intersect against, my library supports that.




--
You received this message because you are subscribed to the Google Groups "Redis DB" group.

Reflejo

unread,
May 8, 2012, 5:52:08 PM5/8/12
to redi...@googlegroups.com
Hello Dvir,

First to all, thank you for share your thoughts. It was very helpful. I was wondering if you already published your library. I have some doubts about the fifth step and I would love to see how the zinterstore is done.

Thank you!

Dvir Volk

unread,
May 9, 2012, 3:24:58 AM5/9/12
to redi...@googlegroups.com
not yet but I can do it in a couple of days if you have patience, or send you the related code or tell you more about what specifically you have doubts about.

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To view this discussion on the web visit https://groups.google.com/d/msg/redis-db/-/FP5dX2Bd06kJ.

Bohu TANG

unread,
May 9, 2012, 4:00:10 AM5/9/12
to redi...@googlegroups.com
what's branch you are using? unstable or v1.8.1?
--
BohuTANG

"The great artist is the simplifier."--- Vincent Van Gogh

Dvir Volk

unread,
May 9, 2012, 4:27:13 AM5/9/12
to redi...@googlegroups.com
branch of what? geodis has no branches in github.

Reflejo

unread,
May 9, 2012, 11:52:52 PM5/9/12
to redi...@googlegroups.com
Hello Dvir,

Let me see if I get this right: If you need to look up for neighbors in M categories (pizza, etc) you should do M * 9 queries, right? 9 = Your cell and cells around your cell.

I have a similar situation but kinda different, I'm looking for people who have different categories, but in my case I need to look for at least 3 category of people. 

I can think on three different solutions for my problem:

1) Use the mechanism that you described which is my current implementation, but I'm doing too much queries. [Queries: #cat * 8 + intersect] Complexity: (O(N*K) + O(M*log(M))) * #cat * 8
2) I could use a zset with geohashes as keys and set the score as my "categoryid", get the 8 lists and filter out the categories from the client. [Queries: 8] Complexity: O(1) server side, O(M) client-side
3) I could use a zset with geohashes as keys and the 64bit geohash as score and querying by: ZRANGEBYSCORE and filter out the categories from the client [Queries: 1] Complexity: O(log(N)+M) server side, O(M) client-side

Any bets on which one will be the fastest? The second and third seem to be faster but they have bandwidth overhead.

Cheers, 
M


On Friday, April 20, 2012 9:31:20 AM UTC-3, Dvir Volk wrote:

Dvir Volk

unread,
May 10, 2012, 5:11:34 AM5/10/12
to redi...@googlegroups.com
I have two cases - in city search, you just cross one index with the geoboxes, so it's always fast. in a business name search, it works sort of like an inverted index, to allow partial matches (e.g if there's a business named "joe's famous pizza", I still want to return it if you're looking for "joe's pizza").

in an "inverted index" mode, if you have K words, you need to first intersect the K words' non geo index, which is O(N*K) + O(LOG(M)+M), where N is the smallest set and M is the size of the result. that's heavy if you have many popular words. then you cross the resulting set with the  geohash cells, just as you would a single word, and usually the result is pretty small so that's fast. It doesn't change the client code and you can also cache the textual intersection to deal with popular n-grams. keep in mind that I usually cross about 1-3 words, no more than that, usually just one popular, so it's not that bad. 

If I were to work with "categories", and I'm guessing you have relatively few categories that are not overlapping much, I'd just create geohash indices for each category, and just union the relevant boxes.

I didn't really get what you meant in solution 2. 

solution 3 sounds like it will produce TONS of out-of-area hits you will need to filter, making it very slow.

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To view this discussion on the web visit https://groups.google.com/d/msg/redis-db/-/uU3-p6HOJRgJ.
Reply all
Reply to author
Forward
0 new messages