Advice Needed: Stuck trying to model some Address/Locations with RavenDb

34 views
Skip to first unread message

Justin A

unread,
May 25, 2012, 12:58:57 AM5/25/12
to rav...@googlegroups.com
Hi folks,

Ok, got a really weird modelling question which I'm not sure how to do.

Goal: *Smart* searching for locations, given a massive list of locations. ie. Autocomplete or SEO friendly url.

Problem: I'm not too sure how to model a Location, given that I'm cautious of having denormalized references because sometimes locations change which has a flow on effect for locations 'underneath'. (A city is under a state. a street is under a city .. which is under a state.. etc).

eg. An old neighbourhood is now split into two. Which means all streets in each one will need to be updated, also. 
         Denormalising would solve the problem, if I didn't need to update.

Details: Previously Ayende showed me how to do some autocomplete with a bunch of locations in a system. 
A location, in -that- scenario was just a single string - the fully formatted name of the place, which came from an existing SQL Server.
eg. smith street, new york, new york.

The data was a simple export from the existing Sql Server.

My problem is: I'm a bit stuck to how I can model this. I'm starting out with a SEARCH Index and working my way up to Location object.


Sample Data
- California (State)
- San Francisco, California (City / State)
- Downtown, San Francisco, California (Neighbourhood / City / State)
- O'Farrell St, Downtown, San Francisco, California (Street / Neighbourhood / City / State)
- Post St, Downtown, San Francisco, California (Street / Neighbourhood / City / State)
- Post St, Tribeca, New York, New York (Street / Neighbourhood / City / State)


and here's some Search Queries

- post st downtown
- post st downtown san fran
- post st san fran
- post st ny
- post st ny ny
- post st tribeca

Ok. So what we currently have was to allow each location to have an alias (or aliases - just a single field, space delimeted) .. which is *inherited* as you go down.
An alias(es) are used ONLY for searching. (refer to search examples, above). *inherited* means, it's used when generating the search query ... it's not actually copied to each location. (that would be ack!!)

If the state of New York has an alias of NY, then this is what the TERMS would be for that location.

location: Post St, Tribeca, New York, New York
search query terms: "post st", "tribeca", "new york", "ny"

so this means if we add a new alias / update an alias to a state or a city, etc.. then -all- locations under it also need to be updated.

It's like we want to have a 'view' of a location which is the sum of all the location 'parts' (eg. neighbourhood name, city name, etc) and the alias(es), which is searchable.

lastly, these parts also generate a unique uri : post-st-tribeca-new-york-new-york (for example).


Ideas:

Idea #1 - Class with Id's.

So at first i was thinking about having a Location look like this :-

Id
StreetId
NeighbourhoodId
CityId
StateId
CountryType (an enum).

and possibly use Include(..) to preload all the location 'segments'.

Pros: No denormalized references means that if i have a street location .. and the neighbourhood changes name, then i do not have to find all streets that are in this neighbourhood and update. (get's messier when I go up the hierarchy).
Cons: have to include(..) for each xxxId to get the full location name.
         Not sure i could make a static index with this format .. cause i need to Include in a static index?

Idea #2 - Denormalised

Id
Street
Neighbourhood
City
State
CountryType

Pros: No more including required. Can easily create a search static index.
Cons: A real PITA if anything above this location, is updated or changed. 

So that's where i'm at :(

and i still stuck on how the static index would look :(

Does anyone have any suggestions about how I should tackle this?

Justin A

unread,
May 25, 2012, 3:15:07 AM5/25/12
to rav...@googlegroups.com
Just another update ... I just got the following to work .. but I'm not sure if this would be a recommended way to solve this problem.

I've used Idea #1 Class with Id's and created a Static Index with a TransformResults 

public class Locations_Details : AbstractIndexCreationTask<Location>
{
    public Locations_Details()
    {
        Map = locations => from location in locations
                            select new
                                    {
                                        location.Id,
                                        location.Name,
                                        location.NeighbourhoodId,
                                        location.CityId,
                                        location.StateId,
                                        location.LocationType,
                                        location.Aliases
                                    };
 
        TransformResults = (database, locations) => from location in locations
                                                    let Neighbourhood =
                                                        database.Load<Location>(location.NeighbourhoodId)
                                                    let City = database.Load<Location>(location.CityId)
                                                    let State = database.Load<Location>(location.StateId)
                                                    select new
                                                            {
                                                                location.Id,
                                                                location.Name,
                                                                Neighbourhood = Neighbourhood.Name,
                                                                City = City.Name,
                                                                State = State.Name,
                                                                location.LocationType,
                                                                Aliases = (location.Aliases + " " + 
                                                                Neighbourhood.Aliases + " " +
                                                                City.Aliases + " " +
                                                                State.Aliases).Trim()
                                                            };
    }
}


and here's a sample test i created...

[Fact]
public void TestingLocations()
{
    var locations = this.DocumentSession
        .Query<LocationLocations_Details>()
        .Where(x => x.Id == "Locations/10")
        .As<LocationDetail>()
        .ToList();
 
    Console.WriteLine(locations.First().Aliases);
 
    Assert.NotNull(locations.First().Name);
    Assert.NotNull(locations.First().Aliases);
    
}

So this means I had to create a NEW class which get the transformed results mapped, to.

So would this be a good way to go about this?

Because i'm doing a Load<Location> for every location 'level' .. I'm guessing that .. if a person changes a level above, then all levels below would be auto-updated? The higher the level u change, the more things need to be updated and the longer the index needs to take to update.

Secondly, I'm assuming that, because this is also using a TransformResults, that each 'new document' that is 'generated' in this index is persisted to the file system (all properties, etc).

Matt Warren

unread,
May 25, 2012, 6:25:00 AM5/25/12
to rav...@googlegroups.com
> Because i'm doing a Load<Location> for every location 'level' .. I'm guessing that .. if a person changes a level above, then all 
> levels below would be auto-updated? The higher the level u change, the more things need to be updated and the longer the index 
> needs to take to update.

No really auto-updated, it'll just fetch the current version from the database (using Load<T>) each time you run the query.

> Secondly, I'm assuming that, because this is also using a TransformResults, that each 'new document' that is 'generated' in this 
> index is persisted to the file system (all properties, etc).

No TransformResults isn't persisted anywhere, it's only built up in-memory at query time, after the lucene query has run. But because calls to the doc store are cached, if you fetch the same neighbourhood/city doc several times it'll still be quick

Itamar Syn-Hershko

unread,
May 25, 2012, 7:04:39 AM5/25/12
to rav...@googlegroups.com
The problem description is somewhat vague. It is also not clear whether this is a small feature you are looking to implement, or this is a main business concern that you are willing to put efforts on.

I'll assume the following (quoted) and move from there, let me know if I missed your point.

Sample Data
- California (State)
- San Francisco, California (City / State)
- Downtown, San Francisco, California (Neighbourhood / City / State)
- O'Farrell St, Downtown, San Francisco, California (Street / Neighbourhood / City / State)
- Post St, Downtown, San Francisco, California (Street / Neighbourhood / City / State)
- Post St, Tribeca, New York, New York (Street / Neighbourhood / City / State)


and here's some Search Queries

- post st downtown
- post st downtown san fran
- post st san fran
- post st ny
- post st ny ny
- post st tribeca

Basically, textual location / address search is a well known problem, and quite frankly a non-trivial one to solve. To get this done right, you need to maintain a large list of synonyms and abbreviations, and obviously you'd want to be able to auto-correct spelling mistakes which are quite common for street names etc. To make this most efficient, this should be done in the index level - storing all synonyms for each word on the same token position - meaning doing some low-level Lucene stuff.

However, luckily for you this is a solved problem. Google Maps does this quite nicely, and it has an external API you can rely on.

Take a look at this excellent discussion: http://www.ibm.com/developerworks/opensource/library/j-spatial/

If translating a Location to a lat-lon coordinates using the Google Maps API, and then using it again while parsing a query makes sense in your case, I'd go with it. Calling an external server may be a bit pricey, but the results should be worth it, plus you can always store queries and lat-lon results and lookup on them before querying that external API.

This way you can also use proximity searches using the spatial module, which is always nice to have
Reply all
Reply to author
Forward
0 new messages