Ranked similarity search

Luanne Coutinho

unread,

Feb 7, 2012, 3:10:04 AM2/7/12

to ne...@googlegroups.com

Hi,

Does any one have "best practices" for coding a similarity search that ranks results? An example is find all people similar to me, where the criteria which establishes similarity in the order of importance are:

1. Matching interests
2. Age +/- 3 = my age
3. Same city

I don't want an exact match , so my results should typically be ordered by people that have the same interests as me, followed by people in my age group and/or city.
Was thinking of writing a query to return the top level set of people and then filter them manually based on sub-queries or some such. Not sure how good an idea that is :)
Obviously the list of criteria is going to grow to include things like people that shop at the same places that I do etc. etc.

Thanks
Luanne

Luanne Coutinho

unread,

Feb 7, 2012, 4:54:42 AM2/7/12

to ne...@googlegroups.com

The other not-so-good option is to match as much as possible optimistically hoping my results contain at least x nodes. If not, then execute a less strict traversal and make up the results to contain x nodes. The results of this particular recommendation would be cached by person for about 12 hours so I am not too worried about the multiple traversals--but it doesn't seem like such a nice solution to me.

On the other hand, I was just reading about Gremlin...wondering if that is more suitable?

James Thornton

unread,

Feb 7, 2012, 5:12:10 AM2/7/12

to ne...@googlegroups.com

Hi Luanne -

In graph parlance, a "similarity search" is called a collaborative filtering (recommendation) algorithm. Examples include, Amazon's book recommendations or Netflix's movie recommendations.

Gremlin (https://github.com/tinkerpop/gremlin/wiki) excels at these type of queries.

See Marko's blog on building a movie recommendation engine (http://markorodriguez.com/2011/09/22/a-graph-based-movie-recommender-engine/).

- James

Luanne Coutinho

unread,

Feb 7, 2012, 5:24:51 AM2/7/12

to ne...@googlegroups.com

Thanks James, I'll spend some time reading through these.

Marko Rodriguez

unread,

Feb 7, 2012, 11:05:24 AM2/7/12

to ne...@googlegroups.com

Hi,

To do "similarity searches" in Gremlin, you make use of a path expression that contains groupCount. For example:

1. "Which people are most similar to person 1 by reading behavior?"

g.v(1).out('read').in('read').groupCount.cap

2. "Which products are most similar to person 1 based on the collective purchasing behavior of community?" (i.e. collaborative filtering)

g.v(1).out('purchased').in('purchased').out('purchased').groupCount.cap

-- but don't recommend person 1 products they have already purchased!

x = []; g.v(1).out('purchased').aggregate(x).in('purchased').out('purchased').except(x).groupCount.cap

3. "Which people are most similar to person 1 based on their purchasing behavior, food likes, and book they read?"

g.v(1).out('purchased','eats','read').in('purchased','eats','read').groupCount.cap

4. "Which concepts are most similar to concept 1 in a web of concept associations?" (i.e. spreading activation -- similarity through graph resonance)

g.v(1).out('relatedTo').loop(1){it.loops < 4}.groupCount.cap

In short, make up a semantically reasonable path expression and then count how many times you run into each vertex as you evaluate that expression. That is "similarity search!"

------------------------------------------

Lets say your graph is really dense and you want this to run in real-time, you can do stuff like this:

g.v(1).out('purchased').in('purchased').out('purchased')[0..1000].groupCount.cap

By making 1000 a query variable, then you can determine how many clock cylces you want to contribute to the computation. A larger range filter, more time, less accurate results. A smaller range filter, more accurate results and less time.

HTH,

Marko.

http://markorodriguez.com

Luanne Coutinho

unread,

Feb 7, 2012, 10:51:34 PM2/7/12

to ne...@googlegroups.com

Thanks for the help Marko. I’ll be spending some time next week on Gremlin.

Regards

Luanne

Reply all

Reply to author

Forward