Hi,
To do "similarity searches" in Gremlin, you make use of a path expression that contains groupCount. For example:
1. "Which people are most similar to person 1 by reading behavior?"
g.v(1).out('read').in('read').groupCount.cap
2. "Which products are most similar to person 1 based on the collective purchasing behavior of community?" (i.e. collaborative filtering)
g.v(1).out('purchased').in('purchased').out('purchased').groupCount.cap
-- but don't recommend person 1 products they have already purchased!
x = []; g.v(1).out('purchased').aggregate(x).in('purchased').out('purchased').except(x).groupCount.cap
3. "Which people are most similar to person 1 based on their purchasing behavior, food likes, and book they read?"
g.v(1).out('purchased','eats','read').in('purchased','eats','read').groupCount.cap
4. "Which concepts are most similar to concept 1 in a web of concept associations?" (i.e. spreading activation -- similarity through graph resonance)
g.v(1).out('relatedTo').loop(1){it.loops < 4}.groupCount.cap
In short, make up a semantically reasonable path expression and then count how many times you run into each vertex as you evaluate that expression. That is "similarity search!"
------------------------------------------
Lets say your graph is really dense and you want this to run in real-time, you can do stuff like this:
g.v(1).out('purchased').in('purchased').out('purchased')[0..1000].groupCount.cap
By making 1000 a query variable, then you can determine how many clock cylces you want to contribute to the computation. A larger range filter, more time, less accurate results. A smaller range filter, more accurate results and less time.
HTH,
Marko.