Recommendation engine

213 views
Skip to first unread message

Josh Adell

unread,
Feb 10, 2012, 12:09:40 AM2/10/12
to Neo4j
Hey all,

I'm trying to build a simple recommendation engine. What I need are
two separate (but perhaps related?) pieces of data. I'm having a tough
time expressing the query in either Cypher or Gremlin, so maybe
someone can help?

Question 1 - Estimated rating of an unrated item: "For all the users
who tend to rate things similarly to me and have already rated some
item that I have not rated, what is their average rating of that
item?"
The thing I'm having trouble with is the "similarly to me" part.
Ratings go from 0 to 10. If a user's ratings on average are within 2
points of my ratings for the same items, that is considered similar,
and they should be included when averaging the ratings of items I have
not yet rated.

Question 2 - Recommendation of unrated items: "What items have I not
rated that have on average been rated 7 or higher by users who tend to
rate things similarly to me."
This is essentially the same problem as the first, except rather than
asking about a specific item, I want to get back all items that have
been highly rated.

In both cases, it seems like the first thing to do is gather a list of
users who rate things similarly to me. Then in the first case, filter
out any who have not rated the item I'm interested in, and average the
ratings of the remaining users. In the second case, average the
ratings of those users for all items I have not yet rated that they
have rated and filter out any that are less than 7.

I started with Marko's recommendation engine blog post (http://
markorodriguez.com/2011/09/22/a-graph-based-movie-recommender-engine/)
but I'm not sure how to tweak it for what I want. Any help would be
appreciated.

Thanks!

-- Josh Adell

Luanne Coutinho

unread,
Feb 10, 2012, 12:17:44 AM2/10/12
to ne...@googlegroups.com
Hi Josh,

I finished working on a use case very similar to yours just a day ago, but using Cypher, not Gremlin; perhaps I can be of some assistance there. How is your graph structured?

Regards
Luanne

豁暴~Louis

unread,
Feb 10, 2012, 1:39:31 AM2/10/12
to Neo4j
According to my experience , the cypher engine is not fast enough, I
adopt gremlin, much more fast than cyher

Josh Adell

unread,
Feb 10, 2012, 8:23:16 AM2/10/12
to Neo4j
Hi Luanne,

Cypher or Gremlin doesn't matter as long as I get the right
results :-)

My graph is very simple:

(user) -[:RATED]->(item)

The same (user) will have either 0 or 1 [:RATED] relationships to any
given item. Multiple users can rate the same item. So the graph like
this:

(user1) -[:RATED 3]->(itemA)
(user2) -[:RATED 2]->(itemA)
(user3) -[:RATED 7]->(itemA)

(user1) -[:RATED 8]->(itemB)
(user2) -[:RATED 7]->(itemB)
(user3) -[:RATED 4]->(itemB)

(user2) -[:RATED 5]->(itemC)
(user3) -[:RATED 9]->(itemC)

(user2) -[:RATED 8]->(itemD)
(user3) -[:RATED 4]->(itemD)

So in this example, user1 and user2 tend to rate things similarly,
while user3 does not. So when answering question 1 (what is the
estimated rating for user1 of itemC) user2's ratings should be
included, but user3's should not. For question 2 (recommend items that
are highly rated) we should include itemD but not itemC because even
though user3 rated it highly, their ratings are not as similar to
user1's as user2's ratings.

Does that make sense?

-- Josh

On Feb 10, 12:17 am, Luanne Coutinho <luanne.couti...@gmail.com>
wrote:

Marko Rodriguez

unread,
Feb 10, 2012, 1:15:05 PM2/10/12
to ne...@googlegroups.com
Hello,

>> In both cases, it seems like the first thing to do is gather a list of
>> users who rate things similarly to me. Then in the first case, filter
>> out any who have not rated the item I'm interested in, and average the
>> ratings of the remaining users. In the second case, average the
>> ratings of those users for all items I have not yet rated that they
>> have rated and filter out any that are less than 7.

First, so I know we are on the same page. That is all people that have rated the same things as person 1.

g.v(1).out('rated').in('rated')

The traversal below returns all the people that have rated things "similar" to person 1. Similar is defined as the absolute value of vertex 1's rating minus the rating of the person in question.

g.v(1).outE('rated').sideEffect{ w = it.stars }.inV.inE('rated').filter{ Math.abs(it.stars - w) <= 1 }.outV

You can then average their ratings for all items using mean():

g.v(1).outE('rated').sideEffect{ w = it.stars }.inV.inE('rated').filter{ Math.abs(it.stars - w) <= 1 }.stars.mean()

...hopefully that is enough to get you going on your particulars. Here are things to realize:

1. You will need to make use of the outE.inV pattern as you are reasoning beyond an edge's label and into their properties (e.g. the stars on an edge).
2. Make use of sideEffects to store intermediate values (e.g. person 1's stars)
3. With filter you can apply any function you want that returns boolean -- use the Java API as you see fit.
4. If you want to make this as blazing fast as you want it to be, look into strategically place range filters to sacrifice accuracy for time.

Good luck!,
Marko.

http://markorodriguez.com

Xia @ Internet

unread,
Feb 10, 2012, 1:25:27 PM2/10/12
to ne...@googlegroups.com
This is an awesome discussion.

Hi Marko and others, I know I can use ExecutionEngine in Java to run cypher queries. What do I use to run Gremlin queries in Java? Any tutorial/documentaion to suggest?

Many thanks,

Marko Rodriguez

unread,
Feb 10, 2012, 1:35:49 PM2/10/12
to ne...@googlegroups.com
Hi,

Gremlin comes in 3 languages. Gremlin-Java, Gremlin-Groovy, and Gremlin-Scala. Most people are familiar with Gremlin-Groovy as that is the original implementation and the one deployed by Neo4j. As a sidenote, TinkerPop plans to support Gremlin-JavaScript and Gremlin-Jython in the relatively near future.

In particular to your question about Gremlin and native Java, you can see this page:

Basically, it looks like Gremlin-Groovy, except that you have to wrap your "start" as a new GremlinPipeline() and your anonymous/closure functions are defined as anonymous classes of PipeFunction. A simple FOAF example:

new GremlinPipeline(g.getVertex(1)).out("knows").out("knows").property("name")
However, when you want to do anonymous function filtering, its:

new GremlinPipeline(g.getVertex(1)).out("knows").out("knows").property("name").filter(new PipeFunction<String,Boolean> { public Boolean compute(String a) { a.startsWith("m") } })

That will filter all FOAFs whose name does not start with m. Realize, if you don't want a one line, simply define a the function as a variable.

PipeFunction<String,Boolean> mFilter = new PipeFunction<String,Boolean> { public Boolean compute(String a) { a.startsWith("m") } };
new GremlinPipeline(g.getVertex(1)).out("knows").out("knows").property("name").filter(mFilter);

The nice thing about Gremlin-Java is that its very easy to write your own Pipes in Pipes (http://pipes.tinkerpop.com) and thus, create your own domain specific query language.


I recommend reading http://markorodriguez.com/2011/08/03/on-the-nature-of-pipes/ as it will show you how Gremlin is just one data-flow DSL over Pipes --- a DSL for property graphs.

HTH,
Marko.

Josh Adell

unread,
Feb 10, 2012, 4:31:44 PM2/10/12
to Neo4j
Thanks, Marko. I'll give this a shot and see how far I get. I may have
additional questions, but hopefully they'll be more specific.

-- Josh

Josh Adell

unread,
Feb 12, 2012, 9:51:12 PM2/12/12
to Neo4j
So, this is what I have so far (I haven't gotten much accomplished):

user1=g.v(1);
user1.outE("RATED").sideEffect{w=it.rating}
.inV.inE("RATED").outV.except([user1]).back(2)
.sideEffect{diff=Math.abs(it.rating-w)}

What I'd like to have is all the "diff" in a map where each key is the
node id of a user, and the value is a list of the differences of a
rating by that user and our target user's (user1) rating of the same
item. Something like:

ratingDiffs = {
2 : [1, 3, 2, 2],
3 : [3, 3],
4 : [1, 4, 3, 5],
5 : [1, 0, 2],
...etc.
}

My plan is to average all the lists, and filter out any where the
average is less than or equal to 2. The keys left over are my
"similar" users. So in the table above, I should end up with 2 and 5,
or better yet node(2) and node(5).

What I am stuck on now seems to be a lack of Groovy skills. How can I
take each of my "diff" from the last sideEffect and put them in a
format like the "ratingDiffs" table? I thought something like this
would work, but it doesn't:

ratingDiffs = [:].withDefault{ [] };
user1=g.v(1);
user1.outE("RATED").sideEffect{w=it.rating}
.inV.inE("RATED").outV.except([user1]).back(2)
.sideEffect{ ratingDiffs[it.outV.id][] = Math.abs(it.rating-
w) }

Any help would be appreciated. Thanks!

-- Josh



Marko Rodriguez

unread,
Feb 13, 2012, 11:22:59 AM2/13/12
to ne...@googlegroups.com
Hi,

You can do something like this:

m = [:].withDefault({[]})
....groupCount(m){it.id}{it.add(diff)}.iterate()

Realize that the first line is your resultant data structure and that will generate values that are lists (hence the "withDefault()").

In essence, groupCount allows two closures -- one that is the key to use and one is the value to use. In Gremlin 1.4 and below, the key-"it" is the object propagating through the pipeline and the value-"it" is the last value for that key in the map. In Gremlin 1.5+, its a bit different, the key-"it" is the object propagating through the pipe and the value-"it" is a Pair which is the object propagating as the getA() and the last value for that key as the getB(). However, I suspect you are using Gremlin 1.4 or below so the code snippet above will work for you.

HTH,
Marko.

http://markorodriguez.com

Josh Adell

unread,
Feb 14, 2012, 12:17:19 AM2/14/12
to Neo4j
Thanks Marko! I am a lot further than I was a few hours ago. I've
gotten two major pieces of the recommendation engine done.

Piece 1: find all the users who tend to rate things similarly to a
target user

m=[:].withDefault{[0,0,0]};
user=g.v(1);
user.outE("RATED").sideEffect{w=it.rating}
.inV.inE("RATED").outV.except([user]).back(2)
.sideEffect{diff=Math.abs(it.rating-w)}
.outV.sideEffect{ me=m[it.id]; me[0]++; me[1]+=diff; me[2] =
me[1]/me[0]; }
.filter{m[it.id][2]<=2}.dedup()

Building off of this, I can get Piece 2, the answer to one of my
original questions: given users who are similar to a target user,
calculate the predicted rating of an item by that target user

m=[:].withDefault{[0,0,0]};
user=g.v(1);
item=g.v(9);
user.outE("RATED").sideEffect{w=it.rating}
.inV.inE("RATED").outV.except([user]).back(2)
.sideEffect{diff=Math.abs(it.rating-w)}
.outV.sideEffect{ me=m[it.id]; me[0]++; me[1]+=diff; me[2] =
me[1]/me[0]; }
.filter{m[it.id][2]<=2}.dedup()
.outE("RATED").inV.filter{it==item}.back(2).dedup().rating.mean()

So now I just need one more piece, the answer to the other question in
my original post: what are the top rated items by users similar to my
target user, that the target user has not rated. I think I'll need to
have another map variable and sideEffect to calculate the average
ratings for every item rated by my similar users, then sort by average
and limit to the top 10 or however many. Still working on it.

Also, I have not benchmarked this yet, but my instinct is that it
would not be very performant.

If anyone has any insights in solving the final problem, or improving
my existing solutions, I'd love to hear them. Until then, I'll keep
hacking away on it.

Thanks!

-- Josh

Marko Rodriguez

unread,
Feb 14, 2012, 10:17:12 AM2/14/12
to ne...@googlegroups.com
Hey,

Thanks Marko! I am a lot further than I was a few hours ago.  I've
gotten two major pieces of the recommendation engine done.

Whoa. You have some rockin' Gremlin there. Good job on learning how to throw down the pipelines.

Before we begin, here are some links you might appreciate:
1. http://groovy.codehaus.org/groovy-jdk/java/util/Map.html (Groovy maps have nice methods)
2. http://markorodriguez.com/2011/08/03/on-the-nature-of-pipes/ (you seem to understand Gremlin mechanics quite well, but its always good to read more)
3. https://github.com/tinkerpop/gremlin/wiki/User-Defined-Steps (talk at a higher level of abstraction than the low-level graph)
4. Take one of your expressions and do .toString() at the end --- crazy town!


Piece 1: find all the users who tend to rate things similarly to a
target user

m=[:].withDefault{[0,0,0]};
user=g.v(1);
user.outE("RATED").sideEffect{w=it.rating}
       .inV.inE("RATED").outV.except([user]).back(2)
       .sideEffect{diff=Math.abs(it.rating-w)}
       .outV.sideEffect{ me=m[it.id]; me[0]++; me[1]+=diff; me[2] =
me[1]/me[0]; }
       .filter{m[it.id][2]<=2}.dedup()

Interesting. I like your mapped 3 element array. Nice thinking.
key = other user being compared to 'user'
me[0] = number of times 'user' has been compared to that key-user
me[1] = the total similarity as defined by absolution value of their ratings
me[2] = the average similarity of 'user' to key-user.

Building off of this, I can get Piece 2, the answer to one of my
original questions: given users who are similar to a target user,
calculate the predicted rating of an item by that target user

m=[:].withDefault{[0,0,0]};
user=g.v(1);
item=g.v(9);
user.outE("RATED").sideEffect{w=it.rating}
       .inV.inE("RATED").outV.except([user]).back(2)
       .sideEffect{diff=Math.abs(it.rating-w)}
       .outV.sideEffect{ me=m[it.id]; me[0]++; me[1]+=diff; me[2] =
me[1]/me[0]; }
       .filter{m[it.id][2]<=2}.dedup()
       .outE("RATED").inV.filter{it==item}.back(2).dedup().rating.mean()

Okay. So here you are now filtering out all users whose similarity to 'user' is less than 3 (<=2).
You are then finding out what they rated item g.v(9) and taking the mean of that.
Makes sense -- assuming a Gaussian distribution (which I would suspect in practice).

So now I just need one more piece, the answer to the other question in
my original post: what are the top rated items by users similar to my
target user, that the target user has not rated. I think I'll need to
have another map variable and sideEffect to calculate the average
ratings for every item rated by my similar users, then sort by average
and limit to the top 10 or however many. Still working on it.

You will also want to look at this pattern to filter out items that the user has not already rated:

e.g. g.v(1).out('RATED').aggregate(x).in('RATED').out('RATED').except(x)
-OR- g.v(1).out('RATED').aggregate(x).in('RATED').except([g.v(1)]).out('RATED').except(x)

When you figure out your final solution, post it again and I can look it over.

Also, I have not benchmarked this yet, but my instinct is that it
would not be very performant.

The only step in there that is problematic is the dedup step. dedup is building an internal Set that is doing a .contains() for everything that flows through the step. Besides that, the next factor is your graph structure. If you have a massive branch factor, then you are checking LOTS of paths. You can do this to determine your bushy-ness:

g.v(1).out('RATED').count()
g.v(1).out('RATED').in('RATED').count()
g.v(1).out('RATED').in('RATED').out('RATED').count()

You will probably notice an exponential increase in those counts over the three expressions --- assuming this is a natural, real-world dataset you are working with. There are various optimizations you can do -- e.g. random walks or limits.

g.v(1).out('RATED').in('RATED')[0..1000] -- only make use of 1000 co-rating users.
- Also, you can do g.v(1).out('RATED').in('RATED').range(0,1000)
g.v(1).out('RATED').in('RATED').random(0.5) -- only make use of 50% of the co-rated users (a biased coin toss).

* I tend to do random walks early on in an expression as random() will still go through ALL the elements up to random(). It is after random() that your bushy tree is pruned by whatever bias you provided (0% to 100%).

* I tend to do range filters here and there to keep things clipped down AND I make the high end value a variable. By making it a variable I can control the real-timeness of the traversal. With range(), you always know the upper limit of what can pass through that particular section of the expression so you know the max number of things you can possibly process. This allows you to know exactly how fast your traversal will perform --- however, as you drop your high end down, you are sacrificing accuracy for speed.

When you are done, I'd love it if you put this in a blog post or added it as a comment to the following:

Good luck man,
Marko.

Luanne Coutinho

unread,
Feb 14, 2012, 12:32:57 PM2/14/12
to ne...@googlegroups.com
Hey Josh,

Sorry for taking so long to get back to you, was stuck with other stuff.
Anyhow, it looks like you're done with this using Gremlin (which I started learning today btw), but I also gave the queries a shot in Cypher-

First, I got all users similar to me:
start me = node(1)
match (me)-[myRating:RATED]->(i)<-[otherRating:RATED]-(u)
where abs(myRating.rating-otherRating.rating)<=2
return u.name

Node(1)= user 1 in your example, and the result of that query is User 2.

Using the results above in the next query to answer your question 1:
start item=node(6), similarUsers=node(3)
match (similarUsers)-[r:RATED]->(item)
return AVG(r.rating)

Node 6 here is Item C, and similarUsers=results from the first query, which is user 2.

To answer question 2:

start me=node(1), similarUsers=node(3)
match (similarUsers)-[r:RATED]->(item)
where r.rating > 7 and not((me)-[:RATED]->(item)) 
return item,r.rating, similarUsers.name 

Pretty simple!
Tried it out on the 3 user example you provided, may need some tweaking with some more data and similar users.

Regards
Luanne

Peter Neubauer

unread,
Feb 14, 2012, 12:42:34 PM2/14/12
to ne...@googlegroups.com

Wow.
If you two could could coordinate on a blog post about both approaches, that would be most awesome. Just sayin.

Send from a device with crappy keyboard and autocorrection.

/peter

Luanne Coutinho

unread,
Feb 14, 2012, 12:59:10 PM2/14/12
to ne...@googlegroups.com
Sure, I'm game :-)

Marko Rodriguez

unread,
Feb 14, 2012, 1:58:26 PM2/14/12
to ne...@googlegroups.com
Hey,

Piece 1: find all the users who tend to rate things similarly to a
target user

m=[:].withDefault{[0,0,0]};
user=g.v(1);
user.outE("RATED").sideEffect{w=it.rating}
       .inV.inE("RATED").outV.except([user]).back(2)
       .sideEffect{diff=Math.abs(it.rating-w)}
       .outV.sideEffect{ me=m[it.id]; me[0]++; me[1]+=diff; me[2] =
me[1]/me[0]; }
       .filter{m[it.id][2]<=2}.dedup()

Looking over this part again, here is another optimization:

  Use it.getProperty('rating') as it is faster.

Also, since you are filling a map, you may need to iterate all those values first before you filter

Also, note that in Gremlin 1.5-SNAPSHOT, you can make use of select(). This allows you to avoid "backing up" and using sideEffects as you can "select" elements from previous in the pipeline. Here is how your statement would look with select().

user.outE('RATED').as('a').inV
.inE('RATED').as('b').outV.except([user]).as('c').select(['a','b'])
.sideEffect{me=m[it.id]; me[0]++; me[1]+=Math.abs(it[0].getProperty('rating') - it[1].getProperty('rating')); me[2] =me[1]/me[0];}
.select('c').filter{m[it[0].id][2]<=2}.dedup()

However, now that I look at your query more closely, you may need to iterate the full pipeline before doing the final filter as your map is not completely full. Moreover, your me[2] is an average for only that rating/rating comparison. To rectify this, I suspect you want:

m=[:].withDefault{[0,0]}; // only two slots. running count and running absolute value.
user=g.v(1); 
user.outE('RATED').sideEffect{w=it.getProperty('rating')}.inV
.inE('RATED').outV.except([user]).back(2)
.sideEffect{diff=Math.abs(it.getProperty('rating')-w)}.outV
.sideEffect{me=m[it.id]; me[0]++; me[1]+=diff}.iterate()
m.entrySet._().filter{it.value[1]/it.value[0] <= 2}.transform{it.key} // or some fancy Groovy map filter function.

Anywho....

Take care,
Marko.

Josh Adell

unread,
Feb 24, 2012, 9:45:29 AM2/24/12
to Neo4j
Finally got around to writing this up. Feedback always appreciated.

http://blog.everymansoftware.com/2012/02/similarity-based-recommendation-engines.html

-- Josh

On Feb 14, 12:42 pm, Peter Neubauer <neubauer.pe...@gmail.com> wrote:
> Wow.
> If you two could could coordinate on a blog post about both approaches,
> that would be most awesome. Just sayin.
>
> Send from a device with crappy keyboard and autocorrection.
>
> /peter
> On Feb 14, 2012 6:32 PM, "Luanne Coutinho" <luanne.couti...@gmail.com>
> >>http://markorodriguez.com/2011/09/22/a-graph-based-movie-recommender-...

Luanne

unread,
Feb 24, 2012, 11:56:42 AM2/24/12
to ne...@googlegroups.com
And I shall have the cypher approach written up this weekend

-Luanne

Peter Neubauer

unread,
Feb 24, 2012, 4:39:02 PM2/24/12
to ne...@googlegroups.com

Wow,
That will be supercool, both of you!

Sent from a device with crappy keyboard and autocorrection.

/peter

Luanne Coutinho

unread,
Feb 25, 2012, 11:50:12 AM2/25/12
to ne...@googlegroups.com
Here's the Cypher equivalent- http://thought-bytes.blogspot.in/2012/02/similarity-based-recommendations-with.html

We'll have them linked up

Peter Neubauer

unread,
Feb 25, 2012, 12:32:29 PM2/25/12
to ne...@googlegroups.com
That is - impressive. I guess in the future, with Cypher doing subqueries, the "similar users to me" first query can even be put into a subquery, Andres?

Thanks for taking the time to write this up, Josh, Luanne and Marko and Andres for helping out!

I think this could lead the way to the cheat/comparison sheet we have been talking about?

Cheers,

/peter neubauer

G:  neubauer.peter
S:  peter.neubauer
P:  +46 704 106975
L:   http://www.linkedin.com/in/neubauer
T:   @peterneubauer

Neo4j 1.6 released                 - dzone.com/6S4K
The Neo4j Heroku Challenge   - http://neo4j-challenge.herokuapp.com/
Reply all
Reply to author
Forward
0 new messages