Strategies for optimizing a recommendation algorithm?

181 views

Skip to first unread message

Jeroen van Dijk

unread,

Nov 9, 2011, 4:49:50 AM11/9/11

to Gremlin-users

Hi all,

My compliments to everybody involved with Gremlin and Tinkerpop. I'm a
very happy user so far.

I would like to share my implementation of an recommendation algorithm
[*] in the hope we could all benefit and I could get some advise on
how to make it fast for all cases. The recommendation algorithm is
written in Gremlin and runs on the Neo4j addon for Heroku. It uses the
Jaccard index to compute the similarity between entities. I'm
currently looking into how I could make the computation feasible with
a huge dataset. My dataset has many items that are heavily connected,
some with almost all other items in the graph. This means that for the
similarity calculation in most cases all the items in the graph need
to be traversed. I have been thinking about pruning or sampling paths,
but so far these strategies heavily influence the results and do not
give me the speed boost I want.

Are there known strategies that deal with these situations? I can
imagine that I could abstract the heavily connected items by putting
them in precalculated-clusters and adapt the Jaccard similarity
algorithm slightly.

And as a side question, are implementions of algorithms like mine
interesting for https://github.com/tinkerpop/furnace project?

Also if there are possible optimizations in the Gremlin code, I would
also be keen to hear that.

Thanks in advance,
Jeroen

[*] https://gist.github.com/517b32e715d6108238bc

Marko Rodriguez

unread,

Nov 9, 2011, 12:46:34 PM11/9/11

to gremli...@googlegroups.com

Hey,

> My compliments to everybody involved with Gremlin and Tinkerpop. I'm a
> very happy user so far.

Sweet. Thanks. Glad you like the TinkerPop stack.

> I would like to share my implementation of an recommendation algorithm
> [*] in the hope we could all benefit and I could get some advise on
> how to make it fast for all cases. The recommendation algorithm is
> written in Gremlin and runs on the Neo4j addon for Heroku. It uses the
> Jaccard index to compute the similarity between entities. I'm
> currently looking into how I could make the computation feasible with
> a huge dataset.

So for any two user vertices, you are determining the Jaccard index of their "purchased" item vertices (intersection / union). This should be fast unless you have many many items per user. What is the distribution of items for users?

> My dataset has many items that are heavily connected,
> some with almost all other items in the graph. This means that for the
> similarity calculation in most cases all the items in the graph need
> to be traversed. I have been thinking about pruning or sampling paths,
> but so far these strategies heavily influence the results and do not
> give me the speed boost I want.

Hm. I tend to do graph sampling and that boosts performance to any arbitrary degree and even with heavy sampling, I tend to not see accuracy degenerate. The most basic sampling is:

g.v(1).out.blah.groupCount(m)[0..maxResults]

Where maxResults is correlated to how much time you want to spend on the traversal. However, for your algorithm which is based on set intersections/unions, sampling may not be too successful (I speculate).

> Are there known strategies that deal with these situations? I can
> imagine that I could abstract the heavily connected items by putting
> them in precalculated-clusters and adapt the Jaccard similarity
> algorithm slightly.

Yes---if you have heavily connected items. In general, you could probably leave such items out as they are not contributing information (everyone likes them! -- am I similar to you cause we both like oxygen? -- no, because everyone likes oxygen.)

> And as a side question, are implementions of algorithms like mine
> interesting for https://github.com/tinkerpop/furnace project?

Yes. However, Furnace is still very premature and the foundation is being designed. Once that is complete, then we hope to go to town adding algorithms.

See ya,
Marko.

http://markorodriguez.com

Jeroen van Dijk

unread,

Nov 16, 2011, 10:12:42 AM11/16/11

to gremli...@googlegroups.com

Hi Marko,

Thanks for your reply and sorry for the delay. I have replied inline.

So for any two user vertices, you are determining the Jaccard index of their "purchased" item vertices (intersection / union). This should be fast unless you have many many items per user. What is the distribution of items for users?

My dataset can be compared to the graph behind the Amazon bookstore. So some users have bought many items, but more importantly some items have been bought by a large part of the population of users. Especially these items that have been bought many times add many paths to the traversal path. But like you say somewhere below, these items don't say as much as other items that are less present.

> My dataset has many items that are heavily connected,
> some with almost all other items in the graph. This means that for the
> similarity calculation in most cases all the items in the graph need
> to be traversed. I have been thinking about pruning or sampling paths,
> but so far these strategies heavily influence the results and do not
> give me the speed boost I want.

Hm. I tend to do graph sampling and that boosts performance to any arbitrary degree and even with heavy sampling, I tend to not see accuracy degenerate. The most basic sampling is:

g.v(1).out.blah.groupCount(m)[0..maxResults]

Where maxResults is correlated to how much time you want to spend on the traversal. However, for your algorithm which is based on set intersections/unions, sampling may not be too successful (I speculate)

I might have implemented the sampling wrongly then. I'll try again and see if your proposal solves my issue.

> Are there known strategies that deal with these situations? I can
> imagine that I could abstract the heavily connected items by putting
> them in precalculated-clusters and adapt the Jaccard similarity
> algorithm slightly.

Yes---if you have heavily connected items. In general, you could probably leave such items out as they are not contributing information (everyone likes them! -- am I similar to you cause we both like oxygen? -- no, because everyone likes oxygen.)

I'll try this and see how well the quality and speed of the recommendation becomes. I hope they are both acceptable. My intuition still says that I will need to do pre-calculation of clusters in order to make it really scale well. I want to make the recommendations real-time available for end users through an API call. This means I want it to be not more than 100ms and preferably more like 10ms. And all this while the graph becomes bigger and bigger, and the queries more complex. If you have good literature on this I would highly appreciate this.

> And as a side question, are implementions of algorithms like mine
> interesting for https://github.com/tinkerpop/furnace project?

Yes. However, Furnace is still very premature and the foundation is being designed. Once that is complete, then we hope to go to town adding algorithms.

Cool! I'm looking forward to contribute.

Cheers,

Jeroen

Reply all

Reply to author

Forward

0 new messages