> My compliments to everybody involved with Gremlin and Tinkerpop. I'm a
> very happy user so far.
Sweet. Thanks. Glad you like the TinkerPop stack.
> I would like to share my implementation of an recommendation algorithm
> [*] in the hope we could all benefit and I could get some advise on
> how to make it fast for all cases. The recommendation algorithm is
> written in Gremlin and runs on the Neo4j addon for Heroku. It uses the
> Jaccard index to compute the similarity between entities. I'm
> currently looking into how I could make the computation feasible with
> a huge dataset.
So for any two user vertices, you are determining the Jaccard index of their "purchased" item vertices (intersection / union). This should be fast unless you have many many items per user. What is the distribution of items for users?
> My dataset has many items that are heavily connected,
> some with almost all other items in the graph. This means that for the
> similarity calculation in most cases all the items in the graph need
> to be traversed. I have been thinking about pruning or sampling paths,
> but so far these strategies heavily influence the results and do not
> give me the speed boost I want.
Hm. I tend to do graph sampling and that boosts performance to any arbitrary degree and even with heavy sampling, I tend to not see accuracy degenerate. The most basic sampling is:
g.v(1).out.blah.groupCount(m)[0..maxResults]
Where maxResults is correlated to how much time you want to spend on the traversal. However, for your algorithm which is based on set intersections/unions, sampling may not be too successful (I speculate).
> Are there known strategies that deal with these situations? I can
> imagine that I could abstract the heavily connected items by putting
> them in precalculated-clusters and adapt the Jaccard similarity
> algorithm slightly.
Yes---if you have heavily connected items. In general, you could probably leave such items out as they are not contributing information (everyone likes them! -- am I similar to you cause we both like oxygen? -- no, because everyone likes oxygen.)
> And as a side question, are implementions of algorithms like mine
> interesting for https://github.com/tinkerpop/furnace project?
Yes. However, Furnace is still very premature and the foundation is being designed. Once that is complete, then we hope to go to town adding algorithms.
See ya,
Marko.
So for any two user vertices, you are determining the Jaccard index of their "purchased" item vertices (intersection / union). This should be fast unless you have many many items per user. What is the distribution of items for users?
> My dataset has many items that are heavily connected,Hm. I tend to do graph sampling and that boosts performance to any arbitrary degree and even with heavy sampling, I tend to not see accuracy degenerate. The most basic sampling is:
> some with almost all other items in the graph. This means that for the
> similarity calculation in most cases all the items in the graph need
> to be traversed. I have been thinking about pruning or sampling paths,
> but so far these strategies heavily influence the results and do not
> give me the speed boost I want.
g.v(1).out.blah.groupCount(m)[0..maxResults]
Where maxResults is correlated to how much time you want to spend on the traversal. However, for your algorithm which is based on set intersections/unions, sampling may not be too successful (I speculate)
> Are there known strategies that deal with these situations? I canYes---if you have heavily connected items. In general, you could probably leave such items out as they are not contributing information (everyone likes them! -- am I similar to you cause we both like oxygen? -- no, because everyone likes oxygen.)
> imagine that I could abstract the heavily connected items by putting
> them in precalculated-clusters and adapt the Jaccard similarity
> algorithm slightly.
> And as a side question, are implementions of algorithms like mineYes. However, Furnace is still very premature and the foundation is being designed. Once that is complete, then we hope to go to town adding algorithms.
> interesting for https://github.com/tinkerpop/furnace project?