I am currently playing with a neo4j database that consists of 3 types of node: “User”, “Merchant” and “Offer”.
Users click on offers, and make purchases from merchants. So, the following relationships exist:
(:User)-[:PURCHASED_FROM]->(:Merchant)
(:User)-[:CLICKED_ON]->(:Offer)
I want to perform a cypher query that will recommend merchants to a particular user, based on how many other users with similar tastes have purchased from that merchant. So a pretty standard collaborative filtering query, I believe.
The query looks like this:
MATCH(u:User{userId:1234)-[:PURCHASED_FROM]->(:Merchant)<-[:PURCHASED_FROM]-(:User)-[r:PURCHASED_FROM]->(newMerchants:Merchant)
WHERE NOT (u)-[:PURCHASED_FROM]->(newMerchants)
RETURN DISTINCT newMerchants.name, newMerchants.merchantId, count(r) AS rel_count
ORDER BY rel_count DESC
LIMIT 30
My problem is that this query will take a LONG time to run. In fact, it’ll take so long I haven’t seen it finish yet. Removing the count() aggregate functions will help things, but I need to order by this count of relationships so that the user sees the merchants most relevant to them.
I believe the problem is caused by the sheer number of nodes and relationships in graph, and the number of traversals that then occur when the above query is executed.
It’s not uncommon for a single merchant node to have >1m :PURCHASED_FROM relationships. There are approximately 60m :PURCHASED_FROM relationships is the graph, around 12m :User nodes and around 7.5k :Merchant nodes.
I figure that this must be a problem that comes up fairly frequently in large graphs. What modelling approaches are available to lessen the number of traversals required at query time?
Thanks in advance for your brain power!