Optimising query-time traversals

10 views
Skip to first unread message

David Smith

unread,
Sep 11, 2015, 11:52:29 AM9/11/15
to Neo4j
I am currently playing with a neo4j database that consists of 3 types of node: “User”, “Merchant” and “Offer”.

Users click on offers, and make purchases from merchants. So, the following relationships exist:

(:User)-[:PURCHASED_FROM]->(:Merchant)
(:User)-[:CLICKED_ON]->(:Offer)


I want to perform a cypher query that will recommend merchants to a particular user, based on how many other users with similar tastes have purchased from that merchant. So a pretty standard collaborative filtering query, I believe.

The query looks like this:

MATCH(u:User{userId:1234)-[:PURCHASED_FROM]->(:Merchant)<-[:PURCHASED_FROM]-(:User)-[r:PURCHASED_FROM]->(newMerchants:Merchant)
WHERE NOT
(u)-[:PURCHASED_FROM]->(newMerchants)
RETURN DISTINCT newMerchants
.name, newMerchants.merchantId, count(r) AS rel_count
ORDER BY rel_count DESC
LIMIT
30


My problem is that this query will take a LONG time to run. In fact, it’ll take so long I haven’t seen it finish yet. Removing the count() aggregate functions will help things, but I need to order by this count of relationships so that the user sees the merchants most relevant to them.

I believe the problem is caused by the sheer number of nodes and relationships in graph, and the number of traversals that then occur when the above query is executed.

It’s not uncommon for a single merchant node to have >1m :PURCHASED_FROM relationships. There are approximately 60m :PURCHASED_FROM relationships is the graph, around 12m :User nodes and around 7.5k :Merchant nodes.

I figure that this must be a problem that comes up fairly frequently in large graphs. What modelling approaches are available to lessen the number of traversals required at query time?

Thanks in advance for your brain power!

Michael Hunger

unread,
Sep 11, 2015, 1:54:51 PM9/11/15
to ne...@googlegroups.com, Max De Marzi
Hi David,

I presume you have an Index on :User(userId) ?

What version of neo do you run?

It would be good to see your query plan which you should get with prefixing the query with PROFILE.

Do multiple PURCHASED_FROM rels exist between one user and one merchant? If not you can skip the first distinct in my following query.

In general try to get your cardinalities down in between, try this:

profile MATCH (u:User{userId:1234})-[:PURCHASED_FROM]->(m:Merchant)
WITH distinct u,m
MATCH (m)<-[:PURCHASED_FROM]-(other)
WITH distinct u,other
MATCH (other)-[:PURCHASED_FROM]->(newMerchants)
WITH u,other, newMerchants, count(*) as rel_count
ORDER BY rel_count desc LIMIT 100
WHERE NOT (u)-[:PURCHASED_FROM]->(newMerchants)
RETURN newMerchants.name, newMerchants.merchantId, rel_count
LIMIT 30


I would also skip merchants and users that add no information (like ones where everyone bought or users that bought at more than x (e.g. 20) different merchants as they are not selective enough.


--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages