[Cypher] Analytics queries slow when graph size increases

724 views
Skip to first unread message

Michael Rehse

unread,
Mar 30, 2012, 10:40:16 AM3/30/12
to ne...@googlegroups.com
Hello,

I'm working on a web reading platform written in Python, and we've been experimenting with Neo4j to track reads. I'm currently using the noe4jrestclient library. The graph has 2 types of nodes (articles and users) which are attached by "read" relationships. Our graph currently has 
just over half a million nodes and just under a million relationships. This query:

START article=node:article("article_id:*") 
MATCH article<-[r:read]-user 
WHERE r.timestamp > {start_time} AND article.blacklist? = false 
RETURN article, count(r) 
ORDER BY count(r) DESC 
SKIP 0 LIMIT 20

was quite fast with a smaller dataset, but now takes well over a minute. Am I doing something wrong? How can I speed this query up to be manageable?

Thanks

Jim Webber

unread,
Mar 30, 2012, 11:10:17 AM3/30/12
to ne...@googlegroups.com
Hi Michael,

You're not doing anything wrong, you're just getting the performance exactly as you'd expect with the algorithm that you're using.

START article=node:article("article_id:*")

Let's start here - you're asking for all articles. That's probably O(n) in complexity (I don't know how Lucene works internally).

Then you do this:

MATCH article<-[r:read]-user
WHERE r.timestamp > {start_time} AND article.blacklist? = false

That's O(n) in complexity too - every article has to be checked to see if it matches these constraints.

Why is this slow? Because when you access nodes sequentially like this, it's unlikely that any of them will be in neo4j's cache meaning that you'll end up going to disk most of the time (and seek time on mechanical disk is really latent).

The way that you get performance with these kind of graph global algorithms (the same class as PageRank) is to keep your cache warm, and you do that by keeping Neo4j alive between analytical requests. Otherwise in Neo4j - as any other database - you're going to be IO bound.

There may also be things that can help in using Cypher, or its implementation, but I'll defer to Andres on that.

HTH.

Jim

Michael Hunger

unread,
Mar 30, 2012, 11:26:21 AM3/30/12
to ne...@googlegroups.com
Additionally,

#3 doing the aggregation pulls the whole dataset into memory.
#4 and ordering does this too (and sorts it in memory)
#5 then you state that you're only interested in the top 20.

what is the actual use-case for this query ??

You might be interested into adding additional in-graph indexes (and linking) inside of your graph to put things like an order or ranking directly into your graph (and update the graph when a measure changes).

An interesting discussion around this is the graphity approach by Rene Pickard:
http://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/

Michael Rehse

unread,
Mar 30, 2012, 11:50:43 AM3/30/12
to ne...@googlegroups.com
Awesome, thanks for the quick responses!

@Jim: I'm not sure what you mean by keeping Neo4j alive between analytical requests. We're running Neo4j as a server, and thus, we shouldn't have problems keeping the cache warm, should we?

@Michael: The use case is to find the 20 most-read articles since a given timestamp. Thus (coming from a SQL point of view), we'd want to take the whole dataset, filter out reads outside of our timeframe, then aggregate those reads based on the article. In SQL, I'd add an index on r.timestamp to make that portion of the query more efficient... and I was under the impression that indexes in Cypher don't work that way. Do you mean something else when you say "in-graph indexes"? I haven't had a chance to check out that link in any detail yet, so I apologize if the answer to that is in there...

Obviously, there is a large amount of data here, and as such, it will take some work to make sure it's performant. I like Cypher a whole lot, but I also have heard that it might be worth it to try writing the query in Gremlin instead... is that something that you'd recommend?

Thanks for all your help!

Marko Rodriguez

unread,
Mar 30, 2012, 12:02:56 PM3/30/12
to ne...@googlegroups.com
Hi,

Obviously, there is a large amount of data here, and as such, it will take some work to make sure it's performant. I like Cypher a whole lot, but I also have heard that it might be worth it to try writing the query in Gremlin instead... is that something that you'd recommend?

I suspect that the query in Gremlin or Cypher will only give you constant difference in speed. The problem is not the query language, but the amount of data you are touching. Its equivalent to doing a:

SELECT * FROM a, b WHERE a.id1=b.id2 ORDER BY a.something

That query can get nasty fast depending on your underlying data structure. 

HTH,
Marko

Michael Hunger

unread,
Mar 30, 2012, 12:06:48 PM3/30/12
to ne...@googlegroups.com
Michael,

do these analytic queries to be real-time?

So what we do in this case (a more graphy approach):

we created a time-tree structure in the graph: see http://blog.neo4j.org/2012/02/modeling-multilevel-index-in-neoj4.html

And then we would connect articles on read with a relationship to the day-leaf-node. And on the relationship you can count multiple reads (after each day ended you can even optimize that by keeping only the top-n "most-read" article relationships and dropping the other ones due to insignificance (after all after that day this is static, history data).

To determine the most often read articles you would go down the time-tree to the leaf nodes and follow them (via the next-relationships) from start to end.

There (depending on your interval) you only have a few nodes (max 365 per year) to look at. Fetching the most-read relatonships from those is simple as well and then you just have a few article nodes to aggregate on, order and take the top.

In general this would support the local-traversal use-case as the few local nodes to start from are the day-nodes of your time-tree from those you can traverse directly via semantically relevant relationships to the articles.

As the graphdb moves the cost of creating relationships(joins) to insertion time to allow following them at query time quickly one should leverage that and introduce use-case specific, additional relationships into the graph that enrich your domain and make your use-cases fast. Also pre-aggregation of event streams (like your "read"-events is a common approach here and linking the "aggregated" event data to the dimensions you want to query from later enriches your ability to search.

These are also the approaches discussed in graphity.

HTH

Michael

Michael Rehse

unread,
Mar 30, 2012, 2:48:26 PM3/30/12
to ne...@googlegroups.com
Michael,

Optimally, we'd like the queries to be real-time.

I read the time-tree article before, but had forgotten about it until you mentioned it here. It's a great way to index within the graph!

We'll be looking to store information about reads this week, today, and this hour, so we'd be storing for the whole week until we had to compact the data. However, we also want to store historical data. I was thinking that we might want to strip off read relationships at the end of every week and place them into an archive DB... perhaps another, flatter instance of Neo4j where it's alright if queries take a long time. Does this sound like an insane idea? Do other people do something like this?

Also, since we'd want to keep track of who read what article, I was thinking that I could use a hyper-edge from Hour to User & Article. Just about to try it out. I'll let you know how it looks!

Luanne Coutinho

unread,
Apr 2, 2012, 4:08:31 AM4/2/12
to ne...@googlegroups.com
Adding on to this thread because I've just started figuring out how I'm planning to store my data for similar analytics.

The kind of data I deal with is more to do with promotions and data around that- so information about promotions availed of today, this week, month etc. I was thinking of actually keeping my current graph model (which deals with promotion functionality) as it is- e.g. Promotion-[sent_to]->Customer and Customer-[cashed]->Promotion. But I will also attach to the timeline node(s) some rolled up data for a promotion like number of customers sent to, or number of customers that cashed it on that day. This will help me produce numbers by promotion.
Might also roll up numbers and take them off the graph once the promotion ends...but not sure about this until we get really bad numbers with a lot of historical data in the same graph.

The only concern is that every time a customer is sent a promotion or he cashes in, it means that along with the basic model being updated, it requires an update of the offer node in the timeline index for that day. I'm finding it a bit difficult to maintain the multilevel index directly on my core entities like offers etc. unless I introduce a kind of node that holds the association- so:
 Promotion-[cant-think-of-a-name]-AssociationNode-[sent_to]->Customer
Customer-[cashed]-AnotherAssociationNode-[cant-think-of-a-name]->Promotion

And link the AssociationNodes to the Day node in the timeline...something like that. But it's so not intuitive and very relational :-)

I'd be grateful for any suggestions.
Thanks
Luanne

Michael Hunger

unread,
Apr 2, 2012, 4:49:40 AM4/2/12
to ne...@googlegroups.com
Is promotion per customer or global ? Where do you store customer-local information (like a custom promotion-code? on the sent-to-relationship?)

If it is already per customer - Why not just two additional relationships on promotion that point to the time-index? 

If the promotion is global then you probably hit the hyperedge issue (where you want a relationship connect more than 2 nodes).

The node you probably look for is something like PromotionCode ?

HTH

Michael

Luanne Coutinho

unread,
Apr 2, 2012, 4:56:13 AM4/2/12
to ne...@googlegroups.com
Yes the promotion is global. So a single promotion is sent to multiple customers and some of those customers can cash in on it. Yeah, it does look like the hyperedge example in the cookbook. Is it advisable to go down that route or just duplicate some information for reporting purposes on the timeline?

Michael Hunger

unread,
Apr 2, 2012, 5:03:48 AM4/2/12
to ne...@googlegroups.com
Is there more connected to the cashing in? Perhaps the reporting could go via those additional things connected to it?

Otherwise I'd do a PoC (with generated data) to see if the analytics (which might not be real-time) are fast enough. After all I assume that the promotions are time-interval-limited anyway so you can use that to pre-filter the promotions that fit into your analytics range.

Michael

Luanne Coutinho

unread,
Apr 2, 2012, 5:06:27 AM4/2/12
to ne...@googlegroups.com
Yeah I think I'll just have to try some PoCs on this- they probably would be more on the cashing in part- like for a promotion, give me the total value of all cash-ins and group those by customer age, location and gender.
Getting to the promotion is not a problem since I'll be generating reports for a single promotion at a time.

Luanne Coutinho

unread,
Apr 3, 2012, 9:57:09 AM4/3/12
to ne...@googlegroups.com
So I tried an example without the timeline index- just using my actual domain model as is.

For a promotion sent out to 50000 customers, assuming 80% accepted it and 60% cashed in, the time taken to get the total number of customers that accepted it within a specified date range averages to about 1123ms.

For a promotion sent out to 1  million customers, assuming 80% accepted it and 60% cashed in, the time taken to get the total number of customers that accepted it within a specified date range averages to 38565ms.

These queries were executed using the shell on a developer box (4G RAM) with no other load on the db, and the only tweak in settings was
wrapper.java.maxmemory=1024 (cos the second example caused an OOM exception).
I am not yet accessing properties on the customer such as age and location.

The timings are not really disastrous- I could really roll up numbers every day etc., but I am now beginning to think that this particular aspect is really just flat reporting. The domain model I have really serves me well for more valuable intelligence and analytics so I am leaning towards keeping it as it is, and using something else entirely for straight reports and summaries- like a mongoDb perhaps.
Still mulling this over though- any insights would be helpful.

Thanks
Luanne

Michael Hunger

unread,
Apr 3, 2012, 10:02:45 AM4/3/12
to ne...@googlegroups.com
What version are you running? Already 1.7 with the cypher improvements?

Michael

Luanne Coutinho

unread,
Apr 3, 2012, 10:02:36 AM4/3/12
to ne...@googlegroups.com
Yup, 1.7 M02

Michael Rehse

unread,
Apr 4, 2012, 4:57:08 PM4/4/12
to ne...@googlegroups.com
So I've re-architected the data to use a time tree, but the performance seems to have gotten HORRIBLY worse. A query like this:

START root=node:time("root:root") 
MATCH common_path=root-[:`2012`]->common_leaf, 
start_path=common_leaf-[:`3`]->()-[:`28`]->()-[:`19`]->start_leaf, 
end_path=common_leaf-[:`4`]->()-[:`4`]->()-[:`19`]->end_leaf, 
value_path=start_leaf-[:NEXT*0..]->middle-[:NEXT*0..]->end_leaf, 
values=middle-[:user_read]->hyperedge-[r:read]->article, hyperedge-[:has_read_in_hour]->user 
WHERE article.blacklist? = false 
RETURN article, count(r) 
ORDER BY count(r) DESC 
SKIP 0 LIMIT 20

Takes between 2 and 4 seconds. The size of the dataset right now is extremely small (1 user, 2 articles, 3 reads), meaning the database is currently mostly the time tree.

I should mention that I'm on the stable build of Neo4j (1.6.1).

Am I doing something silly here? Why is my shiny new time tree index so slow?

reyman

unread,
Apr 4, 2012, 5:26:50 PM4/4/12
to ne...@googlegroups.com
Hi,

I try to make a time tree in my project to store big data at each step of simulation,
can you give some more information or code source about your neo4j architecture ?

Your test and conclusion interest me, because i have this type of complex query to do later ...

Michael Rehse

unread,
Apr 4, 2012, 5:40:30 PM4/4/12
to ne...@googlegroups.com
My data is organized based off of these two recipes from the cookbook: http://docs.neo4j.org/chunked/snapshot/cypher-cookbook-path-tree.html and http://docs.neo4j.org/chunked/snapshot/cypher-cookbook-hyperedges.html

It's basically a root node that points to years, which point to months, which point to days, which point to hours (so it's one level deeper than the cookbook). The "values" path that I've specified is a hyperedge (see the second cookbook article) between an hour, a user, and an article, meant to represent users who have read an article during that hour. That makes the query a little more complex than the initial time tree example, but not by much. 

So everyone knows, I've tried installing 1.7.M02, and the queries seem to be running much faster--sometimes down to 8 ms or faster, but occasionally will shoot up to nearly a second for no apparent reason. I'm not sure I want to recommend to my sysadmin putting a milestone release into production, so I'd really rather use 1.6.1, but it 1.7.M02 is stable enough....

Can anyone shed some light on the performance disparity there?

Michael Hunger

unread,
Apr 4, 2012, 5:55:28 PM4/4/12
to ne...@googlegroups.com
Michael,

Andres and I worked on cypher performance, by re-adding the original pattern matcher for simple patterns which will increase the speed by orders of magnitude.
The slower pattern matcher is still used for more complex patterns but will be worked on to improve.

We also did a bunch of internal changes in the cypher execution, mostly about using mutable data structures inside of an execution, rewriting some parts to compute things only once etc.

The thing that would make your query slow is the variable length path, so if you can change the query to instead walking the leave nodes via this path, just going one level up in the tree and then collecting everything that is below, this query should also fast with 1.7.M02.
Especially if the order on the lowest level is not important to your aggregation (the variable length path ensures the time-order which might be important for event processing but not for aggregating analytics).

Peter Neubauer

unread,
Apr 4, 2012, 6:25:01 PM4/4/12
to ne...@googlegroups.com

I think on there is Stoff for a number of interesting topics, Michael :-)

Michael Rehse

unread,
Apr 4, 2012, 6:27:25 PM4/4/12
to ne...@googlegroups.com
Thanks!

I certainly don't care about time-order... all I care about is the number of :read relationships for any specific article. That number has to be counted off of the hyperedge (as there can be multiple reads off of each one). I'm not sure how to write that query so it's not variable length... I will know the precise number of hops, but I need a nice reference point like "middle" to refer to, as I don't want to put in start-[:NEXT]->hop1-[:NEXT]->hop2-[:NEXT]->end for a 3-hop path, then refer to each node individually, as my minimum amount of hops would be 24, and likely my maximum would be 168. That would make the messiest of queries. I DO have to aggregate on the lowest level... can you suggest a way to do that? I've just started learning Cypher so thanks for your patience with me...

Also, if I can eliminate that variable length path, will it also speed up the query in 1.6.1? I'd prefer to use that, if possible.

Michael Rehse

unread,
Apr 4, 2012, 6:57:23 PM4/4/12
to ne...@googlegroups.com
I noticed that constraining the variable length path (ie, start_leaf-[:NEXT*0..168]->middle-[:NEXT*0..168]->end_leaf) speeds up the query significantly. Is this what you meant? Is it still technically a variable length path, but now it just realizes that it doesn't have to keep running after it hits the end...

It's still much faster in 1.7.M02 than in 1.6.1, but the 1.6.1 performance is almost bearable. However, my data set is still small, and so almost bearable (1.2 seconds) makes me very nervous for when I actually hit it with a large data set.

Michael Hunger

unread,
Apr 4, 2012, 7:08:40 PM4/4/12
to ne...@googlegroups.com
What I mean was, if the order is not important you don't need the variable length path at all.

And if you restrict the granularity of your query e.g. to one hour or one day or one month you just traverse  from there down to the articles

START root=node:time("root:root") 
MATCH common_path=root-[:`2012`]->year-[:`3`]->month-[d]->day-[h]->hour, 
hour-[:user_read]->hyperedge-[r:read]->article, hyperedge-[:has_read_in_hour]->user 
WHERE article.blacklist? = false and type(h) =~ /\d+/ and type(d) =~ /\d+/
RETURN article, count(r) 
ORDER BY count(r) DESC 
SKIP 0 LIMIT 20

Reply all
Reply to author
Forward
0 new messages