Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion [Cypher] Analytics queries slow when graph size increases
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Michael Hunger  
View profile  
 More options Mar 30 2012, 12:06 pm
From: Michael Hunger <michael.hun...@neotechnology.com>
Date: Fri, 30 Mar 2012 18:06:48 +0200
Local: Fri, Mar 30 2012 12:06 pm
Subject: Re: [Neo4j] [Cypher] Analytics queries slow when graph size increases

Michael,

do these analytic queries to be real-time?

So what we do in this case (a more graphy approach):

we created a time-tree structure in the graph: see http://blog.neo4j.org/2012/02/modeling-multilevel-index-in-neoj4.html

And then we would connect articles on read with a relationship to the day-leaf-node. And on the relationship you can count multiple reads (after each day ended you can even optimize that by keeping only the top-n "most-read" article relationships and dropping the other ones due to insignificance (after all after that day this is static, history data).

To determine the most often read articles you would go down the time-tree to the leaf nodes and follow them (via the next-relationships) from start to end.

There (depending on your interval) you only have a few nodes (max 365 per year) to look at. Fetching the most-read relatonships from those is simple as well and then you just have a few article nodes to aggregate on, order and take the top.

In general this would support the local-traversal use-case as the few local nodes to start from are the day-nodes of your time-tree from those you can traverse directly via semantically relevant relationships to the articles.

As the graphdb moves the cost of creating relationships(joins) to insertion time to allow following them at query time quickly one should leverage that and introduce use-case specific, additional relationships into the graph that enrich your domain and make your use-cases fast. Also pre-aggregation of event streams (like your "read"-events is a common approach here and linking the "aggregated" event data to the dimensions you want to query from later enriches your ability to search.

These are also the approaches discussed in graphity.

HTH

Michael

Am 30.03.2012 um 17:50 schrieb Michael Rehse:

> Awesome, thanks for the quick responses!

> @Jim: I'm not sure what you mean by keeping Neo4j alive between analytical requests. We're running Neo4j as a server, and thus, we shouldn't have problems keeping the cache warm, should we?

> @Michael: The use case is to find the 20 most-read articles since a given timestamp. Thus (coming from a SQL point of view), we'd want to take the whole dataset, filter out reads outside of our timeframe, then aggregate those reads based on the article. In SQL, I'd add an index on r.timestamp to make that portion of the query more efficient... and I was under the impression that indexes in Cypher don't work that way. Do you mean something else when you say "in-graph indexes"? I haven't had a chance to check out that link in any detail yet, so I apologize if the answer to that is in there...

> Obviously, there is a large amount of data here, and as such, it will take some work to make sure it's performant. I like Cypher a whole lot, but I also have heard that it might be worth it to try writing the query in Gremlin instead... is that something that you'd recommend?

> Thanks for all your help!

> On Friday, March 30, 2012 11:26:21 AM UTC-4, Michael Hunger wrote:
> Additionally,
> #3 doing the aggregation pulls the whole dataset into memory.
> #4 and ordering does this too (and sorts it in memory)
> #5 then you state that you're only interested in the top 20.

> what is the actual use-case for this query ??

> You might be interested into adding additional in-graph indexes (and linking) inside of your graph to put things like an order or ranking directly into your graph (and update the graph when a measure changes).

> An interesting discussion around this is the graphity approach by Rene Pickard:
> http://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-re...

> Am 30.03.2012 um 17:10 schrieb Jim Webber:

> > Hi Michael,

> > You're not doing anything wrong, you're just getting the performance exactly as you'd expect with the algorithm that you're using.

> > START article=node:article("article_id:*")

> > Let's start here - you're asking for all articles. That's probably O(n) in complexity (I don't know how Lucene works internally).

> > Then you do this:

> > MATCH article<-[r:read]-user
> > WHERE r.timestamp > {start_time} AND article.blacklist? = false

> > That's O(n) in complexity too - every article has to be checked to see if it matches these constraints.

> > Why is this slow? Because when you access nodes sequentially like this, it's unlikely that any of them will be in neo4j's cache meaning that you'll end up going to disk most of the time (and seek time on mechanical disk is really latent).

> > The way that you get performance with these kind of graph global algorithms (the same class as PageRank) is to keep your cache warm, and you do that by keeping Neo4j alive between analytical requests. Otherwise in Neo4j - as any other database - you're going to be IO bound.

> > There may also be things that can help in using Cypher, or its implementation, but I'll defer to Andres on that.

> > HTH.

> > Jim


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.