performance question

14 views
Skip to first unread message

Third Replicator

unread,
Jan 12, 2010, 2:41:39 AM1/12/10
to neo4jrb
Say you had a graph of users, books and favorite books.

Would neo4j be fast at this problem: given a user, find other users
with similar favorites.

Say, 100 million users, 1 million books, 100-1000 favorites on average
up to 10,000 favorites.

Could you fit all that on one server?

How many seconds would it take to run a query like that?

would you need to do any extra work for caching or is Neo4j's smart
caching smart enough?

Andreas Ronge

unread,
Jan 12, 2010, 5:54:00 AM1/12/10
to neo...@googlegroups.com
Hi

One nice thing about neo4j is that performance is not depending on the
size of the database. You will have around 1000-3000 traversals or
property gets per ms (reads) independent on the size of the database.
So it's up to your application how to organize the node space so that
it will be fast to query the database by doing traversals.

Regarding what can fit in one machine I found this on their wiki:
"With normal rotating media here are some guidelines: Laptop 1-2 GB
RAM handles tens of millions of primitives. A standard server 4-8 GB
RAM handles hundreds of millions of primitives. More expensive servers
with 16-32GB RAM can handle billions of primitives. With Solid State
Drives (SSDs) you can handle larger graphs on less RAM."

/Andreas

> --
> You received this message because you are subscribed to the Google Groups "neo4jrb" group.
> To post to this group, send email to neo...@googlegroups.com.
> To unsubscribe from this group, send email to neo4jrb+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/neo4jrb?hl=en.
>
>
>
>

David Beckwith

unread,
Jan 12, 2010, 11:40:31 AM1/12/10
to neo...@googlegroups.com
> size of the database. You will have around  1000-3000 traversals or
> property gets per ms (reads) independent on the size of the database.

per ms ?! So thats 1,000,000 - 3,000,000 reads per second?

David Beckwith

unread,
Jan 12, 2010, 11:41:58 AM1/12/10
to neo...@googlegroups.com
that's fast.

David Beckwith

unread,
Jan 12, 2010, 12:26:29 PM1/12/10
to neo...@googlegroups.com
I think the problem is that if you have 1,000,000 people loving the
Harry Potter books, you've got to hit other users through the books
several million times, which would mean at least 1 second per user and
only 2.6 million seconds per month.

On Tue, Jan 12, 2010 at 8:41 AM, David Beckwith

Andreas Ronge

unread,
Jan 12, 2010, 1:26:29 PM1/12/10
to neo...@googlegroups.com
But if you have 1000,000 people loving that book then it is really
easy to cache that HTML page,
so you don't need to ask/traverse neo4j at all. (the closer caching to
the user the better).

David Beckwith

unread,
Jan 12, 2010, 2:39:52 PM1/12/10
to neo...@googlegroups.com
Yes, caching is a good strategy as long as you can construct the cache
in a reasonable amount of time and you're not updating the cache n^2
times more than you are viewing it. The issue I think is what the
complexity of

1. Creating the cache initially (e.g. if it takes 3 years to construct
it, it's probably not that useful)
2. Updating the cache (ie. we don't want to bog the server down in
checking to make sure the cache needs to be updated or not)

David :)

Tobias Ivarsson

unread,
Jan 12, 2010, 5:30:03 PM1/12/10
to neo...@googlegroups.com
Hi,

Short answer: No, this dataset would not fit on one server.

100-1000 favorites per user would mean an expected value of 549.5 favorites per user (if random distribution is assumed). Multiplied with 100M users that's a total of 54.95 billion favorite-relationships. This is out of range of the current implementation (although a future version might support it). From a disk space point of view there are certainly server disk arrays that can store that many relationships, 54.96 billion relationships would consume just over 1.8TB of disk space. Then there are storage for the nodes, and the properties you might want to store with the nodes and relationships in addition to that 1.8TB.

As for the speediness of traversing out users with similar favorites:
First of all, your definition of "similar favorites" is vague, so I'll interpret it as users with the same favorites.
The problem with this query is the sheer number of such users:

If each user has on average F favorites, there are U users and B books in the system, then each book will on average be favorited by F' = F*U/B users. With your numbers (U=100M, B=1M, F=100-1000) this means each book is favorited by 10-100 thousand users. The expected number of users that favors a particular book would be 54950 users.

This means that when starting from one user, finding all users with the same favorites would need to traverse from that user to 100-1000 books, and from each book to 10'000-100'000 other users. This is a total of 1 million - 100 million users, with an expected size of the result set being just over 30 million. However these are likely to not be 1M-100M unique users, but since filtering out duplicates is an additional operation the fact that the dataset contains duplicated will not improve the traversal speed.

However, your question didn't state that you were interested in finding all such users. If you were satisfied with getting only five random such users you would simply do a depth first traversal and stop after finding as many (five) users as you wanted, and not pay any penalty for the fact that there are millions of other matching user nodes.

When traversing through the graph, Neo4j loads the nodes and relationships lazily as they are needed. This means that the speed of your traversal is constant over the structure you are traversing. In this case the structure is two relationships, one from the initial user to the book, and one from the book to another user. That is two relationships hops, a constant time operation. If you traverse a fixed number of these (five in the previous paragraph) you end up with a traversal time of the number of result elements multiplied with the traversal time for one. For finding all such users, the equation is similar, but the numbers of result elements is bigger. However the traversal time is only dependent on the number of elements you actually traverse. The total size of your graph does not effect the traversal time (as it would in a relational database). So if F and F' stayed constant while U increased you would still get the same traversal speed for your query.

Since my laptop does not have a 2TB disk (and since Neo4j currently does not handle that many relationships) I couldn't run simulations with a dataset distributed exactly according to your specifications. Instead I created two datasets, representing only the users and books seen in one traversal. One dataset where the starting user has 100 favorite books, and each book is the favorite of 10 thousand users (the number of expected users to have a random book as a favorite if each user has 100 favorites). And one dataset where the starting user has 1000 favorite books, and each book is the favorite of 100 thousand users.

The second dataset was too large for Neo4j to be able to fit it in the caches when running on my laptop, so I only have traversal figures for running with uncached data. The execution time in this case, for traversing over that entire graph (representing a worst case query according to your description) was just over 2000seconds. That is a traversal speed of 50'000 relationships / second. This is with a pretty cheap mechanical (spinning) disk, an SSD would probably increase the speed at least five times.

The first dataset was much more cache friendly. With cold caches I had a traversal speed of slightly less than 9 seconds (ca 115'000 relationships traversed/second), and 0.9 seconds with warm caches (ca 1'150'000 relationships traversed/second). The first improvement (over the cold cache case of the large dataset) in traversal speed here is due to the fact that the dataset fits in the filesystem cache. Then the Neo4j caches gives an additional 10x performance improvement over that, which is what we see in the case with warm caches.

Cheers,
Tobias

--
You received this message because you are subscribed to the Google Groups "neo4jrb" group.
To post to this group, send email to neo...@googlegroups.com.
To unsubscribe from this group, send email to neo4jrb+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/neo4jrb?hl=en.




--
Tobias Ivarsson <tobias....@neotechnology.com>
Hacker, Neo Technology
www.neotechnology.com
Cellphone: +46 706 534857
Reply all
Reply to author
Forward
0 new messages