Thread-safe NFBuildGraph?

129 views
Skip to first unread message

Steve Reed

unread,
Mar 14, 2013, 7:08:35 PM3/14/13
to netfli...@googlegroups.com
I didn't see any mention of thread-safety, or a lack thereof, in the javadoc. Looking through the code, though, it seems the NFBuildGraph is not thread-safe.

We just replaced a different third-party graph database with netflix-graph in production and it is working great. Serializing a compressed graph into a read-only in-memory version fits our use case perfectly. I wish you guys had announced this one week earlier.

However we're currently CPU-limited with the graph building, and if possible I'd like to spread the work out among different threads. This would be especially simple in my case because I build our graph out from multiple sources. As it is now I'm building a few dozen million connections between 5 types of nodes, some that number in the millions, in about 20 minutes on a single thread. Better than I was able to do with the other java-based graph libraries and databased I've tried, so again, thanks.

Naturally I could just synchronize access to the graph but that would create too much contention. My thought is using something like Guava's Striped<Semaphore> to increase throughput. I will probably do something along these lines outside of the graph if push comes to shove.

What complicates my use-case is that I want to add one and only one connection between certain nodes, so I have to check for one before adding it lest I duplicate it. Perhaps there's a simpler approach that I missed.

dkosz...@netflix.com

unread,
Mar 14, 2013, 10:11:01 PM3/14/13
to netfli...@googlegroups.com
Hi Steve,

Wow, we're really glad to see such fast adoption in a production environment!  That's great news!

You're correct that NetflixGraph is currently not thread-safe.  Thank you for pointing out that the documentation doesn't mention this.  I will correct this oversight on the wiki.

When you are checking to see whether or not a connection already exists, do you get an OrdinalSet from the NFBuildGraph, then ask contains()?  Do you have a feel for whether or not this (in aggregate) is the most expensive operation in your build?  I'd be very interested to see it if you could hook up a profiler like YourKit or jvisualvm to your build process to find out where the time is being spent.  20 minutes seems like a very long time to be CPU bound.

The Striped<Semaphore> applied outside of the NFBuildGraph will help you get to the "addConnectionIfAbsent" atomic operation, but I think you may still run into issues unless you make the NFBuildGraphNodeCache thread safe as well.  Would you feel comfortable making a modification to the NFBuildGraphNodeCache to get that thread safety?  I think using a ConcurrentHashMap for nodesByOrdinal with appropriate use of putIfAbsent, plus synchronized access to the List returned from getNodes() is all that is required.  If you want to go all the way and make NFBuildGraphNode thread-safe as well, it would make a great first external pull request :).  If you'd prefer I made the changes, that's ok too, I can put it on my list...just let me know.

Thank you for reaching out, I am very glad to hear NetflixGraph is useful for you!

Drew.

Steve Reed

unread,
Mar 15, 2013, 10:55:54 AM3/15/13
to netfli...@googlegroups.com
As you guessed I'm doing builder.getConnectionSet(...).contains(...).

I can very easily run this in yourkit and will try to get to it later today. I just realized I have some very highly connected nodes in my graph and I'm using COMPACT all over the place. Perhaps I should switch to HASH? (Anecdotally I haven't noticed the graph building process slow down, its performance seems pretty constant). Do separate connection models improve performance at all?

I'll take a look at your suggestions for getting some thread-safety introduced.

dkosz...@netflix.com

unread,
Mar 15, 2013, 12:45:19 PM3/15/13
to netfli...@googlegroups.com
Hi Steve,

COMPACT vs HASH will only help with runtime contains() performance (e.g. on the NFCompressedGraph), the build performance will remain constant.

Thank you for running the profiler, I'm interested to see the results.  What I suspect may be happening is:  contains() for an OrdinalSet returned from an NFBuildGraph always requires a full iteration of the set.  If you just add connections, each add is an O(1) operation on the NFBuildGraph, but if you check beforehand to see whether or not the item is there, each add is O(n).  Of course, we won't know whether or not this is the case until we see the profiler.  Hopefully the profiler shows most of the time being spent in the NFBuildGraphOrdinalSet.contains() method.

If it does, here are a couple of ideas:  Is there a way you could use a hash data structure for the connections you've already added to a specific node, and use that data structure for the contains() query instead of the NFBuildGraph itself?  Depending on how you retrieve your connections to be added to the build graph in the first place, is it possible to get the originating node of each connection in sorted order?  That way you would only need one of these hash data structures in memory at a time.  If you can't get the input in sorted order, then a Bloom filter may be a memory-efficient way to help you reduce the number of contains() queries you need to perform, by letting you know that a connection is definitely not already in your graph (it will tell you that either an item is not or might be in the graph -- the more bits you use for the filter, the fewer false positives you'll get).  This will only be very effective if most of your connections are not duplicates.

Hope this helps!

Thanks again,
Drew.

Steve Reed

unread,
Mar 15, 2013, 1:27:57 PM3/15/13
to netfli...@googlegroups.com
Well, some good news is that running the profiler identified that too much of the time was in some of my own code, so I've cleaned that up. :) Overall it's taking about 15 minutes now to process about 34,000,000 lines of input data. 

I haven't profiled the builder towards the end of the process, but in the beginning most time (with the com.netflix packages) is spent in addConnection. 4.140 seconds over 576,797 calls. I'm using yourkit, and tracing not sampling. NFBuildGraphOrdinalSet.contains accounts for 0.294 seconds over 660,043 calls.

I also suspect, though, that towards the end of the process the contains() methods should be slowing down for those very highly connected nodes I mentioned. I'll try to get some data about that and keep you posted. If that is the case, I will definitely use a bloom filter. I don't get things in any predictable order. Most of my connections are dupes.

Thanks for your help.

dkosz...@netflix.com

unread,
Mar 15, 2013, 2:29:40 PM3/15/13
to netfli...@googlegroups.com
I also suspect, though, that towards the end of the process the contains() methods should be slowing down for those very highly connected nodes I mentioned

I think you're right, but the data will confirm.

The bloom filter might not be as effective if you have mostly duplicates.  It can only tell you whether or not a connection has definitely not already been added.  It can't tell you for sure that a connection has already been added; since it can give false positives, if it tells you a connection might already be there, you'll still have to do the contains() to be sure.

So, we may be back to thread safety and an atomic "addConnectionIfAbsent" type operation.

Thanks,
Drew.

Steve Reed

unread,
Mar 15, 2013, 3:39:16 PM3/15/13
to netfli...@googlegroups.com
TL;DR: Most CPU time is still outside of netflix-graph in my app, doing some input parsing and splitting, etc. Given that I can't sort the input data before-hand I think my best bet for a speed-up is to build the graph from multiple threads. I can make some gains in parallelizing my splitting and parsing code but until netflix-graph is thread-safe I am still stuck building it from a single thread.

Also TL;DR: Given the load I'm very impressed with the graph building performance. While profiling it I found it to be very very efficient. Good work.

I've finished profiling. At the end of the graph-building process there is still more time spent in addConnection than contains. NFBuildGraphNodeCache.get ends up showing up near the top for addConnection and getConnectionSet. I think perhaps I was mistaken in thinking it was the few highly connected nodes that were a problem, I think it might be the node count leading to very large hash maps. Nothing much to be done about that I imagine.

method, time, %, own time, invocation count

addConnection, 47615, 24%, 1, 906018
getConnectionSet, 27746, 14%, 3, 1202328
contains, 6805, 3%, 6805, 1201423


Reply all
Reply to author
Forward
0 new messages