Edge weights

205 views
Skip to first unread message

Glen Maddern

unread,
Mar 13, 2013, 3:45:25 AM3/13/13
to netfli...@googlegroups.com
Spent the day playing with NetflixGraph and you've clearly done some impressive work. I had next to no trouble firing it up in a new SBT project (despite not having done anything on the JVM in ages) and playing around. It's incredibly fast, and the interface (and the source!) is extremely clean.

My first question is about edge weights - as far I can see it won't fit very well with the design and memory optimisations of NetflixGraph, but I could be wrong. So I'm wondering if there's another way to model them?

Consider something like:

val movieOrdinals = new OrdinalMap[String]
 
val graph = new NFBuildGraph(new NFGraphSpec(
new NFNodeSpec(
"Movie",
new NFPropertySpec("similar_to", "Movie", MULTIPLE | COMPACT | GLOBAL)
)
))
val inception = movieOrdinals.add("Inception")
val matrix = movieOrdinals.add("Matrix")
 
// a final parameter here, of 'weight' would be ideal.
graph.addConnection("Movie", inception, "similar_to", matrix)
graph.addConnection("Movie", matrix, "similar_to", inception)

(if that doesn't work in an email, the gist is here: https://gist.github.com/geelen/5150018)

I'm wondering what other approaches would work here? I'm approaching this from a relational database perspective, so the idea of a 'join table' makes sense to me, but I can't see how it would fit well within a graph.

Any ideas?

Cheers,
-glen.

dkosz...@netflix.com

unread,
Mar 13, 2013, 2:34:23 PM3/13/13
to netfli...@googlegroups.com
Hi Glen,

I'm very happy to hear you found NetflixGraph clean and easy to work with.  

Your question is a very good one.  I think you're right, I can't immediately see a good way to use the existing interface to add weighted connections, but it seems like a feature that might come in very handy.

I'm thinking that this could be added to the library without too much difficulty.  Essentially, each connected ordinal needs to be mapped to some value.

In the compact connection set representation, we could interleave the connection weights (also represented as variable-byte integers) with the connected nodes' ordinals.

For the hashed connection set representation, I have something similar to the "variable-byte hashed integer array" which acts a hash map between two variable-byte integers.  At a very high level: it hashes the key to a byte in the array, then appends the value after the key is written.  It steals two bits instead of one from the begin byte of each token, and the second bit of the begin byte indicates whether or not the token is a key or a value.

I can't immediately see how to map connected ordinals to weights with the bit set representation, it's possible that weighted connections just won't ever collapse into bit sets.

Send me an email, and I'll put together a quick prototype of this feature for you.  Unfortunately, I can't promise I'll be able to get to it before this weekend.

Drew.

Glen Maddern

unread,
Mar 14, 2013, 4:42:02 AM3/14/13
to netfli...@googlegroups.com
Ideally, I'd like to be able to iterate over the connections in order of descending weight, so I can implement things like movie similarities and grab the N most correlated movies to X. But I understand storing the connections in ascending id order is the key to saving bytes in the variable-length delta encoding, so I'm not sure if that's possible.

There's also an issue that while the connections may be stored extremely efficiently, if the weights are arbitrary (say, double precision floats) then the memory cost of each connection blows out. So, would you anticipate using integer or even natural numbers for weights?

At the moment I'm experimenting with using a join object - returning the full set of similarities, then sorting rather than returning the first few from a sorted set. So far it's performing ok, but I wonder if I'll need to do some careful pruning of the tree to keep the connection sets small as the graph grows.

Cheers,
-glen.

dkosz...@netflix.com

unread,
Mar 14, 2013, 2:57:08 PM3/14/13
to netfli...@googlegroups.com
Hi Glen,

I apologize if this is a duplicate.  I thought I sent a reply but it's not showing up.  I'm new to Google Groups so I probably just hit the wrong button somewhere.

The way I was describing it, you would not be able to retrieve the connections in descending weight order without an iteration of the entire set.  I'm guessing you would only be interested in the top m items, so you would likely want to use a heap for this, rather than retrieving all in memory and sorting.  Besides using less memory, this would give you O(n log m) performance, rather than O(n log n).  Let me know if you need more details here.

I was thinking of storing the weights as variable-byte integers, so you'd likely want to normalize the weights yourself, depending on the resolution you needed.  Because of the encoding, you would get the most efficient usage of the available bits by normalizing between 0 and 127, or 0 and 16383, (or 0 and ((2^(7n)) - 1)).

If you know you'll mostly be retrieving this data in descending weight order, you may want to make a custom modification to the library which, at compression time, sorts the normalized connections by weights descending, then uses a "reverse" delta compression on the weights (when reading you will subtract the deltas instead of adding).  This way, you would at least get the benefit of gap compression on the weights, but not the connected ordinals, which is the inverse of what I was originally suggesting.

Of course, if you use the hash format, you don't get gap compression anyways and won't be able to retrieve the results in any kind of sorted order, but you would get to determine what the weight of any given connection is in constant time.

Drew.
Reply all
Reply to author
Forward
0 new messages