Re: [TinkerPop] Count a vertex's incident edges without actually loading them? (It has > 100,000 edges)

847 views
Skip to first unread message

Marko Rodriguez

unread,
Oct 3, 2012, 7:36:56 PM10/3/12
to gremli...@googlegroups.com
Hi Chris,

I'm not certain of exactly how Titan does it, but you can do:

v.query().labels('knows').count()

That may provide a count without loading. Try it and report back if you don't mind.

You can read more here:

Good luck,
Marko.


On Oct 3, 2012, at 4:51 PM, Chris Roth wrote:

I'm trying to figure out of it it's possible to count a vertex's incident edges without actually loading them. The reason that I don't want to load them is that the vertex is a super-node without 100,000 edges. Attempting to traverse them e.g. "v.outE()" in Gremlin actually just times out because it can't load that many edges.

If it helps, I'm using Titan with Cassandra as a backend. This may be more of a Titan question than a Gremlin question. I'm also open to doing this just in blueprints, or even on a lower-level, if possible.

Thanks!

--
 
 

Matthias Broecheler

unread,
Oct 3, 2012, 8:07:21 PM10/3/12
to gremli...@googlegroups.com
Hey Chris,

if you need an exact count of the number of edges, then you currently need to load all edges.
There are two improvements coming in titan:
1) switching to an iterator storage model which should allow you to load all those edges without timeout (next version)
2) support for counters and degree approximation in Titan (future version)

1) is not going to help you much in a real time environment because you are still pulling a lot of data over the wire. If you just need to know whether a vertex is a supernode (e.g. to avoid it in a traversal) you can do a v.query().limit(L+1)....
where L is the number such that every vertex with more than L edges is considered a supernode. Then you can check if count==L+1 to know whether you have found a devil.

Another approach would be to use Faunus in a batch job to determine the vertex degree.

Hope some of this helps,
Matthias

--
 
 



--
Matthias Broecheler, PhD
http://www.matthiasb.com
E-Mail: m...@matthiasb.com

Chris Roth

unread,
Oct 4, 2012, 12:54:31 AM10/4/12
to gremli...@googlegroups.com
Marko - I think I gave that a try yesterday and found that it timed out just like v.outE(), but I'll give it a try again in case I missed something.

Matthias -

I'm evaluating Titan for use in our analytics engine, so I do need the entire count as that would be one of the statistics, so I think using Faunus might be my best bet here. I'm setting up an test cluster to give it a try now. Do you have any rough ideas on what the performance of counting 100K edges might be or any specifics on how it's loading in all of the edges under the hood?

I'm interested in finding some in-depth details on the philosophy behind how Titan is storing edges/vertices and how it indexes properties. That said, I know it's quite new so I'm guessing my best bet is to read the source code.

Anyways, thank you so much. I think I've got some ideas to work with here :) Also exited about the upcoming improvements!

Chris

Marko Rodriguez

unread,
Oct 4, 2012, 10:11:46 AM10/4/12
to gremli...@googlegroups.com
Hi,

I'm evaluating Titan for use in our analytics engine, so I do need the entire count as that would be one of the statistics, so I think using Faunus might be my best bet here. I'm setting up an test cluster to give it a try now. Do you have any rough ideas on what the performance of counting 100K edges might be or any specifics on how it's loading in all of the edges under the hood?

General rule:
If you are doing g.v(...), use Titan.
If you are using g.V, use Faunus.

Basically, there is a tipping point at which you touch too much, then Faunus will be faster.

If I remember correctly, on a m1.xlarge 8 machine cluster, you can get an ~500 million edge graph through Faunus in ~5 minutes (that is the map load and Titan/HBase). How long your Gremlin traversal will take depends on, of course, its complexity and ultimately the number of Map/Reduce jobs it compiles to.

If you can say a bit more about what you are trying to do, I can provide you better information as to how Faunus will perform.

Also, if you dive deep into Aurelius tech, it would be best to use Aurelius' mailing list for further discussion.


Good luck,
Marko.


--
 
 

Jonathan Haddad

unread,
Oct 4, 2012, 10:59:27 AM10/4/12
to gremli...@googlegroups.com
Is there a rule of thumb you guys follow on the max number of edges you'd want to follow?  How long is Gremlin's timeout?  We're going to need to do something similar here. 

Matthias Broecheler

unread,
Oct 5, 2012, 1:02:38 AM10/5/12
to gremli...@googlegroups.com
Hey Chris,

if you use Faunus then loading in those edges will be pretty fast (per vertex, of course the more vertices you have the longer it will take in total). That's because Titan keeps all edges for one vertex sequentially on disk. Hence, loading all those edges in requires sequential disk access which is very fast (if you use Titan+Cassandra, running a major compaction before doing a bunch of faunus jobs will help performance a lot). 

Note, however, that Faunus is still very young and we have not yet extensively tested Faunus with supernodes and some optimization might be necessary to not blow out your heap. But with 100K edges and reasonable hardware that should not be the case.

Cheers,
Matthais

Matthias Broecheler

unread,
Oct 5, 2012, 1:05:01 AM10/5/12
to gremli...@googlegroups.com
Hey Jon,

it very much depends on what latency you want to have. Obviously, the fewer edges the faster the response. If you are pulling in so many edges that you are getting close to timeouts, then your latencies under stress will likely go through the roof, so you don't have to worry about timeouts.

If you need real time access to lots of edges on a vertex, we highly recommend using vertex centric queries and primary keys.

HTH,
Matthias

Chris Roth

unread,
Oct 27, 2012, 1:53:24 PM10/27/12
to gremli...@googlegroups.com
Marko, Matthias,

Thanks so much for the details responses! This is really interesting stuff. So it sounds like Faunus can do what I'm looking for just fine, with the side effect of not being real time since it takes some time to spin up Hadoop jobs.

Just had a random thought for you though - would it be feasible to use a probabilistic data structure to estimate the degree of a supernode? In my case, if I could estimate within 3% of the actual number of edges in realtime, I would be super happy.

Thought is kind of fuzzy in my mind still, but wanted to throw it out there :)

Marko Rodriguez

unread,
Oct 27, 2012, 3:13:17 PM10/27/12
to gremli...@googlegroups.com
Hi Chris,

With Titan, you get the benefit of vertex-centric indices (as Matthias says). You can learn more about those benefits here:

For instance, Gremlin will automagically compile down to use vertex-centric queries. See:

gremlin> g.v(1).outE('knows').has('weight',1.0).toString() ==>[StartPipe, OutEdgesPipe(knows), PropertyFilterPipe(weight,EQUAL,1.0)] gremlin> g.v(1).outE('knows').has('weight',1.0).inV.toString() ==>[StartPipe, QueryPipe(out,[knows],has:true,interval:false,limit:-1,edge), IdentityPipe, InVertexPipe]
If you provide no discriminating information (edge label, property values, etc.), then the best you will get is a linear scan through the incident edges -- O(n).

However, what you could do and this gets to your question, is you could use Faunus to run a Hadoop job to get the vertex degrees and then update the vertices in Titan as such. Thus, using Faunus/Gremlin:

g.V.sideEffect{it.degree = it.bothE.count()}

This way, you have the vertex degree as a property on the edges and can use that in your traversal (in Titan/Gremlin) to handle/avoid vertices you consider super nodes. Operationally, you can run this job every so often as your graph mutates to get the latest degree counts.

However, unfortunately, it will not be until Faunus 0.1 is released that writing back to the Titan graph will be possible. Right now Faunus is read-only.

HTH,
Marko.
--
 
 

Reply all
Reply to author
Forward
0 new messages