[Blog] A Solution to the Supernode Problem

417 views
Skip to first unread message

Marko Rodriguez

unread,
Oct 25, 2012, 4:04:46 PM10/25/12
to gremli...@googlegroups.com
Hello,

Blueprints 2.x introduced the concept of vertex queries. I thought many of you might be interested in this post. It discusses the supernode problem and how it is overcome for property graphs. With specific focus on Titan and its vertex-centric indices.


The results are pretty profound... a testament to this technique.

Take care,
Marko.

daniel...@gmail.com

unread,
Oct 25, 2012, 6:10:12 PM10/25/12
to gremli...@googlegroups.com
Thanks for the good read Marko!  In Bio Super Nodes Happen when you try to co-locate the Ontology in with the actual data (as opposed to saying type= as a property).  Attributes that type things is usually bad for real queries... but I have always implemented that because of the super-node issue.  Love the index idea, uberuseful!

Oh, and Dan's a good name ;)

Dan

Sent from my iPhone
--
 
 

Marko Rodriguez

unread,
Oct 25, 2012, 6:14:38 PM10/25/12
to gremli...@googlegroups.com
Hey Dan,

Thanks for the good read Marko!  In Bio Super Nodes Happen when you try to co-locate the Ontology in with the actual data (as opposed to saying type= as a property).  Attributes that type things is usually bad for real queries... but I have always implemented that because of the super-node issue.  Love the index idea, uberuseful!

Thanks man. Note that Matthias is the creator of the vertex-centric index idea. I simply benchmarked it and wrote it up. I was amazed at the results. Not only were they crazy performant, but they also perfectly matched with complexity theory. It is rare when theory and practice align so well.

Finally, I was looking for one more supernode domain + problem that they cause, but the post got too long...  As such, only file transfer networks and Twitter. However, bio would have been a good one. Feel free to send around to your bio-buddies.

Take care and thanks for the stoke Dan,
Marko.





Oh, and Dan's a good name ;)

Dan

Sent from my iPhone

On Oct 25, 2012, at 3:04 PM, Marko Rodriguez <okram...@gmail.com> wrote:

Hello,

Blueprints 2.x introduced the concept of vertex queries. I thought many of you might be interested in this post. It discusses the supernode problem and how it is overcome for property graphs. With specific focus on Titan and its vertex-centric indices.


The results are pretty profound... a testament to this technique.

Take care,
Marko.

--
 
 

--
 
 

Eddy

unread,
Oct 26, 2012, 10:43:48 PM10/26/12
to gremli...@googlegroups.com
Hi Marko,

I haven't had a look at Titan yet but I was wondering if there are any plans to provide a solution for hyperedges?

Pablo Pareja

unread,
Oct 27, 2012, 6:59:48 AM10/27/12
to gremli...@googlegroups.com
Hi Marko,

This is such great news!!
I've been waiting for this in Neo4j for way too long! (more than a year and a half...) and it's extremely important for the projects I'm developing such as Bio4j .
Quick question, how hard would it be to move a DB from Neo4j to Titan? 

Thanks!

Pablo

--
 
 



--
Pablo Pareja Tobes 

Luca Garulli

unread,
Oct 27, 2012, 11:08:19 AM10/27/12
to gremli...@googlegroups.com
Hi Marko,
this is the same solution we adopted in OrientDB about 10 months ago: 

Lvc@


--
 
 

Marko Rodriguez

unread,
Oct 27, 2012, 12:22:24 PM10/27/12
to gremli...@googlegroups.com
Hi Luca,

this is the same solution we adopted in OrientDB about 10 months ago: 

Very cool. What would be great is if OrientGraph implemented its own Query object. Right now in Blueprints, the DefaultQuery is used:


If you create OrientQuery then Gremlin (etc.) will be super speedy around super nodes.


That would be a  super stellar Luca and a much appreciated piece to the OrientGraph codebase.

Finally, I have the benchmark code that we used for Titan and I can show you the statistics for OrientDB with and without Query.

Thanks man,
Marko.




--
 
 

Marko Rodriguez

unread,
Oct 27, 2012, 2:50:06 PM10/27/12
to gremli...@googlegroups.com
Hey Pablo,

This is such great news!!

Thanks.

I've been waiting for this in Neo4j for way too long! (more than a year and a half...) and it's extremely important for the projects I'm developing such as Bio4j .
Quick question, how hard would it be to move a DB from Neo4j to Titan? 

I don't know how Bio4j is implemented, but its written to TinkerPop (e.g. Blueprints and/or Rexster), then it is fairly trivial to move between the various graph vendors: OrientDB, Neo4j, Titan, DEX, InfiniteGraph, Sail RDF, etc.

If Bio4j is not written to TinkerPop, then there is solace in the fact that all graph software is premised on the same foundational concepts. There are vertices, edge, traversals, etc. Therefore, while there is code to be written, there are not new concepts to be learned.

To continue beyond your original question....the TinkerPop approach is to make sure that graph technologies are interoperable. There are more than just graph databases. There are in-memory toolkits (JUNG, GraphStream, etc.), visualization packages (Cytoscape, Gephi), batch analytics packages (Giraph, Faunus, Hama), etc. The goal with TinkerPop is to make the interfaces the same so that people can use, e.g., OrientDB with Faunus and then visualize the graph using Gephi. This way, for developers, new APIs don't need to be learned and vendor lock-in is avoided as best as humanly possible.

I hope that answers your question.

Good luck with your project,
Marko.




--
 
 

Pablo Pareja

unread,
Oct 31, 2012, 4:21:28 AM10/31/12
to gremli...@googlegroups.com
Hey,

Thanks for your answer.

If Bio4j is not written to TinkerPop, then there is solace in the fact that all graph software is premised on the same foundational concepts. There are vertices, edge, traversals, etc. Therefore, while there is code to be written, there are not new concepts to be learned.

What about indexing? According to this discussion I found here: https://groups.google.com/forum/#!msg/neo4j/TWISaqitivo/vcbjZsxKDlAJ fulltext indexes are not directly available with Blueprints right?
Also, in the importing process I have to create many entities based on discoverability so I need to add/flush/query values in the same process. Would that be possible with Blueprints/Tinkerpop?
Thanks in advance.

Cheers,

Pablo

Luca Garulli

unread,
Oct 31, 2012, 5:23:39 AM10/31/12
to gremli...@googlegroups.com
Hi Marko,
this is something in our queue but planned after 1.3.

Lvc@

--
 
 

Marko Rodriguez

unread,
Oct 31, 2012, 7:58:03 AM10/31/12
to gremli...@googlegroups.com
Hi,

If Bio4j is not written to TinkerPop, then there is solace in the fact that all graph software is premised on the same foundational concepts. There are vertices, edge, traversals, etc. Therefore, while there is code to be written, there are not new concepts to be learned.

What about indexing? According to this discussion I found here: https://groups.google.com/forum/#!msg/neo4j/TWISaqitivo/vcbjZsxKDlAJ fulltext indexes are not directly available with Blueprints right?

Full-text indexes are supported through Blueprints if the underlying storage layer (graph engine) supports full text search. Next, Titan will be supporting Elastic Search in the near future --- currently, there is no support for full text search in Titan.

Also, in the importing process I have to create many entities based on discoverability so I need to add/flush/query values in the same process. Would that be possible with Blueprints/Tinkerpop?

I don't quite understand your question. Can you say more?

Thanks Pablo,
Marko.



Pablo Pareja

unread,
Nov 2, 2012, 4:38:05 AM11/2/12
to gremli...@googlegroups.com
Hi,


Full-text indexes are supported through Blueprints if the underlying storage layer (graph engine) supports full text search. Next, Titan will be supporting Elastic Search in the near future --- currently, there is no support for full text search in Titan.

mmm too bad, we will have to wait for the Elastic Search support to come out then.

Also, in the importing process I have to create many entities based on discoverability so I need to add/flush/query values in the same process. Would that be possible with Blueprints/Tinkerpop?

I don't quite understand your question. Can you say more?

What I meant is that when I'm importing the data, I cannot do things like import all the nodes first and then connect them with all relationships that would be present in the database. On the contrary, I'm "discovering" nodes that I have to create and index on the way and I was concerned about the flushing of the indexed values I am doing all the time in the middle of the process. In the case where I couldn't add and flush an indexed value in the same transaction/process, that would be a problem because I would end up creating tones of redundant nodes and everything would just be a mess. 
Let me know if this is more understandable, otherwise I can try to explain it with an example of input data.

Cheers,

Pablo


Marko Rodriguez

unread,
Nov 2, 2012, 4:43:27 PM11/2/12
to gremli...@googlegroups.com
Hi,

What I meant is that when I'm importing the data, I cannot do things like import all the nodes first and then connect them with all relationships that would be present in the database. On the contrary, I'm "discovering" nodes that I have to create and index on the way and I was concerned about the flushing of the indexed values I am doing all the time in the middle of the process. In the case where I couldn't add and flush an indexed value in the same transaction/process, that would be a problem because I would end up creating tones of redundant nodes and everything would just be a mess. 

You should look into Blueprints' BatchGraph.

With BatchGraph, there is an internal HashMap that is maintained in memory on the parsing machine. This map provides the "database to dataset" id mapping so that indices are not needed. If you have to hit an index on every parse chunk (e.g. every new line of your text file), you are will incur speed penalties (which is compounded in a distributed environment). It is best, if possible, to hold this mapping in memory.

Next, Matthias Broecheler has done some great work on BatchGraph where he compresses your IDs so if you use Long, String, URI, etc. it will get compressed and the HashMap doesn't blow up your heap. Next, as an example, for parsing DBPedia (200 million edges), Stephen is saying he was able to do that with -Xmx8g. We did a 100 million edge parse into Titan/HBase the other day with BatchGraph in ~1.25 hours and -Xmx10g (probably could have used less memory.. ?). The limiting factors of BatchGraph are that 1.) its single threaded, 2.)  talks to a single machine in the cluster, and 3.) your vertex ids must fit in the HashMap on the parsing machine.

To remedy these limitation for massive graph loading, we are currently writing Faunus InputFormats for Titan so you can do your parsing via Faunus and have the parse run as Map/Reduce jobs against the Titan cluster. E.g., via Faunus/Gremlin:

gremlin> g.loadGraphSON('hdfs://mybiggraph.json')

This has the benefit of parallel parsing with each machine in the cluster balancing out the write load. Finally, Matthias and Dan have some tricks for 100+ billion edge graphs where they use CloudFormation/etc. to spawn writing machines. They have demonstrated ~1 million edges per second using this model.  Unfortunately, it is not open source and we hope to make the Faunus method the "easy and sufficient" model for bulk loading 100s of billion of edges.

I hope that answers your questions,
Marko.



Let me know if this is more understandable, otherwise I can try to explain it with an example of input data.

Cheers,

Pablo


--
Pablo Pareja Tobes 

--
 
 

Pablo Pareja

unread,
Dec 12, 2012, 5:53:48 AM12/12/12
to gremli...@googlegroups.com
Hi Marko,

Thanks so much for your answer.

I finally managed to find some time to do this so I'm going to start adapting a small part of the importing process to BatchGraph and see if I get it working.

So, once I had the data imported through blueprints, using Titan should be a pretty straight-forward step, right?

I have been having a look at the three different storage backends and I wanted to ask you which one would better suit my use case. 
In principle my database is not going to change at all once the big initial importing process is done. Perhaps BerkeleyDB could be then the best choice?

Thanks!

Pablo

--
 
 

Matthias Broecheler

unread,
Dec 12, 2012, 6:00:48 AM12/12/12
to gremli...@googlegroups.com
Yes, if you just import the data once and its not too much data then BerkeleyDB is your best choice.

Cheers,
Matthias

--
 
 



--
Matthias Broecheler, PhD
http://www.matthiasb.com
E-Mail: m...@matthiasb.com

Pablo Pareja

unread,
Dec 12, 2012, 6:07:38 AM12/12/12
to gremli...@googlegroups.com
OK, thanks ;)

One more question I'm having trouble with right now, how can I create a TransactionalGraph that I can pass as an argument to the BatchGraph constructor?
I tried with TinkerGraph but it doesn't work...

TinkerGraph graph = new TinkerGraph("folder");
BatchGraph bGraph = new BatchGraph(graph, BatchGraph.IdType.STRING, 1000); -->compilation error

Cheers,

Pablo

Matthias Broecheler

unread,
Dec 12, 2012, 6:15:17 AM12/12/12
to gremli...@googlegroups.com
Hey Pablo,

the purpose of BatchGraph is to provide a simple interface that simplifies loading medium size graphs into transactional graph databases by chunking the input graph into smaller parts and loading each per transaction. That's why it expects a transactional graph, such as Neo4j or OrientDB or Titan. If you want to use TinkerGraph, BatchGraph will not provide any benefit.

HTH,
Matthias

Pablo Pareja

unread,
Dec 12, 2012, 6:40:51 AM12/12/12
to gremli...@googlegroups.com
Hi Matthias,

Well my goal is actually to use Titan graph DB so if I can somehow create a TitanGraph instance that'd be great.
I've tried to create an instance of TitanGraph but I just realized it's an interface, am I supposed to create my own class that implements this interface in order to use it?

Thanks in advance.

Pablo


On Wed, Dec 12, 2012 at 12:15 PM, Matthias Broecheler <m...@matthiasb.com> wrote:
Titan

Stephen Mallette

unread,
Dec 12, 2012, 8:03:18 AM12/12/12
to gremli...@googlegroups.com
From the Titan/Gremlin prompt:

gremlin> graph = TitanFactory.open("/tmp/graph")
==>titangraph[local:/tmp/graph]
gremlin> bg = new BatchGraph(graph, BatchGraph.IdType.STRING, 1000)
==>batchgraph[titangraph[local:/tmp/graph]]

TitanFactory will return you TitanGraph instance which you can then
pass directly into BatchGraph. In the example above, that's creating
an instance of BerkeleyDB as described here:

https://github.com/thinkaurelius/titan/wiki/Using-BerkeleyDB

Check out options for instantiating Titan instance in Cassandra or Hbase here:

https://github.com/thinkaurelius/titan/wiki/Using-Cassandra
https://github.com/thinkaurelius/titan/wiki/Using-HBase

Stephen
> --
>
>

Pablo Pareja

unread,
Dec 12, 2012, 11:41:57 AM12/12/12
to gremli...@googlegroups.com
Hi Stephen,

Thanks a lot! ;)

Pablo 

--


Reply all
Reply to author
Forward
0 new messages