Loading Large Data into JanusGraph and HBase

james....@gmail.com

unread,

Feb 13, 2018, 7:50:44 AM2/13/18

to JanusGraph users

I have a JanusGraph database, backed by HBase. Into this database, I'm trying to load a number of GraphML files, which range in size from a few hundred KB to a couple of GB. I'm doing this with the following Java code:

Graph graph = GraphFactory.open(conf);

graph.io(IoCore.graphml()).writer().create().writeGraph(new FileOutputStream(file), graph);

Where conf is loaded with the following configuration file:

gremlin.graph=org.janusgraph.core.JanusGraphFactory

storage.backend=hbase

storage.hostname=my_host

storage.hbase.table=my_table

cache.db-cache = true

cache.db-cache-clean-wait = 20

cache.db-cache-time = 180000

cache.db-cache-size = 0.5

index.search.backend=lucene

index.search.directory=data/index

Loading of the smaller files works, but if I try to load the large files I get out of memory errors if there's already data in the table. If I try loading a large file into an empty table, it seems to work (although I do have issues with HBase/ZooKeeper timeouts). To me, this feels like JanusGraph is trying to load the existing data into memory before ingesting the new data. Is this correct? And if so, is there anyway I can avoid it doing this or load my data in differently? If not, any ideas what is going on?

Thanks,

James

marc.de...@gmail.com

unread,

Feb 14, 2018, 9:48:05 AM2/14/18

to JanusGraph users

Hi James,

I do not understand. The graph writer writes data from a graph to a file. Your text suggests you use it the other way around, that is get data from the file into the graph.

See:

http://tinkerpop.apache.org/docs/current/reference/#_graphml_reader_writer

Maybe this remark helps in clarifying the situation!

Cheers, Marc

Op dinsdag 13 februari 2018 13:50:44 UTC+1 schreef James Baker:

james....@gmail.com

unread,

Feb 15, 2018, 3:08:51 AM2/15/18

to JanusGraph users

Apologies, I copied the wrong piece of code - I am indeed using the graph reader:

graph.io(IoCore.graphml()).reader().create().readGraph(inputStream, graph);

The graph variable is connected to HBase, the inputStream comes from a GraphML file. When I connect graph to HBase, does it try to load the HBase data into memory?

HadoopMarc

unread,

Feb 15, 2018, 5:42:45 AM2/15/18

to JanusGraph users

Hi James,

Two things you can try:

1. Reading through the code, things could possibly go wrong with a vertex cache being used (can only be increased with JVM memory options) and with the default batchSize of 10000. So you can try reader().batchSize(1000).create().

2. If you have hadoop/spark available next to HBase, you can use the BulkLoaderVertexProgram which accepts very large graphML files from hadoop. For this approach it is important to have a JanusGraph index defined on the bulkloader.vertex.id .