Bulk loading into JanusGraph with HBase

1,345 views
Skip to first unread message

Michele Polonioli

unread,
Oct 5, 2017, 10:17:06 AM10/5/17
to JanusGraph users
I have JanusGraph using Hbase as backend storage on an Hadoop cluster.

I need to load a very large quantity of data that represents a social network graph mapped in csv files.
By now I created a java program that creates the schema and load verticies and edges using gremlin.


The problem is that this method is very slow.


Is there a way to perform bulk loading into Hbase in order to significantly reduce the loading times?


The csv files comes out from the ldbc_snb_datage: https://github.com/ldbc/ldbc_snb_datagen

I'll attach a little portion of the files I need to load and the java classes that I wrote.

Thanks.
JanusGraphImporter.java

Jason Plurad

unread,
Oct 6, 2017, 10:30:42 AM10/6/17
to JanusGraph users
Thanks for providing the code. It would be even better if you shared everything as a GitHub project that's easy to clone and build, contains the CSV files, and also the specific parameters you're sending into program, like batchSize.

You didn't mention how slow is slow. What is the ingestion rate for vertices, properties, and edges? Some more concrete details would be helpful. What does your HBase deployment look like?

         * Note: For unknown reasons, it seems that each modification to the
         * schema must be committed in its own transaction.

I noticed this comment in the code. I don't think that's true, and GraphOfTheGodsFactory does all of its schema updates in one mgmt transaction. I'd be interested to hear more details on this scenario too.

Michele Polonioli

unread,
Oct 6, 2017, 11:36:39 AM10/6/17
to JanusGraph users
I created a repository on GitHub with my code and a very small csv samples here: https://github.com/mpolonioli/JanusGraph-importer-example.

The csv files that i provided with the repo is a very small example, I loaded 1,2GB of files in about 12 hours.

My deployment of JanuGraph is on an Hadoop Cluster composed by 4 nodes with HBase installed with Cloudera Manager.

I didn't measure the ingestion rate for vertices, properties, edges and I don't know how to do that actually.

I apologize for the wrong comment in my code, that code partially comes to an implementation of a titan-importer and I forgot to delete that comment.

I'm wondering if exists a way to load the data directly on HBase, without using the JanusGraph-API or if my code can be optimazed.

Hope this helps to solve my problem, thank you.

Michele Polonioli

unread,
Oct 6, 2017, 11:51:03 AM10/6/17
to JanusGraph users
https://drive.google.com/file/d/0B-f-jjH6bDhnZUx1RkoyOElEQlE/view?usp=sharing

Here there is a zip containing the data that took 12h on a cluster.

I also tried to load that data on JanuGraph-HBase with default configuration on a laptop and the loading took 3h.

Joe Obernberger

unread,
Oct 6, 2017, 12:01:12 PM10/6/17
to Jason Plurad, JanusGraph users

We are having similar issues with performance loading graph data into Janus backed by HBase.  I agree with Jason, we didn't have any issues with doing all the mgmt calls in one go.

One thing that we did was to multi-thread the java code which certainly helped performance.  HBase seems to respond well to multiple calls at once.  For example, in your loadVerticies method, you may want to make a thread inside the main for loop and give it a bank of maybe 32 threads (depends on the machine your're running on).  I use the Java ExecutorService - like:

ExecutorService doWork= Executors.newFixedThreadPool(MAX_WORK_CALLS);
Semaphore smDoWork= new Semaphore(MAX_WORK_CALLS);
try {
smDoWork.acquire();
  } catch (InterruptedException ex) {
             log.error("Interrupt: " + ex);
  }
   someThread= new doJanusStuff(this);
   doWork.execute(someThread);

Just make to release the semaphore when the thread is completed.  All that said, performance was then limited by the one machine doing the ingesting, and still seemed slower than one would expect.  In our case to generate a 154 million node and ~275 million edge graph took 3 days on a 5 node Hadoop cluster.

-Joe

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/bb4a6e00-b069-4c5b-a87c-77580decde75%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Virus-free. www.avg.com

Jason Plurad

unread,
Oct 8, 2017, 2:46:33 PM10/8/17
to JanusGraph users
Thanks for sharing all that info because makes it much easier to have a constructive conversation.

Your default batch size of 100,000 between commits looks really large. Dropping that down to 5,000, these were my results running on my machine (2015 MacBook Pro, 2.8 GHz Intel Core i7 quad core, 16 GB RAM, 1 TB SSD)

Time needed for loading schema into the graph in milliseconds: 94592
Time needed for loading data into the graph in milliseconds: 4587774
Time needed for loading vertices into the graph in milliseconds: 302718
Time needed for loading properties into the graph in milliseconds: 13071
Time needed for loading edges into the graph in milliseconds: 4271985
Total duration in milliseconds: 4682366

Time Elapsed for loading schema into the graph: 000h.01m.34s
Time Elapsed for loading data into the graph: 001h.16m.27s
Total duration: 001h.18m.2s
vertices
: 3181724, edges: 17436661

Not sure what your machine specs are, but that's already 2x faster. I didn't spend much more time on it, but experimenting with the batch size could get you better results.

You mentioned you saw 3h on local laptop vs 12h on the HBase cluster. This sounds like either your cluster is misconfigured/unoptimized or you have a big latency involved between your client application and the cluster.

Michele Polonioli

unread,
Oct 9, 2017, 4:47:14 PM10/9/17
to JanusGraph users
Thank you, reducing the batch size did help.

If you want to see some configuration of the cluster please ask, because I don't know where to look for better configuration.

Michele Polonioli

unread,
Oct 10, 2017, 11:11:04 AM10/10/17
to JanusGraph users
Thank you for the hint,

I've updated my code with the implementation of the multi-thread logic.

I didn't test the performance yet.

Liping Huang

unread,
Dec 22, 2017, 1:28:31 AM12/22/17
to JanusGraph users
Hi Michele,

Could share some sample codes if it is OK, I tried to load data with threading but encountered a lots of NPE and get a lot of meesage with nothing else like below:
14:26:23.328 [pool-20-thread-1] DEBUG o.j.d.cassandra.CassandraTransaction - Created CassandraTransaction@278d5905[read=QUORUM,write=QUORUM]
14:26:28.332 [pool-20-thread-1] DEBUG o.j.d.cassandra.CassandraTransaction - Created CassandraTransaction@6f637d04[read=QUORUM,write=QUORUM]
14:26:33.334 [pool-20-thread-1] DEBUG o.j.d.cassandra.CassandraTransaction - Created CassandraTransaction@487d8d07[read=QUORUM,write=QUORUM]
14:26:38.339 [pool-20-thread-1] DEBUG o.j.d.cassandra.CassandraTransaction - Created CassandraTransaction@6fff4e25[read=QUORUM,write=QUORUM]
14:26:43.342 [pool-20-thread-1] DEBUG o.j.d.cassandra.CassandraTransaction - Created CassandraTransaction@227777ae[read=QUORUM,write=QUORUM]
14:26:48.346 [pool-20-thread-1] DEBUG o.j.d.cassandra.CassandraTransaction - Created CassandraTransaction@5688b5b9[read=QUORUM,write=QUORUM]
14:26:53.350 [pool-20-thread-1] DEBUG o.j.d.cassandra.CassandraTransaction - Created CassandraTransaction@72390fda[read=QUORUM,write=QUORUM]
14:26:58.352 [pool-20-thread-1] DEBUG o.j.d.cassandra.CassandraTransaction - Created CassandraTransaction@6722e9de[read=QUORUM,write=QUORUM]
14:27:03.354 [pool-20-thread-1] DEBUG o.j.d.cassandra.CassandraTransaction - Created CassandraTransaction@7ef13893[read=QUORUM,write=QUORUM]
14:27:08.356 [pool-20-thread-1] DEBUG o.j.d.cassandra.CassandraTransaction - Created CassandraTransaction@6d477f89[read=QUORUM,write=QUORUM]
14:27:13.359 [pool-20-thread-1] DEBUG o.j.d.cassandra.CassandraTransaction - Created CassandraTransaction@57baea24[read=QUORUM,write=QUORUM]
14:27:18.361 [pool-20-thread-1] DEBUG o.j.d.cassandra.CassandraTransaction - Created CassandraTransaction@27268514[read=QUORUM,write=QUORUM]
14:27:23.363 [pool-20-thread-1] DEBUG o.j.d.cassandra.CassandraTransaction - Created CassandraTransaction@506414c3[read=QUORUM,write=QUORUM]
14:27:28.365 [pool-20-thread-1] DEBUG o.j.d.cassandra.CassandraTransaction - Created CassandraTransaction@6899271[read=QUORUM,write=QUORUM]
14:27:33.368 [pool-20-thread-1] DEBUG o.j.d.cassandra.CassandraTransaction - Created CassandraTransaction@662b34e[read=QUORUM,write=QUORUM]
14:27:38.370 [pool-20-thread-1] DEBUG o.j.d.cassandra.CassandraTransaction - Created CassandraTransaction@157f98b3[read=QUORUM,write=QUORUM]

在 2017年10月10日星期二 UTC+8下午11:11:04,Michele Polonioli写道:

Michele Polonioli

unread,
Jan 25, 2018, 11:50:27 AM1/25/18
to JanusGraph users
Hi, Liping, i'm sorry for the very late reply.

Here you are the GitHub repository of the code I made: https://github.com/mpolonioli/janusgraph-importer-example.
I didn't wrote any documentation so I'm not sure that is understandable.

Michele Polonioli

unread,
Mar 6, 2018, 12:07:09 PM3/6/18
to JanusGraph users
I've made an open source Java library and created a separate repository.
Check it out at https://github.com/mpolonioli/janusgraph-csv-importer

Jerry He

unread,
Mar 6, 2018, 11:24:28 PM3/6/18
to Michele Polonioli, JanusGraph users
Late in the response. 
You can take a look and comment on https://github.com/JanusGraph/janusgraph/issues/885

Thanks,

Jerry

On Tue, Mar 6, 2018 at 9:07 AM, Michele Polonioli <michele....@gmail.com> wrote:
I've made an open source Java library and created a separate repository.
Check it out at https://github.com/mpolonioli/janusgraph-csv-importer

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/d5ec8fcb-617d-4b06-a6f1-5b15677fc914%40googlegroups.com.

cyrus.vil...@gmail.com

unread,
May 17, 2018, 10:32:11 AM5/17/18
to JanusGraph users
Hi Michele,

I was just wondering how you would run the Java program that you have written to bulk load data into JanusGraph? Do you first run the Gremlin console, import the Java class from the console and run it from there? Or are there some other way?

Cheers,
Cyrus
Reply all
Reply to author
Forward
0 new messages