How many data could titan+berkeleydb handle?

66 views
Skip to first unread message

史晔翎

unread,
Jun 14, 2016, 10:52:12 AM6/14/16
to Aurelius
Dear Titan experts:

I'm currently trying out titan for storing data that are very similar to graph and I'm using a titan 1.0.0 with berkeleydb backend storage to build a prototype. 

What I did in the prototype is generate some random vertex data ( one string type, one int type and a date type property for each vertex ) and randomly add link between them.
After inserting like 2 million vertices and 8 million edges, the java process heap goes up to 8G and the CPU is 100% used while the server is actually doing nothing.

Is this something expected? I remember the doc said something like 100m vertices 

Thanks in advance.

Yeling


Stephen Mallette

unread,
Jun 15, 2016, 10:54:30 AM6/15/16
to Aurelius
You shouldn't have a problem with 8 million edges in berkeleydb. Are you remembering to periodically committing transactions when loading your data?

--
You received this message because you are subscribed to the Google Groups "Aurelius" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraph...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aureliusgraphs/924bb4da-b10c-4c32-af26-3fbc11f30fee%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

史晔翎

unread,
Jun 16, 2016, 1:58:56 AM6/16/16
to Aurelius
Yes I committing periodically.  What's the suggested transaction size?

Currently I've switched to using cassandra and my observation is that tx.commit() call returns fast but graph.close() take very long. It took around 10 minutes to finish the close() call after inserting like a million vertices+edges. 

PS, storage.batch-loading is set to true.

Daniel Kuppitz

unread,
Jun 16, 2016, 2:18:43 AM6/16/16
to aureliu...@googlegroups.com
I would start with 10k mutations per transaction. Then, depending on how powerful your cluster is, increase or decrease the tx size slightly and see whether it has a performance impact or not.

Cheers,
Daniel


史晔翎

unread,
Jun 16, 2016, 11:37:06 AM6/16/16
to Aurelius
Thanks Daniel.
I've been using 100k batch size on a toy cluster setup (8G 4core VMs), I guess that's too big.

Now I'm doing smaller batches and monitor the network activity at the same time. And what I found is that the data transfer to cassandra does not happen when I call tx.commit(), but when I call graph.close() instead. Not sure if this is by design. But it do tells me where the bottle neck is in my setup.

Austin Sharp

unread,
Jun 21, 2016, 2:36:00 PM6/21/16
to Aurelius
This last point sounds familiar. I remember being surprised by Cassandra's default setting for when a call returns - i.e. data had been committed but not written to disk, or something like that. Sorry I don't recall exactly, but it's worth checking into Cassandra rather than Titan for this behavior.

史晔翎

unread,
Jun 27, 2016, 11:10:08 AM6/27/16
to Aurelius
Good point. Guess I should dig deeper. 
Thanks man.
Reply all
Reply to author
Forward
0 new messages