Proper way to generate and insert data into OrientDB

366 views
Skip to first unread message

Cyprien Gottstein

unread,
Dec 22, 2016, 12:29:49 PM12/22/16
to OrientDB
Hello everyone,

We are currently testing OrientDB to check if we may use it as a database. The problem we are facing right now is about memory consumption through data generation.

We want to test if OrientDB support our queries in a larger scale so we went through building a little generator in java to insert data corresponding to our needs.In order to generate and insert the data in a fast way we parse first ontologies (which serve for semantic data referencing) and store them in memory. Afterwards we generate some random data, bind them together on-the-fly and also bind them to some concepts in the ontologies graphs. All of that is made using the Java Graph API.

It works nicely at the beginning, but in the end it always crashes with "java.lang.OutOfMemoryError: GC overhead limit exceeded". The java program which handle the data generation has 1.5 GB ram to work with and when it crashes we have almost generated a million of OrientDB elements ( about a third as vertices, the rest as edges).

We tried a lot of things, Massive Insert intent, setting keepReferencesInMemory to false, limiting disk cache size, and we checked multiple times to ensure we were not doing anything stupid with the memory. We also thought about using fetchplan to ensure the cache only stores the main document in memory and not all of its edges but this option is not accessible in the Graph API.Yet, we can't make the generator go further because it always lack memory.

We think it's related to the disk cache usage, we can't properly measure it but its visible in htop, the memory usage keeps growing during the last half of the generation even though at this point we are only inserting data in the graph and we are not storing anything anymore in the Java program. Our theory behind this is that we store pointers to the ontology nodes which themselves points on the nodes we generate on the fly and at some point this may trigger the cache to keep the pointed nodes alive in memory. This would explain why the memory keep growing.

I'm sorry of its a bit fuzzy.

We could just add more RAM to the JVM but we can't help but wonder: What are we missing  ? Is there a way to properly generate a set data with connections between eachother and insert it into OrientDB ?

Thanks,

Cyprien Gottstein.




Luca Garulli

unread,
Dec 22, 2016, 12:32:46 PM12/22/16
to OrientDB
Hi Cyprien,

To insert 1M of graph elements you could just 100MB, but it depends about how you are doing it. A few questions:
  1. are you using plocal o remote connection?
  2. embedded, one server or multiple servers?
  3. are you inserting using SQL COMMANDs? Or are you using regular Graph API by using vertices and edges?
  4. are you using transactions? Please bear in mind that transactions consume RAM


Best Regards,

Luca Garulli
Founder & CEO

--

---
You received this message because you are subscribed to the Google Groups "OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to orient-database+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Oleksandr Gubchenko

unread,
Dec 22, 2016, 12:36:30 PM12/22/16
to OrientDB
It might be a jvm memory tuning issue. You can check out the gclog adding this parameters:
add -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintGCID -XX:+PrintGCDetails -Xloggc:$ORIENTDBHOME/log/gc%p_%t.log 
check the memory usage and give server the correct amount of memory.

Cyprien Gottstein

unread,
Dec 23, 2016, 4:14:44 AM12/23/16
to OrientDB
Hello,

First things first, Thank you very much for your answer, your help is much appreciated.

l.garulli

1. We are using a remote connection

2. We run a standalone OrientDB server on an isolated vm, it runs with 8GB of RAM, no problems going from this side

3. To insert data we rely only on the regular Graph API, using indeed vertices and edges. We thought of the possibility of writing a big file composed by SQL statement to insert the data but didn't had the time to do so lately.

4. We do use transactions, i understand that i may be costly in RAM but this much ?

Anyway, we are in a prototype environment i won't break anything so i will try to patch our generator to make it work without transactions.

Oleksandr Gubchenko
I don't get exactly what you are proposing, maybe i did not made myself clear enough about this. The memory issue we are experiencing is at the generator (which you can see as our client), the OrientDB server is running just fine.
Still i will try to make something out of those commands, I'm curious.

Thanks again, i will give you updates

Cyprien Gottstein

unread,
Jan 2, 2017, 8:09:13 AM1/2/17
to orient-...@googlegroups.com
Hello everyone,

Back from holidays, we had the time to patch the generator in non-transactionnal mode and it works ! Sadly, the insert rate has been divided by something like... 50 times ?
We guessed it would be slower, but didn't thought the gap would be this large.

We will run it once more to have actual numbers/rate for each mode.

Anyway, We also thought about writing "Big File" full of insert lines to do some bulk insert but we encountered an unexpected problem. If we insert a new element (be it an edge or a vertex) we can't force manually the @rid.
Consequently, we can't write easily all the edges between the vertices because we do not have any way to know in advance the proper vertice's @rids.

We did some research, the workaround is to use subqueries in the insert to retrieve the vertices to bind. But that means OrientDB will have to perform two subquery for each edge to insert, it will probably work but we highly doubt
it will be faster than the non-transactionnal mode and in the end we would have done this for nothing.

Are we, once more, missing something ? Or is writing large insertion file just not a good idea for OrientDB (at least in the case you have lots of edges) ?

Thanks a lot and Happy new year !

Cyprien Gottstein.

Luca Garulli

unread,
Jan 2, 2017, 9:25:58 AM1/2/17
to OrientDB
Hi Cyprien,

Could you please attach your BATCH script that executed bulk inserts?

Best Regards,

Luca Garulli
Founder & CEO

On 2 January 2017 at 14:09, Cyprien Gottstein <gottstei...@gmail.com> wrote:
Hello everyone,

I'm back from holidays, i had the time to patch the generator in non-transactionnal mode and it works ! Sadly, the insert rate has been divided by something like... 50 times ?

--

Cyprien Gottstein

unread,
Jan 5, 2017, 9:44:54 AM1/5/17
to OrientDB
Hi,

The scripts are in java, those are not batches, and we didn't wanted to post it online to be perfectly honest.

But.

Great news, problem solved !

I remembered something crucial, it's mainly me who is working on the subject and at first i wanted to run the OrientDB server on my own local machine. Since my computer is not what i would call young and powerful, i asked to make it run on a virtual machine in our own datacenter. I work in France. The Datacenter is in Poland.

What was it again ? Non-Transactionnal Mode means " for every new entity we write it immediately on disk ", it meant for every entry we were eating the latency between our building in France and the datacenter in Poland.

We made a proper uber-jar to run the generation code on the very same machine holding the Orient DB server's instance.
We did some simple benchmark to check how the total execution time would behave depending on the transaction mode and location of the program.

Job Place - Transactionnal : 900 secs (i don't have the real number, i just know its around that, sorry.)
Job Place - Non-Transactionnal : more than 54000 secs
Datacenter - Transactionnal : 212 secs
Datacenter - Non-Transactionnal : 2700 secs (i don't have the real number, i just know its around that, sorry.)

So thanks to executing the code from the right place we avoided some HUGE loss of times. The Non-Transactionnal mode is slower but your promess was kept, it runs with a constant amount of RAM, that means if we want to generate 10 millions or even 50 millions of entities we can in a reasonnable time so i would guess it's fine.

Last question and i think this case will be closed (at least for me) :

The OrientDB Documentation warns in http://orientdb.com/docs/last/Performance-Tuning.html#wise-use-of-transactions that :

"Transactions slow down massive inserts unless you're using a "remote" connection. In that case it speeds up all the insertion because the client/server communication happens only at commit time."

 We can indeed see it in effect when we run our little benchmark (Transactionnal - 212 VS Non-Transactionnal 2700) , are we in a difference of speed that is to be expected ? Or does something still feels off ?

Thanks again for your time,

Cyprien Gottstein

Luca Garulli

unread,
Jan 5, 2017, 5:06:44 PM1/5/17
to OrientDB
Hi Cyprien,

Glad you solved it.

With remote and/or graph operations, specially when you add multiple edges, transaction always help. Before 2.2 transactions didn't run in parallel, but after 2.2 they can. In facts, this is my next question. Are you running on multiple threads? 

Best Regards,

Luca Garulli
Founder & CEO

--
Reply all
Reply to author
Forward
0 new messages