Efficiently load data into OrientDB from Spark

26 views

Skip to first unread message

al...@xagongroup.com

unread,

Sep 26, 2017, 2:41:12 AM9/26/17

to OrientDB

Hello friends, how is your day?

I wanted to ask you for suggestion on how to load data into OrientDB efficiently.

I tried various approaches and none of them were sufficient.

My configuration at the moment is my laptop. Which is 8gb RAM and i7 processor.

The data I'm trying to ingest is quite modest, 40k nodes. With some properties. I need to be able to run the same loading job idempotently, that is, running it twice won't produce twice the data.

The approaches I tried are:

calling foreachPartition on the dataframe, there creating a connection to OrientDB (Because the classes are not thread safe)
from there, determining an identifying value for the record. e.g. taking the 'person_id' coulmn, and querying OrientDB (against an index, of course) to see if such node already exists, skip it, otherwise create it
This approach works terribly slow. it took over 15 hours to run, and I gave up.
I tried modifying this approach to not query OrientDB before indexing, it does work, but then all of my data is duplicated since I can't assign my own ID.
Using the Spark for OrientDB connector, but it messed up my data model, either complaining they don't exist or that they already exists.
Looking at the source code of it, it seems that it will handle the idempotenty by firstly deleting the graph, which is not a desired case.

Using Batch, It won't work for remote database! which is a problem since the job is run on a different server.

I think these designs aren't all that great, but I would expect them to run at least couple of hours and not 15..