OrientDB delete performance

Skip to first unread message

larisa kaykov

Jul 4, 2021, 10:15:55 AMJul 4
to OrientDB
I am trying to delete 200K records from the database.
When I use gremlin and use
 graph.traversal().V().has(ID_PROP, P.within(nodesIds)).drop().count().next();

it can take 20 minutes.

If I am trying to delete directly using orient db 
orientGraph.executeSql("DELETE VERTEX V where id in [" + String.join(",", nodesIds) + "]");

It still take a very long time.

Do you know how can I solve the performance issue?

Craig Trader

Jul 4, 2021, 4:25:53 PMJul 4
to OrientDB
  1. Get faster disk drives.
  2. Get more RAM.
  3. Use smaller transactions (eg: don't try to delete 200K records in one shot).
  4. Use the Orient GraphAPI instead of SQL queries.
Seriously, OrientDB is not optimized for deleting records, particularly when those records are nodes (vertices) that may have many edges. Each node record in OrientDB has a set (aka RIDBag) for each type of edge that connects to it, so the process for deleting a node involves going through each of those RIDBags and identifying the opposite node (the linked or linking node) and then removing the original node from *that* RIDBag. If you're using Thick edges (ie: edges with properties, which are stored in their own tables) then you'll also need to delete those records as well.

For example, suppose you want to delete a node that has 10 edges (to 10 other nodes). If you're using thin edges, then you'll end up with a transaction that modifies 10 records and deletes 1 record. If you're using thick edges, then you'll end up with a transaction that modifies 10 records, deletes 10 edge records, and deletes the original node record.

Now multiply that by 200,000.

Now there are things you can do to speed this up:
  1. Start by sorting your list of RIDs (aka Record IDs) so that you're processing the records in table order. 
  2. Split your single RID list into several smaller chunks (say no more than 100 RIDs per chunk) and delete each chunk sequentially. (You don't want to delete the chunks in parallel because you'll end up with contention across the node tables, and you definitely don't want to delete two nodes that have edges to a common node in separate chunks.) 
  3. Use the GraphAPI instead of SQL queries to remove the overhead of SQL processing.

- Craig -
Reply all
Reply to author
0 new messages