Arango newbee trying to understand poor performance

127 views
Skip to first unread message

Rob Gratz

unread,
Feb 14, 2020, 1:27:49 PM2/14/20
to ArangoDB

I am in the process of evaluating a number of different graph databases for use in an existing application.  This application currently uses Neo4J as the repository but we are looking at whether a switch would make sense.  As part of the evaluation process, we have created a test harness to perform consistent tests across the databases we are evaluating, arangodb being one of them.  One of the simple tests we are doing is to add nodes and edges individually and in batches since that is how our application would interact with the DB.  What I have found is that arango is 5-6 times slower than neo4j in doing these tests.  I am currently using the Java driver to perform the tests and doing the inserts using the graph api and not the generic collection api.  I have been following the arango provided java guides for adding the nodes and edges and using e StreamingTransaction api for doing the batches so I'm not sure where I could be going too wrong in my approach.  With that said, I don't know how arangodb could be that much slower than neo4j.

Following are examples of how I am adding the data (this isn't the harness, just an example of how we are adding data).  Any feedback as to how I can improve the performance would be greatly appreciated.

  private void addEdge(ArangoDB arangoDB)
  {
    ArangoGraph graph = arangoDB.db(DATABASE).graph(GRAPH);
    String[] collections = new String[] {"MY_test_edge"};
    
    StreamTransactionEntity tx = graph.db().beginStreamTransaction(
            new StreamTransactionOptions()
            .waitForSync(false)
            .writeCollections(collections));
    EdgeCreateOptions options = new EdgeCreateOptions()
            .streamTransactionId(tx.getId())
            .waitForSync(false);
    
    System.out.println("Transaction collections: " + String.join(",", collections));
    try
    {
      BaseEdgeDocument edge = new BaseEdgeDocument("MY_test_vertex_from1/MY_from_key1", "MY_test_vertex_to/MY_to_key");
      graph.edgeCollection("MY_test_edge").insertEdge(edge, options);

      edge = new BaseEdgeDocument("MY_test_vertex_from2/MY_from_key2", "MY_test_vertex_to/MY_to_key");
      graph.edgeCollection("MY_test_edge").insertEdge(edge, options);
      
      graph.db().commitStreamTransaction(tx.getId());
    }
    catch (Exception e)
    {
      graph.db().abortStreamTransaction(tx.getId());
      throw e;
    }
  }

  private void addNodes(ArangoDB arangoDB)
  {
    ArangoGraph graph = arangoDB.db(DATABASE).graph(GRAPH);
    
    String[] collections = new String[] {"MY_test_vertex_from1", "MY_test_vertex_from2", "MY_test_vertex_to"};
    StreamTransactionEntity tx = graph.db().beginStreamTransaction(
            new StreamTransactionOptions()
            .waitForSync(false)
            .writeCollections(collections));
    VertexCreateOptions options = new VertexCreateOptions()
            .streamTransactionId(tx.getId())
            .waitForSync(false);
    try
    {
      graph.vertexCollection("MY_test_vertex_from1").insertVertex(new BaseDocument("MY_from_key1"), options);
      graph.vertexCollection("MY_test_vertex_from2").insertVertex(new BaseDocument("MY_from_key2"), options);
      graph.vertexCollection("MY_test_vertex_to").insertVertex(new BaseDocument("MY_to_key"), options);
      graph.db().commitStreamTransaction(tx.getId());
    }
    catch (Exception e)
    {
      graph.db().abortStreamTransaction(tx.getId());
    }
  }

Michele Rastelli

unread,
Feb 15, 2020, 1:05:47 PM2/15/20
to ArangoDB
What and how are you measuring exactly? What are the numbers that you get? And what is the execution time that you get in Neo4j?
Your code is correct and you should get good performances running it.

I have slightly modified your code to measure the performances and on my machine it takes in average around 1.1 ms to execute your code. I have executed it against a single instance db (version 3.6.1-community) running in a local docker container.


Here is the code:

import com.arangodb.ArangoDB;
import com.arangodb.ArangoGraph;
import com.arangodb.entity.BaseDocument;
import com.arangodb.entity.BaseEdgeDocument;
import com.arangodb.entity.EdgeDefinition;
import com.arangodb.entity.StreamTransactionEntity;
import com.arangodb.model.EdgeCreateOptions;
import com.arangodb.model.StreamTransactionOptions;
import com.arangodb.model.VertexCreateOptions;

import java.util.Collections;
import java.util.Date;
import java.util.UUID;

public class Test {
static String DATABASE = "mydb";
static String GRAPH = "mygraph";

public static void main(String[] args) {
ArangoDB arangoDB = new ArangoDB.Builder()
.host("localhost", 8529)
.build();

if (arangoDB.db(DATABASE).exists()) {
arangoDB.db(DATABASE).drop();
}
arangoDB.db(DATABASE).create();

arangoDB.db(DATABASE).createCollection("MY_test_vertex_from1");
arangoDB.db(DATABASE).createCollection("MY_test_vertex_from2");
arangoDB.db(DATABASE).createCollection("MY_test_vertex_to");
arangoDB.db(DATABASE).createGraph(GRAPH, Collections.singletonList(new EdgeDefinition()
.collection("MY_test_edge")
.from("MY_test_vertex_from1", "MY_test_vertex_from2")
.to("MY_test_vertex_to")
));

int iterations = 1_000;

// warmup
for (int i = 0; i < iterations; i++) {
String from1 = "from1-" + UUID.randomUUID().toString();
String from2 = "from2-" + UUID.randomUUID().toString();
String to = "to-" + UUID.randomUUID().toString();
addNodes(arangoDB, from1, from2, to);
addEdge(arangoDB, from1, from2, to);
}

long start = new Date().getTime();
for (int i = 0; i < iterations; i++) {
String from1 = "from1-" + UUID.randomUUID().toString();
String from2 = "from2-" + UUID.randomUUID().toString();
String to = "to-" + UUID.randomUUID().toString();
addNodes(arangoDB, from1, from2, to);
addEdge(arangoDB, from1, from2, to);
}
long end = new Date().getTime();
long elapsed = end - start;
System.out.println("elapsed: " + elapsed + " ms");
System.out.println("avg: " + (1.0 * elapsed / iterations) + " ms");
arangoDB.shutdown();
}


private static void addEdge(ArangoDB arangoDB, String from1, String from2, String to) {

ArangoGraph graph = arangoDB.db(DATABASE).graph(GRAPH);
String[] collections = new String[]{"MY_test_edge"};

StreamTransactionEntity tx = graph.db().beginStreamTransaction(
new StreamTransactionOptions()
.waitForSync(false)
.writeCollections(collections));
EdgeCreateOptions options = new EdgeCreateOptions()
.streamTransactionId(tx.getId())
.waitForSync(false);

        try {
BaseEdgeDocument edge = new BaseEdgeDocument("MY_test_vertex_from1/" + from1, "MY_test_vertex_to/" + to);
graph.edgeCollection("MY_test_edge").insertEdge(edge, options);

edge = new BaseEdgeDocument("MY_test_vertex_from2/" + from2, "MY_test_vertex_to/" + to);
            graph.edgeCollection("MY_test_edge").insertEdge(edge, options);

graph.db().commitStreamTransaction(tx.getId());
} catch (Exception e) {
graph.db().abortStreamTransaction(tx.getId());
throw e;
}
}

    private static void addNodes(ArangoDB arangoDB, String from1, String from2, String to) {

ArangoGraph graph = arangoDB.db(DATABASE).graph(GRAPH);

String[] collections = new String[]{"MY_test_vertex_from1", "MY_test_vertex_from2", "MY_test_vertex_to"};
StreamTransactionEntity tx = graph.db().beginStreamTransaction(
new StreamTransactionOptions()
.waitForSync(false)
.writeCollections(collections));
VertexCreateOptions options = new VertexCreateOptions()
.streamTransactionId(tx.getId())
.waitForSync(false);
try {
            graph.vertexCollection("MY_test_vertex_from1").insertVertex(new BaseDocument(from1), options);
graph.vertexCollection("MY_test_vertex_from2").insertVertex(new BaseDocument(from2), options);
graph.vertexCollection("MY_test_vertex_to").insertVertex(new BaseDocument(to), options);
graph.db().commitStreamTransaction(tx.getId());
} catch (Exception e) {
e.printStackTrace();
graph.db().abortStreamTransaction(tx.getId());
throw e;
}
}
}

Ingo Friepoertner

unread,
Feb 19, 2020, 10:17:30 AM2/19/20
to ArangoDB
Hi Rob,

can you please share some more details?
Do you use a local deployment, single server or cluster? What are the numbers you get and for what amount of data?

Rob Gratz

unread,
Feb 24, 2020, 8:53:58 AM2/24/20
to ArangoDB

We have written a test harness to evaluate performance of a number of graph alternatives.  The original snippet of code is not part of the harness, but was an example of how we are adding data through the java driver.  Because we are currently using neo4j, that was the initial implementation for that harness.  The test consists of adding 2M nodes using batches/transactions of 500.  Tests are being run initially on dev laptops with the end goal of running all of the tests on a single, more powerful environment.  With neo, we are able to add the 2M nodes in roughly 6-7 minutes.  With Arangodb, we are in the neighborhood of 35-40 minutes for the same data so as you can see, this is a pretty dramatic difference.  

This is a single arangodb instance running in a docker container.  Here is the docker-compose file being used:  

version: '3.7'
services:
  arangodb_db_container:
    image: arangodb:latest
    environment:
      ARANGO_ROOT_PASSWORD: rootpassword
    ports:
      - 8529:8529
    volumes:
      - arangodb_data_container:/var/lib/arangodb3
      - arangodb_apps_data_container:/var/lib/arangodb3-apps

volumes:
  arangodb_data_container:
  arangodb_apps_data_container:


The data has 6 different node types with all of the types having between 4-7 fields. All of the fields being added are indexed with one field being indexed as unique.

Frank Celler

unread,
Feb 24, 2020, 9:18:17 AM2/24/20
to ArangoDB
Hi Rob,

thanks a lot for the details. We will create a similar test environment,

best Frank

Frank Celler

unread,
Feb 26, 2020, 6:14:23 AM2/26/20
to ArangoDB

> We have written a test harness to evaluate performance of a number of graph alternatives.

> The original snippet of code is not part of the harness, but was an example of how we are

> adding data through the java driver.  Because we are currently using neo4j, that was the

> initial implementation for that harness.  The test consists of adding 2M nodes using

> batches/transactions of 500.  Tests are being run initially on dev laptops with the end goal

> of running all of the tests on a single, more powerful environment.  With neo, we are able

> to add the 2M nodes in roughly 6-7 minutes.  With Arangodb, we are in the neighborhood

> of 35-40 minutes for the same data so as you can see, this is a pretty dramatic difference.  


Hi Rob,


there are different approaches to improve the performance considerably.


(1) Single Document Operation


The initial program you provided will use a single document operation for each vertex and edge that will be inserted. This is based on the synchronous driver so that it will not run in parallel.


In order to make better use of the server, you can use threads in Java to create the vertices and edges in parallel. Also, in his example program, we raised the transaction size to 500.


You can find Michele’s version here: https://gist.github.com/rashtao/831c7e0281314789a2e2b57e8b3bfe67

This is just a proof of concept and not production code quality.


With this setup we get on a laptop:


1200000 vertexes

800000 edges

elapsed: 195275 ms

10241 insertions/s


This is roughly 10x faster than your numbers. Obviously, this is not the same test environment, but the laptop we used is not the fastest.


The drawbacks of this approach are a lot of communication between the client and the server. To fully make use of batches the following approach will help.


(2) Insert using AQL


You can use AQL to insert batches of vertices and edges. Michele’s example program can be found here: https://gist.github.com/rashtao/5b72b6187d1a6b50aa129a9f3c5fb2ef


With this version we reach the following numbers (on the same laptop as above):


1200000 vertexes

800000 edges

elapsed: 72617 ms

27541 insertions/s


That is a factor of ~2.5 faster than the previous approach. With this setup, you can import 2 million documents in 1min12sec.


Please note, that if you use much larger transaction sizes, you should enable intermediate commits, see https://www.arangodb.com/docs/3.6/transactions-limitations.html#rocksdb-storage-engine


(3) Batch Generation of Documents


We also provide a specialized API for generating batches of documents. This can be used as an alternative to (2) and allow you to gain even more performance. For example, see https://gist.github.com/rashtao/22a43ba5233669d610eca65e06bc7b87

 

This gives


1200000 vertexes

800000 edges

elapsed: 64428 ms

31042 insertions/s


This is slightly faster than using AQL.


(4) Import


There is also a special API for bulk imports. However, this does not support transactions (see https://www.arangodb.com/docs/stable/http/bulk-imports.html). We can provide more details if required.


(5) Outlook


We are also working on the next version of the Java driver. This will be reactive and non-blocking on the network side. It will use fewer threads and will do auto-tuning of the parallelism in the client. This will make (1) even easier to implement.


If you have any further questions please do not hesitate to ask. Alternatively, we can set up a call to discuss the various options.

  Michele & Frank


Rob Gratz

unread,
Feb 26, 2020, 9:26:22 AM2/26/20
to ArangoDB

If you extrapolate your results across 2M records, you end up with the results I'm getting which is about 35+ minutes.  I am adding the same data through Neo4J and it takes roughly 6-7 minutes.

Ingo Friepoertner

unread,
Feb 26, 2020, 9:55:29 AM2/26/20
to ArangoDB
Hi Rob,

as mentioned above by Frank and Michele, your initial approach used single document operations that can not run in parallel (using the synchronous driver).
All options stated above are based on 2M records.

Option 1: using Java threads and transaction size 500: 3 min 15 sec.
Option 2: using AQL: 1 min 12 sec.
Option 3: using batches of documents: 1 min 5 sec.

Your mileage will vary, this was a test on a local machine.
Reply all
Reply to author
Forward
0 new messages