AWS performance question

Curt Kohler

unread,

Jun 8, 2016, 10:34:30 AM6/8/16

to OrientDB

I've been asked to kick the tires on OrientDB as a possible graph DB solution for an upcoming project at my company. In order to do so, I've spun up an EC2 instance using the OrientDB marketplace AMI on a m4.xlarge box with an EBS drive (picked as a general purpose box since I couldn't find any hardware recommendations via documentation or searches). I've got the database running, but when I try and use the ETL bulk import tools with CSV files, I'm seeing what I consider very poor performance compared to the claims I have read. The best I've seen is @ 5K records/second loaded. The existing documentation leaves a bit to be desired, so I was hoping someone might be able to offer some insight.

Here are some details (I've scaled things back trying to understand where I may have gone wrong).

One file 2 million records that has two columns (record key and text field). E.g. ABC\tString here
A class schema was predefined outside the ETL config script with those two fields and an index on the id field
This ETL script - based on one in the documentation - I am running on the EC2 box (I am using remote: connection as the project will consist of a distributed DB. even though both are on the same box right now)

{
    "source":{
        "file":{
            "path":"/user/poc1_Datasets/organization.tsv"
        }
    },
    "extractor":{
        "row":{

        }
    },
    "transformers":[
        {
            "csv":{
               "separator": "\t"
            }
        },
        {
            "vertex":{
                "class":"Organization"
            }
        },
    ],
    "loader":{
        "orientdb":{
            "dbURL":"remote:localhost/DataSpine1",
            "dbType":"graph",
            "wal":false,
            "tx":false,
            "batchCommit":25000
        }
    }
}

The final output of the ETL loader in this case was:

END ETL PROCESSOR
+ extracted 1,822,150 rows (3,904 rows/sec) - 1,822,150 rows -> loaded 1,822,149 vertices (3,907 vertices/sec) Total time: 520411ms [0 warnings, 0 errors]

Does using the remote: protocol really kill performance that greatly? I believe the AMI has configured the data to be sitting on the EBS drive. Should I try and find an instance that would leverage the local ephemeral?

Any insights you could provide would be appreciated.

Curt

Curt Kohler

unread,

Jun 20, 2016, 9:56:11 AM6/20/16

to OrientDB

I spent a little bit of additional time over the past few weeks trying different variants of this basic setup with little success in improving the performance. I'm posting my results in case anyone else comes along later looking for posts of this subject.

In an attempt to see if using the networked EBS drives were the bottleneck I ran on a single r3.large instance and basically saw the same throughput performance numbers across the various vertices and edges I was attempting to load. When I switched to using plocal vs remote, I saw approximately a 7X performance increase in the loading the vertices. Unfortunately, in our envisioned scenario, running in plocal mode is likely not feasible.

Loading the edges was a different story all together. Based on our data flows, we were in a position where the edges were extracted from our data separately from the vertices, so we had to load them up after populating the vertex nodes in the DB. As a result, I assume the ETL loader had to run 2 queries (to convert our native record ids into RIDs) before being able to actually add the edge to the graph. Running version 2.2.0 of the software had the ETL tool throwing errors while processing our file (a move to 2.2.2 eventually solved the issue). When we were finally able to run the files successfully, we were seeing throughput in the rand of @ 150 edges/sec (running with one thread). We also wrote a simple Apache Spark driver program using the Java Graph API and were able to start running parallel record streams and get to a load rate of approximately 1,000/edges/sec before we started having errors show up in our loader.

Francisco Reyes

unread,

Jun 21, 2016, 12:12:33 AM6/21/16

to OrientDB

On Monday, June 20, 2016 at 9:56:11 AM UTC-4, Curt Kohler wrote:

eventually solved the issue). When we were finally able to run the files successfully, we were seeing throughput in the rand of @ 150 edges/sec (running with one thread).

Curt,

New OrientDB user here.. but was wondering if you checked iostats to see if it was an issue with the disk subsystem. Also, is the disk SSD? Is disk using provisioned IOPS?

Curt Kohler

unread,

Jun 24, 2016, 11:57:58 AM6/24/16

to OrientDB

Sorry, I should have been more explicit.. I moved over to the r3 instance types to leverage the attached SSD ephemeral drives instead of the networked EBS drive to take possible network issues out of the picture.... I didn't notice anything specific in iostats when running the loads.

Luca Garulli

unread,

Jun 24, 2016, 12:33:52 PM6/24/16

to OrientDB

Hi guys,

A couple of week ago we created an internal division in OrientDB to take care about AWS (and other Cloud). Soon we will publish some metrics about OrientDB and Amazon AWS server configurations, so it will much easier choosing the right hw/sw configuration for your workload.

Back to your first question, I think the ETL is slow because it goes not in parallel. Have you tried "parallel" option?

Best Regards,

Luca Garulli

Founder & CEO

OrientDB LTD

--

---
You received this message because you are subscribed to the Google Groups "OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to orient-databa...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Francisco Reyes

unread,

Jun 24, 2016, 2:28:07 PM6/24/16

to OrientDB

On Friday, June 24, 2016 at 12:33:52 PM UTC-4, l.garulli wrote:

A couple of week ago we created an internal division in OrientDB to take care about AWS (and other Cloud).

Although AWS is likely the biggest provider, there are lots of other providers so will information/recommendations from this new division be generic enough so it can be used on other providers?

Curt Kohler

unread,

Jun 27, 2016, 10:14:15 AM6/27/16

to OrientDB

Luca,

Thanks for taking the time to reply. In answer to your question, yes and no. I was running the ETL tool on a instance that only had 2 cores, so there was really only one core available for the tool to utilize(hence the 150/sec for one thread result). I actually wrote a simple Spark-based loading program (using OrientGraphNoTx and setting intent for massive insert) and ran it as a job on my AWS Spark cluster for easily controllable parallelization. I was able to run up to 8 worker nodes (basically 8 threads) before I started seeing exceptions come back from the calls for a load rate of @ 1,042 recs/sec (approx 130 rec/sec/thread). I should note this rate was for creating edges between existing vertices from a file that had our internal ids for the nodes. The code had to look up the RIDs based on those keys (which had an index on them) and then create the link (basically the same work that our ETL config file was set up to do on our earlier runs).

Glad to hear that you are going to provide some guidance on cloud deployment recommendations. Having that type of info would have been helpful during this exercise.

Curt

On Friday, June 24, 2016 at 12:33:52 PM UTC-4, l.garulli wrote:

Reply all

Reply to author

Forward