I've been asked to kick the tires on OrientDB as a possible graph DB solution for an upcoming project at my company. In order to do so, I've spun up an EC2 instance using the OrientDB marketplace AMI on a m4.xlarge box with an EBS drive (picked as a general purpose box since I couldn't find any hardware recommendations via documentation or searches). I've got the database running, but when I try and use the ETL bulk import tools with CSV files, I'm seeing what I consider very poor performance compared to the claims I have read. The best I've seen is @ 5K records/second loaded. The existing documentation leaves a bit to be desired, so I was hoping someone might be able to offer some insight.
Here are some details (I've scaled things back trying to understand where I may have gone wrong).
- One file 2 million records that has two columns (record key and text field). E.g. ABC\tString here
- A class schema was predefined outside the ETL config script with those two fields and an index on the id field
- This ETL script - based on one in the documentation - I am running on the EC2 box (I am using remote: connection as the project will consist of a distributed DB. even though both are on the same box right now)
The final output of the ETL loader in this case was:
END ETL PROCESSOR
+ extracted 1,822,150 rows (3,904 rows/sec) - 1,822,150 rows -> loaded 1,822,149 vertices (3,907 vertices/sec) Total time: 520411ms [0 warnings, 0 errors]
Does using the remote: protocol really kill performance that greatly? I believe the AMI has configured the data to be sitting on the EBS drive. Should I try and find an instance that would leverage the local ephemeral?
Any insights you could provide would be appreciated.