I've been asked to kick the tires on OrientDB as a possible graph DB solution for an upcoming project at my company. In order to do so, I've spun up an EC2 instance using the OrientDB marketplace AMI on a m4.xlarge box with an EBS drive (picked as a general purpose box since I couldn't find any hardware recommendations via documentation or searches). I've got the database running, but when I try and use the ETL bulk import tools with CSV files, I'm seeing what I consider very poor performance compared to the claims I have read. The best I've seen is @ 5K records/second loaded. The existing documentation leaves a bit to be desired, so I was hoping someone might be able to offer some insight.
Here are some details (I've scaled things back trying to understand where I may have gone wrong).
- One file 2 million records that has two columns (record key and text field). E.g. ABC\tString here
- A class schema was predefined outside the ETL config script with those two fields and an index on the id field
- This ETL script - based on one in the documentation - I am running on the EC2 box (I am using remote: connection as the project will consist of a distributed DB. even though both are on the same box right now)
{
"source":{
"file":{
"path":"/user/poc1_Datasets/organization.tsv"
}
},
"extractor":{
"row":{
}
},
"transformers":[
{
"csv":{
"separator": "\t"
}
},
{
"vertex":{
"class":"Organization"
}
},
],
"loader":{
"orientdb":{
"dbURL":"remote:localhost/DataSpine1",
"dbType":"graph",
"wal":false,
"tx":false,
"batchCommit":25000
}
}
}
The final output of the ETL loader in this case was:
END ETL PROCESSOR
+ extracted 1,822,150 rows (3,904 rows/sec) - 1,822,150 rows -> loaded 1,822,149 vertices (3,907 vertices/sec) Total time: 520411ms [0 warnings, 0 errors]
Does using the remote: protocol really kill performance that greatly? I believe the AMI has configured the data to be sitting on the EBS drive. Should I try and find an instance that would leverage the local ephemeral?
Any insights you could provide would be appreciated.
Curt