In my OrientDB-based application, I need to do an INSERT-IF-NOT-EXISTS operation using the Java (TinkerPop) API.
I have created a vertex type "Identifier." It has a single property, "identifier," which contains a URI (effectively a String for purposes of this discussion).
I have also created an index like this:
ParametersBuilder builder=new ParametersBuilder();
builder.add("class", "Identifier");
builder.add("type", "UNIQUE_HASH_INDEX");
graph.createKeyIndex("identifier", Vertex.class, builder.build());
Then, I perform the INSERT-IF-NOT-EXISTS operation in a loop like this. This snippet is using the Google Guava libraries and is obviously a simplification of our real application:
int n=10000;
for (int i=0; i<n; i++)
{
Iterable<Vertex> vertices=graph.getVertices("identifier", myUriStr);
Vertex vertex=Iterables.getOnlyElement(vertices);
if (null==vertex)
{
// Create vertex
...
}
// Use vertex
...
}
What I am seeing is that the throughput of this loop rapidly diminishes as more vertices are added, like this (with the throughput relative to the n=1,000 baseline):
n=10,000 throughput=16.5%
This obviously suggests that indexing is not working, so I tried a SQL EXPLAIN command.
documentReads=1
fullySortedByIndex=false
documentAnalyzedCompatibleClass=1
recordReads=1
fetchingFromTargetElapsed=0
indexIsUsedInOrderBy=false
compositeIndexUsed=1
involvedIndexes=[Identifier.identifier]
limit=-1
evaluated=1
user=#5:0
elapsed=2.387001
resultType=collection
resultSize=1
The documentation at
http://orientdb.com/docs/master/SQL-Explain.html does not seem to be 100% current on how to interpret the output of the EXPLAIN command, but my interpretation is that the query did recognize and use the index that I created.
I also tried some profiling (with JProfiler) and see a hot spot at com.tinkerpop.blueprints.impls.orient.OrientElementIterator.hasNext.
All of this is with OrientDB running in embedded mode, on a fairly high-end Linux machine and with a fresh, empty database at the beginning of each test.
I have to believe I am doing something wrong to see such a rapid drop-off in query performance under such relatively small data volumes.
I have been struggling with this for several days off-and-on now and it's time to ask for help. Has anyone else encountered a similar issue? What can I do to address this?
Thanks in advance!
-- John