Issue with duplicate vertices with unique index not being indexed

149 views
Skip to first unread message

Kevin Schmidt

unread,
Jul 22, 2016, 6:01:34 PM7/22/16
to aureliu...@googlegroups.com
Using Titan 1.0.0 with Cassandra, I have a scenario where multiple threads are adding vertices to a graph, each vertex having a property with a unique index on it.  If requests are made non-concurrently, a unique constraint violation occurs when an attempt is made to create a duplicate vertex.

If concurrent requests are made to create a duplicate vertex though, the constraint violation is not caught and both vertices get created.  Now I believe this is expected behavior and the only solution is to also use locks, is that correct?  But locks would slow things down ...

In any case, that isn't my real question or issue.  My issue is that when duplicate vertices are created, looking up the vertex only retrieves one of them if the index is used.  The only way to retrieve both vertices (and thus know a duplicate exists) is to do a query/traversal that doesn't use the index.  But doing so on a large graph would be slow.

For example, after the duplicate vertices have been created a traversal without the index returns both vertices (but warns about iterating over all vertices):

gremlin> g.V().has('myId','2');
14:53:17 WARN  com.thinkaurelius.titan.graphdb.transaction.StandardTitanTx  - Query requires iterating over all vertices [(myId = 2)]. For better performance, use indexes
==>v[286724096]
==>v[245764144]

But using the index returns just one of the vertices:

gremlin> g.V().hasLabel('Person').has('myId','2');
==>v[245764144]


Is this correct behavior or is there a bug?  Why wouldn't both vertices be returned when using the index?

If it is correct behavior, how can I detect the duplicate vertex without traversing over all vertices?

Thanks,

Kevin

Kevin Schmidt

unread,
Jul 25, 2016, 11:01:05 AM7/25/16
to Aurelius
Here is some example Gremlin code to recreate the issue.

First, create the label/property/index.

graph = TitanFactory.open('titan10.properties');
g = graph.traversal();
mgmt = graph.openManagement();

carLabel = mgmt.makeVertexLabel('car').make();
modelKey = mgmt.makePropertyKey('model').dataType(String.class).make();
mgmt.buildIndex('byModelU',Vertex.class).addKey(modelKey).unique().indexOnly(carLabel).buildCompositeIndex();
mgmt.commit();

Next, add a vertex but don't commit:

gremlin> v = graph.addVertex('car');

==>v[4296]

gremlin> v.property('model','Boxster');

==>vp[model->Boxster]


Then, in a separate Gremlin shell open the graph using the same properties file and add a vertex with the same key and commit and do an indexed and non-indexed lookup:

gremlin> v = graph.addVertex('car');
==>v[4168]
gremlin> v.property('model','Boxster');
==>vp[model->Boxster]
gremlin> graph.tx().commit();
==>null
gremlin> g.V().has('model','Boxster');
==>v[4168]
gremlin> g.V().hasLabel('car').has('model','Boxster');
==>v[4168]

Now in the first Gremlin shell, commit the transaction and do the same lookup:

gremlin> graph.tx().commit();
==>null
gremlin> g.V().has('model','Boxster');
==>v[4168]
==>v[4296]
gremlin> g.V().hasLabel('car').has('model','Boxster');
==>v[4296]

Observe the non-indexed lookup returns both vertices while the indexed returns just the one committed from this first shell.

Doing the same lookups from the second shell after the commit in the first shell:

gremlin> g.V().has('model','Boxster');
==>v[4168]
==>v[4296]
gremlin> g.V().hasLabel('car').has('model','Boxster');
==>v[4168]

It also returns both vertices without the index, but with the index returns the vertex created in the second shell.  However, upon doing a rollback, it then returns the vertex from the first transaction:

gremlin> g.tx().rollback();
==>null
gremlin> g.V().hasLabel('car').has('model','Boxster');
==>v[4296]

The issue is that there will be edges created to these vertices in the transactions so doing an indexed lookup is going to miss some of them as only one vertex (either one) is returned.

Stephen Mallette

unread,
Jul 29, 2016, 8:47:26 AM7/29/16
to Aurelius
Now I believe this is expected behavior and the only solution is to also use locks, is that correct?  But locks would slow things down ...

yes - i think you would want locks and yes it would slow stuff down. locks should be used judiciously.

My issue is that when duplicate vertices are created, looking up the vertex only retrieves one of them if the index is used.  The only way to retrieve both vertices (and thus know a duplicate exists) is to do a query/traversal that doesn't use the index.  But doing so on a large graph would be slow.

I don't know if that's expected behavior or not, though I presume that if you define something as unique it would be weird for it to return more than one vertex so perhaps titan is making some assumptions there about making sure than just one vertex "wins" in the case of unique. I don't think I've ever done a unique index without a lock which is why i've probably never run into that problem.

I think your only real option for finding these duplicates on a large graph would be to bust some olap traversals over it. 

--
You received this message because you are subscribed to the Google Groups "Aurelius" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraph...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aureliusgraphs/CAJFiCb_GdStXL%3Dd0oxiOYoGo_yR%2BiYoo8v-eKaMv9ZjNNH5dvQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Kevin Schmidt

unread,
Jul 29, 2016, 8:53:38 AM7/29/16
to aureliu...@googlegroups.com
Thanks for the reply.

We are trying locks now to see how that changes behavior, but are also considering making the index not unique and just changing our traversals to recognize there may be multiple vertices and handle it appropriately, and perhaps "merge" the vertices when multiple are discovered.

Dario Amiri

unread,
Sep 6, 2016, 6:45:13 PM9/6/16
to Aurelius
Hi Kevin,

I work with a team that is running into a similar issue: we want observed uniqueness but would like to avoid locking. I'm curious if the strategy you proposed in your last post ultimately worked out for you and if it did if you wouldn't mind sharing some of the details.

D

Kevin Schmidt

unread,
Sep 7, 2016, 2:04:30 AM9/7/16
to aureliu...@googlegroups.com
Dario,

Yeah, it is generally working.  We just had to modify some traversals to be tolerant of duplicate vertices and handle things appropriately.  We have not gone to any lengths to try and merge/remove the duplicates yet.

Kevin

To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraphs+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aureliusgraphs/22844f79-f978-46e6-9344-c3810601a8e0%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages