Using mixed index for sorting (lucene / elastic search)

172 views
Skip to first unread message

Brecht De Rooms

unread,
Feb 7, 2016, 4:19:21 PM2/7/16
to Aurelius
Dear Titandb mailing list,

I am test driving Titan to see whether we would gain performance over Neo4j on sort statements.
In our use case we often have to retrieve of entities through several links and take the top 100 based on a weight. 
From what I read, it seemed that Titan would be able to do this using the Mixed indexes (and for a future use case, the vertex-centric indices might be useful)
However, I am not succeeding in making Titan use the mixed indices (tried both Lucene/ES) for sorting. The sorting time starts at 500 ms for 10000 elements and more than doubles when I add another 10000. Which is far worse performance than we had with Neo4j. Moreover, Titan keeps sending me the warning:
 WARN StandardTitanTx:1262 - Query requires iterating over all vertices [()]. For better performance, use indexes  

I assume that I am doing something wrong and would be grateful if someone could point me in the right direction.

Kind Regards,
Brecht


Query (which I run each time after loading another 10000 elements)
g.V().order().by("weight1", Order.decr).limit(100).toList().get(0).property("weight1")

Configuration:
        BaseConfiguration config = new BaseConfiguration();
        config.setProperty("storage.directory", DATA_DIR);
        config.setProperty("storage.backend", "berkeleyje");
        config.setProperty("storage.port", "8182");
        config.setProperty("storage.batch-loading", "false");
        config.setProperty("storage.transactions", "true");
        config.setProperty("index.search.backend", "elasticsearch");
        config.setProperty("index.search.directory","/tmp/searchindex");
        config.setProperty("index.search.elasticsearch.interface", "NODE");

        graph = TitanFactory.open(config);
        g = graph.traversal();

Schema with weights:
        TitanManagement mgmt = graph.openManagement();
        PropertyKey neo4jId = mgmt.makePropertyKey(NEO4J_ID_ATTRIBUTE_NAME).dataType(Long.class).cardinality(Cardinality.SINGLE).make();
        PropertyKey identifier = mgmt.makePropertyKey("identifier").dataType(String.class).cardinality(Cardinality.SINGLE).make();
        PropertyKey weight1 = mgmt.makePropertyKey("weight1").dataType(Long.class).cardinality(Cardinality.SINGLE).make();
        PropertyKey weight2 = mgmt.makePropertyKey("weight2").dataType(Long.class).cardinality(Cardinality.SINGLE).make();

        mgmt.buildIndex("byIdentifierComposite", Vertex.class).addKey(identifier).buildCompositeIndex();
        mgmt.buildIndex("byNeo4jIdComposite", Vertex.class).addKey(neo4jId).buildCompositeIndex();
        mgmt.buildIndex("byWeight1Composite", Vertex.class).addKey(weight1).buildCompositeIndex();
        mgmt.buildIndex("byWeight2Composite", Vertex.class).addKey(weight2).buildCompositeIndex();
        mgmt.buildIndex("byWeight1Mixed", Vertex.class).addKey(weight1).buildMixedIndex("search");

        mgmt.commit();
    }



Daniel Kuppitz

unread,
Feb 7, 2016, 4:30:51 PM2/7/16
to aureliu...@googlegroups.com
Hi,

try to add a condition, so that Titan knows that you're only interested in vertices that actually have that weight property:

g.V().has("weight1", gte(0L)).order().by("weight1", Order.decr).limit(100)

Cheers,
Daniel


--
You received this message because you are subscribed to the Google Groups "Aurelius" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraph...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aureliusgraphs/3fe325b2-503b-49d4-b5f0-873b31976f20%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brecht De Rooms

unread,
Feb 8, 2016, 4:14:11 AM2/8/16
to Aurelius
Hi Daniel,

thank you for your fast answer. 
Your suggestion removes the warning (WARN StandardTitanTx:1262 - Query requires iterating over all vertices [()]. For better performance, use indexes).
However, the speeds remain exactly the same (Lucene and ES). I already tried to reindex every time before I run the test which doesn't change anything.
The current query is: g.V().has("weight1", P.gte(0L)).order().by("weight1", Order.decr).limit(1).toList().get(0).property("weight1")
So every time I add 10000 elements I run this query to see whether it scales. The results below show that it goes up quite fast and quite linearly. 
For 100 000 elements we are already at 5 seconds. In neo4j it took 5 seconds for 1 000 000 elements so I can't believe that Titan is actually taking advantage of this index. 

As you can see in the results below , the timings go up quite fast (and quite linearly).

Loaded 10000 entities, loading next 10000

Received batch from neo4j

10:02:42,553  INFO IndexRepairJob:101 - Found index byWeight1Mixed

10:02:42,566  INFO IndexRepairJob:101 - Found index byWeight1Mixed

10:02:42,703  INFO ManagementSystem:994 - Index update job successful for [byWeight1Mixed]

----------------------- 

vp[weight1->2777]

runtime: 1209

Loaded 20000 entities, loading next 10000

Received batch from neo4j

10:02:44,733  INFO IndexRepairJob:101 - Found index byWeight1Mixed

10:02:44,738  INFO IndexRepairJob:101 - Found index byWeight1Mixed

10:02:44,873  INFO ManagementSystem:994 - Index update job successful for [byWeight1Mixed]

----------------------- Test 1

vp[weight1->2777]

runtime: 1156

Loaded 30000 entities, loading next 10000

Received batch from neo4j

10:02:46,761  INFO IndexRepairJob:101 - Found index byWeight1Mixed

10:02:46,765  INFO IndexRepairJob:101 - Found index byWeight1Mixed

10:02:46,903  INFO ManagementSystem:994 - Index update job successful for [byWeight1Mixed]

----------------------- 

vp[weight1->2777]

runtime: 1696

Loaded 40000 entities, loading next 10000

Received batch from neo4j

10:02:49,351  INFO IndexRepairJob:101 - Found index byWeight1Mixed

10:02:49,355  INFO IndexRepairJob:101 - Found index byWeight1Mixed

10:02:49,494  INFO ManagementSystem:994 - Index update job successful for [byWeight1Mixed]

----------------------- 

vp[weight1->2841]

runtime: 2149

Loaded 50000 entities, loading next 10000

Received batch from neo4j

10:02:52,315  INFO IndexRepairJob:101 - Found index byWeight1Mixed

10:02:52,318  INFO IndexRepairJob:101 - Found index byWeight1Mixed

10:02:52,461  INFO ManagementSystem:994 - Index update job successful for [byWeight1Mixed]

----------------------- 

vp[weight1->2848]

runtime: 2634

Loaded 60000 entities, loading next 10000

Received batch from neo4j

10:02:55,920  INFO IndexRepairJob:101 - Found index byWeight1Mixed

10:02:55,923  INFO IndexRepairJob:101 - Found index byWeight1Mixed

10:02:56,059  INFO ManagementSystem:994 - Index update job successful for [byWeight1Mixed]

----------------------- 

vp[weight1->2848]

runtime: 3180

Loaded 70000 entities, loading next 10000

Received batch from neo4j

10:03:00,211  INFO IndexRepairJob:101 - Found index byWeight1Mixed

10:03:00,214  INFO IndexRepairJob:101 - Found index byWeight1Mixed

10:03:00,356  INFO ManagementSystem:994 - Index update job successful for [byWeight1Mixed]


Extra information/questions:
- I didn't mention this in my first post: I am using Titan 1.0.0
- What I did not do when reindexing: report = mgmt.awaitGraphIndexStatus(graph, "mixedExample").call()
  Because the Java Interface does not have an awaitGraphIndexStatus method. Should I manually weight until the index goes from "ENABLED" to "REGISTERED" or is this simply not necessary in java?
- Should I somehow indicate that I will use this index for ordering or does any Lucene/ElasticSearch index work both for Fulltext search and ordering? And does Lucene/ES know from the PropertyKey datatype that it should index weight as a numeric weight?

Kind Regards and thanks again for your help,
Brecht 

Daniel Kuppitz

unread,
Feb 8, 2016, 7:54:27 AM2/8/16
to aureliu...@googlegroups.com
Hi Brecht,

I can't reproduce your result. Here's my little test script:

graph = TitanFactory.open("conf/titan-berkeleyje-es.properties")
g = graph.traversal()

mgmt = graph.openManagement()

weight1 = mgmt.makePropertyKey("weight1").dataType(Long.class).cardinality(Cardinality.SINGLE).make()
mgmt.buildIndex("byWeight1Mixed", Vertex.class).addKey(weight1).buildMixedIndex("search")
mgmt.commit()

add10K = {
  for (i = 1; i <= 10000; i++) {
    graph.addVertex("weight1", (Long) (System.currentTimeMillis() / i))
  }
  graph.tx().commit()
}

for (j = 0; j < 100; j++) {
  add10K()
  println "== ${(j+1)*10000} Vertices =="
  sleep(2000) // respect ES indexing latency
  println clockWithResult {g.V().has("weight1", gte(0L)).order().by("weight1", decr).limit(1).values("weight1").next()}
  println ""
}

I updated the configuration file to use a standalone ES instance, as I don't trust the embedded mode performance-wise.
The first few measurements were pretty slow, but over time the query performed better and better. This is how it started:

== 10000 Vertices ==
[0.98979063, 1454935229717]

== 20000 Vertices ==
[0.99538907, 1454935246933]

== 30000 Vertices ==
[0.75932559, 1454935253354]

== 40000 Vertices ==
[1.09850552, 1454935258999]

== 50000 Vertices ==
[0.63404432, 1454935264687]

== 60000 Vertices ==
[0.77917722, 1454935270246]

... and this is how it ended:

== 980000 Vertices ==
[0.05919474999999999, 1454935635203]

== 990000 Vertices ==
[0.055688629999999996, 1454935639298]

== 1000000 Vertices ==
[0.06061103999999999, 1454935643687]

So the slowest queries were right at the beginning, where it took almost 1 second to answer the query. After inserting 1 million vertices, the query time ultimately went down to 60 ms. And yea, my test machine is rather slow, hence these (or better) results should be reproducible on any other machine.

Cheers,
Daniel



Brecht De Rooms

unread,
Feb 8, 2016, 8:09:38 AM2/8/16
to Aurelius
Hi Daniel,

first of all, thanks for testing this!
Since you could not reproduce it, I decided to take the exact same test as you with my configuration.
The problem should be somewhere in the data since if I use the data you use to test, the results are significantly better:

----------------------- Test 1

1454936730176

runtime: 260 ms

----------------------- Test 2

1454936740145

runtime: 61 ms

----------------------- Test 3

1454936743278

runtime: 41 ms

----------------------- Test 4

1454936746264

runtime: 40 ms

----------------------- Test 5

1454936748794

runtime: 40 ms


Now I know that I should stop looking at the configuration and I will investigate further what goes wrong with the data. Thank you for your aid and have a nice day!


Kind Regards,

Brecht

Brecht De Rooms

unread,
Feb 8, 2016, 9:46:50 AM2/8/16
to aureliu...@googlegroups.com
Hey,

what happens is that it works when I pass the current time with  ->  (Long) System.currentTimeMillis().
However when I pass a number ->  new Long(200) or a value coming from the neo4j database it results in a FastNoSuchElementException. 
Whenever I remove the values("weight1") then titan does return a node. It seems from the moment I change the value to anything than your example weight1 property is no longer there.
I also tried: P.gte(0L)).order().by("weight1", Order.decr).limit(1).properties().next() which also does not work. 

Maybe this is logical, could you explain me why it behaves that way? 

Kind Regards,
Brecht



--
You received this message because you are subscribed to a topic in the Google Groups "Aurelius" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/aureliusgraphs/JFNthKyiBRM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to aureliusgraph...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aureliusgraphs/e7fee3a7-04a3-4cb1-8ad7-0b26cee26c3a%40googlegroups.com.

Daniel Kuppitz

unread,
Feb 8, 2016, 10:27:06 AM2/8/16
to aureliu...@googlegroups.com
I don't how you execute your tests, but note that there's a latency in ES indexing (index.refresh_interval). You can't update / insert a value and immediately query it. Wait at least 1 second before you query the value, otherwise the result may be empty.

Cheers,
Daniel


David

unread,
Feb 8, 2016, 4:26:22 PM2/8/16
to Aurelius
I did a quick conversion of Daniel's script into a Java program, changed the System.currentTimeMillis() to new Long(200) and was able to run without exceptions.
Daniel's findings were reproduced - this was my final output:
[0.049398529999999996, [200]]

I added the ids.block-size=1000000 setting to the properties file which made the overall program run faster even though it doesn't impact index performance.

import com.thinkaurelius.titan.core.Cardinality;
import com.thinkaurelius.titan.core.PropertyKey;
import com.thinkaurelius.titan.core.TitanFactory;
import com.thinkaurelius.titan.core.TitanGraph;
import com.thinkaurelius.titan.core.schema.TitanManagement;
import org.apache.tinkerpop.gremlin.process.traversal.Order;
import org.apache.tinkerpop.gremlin.process.traversal.P;
import org.apache.tinkerpop.gremlin.process.traversal.Traversal;
import org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.GraphTraversalSource;
import org.apache.tinkerpop.gremlin.structure.Vertex;
import org.apache.tinkerpop.gremlin.util.TimeUtil;
import java.util.function.Supplier;

/**
*
*/
public class Driver16
{

public static void main(String args[])
{
Driver16 me = new Driver16();
String propsFile = args[0];
me.doWork(propsFile);
}

private void doWork(String propsFile)
{
TitanGraph graph = null;
try
{
graph = openGraph(propsFile);
createSchema(graph);

for (int i = 0;i<101;i++)
{
writeData(graph);
readData(graph, i);
}
}
catch (Exception e)
{
e.printStackTrace();
}
finally
{
if (graph != null)
graph.close();
}


}

private TitanGraph openGraph(String propsFile)
{
TitanGraph graph = TitanFactory.open(propsFile);
return graph;
}

private void createSchema(TitanGraph graph)
{
TitanManagement mgmt = graph.openManagement();

PropertyKey weight1 = mgmt.makePropertyKey("weight1").dataType(Long.class).cardinality(Cardinality.SINGLE).make();
mgmt.buildIndex("byWeight1MixedIdx", Vertex.class).addKey(weight1).buildMixedIndex("search");
mgmt.commit();
}


private void writeData(TitanGraph graph)
{
// Add 10000 vertices at a time
for (int i = 1; i <= 10000; i++)
{
// graph.addVertex("weight1", (Long)(System.currentTimeMillis() / i));
graph.addVertex("weight1", (Long)(new Long(200)));
}
graph.tx().commit();
}

private void readData(TitanGraph graph, int currentCount)
{
try
{
GraphTraversalSource g = graph.traversal();

System.out.println("== ${(" + currentCount + ")*10000} Vertices ==");


// respect ES indexing latency
            Thread.sleep(2000);

final Supplier<Traversal<?, ?>> traversal = () ->
g.V().has("weight1", P.gte(0L)).order().by("weight1", Order.decr).limit(1).values("weight1");

System.out.println(TimeUtil.clockWithResult(() -> traversal.get().toList()).toString());
System.out.println("");
}
catch (Exception e)
{
e.printStackTrace();
}

}
}

Brecht De Rooms

unread,
Feb 9, 2016, 2:36:23 AM2/9/16
to aureliu...@googlegroups.com
Hi David,

thank you for double checking. It works now with Elastic Search though, my code is very similar to yours. 
For Lucene I still get these errors from time to time. I am pretty sure it is a silly mistake on my part (maybe I don't delete the index properly, at the moment I just remove the folder). In every case I can continue working with ES. Thanks to both of you for the help.

Brecht

Surendra Pratap

unread,
Jun 11, 2018, 6:41:55 AM6/11/18
to Aurelius
Hi ,

How can I configure a standalone ES instance with Titan DB app supported by Berkley DB. Please suggest for the below points:

1)      How to check Titan DB is writing to the Elastic search

2)      What needs to be done so that the graph is injected by the java code and then can be queried using the tinkerpop console

Reply all
Reply to author
Forward
0 new messages