Titan embedded

1,799 views
Skip to first unread message

Matthias Broecheler

unread,
Sep 4, 2013, 10:10:08 PM9/4/13
to aureliu...@googlegroups.com
Hey guys,

after doing some stress test comparisons of Titan-Cassandra embedded with Titan-Cassandra over localhost, we came to the conclusion that Titan-Cassandra over localhost is the preferred deployment option for high performance oriented deployments.

We conducted a series of benchmarks that led us to this conclusion and believe the reason that Titan-Cassandra over localhost is faster are as follows:
1) Communication over localhost does not invoke the network stack on modern operating systems (i.e. its fast)
2) Running Titan and Cassandra in the same jvm causes longer GC pauses and generally makes GC tuning more complicated because the memory footprint of Titan is very different from that of Cassandra.

Bottom line: We will most likely remove Titan-Cassandra embedded in 0.4.0 and encourage everybody to run Titan alongside Cassandra but in separate jvms. This has the additional benefit that it removes a lot of code complexity from Titan and allows us to move Rexster out of the Titan code base (i.e. much less dependencies).

Just a heads up,
Matthias


PS: Thanks to Zack Maril for devising, running, and evaluating the benchmark that were the basis for this conclusion.

--
Matthias Broecheler
http://www.matthiasb.com

Zack Maril

unread,
Sep 5, 2013, 10:24:16 AM9/5/13
to aureliu...@googlegroups.com
If anyone has any questions about the methodology and process behind these results, I'd be happy to answer them.
-Zack

Tom Michaud

unread,
Sep 5, 2013, 11:17:21 AM9/5/13
to aureliu...@googlegroups.com
Hi Zack,

Would you be able to share the test results with the community?

Thx,
Tom


--
You received this message because you are subscribed to the Google Groups "Aurelius" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraph...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

bytor...@gmail.com

unread,
Sep 5, 2013, 11:39:52 AM9/5/13
to aureliu...@googlegroups.com
Are you sure you want to completely remove it? 

I was just thinking about Integration tests in people's applications. I think embedded makes it simpler to run the tests without asking the build to start up an instance of Cassandra before running the tests. Kind of just adds an extra dependency on devs and CI.

But with that said, I also thought that embedded meant in-memory and that each test run would be working on a clean DB, but I think I am wrong on that based on my current test which always first run MakeType/createSchema code and it returning true for something like this

if (graph.getType(PLAYER_INVITED)==null)

Thanks

Mark

Matthias Broecheler

unread,
Sep 5, 2013, 12:10:36 PM9/5/13
to aureliu...@googlegroups.com
Hi Mark,

yes, Titan-embedded is convenient for testing, but so is the localhost variant. If you look at the Titan test cases, we implemented a process starter that starts Cassandra separately which is nice for tests.

We can highlight this in the documentation so that devs and CI know how to use that in their environment. It should not be any more work.

Cheers,
Matthias


--
You received this message because you are subscribed to the Google Groups "Aurelius" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraph...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

bytor...@gmail.com

unread,
Sep 5, 2013, 6:32:01 PM9/5/13
to aureliu...@googlegroups.com
Great. Thanks. 

Although, with the localhost variant, wouldn't that mean that any makeType() calls that makes types would remain after the first run?

Just asking out loud, you don't have to respond to that question.

Mark

murat migdisoglu

unread,
Sep 6, 2013, 5:01:03 AM9/6/13
to aureliu...@googlegroups.com
I've a problem in understanding why so many people are doing performance tests on embedded cassandra or single node deployments(one instance of cassandra/localhost).

I don't believe that embedded cassandra or single node deployments are realistic use cases. Cassandra is fast when it is running in a multi node cluster. It is designed to be run on a cluster. I don't use embedded cassandra except in my unit tests. Even the test cluster that I have at home is running on 4 node raspberry pi cluster. 

In my opinion, making it run and executing traversal queries is one thing but benchmarking it is another thing. And I'm pretty sure that you can not convince any serious customer/IT Manager to use embedded cassandra in production environment. 

Said that, I'm also not sure if Titan is really ready to run on a distributed fashion. Like I discussed in a previous email on this forum, the transactions,locks, unique checks are thread related. When I run an application using blueprint API on a storm cluster, I did not find any mechanism to check the unique indexes.

Kind Regards
Murat



"Find a job you enjoy, and you'll never work a day in your life."
Confucius

Zack Maril

unread,
Sep 6, 2013, 9:58:09 AM9/6/13
to aureliu...@googlegroups.com
(Got the green light on releasing all results, working on putting together something coherent about this over the weekend. )

Matthias Broecheler

unread,
Sep 6, 2013, 1:14:41 PM9/6/13
to aureliu...@googlegroups.com
Hey Murat,

when you run a distributed graph cluster, you previously had the choice between running Titan embedded with Cassandra on EACH node in the cluster or running it locally in a separate process on EACH node in the cluster or running Titan nodes SEPARATE from the Cassandra cluster.
Those are the 3 deployment modes applicable for production systems (see documentation for more details). Of those, the latter has higher query latency and hence users are using one of the former two for high performance deployments. The benchmarking we have done suggests that running Ttian locally in a separate jvm is the best option. Zack will publish a more comprehensive review of those results.


Said that, I'm also not sure if Titan is really ready to run on a distributed fashion. Like I discussed in a previous email on this forum, the transactions,locks, unique checks are thread related. When I run an application using blueprint API on a storm cluster, I did not find any mechanism to check the unique indexes.


That is not true. Please refer to the documentation and previous emails on the list to understand how locking and uniqueness checks work in a distributed Titan deployment.

HTH,
Matthias

Zack Maril

unread,
Sep 9, 2013, 12:02:02 AM9/9/13
to aureliu...@googlegroups.com
The main purpose of these experiments was to see what the "best" setup was for a single node server was. There had been rumblings in the Aurelius community that people weren't seeing much improvement in terms of speed, in a variety of metrics, when using Titan embedded versus running Cassandra in a separate jvm process locally.

The data involved was a graph of the marvel comic book universe: https://github.com/zmaril/marvel.graphson

The gatling stress tool was used to launch load scenarios and summarize results: http://gatling-tool.org/ 
The main scenario that was tested can be found here (scala formatting seems odd for some reason on github):

The scenarios tested some simple queries that involved traversals and random side effects. Before anyone starts throwing their specific scripts and scenarios to test, please note that there is literally an infinite number of possible scripts we could run, each meant to achieve a various goal. We choose a simple set of scripts that represented a semi realistic workload for a single node set up. 

The following scripts were used to start the embedded and local titan setups: 

For hardware, I used m3.2xlarge's on ec2 with ubuntu 12.04 LTS. I had an instance for gatling to run on and an instance for titan to run on. These instances all networked with each other via elastic ip's. The embedded instances were given 20g of heap space to work with while the local results had 10g for titan and 10g for cassandra. 

Here are some of the results (if you are reading this in the future and the webpages are down, send me an email and I'll put them back up for you): 

Local results (along with gc output):

Embedded results (along with gc output): 

Local results are generally faster than embedded results, period. More requests/second, smaller means, smaller std deviations, smaller percentiles, all of them were better. These tests were repeated several times and all showed that embedded was consistently worse than local. The Aurelius team theorized that garbage collection was the reason behind the  unexpected slowdown. Pavel Yaskevich explained it as follows: 

"By separating Titan and Cassandra processes you get separate GC behavior (all ParNew and PSYoungGen are Stop-The-World events so even for 5-10 ms stops like that disrupt the whole pipeline)  and help from operating system to buffer producer (Titan) packets while consumer is stopped (Cassandra) via loopback, scheduling also becomes easier task for OS as both now have separate quantum."

In conclusion, we did not see a reason to keep titan embedded in active development. It was expected that embedded would at the least blow local out of the water in terms of speed, but we weren't able to find a scenario where that happened. Thus, while there may be a scenario where running cassandra embedded with titan is a win, titan embedded didn't win in simple scenarios that were more likely to occur. So, it doesn't make sense to maintain and develop titan embedded moving forward. If you have any questions, please shoot them my way and I'll answer as best I can. 
-Zack

P.S.: Nothing good happens when you put titan and cassandra on different servers (http://ec2-54-225-52-5.compute-1.amazonaws.com:8000/distinct-load-20130829131615/ network latency sucks). 
P.P.S: perf doesn't collect many stats on ec2 because they've compiled the kernel with most of the useful kernel flags off. 

Antonio VonG

unread,
Sep 9, 2013, 10:48:25 AM9/9/13
to aureliu...@googlegroups.com
I think we should have reflected the last(titan and cassandra on different servers) part in the wiki, so many people were doing this and became totally frustrated.

Regards,
Antonio

Matthias Broecheler

unread,
Sep 9, 2013, 9:00:55 PM9/9/13
to aureliu...@googlegroups.com
Hey guys,

let me add that Zack's experiments were focused on low latency query answering. In other words, the emphasis of these experiments was to get query results quickly. For low latency applications, we therefore recommend running Titan on the same machine as Cassandra but in a separate jvm.

Deploying Titan and Cassandra on separate clusters remains a viable option where low latency is not the primary concern and the operational simplicity of maintaining separate clusters (or being able to server other work loads out of the cassandra cluster) outweigh the lower latency benefits.

Best,
Matthias


--
You received this message because you are subscribed to the Google Groups "Aurelius" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraph...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Steven McCraw

unread,
Sep 15, 2013, 8:57:43 AM9/15/13
to aureliu...@googlegroups.com
Hi All,

Sorry I'm late to the party on this one, but I do want to weigh in.  The performance results are interesting, and for an internal application I'm developing, I'll definitely decouple Titan and Cassandra from the same process.  However, I hate to hear that you're planning to remove the embedding capability all together.  I originally started using Titan for a small back-end datastore to an application that I'll be distributing to customers.  Being able to embed whatever database we chose was a big decision-point, because we don't want our customers to have to do a bunch of configuration.  As is, we can deliver a very simple package to customers that launches a single process on their machine, and the whole lifecycle maintenance of the thing is pretty easy.  

We don't want our clients to have to know about a second external process that our product leans on, and we don't want them to have to know anything about Cassandra, what ports it comes up on, or how to troubleshoot connection issues between the two.  That would probably be fine for enterprise level customers, but not so much for small individual users like college students that we distribute our software to.  It's best and easiest if they can just start up the process we deliver and see one process get created as a result.  In this scenario, the slight performance gains we would get from decoupling the two take a back seat to the deployment and software lifecycle management experience we want our customers to have.  I suspect this might be the case for lots of users, particularly when Titan isn't being used with huge data sets or particularly complex graphs.

Therefore, I support the notion of changing the wiki to reflect the new understanding of better performance with running Cassandra non-embedded, but advocate that embedding be left in as a possibility to address other use cases.

Thanks very much!
Mark

Antonio VonG

unread,
Sep 15, 2013, 10:16:08 AM9/15/13
to aureliu...@googlegroups.com
Great point there but I think it should be easy to write a launch script that run cassandra first, then run the titan and rexster server connecting to localhost.
The point here is that if your client ever decides to take a closer look at Titan, they will find out about cassandra anyway. But if we do this smart enough people won't even notice the change:-)

Oh btw did you ever think about distributing your application using docker? It was quite cool.

Regards,
Antonio 

Zack Maril

unread,
Sep 15, 2013, 9:41:51 PM9/15/13
to aureliu...@googlegroups.com
Steven, I agree that embedded makes it really easy to set things up. Why doesn't the berkelydb backend work for your use case? 
-Zack

Mark McCraw

unread,
Sep 15, 2013, 11:26:16 PM9/15/13
to aureliu...@googlegroups.com
Hi Zack, berkelydb is actually a perfect fit for this scenario, but we can't distribute it because of licensing restrictions (I tried getting it past legal and it got nowhere fast).  Cassandra's Apache 2.0 license is the business-friendly license we needed.


--
You received this message because you are subscribed to a topic in the Google Groups "Aurelius" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/aureliusgraphs/EasJTTkDtfY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to aureliusgraph...@googlegroups.com.

Zack Maril

unread,
Sep 16, 2013, 9:52:05 PM9/16/13
to aureliu...@googlegroups.com
I hear the persitit backend was suggested as one way of dealing with that particular scenario. I'm unsure how far that got or whether it is officially endorsed yet. Perhaps that could do it? 
-Zack

Steven McCraw

unread,
Sep 17, 2013, 3:28:12 PM9/17/13
to aureliu...@googlegroups.com
Persistit looks great.  The licensing will work.  I'd love to give it a try.  How do I go about doing that?  Is there documentation anywhere (my 30 seconds of googling didn't turn up much, but github, weirdly, is down right now, so it's even harder to figure out what my options are)?

Matthias Broecheler

unread,
Sep 18, 2013, 12:40:36 AM9/18/13
to aureliu...@googlegroups.com
Persistit is on the master branch, but not yet part of the documentation. Similar to BDB though. Check out the adapter (in the titan-persistit) module for the details.


--
You received this message because you are subscribed to the Google Groups "Aurelius" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraph...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

Steven McCraw

unread,
Sep 18, 2013, 1:20:20 PM9/18/13
to aureliu...@googlegroups.com
Ok, I think I'm very close to being able to test my application with persistit.  I built from the master branch and added everything from titan-dist/titan-dist-all/target/titan-all-standalone to my classpath.  Two notes here:

1) The other data stores have separate folders created in titan-dist (i.e. titan-dist-berkeleyje, titan-dist-cassandra, titan-dist-hbase).  I'd be happy to monkey with the builds to get the same thing to happen with persistit, but I'm not familiar with maven.  Still, if someone could point me in the right direction, I bet I could muddle through based on how the other things work.  Secondly, I don't really know which of all those jar files are dependencies for persistit/titan.  I suppose I could diff the other three distributions, get the common set, and tack on the persistit specific jars.  Does that seem like the best approach?

2) When I try to configure titan, I'm creating a org.apache.commons.configuration.MapConfiguration that looks like:

storage.backend => persistit
storage.directory => <whatever>
storage.buffercount => 5000

I just took that from config/titan-server-persistit.properties.  The problem is, when I feed that to com.thinkaurelius.titan.core.TitanFactory.open, I get the following exception:

LoadError: load error: titan -- java.lang.IllegalArgumentException: Could not find implementation class: persistit

which makes me think that I'm doing something wrong with that configuration.

Any ideas about what might be going on?

Thanks so much!
Mark

P.S.  Watching all those tests run really made me appreciate the tremendous amount of work that has gone into this awesome project.  Kudos to everyone involved for the amazing work.

Steven McCraw

unread,
Sep 18, 2013, 5:37:12 PM9/18/13
to aureliu...@googlegroups.com
Ok, the error I got was because I somehow missed pulling some of the titan-0.3.0 jars out of my classpath.  That's fixed now, but now I'm getting:

Features.java:186:in `checkCompliance': java.lang.IllegalStateException: The feature isRDFModel was not specified
from TitanFeatures.java:46:in `getBaselineTitanFeatures'
from TitanFeatures.java:51:in `getFeatures'
from StandardTitanGraph.java:104:in `getFeatures'


My plan is to dive into that late tonight or early tomorrow, but if someone knows what this is about right away, I wouldn't turn down the information :-)

Thanks,
Mark

Matthias Broecheler

unread,
Sep 18, 2013, 8:57:39 PM9/18/13
to aureliu...@googlegroups.com
Might be some old jars still lying around. isRDFModel was deprecated (and removed?).
HTH,
Matthias

Steven McCraw

unread,
Sep 19, 2013, 9:29:43 AM9/19/13
to aureliu...@googlegroups.com
Yes!  That was it, things are working fine now.  Thanks very much!  I'm still interested in building out the distributions (and maybe even updating the doc) for persistit.  I can do that and then submit a pull request.  Can someone tell me how the distributions at https://github.com/thinkaurelius/titan/wiki/Downloads get built?  I'm interested in creating a similar distribution for persistit, containing only the jars it needs, but it isn't obvious to me how the other ones get created.

Thanks!
Mark

Ian

unread,
Oct 6, 2013, 6:50:40 PM10/6/13
to aureliu...@googlegroups.com
I want to use cassandra as a backend and distribute it on at least 3 servers to have high availability. 

Is this the best procedure to install?

Install cassandra and titan on each server and set them up in Local server mode, but use seed IP to setup as a multinode cluster.

If this is correct, how does my Java application communicate with the cluster? Do I have to install Rexster on each server and then choose one of the Rexster endpoint in my application? Can I still use the Blueprint API in this scenario (I guess through the RexPro protocoll)?

I have been reading a lot, but still not sure how to best set this.

Antonio VonG

unread,
Oct 6, 2013, 7:17:37 PM10/6/13
to aureliu...@googlegroups.com
Yes you need rexster, via REST/RexPro, or you could consider embedding Titan into your application.

Regards,
Antonio

drake.c...@gmail.com

unread,
Jan 14, 2014, 7:58:15 PM1/14/14
to aureliu...@googlegroups.com
What read and write consistency settings were used?   With consistency > 1, even using localhost will still have network latency between nodes of the cassandra cluster.

Matthias Broecheler

unread,
Jan 14, 2014, 10:49:11 PM1/14/14
to aureliu...@googlegroups.com
Yes, when you use a consistency level higher than 1 Cassandra consults at least two instances one of which will be remote.


--
You received this message because you are subscribed to the Google Groups "Aurelius" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraph...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply all
Reply to author
Forward
0 new messages