faunus+cassandra common issues

331 views
Skip to first unread message

Stephen Mallette

unread,
Sep 18, 2013, 11:49:29 AM9/18/13
to aureliu...@googlegroups.com
I've been doing some work with Faunus recently on a reasonably large graph and hit a number of problems that I think come up a lot.  I think they come up a lot because I did a fair amount of searching for the answers to my problems and either found answers in different places or didn't get clear answers at all.  I'm not sure if I have clear answers here, but perhaps it will help someone out if I document a bit.

First of all, some environmental information:

+ Titan+Cassandra.  
+ Cassandra cluster consists of 6 m1.xlarge EC2 instances
+ Hadoop cluster consists of 6 m2.xlarge EC2 instances

My first job was to simply get all the data into sequence file... g._().  The first problem I solved was in regard to "Message length exceeded" errors and I think most people know the answer to that...add these lines to your faunus.properties file:

cassandra.thrift.framed.size_mb=49
cassandra.thrift.message.max_size_mb=50

It actually took some trial and error to get these settings right.  I just kept bumping the size up until the errors disappeared.  Not sure if there is a better way to do that or to know the right setting from the outset.  I ended up with 256 as the value for both settings given my graph structure.

The next error message I dealt with was: "TitanException: Could not connect to Cassandra to read partitioner information. Please check the connection".  In this case, I fixed that by changing from "cassandrathrift" to:

faunus.graph.input.titan.storage.backend=cassandra

I can never remember which one of those works best in EC2.  Always feels like I guess wrong, no matter which I pick first, but in this case "cassandra" was the answer.

Then I started to blow the heap getting "GC limited exceeded" and OutOfMemoryError exceptions.  I edited my faunus.properties file as follows:

mapred.map.child.java.opts=-Xmx6144M

The m2.xlarge are memory optimized and have 17G of RAM.  Given two mappers per node, I figured 6G was ok to spare to each mapper.

Then I started to get timeout exceptions when connecting to Cassandra (usually happened a good way into the job i was executing).  I fixed that with:

cassandra.range.batch.size=256

The default for that value is 4096.  If anyone can share why the 4096 number was too high, I'd like to know the reason.

Anyway, since those changes went into play, I've not had any problems executing my Faunus jobs.  Hope this helps someone out in the future...even if that person is just me :)

Stephen









Daniel Kuppitz

unread,
Sep 18, 2013, 12:14:17 PM9/18/13
to aureliu...@googlegroups.com
There's one more, my favorite exception in Faunus. It looks like:

java.lang.RuntimeException: TimedOutException()
        ...
        at org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:734)
        at org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:718)
        at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.maybeInit(ColumnFamilyRecordReader.java:346)
        ... 17 more

The solution is to find the following lines in cassandra.yaml:

# How long the coordinator should wait for seq or index scans to complete
range_request_timeout_in_ms: 10000

...increase the range request timeout (I always use 60000 and it works, maybe lower values will also work) and restart Cassandra.

Cheers,
Daniel



2013/9/18 Stephen Mallette <spmal...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "Aurelius" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraph...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

stephen mallette

unread,
Sep 19, 2013, 6:37:47 PM9/19/13
to aureliu...@googlegroups.com
Thought I would supply some more updates here.  As it turned out, this setting:

cassandra.range.batch.size

is not so good.  It doesn't appear to work as advertised (I could be misinterpreting the results) and provides some inconsistencies in the in/out edge counts (at least in graphs at the size i was working at).  It almost behaves as a "cap" to the number of edges that can be read.  You can read more about that issue here (i added it to Faunus, but not sure it belongs there): https://github.com/thinkaurelius/faunus/issues/143  I've since removed it from my properties file.

I think Daniel's suggestion for the range_request_timeout_in_ms addition to cassandra.yaml was a good one, but I also bumped these numbers considerably (without the change to cassandra.yaml...couldn't touch that in the environment I was work with):

mapred.max.tracker.failures=256
mapred.map.max.attempts=128
mapred.reduce.max.attempts=128

I picked up a number of timeouts, but at the least the job kept going and finished.  Looking forward to when i can change cassandra.yaml to test that out.

Stephen

Kevin ADDA

unread,
Oct 18, 2013, 5:26:06 AM10/18/13
to aureliu...@googlegroups.com
Another one that seems close is developped and answered there: https://groups.google.com/forum/#!topic/aureliusgraphs/jlm2Md2Mvmk

In a nutshell, the failure is "java.io.IOException: Could not commit transaction due to exception during persistence", but the stack also says "Caused by: java.net.SocketException: Connection reset", while connection seemed up during job.

The key was a too low thrift_framed_transport_size_in_mb, default being 15. Increasing it to 50 made the job.


Kevin
Reply all
Reply to author
Forward
0 new messages