Sensei 1.5.1 node repeatedly disconnecting from zookeeper.

81 views
Skip to first unread message

Jayadev Jayaraman

unread,
Jul 25, 2013, 5:57:33 PM7/25/13
to sensei...@googlegroups.com
Hi , 

We're hosting a 3 node sensei 1.5.1 cluster [ 4 shards and 25G heap per node with approx 28 GB of index files on disk per node, and about 192M documents in total ] , and a 3 node Zookeeper ensemble running zookeeper 3.4.0 . We use kafka for our gateway, the kafka server hosted on a separate machine.  

We're having trouble with one of the nodes ( 10.70.158.129 in the attached zookeeper logs ) which repeatedly disconnects from the zookeeper ensemble, having to refresh its zookeeper connection sometimes several times a minute. We've started encountering this issue only recently, so we're wondering if it's because of GC halts ( https://issues.apache.org/jira/browse/ZOOKEEPER-1382 ). We're planning a hardware upgrade for sensei anyway , but we aren't sure if this is really the issue. This issue seems to happen on the problem node regardless of query load. We find that it's slowing down data consumption a lot.

I've attached the sensei.properties file , a snapshot of the sensei-main.log in the problem node , as well as a snapshot of the zookeeper leader's logs here [ notice session 0x23b1981b80ec73e ] .

 
sensei.properties
sensei-node.log
zookeeper-leader.log

Volodymyr Zhabiuk

unread,
Jul 26, 2013, 2:45:49 AM7/26/13
to sensei...@googlegroups.com
It's hard to tell, but seems to be a full/endless GC issue. Could you
add GC logging
-verbosegc -Xloggc:<file> -XX:+PrintGCDetails -XX:+PrintGCTimeStamp
-XX:+PrintGCDateStamps -XX:+PrintHeapAtGC?


Thanks,
Volodymyr

2013/7/25 Jayadev Jayaraman <jdis...@gmail.com>:
> --
> You received this message because you are subscribed to the Google Groups
> "Sensei" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to sensei-searc...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Jayadev Jayaraman

unread,
Jul 27, 2013, 1:01:41 AM7/27/13
to sensei...@googlegroups.com
Thanks for the pointer. I've pasted a snap of the gc logs generated by those new GC_OPTS switches in an attached file. How do I interpret the contents ? 

Also, the zookeeper disconnects keep happening though the consumption slowdown seems to be milder after the restart to place the new GC options. 
sensei-gc.log

Jayadev Jayaraman

unread,
Jul 27, 2013, 1:41:52 AM7/27/13
to sensei...@googlegroups.com
Update: Performance has deteriorated again. I am attaching snapshots of the GC logs and the sensei-main.log file as well. Looks more and more like a GC / heap issue. What perplexes me is why the other 2 nodes remain unaffected by this and don't face zookeeper connection problems. 
sensei-gc.log
sensei-node.log

Volodymyr Zhabiuk

unread,
Jul 27, 2013, 5:19:42 PM7/27/13
to sensei...@googlegroups.com
It's a GC issue

2013-07-27T01:35:29.993-0400: 3132.291: [Full GC 3132.291: [CMS: 24117248K->24117247K(24117248K), 17.6407850 secs] 26004735K->26004729K(26004736K), [CMS Perm : 38463K->38463K(64412K)], 17.6409100 secs] [Times: user=17.63 sys=0.00, real=17.64 secs]
Heap after GC invocations=94 (full 163):
 ...
 concurrent mark-sweep generation total 24117248K, used 24117247K [0x000000023ae00000, 0x00000007fae00000, 0x00000007fae00000)
2013-07-27T01:35:47.635-0400: 3149.933: [Full GC 3149.933: [CMS

Full GC blocks all the threads, leading to the connection timeouts
You an see, that after the GC cycle, there is no free space in the tenured generation

You may try to increase the heap and/or change yound/old gen ratio, decrease the number of facets

Volodymyr


2013/7/26 Jayadev Jayaraman <jdis...@gmail.com>

Otis

unread,
Jul 30, 2013, 2:21:08 PM7/30/13
to sensei...@googlegroups.com
Hi,

You could also try G1 if you are using a recent Java7.
It helped us and some  of our clients in similar situations.


Otis

Volodymyr Zhabiuk

unread,
Jul 30, 2013, 2:43:53 PM7/30/13
to sensei...@googlegroups.com
Hi Otis

I believe in this case it wouldn't help. Although G1 would decrease GC pauses and will use less resources but it can't do anything about not sufficient heap space

Thanks,
Volodymyr


2013/7/30 Otis <otis.gos...@gmail.com>
Reply all
Reply to author
Forward
0 new messages