Possible DNS issues

瀏覽次數:51 次
跳到第一則未讀訊息

Chris Freyer

未讀,
2016年9月28日 晚上7:30:332016/9/28
收件者:Druid Development
Hi Druid Dev team,

My team's druid cluster is running in AWS, behind a vip.  We have been experiencing a number of communication issues.  My fist step in resolving this was to upgrade the Apache Curator jars (https://groups.google.com/forum/#!topic/druid-development/_vcQzVCtztM).  That worked well, and has resolved some of our issues.

But we've experienced other issue as well.  My biggest concern is that some of our nodes get an OutOfMemoryError after having communication issues.  In several cases, a node gets hung (i.e. still in memory, but non-functional) and requires manual intervention to fix it.  

All these errors seem to be centered around loss of communication to Zookeeper.  I'm currently looking into a possible DNS issue with our infrastructure.  Its possible that IPV6 is enabled in our DNS and is affecting the speed of IPV4 lookups.  I don't know how long Druid (or the JDK) caches DNS records, but that could be part of it.  Here is a description of that issue:  https://www.sixxs.net/faq/dns/?faq=ipv6slowconnect.

Has anyone experience issues like these?  
Chris

Charles Allen

未讀,
2016年9月28日 晚上9:44:182016/9/28
收件者:Druid Development
Is it possible that GC is causing connection timeouts to ZK?

That would manifest as "connection problems" followed by an OOME.

Chris Freyer

未讀,
2016年9月29日 下午1:30:582016/9/29
收件者:Druid Development
I do see GC statements in the log before an OOME happens.  An example (abbreviated):

2016-09-27T21:34:11,799 INFO [main-SendThread(ip-XX-XX-XX-XX.ec2.XXXXXXX:2181)] org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 30579ms for sessionid 0x3562c97cb1a1cef, closing socket connection and attempting reconnect
2016-09-27T21:34:11,799 INFO [main-SendThread(ip-XX-XX-XX-XX.ec2.XXXXXXX:2181)] org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 30579ms for sessionid 0x3562c97cb1a1cef, closing socket connection and attempting reconnect
2016-09-27T21:34:16,325 WARN [Finalizer] io.druid.collections.StupidPool - Not closed!  Object was[java.nio.DirectByteBuffer[pos=0 lim=65536 cap=65536]]. Allowing gc to prevent leak.
2016-09-27T21:34:16,325 WARN [Finalizer] io.druid.collections.StupidPool - Not closed!  Object was[java.nio.DirectByteBuffer[pos=0 lim=65536 cap=65536]]. Allowing gc to prevent leak.
2016-09-27T21:34:16,325 WARN [Finalizer] io.druid.collections.StupidPool - Not closed!  Object was[java.nio.DirectByteBuffer[pos=0 lim=49153 cap=65536]]. Allowing gc to prevent leak.
2016-09-27T21:34:16,325 WARN [Finalizer] io.druid.collections.StupidPool - Not closed!  Object was[java.nio.DirectByteBuffer[pos=0 lim=65536 cap=65536]]. Allowing gc to prevent leak.
2016-09-27T21:34:16,325 WARN [Finalizer] io.druid.collections.StupidPool - Not closed!  Object was[java.nio.DirectByteBuffer[pos=0 lim=1000000000 cap=1000000000]]. Allowing gc to prevent leak.
2016-09-27T21:34:16,326 ERROR [processing-0] com.google.common.util.concurrent.Futures$CombinedFuture - input future failed.
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3181) ~[?:1.8.0_60]
        at java.util.ArrayList.grow(ArrayList.java:261) ~[?:1.8.0_60]

回覆所有人
回覆作者
轉寄
0 則新訊息