zookeeper connection loss?

4,428 views
Skip to first unread message

Rui Wang

unread,
Oct 31, 2013, 5:17:18 PM10/31/13
to druid-de...@googlegroups.com
Hi,

When I am doing an intensive query, I noticed in the compute nodes log that following are scrolling:

2013-10-31 20:32:34,188 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: RECONNECTED
2013-10-31 20:32:34,188 INFO [ZkCoordinator-0] com.metamx.druid.coordination.ZkCoordinator - Ignoring event[PathChildrenCacheEvent{type=CONNECTION_RECONNECTED, data=null}]
2013-10-31 20:32:34,304 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED
2013-10-31 20:32:34,304 INFO [ZkCoordinator-0] com.metamx.druid.coordination.ZkCoordinator - Ignoring event[PathChildrenCacheEvent{type=CONNECTION_SUSPENDED, data=null}]

and

2013-10-31 20:40:04,557 ERROR [CuratorFramework-0] org.apache.curator.ConnectionState - Connection timed out for connection string (10.5.11.24:2181) and timeout (15000) / elapsed (23693)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:191)
        at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:86)
        at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:108)
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:455)
        at org.apache.curator.framework.imps.GetChildrenBuilderImpl.performBackgroundOperation(GetChildrenBuilderImpl.java:175)
        at org.apache.curator.framework.imps.OperationAndData.callPerformBackgroundOperation(OperationAndData.java:57)
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:659)
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:651)
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:54)
        at org.apache.curator.framework.imps.CuratorFrameworkImpl$3.call(CuratorFrameworkImpl.java:242)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:679)

   May I ask what this meant? Latency is too high to connect to zookeeper? One thing worth to mention is that the druid cluster we set up is in california and the zookeeper is in virginia. Would it be the fix to have a new zookeeper around or I'm missing something else in the config?

Thanks,
Rui

Rui Wang

unread,
Oct 31, 2013, 7:59:51 PM10/31/13
to druid-de...@googlegroups.com
I also noted that that once the time reaches 5 mins, it would disconnect:

java.util.concurrent.ExecutionException: java.lang.RuntimeException: Channel disconnected

real    5m0.067s
user    0m0.005s
sys    0m0.003s

however, sometimes data size is just enormous...how do we deal with this?

Thanks,
Rui

Fangjin Yang

unread,
Oct 31, 2013, 9:40:06 PM10/31/13
to druid-de...@googlegroups.com
Hi Rui,

What types of queries and data volumes are we talking about to see these sort of disconnects? I want to understand the time range of the query and how many segments it hits. Do you always see these errors associated with a long query? What are your typical segment sizes? Are you running these queries against the Broker?

-- FJ


--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/6d1bf586-4b14-401e-8cce-567a2f0fcc57%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

Eric Tschetter

unread,
Nov 1, 2013, 12:41:11 AM11/1/13
to druid-de...@googlegroups.com
To add some color to what Fangjin is asking for.

Disconnects to ZK are generally caused by two things:

1) GC churn, while the node is GCing it's not talking to ZK.  If you have lots of these, it can cause reconnect churn

2) Latency, if you are running in CA and ZK is in VA, then this could definitely be causing issues.  You can try increasing the session timeout of your ZK connections and see if that helps.  ZK also has "repeater" nodes (I think that's what they are called anyway), that you can have in your CA data center, they are nodes that look like any other ZK quorum node to clients, but instead of participating in votes, they just forward all writes over to the real quorum.

For the 5 minute disconnect thing, we've often seen load balancers and other networking infrastructure kill "idle" connections and an HTTP request that is just sitting there waiting for a response definitely looks like an "idle" connection.

If your ZK issues are in fact GC, then fixing that will most likely speed up your queries and hopefully eliminate this problem altogether.  The other option if it is indeed some networking infrastructure messing with stuff is to adjust the configuration on that infrastructure (which can have other ramifications as well...)

--Eric


Rui Wang

unread,
Nov 5, 2013, 6:24:05 PM11/5/13
to druid-de...@googlegroups.com
Hi Fangjin,

Sorry for the delay.

 
What types of queries and data volumes are we talking about to see these sort of disconnects? I want to understand the time range of the query and how many segments it hits. Do you always see these errors associated with a long query? What are your typical segment sizes? Are you running these queries against the Broker?


query would be something like the following:

{
   "queryType":"groupBy",
   "dataSource":"stagecdh",
   "granularity":"all",
   "dimensions":[
      "publisher_id"
   ],
   "aggregations":[
      {
         "type":"count",
         "name":"rows"
      },
      {
         "type":"doubleSum",
         "fieldName":"revenue",
         "name":"revenue"
      },
      {
         "type":"longSum",
         "fieldName":"impressions",
         "name":"impPerCountry"
      },
      {
         "type":"longSum",
         "fieldName":"sold_cnt",
         "name":"soldcntPerCountry"
      }
   ],
   "intervals":[
      "2013-09-02T00/2013-09-15T23"
   ]
}

As you could see, it is just normal stuff. I suspect that this problem has nothing to do with the query because such error messages just
keep going even when there is no query. Latency might be a problem like Eric said. I'll talk to our SRE to find out.

Yes, I was querying against the broker node.

Segment size...this might be another potential reason. Is it true that the ideal segment sizes are between 700 to 1000MB? At the beginning, due to the oom problem on the reducers, we had to make the partition really small to let it run through, thus we have a lot of smaller segments(14MB around ). later we used other machines that have more RAM and that made our segments to be 1G each piece.

Total number of those smaller segments are about 42000. And the query was against them.

On a separate note, after we successfully generateded the 1G size segments, we started to get rid of the smaller segments(14MB size).
There was a period that the cluster was full(for test, we only had 3 machines with 800GB each for druid), and the 1G size segments were only loaded partially of the time range we have available. Thus, I was expecting that after smaller segments were removed by druid, those larger segments will continue loading...but I don't see that happening. Over 3 days each machine removed about 200GB data of smaller size segments, however the remaining larger segments were not loaded. am I missing sth?

Another matter is that after I modified the replication factor, it seems not effective. Is it true that it only applies to the segments you add in future?

Thanks,
Rui

-- FJ

Fangjin Yang

unread,
Nov 6, 2013, 9:54:46 PM11/6/13
to druid-de...@googlegroups.com
Hi Rui, see inline.
This value has worked for us but we've also used smaller segment sizes too.
 
At the beginning, due to the oom problem on the reducers, we had to make the partition really small to let it run through, thus we have a lot of smaller segments(14MB around ). later we used other machines that have more RAM and that made our segments to be 1G each piece.

Total number of those smaller segments are about 42000. And the query was against them.

On a separate note, after we successfully generateded the 1G size segments, we started to get rid of the smaller segments(14MB size).
There was a period that the cluster was full(for test, we only had 3 machines with 800GB each for druid), and the 1G size segments were only loaded partially of the time range we have available. Thus, I was expecting that after smaller segments were removed by druid, those larger segments will continue loading...but I don't see that happening. Over 3 days each machine removed about 200GB data of smaller size segments, however the remaining larger segments were not loaded. am I missing sth?

Do you see any error message on your compute nodes about out of capacity or out of storage?
 

Another matter is that after I modified the replication factor, it seems not effective. Is it true that it only applies to the segments you add in future?

They should apply to the immediate state of the cluster. There may be logs on the master node indicating what is happening and why segments aren't being loaded or dropped.

Rui Wang

unread,
Nov 12, 2013, 2:47:21 PM11/12/13
to druid-de...@googlegroups.com
Hi Fangjin,

Please see inline.

Total number of those smaller segments are about 42000. And the query was against them.

On a separate note, after we successfully generateded the 1G size segments, we started to get rid of the smaller segments(14MB size).
There was a period that the cluster was full(for test, we only had 3 machines with 800GB each for druid), and the 1G size segments were only loaded partially of the time range we have available. Thus, I was expecting that after smaller segments were removed by druid, those larger segments will continue loading...but I don't see that happening. Over 3 days each machine removed about 200GB data of smaller size segments, however the remaining larger segments were not loaded. am I missing sth?

Do you see any error message on your compute nodes about out of capacity or out of storage?

I saw a few, but later when there was about 200GB on each machine, those large ones still didn't load. They only load after all the smaller segments are gone.
 
 

Another matter is that after I modified the replication factor, it seems not effective. Is it true that it only applies to the segments you add in future?

They should apply to the immediate state of the cluster. There may be logs on the master node indicating what is happening and why segments aren't being loaded or dropped.

Yes, I took what I said back. All the duplicates are gone now, too.

Btw, do you guys have a real example of setting up realtime node with kafka?

Thanks,
Rui
 

Fangjin Yang

unread,
Nov 15, 2013, 1:24:38 AM11/15/13
to druid-de...@googlegroups.com
Hi, see inline.


On Tuesday, November 12, 2013 11:47:21 AM UTC-8, Rui Wang wrote:
Hi Fangjin,

Please see inline.

Total number of those smaller segments are about 42000. And the query was against them.

On a separate note, after we successfully generateded the 1G size segments, we started to get rid of the smaller segments(14MB size).
There was a period that the cluster was full(for test, we only had 3 machines with 800GB each for druid), and the 1G size segments were only loaded partially of the time range we have available. Thus, I was expecting that after smaller segments were removed by druid, those larger segments will continue loading...but I don't see that happening. Over 3 days each machine removed about 200GB data of smaller size segments, however the remaining larger segments were not loaded. am I missing sth?

Do you see any error message on your compute nodes about out of capacity or out of storage?

I saw a few, but later when there was about 200GB on each machine, those large ones still didn't load. They only load after all the smaller segments are gone.

Do you have any logs you can share about this problem? In particular, I am curious about what your maxSize is set to and how much data each node was serving.
 
 
 

Another matter is that after I modified the replication factor, it seems not effective. Is it true that it only applies to the segments you add in future?

They should apply to the immediate state of the cluster. There may be logs on the master node indicating what is happening and why segments aren't being loaded or dropped.

Yes, I took what I said back. All the duplicates are gone now, too.

Btw, do you guys have a real example of setting up realtime node with kafka?


We do for Druid 0.6.x, check out: http://druid.io/docs/latest/Realtime.html
Reply all
Reply to author
Forward
0 new messages