Scheduler loses zookeeper connection then cannot reconnect

Kevin Taylor

unread,

Feb 13, 2015, 6:30:46 AM2/13/15

to chronos-...@googlegroups.com

The scheduler periodically stops scheduling - this seems to be after about 2 days of correctly functioning jobs.

The problem appears to be that zookeeper connection gets closed and the scheduler doesn't reconnect, so state is not being maintained correctly and the whole process just stops functioning

I have to restart the scheduler and zookeeper to re-establish the process.

This is a single zookeeper instance, so is there a way of setting zookeeeper so that client timeouts do not occur?

airbnbchronos_1 | [2015-02-13 00:00:02,295] INFO Task 'ct:1423785600000:0:keep-alive:' launched, status: 'DRIVER_RUNNING' (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:174)

airbnbchronos_1 | [2015-02-13 00:00:02,426] INFO Task with id 'ct:1423785600000:0:keep-alive:' RUNNING. (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:222)

airbnbchronos_1 | [2015-02-13 00:00:02,527] INFO Task with id 'ct:1423785600000:0:keep-alive:' FINISHED (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:208)

airbnbchronos_1 | [2015-02-13 00:00:02,528] INFO Persisting job 'keep-alive' with data 'ScheduleBasedJob(R/2015-02-13T01:00:00.000Z/PT1H,keep-alive,echo '' >/dev/null,PT60S,32,0,,,2,a...@a.com,2015-02-13T00:00:02.528Z,,false,0.1,256.0,128.0,false,0,ListBuffer(),false,root,null,,ListBuffer(),true,ListBuffer(),false)' (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:63)

airbnbchronos_1 | [2015-02-13 00:00:02,530] INFO Key for state exists already: J_keep-alive (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:79)

airbnbchronos_1 | 2015-02-13 00:00:09,196:1(0x7f9340d6b700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [172.17.0.3:2181] zk retcode=-7, errno=110(Connection timed out): connection to 172.17.0.3:2181 timed out (exceeded timeout by 0ms)

airbnbchronos_1 | 2015-02-13 00:00:12,531:1(0x7f9340d6b700):ZOO_WARN@zookeeper_interest@1557: Exceeded deadline by 3334ms

airbnbchronos_1 | 2015-02-13 00:00:12,532:1(0x7f9340d6b700):ZOO_INFO@check_events@1703: initiated connection to server [172.17.0.3:2181]

airbnbchronos_1 | 2015-02-13 00:00:12,534:1(0x7f9340d6b700):ZOO_INFO@check_events@1750: session establishment complete on server [172.17.0.3:2181], sessionId=0x14b7909e2fa0002, negotiated timeout=10000

airbnbchronos_1 | [2015-02-13 00:00:12,538] WARN State update failed. (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:86)

airbnbchronos_1 | [2015-02-13 00:00:12,539] INFO Dependents: [] (org.apache.mesos.chronos.scheduler.graph.JobGraph:168)

airbnbchronos_1 | [2015-02-13 00:00:12,540] INFO Received resource offers

airbnbchronos_1 | (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:77)

airbnbchronos_1 | [2015-02-13 00:00:12,540] INFO No tasks scheduled or next task has been disabled.

airbnbchronos_1 | (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:103)

airbnbchronos_1 | [2015-02-13 00:00:12,540] INFO Declining unused offers.

airbnbchronos_1 | (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:84)

airbnbchronos_1 | [2015-02-13 00:00:12,541] INFO Task with id 'ct:1423785600000:0:keep-alive:' FINISHED (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:208)

airbnbchronos_1 | [2015-02-13 00:00:12,542] INFO Persisting job 'keep-alive' with data 'ScheduleBasedJob(R/2015-02-13T01:00:00.000Z/PT1H,keep-alive,echo '' >/dev/null,PT60S,33,0,,,2,a...@a.com,2015-02-13T00:00:12.541Z,,false,0.1,256.0,128.0,false,0,ListBuffer(),false,root,null,,ListBuffer(),true,ListBuffer(),false)' (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:63)

airbnbchronos_1 | [2015-02-13 00:00:12,543] INFO Key for state exists already: J_keep-alive (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:79)

airbnbchronos_1 | 2015-02-13 00:00:19,210:1(0x7f9340d6b700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [172.17.0.3:2181] zk retcode=-7, errno=110(Connection timed out): connection to 172.17.0.3:2181 timed out (exceeded timeout by 0ms)

airbnbchronos_1 | 2015-02-13 00:00:22,546:1(0x7f9340d6b700):ZOO_WARN@zookeeper_interest@1557: Exceeded deadline by 3336ms

airbnbchronos_1 | 2015-02-13 00:00:22,547:1(0x7f9340d6b700):ZOO_INFO@check_events@1703: initiated connection to server [172.17.0.3:2181]

airbnbchronos_1 | 2015-02-13 00:00:22,549:1(0x7f9340d6b700):ZOO_INFO@check_events@1750: session establishment complete on server [172.17.0.3:2181], sessionId=0x14b7909e2fa0002, negotiated timeout=10000

airbnbchronos_1 | [2015-02-13 00:00:28,962] INFO Client session timed out, have not heard from server in 26668ms for sessionid 0x14b7909e2fa0001, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn:1096)

airbnbchronos_1 | [2015-02-13 00:00:29,064] INFO State change: SUSPENDED (org.apache.curator.framework.state.ConnectionStateManager:228)

airbnbchronos_1 | 2015-02-13 00:00:29,214:1(0x7f9340d6b700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [172.17.0.3:2181] zk retcode=-7, errno=110(Connection timed out): connection to 172.17.0.3:2181 timed out (exceeded timeout by 0ms)

airbnbchronos_1 | [2015-02-13 00:00:30,650] INFO Opening socket connection to server 172.17.0.3/172.17.0.3:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:975)

airbnbchronos_1 | [2015-02-13 00:00:30,652] INFO Socket connection established to 172.17.0.3/172.17.0.3:2181, initiating session (org.apache.zookeeper.ClientCnxn:852)

airbnbchronos_1 | [2015-02-13 00:00:30,653] INFO Session establishment complete on server 172.17.0.3/172.17.0.3:2181, sessionid = 0x14b7909e2fa0001, negotiated timeout = 40000 (org.apache.zookeeper.ClientCnxn:1235)

airbnbchronos_1 | [2015-02-13 00:00:30,918] INFO State change: RECONNECTED (org.apache.curator.framework.state.ConnectionStateManager:228)

airbnbchronos_1 | I0213 00:00:32.106279 52 sched.cpp:1286] Asked to stop the driver

airbnbchronos_1 | [2015-02-13 00:00:32,106] INFO Defeated. Not the current leader. (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:612)

airbnbchronos_1 | 2015-02-13 00:00:32,549:1(0x7f9340d6b700):ZOO_WARN@zookeeper_interest@1557: Exceeded deadline by 3334ms

airbnbchronos_1 | 2015-02-13 00:00:32,550:1(0x7f9340d6b700):ZOO_INFO@check_events@1703: initiated connection to server [172.17.0.3:2181]

airbnbchronos_1 | 2015-02-13 00:00:32,551:1(0x7f9340d6b700):ZOO_INFO@check_events@1750: session establishment complete on server [172.17.0.3:2181], sessionId=0x14b7909e2fa0002, negotiated timeout=10000

airbnbchronos_1 | [2015-02-13 00:00:32,808] WARN State update failed. (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:86)

airbnbchronos_1 | [2015-02-13 00:00:32,808] INFO Size of streams: 2 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)

Brenden Matthews

unread,

Feb 13, 2015, 2:44:25 PM2/13/15

to Kevin Taylor, chronos-...@googlegroups.com

How old is that build of Chronos? It looks like a bug that got fixed a while ago.

--
You received this message because you are subscribed to the Google Groups "chronos-scheduler" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chronos-schedu...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kevin Taylor

unread,

Feb 14, 2015, 5:20:17 AM2/14/15

to chronos-...@googlegroups.com, kevin.ta...@gmail.com

Built directly from the github head a few days ago (so it is the jar from the target folder)

I know that it is recent, because the fix to the retry delay interval is working which was a new fix

Kevin

Kevin Taylor

unread,

Feb 24, 2015, 6:45:40 AM2/24/15

to chronos-...@googlegroups.com, kevin.ta...@gmail.com

Hi

Does anyone have any further input into this?

Slightly annoying that the scheduler runs for a few days and then just appears to die without recovery. I have to restart the processes

Thanks for any input

Kevin

Brenden Matthews

unread,

Mar 1, 2015, 12:02:11 PM3/1/15

to Kevin Taylor, chronos-...@googlegroups.com

Seems like it might be a problem with your zookeeper setup. Are your ZK instances doing anything weird? Do you specify their location with DNS and then have their IPs change?

Kevin Taylor

unread,

Mar 1, 2015, 2:12:13 PM3/1/15

to chronos-...@googlegroups.com

Nothing particularly fancy, although it does run in a docker container and is a single instance. Please can you post your zoo keeper config, so that I can compare settings. Thanks

Kevin Taylor

unread,

Mar 9, 2015, 3:55:39 AM3/9/15

to chronos-...@googlegroups.com

I almost managed to run this suite all week and then it just crashed. It ran for 6 days without any issue, then completely collapsed. The only difference I can see on the day it fell over was that I had a failed job in the suite. All week since the 3rd, every job has been completing successfully.

I am at a loss as to why this is occurring and cannot proceed to a production running environment with this problem

I have dumped the entire log of chronos/zookeeper and mesos if you wish to see them.

This is a snippet of the failure:

ZooKeeper:

2015-03-08 11:00:01,705 [myid:] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception

EndOfStreamException: Unable to read additional data from client sessionid 0x14bdf5f3ab70002, likely client has closed socket

at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)

at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)

at java.lang.Thread.run(Thread.java:724)

2015-03-08 11:00:01,706 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /172.17.0.5:53187 which had sessionid 0x14bdf5f3ab70002

Chronos:

[2015-03-08 11:00:03,333] INFO State change: SUSPENDED (org.apache.curator.framework.state.ConnectionStateManager:228)

[2015-03-08 11:00:05,105] INFO Opening socket connection to server 172.17.0.3/172.17.0.3:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:975)

[2015-03-08 11:00:05,105] INFO Socket connection established to 172.17.0.3/172.17.0.3:2181, initiating session (org.apache.zookeeper.ClientCnxn:852)

[2015-03-08 11:00:05,107] INFO Session establishment complete on server 172.17.0.3/172.17.0.3:2181, sessionid = 0x14bdf5f3ab70001, negotiated timeout = 40000 (org.apache.zookeeper.ClientCnxn:1235)

2015-03-08 11:00:05,042:1(0x7fab17fff700):ZOO_WARN@zookeeper_interest@1557: Exceeded deadline by 3337ms

2015-03-08 11:00:05,042:1(0x7fab17fff700):ZOO_INFO@check_events@1703: initiated connection to server [172.17.0.3:2181]

2015-03-08 11:00:05,044:1(0x7fab17fff700):ZOO_INFO@check_events@1750: session establishment complete on server [172.17.0.3:2181], sessionId=0x14bdf5f3ab70002, negotiated timeout=10000

[2015-03-08 11:00:55,063] ERROR Listener (org.apache.mesos.chronos.scheduler.jobs.JobScheduler$$anon$3@4eb8715e) threw an exception (org.apache.curator.framework.listen.ListenerContainer:96)

java.lang.IllegalArgumentException: requirement failed: Vertex already exists in graph keep-alive

at scala.Predef$.require(Predef.scala:219)

at org.apache.mesos.chronos.scheduler.graph.JobGraph.addVertex(JobGraph.scala:67)

at org.apache.mesos.chronos.scheduler.jobs.JobScheduler$$anonfun$registerJob$3.apply(JobScheduler.scala:152)

at org.apache.mesos.chronos.scheduler.jobs.JobScheduler$$anonfun$registerJob$3.apply(JobScheduler.scala:151)

at scala.collection.immutable.List.foreach(List.scala:381)

at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)

at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)

at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.registerJob(JobScheduler.scala:150)

at org.apache.mesos.chronos.scheduler.jobs.JobUtils$.loadJobs(JobUtils.scala:75)

at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.onElected(JobScheduler.scala:628)

at org.apache.mesos.chronos.scheduler.jobs.JobScheduler$$anon$3.isLeader(JobScheduler.scala:594)

at org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:644)

at org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:640)

at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:92)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)