The scheduler periodically stops scheduling - this seems to be after about 2 days of correctly functioning jobs.
The problem appears to be that zookeeper connection gets closed and the scheduler doesn't reconnect, so state is not being maintained correctly and the whole process just stops functioning
I have to restart the scheduler and zookeeper to re-establish the process.
This is a single zookeeper instance, so is there a way of setting zookeeeper so that client timeouts do not occur?
airbnbchronos_1 | [2015-02-13 00:00:02,295] INFO Task 'ct:1423785600000:0:keep-alive:' launched, status: 'DRIVER_RUNNING' (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:174)
airbnbchronos_1 | [2015-02-13 00:00:02,426] INFO Task with id 'ct:1423785600000:0:keep-alive:' RUNNING. (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:222)
airbnbchronos_1 | [2015-02-13 00:00:02,527] INFO Task with id 'ct:1423785600000:0:keep-alive:' FINISHED (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:208)
airbnbchronos_1 | [2015-02-13 00:00:02,528] INFO Persisting job 'keep-alive' with data 'ScheduleBasedJob(R/2015-02-13T01:00:00.000Z/PT1H,keep-alive,echo '' >/dev/null,PT60S,32,0,,,2,
a...@a.com,2015-02-13T00:00:02.528Z,,false,0.1,256.0,128.0,false,0,ListBuffer(),false,root,null,,ListBuffer(),true,ListBuffer(),false)' (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:63)
airbnbchronos_1 | [2015-02-13 00:00:02,530] INFO Key for state exists already: J_keep-alive (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:79)
airbnbchronos_1 | 2015-02-13 00:00:09,196:1(0x7f9340d6b700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [
172.17.0.3:2181] zk retcode=-7, errno=110(Connection timed out): connection to
172.17.0.3:2181 timed out (exceeded timeout by 0ms)
airbnbchronos_1 | 2015-02-13 00:00:12,531:1(0x7f9340d6b700):ZOO_WARN@zookeeper_interest@1557: Exceeded deadline by 3334ms
airbnbchronos_1 | 2015-02-13 00:00:12,532:1(0x7f9340d6b700):ZOO_INFO@check_events@1703: initiated connection to server [
172.17.0.3:2181]
airbnbchronos_1 | 2015-02-13 00:00:12,534:1(0x7f9340d6b700):ZOO_INFO@check_events@1750: session establishment complete on server [
172.17.0.3:2181], sessionId=0x14b7909e2fa0002, negotiated timeout=10000
airbnbchronos_1 | [2015-02-13 00:00:12,538] WARN State update failed. (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:86)
airbnbchronos_1 | [2015-02-13 00:00:12,539] INFO Dependents: [] (org.apache.mesos.chronos.scheduler.graph.JobGraph:168)
airbnbchronos_1 | [2015-02-13 00:00:12,540] INFO Received resource offers
airbnbchronos_1 | (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:77)
airbnbchronos_1 | [2015-02-13 00:00:12,540] INFO No tasks scheduled or next task has been disabled.
airbnbchronos_1 | (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:103)
airbnbchronos_1 | [2015-02-13 00:00:12,540] INFO Declining unused offers.
airbnbchronos_1 | (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:84)
airbnbchronos_1 | [2015-02-13 00:00:12,541] INFO Task with id 'ct:1423785600000:0:keep-alive:' FINISHED (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:208)
airbnbchronos_1 | [2015-02-13 00:00:12,542] INFO Persisting job 'keep-alive' with data 'ScheduleBasedJob(R/2015-02-13T01:00:00.000Z/PT1H,keep-alive,echo '' >/dev/null,PT60S,33,0,,,2,
a...@a.com,2015-02-13T00:00:12.541Z,,false,0.1,256.0,128.0,false,0,ListBuffer(),false,root,null,,ListBuffer(),true,ListBuffer(),false)' (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:63)
airbnbchronos_1 | [2015-02-13 00:00:12,543] INFO Key for state exists already: J_keep-alive (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:79)
airbnbchronos_1 | 2015-02-13 00:00:19,210:1(0x7f9340d6b700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [
172.17.0.3:2181] zk retcode=-7, errno=110(Connection timed out): connection to
172.17.0.3:2181 timed out (exceeded timeout by 0ms)
airbnbchronos_1 | 2015-02-13 00:00:22,546:1(0x7f9340d6b700):ZOO_WARN@zookeeper_interest@1557: Exceeded deadline by 3336ms
airbnbchronos_1 | 2015-02-13 00:00:22,547:1(0x7f9340d6b700):ZOO_INFO@check_events@1703: initiated connection to server [
172.17.0.3:2181]
airbnbchronos_1 | 2015-02-13 00:00:22,549:1(0x7f9340d6b700):ZOO_INFO@check_events@1750: session establishment complete on server [
172.17.0.3:2181], sessionId=0x14b7909e2fa0002, negotiated timeout=10000
airbnbchronos_1 | [2015-02-13 00:00:28,962] INFO Client session timed out, have not heard from server in 26668ms for sessionid 0x14b7909e2fa0001, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn:1096)
airbnbchronos_1 | [2015-02-13 00:00:29,064] INFO State change: SUSPENDED (org.apache.curator.framework.state.ConnectionStateManager:228)
airbnbchronos_1 | 2015-02-13 00:00:29,214:1(0x7f9340d6b700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [
172.17.0.3:2181] zk retcode=-7, errno=110(Connection timed out): connection to
172.17.0.3:2181 timed out (exceeded timeout by 0ms)
airbnbchronos_1 | [2015-02-13 00:00:30,650] INFO Opening socket connection to server
172.17.0.3/172.17.0.3:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:975)
airbnbchronos_1 | [2015-02-13 00:00:30,652] INFO Socket connection established to
172.17.0.3/172.17.0.3:2181, initiating session (org.apache.zookeeper.ClientCnxn:852)
airbnbchronos_1 | [2015-02-13 00:00:30,653] INFO Session establishment complete on server
172.17.0.3/172.17.0.3:2181, sessionid = 0x14b7909e2fa0001, negotiated timeout = 40000 (org.apache.zookeeper.ClientCnxn:1235)
airbnbchronos_1 | [2015-02-13 00:00:30,918] INFO State change: RECONNECTED (org.apache.curator.framework.state.ConnectionStateManager:228)
airbnbchronos_1 | I0213 00:00:32.106279 52 sched.cpp:1286] Asked to stop the driver
airbnbchronos_1 | [2015-02-13 00:00:32,106] INFO Defeated. Not the current leader. (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:612)
airbnbchronos_1 | 2015-02-13 00:00:32,549:1(0x7f9340d6b700):ZOO_WARN@zookeeper_interest@1557: Exceeded deadline by 3334ms
airbnbchronos_1 | 2015-02-13 00:00:32,550:1(0x7f9340d6b700):ZOO_INFO@check_events@1703: initiated connection to server [
172.17.0.3:2181]
airbnbchronos_1 | 2015-02-13 00:00:32,551:1(0x7f9340d6b700):ZOO_INFO@check_events@1750: session establishment complete on server [
172.17.0.3:2181], sessionId=0x14b7909e2fa0002, negotiated timeout=10000
airbnbchronos_1 | [2015-02-13 00:00:32,808] WARN State update failed. (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:86)
airbnbchronos_1 | [2015-02-13 00:00:32,808] INFO Size of streams: 2 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)