We are running Druid 0.8.1. When we induce the network partition, the MM immediately drops out of the Overlord's list of active Middle Managers and we observe that the MM in question is not announcing in /druid/indexer/announcements. Then when we remove the partition, we still don't see the MM announcing itself in /druid/indexer/announcements. Interestingly, the realtime ingestion peons that were running on that host continued to run. At about 2 hours after the partition was removed, the logs seem to show that the peons finally exit and experience errors. However, at this time, the MM actually successfully reconnects to the rest of the cluster, announces itself on /druid/indexer/announcements, and is observable through the Overlord console. Here are the logs after we removed the partition:
159119.639: [GC159119.639: [ParNew: 18536K->725K(19648K), 0.0021630 secs] 32107K->14296K(63360K), 0.0023100 secs] [Times: user=0.03 sys=0.00, real=0.00 secs]
2016-05-12T20:27:31,141 INFO [pool-9-thread-3] io.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[0] for task: index_realtime_observations_2016-05-12T18:00:00.000Z_0_0
2016-05-12T20:27:31,141 INFO [pool-9-thread-3] io.druid.storage.hdfs.tasklog.HdfsTaskLogs - Writing task log to: hdfs:/tmp/druid/indexer/logs/index_realtime_observations_2016-05-12T18_00_00.000Z_0_0
2016-05-12T20:27:31,283 INFO [pool-9-thread-3] io.druid.storage.hdfs.tasklog.HdfsTaskLogs - Wrote task log to: hdfs:/tmp/druid/indexer/logs/index_realtime_observations_2016-05-12T18_00_00.000Z_0_0
2016-05-12T20:27:31,284 INFO [pool-9-thread-3] io.druid.indexing.overlord.ForkingTaskRunner - Removing temporary directory: /tmp/persistent/task/index_realtime_observations_2016-05-12T18:00:00.000Z_0_0/17419c42-f9ef-4618-a0e9-8a1a8732e509
2016-05-12T20:27:46,288 WARN [WorkerTaskMonitor-0] org.apache.curator.ConnectionState - Connection attempt unsuccessful after 8510795 (greater than max timeout of 30000). Resetting connection and trying again with a new connection.
2016-05-12T20:27:46,379 INFO [WorkerTaskMonitor-0] org.apache.zookeeper.ZooKeeper - Session: 0x354180afd58e007 closed
2016-05-12T20:27:46,379 INFO [WorkerTaskMonitor-0-EventThread] org.apache.zookeeper.ClientCnxn - EventThread shut down
2016-05-12T20:27:46,399 INFO [WorkerTaskMonitor-0-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: RECONNECTED
2016-05-12T20:27:46,401 INFO [WorkerTaskMonitor-0-EventThread] io.druid.curator.announcement.Announcer - Node[/druid/indexer/announcements/
dwh01.dev.skyportsystems.com:8080] dropped, reinstating.
2016-05-12T20:27:46,478 ERROR [WorkerTaskMonitor-0] io.druid.indexing.worker.WorkerTaskMonitor - Failed to update task status: {class=io.druid.indexing.worker.WorkerTaskMonitor, exceptionType=class java.lang.RuntimeException, exceptionMessage=org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /druid/indexer/status/
dwh01.dev.skyportsystems.com:8080/index_realtime_observations_2016-05-12T18:00:00.000Z_0_0, task=index_realtime_observations_2016-05-12T18:00:00.000Z_0_0}
at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]
at io.druid.indexing.worker.WorkerCuratorCoordinator.announceTaskAnnouncement(WorkerCuratorCoordinator.java:219) ~[druid-indexing-service-0.8.1.jar:0.8.1]
at io.druid.indexing.worker.WorkerCuratorCoordinator.updateAnnouncement(WorkerCuratorCoordinator.java:233) ~[druid-indexing-service-0.8.1.jar:0.8.1]
at io.druid.indexing.worker.WorkerTaskMonitor$1$1.run(WorkerTaskMonitor.java:159) [druid-indexing-service-0.8.1.jar:0.8.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [?:1.7.0_67]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [?:1.7.0_67]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_67]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_67]
at java.lang.Thread.run(Thread.java:745) [?:1.7.0_67]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:696) ~[curator-framework-2.8.0.jar:?]
at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:679) ~[curator-framework-2.8.0.jar:?]
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[curator-client-2.8.0.jar:?]
at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:676) ~[curator-framework-2.8.0.jar:?]
at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:453) ~[curator-framework-2.8.0.jar:?]
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:443) ~[curator-framework-2.8.0.jar:?]
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:44) ~[curator-framework-2.8.0.jar:?]
at io.druid.indexing.worker.WorkerCuratorCoordinator.announceTaskAnnouncement(WorkerCuratorCoordinator.java:212) ~[druid-indexing-service-0.8.1.jar:0.8.1]
... 7 more
2016-05-12T20:34:47,369 INFO [pool-9-thread-2] io.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[0] for task: index_realtime_delta_netflow_2016-05-12T18:00:00.000Z_0_0
2016-05-12T20:34:47,370 INFO [pool-9-thread-2] io.druid.storage.hdfs.tasklog.HdfsTaskLogs - Writing task log to: hdfs:/tmp/druid/indexer/logs/index_realtime_delta_netflow_2016-05-12T18_00_00.000Z_0_0
2016-05-12T20:34:47,730 INFO [pool-9-thread-2] io.druid.storage.hdfs.tasklog.HdfsTaskLogs - Wrote task log to: hdfs:/tmp/druid/indexer/logs/index_realtime_delta_netflow_2016-05-12T18_00_00.000Z_0_0
2016-05-12T20:34:47,731 INFO [pool-9-thread-2] io.druid.indexing.overlord.ForkingTaskRunner - Removing temporary directory: /tmp/persistent/task/index_realtime_delta_netflow_2016-05-12T18:00:00.000Z_0_0/616a5b7b-4ce5-4724-bff1-ba9febaac610
2016-05-12T20:34:47,774 INFO [WorkerTaskMonitor-1] io.druid.indexing.worker.WorkerTaskMonitor - Job's finished. Completed [index_realtime_delta_netflow_2016-05-12T18:00:00.000Z_0_0] with status [SUCCESS]
2016-05-12T20:36:01,498 INFO [pool-9-thread-4] io.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[0] for task: index_realtime_insights_2016-05-12T18:00:00.000Z_0_0
2016-05-12T20:36:01,499 INFO [pool-9-thread-4] io.druid.storage.hdfs.tasklog.HdfsTaskLogs - Writing task log to: hdfs:/tmp/druid/indexer/logs/index_realtime_insights_2016-05-12T18_00_00.000Z_0_0
2016-05-12T20:36:01,848 INFO [pool-9-thread-4] io.druid.storage.hdfs.tasklog.HdfsTaskLogs - Wrote task log to: hdfs:/tmp/druid/indexer/logs/index_realtime_insights_2016-05-12T18_00_00.000Z_0_0
2016-05-12T20:36:01,849 INFO [pool-9-thread-4] io.druid.indexing.overlord.ForkingTaskRunner - Removing temporary directory: /tmp/persistent/task/index_realtime_insights_2016-05-12T18:00:00.000Z_0_0/afe0d4dd-2d9a-4d74-91a3-f4cda085fbf8
2016-05-12T20:36:01,862 INFO [WorkerTaskMonitor-3] io.druid.indexing.worker.WorkerTaskMonitor - Job's finished. Completed [index_realtime_insights_2016-05-12T18:00:00.000Z_0_0] with status [SUCCESS]
164879.062: [GC164879.062: [ParNew: 18197K->624K(19648K), 0.0038250 secs] 31768K->14196K(63360K), 0.0041280 secs] [Times: user=0.05 sys=0.00, real=0.00 secs]