Hi -
Currently running and loving Marathon 0.7.1 and Mesos 0.20.1, having success deploying Docker containers and other tasks. I have encountered some behavior which I would like to understand better.
Background:
- Initially deployed 100 Docker containers running a simple web server across 4 slaves with reasonable distribution across the slaves. Marathon showed 100/100 Tasks/Instances. Running for several days.
- Reconfigured and re-registered one of the slaves (slave-4) to add some attributes in order to test constraints (this is not relevant to the actual issue as far as I can tell).
- Upon re-registering successfully, the slave shows back up in Mesos.
Looking for clarity on whether the following are issues or expected behavior:
- While slave-4 was down, I still see 100/100 Tasks/Instances even though none of the tasks were restarted across any of the remaining slaves. As expected health checks failed during that time.
- "docker ps" on the slave after restart confirms no containers running. However, Marathon still shows the tasks are on that particular slave, as I can see those tasks (with failed health checks and no recent update)
- Marathon log on leader shows:
[INFO] [09/29/2014 18:42:52.192] [marathon-akka.actor.default-dispatcher-23647]
[akka://marathon/user/$d] Killing task bridged-webapp-4.641c8eb9-45f3-11e4-971
2-56847afe9799 on host 10.202.12.220
[INFO] [09/29/2014 18:42:52.192] [marathon-akka.actor.default-dispatcher-23647]
[akka://marathon/user/$d] Received health result: [Unhealthy(bridged-webapp-4.
6718b7bc-45f3-11e4-9712-56847afe9799,2014-09-27T03:04:54.193Z,ConnectionAttempt
:42:52.167Z)]
[2014-09-29 18:42:52,193] INFO Task launch delay for [/bridged-webapp-4] is now
[3600] seconds (mesosphere.util.RateLimiter:34)
[2014-09-29 18:42:52,193] WARN Task [bridged-webapp-4.6718b7bc-45f3-11e4-9712-5
6847afe9799] for app [/bridged-webapp-4] was killed for failing too many health
checks (mesosphere.marathon.MarathonScheduler:193)
[INFO] [09/29/2014 18:42:52.193] [marathon-akka.actor.default-dispatcher-23647]
[akka://marathon/user/$d] Killing task bridged-webapp-4.6718b7bc-45f3-11e4-971
2-56847afe9799 on host 10.202.12.220
[INFO] [09/29/2014 18:42:52.193] [marathon-akka.actor.default-dispatcher-23647]
[akka://marathon/user/$d] Received health result: [Unhealthy(bridged-webapp-4.
5a91c720-45f3-11e4-9712-56847afe9799,2014-09-27T03:04:54.193Z,ConnectionAttempt
:42:52.167Z)]
[2014-09-29 18:42:52,194] INFO Task launch delay for [/bridged-webapp-4] is now
[3600] seconds (mesosphere.util.RateLimiter:34)
[2014-09-29 18:42:52,194] WARN Task [bridged-webapp-4.5a91c720-45f3-11e4-9712-5
6847afe9799] for app [/bridged-webapp-4] was killed for failing too many health
checks (mesosphere.marathon.MarathonScheduler:193)
[INFO] [09/29/2014 18:42:52.194] [marathon-akka.actor.default-dispatcher-23647]
[akka://marathon/user/$d] Killing task bridged-webapp-4.5a91c720-45f3-11e4-971
2-56847afe9799 on host 10.202.12.220
[2014-09-29 18:42:52,195] INFO Task launch delay for [/bridged-webapp-4] is now
[3600] seconds (mesosphere.util.RateLimiter:34)
Questions:
- Should Marathon show "degraded" and indicate something less than 100/100 while the slave is down?
- Should Marathon restart the tasks on the remaining 3 slaves when the slave-4 is down?
- Should Marathon reconcile the tasks upon the slave-4 successful re-registration and restart on that slave?
I have this test environment still in this state, so I can collect more logs/outputs or try to resolve and/or reproduce.
Thanks for any help,
Tory