Task reconciliation after slave restart

Tory Long

unread,

Sep 29, 2014, 2:54:02 PM9/29/14

to marathon-...@googlegroups.com

Hi -

Currently running and loving Marathon 0.7.1 and Mesos 0.20.1, having success deploying Docker containers and other tasks. I have encountered some behavior which I would like to understand better.

Background:

- Initially deployed 100 Docker containers running a simple web server across 4 slaves with reasonable distribution across the slaves. Marathon showed 100/100 Tasks/Instances. Running for several days.

- Reconfigured and re-registered one of the slaves (slave-4) to add some attributes in order to test constraints (this is not relevant to the actual issue as far as I can tell).

- Upon re-registering successfully, the slave shows back up in Mesos.

Looking for clarity on whether the following are issues or expected behavior:

- While slave-4 was down, I still see 100/100 Tasks/Instances even though none of the tasks were restarted across any of the remaining slaves. As expected health checks failed during that time.

- "docker ps" on the slave after restart confirms no containers running. However, Marathon still shows the tasks are on that particular slave, as I can see those tasks (with failed health checks and no recent update)

- Marathon log on leader shows:

[INFO] [09/29/2014 18:42:52.192] [marathon-akka.actor.default-dispatcher-23647]

[akka://marathon/user/$d] Killing task bridged-webapp-4.641c8eb9-45f3-11e4-971

2-56847afe9799 on host 10.202.12.220

[INFO] [09/29/2014 18:42:52.192] [marathon-akka.actor.default-dispatcher-23647]

[akka://marathon/user/$d] Received health result: [Unhealthy(bridged-webapp-4.

6718b7bc-45f3-11e4-9712-56847afe9799,2014-09-27T03:04:54.193Z,ConnectionAttempt

FailedException: Connection attempt to 10.202.12.220:31506 failed,2014-09-29T18

:42:52.167Z)]

[2014-09-29 18:42:52,193] INFO Task launch delay for [/bridged-webapp-4] is now

[3600] seconds (mesosphere.util.RateLimiter:34)

[2014-09-29 18:42:52,193] WARN Task [bridged-webapp-4.6718b7bc-45f3-11e4-9712-5

6847afe9799] for app [/bridged-webapp-4] was killed for failing too many health

checks (mesosphere.marathon.MarathonScheduler:193)

[INFO] [09/29/2014 18:42:52.193] [marathon-akka.actor.default-dispatcher-23647]

[akka://marathon/user/$d] Killing task bridged-webapp-4.6718b7bc-45f3-11e4-971

2-56847afe9799 on host 10.202.12.220

[INFO] [09/29/2014 18:42:52.193] [marathon-akka.actor.default-dispatcher-23647]

[akka://marathon/user/$d] Received health result: [Unhealthy(bridged-webapp-4.

5a91c720-45f3-11e4-9712-56847afe9799,2014-09-27T03:04:54.193Z,ConnectionAttempt

FailedException: Connection attempt to 10.202.12.220:31209 failed,2014-09-29T18

:42:52.167Z)]

[2014-09-29 18:42:52,194] INFO Task launch delay for [/bridged-webapp-4] is now

[3600] seconds (mesosphere.util.RateLimiter:34)

[2014-09-29 18:42:52,194] WARN Task [bridged-webapp-4.5a91c720-45f3-11e4-9712-5

6847afe9799] for app [/bridged-webapp-4] was killed for failing too many health

checks (mesosphere.marathon.MarathonScheduler:193)

[INFO] [09/29/2014 18:42:52.194] [marathon-akka.actor.default-dispatcher-23647]

[akka://marathon/user/$d] Killing task bridged-webapp-4.5a91c720-45f3-11e4-971

2-56847afe9799 on host 10.202.12.220

[2014-09-29 18:42:52,195] INFO Task launch delay for [/bridged-webapp-4] is now

[3600] seconds (mesosphere.util.RateLimiter:34)

Questions:

- Should Marathon show "degraded" and indicate something less than 100/100 while the slave is down?

- Should Marathon restart the tasks on the remaining 3 slaves when the slave-4 is down?

- Should Marathon reconcile the tasks upon the slave-4 successful re-registration and restart on that slave?

I have this test environment still in this state, so I can collect more logs/outputs or try to resolve and/or reproduce.

Thanks for any help,
Tory

--
Tory Long
tory...@gmail.com

to...@cisco.com

@vocemmeam

Dario Rexin

unread,

Sep 29, 2014, 3:11:44 PM9/29/14

to Tory Long, marathon-...@googlegroups.com

Hi Tory,

usually Mesos should send a TASK_LOST when a slave is unreachable for some time. If you stop the slave on purpose, it should kill all the tasks before it shuts down. In any case the master should notify Marathon. If the tasks are still shown in Marathon, it has not been notified about the termination of the tasks. This can happen when at the same time Marathon was down and the master failed over. Can you please check the Marathon log for any TASK_KILLED, TASK_FAILED, TASK_FINISHED or TASK_LOST messages?

Usually Marathon should do a task reconcilation from time to time, but unfortunately Mesos does not send a reply for tasks it foes not know (which happens in the scenario mentioned above). This is a known limitation and is being worked on.

Hope this helps.

Best regards,

Dario

--
You received this message because you are subscribed to the Google Groups "marathon-framework" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marathon-framew...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tory Long

unread,

Sep 29, 2014, 3:34:04 PM9/29/14

to Dario Rexin, marathon-...@googlegroups.com

Hi Dario,

Thanks for the quick response. Here is a subset what I see in the mesos-master.INFO log, from a master that is not leading now (so, failed over at some point):

ubuntu@mesos-master-003:~$ egrep '(TASK_FAILED|TASK_LOST|TASK_FINISHED)' /var/log/mesos/mesos-master.INFO | grep bridged-webapp-4

I0929 14:53:54.481621 22482 master.cpp:3175] Sending status update TASK_LOST (UUID: b2c10a97-1cfd-4e4a-8af9-908dfc25f1d0) for task bridged-webapp-4.8c156836-45f3-11e4-9712-56847afe9799 of framework 20140923-012617-1527564810-5050-15959-0000 'Slave 10.202.12.220 removed'

W0929 15:30:55.106998 22481 master.cpp:3096] Ignoring status update TASK_FAILED (UUID: e6751b29-6318-4cab-8d73-8a431a15d053) for task bridged-webapp-4.8c156836-45f3-11e4-9712-56847afe9799 of framework 20140923-012617-1527564810-5050-15959-0000 from removed slave slave(1)@10.202.12.220:5051 with id 20140923-012617-1527564810-5050-15959-1 ; asking slave to shutdown

I do see some TASK_KILLED from a slave restart, but was on another slave and different time so do not believe it corresponds with this issue.

This is not critical in my environment, but curious if there is a clean way to recover from this? I have not yet tried to kill the individual tasks for that particular slave, which I'm sure is the obvious way to do this, but wanted to keep the environment in this state to provide any logs, etc.

Also, on a related note, should Marathon indicate the app is "degraded" when health checks are failing for a subset of the tasks? Something like "79/100" for Tasks/Instances and "Degraded" in the status field?

Thanks,
Tory

--
Tory Long
tory...@gmail.com

to...@cisco.com

@vocemmeam

--
Tory
tory...@gmail.com

Dario Rexin

unread,

Sep 29, 2014, 6:53:16 PM9/29/14

to Tory Long, marathon-...@googlegroups.com

If health checks fail too many times (configurable), Marathon (or Mesos if health checks run here) will kill the task and Marathon will remove it from the set of running tasks once it receives the terminal status update. If Mesos does not know the task (like in the scenario from my first reply), it will ignore the message and not send any status update for the task. In this case the task will stay in Marathon's set of running tasks forever. This has to be fixed in Mesos and will hopefully be part of the next bug fix release. Once this has been fixed, it should work as you expect it and remove the tasks from the set and start new tasks elsewhere.

Best regards,

Dario

ZBhatti

unread,

Oct 7, 2014, 1:52:08 AM10/7/14

to marathon-...@googlegroups.com, tory...@gmail.com

Just to add to a this topic ... I continue to have issues where a restart of mesos master or marathon master will send the cluster into a tail spin of task state confusion. In majority of the cases, mesos has the correct and consistent information. In these scenarios, slaves remain up and the tasks continue to run but marathon does not have a clue. I will be putting in health checks to work around other kinks but that will not help if slaves are healthy and marathon cannot reconcile with mesos. The clean after the confusion is all manual. Kill the tasks manually on slaves and restart all marathon processes. Repeat as necessary.

If I was proficient with Scala, I would be deep in the code by now. Task reconciliation is week and needs a rethink. I have reached the point of writing a smaller replacement, taken a deep breath and have stepped back. I can barely manage to bring it back up with four slaves in the cluster.

I am open to suggestion on debugging/recovery ...

To unsubscribe from this group and stop receiving emails from it, send an email to marathon-framework+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Tory
tory...@gmail.com

Dario Rexin

unread,

Oct 7, 2014, 2:15:57 AM10/7/14

to ZBhatti, marathon-...@googlegroups.com, tory...@gmail.com

You are right, Mesos has the right set of running tasks. The problem is, that the API does not allow us to fetch all those tasks. Currently we can only send the set of tasks that we have in Marathon and Mesos will then send status updates for the tasks it knows. This means, that we will neither get the running tasks that Marathon doesn't know, nor an info about tasks that are not known by Mesos anymore. This can't be fixed in Marathon. AFAIK the Mesos team is already working on this and I hope to see it in the next bugfix release.

Best,

Dario

To unsubscribe from this group and stop receiving emails from it, send an email to marathon-framew...@googlegroups.com.

Reply all

Reply to author

Forward