I found out a little more. I keep having this problem. Today I experienced the same thing again. I checked the fleet log and found this at the same time the main unit gets stopped and fails to start:
Dec 11 14:02:17 doa3wrkprd001 fleetd[605]: ERROR engine.go:135: Unable to determine current lessee: timeout reached
Dec 11 14:02:17 doa3wrkprd001 fleetd[605]: WARN job.go:253: No Unit found in Registry for Job(doa3_test_restart_watcher.service)
Dec 11 14:02:17 doa3wrkprd001 fleetd[605]: ERROR job.go:95: Failed to parse Unit from etcd: unable to parse Unit in Registry at key
Dec 11 14:02:17 doa3wrkprd001 fleetd[605]: INFO client.go:278: Failed getting response from https://[etcd-server]/:
Dec 11 14:02:17 doa3wrkprd001 fleetd[605]: INFO client.go:278: Failed getting response from https://[etcd-server]/:
Dec 11 14:02:17 doa3wrkprd001 fleetd[605]: INFO manager.go:89: Triggered systemd unit doa3_test_restart_watcher.service stop: job=5
Dec 11 14:02:17 doa3wrkprd001 fleetd[605]: INFO manager.go:231: Removing systemd unit doa3_test_restart_watcher.service
Dec 11 14:02:17 doa3wrkprd001 fleetd[605]: INFO manager.go:142: Instructing systemd to reload units
Dec 11 14:02:17 doa3wrkprd001 fleetd[605]: INFO reconcile.go:274: AgentReconciler completed task: type=UnloadUnit job=doa3_test_res
Dec 11 14:02:20 doa3wrkprd001 fleetd[605]: INFO manager.go:218: Writing systemd unit doa3_test_restart_watcher.service (996b)
Dec 11 14:02:20 doa3wrkprd001 fleetd[605]: INFO manager.go:142: Instructing systemd to reload units
Dec 11 14:02:20 doa3wrkprd001 fleetd[605]: INFO reconcile.go:274: AgentReconciler completed task: type=LoadUnit job=doa3_test_resta
Dec 11 14:02:20 doa3wrkprd001 fleetd[605]: INFO manager.go:78: Triggered systemd unit doa3_test_restart_watcher.service start: job=
Dec 11 14:02:20 doa3wrkprd001 fleetd[605]: INFO reconcile.go:274: AgentReconciler completed task: type=StartUnit job=doa3_test_rest
The doa3_test_restart_watcher is a sidekick service to doa3_test. doa3_test has a line "Wants=doa3_test_restart_watcher.service" and doa3_test_restart_watcher has a line "BindsTo=doa3.test.service".
So it seems like fleet itself is the cause of the unability for systemd to restart the job, since this specific unit stops the doa3_test service when it stops itself and then doa3_test gets started again by systemd, but cannot fulfill the "Wants=doa3_test_restart_watcher.service" line, since this unit was removed (according to the logs of fleet).
So now the question is, what do the lines from 14:02:17 mean exactly and how can we make sure they don't happen?