Hmmm, I replied via email on may 17, but from a different email account, so maybe that's why this message never came to the group:
Well, there were several separate problems I encountered, starting with my initial attempt to stray from the default values, eg for s3configprefix. The initial settling period value set to like 3s or 30s, with a fixed_ensemble_size set to the number in my cluster, and occasionally, a pre-populated config object at the s3prefix, all together finally solved my problem with no period of latent intervals on autoregistering to a new cluster.
I mostly have, perhaps unjustly, blamed s3 for the issues with defunct processes, since the defunct processes appeared to be every five, then ten minutes, then five minutes, then ten minutes,, corresponding to my values for backup/cleanup intervals, but since I was also running in docker without a PID1 process reaper, I ended up building a custom container image with yelp's dumb-init as the ENTRYPOINT and an extended CMD wrapper.sh, loosely based on the mbabineau/zk-exhibitor container, but also accepting arbitrary environment parameters to populate the defaults.conf and zoo.cfg with some defaults tuned for my environments.
I also eliminated all my S3 vpc endpoints (which were apparently intermittent causal factors), rebuilt my ensembles several times, replacing my s3prefix, s3configprefix, and extra-config.backup-prefix values with strings containing neither underscore characters '_' nor slashes '/'. I think I was also doing something fancy with my s3 bucket policy and had to fix that as well....
Of around 16 ensembles with which I am most familiar, with 3-5 members each in different network configurations, accounts/providers, three of them are still running with an early version of my image and continue to reach high load averages in the thousands on an 8-vcpu instance within 3 weeks, requiring rolling reboots but the rest have had no problems in months. The very first clusters I had built quasi-manually on mesosphere dcos1.0.3 with vanilla-ish containers and continue to operate with no issues since December 2015, so go figure.
If you aren't using persistent storage/volumes for the zk snapshots/transactions directories, make sure you have a backup directly from the leader container image before redeploying with any changes; the backups from exhibitor on s3 may be insufficient to rebuild from scratch without a fuzzy snapshot at the beginning of the transaction logs.
Hope that helps at all, and let me know how it goes!
--
xpaul