Zombie processes when running in Docker

Eli Oxman

unread,

Jan 18, 2016, 5:11:54 AM1/18/16

to exhibitor-users

Hi,

We're seeing an issue which prevents us from smoothly performing a rolling update to the Zookeeper cluster managed by Exhibitor.

The issue is that Exhibitor is using the zkServer.sh script to run the Zookeeper, which in turn runs it with nohup, causing it to be a child of PID 1.

When running in Docker PID 1 is the Exhibitor process, and not the init system, which causes the Zookeeper process to not be reaped when stopped (this is a known issue with Dockers - docker-zombie-reaping).

This in turn blocks the rolling update as the Exhibitor is still seeing the old process when issuing the JPS command.

We are specifically using the ZeroToDocker image, though this issue would be true for any Docker image which does not contain an init system as PID 1.

We were wondering if there are any suggestion for mitigating this issue (aside from running using a Docker image with an init system, like phusion).

Thanks,

Eli

P.S. I know this is an issue which is not entirely related to the Exhibitor itself, but I am hoping that someone might have solved this issue.

xPaul Vigil

unread,

Jun 1, 2016, 10:22:35 PM6/1/16

to exhibitor-users

I've also hit this issue, particularly on new clusters when the automatic server registration is turned on, as nodes come online, they all seem to enter an infinite loop of restarting and updating the configuration file to add another server and then restarting again -- some nodes went as high as load averages >1000 due to the number of defunct java processes.

I haven't had a chance to drill into it yet, but I may have to in the near future -- my first stab at the fix was to add in the recently announced "dumb-init" as PID1 that invokes my wrapper shell script that calls an exec to run the java exhibitor. Sadly, it doesn't seem to have done the trick, as I'm looking at a list of 37 defunct processes below the exhibitor after deployment a few hours ago, while the zookeeper is humming along merrily as a sibling of the exhibitor, both of whom are children of the dumb-init.

Perhaps next, I will start periodically restarting the docker container to see if it will reset all those zombies.

Joseph Zargari

unread,

May 17, 2017, 7:58:48 AM5/17/17

to exhibitor-users

Hi xPaul Vigil,

Any chance you found the solution (and remember what it was)? I'm having the exact same issue now...

Thanks,

Joseph

xPaul Vigil

unread,

Jun 16, 2017, 6:00:37 PM6/16/17

to exhibitor-users

Hmmm, I replied via email on may 17, but from a different email account, so maybe that's why this message never came to the group:

Well, there were several separate problems I encountered, starting with my initial attempt to stray from the default values, eg for s3configprefix. The initial settling period value set to like 3s or 30s, with a fixed_ensemble_size set to the number in my cluster, and occasionally, a pre-populated config object at the s3prefix, all together finally solved my problem with no period of latent intervals on autoregistering to a new cluster.

I mostly have, perhaps unjustly, blamed s3 for the issues with defunct processes, since the defunct processes appeared to be every five, then ten minutes, then five minutes, then ten minutes,, corresponding to my values for backup/cleanup intervals, but since I was also running in docker without a PID1 process reaper, I ended up building a custom container image with yelp's dumb-init as the ENTRYPOINT and an extended CMD wrapper.sh, loosely based on the mbabineau/zk-exhibitor container, but also accepting arbitrary environment parameters to populate the defaults.conf and zoo.cfg with some defaults tuned for my environments.

I also eliminated all my S3 vpc endpoints (which were apparently intermittent causal factors), rebuilt my ensembles several times, replacing my s3prefix, s3configprefix, and extra-config.backup-prefix values with strings containing neither underscore characters '_' nor slashes '/'. I think I was also doing something fancy with my s3 bucket policy and had to fix that as well....

Of around 16 ensembles with which I am most familiar, with 3-5 members each in different network configurations, accounts/providers, three of them are still running with an early version of my image and continue to reach high load averages in the thousands on an 8-vcpu instance within 3 weeks, requiring rolling reboots but the rest have had no problems in months. The very first clusters I had built quasi-manually on mesosphere dcos1.0.3 with vanilla-ish containers and continue to operate with no issues since December 2015, so go figure.

If you aren't using persistent storage/volumes for the zk snapshots/transactions directories, make sure you have a backup directly from the leader container image before redeploying with any changes; the backups from exhibitor on s3 may be insufficient to rebuild from scratch without a fuzzy snapshot at the beginning of the transaction logs.

Hope that helps at all, and let me know how it goes!

--

xpaul

Reply all

Reply to author

Forward