salt-minion hangs when starting custom service

1,720 views
Skip to first unread message

Kevin Dodge

unread,
Jun 4, 2012, 5:59:54 PM6/4/12
to salt-...@googlegroups.com
We have written a custom service to start/stop a never ending java process.   Inside the service we start it by doing the following

daemon --user $USER --pidfile /var/run/javabatch.pid "(/usr/bin/java -classpath $CLASSPATH  MyClass 2>&1 | logger -t MyClass -p local5.notice) &"

This is on centos 5.6

When I run the service start from the shell it executes and returns immediately.   When I run salt-call state.highstate from the minion It runs through everything but then it hangs on starting my service.   I can watch my java log files and see that the job has started but the minion simply doesn't continue until I ctrl-c at which point it kills the job.

If I ctrl-c at the end when I get the output it gives this error

        Comment:   An exception occured in this state: Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/salt/state.py", line 811, in call
    ret = self.states[cdata['full']](*cdata['args'])
  File "/usr/lib/python2.6/site-packages/salt/states/service.py", line 353, in mod_watch
    changes = {name: __salt__['service.restart'](name)}
  File "/usr/lib/python2.6/site-packages/salt/modules/rh_service.py", line 126, in restart
    return not __salt__['cmd.retcode'](cmd)
  File "/usr/lib/python2.6/site-packages/salt/modules/cmdmod.py", line 210, in retcode
    return _run(cmd, runas=runas, cwd=cwd, shell=shell, env=env)['retcode']
  File "/usr/lib/python2.6/site-packages/salt/modules/cmdmod.py", line 109, in _run
    out = proc.communicate()
  File "/usr/lib64/python2.6/subprocess.py", line 691, in communicate
    return self._communicate(input)
  File "/usr/lib64/python2.6/subprocess.py", line 1211, in _communicate
    rlist, wlist, xlist = select.select(read_set, write_set, [])
KeyboardInterrupt


The service entry in my sls file is as follows.

MyClassBatch:
  service:
    - running


Any ideas?

Kevin Dodge

unread,
Jun 4, 2012, 7:45:34 PM6/4/12
to salt-...@googlegroups.com
Here is a little more info for you.    I changed the service to simply ping google.   So here is my sls file

testPing:
  service:
    - running
 
Then here is the code for my testPing service   (put in /etc/init.d/testPing on the minion)

# chkconfig: 2345 64 36
# description: 
# processname: 
# pidfile: /var/run/testping.pid

# source function library
. /etc/init.d/functions

RETVAL=0
USER=tomcat
NAME="TestPing"

start() {
        echo -n "Starting $NAME: "
        daemon --user $USER --pidfile /var/run/$NAME.pid "(ping www.google.com 2>&1 ) &" 
        RETVAL=$?
        
        if [ $RETVAL -eq 0 ]; then
                PID=`ps U $USER | grep google | awk '{ print $1}'`
                #PID=`jps | grep $NAME | awk '{ print $1 }'`
                echo $PID > /var/run/$NAME.pid
                touch /var/lock/subsys/$NAME
        fi
        echo
}


stop() {
        echo -n $"Shutting down $NAME: "
        killproc $NAME
        RETVAL=$?
        [ $RETVAL -eq 0 ]
        rm -f /var/lock/subsys/$NAME
        rm -f /var/run/$NAME success || failure
        echo
}

case "$1" in
start)
        start
;;

stop)
        stop

;;
restart|reload)
        stop
        start
;;

condrestart)
        if [ -f /var/lock/subsys/$NAME ]; then
                stop
                start
        fi
;;

status)
        status $NAME
        RETVAL=$?
;;
*)
        echo $"Usage: $0 {start|stop|restart|condrestart|status}"
        exit 1
esac

exit $RETVAL

When I run salt-call state.highstate on the minion it never returns after it outputs:
[INFO    ] Executing command /sbin/service rfqbatch-testPing status in directory /root
[INFO    ] Executing command /sbin/service rfqbatch-testPing start in directory /root

Running the same command with debug gives no new info.

Its almost as if it is ignoring the & that makes the process run in the background.

Thomas S Hatch

unread,
Jun 4, 2012, 7:59:32 PM6/4/12
to salt-...@googlegroups.com
That is what I am thinking as well, because it is clearly halting waiting for the process to finish. Can you please file an issue with this information? This is something that I will need to come back to and spend a little time on

Kevin Dodge

unread,
Jun 5, 2012, 11:42:35 AM6/5/12
to salt-...@googlegroups.com
Just playing around here, but I found another way to repeat the problem.   It appears that the & does not work in cmd.run either.

salt-call cmd.run "ping www.google.com > /tmp/pinggoogle &"


On Monday, June 4, 2012 5:59:32 PM UTC-6, Thomas Hatch wrote:
That is what I am thinking as well, because it is clearly halting waiting for the process to finish. Can you please file an issue with this information? This is something that I will need to come back to and spend a little time on

Thomas S Hatch

unread,
Jun 5, 2012, 12:32:29 PM6/5/12
to salt-...@googlegroups.com
Thanks, I see the bug and I have moved it to something I will look at very soon

- Thomas S Hatch

Alan Ooi

unread,
May 1, 2013, 8:32:57 PM5/1/13
to salt-...@googlegroups.com
Hi Thomas,
 
I'm wondering if this bug ever got resolved? I believe I am experiencing a similar problem when trying to start my own custom service init script.
 
Can't post the exact sls that I'm using (or the service startup script) but it looks like this:
 
serviceName:
    service.running:
        - enabled: true
        - watch:
            - file: /etc/sysconfig/someFile
            - file: /opt/someotherFile
 
When I change /opt/someotherFile and run state.highstate, i can see my minion log attempt to restart the service but then it hangs indefinitely despite the service restarting correctly.
 
Any help would be greatly appreciated! Thanks.

Thomas S Hatch

unread,
May 2, 2013, 11:52:59 AM5/2/13
to salt-...@googlegroups.com
This was resolved, is there a chance the scipt itself is blocking?

Thomas S. Hatch  |  Founder, CTO


5272 South College Drive, Suite 301 | Murray, UT 84123


--
You received this message because you are subscribed to the Google Groups "Salt-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to salt-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Alan Ooi

unread,
May 3, 2013, 3:15:42 AM5/3/13
to salt-...@googlegroups.com
Thanks for replying.

The script is running a process using something very close to
  runuser -l -s /bin/bash -c "insert looong cmd line here" &

I'll be honest, my experience with startup scripts is pretty basic and I don't really know the implications of how this may or may not effect salt's ability to tell when the script has finished running. 
So my answer your question is an unfortunate "maybe".

However, I do know that when I run the script using "service scriptName start|stop|restart|status|etc" that it will run perfectly fine. 

Do you have any recommendations for a way around this? I've been looking into replacing the runuser command and seeing if that will change anything but have yet to get around to attempting this.

Cheers,


--
You received this message because you are subscribed to a topic in the Google Groups "Salt-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/salt-users/wymM8NrslNw/unsubscribe?hl=en-US.
To unsubscribe from this group and all its topics, send an email to salt-users+...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
- Alan Ooi

Thomas S Hatch

unread,
May 6, 2013, 3:37:27 PM5/6/13
to salt-...@googlegroups.com
I think that we have an issue where scripts with an & in one of the commands will block, I resolved this a long time ago but it may have resurfaced somewhere else in the code. Is there any way to start the service in daemon mode and not with a &?

Thomas S. Hatch  |  Founder, CTO


5272 South College Drive, Suite 301 | Murray, UT 84123


Alan Ooi

unread,
May 7, 2013, 5:02:49 AM5/7/13
to salt-...@googlegroups.com
That's exactly what I've been meaning to try. I'm fairly busy at the moment though so I'll report back when I get around to trying it!

Even if it doesn't fix it, I'm hoping it'll at least rule out the & as the potential cause for this problem. 

Alan Ooi

unread,
May 20, 2013, 7:23:05 PM5/20/13
to salt-...@googlegroups.com
Hi Thomas,
 
Just letting you know that we resolved the issue.
 
In the end, it was a result of the weird startup script we had as a result of our application not being responsible for writing its pid out to file.
 
As mentioned before, our command to run the application was something along the lines of
runuser -l -s /bin/bash userName -c "application to run with parameters" &
 
This ends up creating a shell that runs the application and then our startup script will attempt to push that shell to the background. We discovered that Python does not like this and the Popen/communciate methods will hang.
 
I have now modified our application to be responsible for writing its PID out to file so that we can blow away the dodgy hacks that we have in our startup scripts. This way, we can still use runuser to get environment variables we need for the execution of the application, but we can now push the application to the background from the spawned shell instead of pushing the spawn shell to the background.
 
Anyway, thanks for your help!
Message has been deleted

maartens...@opencredo.com

unread,
Jun 25, 2013, 8:16:34 AM6/25/13
to salt-...@googlegroups.com
We've been experiencing a similar problem, which I think is worth reporting in more detail. We're currently using Salt v. 0.15.3 on CentOS 6.3 and CentOS 6.4.

Two things happen. The service command hangs, and a zombie process appears. I can reproduce the problem with for instance the logstash client and this command:

java -jar /opt/logstash/logstash.jar agent -f /opt/logstash/client.conf &

I have verified that the problem is not down to the "&" itself, by executing eg. "sleep 5 &". The way I tested it I created an executable file eg. "/tmp/test" with the above command, then executed it via Salt. This proved that the problem isn't with the service function itself, but with something more general:

salt 'server1' cmd.run '/tmp/test'

In this case the salt command just hangs, and on the client side I see two minions, one which has spawned a defunct process, eg:

root     124 123  0 10:15 ?        00:00:00 [sh] <defunct>
root     123   1  0 10:15 ?        00:00:00 /usr/bin/python /usr/bin/salt-minion -d

In the meantime the logstash process has started running ...

Breaking the salt process on the master side (ctrl-c) does not change anything. When I kill the logstash process, the zombie process and the above salt-minion process both disappear.

No amount of fiddling with the command to use nohup, or the init daemon function, makes any difference. 

I had a funny feeling that a file descriptor could be part of the problem though, and noticed someone using the technique of closing file descriptors in an init script:


If I close the basic file descriptors before running the java command, Salt no longer hangs when executing the "/tmp/test" script:

exec 1>&-   # Close stdout
exec 2>&-   # Close stderr
exec 3>&-   # Close stdin
java -jar /opt/logstash/logstash.jar agent -f /opt/logstash/client.conf &

This is a genuine problem though, because it means we have to hack the init scripts to get Salt to work with them. 

We noticed zombie processes in earlier versions of Salt we used too (0.11 and 0.14), but hadn't yet drilled down to the root cause. It is possible that those zombies had been caused by the same problem, but I can't be sure.

For anyone experiencing similar issues, a separate workaround is to use the at command:

echo "java -jar /opt/logstash/logstash.jar agent -f /opt/logstash/client.conf" | at now

Regards,
Maartens

David Anderson

unread,
Jun 25, 2013, 11:17:49 AM6/25/13
to salt-...@googlegroups.com
This isn't a bug with salt, it is just blocking while waiting for output
from your command's stdout. I suspect you can fix your logstash init
script problem by specifying a place to send the logs (logstash defaults
to stdout). Try this in your init script:

java -jar /opt/logstash/logstash.jar agent -f /opt/logstash/indexer.conf
--log /var/log/logstash.log

For any other homegrown init scripts, you should handle stdout/stderr
appropriately, either by redirecting each to a file or, as you
mentioned, closing them with 2>&- 3>&-
--
Dave

On 6/25/13 6:13 AM, maartens...@opencredo.com wrote:
> We've been experiencing a similar problem, which I think is worth
> reporting in more detail. We're currently using Salt v. 0.15.3 on
> CentOS 6.3 and CentOS 6.4.
>
> Two things happen. The service command hangs, and a zombie process
> appears. I can reproduce the problem with for instance the logstash
> client and this command:
>
> java -jar /opt/logstash/logstash.jar agent -f /opt/logstash/indexer.conf &
> I think that we have an issue where scripts with an & in one of
> the commands will block, I resolved this a long time ago but it
> may have resurfaced somewhere else in the code. Is there any way
> to start the service in daemon mode and not with a &?
>
> Thomas S. Hatch | Founder, CTO
>
>
> 5272 South College Drive, Suite 301 | Murray, UT 84123
> tha...@saltstack.com <javascript:> | www.saltstack.com
> <http://saltstack.com/>
>
>
> On Fri, May 3, 2013 at 1:15 AM, Alan Ooi <ayo...@gmail.com
> <javascript:>> wrote:
>
> Thanks for replying.
>
> The script is running a process using something very close to
> runuser -l -s /bin/bash -c "insert looong cmd line here" &
>
> I'll be honest, my experience with startup scripts is pretty
> basic and I don't really know the implications of how this may
> or may not effect salt's ability to tell when the script has
> finished running.
> So my answer your question is an unfortunate "maybe".
>
> However, I do know that when I run the script using "service
> scriptName start|stop|restart|status|etc" that it will run
> perfectly fine.
>
> Do you have any recommendations for a way around this? I've
> been looking into replacing the runuser command and seeing if
> that will change anything but have yet to get around to
> attempting this.
>
> Cheers,
>
>
> On Fri, May 3, 2013 at 1:52 AM, Thomas S Hatch
> <that...@gmail.com <javascript:>> wrote:
>
> This was resolved, is there a chance the scipt itself is
> blocking?
>
> Thomas S. Hatch | Founder, CTO
>
>
> 5272 South College Drive, Suite 301 | Murray, UT 84123
> tha...@saltstack.com <javascript:> | www.saltstack.com
> <http://saltstack.com/>
> <http://www.google.com> > /tmp/pinggoogle &"
> <http://www.google.com> 2>&1 ) &"
> salt-users+...@googlegroups.com <javascript:>.
>
> For more options, visit
> https://groups.google.com/groups/opt_out
> <https://groups.google.com/groups/opt_out>.
>
>
>
> --
> You received this message because you are subscribed to a
> topic in the Google Groups "Salt-users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/salt-users/wymM8NrslNw/unsubscribe?hl=en-US
> <https://groups.google.com/d/topic/salt-users/wymM8NrslNw/unsubscribe?hl=en-US>.
> To unsubscribe from this group and all its topics, send an
> email to salt-users+...@googlegroups.com <javascript:>.
>
> For more options, visit
> https://groups.google.com/groups/opt_out
> <https://groups.google.com/groups/opt_out>.
>
>
>
>
>
> --
> - Alan Ooi
> --
> You received this message because you are subscribed to the
> Google Groups "Salt-users" group.
> To unsubscribe from this group and stop receiving emails from
> it, send an email to salt-users+...@googlegroups.com
> <javascript:>.
> For more options, visit
> https://groups.google.com/groups/opt_out
> <https://groups.google.com/groups/opt_out>.

maartens...@opencredo.com

unread,
Jun 26, 2013, 10:26:50 AM6/26/13
to salt-...@googlegroups.com
Hi Dave,

Thanks for your explanation.

Whereas I am inclined to agree that Salt is not at fault per se, I can see how the typical use case in which this problem appears can mean the root cause remains hidden without a lot of troubleshooting. This is largely down to the fact that the problem does not manifest in the shell environment, at least not in a clear way. The "&" places the process in the background, and that is usually enough to control the startup and shutdown of the process. The problem could be down to the way the software is written, and this could be 3rd party software (usually is). 

I am not suggesting Salt stoop to compensate for a problem outside its domain, but is Salt aware if it is waiting on something before proceeding? If it does I would prefer if it reported (maybe at DEBUG level) that it is waiting on X once a certain amount of time has passed. To the user it just looks like Salt is waiting (indefinitely) for the process to start up when clearly the process has successfully started running.

In the logstash example, the logfile suggestion looked like a sound one to me however it doesn't work unfortunately. So in such a case the Salt administrator would have to resort to the "tricks" I suggested, or understand the nature of the problem and work through all the options of the software and hope one works at closing the file descriptors. There is no guarantee. (I'm guessing but it is likely to apply to many custom startup scripts of Java applications, not just Logstash).

Cheers,
Maartens

Ollie Walsh

unread,
Jun 28, 2013, 10:40:38 PM6/28/13
to salt-...@googlegroups.com
Hi,

Salt may handle misbehaving init scripts soon, but that's really treating the symptom, not the cause. See https://github.com/saltstack/salt/issues/5567.

> as you mentioned, closing them with 2>&- 3>&- 
I wouldn't recommend this - it will probably cause crashes. Most processes assume they can read/write the std file handles.


Or (my personal preference) use supervisord to run the process instead: http://supervisord.org/.

Cheers,
Ollie
Reply all
Reply to author
Forward
0 new messages