Restarting salt-master via salt

3,071 views
Skip to first unread message

Mike Chesnut

unread,
Apr 4, 2013, 9:48:52 PM4/4/13
to salt-...@googlegroups.com
I decided to try to have salt distribute my master config file.  I added a clause to have it restart the salt-master service, too, as I would for any other service with a config file.  This appears to have been a bad idea (which I kind of suspected before doing it anyway, but it has been more catastrophic than I'd anticipated...).

When I tried to apply the state, it just hung.  I noticed that my salt-master process had died.  The only thing in the log file was this:

2013-04-05 01:25:48,021 [salt.master         ][WARNING ] Caught signal 15, stopping the Salt Master

I stopped the minion and verified that no salt processes were running.
Now when I try to start the master back up, I get this:

[mikec@admin1 ~]$ sudo /usr/local/bin/python /usr/local/bin/salt-master
[WARNING ] Unable to bind socket, error: [Errno -2] Name or service not known
The ports are not available to bind

I don't see anything bound to the normal salt ports (4505 and 4506), though.  I also don't see a stale pidfile or lock file or anything.

I'm on CentOS 6.3.  Any other ideas of what I should look at?  Has anybody else tried controlling their master with salt?  (Does everybody else agree that I'm pretty dumb for having tried it in the first place?)

Thanks,
Mike

Corey Quinn

unread,
Apr 4, 2013, 9:55:36 PM4/4/13
to salt-...@googlegroups.com
Hmm, I can't replicate that here:

sudo salt 'salt*' service.restart salt-master
Failed to authenticate, is this user permitted to execute commands?
[root@salt cquinn]# pgrep salt-master
29107
29108
29115
29118
29121
29122
29125
29126
29131

What salt version are you on?

-- Corey


--
You received this message because you are subscribed to the Google Groups "Salt-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to salt-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Mike Chesnut

unread,
Apr 4, 2013, 11:10:06 PM4/4/13
to salt-...@googlegroups.com
Hmm, well good to know that it worked for you.  Thanks for the data point.

I'm on 0.14.0+, the latest git head as of last Friday.  Maybe I'll try rebuilding the newest commits later tonight... I've noticed that travis has been showing red a lot lately, so maybe I'll actually try checking out a sha of the last time it was green or something.

I was also thinking I might just reboot the master and see if that helps anything.  Was hoping to diagnose as much as possible first, though.

Corey Quinn

unread,
Apr 4, 2013, 11:15:23 PM4/4/13
to salt-...@googlegroups.com
Hmm, you're braver than I am-- I'm running the packaged version from EPEL(-testing?), salt-master-0.14.0-1.el6.noarch.

Have an example of the state file that caused the restart? I'll give that a shot next if you'd like.

-- Corey

Mike Chesnut

unread,
Apr 4, 2013, 11:27:03 PM4/4/13
to salt-...@googlegroups.com
Yeah there tend to be fixes that come after the tagged versions that I usually want/need.  I also build all of my own packages so it's not a big deal to change from building from a particular tag to a particular sha to the latest head to whatever...

The SLS was pretty simple:

salt-master:
{% if salt['network.interfaces']()['br0']['inet'][0]['address'] == pillar['saltmaster'] %}
  service.running:
    - enable: True
    - watch:
      - file: /etc/salt/master
{% else %}
  service.dead:
    - enable: False
{% endif %}
  pkg.installed: []

/etc/salt/master:
  file.managed:
    - source: salt://admin/master.conf
    - template: jinja
    - mode: 640
    - user: root
    - group: root
    - require:
      - pkg: salt-master

I applied this to both of my masters (the active, and the standby one).  Now neither will start the salt-master process (both give the same error) so I'm wondering if maybe I just have a bad build and forgot to restart it when I updated the packages (although that seems unlikely).

Thanks Corey!

Corey Quinn

unread,
Apr 4, 2013, 11:37:49 PM4/4/13
to salt-...@googlegroups.com
Yeah, that should be calling service.restart.  Very, very odd.

I'm about out of ideas; can you test on a VM to see if you've just got a busted package?

-- Corey

Sean Channel

unread,
Apr 4, 2013, 11:47:25 PM4/4/13
to salt-...@googlegroups.com, Mike Chesnut
It sounds like you got upstart stuck. Try ``initctl stop salt-master``
followed by ``initctl start salt-master`` on the affected systems (*not*
using 'restart').

_S

Mike Chesnut

unread,
Apr 4, 2013, 11:51:37 PM4/4/13
to Sean Channel, salt-...@googlegroups.com

I'm not using upstart, Sean. Sorry, I should've specified this is with regular old sysv.

I'll do more experimenting with a VM as Corey suggested later tonight and report back. Thanks guys.

Mike Chesnut

unread,
Apr 5, 2013, 1:44:30 AM4/5/13
to Sean Channel, salt-...@googlegroups.com
With the same package on a different VM the master starts up fine.  I also started a minion on the same host, and was able to run commands (e.g., test.version) against it.

This really makes me think I must just be missing a stale pid file or something along those lines somewhere, and the error message is just misleading.

Any other ideas for what to look at?  I really appreciate the assistance!

Mike Chesnut

unread,
Apr 5, 2013, 2:01:47 AM4/5/13
to Sean Channel, salt-...@googlegroups.com
Well I kind of wussed out, I uninstalled the RPM and reinstalled it, and now everything works again.  I had started deleting things, e.g., /var/run/salt/master and /var/cache/salt/master, but that didn't help so then I just got sick of trying to figure out where the problem resided and went the more brute-force way.

I can now apply highstates on my masters, including the SLS I pasted before* and things don't blow up.  Weird.  Glad it's back to working, at any rate.

* - the one change I made was to remove the ":80" from the end of the interface line; I had copied it from a different file and didn't notice the port before.  Maybe taking that out made all the difference in the world?

Corey Quinn

unread,
Apr 5, 2013, 12:03:36 PM4/5/13
to salt-...@googlegroups.com
That actually might do it. It's not going to be able to bind to an invalid interface...  

-- Corey

Sean Channel

unread,
Apr 5, 2013, 12:32:04 PM4/5/13
to salt-...@googlegroups.com, Corey Quinn
yeah, that seems suspect.

FWIW / future sake: if you think it's a file that needs to be removed
sometimes 'strace -f -e trace=open <command>' can help by showing every
attempt to open a file (though there are normally alot of ENOENT's you
might need to grep out).

annoying spooky failures! :)
_S

Mike Chesnut

unread,
Apr 5, 2013, 12:35:24 PM4/5/13
to salt-...@googlegroups.com
Yep, pretty sure that was it.  Just for clarity, what I didn't show was my master.conf, which has a templated interface line.  It looks just like the interface check I did in the SLS, but it had the ":80" at the end of it previously.  Removing that has fixed things (and I'm fairly certain that's where the problem was... had I caught it sooner I'm pretty sure I wouldn't have had to reinstall the RPM).

I wish I'd have had the patience to figure out the root reason why the master couldn't start, though.  I did try some straces but didn't see anything that jumped out at me, but again that may have just been fatigue last night.  I guess the good news is I know how to recreate the issue so if I get ambitious at some point maybe I can try again and track it down. :)  (If nothing else, it'd be nice to have a better error message there.)

Mike Chesnut

unread,
Apr 5, 2013, 2:17:31 PM4/5/13
to salt-...@googlegroups.com
Okay, sorry to reply to myself, hopefully this is the last word on this issue... I did try again and realized that the *whole* story is the port in the interface setting.

interface 10.10.10.10

works fine, whereas

interface 10.10.10.10:80

fails with the error: "The ports are not available to bind."  There is no issue with stale lock files or anything along those lines.  It's just that it's interpreting the interface setting as specified, and having trouble with it.  (Now, even if I stop apache so that there is nothing already bound to port 80, or if I change it to some other random unused port, I still get the same error.  So the master isn't trying to actually use that port setting, which really isn't surprising since that's not how the config directive is supposed to work in the first place.)

So, as we all suspected all along, this was just due to my own stupidity!  (It all came from a sloppy copy-and-paste and not paying enough attention...) Just trying to clarify so that hopefully nobody else gets confused in the same way in the future. :)
Reply all
Reply to author
Forward
0 new messages