Nginx restarting

Morgan Blackthorne

unread,

Dec 7, 2018, 12:21:56 PM12/7/18

to LOPSA Discuss List

(Apologies if this should go to te...@lopsa.org instead of discuss; I'm not really clear what messages should be directed to which list.)

I've been running into a bit of a weird situation lately at $WORK. We have a few LDAP REST proxy servers (basically they provide a REST API to query AD over LDAP for authentication) in the various regions, and I keep getting alerts about nginx on these nodes from our monitoring systems. The checks are returning the following message: "There's no pid file for nginx. Is nginx running? Please also make sure whether your pid path and name is correct." The service ends up restarting on its own, and these are lightly used servers (only for employee related logins, not customer-affecting), so it hasn't been a big deal yet, but I'd like to resolve it before it becomes one, especially since the NOC tends to ask me about it when it happens (despite happening on a regular basis).

I've confirmed that our logrotate configs are overriding /etc/logrotate.d/nginx, and they do not restart the service as they are using copytruncate like so:

/var/log/nginx/access.log
{
firstaction
/bin/bash /usr/local/noc/aws_s3_logrotate.sh /var/log/nginx/access.log
endscript
hourly
missingok
copytruncate
rotate 2
notifempty
create 0644 root root
}

I don't see anything else scheduled via cron (there are no user crontabs in /var/spool/cron/crontabs, and I checked everything in /etc/cron.*) that restarts nginx. This is Ubuntu 14.04 (we're still in the process of migrating over to Bionic, which will be a bigger push after the holiday change freeze ends), so I don't have journalctl available. I don't see anything in /var/log being logged about restarting nginx, so it's not coming in via sudo. I'm at a bit of a loss as to what is going on with these nodes. It's happening every few hours, and doesn't seem to be on any obvious schedule:

Concierge Production [#53550] Component concierge-nginx-status (Concierge LDAP Template on irldap01) is Critical at 8:42 AM Resolved through the integration API.
Concierge Production [#53550] Component concierge-nginx-status (Concierge LDAP Template on irldap01) is Critical at 8:37 AM Triggered through the API.
Concierge Production [#53549] Component concierge-nginx-status (Concierge LDAP Template on irldap01) is Critical at 4:26 AM Resolved through the integration API.
Concierge Production [#53549] Component concierge-nginx-status (Concierge LDAP Template on irldap01) is Critical at 4:21 AM Triggered through the API.
Concierge Production [#53548] Component concierge-nginx-status (Concierge LDAP Template on irldap01) is Critical at 12:12 AM Resolved through the integration API.
Concierge Production [#53548] Component concierge-nginx-status (Concierge LDAP Template on irldap01) is Critical at 12:07 AM Triggered through the API.

Anyone have any ideas what might be happening or where I can look to dig into the issue further?

Dan Ritter

unread,

Dec 7, 2018, 12:52:36 PM12/7/18

to Morgan Blackthorne, LOPSA Discuss List

Morgan Blackthorne wrote:
>
> I've been running into a bit of a weird situation lately at $WORK. We have
> a few LDAP REST proxy servers (basically they provide a REST API to query
> AD over LDAP for authentication) in the various regions, and I keep getting
> alerts about nginx on these nodes from our monitoring systems. The checks
> are returning the following message: "There's no pid file for nginx. Is
> nginx running? Please also make sure whether your pid path and name is
> correct."

1. Where does nginx.conf (or, most likely, /etc/nginx/conf.d/* )
tell the pid file to be placed? grep pid should show you
pid: /run/nginx.pid
or something similar. If not, set it.

2. Is the monitoring system looking in that location?

> The service ends up restarting on its own, and these are lightly
> used servers (only for employee related logins, not customer-affecting), so
>

> I've confirmed that our logrotate configs are overriding
> /etc/logrotate.d/nginx, and they do not restart the service as they are
> using copytruncate like so:
>
> /var/log/nginx/access.log
> {
> firstaction
> /bin/bash /usr/local/noc/aws_s3_logrotate.sh
> /var/log/nginx/access.log
> endscript
> hourly
> missingok
> copytruncate
> rotate 2
> notifempty
> create 0644 root root
> }

hourly is weird for a lightly used server.

If you send USR1 to nginx, it will close logfiles and reopen
them cleanly. The nginx binary even provides 'nginx -s reopen'
which does the same thing. That will be cleaner than
copytruncate.

> anything in /var/log being logged about restarting nginx, so it's not
> coming in via sudo. I'm at a bit of a loss as to what is going on with
> these nodes.

Did you restart nginx, make note of the pid, and then check
again after an event? It's always good to make sure your
monitoring system isn't hallucinating.

-dsr-

Morgan Blackthorne

unread,

Dec 7, 2018, 1:43:44 PM12/7/18

to dis...@lopsa.org

Replies inline.

On Fri, Dec 7, 2018 at 9:52 AM Dan Ritter <d...@randomstring.org> wrote:

Morgan Blackthorne wrote:
>
> I've been running into a bit of a weird situation lately at $WORK. We have
> a few LDAP REST proxy servers (basically they provide a REST API to query
> AD over LDAP for authentication) in the various regions, and I keep getting
> alerts about nginx on these nodes from our monitoring systems. The checks
> are returning the following message: "There's no pid file for nginx. Is
> nginx running? Please also make sure whether your pid path and name is
> correct."

1. Where does nginx.conf (or, most likely, /etc/nginx/conf.d/* )
tell the pid file to be placed? grep pid should show you
pid: /run/nginx.pid
or something similar. If not, set it.

/var/run/nginx.pid

2. Is the monitoring system looking in that location?

Yes, I confirmed that.

> The service ends up restarting on its own, and these are lightly
> used servers (only for employee related logins, not customer-affecting), so
>
> I've confirmed that our logrotate configs are overriding
> /etc/logrotate.d/nginx, and they do not restart the service as they are
> using copytruncate like so:
>
> /var/log/nginx/access.log
> {
> firstaction
> /bin/bash /usr/local/noc/aws_s3_logrotate.sh
> /var/log/nginx/access.log
> endscript
> hourly
> missingok
> copytruncate
> rotate 2
> notifempty
> create 0644 root root
> }

hourly is weird for a lightly used server.

That's more a matter of how I crafted our log rotation Chef code, I didn't abstract it to handle different schedules for different log files, I just set them all to hourly. It is something I probably should improve, however.

If you send USR1 to nginx, it will close logfiles and reopen
them cleanly. The nginx binary even provides 'nginx -s reopen'
which does the same thing. That will be cleaner than
copytruncate.

So the whole reason why I was using copytruncate is that I can do so and avoid needing to send any signals or restarts. Again, since it's Chef code running in a loop, if I need to do this, then I'll have to track another value in the array of log files to indicate which service/pid file to send the signal to. And I'm creating multiple /etc/logrotate.d/ files for the same service in a few cases, and I wanted to avoid doing multiple restarts/signals in those scenarios.

> anything in /var/log being logged about restarting nginx, so it's not
> coming in via sudo. I'm at a bit of a loss as to what is going on with
> these nodes.

Did you restart nginx, make note of the pid, and then check
again after an event? It's always good to make sure your
monitoring system isn't hallucinating.

-dsr-

That's a good point, I have not. I just made a note of it now and I'll watch for the next event.

Morgan Blackthorne

unread,

Dec 7, 2018, 4:02:36 PM12/7/18

to dis...@lopsa.org

Confirmed that the pid file did change when it happened again, and that the pid file is accurate compared to ps.

Dan Ritter

unread,

Dec 7, 2018, 4:14:27 PM12/7/18

to Morgan Blackthorne, dis...@lopsa.org

Morgan Blackthorne wrote:
> Confirmed that the pid file did change when it happened again, and that the
> pid file is accurate compared to ps.
>

Is your monitoring system too fine-grained? For example, is it
using inotifywait to track the file instead of periodically
inspecting it?

-dsr-

Guus Snijders

unread,

Dec 7, 2018, 4:15:38 PM12/7/18

to Morgan Blackthorne, LOPSA Discuss List

Op vr 7 dec. 2018 22:02 schreef Morgan Blackthorne <mor...@windsofstorm.net>:

Confirmed that the pid file did change when it happened again, and that the pid file is accurate compared to ps.

Perhaps a stupid question, but isn't there something in the nginx logs?

Since the daemon appears to restart, perhaps there's some log message about why it shutdown or crashed in the first place...

Mvg, Guus Snijders

Morgan Blackthorne

unread,

Dec 7, 2018, 5:19:03 PM12/7/18

to gsni...@gmail.com, dis...@lopsa.org

I only see one error in /var/log/nginx/error.log.1 and it's not related to the startup/shutdown. /var/log/nginx/error.log is empty. I should have mentioned that before, but that was definitely the first thing that I looked at. :)

As for the monitoring system, it's Solarwinds running a script over the agent. The script is a nagios one, simplified version is as follows:

#!/bin/sh

# This program is free software; you can redistribute it and/or modify
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA

PROGNAME=`basename $0`
VERSION="Version 1.1,"
AUTHOR="2009, Mike Adolphs (http://www.matejunkie.com/)"

ST_OK=0
ST_WR=1
ST_CR=2
ST_UK=3
hostname="localhost"
port=80
path_pid=/var/run
name_pid="nginx.pid"
status_page="nginx_status"
pid_check=1
secure=0

check_pid() {
if [ -f "$path_pid/$name_pid" ]
then
retval=0
else
retval=1
fi
}

if [ ${pid_check} = 1 ]
then
check_pid
if [ "$retval" = 1 ]
then
echo "There's no pid file for nginx. Is nginx running? Please \

also make sure whether your pid path and name is correct."

exit $ST_CR
fi
fi

So it's definitely catching the server bouncing, but I have no idea why. I checked dmesg and I don't see anything from the OOM killer; it doesn't seem to be running out of memory. (Although I will note these are t2.micro instances since as I mentioned before they're very lightly used.)

If it were upstart or systemd, I might be able to get some further info out of those systems, but it's just a SysV init script. OS is Ubuntu 14.04 (we're in the process of moving things to 18.04, but are not far enough along to hit this particular setup yet).

Alicia Smith

unread,

Dec 8, 2018, 5:29:18 AM12/8/18

to Morgan Blackthorne, gsni...@gmail.com, dis...@lopsa.org

If you check the nginx config is the error log empty because it's not configured to write to one?

Have you checked syslog for any indications?

If you're using Nagios checks I'd think you probably have alerting set up for OOM errors if they occur.

A. Smith

--
This list provided by the League of Professional System Administrators
http://lopsa.org/
---
You received this message because you are subscribed to the Google Groups "LOPSA Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@lopsa.org.
Visit this group at https://groups.google.com/a/lopsa.org/group/discuss/.

John Stoffel

unread,

Dec 9, 2018, 10:11:10 PM12/9/18

to Morgan Blackthorne, gsni...@gmail.com, dis...@lopsa.org

Why don't you try running nginx by hand, and seeing what's happening
if it restarts on it's own, or get's restarted by something else?
Bump up the default log file verbosity as well.

Could it be that the rotating of the log files is causing nginx to
restart and solarwinds is checking just as the change is happening but
before the run.pid file is updated properly?

Does Solarwinds show which PID it was expecting to see?

I'd also just remove the log rotation completely, let it fill up and
rotate it weekly. Esp if they're lightly loaded... would it matter?

John

Morgan> I only see one error in /var/log/nginx/error.log.1 and it's
Morgan> not related to the startup/shutdown. / var/log/nginx/error.log
Morgan> is empty. I should have mentioned that before, but that was
Morgan> definitely the first thing that I looked at. :)

Morgan> As for the monitoring system, it's Solarwinds running a script over the agent. The script is a
Morgan> nagios one, simplified version is as follows:

Morgan> #!/bin/sh

Morgan> # This program is free software; you can redistribute it and/or modify
Morgan> # but WITHOUT ANY WARRANTY; without even the implied warranty of
Morgan> # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
Morgan> # GNU General Public License for more details.
Morgan> #
Morgan> # You should have received a copy of the GNU General Public License
Morgan> # along with this program; if not, write to the Free Software
Morgan> # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA

Morgan> PROGNAME=`basename $0`
Morgan> VERSION="Version 1.1,"
Morgan> AUTHOR="2009, Mike Adolphs (http://www.matejunkie.com/)"

Morgan> ST_OK=0
Morgan> ST_WR=1
Morgan> ST_CR=2
Morgan> ST_UK=3
Morgan> hostname="localhost"
Morgan> port=80
Morgan> path_pid=/var/run
Morgan> name_pid="nginx.pid"
Morgan> status_page="nginx_status"
Morgan> pid_check=1
Morgan> secure=0

Morgan> check_pid() {
Morgan> if [ -f "$path_pid/$name_pid" ]
Morgan> then
Morgan> retval=0
Morgan> else
Morgan> retval=1
Morgan> fi
Morgan> }

Morgan> if [ ${pid_check} = 1 ]
Morgan> then
Morgan> check_pid
Morgan> if [ "$retval" = 1 ]
Morgan> then
Morgan> echo "There's no pid file for nginx. Is nginx running? Please \
Morgan> also make sure whether your pid path and name is correct."
Morgan> exit $ST_CR
Morgan> fi
Morgan> fi

Morgan> So it's definitely catching the server bouncing, but I have no idea why. I checked dmesg and I
Morgan> don't see anything from the OOM killer; it doesn't seem to be running out of memory. (Although I
Morgan> will note these are t2.micro instances since as I mentioned before they're very lightly used.)

Morgan> If it were upstart or systemd, I might be able to get some further info out of those systems, but
Morgan> it's just a SysV init script. OS is Ubuntu 14.04 (we're in the process of moving things to 18.04,
Morgan> but are not far enough along to hit this particular setup yet).

Morgan> On Fri, Dec 7, 2018 at 1:15 PM Guus Snijders <gsni...@gmail.com> wrote:

Morgan> Op vr 7 dec. 2018 22:02 schreef Morgan Blackthorne <mor...@windsofstorm.net>:

Morgan> Confirmed that the pid file did change when it happened again, and that the pid file is
Morgan> accurate compared to ps.

Morgan> Perhaps a stupid question, but isn't there something in the nginx logs?

Morgan> Since the daemon appears to restart, perhaps there's some log message about why it shutdown or
Morgan> crashed in the first place...

Morgan> Mvg, Guus Snijders

Morgan> --
Morgan> This list provided by the League of Professional System Administrators
Morgan> http://lopsa.org/
Morgan> ---
Morgan> You received this message because you are subscribed to the Google Groups "LOPSA Discussion"
Morgan> group.
Morgan> To unsubscribe from this group and stop receiving emails from it, send an email to
Morgan> discuss+u...@lopsa.org.
Morgan> Visit this group at https://groups.google.com/a/lopsa.org/group/discuss/.

Morgan Blackthorne

unread,

Dec 17, 2018, 9:57:56 AM12/17/18

to John Stoffel, Guus Snijders, LOPSA Discuss List

So something I should have thought to mention is I'm trying to debug it without violating our holiday change freeze policy. I can probably get approval to do so during the window (after my boss is back tomorrow) given what the impact would be (only internal employee access would be affected), but I'd rather save that for when I know the fix. The idea of doing a tmux/screen session and strace'ing the pid file to see if it's crashing or being signaled to die has merit, though, that's non-intrusive.

John Stoffel

unread,

Dec 17, 2018, 3:35:11 PM12/17/18

to Morgan Blackthorne, John Stoffel, Guus Snijders, LOPSA Discuss List

>>>>> "Morgan" == Morgan Blackthorne <mor...@windsofstorm.net> writes:

So why not just spin up a test instance that only you hammer with one
of the web testing tools, and see if you can get it to fall over, then
you can debug at your leisure without affecting production.

Having Solarwinds monitor a test instance and only notify you seems
like a good strategy to nail down the root cause.

John

Morgan> So something I should have thought to mention is I'm trying to
Morgan> debug it without violating our holiday change freeze policy. I
Morgan> can probably get approval to do so during the window (after my
Morgan> boss is back tomorrow) given what the impact would be (only
Morgan> internal employee access would be affected), but I'd rather
Morgan> save that for when I know the fix. The idea of doing a
Morgan> tmux/screen session and strace'ing the pid file to see if it's
Morgan> crashing or being signaled to die has merit, though, that's
Morgan> non-intrusive.

Morgan> On Sun, Dec 9, 2018, 7:11 PM John Stoffel <jo...@stoffel.org wrote:

Morgan> Why don't you try running nginx by hand, and seeing what's happening
Morgan> if it restarts on it's own, or get's restarted by something else?
Morgan> Bump up the default log file verbosity as well.

Morgan> Could it be that the rotating of the log files is causing nginx to
Morgan> restart and solarwinds is checking just as the change is happening but
Morgan> before the run.pid file is updated properly?

Morgan> Does Solarwinds show which PID it was expecting to see?

Morgan> I'd also just remove the log rotation completely, let it fill up and
Morgan> rotate it weekly. Esp if they're lightly loaded... would it matter?

Morgan> John

Morgan> I only see one error in /var/log/nginx/error.log.1 and it's
Morgan> not related to the startup/shutdown. / var/log/nginx/error.log
Morgan> is empty. I should have mentioned that before, but that was
Morgan> definitely the first thing that I looked at. :)

Morgan> As for the monitoring system, it's Solarwinds running a script over the agent. The

Morgan> script is a

Morgan> dmesg and I

Morgan> don't see anything from the OOM killer; it doesn't seem to be running out of memory.

Morgan> (Although I

Morgan> will note these are t2.micro instances since as I mentioned before they're very

Morgan> lightly used.)

Morgan> If it were upstart or systemd, I might be able to get some further info out of those

Morgan> systems, but

Morgan> it's just a SysV init script. OS is Ubuntu 14.04 (we're in the process of moving

Morgan> things to 18.04,

Morgan> but are not far enough along to hit this particular setup yet).

Morgan> On Fri, Dec 7, 2018 at 1:15 PM Guus Snijders <gsni...@gmail.com> wrote:

Morgan> Op vr 7 dec. 2018 22:02 schreef Morgan Blackthorne <mor...@windsofstorm.net>:

Morgan> Confirmed that the pid file did change when it happened again, and that the

Morgan> pid file is

Morgan> accurate compared to ps.

Morgan> Perhaps a stupid question, but isn't there something in the nginx logs?

Morgan> Since the daemon appears to restart, perhaps there's some log message about why it

Morgan> shutdown or

Morgan> crashed in the first place...

Morgan> Mvg, Guus Snijders

Morgan> --
Morgan> This list provided by the League of Professional System Administrators
Morgan> http://lopsa.org/
Morgan> ---
Morgan> You received this message because you are subscribed to the Google Groups "LOPSA

Morgan> Discussion"

Billy Vierra

unread,

Dec 20, 2018, 6:56:36 PM12/20/18

to Morgan Blackthorne, LOPSA Discuss List

A few things here :)

> firstaction

> /bin/bash /usr/local/noc/aws_s3_logrotate.sh /var/log/nginx/access.log

> endscript

This really should be in a postaction to upload the logs after they have been moved.

> create 0644 root root

This should be create 0644 nginx nginx as nginx should not be running as root (and you prob want it to be 640)

Also you shouldnt do the copytruncate

postrotate
  [ -f /var/run/nginx.pid ] && kill -USR1 `cat /var/run/nginx.pid`

  /bin/bash /usr/local/noc/aws_s3_logrotate.sh /var/log/nginx/access.log.1

endscript

should work just fine without having the copytruncate

as for why you should do the kill -USR1:

> In order to rotate log files, they need to be renamed first. After that USR1 signal should be sent to the master process. The master process will then re-open all currently open log files and

> assign them an unprivileged user under which the worker processes are running, as an owner. After successful re-opening, the master process closes all open files and sends the message

> to worker process to ask them to re-open files. Worker processes also open new files and close old files right away. As a result, old files are almost immediately available for post processing,

> such as compression.

--

This list provided by the League of Professional System Administrators

http://lopsa.org/
---
You received this message because you are subscribed to the Google Groups "LOPSA Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@lopsa.org.

Reply all

Reply to author

Forward