Errors when reloading Prometheus service

776 views
Skip to first unread message

nsmeds

unread,
May 10, 2017, 6:32:28 PM5/10/17
to Prometheus Users
I'm trying to modify our Prometheus Puppet module to `reload` the service instead of `restarting` it when:
- prometheus config file changes
- alert rule files change

Currently restarting the Prometheus service has lead to 1+ hour of `crash-recovery`.

I've successfully tweaked the Puppet module, but now I'm seeing the following in logs:

May 10 18:22:41 ip-10-232-20-108 systemd[1]: Reloading Prometheus Monitoring framework.
May 10 18:22:41 ip-10-232-20-108 prometheus[19893]: time="2017-05-10T18:22:41-04:00" level=info msg="Loading configuration file /etc/prometheus/prometheus.yaml" source="main.go:251"
May 10 18:22:41 ip-10-232-20-108 systemd[1]: Reloaded Prometheus Monitoring framework.
May 10 18:23:07 ip-10-232-20-108 prometheus[19893]: time="2017-05-10T18:23:07-04:00" level=warning msg="Error on ingesting results from rule evaluation with different value but same timestamp" numDropped=10 source="manager.go:313"
May 10 18:23:07 ip-10-232-20-108 prometheus[19893]: time="2017-05-10T18:23:07-04:00" level=error msg="Error sending alerts: bad response status 400 Bad Request" alertmanager="http:<omitted>.com:80/api/v1/alerts" count=10 source="notifier.go:370"
May 10 18:23:38 ip-10-232-20-108 prometheus[19893]: time="2017-05-10T18:23:38-04:00" level=error msg="Error sending alerts: bad response status 400 Bad Request" alertmanager="http:<omitted>.com:80/api/v1/alerts" count=10 source="notifier.go:370"
May 10 18:24:08 ip-10-232-20-108 prometheus[19893]: time="2017-05-10T18:24:08-04:00" level=error msg="Error sending alerts: bad response status 400 Bad Request" alertmanager="http:<omitted>.com:80/api/v1/alerts" count=10 source="notifier.go:370"
May 10 18:24:38 ip-10-232-20-108 prometheus[19893]: time="2017-05-10T18:24:38-04:00" level=error msg="Error sending alerts: bad response status 400 Bad Request" alertmanager="http:<omitted>.com:80/api/v1/alerts" count=10 source="notifier.go:370"

Has anyone experienced something similar before? Thoughts on possible solution?

Thanks as always!

nsmeds

unread,
May 10, 2017, 6:33:21 PM5/10/17
to Prometheus Users
Important note: we're on Xenial and systemd. The reload command we are sending is `service prometheus reload`.

nsmeds

unread,
May 14, 2017, 5:04:28 PM5/14/17
to Prometheus Users
*bump*

Julius Volz

unread,
May 14, 2017, 5:31:39 PM5/14/17
to nsmeds, Prometheus Users
The errors you quote indicate that Prometheus is receiving a 400 HTTP error while trying to send alerts to Alertmanager. I struggle to see how that would be restart/reload-related per se, unless of course the new config contains a bad Alertmanager URL, and before that it was good. Is the Alertmanager URL really correct?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/71e28cfe-0e60-4c11-aea1-df71630a007e%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

nsmeds

unread,
May 15, 2017, 2:11:27 PM5/15/17
to Prometheus Users, niko...@gmail.com
I actually didn't change any configuration while testing this - and the URL is correct. A new discovery:

Running sudo service prometheus reload leads to errors.
May 15 13:51:13 ip-10-236-137-113 systemd[1]: Reloading Prometheus Monitoring framework.
May 15 13:51:13 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:51:13-04:00" level=info msg="Loading configuration file /etc/prometheus/prometheus.yaml" source="main.go:251"
May 15 13:51:13 ip-10-236-137-113 systemd[1]: Reloaded Prometheus Monitoring framework.
May 15 13:51:37 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:51:37-04:00" level=warning msg="Error on ingesting results from rule evaluation with different value but same timestamp" numDropped=1 source="manager.go:313"
May 15 13:51:38 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:51:38-04:00" level=error msg="Error sending alerts: bad response status 400 Bad Request" alertmanager="http://alertmanager.REDACTED:80/api/v1/alerts" count=1 source="notifier.go:370"
May 15 13:51:55 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:51:55-04:00" level=info msg="Done checkpointing in-memory metrics and chunks in 1m5.287294923s." source="persistence.go:665"
May 15 13:52:08 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:52:08-04:00" level=error msg="Error sending alerts: bad response status 400 Bad Request" alertmanager="http://alertmanager.REDACTED:80/api/v1/alerts" count=1 source="notifier.go:370"
May 15 13:52:38 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:52:38-04:00" level=error msg="Error sending alerts: bad response status 400 Bad Request" alertmanager="http://alertmanager.REDACTED:80/api/v1/alerts" count=1 source="notifier.go:370"
May 15 13:53:08 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:53:08-04:00" level=error msg="Error sending alerts: bad response status 400 Bad Request" alertmanager="http://alertmanager.REDACTED:80/api/v1/alerts" count=1 source="notifier.go:370"

Running sudo /bin/kill -HUP 17350 works just fine.
May 15 13:30:38 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:30:38-04:00" level=info msg="Loading configuration file /etc/prometheus/prometheus.yaml" source="main.go:251"
May 15 13:31:07 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:31:07-04:00" level=warning msg="Error on ingesting results from rule evaluation with different value but same timestamp" numDropped=1 source="manager.go:313"
May 15 13:31:38 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:31:38-04:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
May 15 13:32:43 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:32:43-04:00" level=info msg="Done checkpointing in-memory metrics and chunks in 1m5.296435504s." source="persistence.go:665"
May 15 13:33:55 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:33:55-04:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
May 15 13:34:53 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:34:53-04:00" level=info msg="Done checkpointing in-memory metrics and chunks in 58.450524385s." source="persistence.go:665"
May 15 13:35:01 ip-10-236-137-113 CRON[533]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 15 13:35:59 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:35:59-04:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
May 15 13:36:50 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:36:50-04:00" level=info msg="Done checkpointing in-memory metrics and chunks in 50.722972084s." source="persistence.go:665"
May 15 13:38:04 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:38:04-04:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
May 15 13:39:09 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:39:09-04:00" level=info msg="Done checkpointing in-memory metrics and chunks in 1m4.270110301s." source="persistence.go:665"
May 15 13:40:14 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:40:14-04:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
May 15 13:41:17 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:41:17-04:00" level=info msg="Done checkpointing in-memory metrics and chunks in 1m2.580461724s." source="persistence.go:665"
May 15 13:42:29 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:42:29-04:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
May 15 13:43:27 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:43:27-04:00" level=info msg="Done checkpointing in-memory metrics and chunks in 57.761827228s." source="persistence.go:665"
May 15 13:44:35 ip-10-236-137-113 prometheus[17350]: time="2017-05-15T13:44:35-04:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

Julius Volz

unread,
May 15, 2017, 2:39:45 PM5/15/17
to nsmeds, Prometheus Users
That's strange. Is Prometheus even trying to send alerts in the second case (i.e. do you see them in Alertmanager, do you see them firing on /alerts on Prometheus?)

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/775613ec-13aa-40d0-96a4-ae3f5b072325%40googlegroups.com.

nsmeds

unread,
May 15, 2017, 3:12:46 PM5/15/17
to Prometheus Users, niko...@gmail.com
Ah. I ran service prometheus restart to restart the service and remove the errors and it went back into crash-recovery for an hour. Will go check the /alerts on Prometheus (but 99% sure we had active alerts, yes).

We run 2x Prometheus instances and I've been fiddling on the secondary instance thankfully. Ready for more confusion?

/etc/systemd/system/multi-user.target.wants$ cat prometheus.service

[Unit]
Description=Prometheus Monitoring framework
Wants=basic.target
After=basic.target network.target

[Service]
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
  -config.file=/etc/prometheus/prometheus.yaml\
  -storage.local.path=/data/prometheus \
  -web.console.templates=/usr/local/share/prometheus/consoles \
  -web.console.libraries=/usr/local/share/prometheus/console_libraries \
  -alertmanager.url=http://alertmanager.REDACTED.com -storage.local.memory-chunks=6290432 -storage.local.retention=336h0m0s -storage.local.series-file-shrink-ratio=0.1 -storage.local.max-chunks-to-persist=3145216 -web.external-url=http://REDACTED.com
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=always
RestartSec=42s

[Install]
WantedBy=multi-user.target

Unless I'm understanding this incorrectly, the service prometheus reload command should result in the same /bin/kill -HUP $MAINPID. Huh.

Julius Volz

unread,
May 15, 2017, 3:29:38 PM5/15/17
to nsmeds, Prometheus Users
On Mon, May 15, 2017 at 9:12 PM, nsmeds <niko...@gmail.com> wrote:
Ah. I ran service prometheus restart to restart the service and remove the errors and it went back into crash-recovery for an hour. Will go check the /alerts on Prometheus (but 99% sure we had active alerts, yes).

Probably you need to configure the restart to allow more time (TimeoutStopSec?) to finish writing out the final checkpoint.
 
We run 2x Prometheus instances and I've been fiddling on the secondary instance thankfully. Ready for more confusion?

/etc/systemd/system/multi-user.target.wants$ cat prometheus.service

[Unit]
Description=Prometheus Monitoring framework
Wants=basic.target
After=basic.target network.target

[Service]
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
  -config.file=/etc/prometheus/prometheus.yaml\
  -storage.local.path=/data/prometheus \
  -web.console.templates=/usr/local/share/prometheus/consoles \
  -web.console.libraries=/usr/local/share/prometheus/console_libraries \
  -alertmanager.url=http://alertmanager.REDACTED.com -storage.local.memory-chunks=6290432 -storage.local.retention=336h0m0s -storage.local.series-file-shrink-ratio=0.1 -storage.local.max-chunks-to-persist=3145216 -web.external-url=http://REDACTED.com
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=always
RestartSec=42s

[Install]
WantedBy=multi-user.target

Unless I'm understanding this incorrectly, the service prometheus reload command should result in the same /bin/kill -HUP $MAINPID. Huh.

I would expect that to work, yeah. Above, you mention restart instead of reload though.
 
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/01e3079e-d171-4643-9ddb-d0a1f96aaf18%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages