Getting 503 error from just one target among hundreds, a Prometheus issue?

3,097 views
Skip to first unread message

emreha...@gmail.com

unread,
May 26, 2017, 3:47:45 AM5/26/17
to Prometheus Users
Hello all Prometheus users,

We've just set up a very capable monitoring system with Prometheus. It works all fine, except one scrape_config target. 

We've hundreds of docker images and each of them exposes prometheus metrics. All of them are "UP" except that one image that shows "server returned HTTP status 503 Service Unavailable" error on our Prometheus dashboard. We haven't detected this error on our VM logs and our load balancer (marathon) doesn't log these out.

We can reach the erroneous image's listed target from the Prometheus image itself. Therefore, we suspect that the problem could be due to Prometheus. We've tried to solve the problem, including restarting prometheus, that vm, and the vm's load balancer, with no help.

We use the latest prom/prometheus:v1.6.3 docker image and this was the case with v1.6.1 as well.

We tried to set -log.level=debug, however I'm not sure if that worked since we only see info level logs, which doesn't include anything on the failing target, and no indication on the debug mode from docker logs command.

How can we prove that this error is on Prometheus' part and debug it further?

Best,
Han Tuzun

Ben Kochie

unread,
May 26, 2017, 3:57:06 AM5/26/17
to emreha...@gmail.com, Prometheus Users
The only way for this error to show up is if the target positively responded with a 503.  The likelihood that Prometheus would incorrectly report this error is very low.

I would suggest doing a tcpdump capture from both the Prometheus side and the target host.  You can use wireshark to inspect the packets.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/77996f40-ecc6-4f69-8963-fca89f61e6a6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

emreha...@gmail.com

unread,
May 26, 2017, 4:13:53 AM5/26/17
to Prometheus Users, emreha...@gmail.com
Thanks a lot Ben.

Okay, I'll inspect it further.

Best,
Han


On Friday, 26 May 2017 10:57:06 UTC+3, Ben Kochie wrote:
The only way for this error to show up is if the target positively responded with a 503.  The likelihood that Prometheus would incorrectly report this error is very low.

I would suggest doing a tcpdump capture from both the Prometheus side and the target host.  You can use wireshark to inspect the packets.
On Fri, May 26, 2017 at 9:47 AM, <emreha...@gmail.com> wrote:
Hello all Prometheus users,

We've just set up a very capable monitoring system with Prometheus. It works all fine, except one scrape_config target. 

We've hundreds of docker images and each of them exposes prometheus metrics. All of them are "UP" except that one image that shows "server returned HTTP status 503 Service Unavailable" error on our Prometheus dashboard. We haven't detected this error on our VM logs and our load balancer (marathon) doesn't log these out.

We can reach the erroneous image's listed target from the Prometheus image itself. Therefore, we suspect that the problem could be due to Prometheus. We've tried to solve the problem, including restarting prometheus, that vm, and the vm's load balancer, with no help.

We use the latest prom/prometheus:v1.6.3 docker image and this was the case with v1.6.1 as well.

We tried to set -log.level=debug, however I'm not sure if that worked since we only see info level logs, which doesn't include anything on the failing target, and no indication on the debug mode from docker logs command.

How can we prove that this error is on Prometheus' part and debug it further?

Best,
Han Tuzun

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

Han Tuzun

unread,
May 26, 2017, 7:55:57 AM5/26/17
to Prometheus Users, Emrehan TÜZÜN
Hi all,

The issue was on our part, apologies for the thread!

Best,
Han

Ben Kochie

unread,
May 26, 2017, 8:31:16 AM5/26/17
to Han Tuzun, Prometheus Users
No problem, anything useful to share about the debugging process in case someone else runs into similar problems?

Glad that Prometheus found a bug for you. :-)

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CALVLTaUuOzdPkAn7XQks8fNZFmx%3Dbrm4n-Zod8Tef%3DqfL8N-Fg%40mail.gmail.com.

Han Tuzun

unread,
May 27, 2017, 2:51:10 PM5/27/17
to Ben Kochie, Prometheus Users
Haha, yes it helped!

I listened the TCP traffic with verbose log level as you said and confirmed that our load balancer returns 503. Then, the problem turned out to be the case with host headers our nginx accepts. We previously fixed it to accept url,url:80,url:443 but that one instance was strangely didn't have that update.

As a side note: After an evaluation we decided to migrate our monitoring from Riemann to Prometheus&Alertmanager and it works really fine for now.

Best,
Han

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/oDrfIAm-M5c/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-users+unsubscribe@googlegroups.com.

To post to this group, send email to prometheus-users@googlegroups.com.

Ben Kochie

unread,
May 27, 2017, 5:51:10 PM5/27/17
to Han Tuzun, Prometheus Users
Thanks for the update.

Generally we recommend direct connections between Prometheus and targets if possible to avoid these kinds of problems.  Plus there's the failure mode where the LB is the only problem which can trigger alerts for systems that aren't affected, or worse make you blind to problems.

Han Tuzun

unread,
May 29, 2017, 12:09:47 PM5/29/17
to Ben Kochie, Prometheus Users
You're welcome!

We're running a SaaS and we err on false positives in our monitoring, not to miss any downtime for our customers. That is, we scrape our customer's instances' metrics over HTTPS via their IPs, just as they do. Wouldn't you recommend us to do so?

We monitor internal services, including the load balancers in front of customer instances, via direct connections though. Our load balancers are highly available as well. 

Best,
Han

Ben Kochie

unread,
May 29, 2017, 12:24:23 PM5/29/17
to Han Tuzun, Prometheus Users
Prometheus design is slightly different.  What we aim for is direct instrumentation to each piece of software.  This means Prometheus can communicate as directly as possible to each component rather than "end-to-end".  This allows you to more easily pinpoint problems at each layer of the stack.  Host metrics, load-balancer metrics, app metrics.

This allows for monitoring things in an isolated way to avoid one system masking problems in another.

End-to-end "blackbox" metrics are also something we monitor, but those are for explicitly testing the whole path.  This is the "what the customer sees" kind of result.

Reply all
Reply to author
Forward
0 new messages