Prometheus getting timeout on describe instance with a blank page

psreej...@gmail.com

unread,

Aug 9, 2017, 3:21:22 AM8/9/17

to Prometheus Users

Hello team,

I have my prometheus going down after sometime and in logs found the timeout for describe instances

time="2017-08-09T07:11:09Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="running", job="cadvisor"} => 186.450205515 @[1502262440.962] source="scrape.go:596"

time="2017-08-09T07:11:26Z" level=error msg="could not describe instances: RequestError: send request failed

caused by: Post https://ec2.us-east-1.amazonaws.com/: net/http: TLS handshake timeout" source="ec2.go:118"

time="2017-08-09T07:11:26Z" level=error msg="could not describe instances: RequestError: send request failed

caused by: Post https://ec2.us-east-1.amazonaws.com/: net/http: TLS handshake timeout" source="ec2.go:118"

time="2017-08-09T07:11:29Z" level=error msg="could not describe instances: RequestError: send request failed

caused by: Post https://ec2.us-east-1.amazonaws.com/: net/http: TLS handshake timeout" source="ec2.go:118"

time="2017-08-09T07:11:31Z" level=error msg="could not describe instances: RequestError: send request failed

caused by: Post https://ec2.us-east-1.amazonaws.com/: net/http: TLS handshake timeout" source="ec2.go:118"

time="2017-08-09T07:11:39Z" level=error msg="could not describe instances: RequestError: send request failed

caused by: Post https://ec2.us-east-1.amazonaws.com/: net/http: TLS handshake timeout" source="ec2.go:118"

Please find the prometheus.yml file content given bellow:

[ec2-user@ip-10-75-10-87 prometheus_etc]$ cat prometheus.yml

global:

external_labels:

monitor: 'codelab-monitor'

rule_files:

- "targets.rules"

- "host.rules"

scrape_configs:

- job_name: 'Devops-node'

scrape_interval: 10s

scrape_timeout: 5s

ec2_sd_configs:

- region: us-east-1

access_key: xxxxxx

secret_key: xxxxxx

port: 9100

relabel_configs:

- source_labels: [__meta_ec2_tag_Name]

regex: DevOps-Cluster-ECSInstance

action: keep

- source_labels: [__meta_ec2_instance_id]

target_label: instance

- job_name: 'Devops-cadvisor'

scrape_interval: 10s

scrape_timeout: 5s

ec2_sd_configs:

- region: us-east-1

access_key: xxxxxx

secret_key: xxxxxx

port: 8080

relabel_configs:

- source_labels: [__meta_ec2_tag_Name]

regex: DevOps-Cluster-ECSInstance

action: keep

- source_labels: [__meta_ec2_instance_id]

target_label: instance

- source_labels: [__meta_ec2_instance_state]

target_label: instance

- job_name: 'prometheus'

scrape_interval: 10s

scrape_timeout: 5s

ec2_sd_configs:

- region: us-east-1

access_key: xxxxxx

secret_key: xxxxxx

port: 9090

relabel_configs:

- source_labels: [__meta_ec2_tag_Name]

regex: DevOps-Cluster-ECSInstance

action: keep

- job_name: 'nodes'

scrape_interval: 10s

scrape_timeout: 5s

ec2_sd_configs:

- region: us-east-1

access_key: xxxxxx

secret_key: xxxxxx

port: 9100

relabel_configs:

- source_labels: [__meta_ec2_tag_Monitor]

regex: Enable

action: keep

- source_labels: [__meta_ec2_instance_id]

target_label: instance

- job_name: 'cadvisor'

scrape_interval: 10s

scrape_timeout: 5s

ec2_sd_configs:

- region: us-east-1

access_key: xxxxxx

secret_key: xxxxx

port: 8080

relabel_configs:

- source_labels: [__meta_ec2_tag_Monitor]

regex: Enable

action: keep

- source_labels: [__meta_ec2_instance_id]

target_label: instance

- source_labels: [__meta_ec2_instance_state]

target_label: instance

[ec2-user@ip-10-75-10-87 prometheus_etc]$

Please find the docker file given below:

FROM prom/prometheus

MAINTAINER Sreejith.Pilakkat

COPY ./*.yml /etc/prometheus/

COPY ./*.rules /etc/prometheus/

VOLUME ["/etc/prometheus"]

EXPOSE 9090

VOLUME [ "/prometheus" ]

WORKDIR /prometheus

ENTRYPOINT [ "/bin/prometheus" ]

CMD [ "-config.file=/etc/prometheus/prometheus.yml", "-storage.local.path=/prometheus", "-alertmanager.url=http://alertmanager:9093", "-storage.local.memory-chunks=500000" ]

The prometheus is running as a container in a debian machine. Please let me knwo if you need any more info from my end.

Thanks

Sreejith

Matthias Rampke

unread,

Aug 9, 2017, 3:33:12 AM8/9/17

to psreej...@gmail.com, Prometheus Users

From the machine that Prometheus runs on, and from within a container started like Prometheus, what happens when you

curl -v https://ec2.us-east-1.amazonaws.com/

? In other words, is this reachable at all?

/MR

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f78b67fc-fb02-4f09-bf44-83d0ab69417c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

psreej...@gmail.com

unread,

Aug 9, 2017, 3:37:38 AM8/9/17

to Prometheus Users, psreej...@gmail.com

When i tried inside the container

[ec2-user@ip-10-75-10-87 prometheus_etc]$ docker exec -it 0c15e17d45a5 sh

/prometheus #

/prometheus # curl -v https://ec2.us-east-1.amazonaws.com/

sh: curl: not found

On the machine

[ec2-user@ip-host prometheus_etc]$ curl -v https://ec2.us-east-1.amazonaws.com/

* Trying 54.239.28.176...

* Connected to ec2.us-east-1.amazonaws.com (54.239.28.176) port 443 (#0)

* Initializing NSS with certpath: sql:/etc/pki/nssdb

* CAfile: /etc/pki/tls/certs/ca-bundle.crt

CApath: none

* ALPN/NPN, server did not agree to a protocol

* SSL connection using TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA

* Server certificate:

* subject: CN=ec2.us-east-1.amazonaws.com,O="Amazon.com, Inc.",L=Seattle,ST=Washington,C=US

* start date: Aug 07 00:00:00 2017 GMT

* expire date: May 07 23:59:59 2018 GMT

* common name: ec2.us-east-1.amazonaws.com

* issuer: CN=Symantec Class 3 Secure Server CA - G4,OU=Symantec Trust Network,O=Symantec Corporation,C=US

> GET / HTTP/1.1

> Host: ec2.us-east-1.amazonaws.com

> User-Agent: curl/7.47.1

> Accept: */*

>

< HTTP/1.1 301 Moved Permanently

< Location: http://aws.amazon.com/ec2

< Content-Length: 0

< Date: Wed, 09 Aug 2017 07:36:14 GMT

< Server: AmazonEC2

<

* Connection #0 to host ec2.us-east-1.amazonaws.com left intact

psreej...@gmail.com

unread,

Aug 9, 2017, 3:52:21 AM8/9/17

to Prometheus Users, psreej...@gmail.com

Please find the ping result inside the container

/prometheus # ping ec2.us-east-1.amazonaws.com

PING ec2.us-east-1.amazonaws.com (54.239.29.8): 56 data bytes

64 bytes from 54.239.29.8: seq=0 ttl=244 time=1.902 ms

64 bytes from 54.239.29.8: seq=1 ttl=244 time=134.534 ms

64 bytes from 54.239.29.8: seq=2 ttl=244 time=1.645 ms

64 bytes from 54.239.29.8: seq=3 ttl=244 time=147.676 ms

^C

--- ec2.us-east-1.amazonaws.com ping statistics ---

4 packets transmitted, 4 packets received, 0% packet loss

round-trip min/avg/max = 1.645/71.439/147.676 ms

Matthias Rampke

unread,

Aug 9, 2017, 5:42:35 AM8/9/17

to psreej...@gmail.com, Prometheus Users

Okay. Do these errors happen all the time, or only after Prometheus "went down"? How specifically does it go down? Does it crash? If so, there should be messages from the crash, or Docker should know if it's been OOM killed.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/cd1b347f-97f7-49c2-85da-13f83a43c582%40googlegroups.com.

psreej...@gmail.com

unread,

Aug 9, 2017, 5:55:25 AM8/9/17

to Prometheus Users, psreej...@gmail.com

The prometheus gets stuck with blank page after sometime of run with the timeout error msg. Please find the first lines of timeout msg fetched from the container logs

{"log":"time=\"2017-08-09T07:10:57Z\" level=warning msg=\"Scrape health sample discarded\" error=\"sample timestamp out of order\" sample=up{instance=\"running\", job=\"cadvisor\"} =\u003e

0 @[1502262641.977] source=\"scrape.go:587\" \n","stream":"stderr","time":"2017-08-09T07:11:01.95726599Z"}

{"log":"time=\"2017-08-09T07:11:10Z\" level=warning msg=\"Scrape duration sample discarded\" error=\"sample timestamp out of order\" sample=scrape_duration_seconds{instance=\"running\", job

=\"cadvisor\"} =\u003e 10.87713533 @[1502262641.977] source=\"scrape.go:590\" \n","stream":"stderr","time":"2017-08-09T07:11:15.76624904Z"}

{"log":"time=\"2017-08-09T07:11:09Z\" level=warning msg=\"Scrape sample count post-relabeling sample discarded\" error=\"sample timestamp out of order\" sample=scrape_duration_seconds{insta

nce=\"running\", job=\"cadvisor\"} =\u003e 186.450205515 @[1502262440.962] source=\"scrape.go:596\" \n","stream":"stderr","time":"2017-08-09T07:11:19.832195484Z"}

{"log":"time=\"2017-08-09T07:11:26Z\" level=error msg=\"could not describe instances: RequestError: send request failed\n","stream":"stderr","time":"2017-08-09T07:11:27.513669038Z"}

{"log":"caused by: Post https://ec2.us-east-1.amazonaws.com/: net/http: TLS handshake timeout\" source=\"ec2.go:118\" \n","stream":"stderr","time":"2017-08-09T07:11:27.513705198Z"}

{"log":"time=\"2017-08-09T07:11:26Z\" level=error msg=\"could not describe instances: RequestError: send request failed\n","stream":"stderr","time":"2017-08-09T07:11:29.024310112Z"}

{"log":"caused by: Post https://ec2.us-east-1.amazonaws.com/: net/http: TLS handshake timeout\" source=\"ec2.go:118\" \n","stream":"stderr","time":"2017-08-09T07:11:29.024341458Z"}

{"log":"time=\"2017-08-09T07:11:29Z\" level=error msg=\"could not describe instances: RequestError: send request failed\n","stream":"stderr","time":"2017-08-09T07:11:34.615989678Z"}

{"log":"caused by: Post https://ec2.us-east-1.amazonaws.com/: net/http: TLS handshake timeout\" source=\"ec2.go:118\" \n","stream":"stderr","time":"2017-08-09T07:11:34.616029714Z"}

{"log":"time=\"2017-08-09T07:11:31Z\" level=error msg=\"could not describe instances: RequestError: send request failed\n","stream":"stderr","time":"2017-08-09T07:11:42.481535042Z"}

{"log":"caused by: Post https://ec2.us-east-1.amazonaws.com/: net/http: TLS handshake timeout\" source=\"ec2.go:118\" \n","stream":"stderr","time":"2017-08-09T07:11:42.481569743Z"}

{"log":"time=\"2017-08-09T07:11:39Z\" level=error msg=\"could not describe instances: RequestError: send request failed\n","stream":"stderr","time":"2017-08-09T07:11:43.871243849Z"}

{"log":"caused by: Post https://ec2.us-east-1.amazonaws.com/: net/http: TLS handshake timeout\" source=\"ec2.go:118\" \n","stream":"stderr","time":"2017-08-09T07:11:43.871275284Z"}

{"log":"time=\"2017-08-09T07:11:27Z\" level=warning msg=\"Scrape sample count sample discarded\" error=\"sample timestamp out of order\" sample=scrape_duration_seconds{instance=\"running\", job=\"cadvisor\"} =\u003e 10.87713533 @[1502262641.977] source=\"scrape.go:593\" \n","stream":"stderr","time":"2017-08-09T07:11:47.815350003Z"}

On the 2 core machine please find the current memory available

[root@host ]# free -m

total used free shared buffers cached

Mem: 7986 2516 5469 0 264 918

-/+ buffers/cache: 1334 6652

Swap: 0 0 0

[root@ip-host ~]# w

09:48:11 up 13 days, 18:15, 1 user, load average: 2.03, 2.15, 2.11

USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT

ec2-user pts/0 ip-10-75-1-214.e 05:06 1.00s 0.13s 0.00s sshd: ec2-user [priv]

Please find the top result

top - 09:49:09 up 13 days, 18:16, 1 user, load average: 2.01, 2.12, 2.10

Tasks: 140 total, 1 running, 139 sleeping, 0 stopped, 0 zombie

Cpu(s): 0.8%us, 1.7%sy, 0.0%ni, 19.7%id, 77.6%wa, 0.0%hi, 0.0%si, 0.2%st

Mem: 8178428k total, 2577656k used, 5600772k free, 270640k buffers

Swap: 0k total, 0k used, 0k free, 937604k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

31115 root 20 0 426m 37m 12m S 1.0 0.5 3:04.77 cadvisor

3159 root 20 0 362m 14m 8852 S 0.7 0.2 148:27.94 docker-containe

3428 root 20 0 248m 47m 7480 S 0.7 0.6 158:10.85 agent

30932 root 20 0 182m 121m 0 S 0.7 1.5 5:32.94 prometheus

psreej...@gmail.com

unread,

Aug 9, 2017, 11:41:53 PM8/9/17

to Prometheus Users, psreej...@gmail.com

Finally was able to run the prometheus without any issue, as you suggested checked RAM which was the culprit. The RAM given for the prometheus container was less, so increaced to 2G, now its not failing. Thank you Matthias for you help .

Reply all

Reply to author

Forward