Prometheus getting timeout on describe instance with a blank page

880 views
Skip to first unread message

psreej...@gmail.com

unread,
Aug 9, 2017, 3:21:22 AM8/9/17
to Prometheus Users
Hello team,

I have my prometheus going down after sometime  and in logs found the timeout for describe instances 

time="2017-08-09T07:11:09Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="running", job="cadvisor"} => 186.450205515 @[1502262440.962] source="scrape.go:596"
time="2017-08-09T07:11:26Z" level=error msg="could not describe instances: RequestError: send request failed
caused by: Post https://ec2.us-east-1.amazonaws.com/: net/http: TLS handshake timeout" source="ec2.go:118"
time="2017-08-09T07:11:26Z" level=error msg="could not describe instances: RequestError: send request failed
caused by: Post https://ec2.us-east-1.amazonaws.com/: net/http: TLS handshake timeout" source="ec2.go:118"
time="2017-08-09T07:11:29Z" level=error msg="could not describe instances: RequestError: send request failed
caused by: Post https://ec2.us-east-1.amazonaws.com/: net/http: TLS handshake timeout" source="ec2.go:118"
time="2017-08-09T07:11:31Z" level=error msg="could not describe instances: RequestError: send request failed
caused by: Post https://ec2.us-east-1.amazonaws.com/: net/http: TLS handshake timeout" source="ec2.go:118"
time="2017-08-09T07:11:39Z" level=error msg="could not describe instances: RequestError: send request failed
caused by: Post https://ec2.us-east-1.amazonaws.com/: net/http: TLS handshake timeout" source="ec2.go:118"

Please find the prometheus.yml file content given bellow:

[ec2-user@ip-10-75-10-87 prometheus_etc]$ cat prometheus.yml
global:
  external_labels:
    monitor: 'codelab-monitor'

rule_files:
  - "targets.rules"
  - "host.rules"
scrape_configs:

  - job_name: 'Devops-node'
    scrape_interval: 10s
    scrape_timeout: 5s
    ec2_sd_configs:
      - region: us-east-1
        access_key: xxxxxx
        secret_key: xxxxxx
        port: 9100
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Name]
        regex: DevOps-Cluster-ECSInstance
        action: keep
      - source_labels: [__meta_ec2_instance_id]
        target_label: instance

  - job_name: 'Devops-cadvisor'
    scrape_interval: 10s
    scrape_timeout: 5s
    ec2_sd_configs:
      - region: us-east-1
        access_key: xxxxxx
        secret_key: xxxxxx
        port: 8080
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Name]
        regex: DevOps-Cluster-ECSInstance
        action: keep
      - source_labels: [__meta_ec2_instance_id]
        target_label: instance

      - source_labels: [__meta_ec2_instance_state]
        target_label: instance
  - job_name: 'prometheus'
    scrape_interval: 10s
    scrape_timeout: 5s
    ec2_sd_configs:
      - region: us-east-1
        access_key: xxxxxx
        secret_key: xxxxxx
        port: 9090
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Name]
        regex: DevOps-Cluster-ECSInstance
        action: keep


  - job_name: 'nodes'
    scrape_interval: 10s
    scrape_timeout: 5s
    ec2_sd_configs:
      - region: us-east-1
        access_key: xxxxxx
        secret_key: xxxxxx
        port: 9100
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Monitor]
        regex: Enable
        action: keep
      - source_labels: [__meta_ec2_instance_id]
        target_label: instance

  - job_name: 'cadvisor'
    scrape_interval: 10s
    scrape_timeout: 5s
    ec2_sd_configs:
      - region: us-east-1
        access_key: xxxxxx
        secret_key: xxxxx
        port: 8080
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Monitor]
        regex: Enable
        action: keep
      - source_labels: [__meta_ec2_instance_id]
        target_label: instance

      - source_labels: [__meta_ec2_instance_state]
        target_label: instance


[ec2-user@ip-10-75-10-87 prometheus_etc]$

Please find the docker file given below:

FROM prom/prometheus
MAINTAINER Sreejith.Pilakkat
COPY ./*.yml /etc/prometheus/
COPY ./*.rules /etc/prometheus/
VOLUME ["/etc/prometheus"]
EXPOSE 9090
VOLUME     [ "/prometheus" ]
WORKDIR    /prometheus
ENTRYPOINT [ "/bin/prometheus" ]
CMD        [ "-config.file=/etc/prometheus/prometheus.yml", "-storage.local.path=/prometheus", "-alertmanager.url=http://alertmanager:9093", "-storage.local.memory-chunks=500000" ]


The prometheus is running as a container in a debian machine. Please let me knwo if you need any more info from my end.

Thanks
Sreejith

Matthias Rampke

unread,
Aug 9, 2017, 3:33:12 AM8/9/17
to psreej...@gmail.com, Prometheus Users

From the machine that Prometheus runs on, and from within a container started like Prometheus, what happens when you

curl -v https://ec2.us-east-1.amazonaws.com/

? In other words, is this reachable at all?

/MR


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f78b67fc-fb02-4f09-bf44-83d0ab69417c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

psreej...@gmail.com

unread,
Aug 9, 2017, 3:37:38 AM8/9/17
to Prometheus Users, psreej...@gmail.com

When i tried inside the container 

[ec2-user@ip-10-75-10-87 prometheus_etc]$ docker exec -it  0c15e17d45a5 sh
/prometheus #

/prometheus #
/prometheus #
sh: curl: not found

On the machine 

[ec2-user@ip-host prometheus_etc]$ curl -v https://ec2.us-east-1.amazonaws.com/
*   Trying 54.239.28.176...
* Connected to ec2.us-east-1.amazonaws.com (54.239.28.176) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* ALPN/NPN, server did not agree to a protocol
* SSL connection using TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA
* Server certificate:
*       subject: CN=ec2.us-east-1.amazonaws.com,O="Amazon.com, Inc.",L=Seattle,ST=Washington,C=US
*       start date: Aug 07 00:00:00 2017 GMT
*       expire date: May 07 23:59:59 2018 GMT
*       common name: ec2.us-east-1.amazonaws.com
*       issuer: CN=Symantec Class 3 Secure Server CA - G4,OU=Symantec Trust Network,O=Symantec Corporation,C=US
> GET / HTTP/1.1
> User-Agent: curl/7.47.1
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Content-Length: 0
< Date: Wed, 09 Aug 2017 07:36:14 GMT
< Server: AmazonEC2
<
* Connection #0 to host ec2.us-east-1.amazonaws.com left intact

psreej...@gmail.com

unread,
Aug 9, 2017, 3:52:21 AM8/9/17
to Prometheus Users, psreej...@gmail.com
Please find the ping result inside the container

/prometheus # ping ec2.us-east-1.amazonaws.com
PING ec2.us-east-1.amazonaws.com (54.239.29.8): 56 data bytes
64 bytes from 54.239.29.8: seq=0 ttl=244 time=1.902 ms
64 bytes from 54.239.29.8: seq=1 ttl=244 time=134.534 ms
64 bytes from 54.239.29.8: seq=2 ttl=244 time=1.645 ms
64 bytes from 54.239.29.8: seq=3 ttl=244 time=147.676 ms
^C
--- ec2.us-east-1.amazonaws.com ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 1.645/71.439/147.676 ms

Matthias Rampke

unread,
Aug 9, 2017, 5:42:35 AM8/9/17
to psreej...@gmail.com, Prometheus Users
Okay. Do these errors happen all the time, or only after Prometheus "went down"? How specifically does it go down? Does it crash? If so, there should be messages from the crash, or Docker should know if it's been OOM killed.

psreej...@gmail.com

unread,
Aug 9, 2017, 5:55:25 AM8/9/17
to Prometheus Users, psreej...@gmail.com
The prometheus gets stuck with blank page after sometime of run with the timeout error msg. Please find the first lines of timeout msg fetched from the container logs

{"log":"time=\"2017-08-09T07:10:57Z\" level=warning msg=\"Scrape health sample discarded\" error=\"sample timestamp out of order\" sample=up{instance=\"running\", job=\"cadvisor\"} =\u003e
0 @[1502262641.977] source=\"scrape.go:587\" \n","stream":"stderr","time":"2017-08-09T07:11:01.95726599Z"}
{"log":"time=\"2017-08-09T07:11:10Z\" level=warning msg=\"Scrape duration sample discarded\" error=\"sample timestamp out of order\" sample=scrape_duration_seconds{instance=\"running\", job
=\"cadvisor\"} =\u003e 10.87713533 @[1502262641.977] source=\"scrape.go:590\" \n","stream":"stderr","time":"2017-08-09T07:11:15.76624904Z"}
{"log":"time=\"2017-08-09T07:11:09Z\" level=warning msg=\"Scrape sample count post-relabeling sample discarded\" error=\"sample timestamp out of order\" sample=scrape_duration_seconds{insta
nce=\"running\", job=\"cadvisor\"} =\u003e 186.450205515 @[1502262440.962] source=\"scrape.go:596\" \n","stream":"stderr","time":"2017-08-09T07:11:19.832195484Z"}
{"log":"time=\"2017-08-09T07:11:26Z\" level=error msg=\"could not describe instances: RequestError: send request failed\n","stream":"stderr","time":"2017-08-09T07:11:27.513669038Z"}
{"log":"caused by: Post https://ec2.us-east-1.amazonaws.com/: net/http: TLS handshake timeout\" source=\"ec2.go:118\" \n","stream":"stderr","time":"2017-08-09T07:11:27.513705198Z"}
{"log":"time=\"2017-08-09T07:11:26Z\" level=error msg=\"could not describe instances: RequestError: send request failed\n","stream":"stderr","time":"2017-08-09T07:11:29.024310112Z"}
{"log":"caused by: Post https://ec2.us-east-1.amazonaws.com/: net/http: TLS handshake timeout\" source=\"ec2.go:118\" \n","stream":"stderr","time":"2017-08-09T07:11:29.024341458Z"}
{"log":"time=\"2017-08-09T07:11:29Z\" level=error msg=\"could not describe instances: RequestError: send request failed\n","stream":"stderr","time":"2017-08-09T07:11:34.615989678Z"}
{"log":"caused by: Post https://ec2.us-east-1.amazonaws.com/: net/http: TLS handshake timeout\" source=\"ec2.go:118\" \n","stream":"stderr","time":"2017-08-09T07:11:34.616029714Z"}
{"log":"time=\"2017-08-09T07:11:31Z\" level=error msg=\"could not describe instances: RequestError: send request failed\n","stream":"stderr","time":"2017-08-09T07:11:42.481535042Z"}
{"log":"caused by: Post https://ec2.us-east-1.amazonaws.com/: net/http: TLS handshake timeout\" source=\"ec2.go:118\" \n","stream":"stderr","time":"2017-08-09T07:11:42.481569743Z"}
{"log":"time=\"2017-08-09T07:11:39Z\" level=error msg=\"could not describe instances: RequestError: send request failed\n","stream":"stderr","time":"2017-08-09T07:11:43.871243849Z"}
{"log":"caused by: Post https://ec2.us-east-1.amazonaws.com/: net/http: TLS handshake timeout\" source=\"ec2.go:118\" \n","stream":"stderr","time":"2017-08-09T07:11:43.871275284Z"}
{"log":"time=\"2017-08-09T07:11:27Z\" level=warning msg=\"Scrape sample count sample discarded\" error=\"sample timestamp out of order\" sample=scrape_duration_seconds{instance=\"running\", job=\"cadvisor\"} =\u003e 10.87713533 @[1502262641.977] source=\"scrape.go:593\" \n","stream":"stderr","time":"2017-08-09T07:11:47.815350003Z"}


On the 2 core machine please find the current memory available

[root@host ]# free -m
             total       used       free     shared    buffers     cached
Mem:          7986       2516       5469          0        264        918
-/+ buffers/cache:       1334       6652
Swap:            0          0          0


[root@ip-host ~]# w
 09:48:11 up 13 days, 18:15,  1 user,  load average: 2.03, 2.15, 2.11
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
ec2-user pts/0    ip-10-75-1-214.e 05:06    1.00s  0.13s  0.00s sshd: ec2-user [priv]

Please find the top result

top - 09:49:09 up 13 days, 18:16,  1 user,  load average: 2.01, 2.12, 2.10
Tasks: 140 total,   1 running, 139 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.8%us,  1.7%sy,  0.0%ni, 19.7%id, 77.6%wa,  0.0%hi,  0.0%si,  0.2%st
Mem:   8178428k total,  2577656k used,  5600772k free,   270640k buffers
Swap:        0k total,        0k used,        0k free,   937604k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
31115 root      20   0  426m  37m  12m S  1.0  0.5   3:04.77 cadvisor
 3159 root      20   0  362m  14m 8852 S  0.7  0.2 148:27.94 docker-containe
 3428 root      20   0  248m  47m 7480 S  0.7  0.6 158:10.85 agent
30932 root      20   0  182m 121m    0 S  0.7  1.5   5:32.94 prometheus

psreej...@gmail.com

unread,
Aug 9, 2017, 11:41:53 PM8/9/17
to Prometheus Users, psreej...@gmail.com

Finally was able to run the prometheus without any issue, as you suggested checked RAM which was the culprit. The RAM given for the prometheus container was less, so increaced to 2G, now its not failing. Thank you Matthias for you help .
Reply all
Reply to author
Forward
0 new messages