Prometheus scraping stalled

371 views

Skip to first unread message

sebastian...@gmail.com

unread,

Sep 13, 2018, 10:06:39 AM9/13/18

to Prometheus Users

Hi,

we had a strange issue with our prometheus installation that is running in our kubernetes cluster.

While investigating another issue in the setup, i noticed in the webinterface on the targets page that the last scrape for every target was ~65h ago. The log did not show anything interesting:

level=info ts=2018-09-10T13:00:02.398884219Z caller=compact.go:398 component=tsdb msg="write block" mint=1536573600000 maxt=1536580800000 ulid=01CQ1S2FHWQ3MCN2KCX0945BZW

level=info ts=2018-09-10T13:00:02.744459059Z caller=head.go:348 component=tsdb msg="head GC completed" duration=129.962428ms

level=info ts=2018-09-10T13:00:03.427840587Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=683.288113ms

level=info ts=2018-09-10T15:00:02.31455488Z caller=compact.go:398 component=tsdb msg="write block" mint=1536580800000 maxt=1536588000000 ulid=01CQ1ZY6SGHZQ4WDJ55X54ARW5

level=info ts=2018-09-10T15:00:02.683163067Z caller=head.go:348 component=tsdb msg="head GC completed" duration=135.644083ms

level=info ts=2018-09-10T15:00:03.357779834Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=674.543186ms

level=info ts=2018-09-10T15:00:05.40285549Z caller=compact.go:352 component=tsdb msg="compact blocks" count=3 mint=1536559200000 maxt=1536580800000 ulid=01CQ1ZY9VYFBA0TXW1GBBBKQQF sources="[01CQ1BB28SA780693DMQKNV40C 01CQ1J6TQJBVQHJ29RHW9T61RZ 01CQ1S2FHWQ3MCN2KCX0945BZW]"

level=info ts=2018-09-10T17:00:02.269244263Z caller=compact.go:398 component=tsdb msg="write block" mint=1536588000000 maxt=1536595200000 ulid=01CQ26SY1SECXT7R4WWMZAGV78

level=info ts=2018-09-10T17:00:02.632895764Z caller=head.go:348 component=tsdb msg="head GC completed" duration=134.117106ms

level=info ts=2018-09-10T17:00:03.157658829Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=524.698287ms

(scrapring stopped around 2018-09-10T17:18)

After killing the pod, the prometheus process became a zombie process and we had to restart the kubernetes node to get the pod restarted.

After the restart the log did show:

level=info ts=2018-09-13T11:58:49.221994284Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.2, branch=HEAD, revision=71af5e29e815795e9dd14742ee7725682fa14b7b)"

level=info ts=2018-09-13T11:58:49.222074925Z caller=main.go:223 build_context="(go=go1.10.3, user=root@5258e0bd9cc1, date=20180712-14:02:52)"

level=info ts=2018-09-13T11:58:49.222112401Z caller=main.go:224 host_details="(Linux 4.4.0-1065-aws #75-Ubuntu SMP Fri Aug 10 11:14:32 UTC 2018 x86_64 prometheus-server-567fdf7d7-dbhfl (none))"

level=info ts=2018-09-13T11:58:49.222195476Z caller=main.go:225 fd_limits="(soft=1048576, hard=1048576)"

level=info ts=2018-09-13T11:58:49.222946146Z caller=web.go:415 component=web msg="Start listening for connections" address=0.0.0.0:9090

level=info ts=2018-09-13T11:58:49.222924295Z caller=main.go:533 msg="Starting TSDB ..."

level=info ts=2018-09-13T11:58:49.223969438Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536235200000 maxt=1536256800000 ulid=01CPRAYMF7TN74TBR32NW0N9G2

level=info ts=2018-09-13T11:58:49.224570079Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536256800000 maxt=1536278400000 ulid=01CPRZHT930YQNTMG8E3JPY07J

level=info ts=2018-09-13T11:58:49.225058675Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536278400000 maxt=1536300000000 ulid=01CPSM4ZZ0WYMWRJDR3AJSR02T

level=info ts=2018-09-13T11:58:49.229090824Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536300000000 maxt=1536321600000 ulid=01CPT8R5QRQ1973RK244SD02YB

level=info ts=2018-09-13T11:58:49.229513818Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536321600000 maxt=1536343200000 ulid=01CPTXBBFSRD4JVYCXDSTWHRXG

level=info ts=2018-09-13T11:58:49.231213295Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536343200000 maxt=1536364800000 ulid=01CPVHYH7SC5MJ4G68VPJZ9H6X

level=info ts=2018-09-13T11:58:49.231631891Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536364800000 maxt=1536386400000 ulid=01CPW6HQ3FQ5TCB1A873G90XRE

level=info ts=2018-09-13T11:58:49.232024267Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536386400000 maxt=1536408000000 ulid=01CPWV4WR16G2GH3TGDX75ZERC

level=info ts=2018-09-13T11:58:49.232414711Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536408000000 maxt=1536429600000 ulid=01CPXFR2M7JTGNPJT2JDMSQH86

level=info ts=2018-09-13T11:58:49.232789291Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536429600000 maxt=1536451200000 ulid=01CPY4B88QB6XKFC1W8ER3GHXZ

level=info ts=2018-09-13T11:58:49.233175141Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536451200000 maxt=1536472800000 ulid=01CPYRYE199X2XSBVVVPQFYWHT

level=info ts=2018-09-13T11:58:49.233544619Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536472800000 maxt=1536494400000 ulid=01CPZDHKW125HSTM7FVW1F48Q7

level=info ts=2018-09-13T11:58:49.233945457Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536494400000 maxt=1536516000000 ulid=01CQ024SPNZXT74X9CXS1438DA

level=info ts=2018-09-13T11:58:49.234356567Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536516000000 maxt=1536537600000 ulid=01CQ0PQZDWQWXDH8QRDM082S0P

level=info ts=2018-09-13T11:58:49.234744966Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536537600000 maxt=1536559200000 ulid=01CQ1BB59E3V0AFRCWPDS6JME9

level=info ts=2018-09-13T11:58:49.235196429Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536580800000 maxt=1536588000000 ulid=01CQ1ZY6SGHZQ4WDJ55X54ARW5

level=info ts=2018-09-13T11:58:49.235624086Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536559200000 maxt=1536580800000 ulid=01CQ1ZY9VYFBA0TXW1GBBBKQQF

level=info ts=2018-09-13T11:58:49.236118599Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536588000000 maxt=1536595200000 ulid=01CQ26SY1SECXT7R4WWMZAGV78

level=error ts=2018-09-13T11:58:54.75199523Z caller=wal.go:275 component=tsdb msg="WAL corruption detected; truncating" err="unexpected CRC32 checksum 4ff425ef, want 0" file=/data/wal/000728 pos=167408354

level=info ts=2018-09-13T11:58:54.849435873Z caller=main.go:543 msg="TSDB started"

level=info ts=2018-09-13T11:58:54.849478601Z caller=main.go:603 msg="Loading configuration file" filename=/etc/config/prometheus.yml

So startup was normal except for the WAL corruption. Everything worked again after that.

Prometheus version here was 2.3.2, the Kubernetes persistent volume has about 2G of data.

So my questions are these:

- Anyone ever experience something like this? Any idea what happened? The webinterface was stillt working fine, it seems just the scraping part crashed somehow

- How to best check for this? As prometheus itself is still working, one could probably check if the prometheus_build_info metric is present?

Thanks for your help!

Sebastian

sebastian...@gmail.com

unread,

Sep 25, 2018, 11:20:20 AM9/25/18

to Prometheus Users

Hi all,

the issue hit us again. The error looked the same, i did try to use the attached EBS volume directly on the host this time though and it was okay. So the hangup must be somewhere in prometheus i guess.

I will now update to 2.4.2 to see if that improves things.

We are now monitoring the prometheus state with an alerting rule:

absent(up{job="prometheus"})

Which hopefully will alert us if this happens again.

If anyone has an idea what is happening here, any hint would be appreciated!

Sebastian

Simon Pasquier

unread,

Sep 26, 2018, 5:09:56 AM9/26/18

to sebastian...@gmail.com, Prometheus Users

If you can access the Prometheus UI (which I presume reading your initial message), you can try taking a goroutine profile.

See https://jvns.ca/blog/2017/09/24/profiling-go-with-pprof/

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/3d4d01db-e8c9-4680-ab10-e779154cb428%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages