Prometheus scraping stalled

371 views
Skip to first unread message

sebastian...@gmail.com

unread,
Sep 13, 2018, 10:06:39 AM9/13/18
to Prometheus Users
Hi,

we had a strange issue with our prometheus installation that is running in our kubernetes cluster.
While investigating another issue in the setup, i noticed in the webinterface on the targets page that the last scrape for every target was ~65h ago. The log did not show anything interesting:
level=info ts=2018-09-10T13:00:02.398884219Z caller=compact.go:398 component=tsdb msg="write block" mint=1536573600000 maxt=1536580800000 ulid=01CQ1S2FHWQ3MCN2KCX0945BZW
level=info ts=2018-09-10T13:00:02.744459059Z caller=head.go:348 component=tsdb msg="head GC completed" duration=129.962428ms
level=info ts=2018-09-10T13:00:03.427840587Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=683.288113ms
level=info ts=2018-09-10T15:00:02.31455488Z caller=compact.go:398 component=tsdb msg="write block" mint=1536580800000 maxt=1536588000000 ulid=01CQ1ZY6SGHZQ4WDJ55X54ARW5
level=info ts=2018-09-10T15:00:02.683163067Z caller=head.go:348 component=tsdb msg="head GC completed" duration=135.644083ms
level=info ts=2018-09-10T15:00:03.357779834Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=674.543186ms
level=info ts=2018-09-10T15:00:05.40285549Z caller=compact.go:352 component=tsdb msg="compact blocks" count=3 mint=1536559200000 maxt=1536580800000 ulid=01CQ1ZY9VYFBA0TXW1GBBBKQQF sources="[01CQ1BB28SA780693DMQKNV40C 01CQ1J6TQJBVQHJ29RHW9T61RZ 01CQ1S2FHWQ3MCN2KCX0945BZW]"
level=info ts=2018-09-10T17:00:02.269244263Z caller=compact.go:398 component=tsdb msg="write block" mint=1536588000000 maxt=1536595200000 ulid=01CQ26SY1SECXT7R4WWMZAGV78
level=info ts=2018-09-10T17:00:02.632895764Z caller=head.go:348 component=tsdb msg="head GC completed" duration=134.117106ms
level=info ts=2018-09-10T17:00:03.157658829Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=524.698287ms

(scrapring stopped around 2018-09-10T17:18)

After killing the pod, the prometheus process became a zombie process and we had to restart the kubernetes node to get the pod restarted. 
After the restart the log did show:

level=info ts=2018-09-13T11:58:49.221994284Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.2, branch=HEAD, revision=71af5e29e815795e9dd14742ee7725682fa14b7b)"
level=info ts=2018-09-13T11:58:49.222074925Z caller=main.go:223 build_context="(go=go1.10.3, user=root@5258e0bd9cc1, date=20180712-14:02:52)"
level=info ts=2018-09-13T11:58:49.222112401Z caller=main.go:224 host_details="(Linux 4.4.0-1065-aws #75-Ubuntu SMP Fri Aug 10 11:14:32 UTC 2018 x86_64 prometheus-server-567fdf7d7-dbhfl (none))"
level=info ts=2018-09-13T11:58:49.222195476Z caller=main.go:225 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-09-13T11:58:49.222946146Z caller=web.go:415 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-09-13T11:58:49.222924295Z caller=main.go:533 msg="Starting TSDB ..."
level=info ts=2018-09-13T11:58:49.223969438Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536235200000 maxt=1536256800000 ulid=01CPRAYMF7TN74TBR32NW0N9G2
level=info ts=2018-09-13T11:58:49.224570079Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536256800000 maxt=1536278400000 ulid=01CPRZHT930YQNTMG8E3JPY07J
level=info ts=2018-09-13T11:58:49.225058675Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536278400000 maxt=1536300000000 ulid=01CPSM4ZZ0WYMWRJDR3AJSR02T
level=info ts=2018-09-13T11:58:49.229090824Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536300000000 maxt=1536321600000 ulid=01CPT8R5QRQ1973RK244SD02YB
level=info ts=2018-09-13T11:58:49.229513818Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536321600000 maxt=1536343200000 ulid=01CPTXBBFSRD4JVYCXDSTWHRXG
level=info ts=2018-09-13T11:58:49.231213295Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536343200000 maxt=1536364800000 ulid=01CPVHYH7SC5MJ4G68VPJZ9H6X
level=info ts=2018-09-13T11:58:49.231631891Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536364800000 maxt=1536386400000 ulid=01CPW6HQ3FQ5TCB1A873G90XRE
level=info ts=2018-09-13T11:58:49.232024267Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536386400000 maxt=1536408000000 ulid=01CPWV4WR16G2GH3TGDX75ZERC
level=info ts=2018-09-13T11:58:49.232414711Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536408000000 maxt=1536429600000 ulid=01CPXFR2M7JTGNPJT2JDMSQH86
level=info ts=2018-09-13T11:58:49.232789291Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536429600000 maxt=1536451200000 ulid=01CPY4B88QB6XKFC1W8ER3GHXZ
level=info ts=2018-09-13T11:58:49.233175141Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536451200000 maxt=1536472800000 ulid=01CPYRYE199X2XSBVVVPQFYWHT
level=info ts=2018-09-13T11:58:49.233544619Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536472800000 maxt=1536494400000 ulid=01CPZDHKW125HSTM7FVW1F48Q7
level=info ts=2018-09-13T11:58:49.233945457Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536494400000 maxt=1536516000000 ulid=01CQ024SPNZXT74X9CXS1438DA
level=info ts=2018-09-13T11:58:49.234356567Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536516000000 maxt=1536537600000 ulid=01CQ0PQZDWQWXDH8QRDM082S0P
level=info ts=2018-09-13T11:58:49.234744966Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536537600000 maxt=1536559200000 ulid=01CQ1BB59E3V0AFRCWPDS6JME9
level=info ts=2018-09-13T11:58:49.235196429Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536580800000 maxt=1536588000000 ulid=01CQ1ZY6SGHZQ4WDJ55X54ARW5
level=info ts=2018-09-13T11:58:49.235624086Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536559200000 maxt=1536580800000 ulid=01CQ1ZY9VYFBA0TXW1GBBBKQQF
level=info ts=2018-09-13T11:58:49.236118599Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536588000000 maxt=1536595200000 ulid=01CQ26SY1SECXT7R4WWMZAGV78
level=error ts=2018-09-13T11:58:54.75199523Z caller=wal.go:275 component=tsdb msg="WAL corruption detected; truncating" err="unexpected CRC32 checksum 4ff425ef, want 0" file=/data/wal/000728 pos=167408354
level=info ts=2018-09-13T11:58:54.849435873Z caller=main.go:543 msg="TSDB started"
level=info ts=2018-09-13T11:58:54.849478601Z caller=main.go:603 msg="Loading configuration file" filename=/etc/config/prometheus.yml

So startup was normal except for the WAL corruption. Everything worked again after that. 

Prometheus version here was 2.3.2, the Kubernetes persistent volume has about 2G of data. 

So my questions are these:
- Anyone ever experience something like this? Any idea what happened? The webinterface was stillt working fine, it seems just the scraping part crashed somehow
- How to best check for this? As prometheus itself is still working, one could probably check if the prometheus_build_info metric is present?

Thanks for your help!
Sebastian

sebastian...@gmail.com

unread,
Sep 25, 2018, 11:20:20 AM9/25/18
to Prometheus Users
Hi all,

the issue hit us again. The error looked the same, i did try to use the attached EBS volume directly on the host this time though and it was okay. So the hangup must be somewhere in prometheus i guess.  
I will now update to 2.4.2 to see if that improves things. 
We are now monitoring the prometheus state with an alerting rule:
absent(up{job="prometheus"})
Which hopefully will alert us if this happens again.

If anyone has an idea what is happening here, any hint would be appreciated!
Sebastian
 

Simon Pasquier

unread,
Sep 26, 2018, 5:09:56 AM9/26/18
to sebastian...@gmail.com, Prometheus Users
If you can access the Prometheus UI (which I presume reading your initial message), you can try taking a goroutine profile.


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/3d4d01db-e8c9-4680-ab10-e779154cb428%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages