Whole target drop when some metrics are old - out of bounds

635 views

Skip to first unread message

szymo...@gmail.com

unread,

Oct 2, 2018, 3:26:06 PM10/2/18

to Prometheus Users

I'm scraping data from wmbus devices, they usually send data only for few hours a day. Exporter is written in nodejs, reads data from serial device and exports data from different devices at once.

I noticed, that when one device stops sending data and metrics are not updated for it, whole target is dropped. It starts to happen an hour after data is not updated.

What is strange that it starts to work again after GC and "WAL truncation completed"

Target drop happens at 16:47 UTC every day and at 17:00 UTC "WAL truncation completed" is seen. So I miss data for around 15 minutes per day.

Difference in logs are in this line:

{"log":"level=warn ts=2018-10-02T16:57:57.484641177Z caller=scrape.go:713 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"append failed\" err=\"out of bounds\"\n","stream":"stderr","time":"2018-10-02T16:57:57.485174381Z"}

How to solve that problem?

Before "WAL truncation completed" logs looks like that:

{"log":"level=debug ts=2018-10-02T16:57:57.484383611Z caller=scrape.go:866 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Out of bounds metric\" series=\"meters_heat_temperature_c{location=\\\"siemianowicka\\\",direction=\\\"in\\\"}\"\n","stream":"stderr","time":"2018-10-02T16:57:57.485089313Z"}
{"log":"level=debug ts=2018-10-02T16:57:57.48446671Z caller=scrape.go:866 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Out of bounds metric\" series=\"meters_heat_temperature_c{location=\\\"siemianowicka\\\",direction=\\\"out\\\"}\"\n","stream":"stderr","time":"2018-10-02T16:57:57.485118977Z"}
{"log":"level=warn ts=2018-10-02T16:57:57.484562854Z caller=scrape.go:948 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Error on ingesting samples that are too old or are too far into the future\" num_dropped=4\n","stream":"stderr","time":"2018-10-02T16:57:57.485147441Z"}
{"log":"level=warn ts=2018-10-02T16:57:57.484641177Z caller=scrape.go:713 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"append failed\" err=\"out of bounds\"\n","stream":"stderr","time":"2018-10-02T16:57:57.485174381Z"}
{"log":"level=debug ts=2018-10-02T16:58:57.508866114Z caller=scrape.go:866 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Out of bounds metric\" series=\"meters_heat_energy_joule{location=\\\"siemianowicka\\\"}\"\n","stream":"stderr","time":"2018-10-02T16:58:57.509181124Z"}
{"log":"level=debug ts=2018-10-02T16:58:57.509001113Z caller=scrape.go:866 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Out of bounds metric\" series=\"meters_heat_usage_m3{location=\\\"siemianowicka\\\"}\"\n","stream":"stderr","time":"2018-10-02T16:58:57.509261055Z"}
{"log":"level=debug ts=2018-10-02T16:58:57.509093345Z caller=scrape.go:866 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Out of bounds metric\" series=\"meters_heat_temperature_c{location=\\\"siemianowicka\\\",direction=\\\"in\\\"}\"\n","stream":"stderr","time":"2018-10-02T16:58:57.509352818Z"}
{"log":"level=debug ts=2018-10-02T16:58:57.509185756Z caller=scrape.go:866 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Out of bounds metric\" series=\"meters_heat_temperature_c{location=\\\"siemianowicka\\\",direction=\\\"out\\\"}\"\n","stream":"stderr","time":"2018-10-02T16:58:57.50942039Z"}
{"log":"level=warn ts=2018-10-02T16:58:57.509265519Z caller=scrape.go:948 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Error on ingesting samples that are too old or are too far into the future\" num_dropped=4\n","stream":"stderr","time":"2018-10-02T16:58:57.509463386Z"}
{"log":"level=warn ts=2018-10-02T16:58:57.509348931Z caller=scrape.go:713 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"append failed\" err=\"out of bounds\"\n","stream":"stderr","time":"2018-10-02T16:58:57.509501305Z"}
{"log":"level=debug ts=2018-10-02T16:59:57.484017027Z caller=scrape.go:866 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Out of bounds metric\" series=\"meters_heat_energy_joule{location=\\\"siemianowicka\\\"}\"\n","stream":"stderr","time":"2018-10-02T16:59:57.484372848Z"}
{"log":"level=debug ts=2018-10-02T16:59:57.484157198Z caller=scrape.go:866 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Out of bounds metric\" series=\"meters_heat_usage_m3{location=\\\"siemianowicka\\\"}\"\n","stream":"stderr","time":"2018-10-02T16:59:57.484482335Z"}
{"log":"level=debug ts=2018-10-02T16:59:57.484286845Z caller=scrape.go:866 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Out of bounds metric\" series=\"meters_heat_temperature_c{location=\\\"siemianowicka\\\",direction=\\\"in\\\"}\"\n","stream":"stderr","time":"2018-10-02T16:59:57.484568398Z"}
{"log":"level=debug ts=2018-10-02T16:59:57.484374408Z caller=scrape.go:866 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Out of bounds metric\" series=\"meters_heat_temperature_c{location=\\\"siemianowicka\\\",direction=\\\"out\\\"}\"\n","stream":"stderr","time":"2018-10-02T16:59:57.484622026Z"}
{"log":"level=warn ts=2018-10-02T16:59:57.484461815Z caller=scrape.go:948 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Error on ingesting samples that are too old or are too far into the future\" num_dropped=4\n","stream":"stderr","time":"2018-10-02T16:59:57.484662502Z"}
{"log":"level=warn ts=2018-10-02T16:59:57.484552979Z caller=scrape.go:713 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"append failed\" err=\"out of bounds\"\n","stream":"stderr","time":"2018-10-02T16:59:57.484823336Z"}
{"log":"level=info ts=2018-10-02T17:00:08.764955695Z caller=compact.go:398 component=tsdb msg=\"write block\" mint=1538488800000 maxt=1538496000000 ulid=01CRTVJ00W7NAAZ0TNTP4C0T8J\n","stream":"stderr","time":"2018-10-02T17:00:08.765243476Z"}
{"log":"level=info ts=2018-10-02T17:00:08.785146302Z caller=head.go:348 component=tsdb msg=\"head GC completed\" duration=2.876125ms\n","stream":"stderr","time":"2018-10-02T17:00:08.785408092Z"}
{"log":"level=info ts=2018-10-02T17:00:08.869096484Z caller=head.go:357 component=tsdb msg=\"WAL truncation completed\" duration=83.835367ms\n","stream":"stderr","time":"2018-10-02T17:00:08.86932147Z"}

After "WAL truncation completed" logs looks like that:

{"log":"level=debug ts=2018-10-02T17:00:57.484290949Z caller=scrape.go:915 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Out of bounds metric\" series=\"meters_heat_energy_joule{location=\\\"siemianowicka\\\"}\"\n","stream":"stderr","time":"2018-10-02T17:00:57.484741546Z"}
{"log":"level=debug ts=2018-10-02T17:00:57.484461348Z caller=scrape.go:915 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Out of bounds metric\" series=\"meters_heat_usage_m3{location=\\\"siemianowicka\\\"}\"\n","stream":"stderr","time":"2018-10-02T17:00:57.484859337Z"}
{"log":"level=debug ts=2018-10-02T17:00:57.484581803Z caller=scrape.go:915 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Out of bounds metric\" series=\"meters_heat_temperature_c{location=\\\"siemianowicka\\\",direction=\\\"in\\\"}\"\n","stream":"stderr","time":"2018-10-02T17:00:57.484942988Z"}
{"log":"level=debug ts=2018-10-02T17:00:57.484701562Z caller=scrape.go:915 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Out of bounds metric\" series=\"meters_heat_temperature_c{location=\\\"siemianowicka\\\",direction=\\\"out\\\"}\"\n","stream":"stderr","time":"2018-10-02T17:00:57.484983764Z"}
{"log":"level=warn ts=2018-10-02T17:00:57.484780281Z caller=scrape.go:948 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Error on ingesting samples that are too old or are too far into the future\" num_dropped=4\n","stream":"stderr","time":"2018-10-02T17:00:57.485084539Z"}
{"log":"level=debug ts=2018-10-02T17:01:57.484540532Z caller=scrape.go:915 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Out of bounds metric\" series=\"meters_heat_energy_joule{location=\\\"siemianowicka\\\"}\"\n","stream":"stderr","time":"2018-10-02T17:01:57.484895465Z"}
{"log":"level=debug ts=2018-10-02T17:01:57.484691371Z caller=scrape.go:915 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Out of bounds metric\" series=\"meters_heat_usage_m3{location=\\\"siemianowicka\\\"}\"\n","stream":"stderr","time":"2018-10-02T17:01:57.48497302Z"}
{"log":"level=debug ts=2018-10-02T17:01:57.484816074Z caller=scrape.go:915 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Out of bounds metric\" series=\"meters_heat_temperature_c{location=\\\"siemianowicka\\\",direction=\\\"in\\\"}\"\n","stream":"stderr","time":"2018-10-02T17:01:57.48508342Z"}
{"log":"level=debug ts=2018-10-02T17:01:57.484929617Z caller=scrape.go:915 component=\"scrape manager\" scrape_pool=prometheus target=http://192.168.27.101:3000/metrics msg=\"Out of bounds metric\" series=\"meters_heat_temperature_c{location=\\\"siemianowicka\\\",direction=\\\"out\\\"}\"\n","stream":"stderr","time":"2018-10-02T17:01:57.485154231Z"}

Simon Pasquier

unread,

Oct 3, 2018, 4:38:40 AM10/3/18

to szymo...@gmail.com, Prometheus Users

I presume that your exporter exposes metrics with timestamps and that the timestamp values are too old which led the tsdb to reject them.

See https://prometheus.io/docs/instrumenting/writing_exporters/#scheduling

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/478cf73e-9bd4-4a21-a332-746e50ad9fe2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Szymon O

unread,

Oct 3, 2018, 4:43:06 AM10/3/18

to spas...@redhat.com, promethe...@googlegroups.com

Yes, I attach timestamp.

So how to handle not having data for particular timestamp, not to publish metric at all in that time?

Simon Pasquier

unread,

Oct 4, 2018, 5:25:45 AM10/4/18

to szymo...@gmail.com, Prometheus Users

If your exporter receives metrics from the devices (push model), you probably want to drop the timestamps and expire metrics that haven't been updated after a while.

https://prometheus.io/docs/instrumenting/writing_exporters/#pushes

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAH41GEffn7T8ODdVGXkbgMCUgzMgp523D-U3HnDD8NxzXfX%3DHA%40mail.gmail.com.

Reply all

Reply to author

Forward

0 new messages