query_range returns no results when using some step values

81 views
Skip to first unread message

Francois Valiquette

unread,
Sep 22, 2020, 7:04:01 PM9/22/20
to Prometheus Users
Hello,

I'm observing a strange behavior with Prometheus and before upgrading the software or submitting a bug report I would like to know if I'm doing something wrong. 

My issue is when I make a query_range I sometimes get no results depending of the step I set. I observed that this behavior doesn't occur for all metrics. 

Here are some examples:

wget "http:/xyz:123/api/v1/query_range?query=elasticsearch_breakers_estimated_size_bytes&start=1600190400&end=1600795200&step=10m"
Above query returns no results: {"status":"success", "data":{"resultType":"matrix","result":[]}}

I observed the same behavior for step=20m

What is odd is that I get results for 9m, 11m, 19m, 21m.  Is this normal?

Thank you

Brian Candler

unread,
Sep 23, 2020, 4:09:30 AM9/23/20
to Prometheus Users
That seems wrong, and I can't reproduce it here.

- What's the prometheus version?
- What's the scrape interval for this metric?
- Does anything get logged by prometheus when you run this query and get empty result? (e.g. "journalctl -eu prometheus" if running under systemd)
- How many timeseries does this metric have?  Does it make a difference if you limit the query to a single TS, e.g. elasticsearch_breakers_estimated_size{foo="bar"} ?
- What happens if you give an instant query with a range vector? This should give you a matrix with the raw ingested data points


Ben Kochie

unread,
Sep 23, 2020, 4:09:58 AM9/23/20
to Francois Valiquette, Prometheus Users
What is your scrape interval for this data? This sounds like staleness handling.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/ac379bb5-7b82-4bef-b34a-a2bf54c9de93n%40googlegroups.com.

Francois Valiquette

unread,
Sep 23, 2020, 3:25:48 PM9/23/20
to Prometheus Users
- What's the prometheus version?
    -> 2.15.2
- What's the scrape interval for this metric?
    -> scrape_interval: 30m
- Does anything get logged by prometheus when you run this query and get empty result? (e.g. "journalctl -eu prometheus" if running under systemd)
    -> We run it in Kubernetes/Docker and the pod didn't generate logs when I ran that query
- How many timeseries does this metric have?  Does it make a difference if you limit the query to a single TS, e.g. elasticsearch_breakers_estimated_size{foo="bar"} ?
    -> I'm not 100% to know how many time series it has. But Grafana says we have 995 timeseries for that metric. I've tried to limit the most with {es_master_node="true"} since we only have 2 masters and it still returns 0.
- What happens if you give an instant query with a range vector? This should give you a matrix with the raw ingested data points
    -> This works; I see the raw data points.

Stuart Clark

unread,
Sep 23, 2020, 3:56:49 PM9/23/20
to promethe...@googlegroups.com
On 23/09/2020 20:25, Francois Valiquette wrote:
> - What's the prometheus version?
>     -> 2.15.2
> - What's the scrape interval for this metric?
>     -> scrape_interval: 30m


Due to staleness the maximum scrape interval is about 2 minutes, so you
will be seeing the metric regularly going stale, which will cause issues
with queries. Try lowering this to 2 minutes.

Brian Candler

unread,
Sep 24, 2020, 4:27:02 AM9/24/20
to Prometheus Users
On Wednesday, 23 September 2020 20:25:48 UTC+1, Francois Valiquette wrote:
- What's the scrape interval for this metric?
    -> scrape_interval: 30m

Aha!

Prometheus only looks back 5 minutes for data points; anything older than that is considered "stale".

Therefore, if you sample the time series with steps of 10 or 20 minutes, you may keep "missing" all the data points, and get no results.  If you sample at 9 or 11 minute intervals, you will hit some (but miss others).

Example: let's say you have data points at t=0min, t=30min, t=60min, t=90min.

The current time is t=118min.

You read data with step=10m.  You are therefore sampling the data at t=118min, t=108min, t=98min etc.

Each sample looks up to 5 minutes in the past.  It picks the latest data point from t=113-118min, 103-108min, 93-98min, 83-88min, 73-78min, 63-68min, 53-58min etc.

It misses the data points at t=90min, t=60min etc.  You get no results, exactly as you see.

Solution: scrape at least every 5 minutes.  Scraping at 2 minutes is strongly recommended, so that a single missed scrape does not cause these sorts of staleness issues.

Don't worry about additional storage utilisation.  Prometheus puts all the data points next to each other and uses delta encoding and compression.

Francois Valiquette

unread,
Sep 25, 2020, 12:54:13 PM9/25/20
to Prometheus Users
Thank you so much, that was super helpful. I've read a bit about staleness after reading your posts. Our elasticsearch_exporter takes more than 1 minute to run.  Because of that it doesn't seem a good idea to set scrape_interval at 2 minutes. We are now thinking of increasing the scrape_interval and changing --query.lookback-delta. The latter setting changes the 5 minutes look back default value. Do you have any thoughts on that?

Brian Candler

unread,
Sep 25, 2020, 5:10:12 PM9/25/20
to Prometheus Users
I'd say it's better to run your elasticsearch exporter at whatever interval makes sense - say once every 30 minutes as you are now - but scrape that value every 2 minutes.

The simplest approach is to run a cronjob every 30 minutes, and write the results to a file for node_exporter's textfile collector.  Or you can send the results to pushgateway.

There are a number of advantages to this - in particular, you can safely scrape from multiple prometheus servers without generating additional load on your elasticsearch node.

Francois Valiquette

unread,
Oct 8, 2020, 6:08:17 PM10/8/20
to Brian Candler, Prometheus Users
Thank you so much! It worded. We set the lookback-delta to 15 minutes since it's just a POC but we might use a pushgateway in our next iteration. 


--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/pIT_KmsNSkU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/254db9a7-cb8f-4763-8076-59279ff3d96bo%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages