Prometheus queue manager config need to be configurable

210 views
Skip to first unread message

Colstuwjx

unread,
Jul 7, 2017, 3:36:16 AM7/7/17
to Prometheus Users
Hi team,

Currently, the queue manager is not configurable, due to the config is coded in here(https://github.com/prometheus/prometheus/blob/master/storage/remote/queue_manager.go#L128), and wasn't the part of startup parameters.

We should make it configurable. 
BTW, as the queue manager is also a part of the prometheus server, it seems that too many samples would trouble the prometheus server itself, such as the rule evaluation, since the prometheus components is monolithic. Is there any better way or just plan about this? Such as the cortex solution with full separated components work together(https://docs.google.com/document/d/1C7yhMnb1x2sfeoe45f4mnnKConvroWhJ8KQZwIHJOuw/edit#).

Thanks.

Tom Wilkie

unread,
Jul 7, 2017, 5:28:41 AM7/7/17
to Colstuwjx, Prometheus Users
Hi colstuwjx

We should make it configurable. 

We decided not to make it configurable to begin with as the aim of the code is to dynamically adapt to the given situation, adding and removing shards to try and flush the current sample rate with a maximum delay of 5s.  I fully suspect there are situations where it does the wrong thing - do you have an example of one?

Is there any better way or just plan about this?

The queueing code is designed to cap the amount of memory it uses, so as not to bother the Prometheus server.  If it can't flush samples quickly enough, it will drop them.

When using Cortex we tend to run Prometheus without any local storage, relying on Cortex for queries, rule evaluation and alerting.  In this mode, Prometheus acts as the agents collecting and forwarding samples to Cortex.  We haven't found it to be a bottleneck.  Have you?

Thanks

Tom



--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/5b6518d0-6003-4407-80ba-fd4b16fdde64%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Colstuwjx

unread,
Jul 7, 2017, 6:49:05 AM7/7/17
to Prometheus Users, cols...@gmail.com
I just setup a prometheus server and configure it with a remote storage, due to too many targets need to scrape, the remote storage queue is quick to full, thus, I'd like to configure the max samples which the queue manager hold.

As cortex has not been merged into the prometheus upstream, what's your suggestion about this?

Tom Wilkie

unread,
Jul 7, 2017, 7:32:14 AM7/7/17
to Colstuwjx, Prometheus Users
Do you have any logs from Prometheus?  It sounds like your remote storage is too slow.  Can you take a screenshot of the following queries:

- 90th percentile send batch latency: `histogram_quantile(0.9, sum(rate(prometheus_remote_storage_sent_batch_duration_seconds_bucket[5m])) by (le,queue))`  
- Rate of dropped samples: `rate(prometheus_remote_storage_failed_samples_total[1m]) by (queue)`

Thanks

Tom

Colstuwjx

unread,
Jul 7, 2017, 9:02:34 PM7/7/17
to Prometheus Users, cols...@gmail.com
The queue capacity is 100k, and the shards was growing up from 2 to 4, one queue I found here, 12.5M succeed samples scraped in 5 minutes, and there is no failed_samples_total metric, instead, I found the dropped_samples_total was 140k samples in 5min. Graphs are shown below:





                                             figure 1. 90th percentile send batch latency




                                             figure 2. succeed samples rate


                                             figure 3. dropped samples rate


Any suggestion about this? My local storage is pretty good, and it shows that there are 480k memory series stored.

Tom Wilkie

unread,
Jul 11, 2017, 11:41:23 AM7/11/17
to Colstuwjx, Prometheus Users
Hi colstuwjx

Sorry for the delay.  Those graphs look fine for me, some initially dropped sample during the period where Prometheus ramps up the number of shards is to be expected.  What were you expecting?

Thanks

Tom

Reply all
Reply to author
Forward
0 new messages