Prometheus Remote Write Large Volume of Metrics

YY Wan

unread,

Oct 23, 2019, 2:04:04 PM10/23/19

to Prometheus Users

Hi,

I'm trying to remote write a large volume of metrics (~ 0.5 million to 1 million samples / s) to a remote storage backend. However, I encounter some issue where Prometheus doesn't seem to be able to remote write to keep up with the ingested samples rate.

With the default remote write configurations which should send 1 million / s, and with ingested samples rate of 400k/s, Prometheus will only remote write at a rate fluctuating between 70k/s to 200k/s.

Screen Shot 2019-10-23 at 10.48.42 AM.png

The most so far that I've been able to get Prometheus to remote write is 250k samples / s. But that is only with a different remote write configuration:

Screen Shot 2019-10-23 at 10.54.16 AM.png

queue_config:

capacity: 4000 # default = 500

max_shards: 500 # default = 1000

min_shards: 50 # default = 1

max_samples_per_send: 256 # default = 100

# batch_send_deadline: 5s # default = 5s

# min_backoff: 30ms # default = 30ms

# max_backoff: 100ms # default = 100ms

I then tried a similar configuration with an ingested samples rate of 400k samples / s, the change in configuration was to change max_shards to 750. However, while the remote write rate fluctuates between 200k/s and 450k/s, metrics I saw from the remote write destination show that the metrics are falling behind. The number of shards staying at the max number of 750 also indicate that Prometheus is falling behind since it is trying to send at its maximum throughput but can't keep up.

queue_config:

capacity: 4000 # default = 500

max_shards: 750 # default = 1000

min_shards: 50 # default = 1

max_samples_per_send: 256 # default = 100

# batch_send_deadline: 5s # default = 5s

# min_backoff: 30ms # default = 30ms

# max_backoff: 100ms # default = 100ms

I do not think that it is a bottleneck at the remote write destination, since I have scaled that (by increasing replicas, I am using M3) proportionally to the number of ingested samples.

Any help with how to tune Prometheus remote write configurations for larger metrics volumes would be much appreciated!

Aliaksandr Valialkin

unread,

Oct 23, 2019, 5:22:20 PM10/23/19

to YY Wan, Prometheus Users

Hi YY,

Which Prometheus version do you use? Try upgrading to the latest version (at least v2.12.0, but v2.13.1 is better), since previous versions had not very good resharding logic for remote_write, which could hurt performance - see https://github.com/prometheus/prometheus/pull/5763 .

Also I'd recommend increasing max_samples_per_send to 10000 and reducing max_shards to 100 for your setup.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/08184f0e-2f3c-4651-96c8-4cff6421d8ab%40googlegroups.com.

--

Best Regards,

Aliaksandr

YY Wan

unread,

Nov 4, 2019, 1:52:37 AM11/4/19

to Prometheus Users

Hi Aliaksandr,

Thanks! I have upgraded to 2.13.1 as you recommended, and tried inreasing the max_samples_per_send.

Increasing the batch size does seem to improve it.

I tried increasing it to 10k per batch. However, the write request latency increased a lot, and the remote destination had a lot of write errors then. And it started to fall behind as well.

Changing it to 1k a batch seems to be more stable. In the first graph below which is across 12 hours, the number of shards does not stay at the maximum of 1000, which indicates that Prometheus is somewhat keeping up. It does seem to recover less and less quickly as the queue size increase though. I'm not too sure why the queue size keeps increasing. (Any ideas? Ingested samples seems to be the same, and the latency of the ingested samples from the remote write destination's perspective isn't getting higher and higher).

Screen Shot 2019-11-03 at 10.50.49 PM.png

It does seem like the batch size of each remote write request is very sensitive though, when I made it around 500 or 2000, both did not work (samples delayed more and more from the beginning, as opposed to batch size 1000 where it seems like it is managing to keep up at least in the first day so far).

I'm trying to decrease the maximum shards as you suggested to around 200 next. Maybe too many shards is bad for remote write performance when there is large batch size.

On Wednesday, October 23, 2019 at 2:22:20 PM UTC-7, Aliaksandr Valialkin wrote:

Hi YY,

Which Prometheus version do you use? Try upgrading to the latest version (at least v2.12.0, but v2.13.1 is better), since previous versions had not very good resharding logic for remote_write, which could hurt performance - see https://github.com/prometheus/prometheus/pull/5763 .

Also I'd recommend increasing max_samples_per_send to 10000 and reducing max_shards to 100 for your setup.

To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/08184f0e-2f3c-4651-96c8-4cff6421d8ab%40googlegroups.com.

--
Best Regards,

Aliaksandr

Aliaksandr Valialkin

unread,

Nov 4, 2019, 10:42:36 AM11/4/19

to YY Wan, Prometheus Users

Which remote storage do you use, so it doesnt handle big batches? Usually remote storage prefers big batches with samples over small batches, since big batches have lower per-sample overhead. Probably you need to try another remote storage for Prometheus with better performance and lower resource usage? For example, try VictoriaMetrics - https://github.com/VictoriaMetrics/VictoriaMetrics . It shows good numbers in benchmarks - https://medium.com/@valyala/measuring-vertical-scalability-for-time-series-databases-in-google-cloud-92550d78d8ae and it can be used as Prometheus datasource in Grafana.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/50398ee1-c318-488c-b932-9d9f74123186%40googlegroups.com.

Reply all

Reply to author

Forward