Debugging Help : Remote storage queue full

1,709 views
Skip to first unread message

dshi...@vt.edu

unread,
Apr 20, 2017, 11:55:22 AM4/20/17
to Prometheus Users
Hi,

I'm having some issues writing to remote storage and am looking for some advice on debugging it.

I'm running Prometheus v1.6.0 in docker.  My Prometheus is currently scraping about 400 targets on a 15 second interval.  I have about 300,000 local memory series that consumes about 13.7 GiB (I have 95GiB as the target max heap).  Prometheus is currently sending about 192 remote writes every second at my writer application.  This amounts to about 1,000,000 individual time series per minute at my writer.

Periodically I get a 'remote storage queue full, multiple subsequent message...' warning for a brief blip of time and then it clears.  I've profiled my remote writer application's HTTP endpoint and it is at about 1us in the 0.95 quantile and never more than about 40us in the 0.999 quantile.  So I don't think my remote writer application is too slow in handling the requests.

I've been trying to dig into the machine running Prometheus and I don't see anything obvious that would cause it to not get CPU.

Do you have any suggestions on what Prometheus metrics I could inspect to see if it is Prometheus that is at issue?  I thought maybe Prometheus GC might be it but I don't see anything obvious.

Failing that, are they any ways to configure Prometheus to have a larger remote queue?  Maybe I can just buffer more so this is less likely to happen?

My only other thought is that it might be my network between my Prometheus and my writer.  However, I'm not sure about how to track that.

Thanks,
Dan

Tom Wilkie

unread,
Apr 20, 2017, 12:35:02 PM4/20/17
to dshi...@vt.edu, Prometheus Users
Hi Dan

We added a bunch of metrics here -
https://github.com/prometheus/prometheus/blob/master/storage/remote/queue_manager.go#L50.

The best thing to look up would be
prometheus_remote_storage_sent_batch_duration_seconds, and see what
the 99%ile latency is from the client. You can also look at
prometheus_remote_storage_queue_length to see the queue backing up.

Do you see a lot of "Remote storage resharding" in your logs? Full
logs would be useful

> Failing that, are they any ways to configure Prometheus to have a larger remote queue? Maybe I can just buffer more so this is less likely to happen?

We didn't make this configurable in this release, as we expected the
dynamic resharding to be able to cope with throughput up to ~1million
samples / second.

Thanks

Tom
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-use...@googlegroups.com.
> To post to this group, send email to promethe...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/c2c15ea6-bbf2-4000-9dec-e7c9d97745ea%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Tom Wilkie

unread,
Apr 20, 2017, 12:55:52 PM4/20/17
to Prometheus Users, dshi...@vt.edu
(vt.edu denied @weave.works, so re-sending from personal account)

Hi Dan 

We added a bunch of metrics here - 
https://github.com/prometheus/prometheus/blob/master/storage/remote/queue_manager.go#L50

The best thing to look up would be 
prometheus_remote_storage_sent_batch_duration_seconds, and see what 
the 99%ile latency is from the client.  You can also look at 
prometheus_remote_storage_queue_length to see the queue backing up. 

Do you see a lot of "Remote storage resharding" in your logs?  Full 
logs would be useful 

> Failing that, are they any ways to configure Prometheus to have a larger remote queue?  Maybe I can just buffer more so this is less likely to happen? 

We didn't make this configurable in this release, as we expected the 
dynamic resharding to be able to cope with throughput up to ~1million 
samples / second. 

Thanks 

Tom 

dshi...@vt.edu

unread,
Apr 20, 2017, 2:26:09 PM4/20/17
to Prometheus Users, dshi...@vt.edu
Hi Tom,

Thanks for the quick reply.

Just to double check, my query should be 'histogram_quantile(0.99, prometheus_remote_storage_sent_batch_duration_seconds_bucket[1m]))'?  That gives me some interesting info.  It looks like the 99% hovers around 5ms.  So that seems reasonable.  However, right when I see my queue full messages in the log, my 99.9% quantile shoots up to 1.2 seconds!

Meanwhile, my remote storage shards stays at 2.  However, two days ago when I updated to v1.6.0, I also update my writer.  For the brief moment it was down, Prometheus started sharding.  It got up to 10.  Then when my writer was back it slowly settled down.  it appears that before all this took place it was at 1.  However, it has never gone back down to 1.

Side note, my writer is a Spring/Java application and I had to write the Filter that handles the content-encoding: snappy decompression.  That is before my receive function that is being timed.  I'll get some metrics added into my Filter to verify that it isn't the cause of the blips.

Tom Wilkie

unread,
Apr 20, 2017, 2:57:59 PM4/20/17
to dshi...@vt.edu, Prometheus Users
Query should be `histogram_quantile(0.99, rate(prometheus_remote_storage_sent_batch_duration_seconds_bucket[1m]))`.

If you're hovering around 5ms for a 100 sample batch, and only doing 20k samples / s, you'd expect to only need one shard.   It looks like you're backend periodically has large latency spikes, and this is causing samples to get dropped.

In this case the queues backup and drop samples as it can't reshard quickly enough.  The resharding isn't designed to cope with changes like this TBH, and in this case I think its okay to drop samples.  Currently the queues are a fixed size and per shard - we could probably do something like making the queue longer when there is a small number of shards or even make a fixed length global queue that is shared amongst the shards.

Tom



dshi...@vt.edu

unread,
Apr 20, 2017, 4:08:36 PM4/20/17
to Prometheus Users, dshi...@vt.edu
OK.  Those numbers make sense WRT the batch size and my measured rates.  I should be throwing about 17k/s at my writer.  Is it normal that once it has exceeded one shard to never settle back down to one?  That's what I'm seeing.  I just restarted my writer (with additional timing metrics) and I saw the same behavior.  Up to 10 shards while my writer was down and then it worked back down to two shards.

So right off the bat, I see that the timer I put around my Snappy decompression filter is running around 4ms per call.  I don't know what else on the Spring/Tomcat HTTP request chain is also happening in there.  I'll assume it is just about everything (I know the stream to protobuf converter is in there).  So that seems to line up with the ~5ms per call seen from Prometheus.

I'll report back again later once I've experience some more blips and let you know if I caught where it was at.

As to the queue sizing, I'm not at all familiar with how the queue size and shards relate.  I would think that there is some reasonable way to estimate reasonable potential 'blip' size based on current record rate and the histogram of the write call time.  I.e. you might say that if you see that the current record rate is X records per second and the .999 quantile of writes is 10 times more than the .99 quantile, then you need 10X buffer size?  With some reasonable MAX cap.

Tom Wilkie

unread,
Apr 20, 2017, 4:16:54 PM4/20/17
to dshi...@vt.edu, Prometheus Users
 Is it normal that once it has exceeded one shard to never settle back down to one? 

It should settle back down, however there is some hysteresis in the system to stop it flapping too much.

 I.e. you might say that if you see that the current record rate is X records per second and the .999 quantile of writes is 10 times more than the .99 quantile, then you need 10X buffer size?

This would be possible, but its not what we do right now.  We track the average batch time and size shards accordingly.  We then have a fixed sized buffer per shard.

The main design constraint on this system was not using too much memory in Prometheus - we don't want the remote write buffering to OOM the process.  At the end of the day, the remote write API will always potentially drop samples if the remote end is slow.

Would you mind sending us a screenshot of the graph of `histogram_quantile(0.99, rate(prometheus_remote_storage_sent_batch_duration_seconds_bucket[1m]))`?

Thanks

Tom


prase...@gmail.com

unread,
Apr 21, 2017, 10:07:46 AM4/21/17
to Prometheus Users
Any body tried with container ? Please paste Dockerfile

Because I am not able to find "remote_storage_adapter" executable file in docker "prom/prometheus" version 1.6

/prometheus # find / -name remote_storage_adapter
/prometheus #

dshi...@vt.edu

unread,
Apr 21, 2017, 10:40:46 AM4/21/17
to Prometheus Users, dshi...@vt.edu

Tom,

Makes sense about the hysteresis.

So I ran the additional metrics overnight.  I think the problem must be in my application.  There seems to be a tight correlation between when Prometheus blips and when the total call time (as seen from my SnappyDecompressionFilter) increases.  My guess is Java GC.  I'm swapping out my underlying embedded Tomcat in favor of Undertow since it is supposedly lower memory usage.  And also forcing my io-thread count up.  We'll see.

Either way, I don't think it is Prometheus, so I'll end here.

Thanks again for all your help in debugging this.

Dan

Tom Wilkie

unread,
Apr 21, 2017, 12:12:39 PM4/21/17
to dshi...@vt.edu, Prometheus Users
Thanks for letting me know.  I'll give some thought now to make the remote queuing stuff actually deal with occasional blips - I've filed https://github.com/prometheus/prometheus/issues/2648 for more discussion.

Tom

Reply all
Reply to author
Forward
0 new messages