Exposing extra options in the remote write configuration

34 views
Skip to first unread message

Joaquin Fernandez Campo

unread,
Aug 28, 2020, 5:21:12 AM8/28/20
to Prometheus Developers
Hi!

At my team we're facing an issue using remote write with prometheus on a consul connect service mesh. 

Right now we have prometheus configured to remote write to a consul connect service mesh endpoint (http://localhost:19090). Consul deals with handling sending connections to every discovered copy of our remote write service. 

Prometheus seems to be establishing long living TCP connections and at first when we have two copies of the remote service running they are balanced. 

The problem comes up when we one of the copies goes away for any reason, prometheus detects this, and establishes new TCP connections to the localhost endpoint but because on consul we only have one copy running all of them end up going to the one copy that is running and the one that comes up sits there doing nothing. 

In order to fix it I think we would need to expose some more configuration options on the remote write specifically the ones that are set here (https://github.com/prometheus/prometheus/blob/d30f202c08a7bf4109f18c755c2cbc6a067666bb/vendor/github.com/prometheus/common/config/http_config.go#L158-L174) and ideally a max number of sends per http client. 

I wanted to write this to see what the opinion of the dev team is and if this change makes sense. I do think at least the keep alive option should be exposed to not reuse TCP connections. We could probably get it to work as we expect by lowering the idle timeouts so maybe exposing those two options would be enough. 

What do y'all think? Has anyone faced this kind of issue using another service mesh? 

Brian Brazil

unread,
Aug 28, 2020, 5:29:07 AM8/28/20
to Joaquin Fernandez Campo, Prometheus Developers
On Fri, 28 Aug 2020 at 10:21, Joaquin Fernandez Campo <jfc...@gmail.com> wrote:
Hi!

At my team we're facing an issue using remote write with prometheus on a consul connect service mesh. 

Right now we have prometheus configured to remote write to a consul connect service mesh endpoint (http://localhost:19090). Consul deals with handling sending connections to every discovered copy of our remote write service. 

Prometheus seems to be establishing long living TCP connections and at first when we have two copies of the remote service running they are balanced. 

The problem comes up when we one of the copies goes away for any reason, prometheus detects this, and establishes new TCP connections to the localhost endpoint but because on consul we only have one copy running all of them end up going to the one copy that is running and the one that comes up sits there doing nothing. 

In order to fix it I think we would need to expose some more configuration options on the remote write specifically the ones that are set here (https://github.com/prometheus/prometheus/blob/d30f202c08a7bf4109f18c755c2cbc6a067666bb/vendor/github.com/prometheus/common/config/http_config.go#L158-L174) and ideally a max number of sends per http client. 

This code is used in many other places in Prometheus and its ecosystem, so any changes would have to make sense in all the other contexts too. For example it doesn't make sense to reestablish the HTTP connection used for scraping periodically, and it'd cause confusion for users of the blackbox exporter which doesn't use persistent connections.
 
I wanted to write this to see what the opinion of the dev team is and if this change makes sense. I do think at least the keep alive option should be exposed to not reuse TCP connections. We could probably get it to work as we expect by lowering the idle timeouts so maybe exposing those two options would be enough. 

What do y'all think? Has anyone faced this kind of issue using another service mesh?  

It sounds like the issue here is with your load balancing setup, and I'd suggest tackling it at that level and ensuring it's doing request-based rather than connection-based balancing - rather than complicating the configuration for everyone.

--
Reply all
Reply to author
Forward
0 new messages