Use remote-write instead of federation

1,854 views
Skip to first unread message

tejaswini vadlamudi

unread,
Jul 18, 2022, 12:21:47 PM7/18/22
to Prometheus Users
Can someone point me to the advantages of using remote-write over federation?
I understand that remote-write is more of a standard interface in the monitoring domain.
Are there any handy performance measurements that were observed/recorded?

Thanks, Teja

Stuart Clark

unread,
Jul 18, 2022, 12:49:45 PM7/18/22
to tejaswini vadlamudi, Prometheus Users
They are really quite different.

Federation is a way of pulling data from a remote Prometheus into
(generally) a local one. The puller gets to choose how often to pull
data and what data to fetch. If the puller can't fetch the data for any
reason (local/remote outage, network issues, etc.) there will be gaps.

Remote write is a way of pushing data from a Prometheus server to
"something else", which could be another Prometheus or one of the many
things which implement the API (e.g. various databases, Thanos, custom
analytics tools, etc.). For these you get all the data (basically as
soon as it has been scraped) with the ability to do filtering via
relabling. If there is an outage/disconnect data will be queued for a
while (too long and things will get lost) so small issues can be handled
transparently.

So you have a difference in what data you get - either all (filtered)
data or data on a schedule (so in effect a form of built-in
downsampling), and who controls that - either the data source Prometheus
or the destination.

Which is "better" depends on what you are trying to achieve and the
constraints you might have (for example difficulties with accepting
network connections or data storage/transfer limits). Don't forget the
organisation differences too - for remote write adding/changing a
destination (or filter rules) needs changes to every data source
Prometheus where federation is purely controlled at the other end, which
might be a good or bad thing depending on team responsibilities/timings.

--
Stuart Clark

tejaswini vadlamudi

unread,
Jul 18, 2022, 1:00:30 PM7/18/22
to Prometheus Users
Hello Stuart, 

I have the 4 Prometheus instances in the same cluster.  
  • Instance-1, monitoring k8s & cadvisor
  • Instance-2, monitoring workload-1 in namespace-1
  • Instance-3, monitoring workload-2 in namespace-2
  • Instance-4 is the central one collecting metrics from all 3 instances (for global querying and alerting). not sure if the federation is a good fit for this sort of deployment pattern.

Thanks, Teja


Ben Kochie

unread,
Jul 18, 2022, 2:16:27 PM7/18/22
to tejaswini vadlamudi, Prometheus Users
I would probably skip federation and remote write with that setup and use Thanos to create a single pane view of all of them.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/a174584b-8f1d-4ab8-bcda-bfae9401af0en%40googlegroups.com.

Stuart Clark

unread,
Jul 18, 2022, 2:52:59 PM7/18/22
to tejaswini vadlamudi, Prometheus Users
On 18/07/2022 18:00, tejaswini vadlamudi wrote:
Hello Stuart, 

I have the 4 Prometheus instances in the same cluster.  
  • Instance-1, monitoring k8s & cadvisor
  • Instance-2, monitoring workload-1 in namespace-1
  • Instance-3, monitoring workload-2 in namespace-2
  • Instance-4 is the central one collecting metrics from all 3 instances (for global querying and alerting). not sure if the federation is a good fit for this sort of deployment pattern.

What's the reason for having all the different instances? Are these all full instances of Prometheus (with local storage) or using agent mode?

If you are just going to copy everything to the "central" instance on the same cluster, why not just do without the extra three clusters and have just the one instance that monitors everything?

-- 
Stuart Clark

tejaswini vadlamudi

unread,
Jul 18, 2022, 4:53:13 PM7/18/22
to Prometheus Users
@Ben: Thanks for the suggestion! I heard that remote-write consumes more system resources like CPU utilization when compared to the federation. I can test and cross-check it myself but I would like to hear feedback from the Prometheus experts.
@Stuart: Ideally, it is possible to manage the complete stack with instance-1 but the current case is about deploying and monitoring multiple workloads/software owned by different vendors.

/Teja

Ben Kochie

unread,
Jul 18, 2022, 5:02:57 PM7/18/22
to tejaswini vadlamudi, Prometheus Users
Yes, Thanos will eliminate the need for instance-4. At the same time it's more efficient because it doesn't use remote write or federation. It can query data from all your Prometheus instances.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

tejaswini vadlamudi

unread,
Jul 19, 2022, 8:24:09 AM7/19/22
to Prometheus Users
@Ben: Makes a point, but getting Thanos or Cortex into the picture could be a way forward after some time. For now, do you think it is good enough to use remote-write instead of federation?  From a performance and resource consumption POV, do you see remote-write as the way-forward?

Thanks, Teja

Stuart Clark

unread,
Jul 19, 2022, 11:02:11 AM7/19/22
to tejaswini vadlamudi, Prometheus Users
On 19/07/2022 13:24, tejaswini vadlamudi wrote:
> @Ben: Makes a point, but getting Thanos or Cortex into the picture
> could be a way forward after some time. For now, do you think it is
> good enough to use remote-write instead of federation?  From a
> performance and resource consumption POV, do you see remote-write as
> the way-forward?
>
With remote write you could use agent mode, so you don't have to have
local storage other than for the destination instance.

However again it depends what you are trying to achieve and why you have
suggested having four instances. Are you wanting to query all four
instances or only the "global" one? Are you wanting to copy all data to
the "global" instance or only some metrics? Every data point, or only at
a lower frequency?

If you are intending to copy all data (both metrics & data points) that
leans towards remote write as federation works differently. But in that
case there doesn't seem to be any advantage in having the extra three
instances at all (unless you are intending on doing local querying,
alerting or recording rules) - so I'd just have a single instance that
scrapes all namespaces.

Alternatively if you are needing to have separate instances with local
storage/querying then I'd probably not look to copy all the data to the
"global" instance (which just doubles storage and memory usage) and
either use remote write for a much smaller subset of metrics, federation
with a slower scrape rate/reduced set of metrics, or as Ben suggested
something like Thanos (other options exist as well) to do away with the
fourth instance entirely and distribute the queries to the individual
instances instead.

Maybe if you could explain a bit about what the design is hoping to
achieve it would help us advise better?

--
Stuart Clark

tejaswini vadlamudi

unread,
Jul 20, 2022, 12:16:29 PM7/20/22
to Prometheus Users
@Stuart: I agree with most of the ideas you say :-) I see remote-write as the most appropriate metrics forwarding for my deployment use case.
                Using federation is not good in terms of interface standardization, HA of monitoring stack, and feature support. 
                For the above case, I have functions and a dedicated set of engineers who own such workload to query individual instances, and the global instance is used as centralized monitoring.
                I was looking at this closed bug, raised on Prometheus in the 2019 Summer. To my understanding, there are performance issues with remote-write but most of them are resolved and the community sees remote-write to perform better when compared to the federation. Am I thinking correctly? 
                Could you clarify the performance comparison between remote-write and federation?

/Teja

Oleg Gumbar

unread,
Sep 8, 2022, 1:01:31 AM9/8/22
to Prometheus Users
Have similar case. I would like to use remote-write to collect metrics from multiple namespaces/clusters, however federation seems me much more reliable. Federation endpoint is just another scrapping target - in case of network failure (or any other failure) I will get an alert that federation endpoint is down. In case of remote write I have risks to stay blind. I see no clear mechanism to be sure I'm getting the metrics =/

What are the possible solutions in this case?

Brian Candler

unread,
Sep 8, 2022, 5:31:49 AM9/8/22
to Prometheus Users
> Federation endpoint is just another scrapping target - in case of network failure (or any other failure) I will get an alert that federation endpoint is down

This is true.  However the flip side is that remote_write buffers metrics while the network is down, whereas federation will not back-fill any historical data when the network comes back up.

You can alert on a remote_write endpoint going away, as described here:

I think you can make a generic alert against loss of *any* remote write sender - something like this (untested):
up{prometheus_agent="true"} offset 1h unless up

(i.e. "alert if the given metric/timeseries was present one hour ago but isn't present now")
Reply all
Reply to author
Forward
0 new messages