Losing metrics for the time when federation node is down

93 views
Skip to first unread message

Shrikant Keni

unread,
Nov 26, 2020, 11:35:18 PM11/26/20
to Prometheus Users
  Hi,

For one of our use case we were trying Prometheus federation to scrape the metrics from master federation node (using attached yml configuration ) and I am able to see the metrics on local Prometheus node.

But when the local Prometheus federation node  goes down and comes up after some interval, then we are seeing gap in the graph between that interval.

Is there any configuration to scrape the metrics from the time when the last successful transaction happened. 

Or else is there any other ways by which we can solve this issue.
 

Regards,
Shrikant Keni









Message has been deleted

b.ca...@pobox.com

unread,
Nov 27, 2020, 4:02:05 AM11/27/20
to Prometheus Users
On Friday, 27 November 2020 at 04:35:18 UTC shrika...@gmail.com wrote:
Is there any configuration to scrape the metrics from the time when the last successful transaction happened. 

No.  Prometheus has no "back-fill" capability, or any ability to ingest historical data.
 
Or else is there any other ways by which we can solve this issue.

Using something other than federation.  remote_write is able to buffer up data locally if the endpoint is down.

Prometheus itself can't accept remote_write requests, so you'd have to write to some other system which can.  I suggest VictoriaMetrics, as it's simple to run and has a very prometheus-like API, which can be queried as if it were a prometheus instance.

Ben Kochie

unread,
Nov 27, 2020, 4:11:26 AM11/27/20
to b.ca...@pobox.com, Prometheus Users
On Fri, Nov 27, 2020 at 10:02 AM b.ca...@pobox.com <b.ca...@pobox.com> wrote:
On Friday, 27 November 2020 at 04:35:18 UTC shrika...@gmail.com wrote:
Is there any configuration to scrape the metrics from the time when the last successful transaction happened. 

No.  Prometheus has no "back-fill" capability, or any ability to ingest historical data.

 
 
Or else is there any other ways by which we can solve this issue.

Using something other than federation.  remote_write is able to buffer up data locally if the endpoint is down.

Prometheus itself can't accept remote_write requests, so you'd have to write to some other system which can.  I suggest VictoriaMetrics, as it's simple to run and has a very prometheus-like API, which can be queried as if it were a prometheus instance.

I recommend Thanos, as it scales better and with less effort than VictoriaMetrics. It also uses PromQL code directly, so you will get the same results as Prometheus, not an emulation of PromQL.
 
 

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/aa8ec4c9-7296-42e3-9658-41eac1c677a0n%40googlegroups.com.

b.ca...@pobox.com

unread,
Nov 27, 2020, 4:41:09 AM11/27/20
to Prometheus Users
> Soon! https://github.com/prometheus/prometheus/pull/8084

That'll be a cool feature for data migration - thanks!

Unfortunately I don't think it helps with the OP's problem, unless:
1. federation is extended to query historical data (with range queries?)
2. prometheus supports a remote_write endpoint for backfilling

I can suggest another option for the OP though. They can run *two* independent federation nodes, such that they are never both down at the same time. Then put promxy in front of them, and query via that.  At query time, it will fill gaps in one server with data from the other.

Re Thanos: is there a simple one-process deployment for that yet?  What I like about VictoriaMetrics is that I can run
/opt/victoria-metrics/victoria-metrics-prod -storageDataPath=/var/lib/victoria-metrics/data -retentionPeriod=6 \
    -httpAuth.username=admin -httpAuth.password=XXXXXX
and I'm done.

Ben Kochie

unread,
Nov 27, 2020, 4:52:31 AM11/27/20
to b.ca...@pobox.com, Prometheus Users
On Fri, Nov 27, 2020 at 10:41 AM b.ca...@pobox.com <b.ca...@pobox.com> wrote:
> Soon! https://github.com/prometheus/prometheus/pull/8084

That'll be a cool feature for data migration - thanks!

Unfortunately I don't think it helps with the OP's problem, unless:
1. federation is extended to query historical data (with range queries?)
2. prometheus supports a remote_write endpoint for backfilling

No, it doesn't solve federation gaps. Only remote_write and Thanos do this correctly.

IMO, Federation is mostly useful in extremely high-scale environments where the metrics on the "scrape Prometheus" instances are mostly used for recording rule rollups, and only the federation data is used.

WIth remote_write and Thanos, I mostly see Federation as a legacy technique. It was an easy form of rollup and aggregation for large scale that was implemented in the early days.
 

I can suggest another option for the OP though. They can run *two* independent federation nodes, such that they are never both down at the same time. Then put promxy in front of them, and query via that.  At query time, it will fill gaps in one server with data from the other.

Re Thanos: is there a simple one-process deployment for that yet?  What I like about VictoriaMetrics is that I can run
/opt/victoria-metrics/victoria-metrics-prod -storageDataPath=/var/lib/victoria-metrics/data -retentionPeriod=6 \
    -httpAuth.username=admin -httpAuth.password=XXXXXX
and I'm done.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Ben Kochie

unread,
Nov 27, 2020, 4:57:38 AM11/27/20
to b.ca...@pobox.com, Prometheus Users
On Fri, Nov 27, 2020 at 10:52 AM Ben Kochie <sup...@gmail.com> wrote:
On Fri, Nov 27, 2020 at 10:41 AM b.ca...@pobox.com <b.ca...@pobox.com> wrote:
> Soon! https://github.com/prometheus/prometheus/pull/8084

That'll be a cool feature for data migration - thanks!

Unfortunately I don't think it helps with the OP's problem, unless:
1. federation is extended to query historical data (with range queries?)
2. prometheus supports a remote_write endpoint for backfilling

No, it doesn't solve federation gaps. Only remote_write and Thanos do this correctly.

IMO, Federation is mostly useful in extremely high-scale environments where the metrics on the "scrape Prometheus" instances are mostly used for recording rule rollups, and only the federation data is used.

WIth remote_write and Thanos, I mostly see Federation as a legacy technique. It was an easy form of rollup and aggregation for large scale that was implemented in the early days.
 

I can suggest another option for the OP though. They can run *two* independent federation nodes, such that they are never both down at the same time. Then put promxy in front of them, and query via that.  At query time, it will fill gaps in one server with data from the other.

Re Thanos: is there a simple one-process deployment for that yet?  What I like about VictoriaMetrics is that I can run
/opt/victoria-metrics/victoria-metrics-prod -storageDataPath=/var/lib/victoria-metrics/data -retentionPeriod=6 \
    -httpAuth.username=admin -httpAuth.password=XXXXXX
and I'm done.

Forgot to reply to this one. No, not yet. I've made two proposals `./thanos standalone` mode, which is possible. Also Epimitheus, a Promtheus with no scraping, just a receiver for small use cases.

But, someone has to write the code. :)

Aliaksandr Valialkin

unread,
Nov 29, 2020, 5:51:11 AM11/29/20
to Ben Kochie, b.ca...@pobox.com, Prometheus Users
On Fri, Nov 27, 2020 at 11:11 AM Ben Kochie <sup...@gmail.com> wrote:
 
Or else is there any other ways by which we can solve this issue.

Using something other than federation.  remote_write is able to buffer up data locally if the endpoint is down.

Prometheus itself can't accept remote_write requests, so you'd have to write to some other system which can.  I suggest VictoriaMetrics, as it's simple to run and has a very prometheus-like API, which can be queried as if it were a prometheus instance.

I recommend Thanos, as it scales better and with less effort than VictoriaMetrics. It also uses PromQL code directly, so you will get the same results as Prometheus, not an emulation of PromQL.


Could you share more details on why you think that VictoriaMetrics has scalability issues and is harder to set up and operate than Thanos? VictoriaMetrics users have quite the opposite opinion. See https://victoriametrics.github.io/CaseStudies.html and https://medium.com/faun/comparing-thanos-to-victoriametrics-cluster-b193bea1683 .

--
 
Aliaksandr Valialkin, CTO VictoriaMetrics

Ben Kochie

unread,
Nov 29, 2020, 8:10:54 AM11/29/20
to Aliaksandr Valialkin, b.ca...@pobox.com, Prometheus Users
Thanos uses object storage, which avoids the need for manual sharding of TSDB storage. Today I have 100TiB of data stored in object storage buckets. I make no changes to scale up or down these buckets.

This object storage design also works when Thanos is in remote-write mode, rather than sidecar mode.

Aliaksandr Valialkin

unread,
Nov 30, 2020, 7:08:40 AM11/30/20
to Ben Kochie, b.ca...@pobox.com, Prometheus Users
On Sun, Nov 29, 2020 at 3:10 PM Ben Kochie <sup...@gmail.com> wrote:
On Sun, Nov 29, 2020 at 11:51 AM Aliaksandr Valialkin <val...@gmail.com> wrote:


On Fri, Nov 27, 2020 at 11:11 AM Ben Kochie <sup...@gmail.com> wrote:
 
Or else is there any other ways by which we can solve this issue.

Using something other than federation.  remote_write is able to buffer up data locally if the endpoint is down.

Prometheus itself can't accept remote_write requests, so you'd have to write to some other system which can.  I suggest VictoriaMetrics, as it's simple to run and has a very prometheus-like API, which can be queried as if it were a prometheus instance.

I recommend Thanos, as it scales better and with less effort than VictoriaMetrics. It also uses PromQL code directly, so you will get the same results as Prometheus, not an emulation of PromQL.


Could you share more details on why you think that VictoriaMetrics has scalability issues and is harder to set up and operate than Thanos? VictoriaMetrics users have quite the opposite opinion. See https://victoriametrics.github.io/CaseStudies.html and https://medium.com/faun/comparing-thanos-to-victoriametrics-cluster-b193bea1683 .

Thanos uses object storage, which avoids the need for manual sharding of TSDB storage. Today I have 100TiB of data stored in object storage buckets. I make no changes to scale up or down these buckets.


VictoriaMetrics stores data on persistent disks. Every replicated durable persistent disk in GCP can scale up to 64TB without the need to stop VictoriaMetrics, i.e. without downtime. Given that VictoriaMetrics compresses real-world data much better than Prometheus, a single-node VictoriaMetrics can substitute the whole Thanos cluster for your workload (in theory of course - just give it a try in order to verify this statement :) ). Cluster version of VictoriaMetrics can scale to petabytes. For example, a cluster with one terabyte capacity can be built with 16 vmstorage nodes with 64TB persistent disk per each node. That's why VictoriaMetrics in production usually has lower infrastructure costs than Thanos.


--
Best Regards,

Aliaksandr Valialkin, CTO VictoriaMetrics

b.ca...@pobox.com

unread,
Nov 30, 2020, 7:22:21 AM11/30/20
to Prometheus Users
On Monday, 30 November 2020 at 12:08:40 UTC val...@gmail.com wrote:
VictoriaMetrics stores data on persistent disks.

In the case of AWS which I'm more familiar with, EBS (block) storage is about 5 times more expensive than S3 (object). Furthermore, you pay for the provisioned EBS space even when not occupying it with data.

A quick look suggests the same is true for GCP:

Persistent disks: $0.17/GB (SSD) or $0.10/GB (balanced)
Cloud: $0.02/GB (standard)

Ben Kochie

unread,
Nov 30, 2020, 7:23:16 AM11/30/20
to Aliaksandr Valialkin, b.ca...@pobox.com, Prometheus Users
* GCP persistent disk costs double that of object storage, and is zone local only.
* Cost is four times if you want regional replication.
* GCP persistent disks don't have multi-regional replication (GCS does by default).
* Object storage versioning makes for easy lifecycle management for disaster recovery.
* Plus you have to maintain some percent of un-used filesystem overhead to avoid running out of space.
* You can't shrink Persistent disks
* And we're back to manual labor required to scale.

Storing on persistent disks is a major reason why we don't just use Prometheus for TSDB. As an instance-level SPoF, the cost of persistent disks compared to object storage, and the toil involved.

No thanks, we're moving away from old-school architectures.

b.ca...@pobox.com

unread,
Nov 30, 2020, 7:27:12 AM11/30/20
to Prometheus Users
It looks like there are also Standard (HDD) persistent disks available, at $0.04/GB.  Still twice as expensive as cloud though, even if HDD performance is acceptable.

Cloud also has other options which could make if the bulk of the data is rarely accessed, like nearline ($0.01/GB)

Ben Kochie

unread,
Nov 30, 2020, 7:27:35 AM11/30/20
to b.ca...@pobox.com, Prometheus Users
Yes, I don't have enough info on if object storage speed is comparable or not to SSD or HDD persistent disks. So I went the "cheap" way and compared with HDD. If you wanted to use SDD, it'd be even more expensive.

On the other hand, GCS disks are implemented on top of Google's object storage, so the IOPs you can get are quite impressive, especially at the larger single volume sizes.

But the big impact for me is that GCS is multi-regionally replicated, which means I can read and write to it from multiple geo locations.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Ben Kochie

unread,
Nov 30, 2020, 7:31:32 AM11/30/20
to b.ca...@pobox.com, Prometheus Users
We've talked about writing a small tool for Thanos to automatically nearline storage of the raw chunk data (not indexes) for older buckets. But keep downsample data and indexes in standard storage.

On Mon, Nov 30, 2020 at 1:27 PM b.ca...@pobox.com <b.ca...@pobox.com> wrote:
It looks like there are also Standard (HDD) persistent disks available, at $0.04/GB.  Still twice as expensive as cloud though, even if HDD performance is acceptable.

Cloud also has other options which could make if the bulk of the data is rarely accessed, like nearline ($0.01/GB)

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages