Is there a way to scrape historical time series data from an existing Prometheus server

2,034 views
Skip to first unread message

goel...@gmail.com

unread,
Mar 20, 2018, 11:56:59 PM3/20/18
to Prometheus Users
I am trying to figure out if there is a way to scrape historical time series data from an existing Prometheus server. 

The use case is to create a HA configuration

1. Start with 2 Prometheus instances (say A, and B). 
2. A is configured in a scraping mode, while B is configured as a Federation destination of A. 
3. B fails. 
4. A new instance, say C, is started again as a Federation destination of A. 
5. While C will be able to collect all metrics from a point in time after it started, it does not get historical metrics that are present on A.


Brian Brazil

unread,
Mar 21, 2018, 4:09:58 AM3/21/18
to goel...@gmail.com, Prometheus Users

goel...@gmail.com

unread,
Mar 21, 2018, 11:57:46 AM3/21/18
to Prometheus Users
Thanks Brian.

I failed to mention that the setup is under a cluster orchestrator (e.g. Kubernetes) which can re-spawn a failed instance, and it is also possible to dynamically change the configuration file to convert the non-scraping Prometheus instance to a scraping instance. 

Bottomline, my environment does NOT allow me to run 2 Prometheus instances, both in a scraping mode. Nevertheless, in a HA setup the problem of re-syncing a replacement Prometheus instance (launched after one has failed) with old time-series data still remains. The problem is the same even if the HA configuration is made up of 2 "scraping" Prometheus instances. There has to be a mechanism to re-sync the replacement Prometheus instance which allows it to get the historical time series data from a point in time "before" the replacement was started. Otherwise, the HA setup is good only until the first failure. While it is possible to launch a new instance after one has failed, without resync capability, you end up with one instance with a full time series, and the other with a partial time series.  

-Atul

Brian Brazil

unread,
Mar 21, 2018, 12:02:28 PM3/21/18
to Atul Goel, Prometheus Users
On 21 March 2018 at 15:57, <goel...@gmail.com> wrote:
Thanks Brian.

I failed to mention that the setup is under a cluster orchestrator (e.g. Kubernetes) which can re-spawn a failed instance, and it is also possible to dynamically change the configuration file to convert the non-scraping Prometheus instance to a scraping instance. 

Bottomline, my environment does NOT allow me to run 2 Prometheus instances, both in a scraping mode.

You can't avoid a SPOF in such a setup. I'd suggest using a k8 volume with a demonset so the restarted instance has the old data.

Brian
 
Nevertheless, in a HA setup the problem of re-syncing a replacement Prometheus instance (launched after one has failed) with old time-series data still remains. The problem is the same even if the HA configuration is made up of 2 "scraping" Prometheus instances. There has to be a mechanism to re-sync the replacement Prometheus instance which allows it to get the historical time series data from a point in time "before" the replacement was started. Otherwise, the HA setup is good only until the first failure. While it is possible to launch a new instance after one has failed, without resync capability, you end up with one instance with a full time series, and the other with a partial time series.  

-Atul


On Tuesday, March 20, 2018 at 8:56:59 PM UTC-7, Atul Goel wrote:
I am trying to figure out if there is a way to scrape historical time series data from an existing Prometheus server. 

The use case is to create a HA configuration

1. Start with 2 Prometheus instances (say A, and B). 
2. A is configured in a scraping mode, while B is configured as a Federation destination of A. 
3. B fails. 
4. A new instance, say C, is started again as a Federation destination of A. 
5. While C will be able to collect all metrics from a point in time after it started, it does not get historical metrics that are present on A.


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/4cc6ea62-50ff-4809-935d-260fa77f18b7%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Ben Kochie

unread,
Mar 21, 2018, 12:23:38 PM3/21/18
to Brian Brazil, Atul Goel, Prometheus Users
I would follow this advise.

If you want another option, I would look into Thanos or Cortex to enable cluster storage.

On Wed, Mar 21, 2018, 11:02 Brian Brazil <brian....@robustperception.io> wrote:
On 21 March 2018 at 15:57, <goel...@gmail.com> wrote:
Thanks Brian.

I failed to mention that the setup is under a cluster orchestrator (e.g. Kubernetes) which can re-spawn a failed instance, and it is also possible to dynamically change the configuration file to convert the non-scraping Prometheus instance to a scraping instance. 

Bottomline, my environment does NOT allow me to run 2 Prometheus instances, both in a scraping mode.

You can't avoid a SPOF in such a setup. I'd suggest using a k8 volume with a demonset so the restarted instance has the old data.

Brian
 
Nevertheless, in a HA setup the problem of re-syncing a replacement Prometheus instance (launched after one has failed) with old time-series data still remains. The problem is the same even if the HA configuration is made up of 2 "scraping" Prometheus instances. There has to be a mechanism to re-sync the replacement Prometheus instance which allows it to get the historical time series data from a point in time "before" the replacement was started. Otherwise, the HA setup is good only until the first failure. While it is possible to launch a new instance after one has failed, without resync capability, you end up with one instance with a full time series, and the other with a partial time series.  

-Atul


On Tuesday, March 20, 2018 at 8:56:59 PM UTC-7, Atul Goel wrote:
I am trying to figure out if there is a way to scrape historical time series data from an existing Prometheus server. 

The use case is to create a HA configuration

1. Start with 2 Prometheus instances (say A, and B). 
2. A is configured in a scraping mode, while B is configured as a Federation destination of A. 
3. B fails. 
4. A new instance, say C, is started again as a Federation destination of A. 
5. While C will be able to collect all metrics from a point in time after it started, it does not get historical metrics that are present on A.


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.



--

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAHJKeLqtgHrjfZFW%3DdsYAWaiBCBDUu-eWMZ9nihyy%2B2Ej%3DO8kQ%40mail.gmail.com.

goel...@gmail.com

unread,
Mar 22, 2018, 4:04:44 AM3/22/18
to Prometheus Users
Thanks guys.. 

Just to confirm, the suggestion is to rely on a "highly available shared" clustered storage to store the time-series-database. That way the restarted Prometheus instance will indeed have historical time series database. 

However, it's not clear how to handle the time series data that was still in memory, and hadn't yet been flushed to disk on the Failed Prometheus instance. Based on the default checkpoint interval, I guess this could be 5 minutes worth of time series. So, are we saying that we are still exposed to a "checkpoint-interval" worth of data loss - unless there is a way to plug this hole by querying the Non-failed instance of Prometheus. 

A cursory look at Thanos appears  promising.



On Tuesday, March 20, 2018 at 8:56:59 PM UTC-7, Atul Goel wrote:

Brian Brazil

unread,
Mar 22, 2018, 4:09:44 AM3/22/18
to Atul Goel, Prometheus Users
On 22 March 2018 at 08:04, <goel...@gmail.com> wrote:
Thanks guys.. 

Just to confirm, the suggestion is to rely on a "highly available shared" clustered storage to store the time-series-database. That way the restarted Prometheus instance will indeed have historical time series database. 

However, it's not clear how to handle the time series data that was still in memory, and hadn't yet been flushed to disk on the Failed Prometheus instance. Based on the default checkpoint interval, I guess this could be 5 minutes worth of time series. So, are we saying that we are still exposed to a "checkpoint-interval" worth of data loss - unless there is a way to plug this hole

 
by querying the Non-failed instance of Prometheus. 

There is no non-failed instance, because you say you can only have one scraper. Avoiding a SPOF in Prometheus depends on having at least two scrapers.

Brian
 

A cursory look at Thanos appears  promising.



On Tuesday, March 20, 2018 at 8:56:59 PM UTC-7, Atul Goel wrote:
I am trying to figure out if there is a way to scrape historical time series data from an existing Prometheus server. 

The use case is to create a HA configuration

1. Start with 2 Prometheus instances (say A, and B). 
2. A is configured in a scraping mode, while B is configured as a Federation destination of A. 
3. B fails. 
4. A new instance, say C, is started again as a Federation destination of A. 
5. While C will be able to collect all metrics from a point in time after it started, it does not get historical metrics that are present on A.


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/cde1d974-db5a-4a6e-8526-175af597b4dd%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

goel...@gmail.com

unread,
Mar 22, 2018, 4:51:43 AM3/22/18
to Prometheus Users
Apologize. I guess I should have been even more clearer, the configuration is as follows

1) 2 Prometheus instances, one in scraping mode "A", and the other as federation destination "B"
2) If "A" fails, then "B" is converted to a scraping mode, and hence becomes the primary. Recovery involves starting a new Prometheus instance "C" and making it a federation destination of "B".
3) If however "B" had failed, then "A" is still the primary, and recovery involves starting a new instance "C" as a federation destination of "A".

The problem is getting "C" to have the historical time-series data from before it was started.

Based on your suggestion, If both "A" and "B" were each using a "highly available remote storage", then when "C" is spawned it could be made to point to the instance of the "remote storage" being used by the failed instance. 

My question is that doing the above still doesn't address the hole for the "checkpoint-interval worth of time-series data" that did not get flushed to disk. 

Btw, as far as I understand, the above problem exists even if there were always 2 instances "A" and "B" each configured in a scraping mode. 





On Tuesday, March 20, 2018 at 8:56:59 PM UTC-7, Atul Goel wrote:

Brian Brazil

unread,
Mar 22, 2018, 5:02:40 AM3/22/18
to Atul Goel, Prometheus Users
On 22 March 2018 at 08:51, <goel...@gmail.com> wrote:
Apologize. I guess I should have been even more clearer, the configuration is as follows

1) 2 Prometheus instances, one in scraping mode "A", and the other as federation destination "B"
2) If "A" fails, then "B" is converted to a scraping mode, and hence becomes the primary. Recovery involves starting a new Prometheus instance "C" and making it a federation destination of "B".
3) If however "B" had failed, then "A" is still the primary, and recovery involves starting a new instance "C" as a federation destination of "A".

This is not a sane use of federation, see https://www.robustperception.io/federation-what-is-it-good-for/

You should accept that you will only have one Prometheus scraping, and not over-complicate things by adding components that don't help. A storage volume that's kept across restarts is simpler and better than anything else you're considering.

Brian
 

The problem is getting "C" to have the historical time-series data from before it was started.

Based on your suggestion, If both "A" and "B" were each using a "highly available remote storage", then when "C" is spawned it could be made to point to the instance of the "remote storage" being used by the failed instance. 

My question is that doing the above still doesn't address the hole for the "checkpoint-interval worth of time-series data" that did not get flushed to disk. 

Btw, as far as I understand, the above problem exists even if there were always 2 instances "A" and "B" each configured in a scraping mode. 





On Tuesday, March 20, 2018 at 8:56:59 PM UTC-7, Atul Goel wrote:
I am trying to figure out if there is a way to scrape historical time series data from an existing Prometheus server. 

The use case is to create a HA configuration

1. Start with 2 Prometheus instances (say A, and B). 
2. A is configured in a scraping mode, while B is configured as a Federation destination of A. 
3. B fails. 
4. A new instance, say C, is started again as a Federation destination of A. 
5. While C will be able to collect all metrics from a point in time after it started, it does not get historical metrics that are present on A.


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

goel...@gmail.com

unread,
Mar 22, 2018, 5:24:28 AM3/22/18
to Prometheus Users
Thanks Brian. 

Could you please also answer if 
a) There is indeed the hole where time-series data that was "not" yet flushed to disk would get lost in the event of a failure. i.e. even with a storage volume that is kept across restarts.
b) If so, is there a way to plug this hole.



On Tuesday, March 20, 2018 at 8:56:59 PM UTC-7, Atul Goel wrote:

Brian Brazil

unread,
Mar 22, 2018, 5:49:13 AM3/22/18
to Atul Goel, Prometheus Users
On 22 March 2018 at 09:24, <goel...@gmail.com> wrote:
Thanks Brian. 

Could you please also answer if 
a) There is indeed the hole where time-series data that was "not" yet flushed to disk would get lost in the event of a failure. i.e. even with a storage volume that is kept across restarts.
b) If so, is there a way to plug this hole.

Prometheus 2.2.1 has a WAL which minimises this. Gaps are a fact of life when it comes to monitoring and can be caused by many different things, it's generally not worth chasing after .1% of your data which is missing.

Brian
 



On Tuesday, March 20, 2018 at 8:56:59 PM UTC-7, Atul Goel wrote:
I am trying to figure out if there is a way to scrape historical time series data from an existing Prometheus server. 

The use case is to create a HA configuration

1. Start with 2 Prometheus instances (say A, and B). 
2. A is configured in a scraping mode, while B is configured as a Federation destination of A. 
3. B fails. 
4. A new instance, say C, is started again as a Federation destination of A. 
5. While C will be able to collect all metrics from a point in time after it started, it does not get historical metrics that are present on A.


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

goel...@gmail.com

unread,
Mar 22, 2018, 10:01:31 AM3/22/18
to Prometheus Users
Thanks Brian. 


On Tuesday, March 20, 2018 at 8:56:59 PM UTC-7, Atul Goel wrote:
Reply all
Reply to author
Forward
0 new messages