How to solve "Two hour in-memory prometheus data during upgrade/failover"

790 views
Skip to first unread message

Shaam Dinesh

unread,
Mar 30, 2020, 11:39:45 AM3/30/20
to Prometheus Users
Hi Team,

Current setup:

I am using using prometheus setup running along with thanos for extended storage to accomodate 2 months of data for longer persistence

  1. 2 Instances of prometheus (0 replica/1 replica) and both this instances enabled with thanos-sidecar
  2. thanos-sidecar subsequently writes data to GCS bucket for extended long term storage
  3. thanos querier connected to two instances and reading data via store gateway
Challenge/Issue in current setup:

Despite of having long term storage we would still be losing potential data of real time/latest/current 2 hours.(storage.tsdb.max-block-duration=2h) - How do we handle the below cases?

  • Backup/snapshot for older instance during upgrade will be an option (But it cannot be seamless and will fail to perform fault tolerance and post-facto)
  • 2 HA Instances of prometheus to handle DR scenarios but if both the instances fail then it will lose all 2 hours data with extreme case
Question:

Is there any efficient mechanism instilled within prometheus to save 2 hours data during restarts/crash recovery/upgrade? what measures/guidelines should we follow to minimize the data loss with less disruption

Brian Candler

unread,
Mar 30, 2020, 3:05:40 PM3/30/20
to Prometheus Users
When you stop prometheus it writes out its WAL to disk, and when you start it it reads in WAL back from disk.  This is why a prometheus restart can take several minutes (and you should ensure that your supervisor process isn't configured to do a hard kill after a short timeout).

Of course it won't be ingesting data during that time, but if you have a second instance, that one still will be.

Shaam Dinesh

unread,
Mar 30, 2020, 3:21:00 PM3/30/20
to Prometheus Users
Hi Brain

Thanks for the response, yes I am still leveraging persistent disk to mitigate the restarts but it was not helping to save 2 hours data as configured

I there any better way to address it

Brian Candler

unread,
Mar 30, 2020, 4:48:50 PM3/30/20
to Prometheus Users
It Works For Me™, but I don't use Thanos.

Maybe you should try to make a standalone test case that reproduces the problem.  My suspicion is you're not letting prometheus shutdown cleanly.

Aliaksandr Valialkin

unread,
Mar 31, 2020, 10:42:09 AM3/31/20
to Shaam Dinesh, Prometheus Users
Another option is to configure Prometheus instances to replicate data to remote storage via remote_write. Prometheus replicates data to the configured remote storage systems as soon as the data is scraped, so it shouldn't lose big amounts of data on unclean shutdown. See the list of supported remote storage systems here. The most promising systems are: Cortex, M3DB and VictoriaMetrics. You can evaluate multiple systems at once - just add multiple `remote_write->url` entries in Prometheus config.

It is worth reading these docs on remote_write config tuning in Prometheus.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/d1adfd24-a199-4f12-b85d-28394a7d52c0%40googlegroups.com.


--
Best Regards,

Aliaksandr
Reply all
Reply to author
Forward
0 new messages