Why Prometheus is not suitable for long-term storage?

rigsb...@gmail.com

unread,

Sep 19, 2018, 4:12:35 AM9/19/18

to Prometheus Users

I am considering to use Prometheus for monitoring and I want to keep metrics for at least one year. However, many sources have mentioned that it is not for long-term storage. As far as I know, we can set a long data retention period like years with --storage.tsdb.retention. So what is reason behind that it is not suitable for long term storage? It is performance issue or what?

Ben Kochie

unread,

Sep 19, 2018, 4:39:07 AM9/19/18

to rigsb...@gmail.com, Prometheus Users

This was true for Prometheus 1.x. With Prometheus 2.x there are far fewer problems with long term storage. And now the question is mainly about scale and planning.

Another issue with long term storage now is that with years of high resolution probes, it can take a lot of memory to process very long queries. This also comes down to scale.

For example, a rate() function over 1 year with 15 second scrape interval requires 2.1 million samples, or about 2.6MiB of data. That's for a single metric.

There are several ways to mitigate this problem. You can pre-compute the data you need with recording rules, and use federation to selectively store those results in a separate Prometheus server.

So, if you have a small infrastructure, there is nothing wrong with adjusting the retention time to years, the current TSDB implementation is perfectly able to handle this.

For larger installations, you may want to consider a larger distributed TSDB like Cortex or Thanos.

On Wed, Sep 19, 2018 at 10:12 AM <rigsb...@gmail.com> wrote:

I am considering to use Prometheus for monitoring and I want to keep metrics for at least one year. However, many sources have mentioned that it is not for long-term storage. As far as I know, we can set a long data retention period like years with --storage.tsdb.retention. So what is reason behind that it is not suitable for long term storage? It is performance issue or what?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/9a290b3c-059e-4c52-993a-b850d4f870f6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

sull.l...@gmail.com

unread,

Sep 20, 2018, 4:14:52 AM9/20/18

to Prometheus Users

Hi,

But What's a small infrastructure ??

We have a 15 days retention.

1 millions time serie with 20K samples/s.

About 200 node_exporter, 800 probe tcp/http and 200 targets with differents kinds of exprter (apache / elastic / postgres / jmx / rancher / cadvisor...)

Everything with a scape interval = 60s.

We will next increase our infra but we facing performance issues during grafana queries.(last 7 days with 5 metrics rate can take 20s....! ). We think about precompute but is it normal ?

Kind regards.

Ben Kochie

unread,

Sep 20, 2018, 5:41:19 AM9/20/18

to Ptitlusone, Prometheus Users

Yes, sorry, I didn't really say what small was.

I would call that single Prometheus a "medium" size.

With 60s scrape interval 7 days with 5 metrics is only ~50k samples, that should return in maybe 100ms.

For example, I just ran this query:

* rate(http_response_size_bytes_count[1h])

* step 3600

* 7 days

* 27 instances

Response time: 403ms

Another example: Running a count(up) for 12 weeks takes 3.5 seconds for 900 metrics.

Our main Prometheus instance is similar in size, but we scrape every 15s, and ingest around 55k samples/sec. We currently keep 6 months in local TSDB.

We also use a number of recording rules to generate key metrics.

The main server uses about 10GB of memory, but we use a 30GB instance to allow for good caching of data for longer queries.

We are doing a couple things to improve performance for our setup.

* We are increasing the sharding by application. Several smaller Prometheus servers, dedicated to a specific set of jobs.

* We're are deploying Thanos, to provide a global query proxy layer as we scale, and eventually we'll be moving long term storage there.

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/9bd484cd-09a2-4a7c-82ed-fb34c78cef05%40googlegroups.com.

sull.l...@gmail.com

unread,

Sep 20, 2018, 8:51:22 AM9/20/18

to Prometheus Users

Great !

I undestand and i can pretty much locate ourselves.

I ran the same query and i observe the same response time.

I think we will upgrade memory from 8Gb to at least 20Gb for the caching.

I think too our queries short time are effectively unsuitable for long term.

I also understand that we can do long term storage with a dedicated prometheus who federate the others. (with filters and agregation).