Prometheus 2.0 TSDB and Durability

Peter Zaitsev

unread,

Jan 22, 2018, 5:11:08 PM1/22/18

to Prometheus Developers

Hi,

Reading on Prometheus 2.0 TSDB

https://prometheus.io/docs/prometheus/latest/storage/

I wonder what is really expected durability of TSDB by design (I recognize there are crash recovery bugs to be expected in new code)

One one side it states "It is secured against crashes by a write-ahead-log (WAL) that can be replayed when the Prometheus server restarts after a crash."

On the other:

"If your local storage becomes corrupted for whatever reason, your best bet is to shut down Prometheus and remove the entire storage directory. However, you can also try removing individual block directories to resolve the problem. This means losing a time window of around two hours worth of data per block directory. Again, Prometheus's local storage is not meant as durable long-term storage."

Is it the case if there might be some bugs... time will tell or there are some known conditions in which storage will become corrupted ?

Julius Volz

unread,

Jan 23, 2018, 4:29:38 AM1/23/18

to Peter Zaitsev, Prometheus Developers

Basically if the storage gets corrupted in some non-anticipated way, there's nothing that we can easily do, and since it's not a replicated storage system, we also can't just restore it from some other replica. Finished blocks are mostly immutable, except for when they get compacted into bigger blocks.

The WAL is only for not losing recent sample data every time in the face of server crashes.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/388fa107-6a69-4584-81b1-79528bc9f1b2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Fabian Reinartz

unread,

Jan 23, 2018, 9:50:18 AM1/23/18

to Julius Volz, Peter Zaitsev, Prometheus Developers

> Finished blocks are mostly immutable, except for when they get compacted into bigger blocks.

Correct, but the newly produced block is immutable again.

Overall corruption to block data can only really happen if the filesystem/disk has an issue. We have relatively fine-grained crc32 checksums spread across our files. In theory we could implement best-effort reads there. It just happens to fail the whole query right now IIRC.

Depending on where in the lookup chain of the index a corruption occurs though, the impact would be more critical regardless.

More or less the same applies for the WAL. Just that records may be dependent on one another – so there we do the safe default and truncate everything after the last valid record.

On Tue, Jan 23, 2018 at 10:29 AM Julius Volz <juliu...@gmail.com> wrote:

Basically if the storage gets corrupted in some non-anticipated way, there's nothing that we can easily do, and since it's not a replicated storage system, we also can't just restore it from some other replica. Finished blocks are mostly immutable, except for when they get compacted into bigger blocks.

The WAL is only for not losing recent sample data every time in the face of server crashes.

On Mon, Jan 22, 2018 at 11:11 PM, Peter Zaitsev <p...@percona.com> wrote:

Hi,

Reading on Prometheus 2.0 TSDB

https://prometheus.io/docs/prometheus/latest/storage/

I wonder what is really expected durability of TSDB by design (I recognize there are crash recovery bugs to be expected in new code)

One one side it states "It is secured against crashes by a write-ahead-log (WAL) that can be replayed when the Prometheus server restarts after a crash."

On the other:

"If your local storage becomes corrupted for whatever reason, your best bet is to shut down Prometheus and remove the entire storage directory. However, you can also try removing individual block directories to resolve the problem. This means losing a time window of around two hours worth of data per block directory. Again, Prometheus's local storage is not meant as durable long-term storage."

Is it the case if there might be some bugs... time will tell or there are some known conditions in which storage will become corrupted ?

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To post to this group, send email to prometheus...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/388fa107-6a69-4584-81b1-79528bc9f1b2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To post to this group, send email to prometheus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CA%2BT6YoyPsWwO_VZLJ5rWNVVc_NPSLtruFR8cH9L7hUmPaPKxYw%40mail.gmail.com.

Peter Zaitsev

unread,

Jan 23, 2018, 6:04:03 PM1/23/18

to Julius Volz, Prometheus Developers

Hi,

So if I understand what you're saying unless there are some bugs in the code or there is corruption on filesystem or device level the storage should be reliable ?

I would point out what many Database Engines will have same kind of data loss in case of file system corruption. Some of them have repair tools or emergency data extractions tools... others may not

and really the only universal recovery process is to recover from backup... which Prometheus supports too I assume

the only "gap" I see prometheus is missing some sort of long term transaction log which can be replayed to achieve point in time recovery, such as binlog in MySQL.

On Tue, Jan 23, 2018 at 4:29 AM, Julius Volz <juliu...@gmail.com> wrote:

Basically if the storage gets corrupted in some non-anticipated way, there's nothing that we can easily do, and since it's not a replicated storage system, we also can't just restore it from some other replica. Finished blocks are mostly immutable, except for when they get compacted into bigger blocks.

The WAL is only for not losing recent sample data every time in the face of server crashes.

On Mon, Jan 22, 2018 at 11:11 PM, Peter Zaitsev <p...@percona.com> wrote:

Hi,

Reading on Prometheus 2.0 TSDB

https://prometheus.io/docs/prometheus/latest/storage/

I wonder what is really expected durability of TSDB by design (I recognize there are crash recovery bugs to be expected in new code)

One one side it states "It is secured against crashes by a write-ahead-log (WAL) that can be replayed when the Prometheus server restarts after a crash."

On the other:

"If your local storage becomes corrupted for whatever reason, your best bet is to shut down Prometheus and remove the entire storage directory. However, you can also try removing individual block directories to resolve the problem. This means losing a time window of around two hours worth of data per block directory. Again, Prometheus's local storage is not meant as durable long-term storage."

Is it the case if there might be some bugs... time will tell or there are some known conditions in which storage will become corrupted ?

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscri...@googlegroups.com.

To post to this group, send email to prometheus-developers@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/388fa107-6a69-4584-81b1-79528bc9f1b2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Peter Zaitsev, CEO, Percona
Tel: +1 888 401 3401 ext 7360 Skype: peter_zaitsev

Julius Volz

unread,

Jan 24, 2018, 3:03:52 AM1/24/18

to Peter Zaitsev, Prometheus Developers

On Wed, Jan 24, 2018 at 12:04 AM, Peter Zaitsev <p...@percona.com> wrote:

Hi,

So if I understand what you're saying unless there are some bugs in the code or there is corruption on filesystem or device level the storage should be reliable ?

I would point out what many Database Engines will have same kind of data loss in case of file system corruption. Some of them have repair tools or emergency data extractions tools... others may not
and really the only universal recovery process is to recover from backup... which Prometheus supports too I assume

Yeah, you can do consistent snapshots in 2.x now.

However, when we talk about data durability, we usually think about local-only storage vs. clustered and replicated storage systems, which Prometheus is not (and by design, shouldn't be). That's the main differentiation we want to make with those points.

Matthias Rampke

unread,

Jan 24, 2018, 4:34:51 AM1/24/18

to Julius Volz, Peter Zaitsev, Prometheus Developers

Given the immutable blocks, are there better recommendations we can give for full or partial recovery? Like deleting the WAL, or the corrupt block? Does the log give enough information to identify the problematic bits?

/MR

On Wed, Jan 24, 2018, 09:03 Julius Volz <juliu...@gmail.com> wrote:

On Wed, Jan 24, 2018 at 12:04 AM, Peter Zaitsev <p...@percona.com> wrote:
Hi,

So if I understand what you're saying unless there are some bugs in the code or there is corruption on filesystem or device level the storage should be reliable ?

I would point out what many Database Engines will have same kind of data loss in case of file system corruption. Some of them have repair tools or emergency data extractions tools... others may not
and really the only universal recovery process is to recover from backup... which Prometheus supports too I assume

Yeah, you can do consistent snapshots in 2.x now.

However, when we talk about data durability, we usually think about local-only storage vs. clustered and replicated storage systems, which Prometheus is not (and by design, shouldn't be). That's the main differentiation we want to make with those points.

the only "gap" I see prometheus is missing some sort of long term transaction log which can be replayed to achieve point in time recovery, such as binlog in MySQL.

On Tue, Jan 23, 2018 at 4:29 AM, Julius Volz <juliu...@gmail.com> wrote:

Basically if the storage gets corrupted in some non-anticipated way, there's nothing that we can easily do, and since it's not a replicated storage system, we also can't just restore it from some other replica. Finished blocks are mostly immutable, except for when they get compacted into bigger blocks.

The WAL is only for not losing recent sample data every time in the face of server crashes.

On Mon, Jan 22, 2018 at 11:11 PM, Peter Zaitsev <p...@percona.com> wrote:

Hi,

Reading on Prometheus 2.0 TSDB

https://prometheus.io/docs/prometheus/latest/storage/

I wonder what is really expected durability of TSDB by design (I recognize there are crash recovery bugs to be expected in new code)

One one side it states "It is secured against crashes by a write-ahead-log (WAL) that can be replayed when the Prometheus server restarts after a crash."

On the other:

"If your local storage becomes corrupted for whatever reason, your best bet is to shut down Prometheus and remove the entire storage directory. However, you can also try removing individual block directories to resolve the problem. This means losing a time window of around two hours worth of data per block directory. Again, Prometheus's local storage is not meant as durable long-term storage."

Is it the case if there might be some bugs... time will tell or there are some known conditions in which storage will become corrupted ?

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To post to this group, send email to prometheus...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/388fa107-6a69-4584-81b1-79528bc9f1b2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Peter Zaitsev, CEO, Percona
Tel: +1 888 401 3401 ext 7360 Skype: peter_zaitsev

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To post to this group, send email to prometheus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CA%2BT6YowT7duq3cYacxZpfrMf%2BW07LExuX%3DXmUaAwcxj-r3Yz6A%40mail.gmail.com.

Peter Zaitsev

unread,

Jan 24, 2018, 5:30:15 PM1/24/18

to Julius Volz, Prometheus Developers

Hi,

Well even in 1.8 I can use LVM or Filesystem based stapshots to get a backup, assuming there is working crash recovery.

I understand the general philosophy what there is some persistent external storage which is reliable etc. But I would wonder what is your estimate how many users usually use it vs really using local storage only ?

In my understanding because at this point all the aggregation happens on Prometheus side you often get best performance from using local storage. Especially with Prometheus 2.x which seems to be a lot optimized in this field

Reply all

Reply to author

Forward