Chris Siebenmann
unread,Jun 24, 2019, 4:16:07 PM6/24/19Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to promethe...@googlegroups.com, cks.prom...@cs.toronto.edu
We had a power failure here yesterday morning, and in the aftermath
our Prometheus 2.10.0 installation began logging errors:
prometheus[648]: level=error ts=2019-06-23T17:18:00.118Z caller=db.go:363 component=tsdb msg="compaction failed" err="reload blocks: invalid block sequence: block time ranges overlap: [mint: 1561276800000, maxt: 1561284000000, range: 2h0m0s, blocks: 2]: <ulid: 01DE2053MEB6JVS709F798AHPB, mint: 1561276800000, maxt: 1561284000000, range: 2h0m0s>, <ulid: 01DE20QHR4ZWPH4X1YAZC59P75, mint: 1561276800000, maxt: 1561284000000, range: 2h0m0s>\n[mint: 1561284000000, maxt: 1561291200000, range: 2h0m0s, blocks: 2]: <ulid: 01DE276KWCRPAS09VKG73QA6V1, mint: 1561284000000, maxt: 1561291200000, range: 2h0m0s>, <ulid: 01DE2NGD7Y420VMYVAQ3RJC1WR, mint: 1561284000000, maxt: 1561291200000, range: 2h0m0s>"
Later, after attempting a Prometheus restart, we also saw:
prometheus[9164]: level=error ts=2019-06-24T17:51:59.442Z caller=main.go:723 err="opening storage failed: invalid block sequence: block time ranges overlap: [mint: 1561276800000, maxt: 1561284000000, range: 2h0m0s, blocks: 2]: <ulid: 01DE2053MEB6JVS709F798AHPB, mint: 1561276800000, maxt: 1561284000000, range: 2h0m0s>, <ulid: 01DE20QHR4ZWPH4X1YAZC59P75, mint: 1561276800000, maxt: 1561284000000, range: 2h0m0s>"
After some investigation, I moved the 01DE2053MEB6JVS709F798AHPB directory
out of Prometheus's TSDB directory (but did not delete it), because it
looked like it had significantly less metrics based on storage usage,
and had metrics from a narrower time range. It turns out that it probably
has metrics that I would like to get back (there is certainly now a
metrics gap that there didn't used to be).
Prometheus 2.10.0 has the new --storage.tsdb.allow-overlapping-blocks
command line option (marked as experimental), that sounds like it might
fix this issue from the description and from the Github pull request
that added it and so on. But I don't really know here, and I would rather
lose these metrics than run any significant risk of destroying our entire
Prometheus TSDB (which has historical data we care about).
So, does anyone know if this is the right and safe approach for
recovery here? Has anyone used --storage.tsdb.allow-overlapping-blocks
already? I looked in the Github repos and didn't find any reported
issues, at least.
(Alternately, perhaps I have already lost those metrics and they
can't be retrieved now.)
If this is the right approach in general, I suspect that the correct
process is:
- stop Prometheus
- move the 01DE2053MEB6JVS709F798AHPB directory back into Prometheus's
TSDB directory
- start Prometheus with --storage.tsdb.allow-overlapping-blocks
If I do this, will Prometheus eventually de-overlap the blocks in the
compaction process, allowing me to drop the flag, or will I need to run
it this way basically forever? (Possibly we want to do that in general
if this sort of thing might happen again.)
Thanks in advance.
- cks