I have prometheus running in EKS (App Version: 2.18.1). The data is being stored in an EFS mount. I am repeatedly getting compaction failure errors and the number of WAL files increase drastically. This gets fixed only after the WAL directory is deleted and pod restarted. But on removing the WAL directory I am losing data. Please let me know if there is a permanent fix for this issue.
Error from logs:
level=error ts=2020-07-23T03:51:41.230Z caller=db.go:667 component=tsdb msg="compaction failed" err="persist head block: write compaction: add series: out-of-order series added with label set \"{__name__=\\\"go_gc_duration_seconds\\\", instance=\\\"<hostname>:<port>\\\", job=\\\"<job_name>\\\", quantile=\\\"0\\\", region=\\\"<region_label>\\\"}\""
The job name varies. Each time this error occurs, it points to a different job.