On of my Prometheus server deployed in Kubernetes is restarted by some reasons, I found the memory usage is huge during "replaying WAL".
Prometheus version: v2.13.1
Arguments: --storage.tsdb.retention.time=1d --web.enable-lifecycle--storage.tsdb.no-lockfile --web.route-prefix=/ --storage.tsdb.min-block-duration=2h --storage.tsdb.max-block-duration=2h
Server: 160GB ram, 20 cores
here is the RSS value from process_resident_memory_bytes metric, you can see this server already restarted 3 times.....it get OOM killed when replaying WAL:
level=info ts=2019-11-25T08:42:27.192Z caller=head.go:562 component=tsdb msg="WAL segment loaded" segment=10648 maxSegment=10762
level=info ts=2019-11-25T08:42:31.414Z caller=head.go:562 component=tsdb msg="WAL segment loaded" segment=10649 maxSegment=10762
level=info ts=2019-11-25T08:42:37.323Z caller=head.go:562 component=tsdb msg="WAL segment loaded" segment=10650 maxSegment=10762
level=info ts=2019-11-25T08:42:43.324Z caller=head.go:562 component=tsdb msg="WAL segment loaded" segment=10651 maxSegment=10762
level=info ts=2019-11-25T08:42:49.385Z caller=head.go:562 component=tsdb msg="WAL segment loaded" segment=10652 maxSegment=10762
level=info ts=2019-11-25T08:43:04.284Z caller=head.go:562 component=tsdb msg="WAL segment loaded" segment=10653 maxSegment=10762
level=warn ts=2019-11-25T08:43:54.134Z caller=main.go:501 msg="Received SIGTERM, exiting gracefully..."
level=info ts=2019-11-25T08:43:54.136Z caller=main.go:526 msg="Stopping scrape discovery manager..."
level=info ts=2019-11-25T08:43:54.137Z caller=main.go:540 msg="Stopping notify discovery manager..."
level=info ts=2019-11-25T08:43:54.137Z caller=main.go:562 msg="Stopping scrape manager..."

before restart, series is around 20~25 Mil, I found the series in head chunk is up to 100 Mil when "replaying WAL"
Around 4x than usual....
I ssh into the node and check wal folder,
I found lots of wal files are BEFORE two hours ago and not get purged.
total 39G
drwxrwsr-x 3 root root 16K Nov 25 08:44 .
drwxrwsr-x 18 root root 4.0K Nov 25 07:38 ..
-rw-rw-r-- 1 root root 121M Nov 25 03:47 00010450
-rw-rw-r-- 1 root root 122M Nov 25 03:47 00010451
-rw-rw-r-- 1 root root 128M Nov 25 03:47 00010452
-rw-rw-r-- 1 root root 124M Nov 25 03:47 00010453
-rw-rw-r-- 1 root root 124M Nov 25 03:47 00010454
-rw-rw-r-- 1 root root 126M Nov 25 03:47 00010455
-rw-rw-r-- 1 root root 128M Nov 25 03:47 00010456
-rw-rw-r-- 1 root root 128M Nov 25 03:47 00010457
-rw-rw-r-- 1 root root 125M Nov 25 03:47 00010458
-rw-rw-r-- 1 root root 124M Nov 25 03:47 00010459
-rw-rw-r-- 1 root root 128M Nov 25 03:48 00010460
-rw-rw-r-- 1 root root 128M Nov 25 03:49 00010461
-rw-rw-r-- 1 root root 128M Nov 25 03:49 00010462
-rw-rw-r-- 1 root root 128M Nov 25 03:50 00010463
-rw-rw-r-- 1 root root 128M Nov 25 03:51 00010464
-rw-rw-r-- 1 root root 128M Nov 25 03:52 00010465
~~~~~~~~~~~~~SKIP~~~~~~~~~~~~~~~~~~~~~~
-rw-rw-r-- 1 root root 128M Nov 25 07:28 00010744
-rw-rw-r-- 1 root root 128M Nov 25 07:29 00010745
-rw-rw-r-- 1 root root 128M Nov 25 07:29 00010746
-rw-rw-r-- 1 root root 121M Nov 25 07:30 00010747
-rw-rw-r-- 1 root root 128M Nov 25 07:30 00010748
-rw-rw-r-- 1 root root 128M Nov 25 07:31 00010749
-rw-rw-r-- 1 root root 128M Nov 25 07:32 00010750
-rw-rw-r-- 1 root root 124M Nov 25 07:32 00010751
-rw-rw-r-- 1 root root 128M Nov 25 07:33 00010752
-rw-rw-r-- 1 root root 128M Nov 25 07:33 00010753
-rw-rw-r-- 1 root root 128M Nov 25 07:34 00010754
-rw-rw-r-- 1 root root 128M Nov 25 07:35 00010755
-rw-rw-r-- 1 root root 128M Nov 25 07:36 00010756
-rw-rw-r-- 1 root root 128M Nov 25 07:38 00010757
-rw-rw-r-- 1 root root 128M Nov 25 07:39 00010758
-rw-rw-r-- 1 root root 35M Nov 25 07:39 00010759
-rw-r--r-- 1 root root 0 Nov 25 07:39 00010760
-rw-r--r-- 1 root root 0 Nov 25 08:01 00010761
-rw-r--r-- 1 root root 0 Nov 25 08:23 00010762
-rw-r--r-- 1 root root 0 Nov 25 08:44 00010763
drwxrwsr-x 2 root root 12K Nov 25 07:17 checkpoint.010449
Please correct me if I'm wrong
1. I think prometheus should only replay the WAL files winthin 2 hours.
2. I found if prometheus server successfully replay WAL files, it immediately remove series (no idea why)