No space left on device

650 views
Skip to first unread message

Abbas Gori

unread,
Apr 6, 2021, 9:33:59 AM4/6/21
to Postgres Operator
Hi,

The database container in postgres cluster is stuck in Notready state. The log is 

2021-04-06 06:07:51,268 INFO: Lock owner: None; I am pgo-qa-7f64888857-brttb
06/04/2021 11:37:51 2021-04-06 06:07:51,288 INFO: Reaped pid=6934, exit status=0
06/04/2021 11:37:51 2021-04-06 06:07:51,289 INFO: Lock owner: None; I am pgo-qa-7f64888857-brttb
06/04/2021 11:37:51 2021-04-06 06:07:51,290 INFO: starting as a secondary
06/04/2021 11:37:51 2021-04-06 06:07:51.600 UTC [6944] FATAL: could not write lock file "postmaster.pid": No space left on device
06/04/2021 11:37:51 2021-04-06 06:07:51,622 INFO: postmaster pid=6944
06/04/2021 11:37:51 /tmp:5432 - no response
06/04/2021 11:37:51 2021-04-06 06:07:51,647 WARNING: Postgresql is not running.
06/04/2021 11:37:51 2021-04-06 06:07:51,647 INFO: Lock owner: None; I am pgo-qa-7f64888857-brttb
06/04/2021 11:37:51 2021-04-06 06:07:51,663 INFO: Reaped pid=6946, exit status=0
06/04/2021 11:37:51 2021-04-06 06:07:51,665 INFO: pg_controldata:
06/04/2021 11:37:51 pg_control version number: 1201
06/04/2021 11:37:51 Catalog version number: 201909212
06/04/2021 11:37:51 Database system identifier: 6926871385558286504
06/04/2021 11:37:51 Database cluster state: in archive recovery
06/04/2021 11:37:51 pg_control last modified: Mon Apr 5 23:59:08 2021
06/04/2021 11:37:51 Latest checkpoint location: 41/55000060
06/04/2021 11:37:51 Latest checkpoint's REDO location: 41/5404B618
06/04/2021 11:37:51 Latest checkpoint's REDO WAL file: 000000090000004100000054
06/04/2021 11:37:51 Latest checkpoint's TimeLineID: 9
06/04/2021 11:37:51 Latest checkpoint's PrevTimeLineID: 9
06/04/2021 11:37:51 Latest checkpoint's full_page_writes: on
06/04/2021 11:37:51 Latest checkpoint's NextXID: 0:441724
06/04/2021 11:37:51 Latest checkpoint's NextOID: 524294
06/04/2021 11:37:51 Latest checkpoint's NextMultiXactId: 1
06/04/2021 11:37:51 Latest checkpoint's NextMultiOffset: 0
06/04/2021 11:37:51 Latest checkpoint's oldestXID: 480
06/04/2021 11:37:51 Latest checkpoint's oldestXID's DB: 1
06/04/2021 11:37:51 Latest checkpoint's oldestActiveXID: 441724
06/04/2021 11:37:51 Latest checkpoint's oldestMultiXid: 1
06/04/2021 11:37:51 Latest checkpoint's oldestMulti's DB: 1
06/04/2021 11:37:51 Latest checkpoint's oldestCommitTsXid: 0
06/04/2021 11:37:51 Latest checkpoint's newestCommitTsXid: 0
06/04/2021 11:37:51 Time of latest checkpoint: Mon Apr 5 19:13:32 2021
06/04/2021 11:37:51 Fake LSN counter for unlogged rels: 0/3E8
06/04/2021 11:37:51 Minimum recovery ending location: 41/5A000000
06/04/2021 11:37:51 Min recovery ending loc's timeline: 9
06/04/2021 11:37:51 Backup start location: 0/0
06/04/2021 11:37:51 Backup end location: 0/0
06/04/2021 11:37:51 End-of-backup record required: no
06/04/2021 11:37:51 wal_level setting: logical
06/04/2021 11:37:51 wal_log_hints setting: on
06/04/2021 11:37:51 max_connections setting: 100
06/04/2021 11:37:51 max_worker_processes setting: 8
06/04/2021 11:37:51 max_wal_senders setting: 6
06/04/2021 11:37:51 max_prepared_xacts setting: 0
06/04/2021 11:37:51 max_locks_per_xact setting: 64
06/04/2021 11:37:51 track_commit_timestamp setting: off
06/04/2021 11:37:51 Maximum data alignment: 8
06/04/2021 11:37:51 Database block size: 8192
06/04/2021 11:37:51 Blocks per segment of large relation: 131072
06/04/2021 11:37:51 WAL block size: 8192
06/04/2021 11:37:51 Bytes per WAL segment: 16777216
06/04/2021 11:37:51 Maximum length of identifiers: 64
06/04/2021 11:37:51 Maximum columns in an index: 32
06/04/2021 11:37:51 Maximum size of a TOAST chunk: 1996
06/04/2021 11:37:51 Size of a large-object chunk: 2048
06/04/2021 11:37:51 Date/time type storage: 64-bit integers
06/04/2021 11:37:51 Float4 argument passing: by value
06/04/2021 11:37:51 Float8 argument passing: by value
06/04/2021 11:37:51 Data page checksum version: 1
06/04/2021 11:37:51 Mock authentication nonce: 98414342bea502fcf14ce270b2c95c85c768f89151061fac1aa0fdea19a03c9c

The persistent volume was having 1GiB so I increased it to 2GiB. I even resized the pvc to 2GiB but nothing seems to be solving the issue.

Could you help me how to solve this issue? 

environment is GKE: 1.17.17-gke.2800
Storage class: standard(GCE)

Jonathan S. Katz

unread,
Apr 6, 2021, 9:35:27 AM4/6/21
to Abbas Gori, Postgres Operator
- What do the logs in /pgdata/pg_log/... say?
- Did you scale down/up the Deployment after resizing the PVC?

Jonathan S. Katz
VP Platform Engineering

Crunchy Data
Enterprise PostgreSQL 


Abbas Gori

unread,
Apr 6, 2021, 10:18:39 AM4/6/21
to Jonathan S. Katz, Postgres Operator
1. logs from /pgdata/pgo-qa/pg_log/postgresql-Mon.log

2021-04-05 23:57:34.498 UTC [2211920] LOG:  entering standby mode
2021-04-05 23:57:34.523 UTC [2211928] FATAL:  the database system is starting up
2021-04-05 23:57:34.636 UTC [2211932] FATAL:  the database system is starting up
2021-04-05 23:57:34.651 UTC [2211934] FATAL:  the database system is starting up
2021-04-05 23:57:34.822 UTC [2211920] LOG:  restored log file "00000009.history" from archive
2021-04-05 23:57:35.141 UTC [2211938] FATAL:  the database system is starting up
2021-04-05 23:57:35.536 UTC [2211967] FATAL:  the database system is starting up
2021-04-05 23:57:35.540 UTC [2211920] LOG:  redo starts at 41/5404B618
2021-04-05 23:57:35.665 UTC [2211974] FATAL:  the database system is starting up
2021-04-05 23:57:36.550 UTC [2211984] FATAL:  the database system is starting up
2021-04-05 23:57:36.677 UTC [2211989] FATAL:  the database system is starting up
2021-04-05 23:57:37.219 UTC [2211920] LOG:  consistent recovery state reached at 41/5A000000
2021-04-05 23:57:37.220 UTC [2211917] LOG:  database system is ready to accept read only connections
2021-04-05 23:57:38.204 UTC [2211920] LOG:  received promote request
2021-04-05 23:57:38.204 UTC [2211920] LOG:  redo done at 41/5904D598
2021-04-05 23:57:38.204 UTC [2211920] LOG:  last completed transaction was at log time 2021-04-05 19:18:19.179725+00
2021-04-05 23:57:38.860 UTC [2211920] LOG:  selected new timeline ID: 10
2021-04-05 23:57:38.860 UTC [2211920] FATAL:  could not write to file "pg_wal/xlogtemp.2211920": No space left on device
2021-04-05 23:57:38.862 UTC [2211917] LOG:  startup process (PID 2211920) exited with exit code 1
2021-04-05 23:57:38.862 UTC [2211917] LOG:  terminating any other active server processes
2021-04-05 23:57:38.862 UTC [2212006] WARNING:  terminating connection because of crash of another server process
2021-04-05 23:57:38.862 UTC [2212006] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2021-04-05 23:57:38.862 UTC [2212006] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2021-04-05 23:57:38.868 UTC [2211917] LOG:  database system is shut down
2021-04-05 23:57:58.465 UTC [2212207] LOG:  database system was interrupted while in recovery at log time 2021-04-05 19:13:32 UTC
2021-04-05 23:57:58.465 UTC [2212207] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2021-04-05 23:57:59.050 UTC [2212207] LOG:  entering standby mode
2021-04-05 23:57:59.109 UTC [2212215] FATAL:  the database system is starting up
2021-04-05 23:57:59.326 UTC [2212219] FATAL:  the database system is starting up
2021-04-05 23:57:59.339 UTC [2212221] FATAL:  the database system is starting up
2021-04-05 23:57:59.371 UTC [2212207] LOG:  restored log file "00000009.history" from archive
2021-04-05 23:58:00.107 UTC [2212207] LOG:  redo starts at 41/5404B618
2021-04-05 23:58:00.118 UTC [2212248] FATAL:  the database system is starting up
2021-04-05 23:58:00.125 UTC [2212250] FATAL:  the database system is starting up
2021-04-05 23:58:00.351 UTC [2212254] FATAL:  the database system is starting up
2021-04-05 23:58:01.139 UTC [2212265] FATAL:  the database system is starting up
2021-04-05 23:58:01.362 UTC [2212269] FATAL:  the database system is starting up
2021-04-05 23:58:01.784 UTC [2212207] LOG:  consistent recovery state reached at 41/5A000000
2021-04-05 23:58:01.785 UTC [2212204] LOG:  database system is ready to accept read only connections
2021-04-05 23:58:02.740 UTC [2212207] LOG:  received promote request
2021-04-05 23:58:02.740 UTC [2212207] LOG:  redo done at 41/5904D598
2021-04-05 23:58:02.740 UTC [2212207] LOG:  last completed transaction was at log time 2021-04-05 19:18:19.179725+00
2021-04-05 23:58:03.418 UTC [2212207] LOG:  selected new timeline ID: 10
2021-04-05 23:58:03.418 UTC [2212207] FATAL:  could not write to file "pg_wal/xlogtemp.2212207": No space left on device
2021-04-05 23:58:03.421 UTC [2212204] LOG:  startup process (PID 2212207) exited with exit code 1
2021-04-05 23:58:03.421 UTC [2212204] LOG:  terminating any other active server processes
2021-04-05 23:58:03.421 UTC [2212286] WARNING:  terminating connection because of crash of another server process
2021-04-05 23:58:03.421 UTC [2212286] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2021-04-05 23:58:03.421 UTC [2212286] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2021-04-05 23:58:03.426 UTC [2212204] LOG:  database system is shut down
2021-04-05 23:58:23.023 UTC [2212546] LOG:  database system was interrupted while in recovery at log time 2021-04-05 19:13:32 UTC
2021-04-05 23:58:23.023 UTC [2212546] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2021-04-05 23:58:23.670 UTC [2212546] LOG:  entering standby mode
2021-04-05 23:58:23.788 UTC [2212554] FATAL:  the database system is starting up
2021-04-05 23:58:23.910 UTC [2212558] FATAL:  the database system is starting up
2021-04-05 23:58:23.922 UTC [2212560] FATAL:  the database system is starting up
2021-04-05 23:58:24.015 UTC [2212546] LOG:  restored log file "00000009.history" from archive
2021-04-05 23:58:24.708 UTC [2212546] LOG:  redo starts at 41/5404B618
2021-04-05 23:58:24.800 UTC [2212573] FATAL:  the database system is starting up
2021-04-05 23:58:24.933 UTC [2212577] FATAL:  the database system is starting up
2021-04-05 23:58:25.812 UTC [2212588] FATAL:  the database system is starting up
2021-04-05 23:58:25.943 UTC [2212592] FATAL:  the database system is starting up
2021-04-05 23:58:26.372 UTC [2212546] LOG:  consistent recovery state reached at 41/5A000000
2021-04-05 23:58:26.373 UTC [2212543] LOG:  database system is ready to accept read only connections
2021-04-05 23:58:27.322 UTC [2212546] LOG:  received promote request
2021-04-05 23:58:27.322 UTC [2212546] LOG:  redo done at 41/5904D598
2021-04-05 23:58:27.323 UTC [2212546] LOG:  last completed transaction was at log time 2021-04-05 19:18:19.179725+00
2021-04-05 23:58:27.976 UTC [2212546] LOG:  selected new timeline ID: 10
2021-04-05 23:58:27.977 UTC [2212546] FATAL:  could not write to file "pg_wal/xlogtemp.2212546": No space left on device
2021-04-05 23:58:27.979 UTC [2212543] LOG:  startup process (PID 2212546) exited with exit code 1
2021-04-05 23:58:27.979 UTC [2212543] LOG:  terminating any other active server processes
2021-04-05 23:58:27.979 UTC [2212610] WARNING:  terminating connection because of crash of another server process
2021-04-05 23:58:27.979 UTC [2212610] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2021-04-05 23:58:27.979 UTC [2212610] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2021-04-05 23:58:27.985 UTC [2212543] LOG:  database system is shut down
2021-04-05 23:58:47.642 UTC [2212811] LOG:  database system was interrupted while in recovery at log time 2021-04-05 19:13:32 UTC
2021-04-05 23:58:47.642 UTC [2212811] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2021-04-05 23:58:48.286 UTC [2212811] LOG:  entering standby mode
2021-04-05 23:58:48.362 UTC [2212840] FATAL:  the database system is starting up
2021-04-05 23:58:48.376 UTC [2212842] FATAL:  the database system is starting up
2021-04-05 23:58:48.586 UTC [2212846] FATAL:  the database system is starting up
2021-04-05 23:58:48.595 UTC [2212848] FATAL:  the database system is starting up
2021-04-05 23:58:48.613 UTC [2212811] LOG:  restored log file "00000009.history" from archive
2021-04-05 23:58:49.281 UTC [2212811] LOG:  redo starts at 41/5404B618
2021-04-05 23:58:49.395 UTC [2212861] FATAL:  the database system is starting up
2021-04-05 23:58:49.605 UTC [2212811] FATAL:  could not extend file "base/16415/518628": wrote only 4096 of 8192 bytes at block 0
2021-04-05 23:58:49.605 UTC [2212811] HINT:  Check free disk space.
2021-04-05 23:58:49.605 UTC [2212811] CONTEXT:  WAL redo at 41/55043330 for XLOG/FPI:
2021-04-05 23:58:49.607 UTC [2212808] LOG:  startup process (PID 2212811) exited with exit code 1
2021-04-05 23:58:49.607 UTC [2212808] LOG:  terminating any other active server processes
2021-04-05 23:58:49.612 UTC [2212808] LOG:  database system is shut down
2021-04-05 23:59:01.118 UTC [2212947] LOG:  database system was interrupted while in recovery at log time 2021-04-05 19:13:32 UTC
2021-04-05 23:59:01.118 UTC [2212947] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2021-04-05 23:59:01.710 UTC [2212947] LOG:  entering standby mode
ERROR: [064]: unable to write '/pgdata/pgo-qa/pg_wal/RECOVERYHISTORY': [28] No space left on device
2021-04-05 23:59:01.946 UTC [2212955] FATAL:  the database system is starting up
2021-04-05 23:59:01.955 UTC [2212957] FATAL:  the database system is starting up
2021-04-05 23:59:02.671 UTC [2212947] LOG:  redo starts at 41/5404B618
2021-04-05 23:59:02.966 UTC [2212970] FATAL:  the database system is starting up
2021-04-05 23:59:02.986 UTC [2212947] FATAL:  could not extend file "base/16415/518611": No space left on device
2021-04-05 23:59:02.986 UTC [2212947] HINT:  Check free disk space.
2021-04-05 23:59:02.986 UTC [2212947] CONTEXT:  WAL redo at 41/55034BC0 for XLOG/FPI:
2021-04-05 23:59:02.988 UTC [2212944] LOG:  startup process (PID 2212947) exited with exit code 1
2021-04-05 23:59:02.988 UTC [2212944] LOG:  terminating any other active server processes
2021-04-05 23:59:02.992 UTC [2212944] LOG:  database system is shut down
2021-04-05 23:59:04.454 UTC [2213029] LOG:  database system was interrupted while in recovery at log time 2021-04-05 19:13:32 UTC
2021-04-05 23:59:04.454 UTC [2213029] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2021-04-05 23:59:05.075 UTC [2213029] LOG:  entering standby mode
ERROR: [064]: unable to write '/pgdata/pgo-qa/pg_wal/RECOVERYHISTORY': [28] No space left on device
2021-04-05 23:59:05.412 UTC [2213040] FATAL:  the database system is starting up
2021-04-05 23:59:05.423 UTC [2213042] FATAL:  the database system is starting up
2021-04-05 23:59:06.043 UTC [2213029] LOG:  redo starts at 41/5404B618
2021-04-05 23:59:06.360 UTC [2213029] FATAL:  could not extend file "base/16415/518611": No space left on device
2021-04-05 23:59:06.360 UTC [2213029] HINT:  Check free disk space.
2021-04-05 23:59:06.360 UTC [2213029] CONTEXT:  WAL redo at 41/55034BC0 for XLOG/FPI:
2021-04-05 23:59:06.362 UTC [2213003] LOG:  startup process (PID 2213029) exited with exit code 1
2021-04-05 23:59:06.362 UTC [2213003] LOG:  terminating any other active server processes
2021-04-05 23:59:06.365 UTC [2213003] LOG:  database system is shut down
2021-04-05 23:59:06.814 U2021-04-05 23:59:08.677 UTC [2213060] LOG:  database system is shut down

2. Yes, I scaled down to zero replicas and scaled up the deployment. pgbadger is in Running state nut database pod stuck at Notready state.

Abbas Gori

unread,
Apr 15, 2021, 7:07:10 AM4/15/21
to Postgres Operator, Abbas Gori, Postgres Operator, jonath...@crunchydata.com
Hi, Any update on this issue? I'm still facing this issue and not able to connect to the database.

Jonathan S. Katz

unread,
Apr 17, 2021, 9:46:01 AM4/17/21
to Abbas Gori, Postgres Operator
Hi Abbas,

The logs provide a strong indication of what you need to do:

2021-04-05 23:59:06.360 UTC [2213029] FATAL:  could not extend file "base/16415/518611": No space left on device
2021-04-05 23:59:06.360 UTC [2213029] HINT:  Check free disk space.

Apparently you are out of disk. You should resize your PVC. PGO 4.7 will make this seamless as it includes a feature to do so, but you will have to resize manually.

If you require timely responses, please refer to the Support page for commercial options.

Thanks,

Jonathan

Jonathan S. Katz
VP Platform Engineering

Crunchy Data
Enterprise PostgreSQL 


Abbas Gori

unread,
Apr 19, 2021, 7:17:32 AM4/19/21
to Postgres Operator, jonath...@crunchydata.com, Postgres Operator, Abbas Gori
Hi,

I have already tried that. I even did it again.

It was having 1GiB of pvc attached to the database pod. Now I have increased it to 5GiB and restarted the database pod by scaling the deployment to zero and then resized it to one. but the error logs are the same. Resizing pvc didn't solve the issue.

Reply all
Reply to author
Forward
0 new messages