problem trying to use repmgr node rejoin

43 views
Skip to first unread message

Sotiris Tsimbonis

unread,
Feb 10, 2021, 3:47:35 PMFeb 10
to repmgr
Hello everyone, hope you are safe during the covid outbreak.

Using postgresql 11.10 and repmgr 5.2.1 on CentOS 7.
I have setup a cluster with a primary, a standby and a witness node.
repmgrd is running on all nodes.

[postgres@seajets-replica data]$ repmgr cluster show
 ID | Name            | Role    | Status    | Upstream        | Location | Priority | Timeline | Connection string
----+-----------------+---------+-----------+-----------------+----------+----------+----------+-----------------------------------------------
 1  | seajets-replica | primary | * running |                 | default  | 100      | 1        | host=188.34.130.148 user=repmgr dbname=repmgr
 2  | popi            | standby |   running | seajets-replica | default  | 100      | 1        | host=94.130.68.29 user=repmgr dbname=repmgr
 3  | seajets         | witness | * running | seajets-replica | default  | 0        | n/a      | host=116.202.47.138 user=repmgr dbname=repmgr


Replication between the two nodes works, and I manually stop postgresql on the primary.

[postgres@seajets-replica data]$ sudo systemctl stop postgresql-11


repmgrd works and the standby node is successfully promoted to primary.

[postgres@popi data]$ repmgr cluster show
 ID | Name            | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-----------------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------
 1  | seajets-replica | primary | - failed  | ?        | default  | 100      |          | host=188.34.130.148 user=repmgr dbname=repmgr
 2  | popi            | primary | * running |          | default  | 100      | 2        | host=94.130.68.29 user=repmgr dbname=repmgr
 3  | seajets         | witness | * running | popi     | default  | 0        | n/a      | host=116.202.47.138 user=repmgr dbname=repmgr

WARNING: following issues were detected
  - unable to connect to node "seajets-replica" (ID: 1)

Right after I verify that the promotion has actually worked, I try to use repmgr node rejoin --force-rewind, in order to have the failed primary rejoin the cluster as a standby node.

[postgres@seajets-replica data]$ repmgr node rejoin -d 'host=94.130.68.29 user=repmgr dbname=repmgr' --force-rewind
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server's timeline 2 forked off current database system timeline 1 before current recovery point 6/95000028
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "/usr/pgsql-11/bin/pg_rewind -D '/var/lib/pgsql/11/data' --source-server='host=94.130.68.29 user=repmgr dbname=repmgr'"
ERROR: pg_rewind execution failed
DETAIL: servers diverged at WAL location 6/95000000 on timeline 1

I tried running the exact same pg_rewind command manually. It complains about a missing file, but I can't understand why does it look for it and why is it missing.

[postgres@seajets-replica data]$ /usr/pgsql-11/bin/pg_rewind -D '/var/lib/pgsql/11/data' --source-server='host=94.130.68.29 user=repmgr dbname=repmgr'
servers diverged at WAL location 6/95000000 on timeline 1
could not open file "/var/lib/pgsql/11/data/pg_wal/000000010000000600000094": No such file or directory

could not find previous WAL record at 6/940446D0
Failure, exiting

[postgres@seajets-replica data]$ ls pg_wal/
000000010000000600000091.00000060.backup  0000000100000006000000A9  0000000100000006000000BE  0000000100000006000000D3  0000000100000006000000E8  0000000100000006000000FD
000000010000000600000095                  0000000100000006000000AA  0000000100000006000000BF  0000000100000006000000D4  0000000100000006000000E9  0000000100000006000000FE
000000010000000600000096                  0000000100000006000000AB  0000000100000006000000C0  0000000100000006000000D5  0000000100000006000000EA  0000000100000006000000FF
000000010000000600000097                  0000000100000006000000AC  0000000100000006000000C1  0000000100000006000000D6  0000000100000006000000EB  000000010000000700000000
...

Any explanation and help to make this work would be really appreciated.
Thanks in advance,
Sot.

Wolf Schwurack

unread,
Feb 12, 2021, 1:53:36 PMFeb 12
to rep...@googlegroups.com
Do you have archive setup in PostgreSQL? If so see if the wal file got moved and copy is back into the pg_wal directory. Then retry the command

Wolf

--
You received this message because you are subscribed to the Google Groups "repmgr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repmgr+un...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/repmgr/b930829b-882b-4cfd-899b-e698b73576d6n%40googlegroups.com.

Sotiris Tsimbonis

unread,
Feb 12, 2021, 2:12:08 PMFeb 12
to repmgr
Hi, thanks for your answer.

My archive setup is what the documentation says (https://repmgr.org/docs/current/quickstart-postgresql-configuration.html)..
    archive_mode = on
    archive_command = '/bin/true'
 
Sot.

Wolf Schwurack

unread,
Feb 12, 2021, 2:27:59 PMFeb 12
to rep...@googlegroups.com
It should be set to something like this archive_command = test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'
So you that you have a copy of the wal missing wal file. Or it should be in your daily backups

Reply all
Reply to author
Forward
0 new messages