switchover working only in one way

245 views
Skip to first unread message

Sandro Bordacchini

unread,
Dec 15, 2021, 10:04:25 AM12/15/21
to repmgr
Hello all.
I am trying to configure a 3-node cluster (primary, standby, witness) with repmgr.
I am facing a weird situation where standby switchover works on a standby, but after i cannot revert doing another switchover... :

I mean:
postgres@postgres01:~$ repmgr cluster show
 ID | Name            | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-----------------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------
 1  | PG-Node1        | standby |   running | PG-Node2 | default  | 100      | 2        | host=172.16.50.16 user=repmgr dbname=repmgr connect_timeout=2
 2  | PG-Node2        | primary | * running |          | default  | 100      | 2        | host=172.16.50.17 user=repmgr dbname=repmgr connect_timeout=2
 3  | PG-Node-Witness | witness | * running | PG-Node2 | default  | 0        | n/a      | host=172.16.50.15 user=repmgr dbname=repmgr connect_timeout=2

postgres@postgres01:~$ repmgr standby switchover --log-level=DEBUG --verbose --siblings-follow
INFO: looking for configuration file in /etc
[ ... log available if needed ... ]
NOTICE: STANDBY SWITCHOVER has completed successfully

postgres@postgres01:~$ repmgr cluster show
 ID | Name            | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-----------------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------
 1  | PG-Node1        | primary | * running |          | default  | 100      | 3        | host=172.16.50.16 user=repmgr dbname=repmgr connect_timeout=2
 2  | PG-Node2        | standby |   running | PG-Node1 | default  | 100      | 2        | host=172.16.50.17 user=repmgr dbname=repmgr connect_timeout=2
 3  | PG-Node-Witness | witness | * running | PG-Node1 | hel1-dc2 | 0        | n/a      | host=172.16.50.15 user=repmgr dbname=repmgr connect_timeout=2

But, after this if i login on postgres02 (now standby), and i issue the same command
postgres@postgres02:~$ repmgr standby switchover --log-level=DEBUG --verbose --siblings-follow
INFO: looking for configuration file in /etc
INFO: configuration file found at: "/etc/repmgr.conf"
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=172.16.50.17 fallback_application_name=repmgr options=-csearch_path="
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached   FROM repmgr.nodes n  WHERE n.node_id = 2
NOTICE: executing switchover on node "PG-Node2" (ID: 2)
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: searching for primary node
DEBUG: get_primary_connection():
  SELECT node_id, conninfo,          CASE WHEN type = 'primary' THEN 1 ELSE 2 END AS type_priority         FROM repmgr.nodes    WHERE active IS TRUE      AND type != 'witness' ORDER BY active DESC, type_priority, priority, node_id
INFO: checking if node 1 is primary
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=172.16.50.16 fallback_application_name=repmgr options=-csearch_path="
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: current primary node is 1
DEBUG: get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached   FROM repmgr.nodes n  WHERE n.node_id = 1
DEBUG: remote node name is "PG-Node1"
DEBUG: test_ssh_connection(): executing ssh -o Batchmode=yes -q -o ConnectTimeout=10 172.16.50.16 /bin/true 2>/dev/null
INFO: SSH connection to host "172.16.50.16" succeeded
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 172.16.50.16 /usr/lib/postgresql/12/bin/repmgr -f /etc/repmgr.conf  -L DEBUG --version >/dev/null 2>&1 && echo "1" || echo "0"
DEBUG: remote_command(): output returned was:
1

DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 172.16.50.16 /usr/lib/postgresql/12/bin/repmgr -f /etc/repmgr.conf  -L DEBUG --version 2>/dev/null
DEBUG: remote_command(): output returned was:
repmgr 5.3

DEBUG: "repmgr" version on "172.16.50.16" is 50300
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 172.16.50.16 test -f /etc/repmgr.conf && echo 1 || echo 0
DEBUG: remote_command(): output returned was:
1

DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 172.16.50.16 /usr/lib/postgresql/12/bin/repmgr -f /etc/repmgr.conf  -L DEBUG node check --data-directory-config --optformat -LINFO 2>/dev/null
DEBUG: remote_command(): output returned was:
--configured-data-directory=OK

DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 172.16.50.16 /usr/lib/postgresql/12/bin/repmgr -f /etc/repmgr.conf  -L DEBUG node check --replication-config-owner --optformat -LINFO 2>/dev/null
DEBUG: remote_command(): output returned was:
--replication-config-owner=OK

DEBUG: get_node_replication_stats():
 SELECT pg_catalog.current_setting('max_wal_senders')::INT AS max_wal_senders,         (SELECT pg_catalog.count(*) FROM pg_catalog.pg_stat_replication) AS attached_wal_receivers,         current_setting('max_replication_slots')::INT AS max_replication_slots,         (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE slot_type='physical') AS total_replication_slots,         (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE active IS TRUE AND slot_type='physical')  AS active_replication_slots,         (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE active IS FALSE AND slot_type='physical') AS inactive_replication_slots,         pg_catalog.pg_is_in_recovery() AS in_recovery
DEBUG: get_active_sibling_node_records():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached     FROM repmgr.nodes n    WHERE n.upstream_node_id = 1      AND n.node_id != 2      AND n.active IS TRUE ORDER BY n.node_id
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
INFO: 1 active sibling nodes found
DEBUG: test_ssh_connection(): executing ssh -o Batchmode=yes -q -o ConnectTimeout=10 172.16.50.15 /bin/true 2>/dev/null
INFO: all sibling nodes are reachable via SSH
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 172.16.50.16 /usr/lib/postgresql/12/bin/repmgr -f /etc/repmgr.conf  -L DEBUG node check --remote-node-id=2 --replication-connection
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=172.16.50.16 fallback_application_name=repmgr options=-csearch_path="
DEBUG: remote_command(): output returned was:
--connection=OK

DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings  WHERE name = 'archive_mode' AND setting != 'off'
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 172.16.50.16 /usr/lib/postgresql/12/bin/repmgr -f /etc/repmgr.conf  -L DEBUG node check --terse -LERROR --archive-ready --optformat
DEBUG: remote_command(): output returned was:
--status=OK --files=0

INFO: 0 pending archive files
DEBUG: get_replication_lag_seconds():
 SELECT CASE WHEN (pg_catalog.pg_last_wal_receive_lsn() = pg_catalog.pg_last_wal_replay_lsn())           THEN 0         ELSE EXTRACT(epoch FROM (pg_catalog.clock_timestamp() - pg_catalog.pg_last_xact_replay_timestamp()))::INT           END         AS lag_seconds
DEBUG: lag is 0
INFO: replication lag on this standby is 0 seconds
DEBUG: get_all_node_records():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached     FROM repmgr.nodes n ORDER BY n.node_id
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
NOTICE: attempting to pause repmgrd on 3 nodes
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=172.16.50.16 fallback_application_name=repmgr options=-csearch_path="
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=172.16.50.17 fallback_application_name=repmgr options=-csearch_path="
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=172.16.50.15 fallback_application_name=repmgr options=-csearch_path="
DEBUG: set_config():
  SET synchronous_commit TO 'local'
NOTICE: local node "PG-Node2" (ID: 2) will be promoted to primary; current primary "PG-Node1" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "PG-Node1" (ID: 1)
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 172.16.50.16 /usr/lib/postgresql/12/bin/repmgr -f /etc/repmgr.conf  -L DEBUG node service --action=stop --checkpoint
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=172.16.50.16 fallback_application_name=repmgr options=-csearch_path="
NOTICE: issuing CHECKPOINT on node "PG-Node1" (ID: 1)
DETAIL: executing server command "sudo /usr/bin/pg_ctlcluster 12 main stop"
DEBUG: remote_command(): no output returned
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_NO_RESPONSE
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 172.16.50.16 /usr/lib/postgresql/12/bin/repmgr -f /etc/repmgr.conf  -L DEBUG node status --is-shutdown-cleanly
DEBUG: remote_command(): output returned was:
--state=SHUTDOWN --last-checkpoint-lsn=0/14000028

DEBUG: remote node status is: SHUTDOWN
NOTICE: current primary has been cleanly shut down at location 0/14000028
DEBUG: get_replication_info():
 SELECT ts,         in_recovery,         last_wal_receive_lsn,         last_wal_replay_lsn,         last_xact_replay_timestamp,         CASE WHEN (last_wal_receive_lsn = last_wal_replay_lsn)           THEN 0::INT         ELSE           CASE WHEN last_xact_replay_timestamp IS NULL             THEN 0::INT           ELSE             EXTRACT(epoch FROM (pg_catalog.clock_timestamp() - last_xact_replay_timestamp))::INT           END         END AS replication_lag_time,         last_wal_receive_lsn >= last_wal_replay_lsn AS receiving_streamed_wal,         wal_replay_paused,         upstream_last_seen,         upstream_node_id    FROM (  SELECT CURRENT_TIMESTAMP AS ts,         pg_catalog.pg_is_in_recovery() AS in_recovery,         pg_catalog.pg_last_xact_replay_timestamp() AS last_xact_replay_timestamp,         COALESCE(pg_catalog.pg_last_wal_receive_lsn(), '0/0'::PG_LSN) AS last_wal_receive_lsn,         COALESCE(pg_catalog.pg_last_wal_replay_lsn(),  '0/0'::PG_LSN) AS last_wal_replay_lsn,         CASE WHEN pg_catalog.pg_is_in_recovery() IS FALSE           THEN FALSE           ELSE pg_catalog.pg_is_wal_replay_paused()         END AS wal_replay_paused,         CASE WHEN pg_catalog.pg_is_in_recovery() IS FALSE           THEN -1           ELSE repmgr.get_upstream_last_seen()         END AS upstream_last_seen,         CASE WHEN pg_catalog.pg_is_in_recovery() IS FALSE           THEN -1           ELSE repmgr.get_upstream_node_id()         END AS upstream_node_id           ) q
NOTICE: waiting up to 30 seconds (parameter "wal_receive_check_timeout") for received WAL to flush to disk
INFO: sleeping 1 of maximum 30 seconds waiting for standby to flush received WAL to disk
[  .... repeats ... ]
INFO: sleeping 30 of maximum 30 seconds waiting for standby to flush received WAL to disk
WARNING: local node "PG-Node2" is behind shutdown primary "PG-Node1"
DETAIL: local node last receive LSN is 0/13024000, primary shutdown checkpoint LSN is 0/14000028
NOTICE: aborting switchover
HINT: use --always-promote to force promotion of standby


At this point i am left with the primary down.
I have the same postgresql.conf and pg_hba.conf on both server.
repmgr.conf differs only for node_id, node_name, conninfo and location... other parameters are:
data_directory='/var/lib/postgresql/12/main'
failover='automatic'
promote_command='/usr/lib/postgresql/12/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'
follow_command='/usr/lib/postgresql/12/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'
monitor_interval_secs=2
connection_check_type='query'
reconnect_attempts=4
reconnect_interval=8
primary_visibility_consensus=true
monitoring_history=yes
log_status_interval=60
repmgrd_service_start_command='sudo /usr/bin/systemctl start repmgrd.service'
repmgrd_service_stop_command='sudo /usr/bin/systemctl stop repmgrd.service'
service_start_command='sudo /usr/bin/pg_ctlcluster 12 main start'
service_stop_command='sudo /usr/bin/pg_ctlcluster 12 main stop'
service_restart_command='sudo /usr/bin/pg_ctlcluster 12 main restart'
service_reload_command='sudo /usr/bin/pg_ctlcluster 12 main reload'
log_file='/var/log/repmgr/repmgrd.log'
pg_bindir='/usr/lib/postgresql/12/bin/'
ssh_options='-q -o ConnectTimeout=10'


I think i am missing something quite substantial here... can anyone point me in the right direction with an hint?
I am on a debian 10, postgresql 12.9-1.pgdg100+1 and repmgr 5.3.0-1.pgdg100+1

Thanks.

Axel Rau

unread,
Dec 15, 2021, 4:12:52 PM12/15/21
to rep...@googlegroups.com


> Am 15.12.2021 um 16:04 schrieb 'Sandro Bordacchini' via repmgr <rep...@googlegroups.com>:
>
>
> But, after this if i login on postgres02 (now standby), and i issue the same command
> postgres@postgres02:~$ repmgr standby switchover --log-level=DEBUG --verbose --siblings-follow
Try repmgr --force standby switchover

Axel
---
PGP-Key: CDE74120 ☀ computing @ chaos claudius

signature.asc

Sandro Bordacchini

unread,
Dec 16, 2021, 4:50:22 AM12/16/21
to repmgr
Hi Axel.
Thank you for your reply.

I fixed the problem adding
wal_keep_segments = 10
to postgres configuration and also realizing that before the 2nd switchover, primary and standby were on a different timeline (primary was on 3, standby was on 2).
So, before doing the switchover i issued on the standby a
repmgr standby follow
to bring both nodes on the same timeline
I still have to figure out which one of these two actions was the one to fix things :-)

S.
Reply all
Reply to author
Forward
0 new messages