Hi Team,
repmgr version - 5.2.1
Postgresql version - 11.7
We have multiple repmgrd processes running on witness node for each database instance on other nodes. I think the problem is when repmgr checks upstream_node_id and upstream_last_seen, it is checked for other instance and not the instance which went downand both the functions returned wrong results.
Scenario and logs:
Primary database went down for 15 minutes but there was no failover. When checking logs we see below
[2022-03-05 21:31:38] [INFO] checking state of sibling node "a" (ID: 3)
[2022-03-05 21:31:38] [DEBUG] connecting to: "user=pgrepmgr connect_timeout=3 dbname=pgrepmgr host=a port=5432 application_name=repmgrd sslmode=require fallback_application_name=repmgr options=-csearch_path="
[2022-03-05 21:31:39] [INFO] node "a" (ID: 3) reports its upstream is node 1, last seen 88 second(s) ago
[2022-03-05 21:31:39] [INFO] standby node "a" (ID: 3) last saw primary node 88 second(s) ago
[2022-03-05 21:31:39] [INFO] last receive LSN for sibling node "a" (ID: 3) is: 225/9800F858
[2022-03-05 21:31:39] [INFO] node "a" (ID: 3) has same LSN as current candidate "b" (ID: 2)
[2022-03-05 21:31:39] [INFO] checking state of sibling node "witness" (ID: 101)
[2022-03-05 21:31:39] [DEBUG] connecting to: "user=pgrepmgr connect_timeout=3 dbname= pgrepmgr host=witness port=5432 application_name=repmgrd sslmode=require fallback_application_name=repmgr options=-csearch_path="
[2022-03-05 21:31:39] [INFO] node "witness" (ID: 101) reports its upstream is node 1, last seen 0 second(s) ago
[2022-03-05 21:31:39] [NOTICE] witness node "witness" (ID: 101) last saw primary node 0 second(s) ago, considering primary still visible
[2022-03-05 21:31:39] [DEBUG] node 101 is witness, not querying state
[2022-03-05 21:31:39] [INFO] 1 nodes can see the primary
[2022-03-05 21:31:39] [DETAIL] following nodes can see the primary:
- node "witness" (ID: 101): 0 second(s) ago
The primary was down but witness says it saw the primary '0' second ago.
Logs in witness node also shows it is not able to connect to primary:
2022-03-05 21:30:57] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2022-03-05 21:30:57] [WARNING] unable to reconnect to node 1 after 6 attempts
[2022-03-05 21:31:36] [WARNING] new primary "b" (node ID: 2) is in recovery
[2022-03-05 21:31:39] [WARNING] unable to connect to "host=primary port=5432 sslmode=require dbname=pgrepmgr user=pgrepmgr connect_timeout=3 application_name=repmgrd"
[2022-03-05 21:31:39] [DETAIL]
timeout expired
[2022-03-05 21:32:33] [ERROR] unable to determine if server is in recovery
[2022-03-05 21:32:33] [DETAIL]
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.