No Auto-Failover when primary instance is down

28 views
Skip to first unread message

Nikhil Shetty

unread,
Mar 14, 2022, 10:31:10 AM3/14/22
to repmgr
Hi Team,

repmgr version - 5.2.1
Postgresql version - 11.7

We have multiple repmgrd processes running on witness node for each database instance on other nodes. I think the problem is when repmgr checks upstream_node_id and upstream_last_seen, it is checked for other instance and not the instance which went downand both the functions returned wrong results.

Scenario and logs:
Primary database went down for 15 minutes but there was no failover. When checking logs we see below
[2022-03-05 21:31:38] [INFO] checking state of sibling node "a" (ID: 3) [2022-03-05 21:31:38] [DEBUG] connecting to: "user=pgrepmgr connect_timeout=3 dbname=pgrepmgr host=a port=5432 application_name=repmgrd sslmode=require fallback_application_name=repmgr options=-csearch_path=" [2022-03-05 21:31:39] [INFO] node "a" (ID: 3) reports its upstream is node 1, last seen 88 second(s) ago [2022-03-05 21:31:39] [INFO] standby node "a" (ID: 3) last saw primary node 88 second(s) ago [2022-03-05 21:31:39] [INFO] last receive LSN for sibling node "a" (ID: 3) is: 225/9800F858 [2022-03-05 21:31:39] [INFO] node "a" (ID: 3) has same LSN as current candidate "b" (ID: 2) [2022-03-05 21:31:39] [INFO] checking state of sibling node "witness" (ID: 101) [2022-03-05 21:31:39] [DEBUG] connecting to: "user=pgrepmgr connect_timeout=3 dbname= pgrepmgr host=witness port=5432 application_name=repmgrd sslmode=require fallback_application_name=repmgr options=-csearch_path=" [2022-03-05 21:31:39] [INFO] node "witness" (ID: 101) reports its upstream is node 1, last seen 0 second(s) ago [2022-03-05 21:31:39] [NOTICE] witness node "witness" (ID: 101) last saw primary node 0 second(s) ago, considering primary still visible [2022-03-05 21:31:39] [DEBUG] node 101 is witness, not querying state [2022-03-05 21:31:39] [INFO] 1 nodes can see the primary [2022-03-05 21:31:39] [DETAIL] following nodes can see the primary: - node "witness" (ID: 101): 0 second(s) ago

The primary was down but witness says it saw the primary '0' second ago.

Logs in witness node also shows it is not able to connect to primary:

2022-03-05 21:30:57] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2022-03-05 21:30:57] [WARNING] unable to reconnect to node 1 after 6 attempts [2022-03-05 21:31:36] [WARNING] new primary "b" (node ID: 2) is in recovery [2022-03-05 21:31:39] [WARNING] unable to connect to "host=primary port=5432 sslmode=require dbname=pgrepmgr user=pgrepmgr connect_timeout=3 application_name=repmgrd" [2022-03-05 21:31:39] [DETAIL] timeout expired [2022-03-05 21:32:33] [ERROR] unable to determine if server is in recovery [2022-03-05 21:32:33] [DETAIL] server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.


Reply all
Reply to author
Forward
0 new messages