"node is not attached" but it IS!

184 views
Skip to first unread message

Stefan Lisowski

unread,
Apr 14, 2021, 3:41:18 PMApr 14
to repmgr
$ repmgr cluster show
 ID | Name        | Role    | Status    | Upstream      | Location | Priority | Timeline | Connection string                                        
----+-------------+---------+-----------+---------------+----------+----------+----------+-----------------------------------------------------------
 1  | pp1         | standby |   running | ! walreceiver | default  | 100      | 2        | host=pp1 user=repadmin dbname=repmgr connect_timeout=2
 2  | walreceiver | primary | * running |               | default  | 100      | 2        | host=pp2 user=repadmin dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - node "pp1" (ID: 1) is not attached to its upstream node "walreceiver" (ID: 2)

However, if I look at rowcounts, the data is replicating, and if I look at the following, it tells me state is "streaming"...

postgres=# select * from pg_stat_replication; 
-[ RECORD 1 ]----+-----------------------------
pid              | 20691
usesysid         | 18520
usename          | repadmin
application_name | walreceiver
client_addr      | 130.x.x.x
client_hostname  | 
client_port      | 55776
backend_start    | 2021-04-13 01:50:00.84162+00
backend_xmin     | 
state            | streaming
sent_lsn         | 47/C25E6730
write_lsn        | 47/C25E6730
flush_lsn        | 47/C25E6730
replay_lsn       | 47/C25E6730
write_lag        | 00:00:00.010946
flush_lag        | 00:00:00.011089
replay_lag       | 00:00:00.011296
sync_priority    | 0
sync_state       | async

postgres=# select * from pg_stat_wal_receiver;
-[ RECORD 1 ]---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pid                   | 22463
status                | streaming
receive_start_lsn     | 47/91000000
receive_start_tli     | 2
received_lsn          | 47/C25A18C8
received_tli          | 2
last_msg_send_time    | 2021-04-14 19:21:35.277112+00
last_msg_receipt_time | 2021-04-14 19:21:35.283151+00
latest_end_lsn        | 47/C25A18C8
latest_end_time       | 2021-04-14 19:21:35.277112+00
slot_name             | 
conninfo              | user=repadmin passfile=/var/lib/pgsql/.pgpass dbname=replication host=pp2 port=5432 fallback_application_name=walreceiver sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any

So does this mean my /etc/repmgr/10/repmgr.conf is improperly configured, or is it an issue with my postgresql.conf?

On another server pair I have the same symptom, and it causes trouble here:
[postgres@pp1 data]$ repmgr standby switchover --dry-run 
NOTICE: checking switchover on node "pp1" (ID: 1) in --dry-run mode
WARNING: no superuser connection available
DETAIL: it is recommended to perform switchover operations with a database superuser
HINT: provide the name of a superuser with -S/--superuser
WARNING: node "pp1" not found in "pg_stat_replication"
ERROR: local node "pp1" (ID: 1) is not attached to demotion candidate "walreceiver" (ID: 2)

(right now pp2 aka walreceiver is the primary, and pp1 is the slave... in the pg_stat_replication, the IP address listed is the secondary/slave IP. In both of these cases I set up streaming replication, and then failed over from pp1 to pp2 with repmgr standby switchover  --always-promote... but when that did promote pp2, but didn't stream back to pp1, I used pg_basebackup -R to get streaming going pp2 -> pp1)

I've been reading many man pages, but feel free to point me to more documentation if you feel I need to do some more reading before asking these questions.

Ian Barwick

unread,
Apr 18, 2021, 8:48:51 PMApr 18
to rep...@googlegroups.com, Stefan Lisowski
On 15/04/2021 04:41, Stefan Lisowski wrote:
> $ repmgr cluster show
>  ID | Name        | Role    | Status    | Upstream      | Location | Priority | Timeline | Connection string
> ----+-------------+---------+-----------+---------------+----------+----------+----------+-----------------------------------------------------------
>  1  | pp1         | standby |   running | ! walreceiver | default  | 100      | 2        | host=pp1 user=repadmin dbname=repmgr connect_timeout=2
>  2  | walreceiver | primary | * running |               | default  | 100      | 2        | host=pp2 user=repadmin dbname=repmgr connect_timeout=2
>
> WARNING: following issues were detected
>   - *node* "pp1" (ID: 1) *is not attached* to its upstream node "walreceiver" (ID: 2)
>
> However, if I look at rowcounts, the data is replicating, and if I look at the following, it tells me state is "streaming"...
>
> postgres=# select * from pg_stat_replication;
> -[ RECORD 1 ]----+-----------------------------
> pid              | 20691
> usesysid         | 18520
> usename          | repadmin
> application_name | walreceiver
> client_addr      | 130.x.x.x
> client_hostname  |
> client_port      | 55776
> backend_start    | 2021-04-13 01:50:00.84162+00
> backend_xmin     |
> state            | *streaming*
> sent_lsn         | 47/C25E6730
> write_lsn        | 47/C25E6730
> flush_lsn        | 47/C25E6730
> replay_lsn       | 47/C25E6730
> write_lag        | 00:00:00.010946
> flush_lag        | 00:00:00.011089
> replay_lag       | 00:00:00.011296
> sync_priority    | 0
> sync_state       | async
>
> postgres=# select * from pg_stat_wal_receiver;
> -[ RECORD 1 ]---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> pid                   | 22463
> status                | streaming
> receive_start_lsn     | 47/91000000
> receive_start_tli     | 2
> received_lsn          | 47/C25A18C8
> received_tli          | 2
> last_msg_send_time    | 2021-04-14 19:21:35.277112+00
> last_msg_receipt_time | 2021-04-14 19:21:35.283151+00
> latest_end_lsn        | 47/C25A18C8
> latest_end_time       | 2021-04-14 19:21:35.277112+00
> slot_name             |
> conninfo              | user=repadmin passfile=/var/lib/pgsql/.pgpass dbname=replication host=pp2 port=5432 fallback_application_name=walreceiver sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any
>
> So does this mean my */etc/repmgr/10/repmgr.conf* is improperly configured, or is it an issue with my*postgresql.conf*?
>
> On another server pair I have the same symptom, and it causes trouble here:
> [postgres@pp1 data]$ repmgr standby switchover --dry-run
> NOTICE: checking switchover on node "pp1" (ID: 1) in --dry-run mode
> WARNING: no superuser connection available
> DETAIL: it is recommended to perform switchover operations with a database superuser
> HINT: provide the name of a superuser with -S/--superuser
> WARNING: node "pp1" not found in "pg_stat_replication"
> *ERROR: local node "pp1" (ID: 1) is not attached to demotion candidate "walreceiver" (ID: 2)*
>
> (right now pp2 aka walreceiver is the primary, and pp1 is the slave... in the pg_stat_replication, the IP address listed is the secondary/slave IP. In both of these cases I set up streaming replication, and then failed over from pp1 to pp2 with repmgr standby switchover  --always-promote... but when that did promote pp2, but didn't stream back to pp1, I used pg_basebackup -R to get streaming going pp2 -> pp1)
>
> I've been reading many man pages, but feel free to point me to more documentation if you feel I need to do some more reading before asking these questions.

Some background here - in PostgreSQL there's no definitive way to identify a
streaming replication child node. repmgr relies on the "application_name" connection
parameter matching the "node_name" set in repmgr.conf, and sets this accordingly
when it generates the replication configuration for the standby node.

From the above it looks like the standby (repmgr "node_name" set to "pp1")
is still sending its "application_name" as "walreceiver", which is the PostgreSQL
default if "application_name" is not explicitly set.

It's possible the standby node's "recovery.conf" was correctly configured by
repmgr, but the standby node has not yet been restarted, so the changes have not
taken effect.

If that's not the case, you can have repmgr regenerate the standby node's "recovery.conf"
with:

repmgr standby clone -d '$upstream_conninfo' --replication-conf-only --force

which will generate the appropriate "recovery.conf" file. Replace '$upstream_conninfo'
with the primary's connection string.

Append "--dry-run" to see what it would do without making any changes; example:

$ repmgr standby clone -d '$upstream_conninfo' --replication-conf-only --force --dry-run
NOTICE: destination directory "/tmp/repmgr-test/node_2/data" provided
WARNING: "recovery.conf" would be created in an active data directory
WARNING: the existing "recovery.conf" file would be overwritten
INFO: would create "recovery.conf" file in "/tmp/repmgr-test/node_2/data"
DETAIL:
standby_mode = 'on'
primary_conninfo = 'user=repmgr host=localhost port=6001 connect_timeout=2 fallback_application_name=repmgr options=''-csearch_path='' application_name=node2'
recovery_target_timeline = 'latest'

Without "--dry-run":

$ repmgr standby clone -d '$upstream_conninfo' --replication-conf-only --force
NOTICE: destination directory "/tmp/repmgr-test/node_2/data" provided
WARNING: creating "recovery.conf" in an active data directory
WARNING: the existing "recovery.conf" file will be overwritten
NOTICE: "recovery.conf" created as "/tmp/repmgr-test/node_2/data/recovery.conf"
HINT: node must be restarted for the new file to take effect

BTW it would be a good idea to change the other node's name from "walreceiver" to something
more specific.

Regards


Ian Barwick


--
Ian Barwick https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Stefan Lisowski

unread,
Apr 27, 2021, 12:31:57 AMApr 27
to repmgr
On Sunday, April 18, 2021 at 8:48:51 PM UTC-4 ian.b...@2ndquadrant.com wrote:
Some background here - in PostgreSQL there's no definitive way to identify a
streaming replication child node. repmgr relies on the "application_name" connection
parameter matching the "node_name" set in repmgr.conf, and sets this accordingly
when it generates the replication configuration for the standby node.


Thank you for your help Ian!
OK, so node_name is set in repmgr.conf, but where is application_name set? I see I can put it in the conninfo in repmgr.conf but does it live anywhere else? 
 
From the above it looks like the standby (repmgr "node_name" set to "pp1")
is still sending its "application_name" as "walreceiver", which is the PostgreSQL
default if "application_name" is not explicitly set.

It's possible the standby node's "recovery.conf" was correctly configured by
repmgr, but the standby node has not yet been restarted, so the changes have not
taken effect.


That didn't seem to be the case.
 
If that's not the case, you can have repmgr regenerate the standby node's "recovery.conf"
with:

repmgr standby clone -d '$upstream_conninfo' --replication-conf-only --force

which will generate the appropriate "recovery.conf" file. Replace '$upstream_conninfo'
with the primary's connection string.

Append "--dry-run" to see what it would do without making any changes; example:

$ repmgr standby clone -d '$upstream_conninfo' --replication-conf-only --force --dry-run

This worked. A relief for sure. But then I later tried 

repmgr node rejoin --force-rewind --verbose -d 'host= pp2 user=repadmin dbname=repmgr connect_timeout=2' 

I saw errors like:
INFO: waiting for node "walreceiver" (ID: 2) to connect to new primary; 46 of max 60 attempts (parameter "node_rejoin_timeout")
DETAIL: checking for record in node "pp1"'s "pg_stat_replication" table where "application_name" is "walreceiver"
WARNING: node "walreceiver" not found in "pg_stat_replication"
WARNING: node "walreceiver" not found in "pg_stat_replication"
WARNING: node "walreceiver" not found in "pg_stat_replication" 

Replication is actually working though; again it is just a matter of the names not matching up for repmgr

[postgres@pp1 data]$ repmgr node status
WARNING: node "walreceiver" not found in "pg_stat_replication"
Node "pp1":
        PostgreSQL version: 10.10
        Total data size: 1571 MB
        Conninfo: host=pp1 user=repadmin dbname=repmgr connect_timeout=2
        Role: primary
        WAL archiving: off
        Archive command: (none)
        Replication connections: 1 (of maximal 3)
        Replication slots: 0 physical (of maximal 80; 0 missing)
        Replication lag: n/a

WARNING: following issue(s) were detected:
  - 1 of 1 downstream nodes not attached:
    - walreceiver (ID: 2)

HINT: execute "repmgr node check" for more details

So, it seems it is pp1 that is expecting a certain name from pp2 'walreceiver' which it now doesn't have, maybe because I've renamed the application in the conninfo to something more intelligent... but where do I change what primary pp1 expects? 



BTW it would be a good idea to change the other node's name from "walreceiver" to something
more specific.


So I think this is my problem, but I don't know how to do this beyond the conninfo in repmgr.conf. Where do I set that to something better --  in this case I'd want it to be pp2 which is now the child node?

Thanks again for your very specific and useful tips.
Reply all
Reply to author
Forward
0 new messages