Re: [codership-team] Failed to open IST listener ...

1,828 views
Skip to first unread message

Alex Yurchenko

unread,
Jan 31, 2013, 5:35:25 AM1/31/13
to codersh...@googlegroups.com
On 2013-01-31 06:13, joshua wrote:
> Hey folks.
>
> I have an XtraDB Cluster in AWS.
>
> Each cluster node is fronted by an Amazon Elastic IP. Thus, the *
> wsrep_cluster_address* and *wsrep_node_address *are represented by
> internet
> routable addresses, the *ec2_public_ipv4*,
>
> These, in turn, are mapped by Amazon to the nodes *ec2_private_ipv4*,
> each
> node's regional LAN address.
>
> This was done to allow nodes in different Amazon regions to
> communicate
> with each other globally (using SSL communication of course).
>
> The cluster, in both dev and production environments, is working
> well:
>
>
> 1. SST instantiation of new nodes works just fine with xtrabackup
> across
> regions.
> 2. All nodes are in sync globally.
>
> If, however, a node is taken offline for any reason, upon a restart
> we are
> seeing:
>
> *30131 3:04:19 [Warning] WSREP: Failed to prepare for incremental
> state
> transfer: Failed to open IST listener at ssl://ec2_public_ipv4:4568',
> asio
> error 'Cannot assign requested address': 99 (Cannot assign requested
> address)*
> * at galera/src/ist.cpp:prepare():309. IST will be unavailable.*
>
>
> What I believe this means:
>
> 1. The node is trying to listen on its *ec2_public_ipv4:4568
> *which is
> the Elastic IP, not local, and thus the node can't bind to it.
> 2. I want the node to listen on its *ec2_private_ipv4:4568 *which
> is its
> local LAN address.
> 3. I want an IST sender to continue to identify this node on its *
> ec2_public_ipv4:4568 *such that IST occurs just as Galera
> replication
> does, on the Elastic IPs, transparently NAT'd to the nodes local
> LAN IPs.
>
> I thought the *wsrep_sst_receive_address *might provide for this, but
> in my
> testing it does not appear to. Is there something similar to a
> *wsrep_ist_receive_address
> *that needs to be set?

Yes, there is:

wsrep_provider_options="ist.recv_addr=<ec2_private_ipv4>"

Yet you still have to set wsrep_sst_receive_address to ec2_public_ipv4.

Regards,
Alex

> *
> Cheers,
> Joshua

--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

Joshua Levine

unread,
Jan 31, 2013, 12:04:36 PM1/31/13
to codersh...@googlegroups.com

Perfect! Sorry I missed that. All is working better than expected.

Cheers,
Joshua

Joshua Levine

unread,
Jan 31, 2013, 12:31:19 PM1/31/13
to codersh...@googlegroups.com
Actually, I was wrong. 

Things looked like this:

130131 17:21:31 [Note] WSREP: IST receiver using ssl
130131 17:21:31 [Note] WSREP: Prepared IST receiver, listening at: ssl://<ec2_local_ip>:4568

Which is definite progress, however even with wsrep_sst_receive_address set to <ec2_external_ip> on my donor node I am seeing:

130131 17:22:34 [ERROR] WSREP: IST failed: IST sender, failed to connect 'ssl://<ec2_local_ip>:4568': Connection timed out: 110 (Connection timed out) at galera/src/ist.cpp:Sender():628

Where I want the remote node to be seeing the <ec2_external_ip>:4568 and the local node listening on <ec2_local_ip>:4568

It seems that the ist.recv_addr=<ec2_local_ip> is working correctly for binding, but announcing itself to remote nodes incorrectly for my goals.

Thank you,
Joshua

Alex Yurchenko

unread,
Jan 31, 2013, 12:40:23 PM1/31/13
to codersh...@googlegroups.com
Please post logs from both nodes that include full joining handshake.

On 2013-01-31 19:31, Joshua Levine wrote:
> Actually, I was wrong.
>
> Things looked like this:
>
> 130131 17:21:31 [Note] WSREP: IST receiver using ssl
> 130131 17:21:31 [Note] WSREP: Prepared IST receiver, listening at:
> ssl://<*
> ec2_local_ip*>:4568
>
>
> Which is definite progress, however even with
> *wsrep_sst_receive_address *set
> to <*ec2_external_ip> *on my donor node I am seeing:
>
> 130131 17:22:34 [ERROR] WSREP: IST failed: IST sender, failed to
> connect
> 'ssl://<*ec2_local_ip*>:4568': Connection timed out: 110 (Connection
> timed
> out) at galera/src/ist.cpp:Sender():628
>
>
> Where I want the remote node to be seeing the
> <*ec2_external_ip>:*4568 and
> the local node listening on <*ec2_local_ip*>:4568
>
> It seems that the *ist.recv_addr=*<*ec2_local_ip*> is working
> correctly for
> binding, but announcing itself to remote nodes incorrectly for my
> goals.
>
> Thank you,
> Joshua

Joshua Levine

unread,
Jan 31, 2013, 1:49:55 PM1/31/13
to codersh...@googlegroups.com
From the receiver:

130131 17:21:29 [Note] WSREP: Flow-control interval: [14, 28]
130131 17:21:29 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 8745)
130131 17:21:29 [Note] WSREP: State transfer required: 
Group state: b973ab86-61b9-11e2-0800-e1666f261698:8745
Local state: b973ab86-61b9-11e2-0800-e1666f261698:8744
130131 17:21:29 [Note] WSREP: New cluster view: global state: b973ab86-61b9-11e2-0800-e1666f261698:8745, view# 43: Primary, number of nodes: 3, my index: 0, protocol version 2
130131 17:21:29 [Warning] WSREP: Gap in state sequence. Need state transfer.
130131 17:21:31 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'joiner' --address '<ec2_external_ip>' --auth 'root:PASSWORDREDACTED' --datadir '/mnt/mysql/data/' --defaults-file '/local/mysql/etc/mysql/my.cnf' --parent '14945''
130131 17:21:31 [Note] WSREP: Prepared SST request: xtrabackup|'<ec2_external_ip>:4444/xtrabackup_sst
130131 17:21:31 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130131 17:21:31 [Note] WSREP: Assign initial position for certification: 8745, protocol version: 2
130131 17:21:31 [Note] WSREP: IST receiver using ssl
130131 17:21:31 [Note] WSREP: Prepared IST receiver, listening at: ssl://'<ec2_local_ip>:4568
130131 17:21:31 [Note] WSREP: Node 0 (<requestor_node>) requested state transfer from '*any*'. Selected 1 (<donor_node>)(SYNCED) as donor.
130131 17:21:31 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 8745)
130131 17:21:31 [Note] WSREP: Requesting state transfer: success, donor: 1
130131 17:21:31 [Note] WSREP: SST complete, seqno: 8744
...
130131 17:21:34 [Note] /usr/sbin/mysqld: ready for connections.
130131 17:22:34 [Warning] WSREP: 1 (<donor_node>): State transfer to 0 (<requestor_node>) failed: -110 (Connection timed out)
130131 17:22:34 [ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_join_msg():712: Will never receive state. Need to abort.
130131 17:22:34 [Note] WSREP: gcomm: terminating thread
130131 17:22:34 [Note] WSREP: gcomm: joining thread
130131 17:22:34 [Note] WSREP: gcomm: closing backend


From the donor:

130131 17:21:29 [Note] WSREP: Flow-control interval: [14, 28]
130131 17:21:29 [Note] WSREP: New cluster view: global state: b973ab86-61b9-11e2-0800-e1666f261698:8745, view# 43: Primary, number of nodes: 3, my index: 1, protocol version 2
130131 17:21:29 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130131 17:21:29 [Note] WSREP: Assign initial position for certification: 8745, protocol version: 2
130131 17:21:31 [Note] WSREP: Node 0 (<requestor_node>) requested state transfer from '*any*'. Selected 1 (<donor_node>)(SYNCED) as donor.
130131 17:21:31 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 8745)
130131 17:21:31 [Note] WSREP: IST request: b973ab86-61b9-11e2-0800-e1666f261698:8744-8745|ssl://<requestor_ec2_local_ip>:4568
130131 17:21:31 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130131 17:21:31 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'donor' --address '/<requestor_ec2_external_ip>:4444/xtrabackup_sst' --auth 'root:PASSWORDREDACTED' --socket '/var/run/mysqld/mysqld.sock' --datadir '/mnt/mysql/data/' --defaults-file '/local/mysql/etc/mysql/my.cnf' --gtid 'b973ab86-61b9-11e2-0800-e1666f261698:8744' --bypass'
130131 17:21:31 [Note] WSREP: sst_donor_thread signaled with 0
130131 17:21:31 [Note] WSREP: IST sender using ssl
130131 17:22:34 [ERROR] WSREP: IST failed: IST sender, failed to connect 'ssl:///<requestor_ec2_local_ip>:4568': Connection timed out: 110 (Connection timed out)
 at galera/src/ist.cpp:Sender():628
130131 17:22:34 [Warning] WSREP: 1 (/<donor_node>): State transfer to 0 (/<requestor_node>) failed: -110 (Connection timed out)
130131 17:22:34 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 8745)


Thank you,
Joshua

Joshua Levine

unread,
Jan 31, 2013, 9:00:12 PM1/31/13
to codersh...@googlegroups.com

I am not at all convinced this is a galera issue.

between:

node_address
sst_receive_address
and
ist.recv_addr

I think its down to mapping the names, local ips, and NAT ips appropriately. 

I will test some more.

Alex Yurchenko

unread,
Feb 1, 2013, 4:05:05 AM2/1/13
to codersh...@googlegroups.com
That too, but currently IST relies on different resolution of the _DNS
name_ on listener and sender (this is proven to work, that's what people
use). In order to use IP address you have to have global IP address
assigned to local interface on listener. Not sure that it is possible in
EC2.

Joshua Levine

unread,
Feb 12, 2013, 8:27:08 PM2/12/13
to codersh...@googlegroups.com

Just to fill folks in what I went with...

We have each of:

wsrep_sst_receive_address
wsrep_node_name
wsrep_node_address

Set to the node's fqdn, which is answered by DNS as the Elastic IP and globally routed (for regional failover) to the address of the nodes. We have 3306, 4444, 4567, 4568 managed by our Security Groups and local iptables. 

We then have a local /etc/hosts entry on any give node which tells that node its local/LAN/rfc1918 address. That is the only local hosts entry besides localhost.

This works such that a local node binds to its local address ec2_private_ip, but announces itself by an fqdn that resolves to its ec2_public-address, ensuring that all cluster communication... writes, SST, IST all work just fine.

With this configuration, we do not need to pass the ist_recv to our wsrep_provider_options.

Thank you for your help.

Joshua 

Bhaskar Peddireddy

unread,
May 5, 2014, 12:09:46 PM5/5/14
to codersh...@googlegroups.com, joshua...@gmail.com

Hi All,

I have a 3 node galera cluster with

Percona XtraDB Cluster (GPL), wsrep_23.7.5.r3880

mysqld  Ver 5.5.31 for Linux on x86_64 (Percona XtraDB Cluster (GPL), wsrep_23.7.5.r3880)


When I bring down one of the nodes for couple of min. host maintenance and bring the node back up,  it is failing to do IST and instead it is done SST and takes longer.

I have gcache set to 2GB and page size set to 1GB

Here is the error in the error log;

140505 12:00:30 [Warning] WSREP: Failed to prepare for incremental state transfer: Failed to open IST listener at tcp://10.89.15.6:7302', asio error 'Address already in use': 98 (Address already in use)
         at galera/src/ist.cpp:prepare():313. IST will be unavailable.


Any help is very much appreciated.

Thanks again,
Bhaskar



On Wednesday, January 30, 2013 11:13:31 PM UTC-5, joshua wrote:

Hey folks.

I have an XtraDB Cluster in AWS.

Each cluster node is fronted by an Amazon Elastic IP. Thus, the wsrep_cluster_address and wsrep_node_address are represented by internet routable addresses, the ec2_public_ipv4

These, in turn, are mapped by Amazon to the nodes ec2_private_ipv4, each node's regional LAN address.

This was done to allow nodes in different Amazon regions to communicate with each other globally (using SSL communication of course).

The cluster, in both dev and production environments, is working well:

  1. SST instantiation of new nodes works just fine with xtrabackup across regions.
  1. All nodes are in sync globally.
If, however, a node is taken offline for any reason, upon a restart we are seeing:

30131  3:04:19 [Warning] WSREP: Failed to prepare for incremental state transfer: Failed to open IST listener at ssl://ec2_public_ipv4:4568', asio error 'Cannot assign requested address': 99 (Cannot assign requested address)
at galera/src/ist.cpp:prepare():309. IST will be unavailable.
What I believe this means:
  1. The node is trying to listen on its ec2_public_ipv4:4568 which is the Elastic IP, not local, and thus the node can't bind to it.
  2. I want the node to listen on its ec2_private_ipv4:4568 which is its local LAN address.
  3. I want an IST sender to continue to identify this node on its ec2_public_ipv4:4568 such that IST occurs just as Galera replication does, on the Elastic IPs, transparently NAT'd to the nodes local LAN IPs.
I thought the wsrep_sst_receive_address might provide for this, but in my testing it does not appear to. Is there something similar to a wsrep_ist_receive_address that needs to be set?

Cheers,
Joshua

Alex Yurchenko

unread,
May 5, 2014, 7:15:16 PM5/5/14
to codersh...@googlegroups.com
On 2014-05-05 19:09, Bhaskar Peddireddy wrote:
> Hi All,
>
> I have a 3 node galera cluster with
>
> Percona XtraDB Cluster (GPL), wsrep_23.7.5.r3880
>
> mysqld Ver 5.5.31 for Linux on x86_64 (Percona XtraDB Cluster (GPL),
> wsrep_23.7.5.r3880)
>
>
> When I bring down one of the nodes for couple of min. host maintenance
> and
> bring the node back up, it is failing to do IST and instead it is done
> SST
> and takes longer.
>
> I have gcache set to 2GB and page size set to 1GB
>
> Here is the error in the error log;
>
> 140505 12:00:30 [Warning] WSREP: Failed to prepare for incremental
> state
> transfer: Failed to open IST listener at tcp://10.89.15.6:7302', asio
> error
> 'Address already in use': 98 (Address already in use)

Well, it is what it says: that port is already taken by some other
process. You either need to find and kill that process, or specify a
different IST port (default is replication port + 1) via ist.recv_addr
(http://galeracluster.com/documentation-webpages/galeraparameters.html#ist-recv-addr).

> at galera/src/ist.cpp:prepare():313. IST will be unavailable.
>
>
> Any help is very much appreciated.
>
> Thanks again,
> Bhaskar
>
>
> On Wednesday, January 30, 2013 11:13:31 PM UTC-5, joshua wrote:
Reply all
Reply to author
Forward
0 new messages