Re: Issue on ubuntu when i reboot the machine Percona cluster fails on node

1,223 views
Skip to first unread message

Jay Janssen

unread,
Aug 7, 2012, 1:22:28 PM8/7/12
to percona-d...@googlegroups.com
Hi Amol,
  I'll try to answer your questions below:

On Aug 7, 2012, at 1:03 PM, amol <ajk...@gmail.com> wrote:

Question1: How can i change the rsync process to use private IP instead of public IP?




         4. Once the sync is completed on node 3, the clustercheck still shows that the node is down and node is not usable as a cluster node
         5. Then i have to issue sudo service mysql stop and tthen sudo /etc/init.d/mysql start and it says database failed to start but the rsync process starts and after the process is completed node3 becomes a part of the cluster

Question2: How can i change the mysql process to start using /etc/init.dmysq instead of service mysql start during the boot time.?

I think this is a bug.  Feel free to poke around the launchpad project  (https://bugs.launchpad.net/percona-xtradb-cluster/+bugs), and file a bug if one does not exist. 


Question3: if node1 becomes a donor it stops accepting connections which make the application unusable, once suggestion is to add +if [ "$WSSREP_STATUS" == "4" ] || [ "$WSSREP_STATUS" == "2" ] in the cluster check, but doing that how accurate is the data during the rsync or should i be using xtrabackup?

The rsync will flush tables and pause replication on the donor node while the rsync is copying.  The xtrabackup method allows for replication to continue during the donation, but it does briefly block on replication right at the end of the data copy.

Question4: how do i configure the nodes to use incremental to avoid this error?

The most obvious way to configure IST is using the ist.recv_addr in the wsrep_provider_options (http://www.codership.com/wiki/doku.php?id=galera_parameters).  It's not obvious, but IST uses its own port.

A shortcut here is using the undocumented wsrep_node_address setting, which sets the listen, ist, and sst addresses automatically if they are all on the same IP.  

Hope this helps. 


Jay Janssen, Senior MySQL Consultant, Percona Inc.
Percona Live in NYC Oct 1-2nd: http://www.percona.com/live/nyc-2012/

amol

unread,
Aug 8, 2012, 11:39:19 AM8/8/12
to percona-d...@googlegroups.com
Thank Jay..you suggestions were indeed helpful... and yes i will investigate in detail about the mysql startup and then create a bug request
another thing to note here is that is it because the wsrep_urls are in the mysqld_safe section that must be causing this? since the database has to be started in the mysqld_safe mode?

can that parameter be shifted to mysqld section? or is their another variable that i can use like e.g
wsrep_cluster_address=gcomm://10.1.6.118:4567,gcomm://10.1.3.30:4567,gcomm://10.1.3.101:4567,gcomm://


So after your suggestions i noticed that the if the db is restarted the node becomes available very soon, is it due to the IST taking effect?
here are my new parameter 
[mysqld_safe]
socket = /var/run/mysqld/mysqld.sock
nice = 0
wsrep_urls      = gcomm://10.1.6.118:4567,gcomm://10.1.3.30:4567,gcomm://10.1.3.101:4567,gcomm://

[mysqld]
#
# * Basic Settings
#
server_id=1
binlog_format=ROW   
wsrep_provider=/usr/lib64/libgalera_smm.so   
wsrep_slave_threads=2 
wsrep_cluster_name=dev_cluster  
wsrep_sst_method=xtrabackup    # changed to xtrabackup from rsycn inorder to use IST
wsrep_node_name=node1   
innodb_locks_unsafe_for_binlog=1 
innodb_autoinc_lock_mode=2
log_slave_updates
wsrep_replicate_myisam=1
wsrep_sst_receive_address=10.1.6.118  # i believe this should be the private ip of this node?
wsrep_provider_options = "gmcast.listen_addr=tcp://0.0.0.0:4567; ist.recv_addr=10.1.6.118:4568; "  # i believe this should be the private ip of this node?


Thanks

Jay Janssen

unread,
Aug 8, 2012, 1:49:56 PM8/8/12
to percona-d...@googlegroups.com
On Aug 8, 2012, at 11:39 AM, amol <ajk...@gmail.com> wrote:

Thank Jay..you suggestions were indeed helpful... and yes i will investigate in detail about the mysql startup and then create a bug request
another thing to note here is that is it because the wsrep_urls are in the mysqld_safe section that must be causing this? since the database has to be started in the mysqld_safe mode?

It's possible, yes.


can that parameter be shifted to mysqld section? or is their another variable that i can use like e.g


No, you cannot.  wsrep_urls is the only variable that supports multiple gcomm:// addresses, and it's really just a bit of a hack that finds an open port in the list and passes that to the mysqld as the wsrep_cluster_address for you.

Again, I'd defer to either filing a bug or getting involved in any discussion on an existing bug (if exists).  


So after your suggestions i noticed that the if the db is restarted the node becomes available very soon, is it due to the IST taking effect?

Check the log, it should tell you when IST or SST is happening.  

here are my new parameter 
[mysqld_safe]
socket = /var/run/mysqld/mysqld.sock
nice = 0

[mysqld]
#
# * Basic Settings
#
server_id=1
binlog_format=ROW   
wsrep_provider=/usr/lib64/libgalera_smm.so   
wsrep_slave_threads=2 
wsrep_cluster_name=dev_cluster  
wsrep_sst_method=xtrabackup    # changed to xtrabackup from rsycn inorder to use IST

This has nothing to do with IST.

wsrep_node_name=node1   
innodb_locks_unsafe_for_binlog=1 
innodb_autoinc_lock_mode=2
log_slave_updates
wsrep_replicate_myisam=1
wsrep_sst_receive_address=10.1.6.118  # i believe this should be the private ip of this node?
wsrep_provider_options = "gmcast.listen_addr=tcp://0.0.0.0:4567; ist.recv_addr=10.1.6.118:4568; "  # i believe this should be the private ip of this node?

What IP you run SST and IST on is up to you and your environment.  

amol

unread,
Aug 8, 2012, 2:42:16 PM8/8/12
to percona-d...@googlegroups.com
Hi Jay, thanks for the answers...another question is..this might be a side track, so let me know if  should open a new thread for this....

we were running some load tests on the entire setup which has (1 haproxy lb + 3 nodes) and i am noticing that after a few connections the scripts stop running with the error "Error connecting to mysql"
and it starts back after a while..
i checked the innotop and did not see any locks or deadlocks in the db node, plus i am just running 1 thread at a time so i don't think it should be using too many connections..
but i wasn't sure whether any of the system user connections is causing the db to lock down?
here is what my process list looks while the load test is running

mysql> show full processlist;
+--------+-------------+--------------------+---------+---------+-------+--------------------+-----------------------+-----------+---------------+-----------+
| Id     | User        | Host               | db      | Command | Time  | State              | Info                  | Rows_sent | Rows_examined | Rows_read |
+--------+-------------+--------------------+---------+---------+-------+--------------------+-----------------------+-----------+---------------+-----------+
|      1 | system user |                    | NULL    | Sleep   | 39005 | wsrep aborter idle | NULL                  |         0 |             0 |         1 |
|      2 | system user |                    | NULL    | Sleep   |  2815 | committed 81728    | NULL                  |         0 |             0 |         1 |
|      3 | system user |                    | NULL    | Sleep   |  2816 | committed 81727    | NULL                  |         0 |             0 |         1 |
| 136444 | user1       | localhost          | NULL    | Query   |     0 | sleeping           | show full processlist |         0 |             0 |         1 |
| 141856 | applusdev   | 10.1.4.6:34993     | demo  | Sleep   |     0 |                    | NULL                  |         0 |             0 |        59 |
| 141869 | applusdev   | 10.1.4.6:35006     | demo    | Sleep   |     0 |                    | NULL                  |         1 |             1 |         1 |
| 141871 | applusdev   | 10.1.4.6:35008     | demo    | Sleep   |     0 |                    | NULL                  |         0 |             0 |         1 |
+--------+-------------+--------------------+---------+---------+-------+--------------------+-----------------------+-----------+---------------+-----------+

I am open to suggestions to debug this issue, as i cannot proceed to production with this issue lingering...

Jay Janssen

unread,
Aug 8, 2012, 2:50:37 PM8/8/12
to percona-d...@googlegroups.com
Amol,
  You should try to check exactly where the test scripts are failing (is it on connect?, is it on a query?, etc.) and, if possible, see if there is a more precise mysql error code associated with the problem.    Are your scripts reconnecting every time they query?

  The system users there seem normal.

On Aug 8, 2012, at 2:42 PM, amol <ajk...@gmail.com> wrote:

Hi Jay, thanks for the answers...another question is..this might be a side track, so let me know if  should open a new thread for this....

we were running some load tests on the entire setup which has (1 haproxy lb + 3 nodes) and i am noticing that after a few connections the scripts stop running with the error "Error connecting to mysql"
and it starts back after a while..
i checked the innotop and did not see any locks or deadlocks in the db node, plus i am just running 1 thread at a time so i don't think it should be using too many connections..
but i wasn't sure whether any of the system user connections is causing the db to lock down?
here is what my process list looks while the load test is running

mysql> show full processlist;
+--------+-------------+--------------------+---------+---------+-------+--------------------+-----------------------+-----------+---------------+-----------+
| Id     | User        | Host               | db      | Command | Time  | State              | Info                  | Rows_sent | Rows_examined | Rows_read |
+--------+-------------+--------------------+---------+---------+-------+--------------------+-----------------------+-----------+---------------+-----------+
|      1 | system user |                    | NULL    | Sleep   | 39005 | wsrep aborter idle | NULL                  |         0 |             0 |         1 |
|      2 | system user |                    | NULL    | Sleep   |  2815 | committed 81728    | NULL                  |         0 |             0 |         1 |
|      3 | system user |                    | NULL    | Sleep   |  2816 | committed 81727    | NULL                  |         0 |             0 |         1 |
| 136444 | user1       | localhost          | NULL    | Query   |     0 | sleeping           | show full processlist |         0 |             0 |         1 |
| 141856 | applusdev   | 10.1.4.6:34993     | demo   | Sleep   |     0 |                    | NULL                  |         0 |             0 |        59 |
| 141869 | applusdev   | 10.1.4.6:35006     | demo    | Sleep   |     0 |                    | NULL                  |         1 |             1 |         1 |
| 141871 | applusdev   | 10.1.4.6:35008     | demo    | Sleep   |     0 |                    | NULL                  |         0 |             0 |         1 |
+--------+-------------+--------------------+---------+---------+-------+--------------------+-----------------------+-----------+---------------+-----------+

I am open to suggestions to debug this issue, as i cannot proceed to production with this issue lingering...


amol

unread,
Aug 8, 2012, 2:58:45 PM8/8/12
to percona-d...@googlegroups.com
yes the precise error in the log file is 

PHP Warning:  mysqli::mysqli(): (HY000/2003): Can't connect to MySQL server on '<server_IP>'

and yes the script is using a new connection for every new record it inserts, and closes the connection 

some variables from the db...

mysql> show variables like 'max_connection%';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| max_connections | 151   |
+-----------------+-------+

mysql> show status like '%connections';
+----------------------+--------+
| Variable_name        | Value  |
+----------------------+--------+
| Connections          | 305747 |
| Max_used_connections | 115    |
+----------------------+--------+

Jay Janssen

unread,
Aug 8, 2012, 3:12:42 PM8/8/12
to percona-d...@googlegroups.com
Is that connecting directly to a cluster node or to a VIP/proxy like HAproxy?

On Aug 8, 2012, at 2:58 PM, amol <ajk...@gmail.com> wrote:

yes the precise error in the log file is 

PHP Warning:  mysqli::mysqli(): (HY000/2003): Can't connect to MySQL server on '<server_IP>'

and yes the script is using a new connection for every new record it inserts, and closes the connection 

some variables from the db...

mysql> show variables like 'max_connection%';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| max_connections | 151   |
+-----------------+-------+

mysql> show status like '%connections';
+----------------------+--------+
| Variable_name        | Value  |
+----------------------+--------+
| Connections          | 305747 |
| Max_used_connections | 115    |
+----------------------+--------+


amol

unread,
Aug 8, 2012, 3:16:56 PM8/8/12
to percona-d...@googlegroups.com
connecting to haproxy.. 
and here is the config i used in haproxy to avoid lock conflicts

backend pxc-onenode-back
        mode tcp
        balance leastconn
        option httpchk
        server c2 10.1.3.3:3306 check port 9200 inter 12000 rise 3 fall 3
        server c1 10.1.6.8:3306 check port 9200 inter 12000 rise 3 fall 3 backup
        server c3 10.1.3.1:3306 check port 9200 inter 12000 rise 3 fall 3 backup

and i also tried connection to server c2 directly and when during the load test i got similar errors...



Jay Janssen

unread,
Aug 8, 2012, 3:21:32 PM8/8/12
to percona-d...@googlegroups.com
On Aug 8, 2012, at 3:16 PM, amol <ajk...@gmail.com> wrote:

connecting to haproxy.. 
and here is the config i used in haproxy to avoid lock conflicts

backend pxc-onenode-back
        mode tcp
        balance leastconn
        option httpchk
        server c2 10.1.3.3:3306 check port 9200 inter 12000 rise 3 fall 3
        server c1 10.1.6.8:3306 check port 9200 inter 12000 rise 3 fall 3 backup
        server c3 10.1.3.1:3306 check port 9200 inter 12000 rise 3 fall 3 backup

and i also tried connection to server c2 directly and when during the load test i got similar errors…

I'd look at the HA proxy dashboard to see if you can see any error counters increasing, likewise in c2 -- specifically things like 'aborted_connections' and so forth.

amol

unread,
Aug 8, 2012, 3:41:46 PM8/8/12
to percona-d...@googlegroups.com
where do i check for aborted_connections?

So here are my findings from haproxy (this is from csv output)

run 1
pxc-onenode-back,c2,0,0,0,13,,12693,21212854,541070495,,0,,6,0,25,1,UP,1,1,0,4,1,25,36,,1,6,1,,12668,,2,0,,115,L7OK,200,22,,,,,,,0,,,,0,0,
pxc-onenode-back,c1,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,0,1,3,1,23,36,,1,6,2,,0,,2,0,,0,L7OK,200,33,,,,,,,0,,,,0,0,
pxc-onenode-back,c3,0,0,0,7,,360,157380,383640,,0,,0,0,0,0,UP,1,0,1,4,1,30,40,,1,6,3,,360,,2,0,,97,L7OK,200,66,,,,,,,0,,,,0,0,
pxc-onenode-back,BACKEND,0,0,0,13,0,13442,21370234,541454135,0,0,,421,0,25,1,UP,1,1,2,,1,30,29,,1,6,0,,13028,,1,0,,115,,,,,,,,,,,,,,0,0,

run 2
pxc-onenode-back,c2,0,0,0,13,,17644,23366860,546367867,,0,,8,0,34,1,UP,1,1,0,5,1,240,36,,1,6,1,,17610,,2,0,,132,L7OK,200,23,,,,,,,0,,,,0,0,
pxc-onenode-back,c1,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,0,1,5,1,238,36,,1,6,2,,0,,2,0,,0,L7OK,200,23,,,,,,,0,,,,0,0,
pxc-onenode-back,c3,0,0,0,7,,360,157380,383640,,0,,0,0,0,0,UP,1,0,1,6,1,245,40,,1,6,3,,360,,2,0,,97,L7OK,200,21,,,,,,,0,,,,0,0,
pxc-onenode-back,BACKEND,0,0,0,13,0,18384,23524240,546751507,0,0,,423,0,34,1,UP,1,1,2,,1,245,29,,1,6,0,,17970,,1,0,,132,,,,,,,,,,,,,,0,0,

run 3
pxc-onenode-back,c2,0,0,0,13,,21574,25073755,550547926,,0,,10,0,48,1,UP,1,1,0,7,2,87,73,,1,6,1,,21526,,2,0,,132,L7OK,200,22,,,,,,,0,,,,0,0,
pxc-onenode-back,c1,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,0,1,7,2,97,72,,1,6,2,,0,,2,0,,0,L7OK,200,22,,,,,,,0,,,,0,0,
pxc-onenode-back,c3,0,0,0,10,,1393,606406,1508709,,0,,0,0,0,0,UP,1,0,1,8,1,522,40,,1,6,3,,1393,,2,0,,105,L7OK,200,22,,,,,,,0,,,,0,0,
pxc-onenode-back,BACKEND,0,0,0,13,0,23333,25680161,552056635,0,0,,425,0,48,1,UP,1,1,2,,1,522,29,,1,6,0,,22919,,1,0,,132,,,,,,,,,,,,,,0,0,

So i see a increase in the error count

another thing  noticed is that the haproxy log shows an increase in activity and then flips over from node 2 to node3

Aug  8 15:31:43 localhost haproxy[30255]: 10.1.4.5:40021 [08/Aug/2012:15:31:40.309] pxc-onenode-front pxc-onenode-back/c2 0/3005/3008 500 -- 2/2/2/2/0 0/0
Aug  8 15:32:04 localhost haproxy[30255]: 10.1.4.5:40027 [08/Aug/2012:15:31:51.310] pxc-onenode-front pxc-onenode-back/c2 0/13024/13028 500 -- 6/6/6/6/2 0/0
Aug  8 15:32:06 localhost haproxy[30255]: 10.1.4.5:40011 [08/Aug/2012:15:31:37.222] pxc-onenode-front pxc-onenode-back/c2 0/0/29621 3016 -- 4/4/4/4/0 0/0
Aug  8 15:32:06 localhost haproxy[30255]: 10.1.4.5:40020 [08/Aug/2012:15:31:37.299] pxc-onenode-front pxc-onenode-back/c2 0/3005/29554 765 -- 3/3/3/3/0 0/0
Aug  8 15:32:21 localhost haproxy[30255]: 10.1.4.5:40048 [08/Aug/2012:15:32:21.419] pxc-onenode-front pxc-onenode-back/c2 0/0/4 500 -- 4/4/4/4/0 0/0
Aug  8 15:32:25 localhost haproxy[30255]: 10.1.4.5:40026 [08/Aug/2012:15:31:45.262] pxc-onenode-front pxc-onenode-back/c2 0/3002/40118 3016 -- 2/2/2/2/0 0/0
Aug  8 15:32:37 localhost haproxy[30255]: 10.1.4.5:40029 [08/Aug/2012:15:31:58.300] pxc-onenode-front pxc-onenode-back/c2 15024/3010/39130 3016 -- 2/2/2/2/3 0/0
Aug  8 15:32:42 localhost haproxy[30255]: 10.1.4.5:40035 [08/Aug/2012:15:32:10.359] pxc-onenode-front pxc-onenode-back/c2 0/8015/32100 3016 -- 1/1/1/1/1 0/0
Aug  8 15:32:42 localhost haproxy[30255]: Backup Server pxc-onenode-back/c1 is DOWN, reason: Layer4 timeout, check duration: 12000ms. 1 active and 1 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug  8 15:32:51 localhost haproxy[30255]: Server pxc-onenode-back/c2 is DOWN, reason: Layer4 timeout, check duration: 12009ms. 0 active and 1 backup servers left. Running on backup. 1 sessions active, 0 requeued, 0 remaining in queue.
Aug  8 15:32:52 localhost haproxy[30255]: 10.1.4.5:40069 [08/Aug/2012:15:32:52.650] pxc-onenode-front pxc-onenode-back/c3 0/0/46 500 -- 3/3/2/1/0 0/0
Aug  8 15:32:52 localhost haproxy[30255]: 10.1.4.5:40070 [08/Aug/2012:15:32:52.696] pxc-onenode-front pxc-onenode-back/c3 0/0/3 995 -- 2/2/2/1/0 0/0
Aug  8 15:32:53 localhost haproxy[30255]: 10.1.4.5:40086 [08/Aug/2012:15:32:52.871] pxc-onenode-front pxc-onenode-back/c3 0/0/144 1185 -- 4/4/4/3/0 0/0
Aug  8 15:32:53 localhost haproxy[30255]: 10.1.4.5:40085 [08/Aug/2012:15:32:52.869] pxc-onenode-front pxc-onenode-back/c3 0/0/146 2276 -- 3/3/3/2/0 0/0
Aug  8 15:32:53 localhost haproxy[30255]: 10.1.4.5:40068 [08/Aug/2012:15:32:52.579] pxc-onenode-front pxc-onenode-back/c3 0/0/437 3016 -- 2/2/2/1/0 0/0
Aug  8 15:32:53 localhost haproxy[30255]: Backup Server stats-back/c3 is DOWN, reason: Layer4 timeout, check duration: 12000ms. 1 active and 1 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug  8 15:32:53 localhost haproxy[30255]: 10.1.4.5:40081 [08/Aug/2012:15:32:52.839] pxc-onenode-front pxc-onenode-back/c3 0/0/187 669 -- 1/1/1/0/0 0/0

amol

unread,
Aug 8, 2012, 4:54:09 PM8/8/12
to percona-d...@googlegroups.com
now when i reboot the node1 the database does not start (even after using /etc/init.d/mysql start)

and i see these error in mysql/error.log

120808 16:39:19 [Note] WSREP: Flow-control interval: [14, 28]
120808 16:39:19 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 151412)
120808 16:39:19 [Note] WSREP: State transfer required: 
Group state: afc4ea7d-dc5e-11e1-0800-0616c529eebe:151412
Local state: 00000000-0000-0000-0000-000000000000:-1
120808 16:39:19 [Note] WSREP: New cluster view: global state: afc4ea7d-dc5e-11e1-0800-0616c529eebe:151412, view# 31: Primary, number of nodes: 3, my index: 0, protocol version 2
120808 16:39:19 [Warning] WSREP: Gap in state sequence. Need state transfer.
120808 16:39:21 [Note] WSREP: Running: 'wsrep_sst_xtrabackup 'joiner' '10.1.6.8' '' '/var/lib/mysql/' '/etc/mysql/conf.d/mysqld_safe_syslog.cnf' '4411' 2>sst.err'
120808 16:39:21 [Note] WSREP: Prepared SST request: xtrabackup|10.1.6.8:4444/xtrabackup_sst
120808 16:39:21 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
120808 16:39:21 [Note] WSREP: Assign initial position for certification: 151412, protocol version: 2
120808 16:39:21 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (afc4ea7d-dc5e-11e1-0800-0616c529eebe): 1 (Operation not permitted)
at galera/src/replicator_str.cpp:prepare_for_IST():439. IST will be unavailable.
120808 16:39:21 [Note] WSREP: Node 0 (node1) requested state transfer from '*any*'. Selected 1 (node2)(SYNCED) as donor.
120808 16:39:21 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 151412)
120808 16:39:21 [Note] WSREP: Requesting state transfer: success, donor: 1
120808 16:39:27 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup 'joiner' '10.1.6.8' '' '/var/lib/mysql/' '/etc/mysql/conf.d/mysqld_safe_syslog.cnf' '4411' 2>sst.err: 32 (Broken pipe)
120808 16:39:27 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
120808 16:39:27 [ERROR] WSREP: SST failed: 32 (Broken pipe)
120808 16:39:27 [ERROR] Aborting

120808 16:39:27 [Warning] WSREP: 1 (node2): State transfer to 0 (node1) failed: -1 (Operation not permitted)
120808 16:39:27 [ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_join_msg():712: Will never receive state. Need to abort.
120808 16:39:27 [Note] WSREP: gcomm: terminating thread
120808 16:39:27 [Note] WSREP: gcomm: joining thread
120808 16:39:27 [Note] WSREP: gcomm: closing backend
120808 16:39:27 [Note] WSREP: view(view_id(NON_PRIM,20ca3744-e199-11e1-0800-0de247e11b46,31) memb {
20ca3744-e199-11e1-0800-0de247e11b46,
} joined {
} left {
} partitioned {
5ffb372a-e118-11e1-0800-1e749dee7061,
71386e58-e109-11e1-0800-8855542b6c12,
})
120808 16:39:27 [Note] WSREP: view((empty))
120808 16:39:27 [Note] WSREP: gcomm: closed
120808 16:39:27 [Note] WSREP: /usr/sbin/mysqld: Terminated.
Aborted
120808 16:39:27 mysqld_safe mysqld from pid file /var/lib/mysql/dev2-db-upgrade.pid ended


this is after all the changes i did earlier in the day on node 1 my.cnf for IST
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
[mysqld_safe]
socket = /var/run/mysqld/mysqld.sock
nice = 0
wsrep_urls      = gcomm://10.1.6.3:4567,gcomm://10.1.3.1:4567,gcomm://10.1.6.8:4567,gcomm://

[mysqld]
#
# * Basic Settings
#
server_id=1
binlog_format=ROW   
wsrep_provider=/usr/lib64/libgalera_smm.so   
wsrep_slave_threads=2 
wsrep_cluster_name=dev_cluster 
wsrep_sst_method=xtrabackup
wsrep_node_name=node1   
innodb_locks_unsafe_for_binlog=1 
innodb_autoinc_lock_mode=2
log_slave_updates
wsrep_replicate_myisam=1
wsrep_sst_receive_address=10.1.6.8
wsrep_provider_options = "gmcast.listen_addr=tcp://0.0.0.0:4567; ist.recv_addr=10.1.6.8:4568; "
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


On Tuesday, August 7, 2012 1:03:48 PM UTC-4, amol wrote:
Hi this is my first post to the group and i am hoping to find some answers to my questions, i apologize for a long post, but i think if i give you all the details then debugging will be easier..

So here is a detailed description of the issue

Server Version : ubuntu 10.04 LTS
percona version: 5.5.24-55-log Percona XtraDB Cluster (GPL), wsrep_23.6.r341

Configuration details: 3 nodes (node1, node2, node3)
(my.cnf) in node 1 
++++++++++++++++++++++++++++
[mysqld_safe]
socket = /var/run/mysqld/mysqld.sock
nice = 0
wsrep_urls      = gcomm://10.1.6.118:4567,gcomm://10.1.3.30:4567,gcomm://10.1.3.101:4567,gcomm://

[mysqld]
#
# * Basic Settings
#
server_id=1
binlog_format=ROW   
wsrep_provider=/usr/lib64/libgalera_smm.so   
#wsrep_cluster_address=gcomm://
wsrep_slave_threads=2 
wsrep_cluster_name=dev_cluster 
wsrep_sst_method=rsync
wsrep_node_name=node1   
innodb_locks_unsafe_for_binlog=1 
innodb_autoinc_lock_mode=2
log_slave_updates
wsrep_replicate_myisam=1
++++++++++++++++++++++++++++

(my.cnf) in node 2 
++++++++++++++++++++++++++++
# This was formally known as [safe_mysqld]. Both versions are currently parsed.
[mysqld_safe]
socket = /var/run/mysqld/mysqld.sock
nice = 0
wsrep_urls      = gcomm://10.1.6.118:4567,gcomm://10.1.3.30:4567,gcomm://10.1.3.101:4567,gcomm://

[mysqld]
#
# * Basic Settings
#
server_id=2
binlog_format=ROW   
wsrep_provider=/usr/lib64/libgalera_smm.so   
#wsrep_cluster_address=gcomm://10.1.6.118:4567
wsrep_slave_threads=2 
wsrep_cluster_name=dev_cluster 
wsrep_sst_method=rsync
wsrep_node_name=node2   
innodb_locks_unsafe_for_binlog=1 
innodb_autoinc_lock_mode=2
log_slave_updates
wsrep_replicate_myisam=1
++++++++++++++++++++++++++++

(my.cnf) in node 3

++++++++++++++++++++++++++++
# This was formally known as [safe_mysqld]. Both versions are currently parsed.
[mysqld_safe]
socket = /var/run/mysqld/mysqld.sock
nice = 0
wsrep_urls      = gcomm://10.1.6.118:4567,gcomm://10.1.3.30:4567,gcomm://10.1.3.101:4567,gcomm://

[mysqld]
#
# * Basic Settings
#
server_id=3
binlog_format=ROW
wsrep_provider=/usr/lib64/libgalera_smm.so
#wsrep_cluster_address=gcomm://10.1.6.118:4567
wsrep_slave_threads=2
wsrep_cluster_name=dev_cluster
wsrep_sst_method=rsync
wsrep_node_name=node3
innodb_locks_unsafe_for_binlog=1
innodb_autoinc_lock_mode=2
log_slave_updates
wsrep_replicate_myisam=1
++++++++++++++++++++++++++++


Testing Scenario: Setup haproxy with node1 up and node2 and node3 as backup (so the connections always go to one node)

When i reboot node 3: 
  1. node1 becomes the donor:  wsrep_local_state_comment  | Donor (+) 
  2. node2 is up and running 
  3. node3 comes back  up and starts to sync 
node3:~$ ps -ef | grep mysql
mysql     2429     1  0 11:27 ?        00:00:00 /usr/sbin/mysqld
root      2549     1  0 11:27 ?        00:00:00 /bin/sh /usr/bin/mysqld_safe
mysql     3031  2549  0 11:27 ?        00:00:00 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --log-error=/var/log/mysql/error.log --pid-file=/var/lib/mysql/dev-db-node3.pid --socket=/var/run/mysqld/mysqld.sock --port=3306 --wsrep_cluster_address=gcomm://10.1.6.118:4567
mysql     3188  3031  0 11:27 ?        00:00:00 sh -c wsrep_sst_rsync 'joiner' '<public_ip_node3>' '' '/var/lib/mysql/' '/etc/mysql/conf.d/mysqld_safe_syslog.cnf' '3031' 2>sst.err
mysql     3189  3188  0 11:27 ?        00:00:01 /bin/bash -ue /usr//bin/wsrep_sst_rsync joiner <public_ip_node3>  /var/lib/mysql/ /etc/mysql/conf.d/mysqld_safe_syslog.cnf 3031
mysql     3203     1  0 11:27 ?        00:00:00 rsync --daemon --port 4444 --config /var/lib/mysql//rsync_sst.conf
mysql     3243  3203  0 11:27 ?        00:00:00 rsync --daemon --port 4444 --config /var/lib/mysql//rsync_sst.conf
mysql     3248  3243  1 11:27 ?        00:00:08 rsync --daemon --port 4444 --config /var/lib/mysql//rsync_sst.conf
mysql     5279  3189  0 11:35 ?        00:00:00 sleep 1
akedar    5281  3771  0 11:35 pts/0    00:00:00 grep --color=auto mysql
node3:~$ 

Question1: How can i change the rsync process to use private IP instead of public IP?

         4. Once the sync is completed on node 3, the clustercheck still shows that the node is down and node is not usable as a cluster node
         5. Then i have to issue sudo service mysql stop and tthen sudo /etc/init.d/mysql start and it says database failed to start but the rsync process starts and after the process is completed node3 becomes a part of the cluster

Question2: How can i change the mysql process to start using /etc/init.dmysq instead of service mysql start during the boot time.?

Question3: if node1 becomes a donor it stops accepting connections which make the application unusable, once suggestion is to add +if [ "$WSSREP_STATUS" == "4" ] || [ "$WSSREP_STATUS" == "2" ] in the cluster check, but doing that how accurate is the data during the rsync or should i be using xtrabackup?

Question4: how do i configure the nodes to use incremental to avoid this error?
120807 11:48:00 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (afc4ea7d-dc5e-11e1-0800-0616c529eebe): 1 (Operation not permitted)
         at galera/src/replicator_str.cpp:prepare_for_IST():439. IST will be unavailable.
 
I have many more questions as i go on and test the configuration but if someone can answer these, i think i can clear a lot of my doubts...





Jay Janssen

unread,
Aug 9, 2012, 8:00:35 AM8/9/12
to percona-d...@googlegroups.com
Something about how you have SST configured is causing the ultimate problem here.

I can't say why the local state was reset to all zeros on reboot, how was the machine restarted?  If the local server had kept its state correctly, an IST should have been possible.  

--
You received this message because you are subscribed to the Google Groups "Percona Discussion" group.
To view this discussion on the web visit https://groups.google.com/d/msg/percona-discussion/-/2f0ODzIxAncJ.
To post to this group, send email to percona-d...@googlegroups.com.
To unsubscribe from this group, send email to percona-discuss...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/percona-discussion?hl=en.

amol

unread,
Aug 9, 2012, 12:33:00 PM8/9/12
to percona-d...@googlegroups.com
yes i just did a reboot of the machine

 and after some search if found this error on the donor node

innobackupex: Error: mysql child process has died: ERROR 1045 (28000): Access denied for user 'mysql'@'localhost' (using password: NO)

So i resolved that error by creating a user.. 

grant process on *.* to 'mysql'@'localhost' identified by '';
flush privileges;

and then on server reboot i see that the donor was a different node and it shows this error....

innobackupex: Error: mysql child process has died: ERROR 1044 (42000) at line 3: Access denied for user 'mysql'@'localhost' to database 'mysql'
 while waiting for reply to MySQL request: 'USE mysql;' at /usr//bin/innobackupex line 374.

now i see that mysql user needs more privileges..so i have granted all privileges to mysql..so now i have to try getting the node backup using SST and then try the reboot 


To unsubscribe from this group, send email to percona-discussion+unsub...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/percona-discussion?hl=en.

amol

unread,
Aug 9, 2012, 4:13:20 PM8/9/12
to percona-d...@googlegroups.com
the notable part here is that there is absolutely no error when i just stop the db (/etc/init/d/mysql stop ) and start the db (/etc/init.d/mysql/start)

so the question is...does the IST only work when you have stop the db and started it? 
if you reboot a node does it always do SST?

and my observation is SST using xtrabackup is slower thant rsync? but the donor node is atleast available for db connections...is that a valid statement?

amol

unread,
Aug 9, 2012, 10:55:15 PM8/9/12
to percona-d...@googlegroups.com
well that seems to have done the trick for now, once the permissions were set for mysql user..all nodes rebooted fine.just need to remove some privileges as "all" is not ideal for a user with no password...:)


On Thursday, August 9, 2012 12:33:00 PM UTC-4, amol wrote:

Alexey Yurchenko

unread,
Aug 23, 2012, 10:13:29 PM8/23/12
to percona-d...@googlegroups.com
On Friday, August 10, 2012 3:13:20 AM UTC+7, amol wrote:

so the question is...does the IST only work when you have stop the db and started it? 
if you reboot a node does it always do SST?


IST happens whenever it is possible. That is:

1) joiner position can be reliably established (e.g. if a server crashes during DDL it can't)
2) donor cache contains enough writesets to cover the gap

These are the only two conditions which IST depends on. If at least one of these conditions is not met, donor will defer to SST.

amol

unread,
Aug 23, 2012, 11:19:52 PM8/23/12
to percona-d...@googlegroups.com
Thank alexey for the clarification...
Reply all
Reply to author
Forward
0 new messages