Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
WSREP FSM Error when load testing PHP application
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  10 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
PaulO  
View profile  
 More options Oct 29 2012, 3:15 pm
From: PaulO <paul....@ceifx.com>
Date: Mon, 29 Oct 2012 12:15:26 -0700 (PDT)
Local: Mon, Oct 29 2012 3:15 pm
Subject: WSREP FSM Error when load testing PHP application

Hello, I have a 3 node cluster accessed by a PHP application. All tables
are Innodb. When load testing the application with apache-jmeter, the
primary cluster node fails with the following error:

[ERROR] WSREP: FSM: no such a transition COMMITTING -> ROLLED_BACK
Aborted
121029 14:53:02 mysqld_safe Number of processes running now: 0
121029 14:53:02 mysqld_safe mysqld restarted
121029 14:53:02 [Note] Flashcache bypass: disabled
121029 14:53:02 [Note] Flashcache setup error is : ioctl failed

121029 14:53:02 [Note] WSREP: Read nil XID from storage engines, skipping
position init
121029 14:53:02 [Note] WSREP: wsrep_load(): loading provider library
'/usr/lib/libgalera_smm.so'
121029 14:53:02 [Note] WSREP: wsrep_load(): Galera 2.1dev(r112) by
Codership Oy <i...@codership.com> loaded succesfully.
121029 14:53:02 [Note] WSREP: Found saved state:
00000000-0000-0000-0000-000000000000:-1
121029 14:53:02 [Note] WSREP: Reusing existing '/data/mysql1//galera.cache'.
121029 14:53:02 [Note] WSREP: Passing config to GCS: base_host =
11.11.11.110; gcache.dir = /data/mysql1/; gcache.keep_pages_size = 0;
gcache.mem_size = 0; gcache.name = /data/mysql1//galera.cache;
gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0;
gcs.fc_factor = 0.5; gcs.fc_limit = 16; gcs.fc_master_slave = NO;
gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit
= 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO;
replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
121029 14:53:02 [Note] WSREP: Assign initial position for certification:
-1, protocol version: -1
121029 14:53:02 [Note] WSREP: wsrep_sst_grab()
121029 14:53:02 [Note] WSREP: Start replication
121029 14:53:02 [Note] WSREP: Setting initial position to
00000000-0000-0000-0000-000000000000:-1
121029 14:53:02 [Note] WSREP: protonet asio version 0
121029 14:53:02 [Note] WSREP: backend: asio
121029 14:53:02 [Note] WSREP: GMCast version 0
121029 14:53:02 [Note] WSREP: (ddd40d58-21f9-11e2-0800-19fa5c7b5506,
'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
121029 14:53:02 [Note] WSREP: (ddd40d58-21f9-11e2-0800-19fa5c7b5506,
'tcp://0.0.0.0:4567') multicast: , ttl: 1
121029 14:53:02 [Note] WSREP: EVS version 0
121029 14:53:02 [Note] WSREP: PC version 0
121029 14:53:02 [Note] WSREP: gcomm: connecting to group 'test_cluster2',
peer '11.11.11.111:4567'
121029 14:53:02 [Note] WSREP: (ddd40d58-21f9-11e2-0800-19fa5c7b5506,
'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
tcp://11.11.11.112:4567
121029 14:53:03 [Note] WSREP: (ddd40d58-21f9-11e2-0800-19fa5c7b5506,
'tcp://0.0.0.0:4567') turning message relay requesting off
121029 14:53:07 [Note] WSREP: declaring
02b0f513-21dc-11e2-0800-7a0a8588f798 stable
121029 14:53:07 [Note] WSREP: declaring
ebe80665-21f4-11e2-0800-bedffce635c8 stable
121029 14:53:07 [Note] WSREP:
view(view_id(PRIM,02b0f513-21dc-11e2-0800-7a0a8588f798,11) memb {
        02b0f513-21dc-11e2-0800-7a0a8588f798,
        ddd40d58-21f9-11e2-0800-19fa5c7b5506,
        ebe80665-21f4-11e2-0800-bedffce635c8,

} joined {
} left {
} partitioned {

        92cdb47c-21f7-11e2-0800-262cdab68ee6,
})

121029 14:53:08 [Note] WSREP: gcomm: connected
121029 14:53:08 [Note] WSREP: Changing maximum packet size to 64500,
resulting msg size: 32636
121029 14:53:08 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
121029 14:53:08 [Note] WSREP: Opened channel 'test_cluster2'
121029 14:53:08 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no,
my_idx = 1, memb_num = 3
121029 14:53:08 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
121029 14:53:08 [Note] WSREP: Waiting for SST to complete.
121029 14:53:08 [Note] WSREP: STATE EXCHANGE: sent state msg:
e0eef247-21f9-11e2-0800-c2c66b6d352c
121029 14:53:08 [Note] WSREP: STATE EXCHANGE: got state msg:
e0eef247-21f9-11e2-0800-c2c66b6d352c from 0 (node112)
121029 14:53:08 [Note] WSREP: STATE EXCHANGE: got state msg:
e0eef247-21f9-11e2-0800-c2c66b6d352c from 2 (node111)
121029 14:53:08 [Note] WSREP: STATE EXCHANGE: got state msg:
e0eef247-21f9-11e2-0800-c2c66b6d352c from 1 (node110)
121029 14:53:08 [Note] WSREP: Quorum results:
        version    = 2,
        component  = PRIMARY,
        conf_id    = 9,
        members    = 2/3 (joined/total),
        act_id     = 8642,
        last_appl. = -1,
        protocols  = 0/4/2 (gcs/repl/appl),
        group UUID = 97be425f-21db-11e2-0800-d98563b1f16e
121029 14:53:08 [Note] WSREP: Flow-control interval: [14, 28]
121029 14:53:08 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 8642)
121029 14:53:08 [Note] WSREP: State transfer required:
        Group state: 97be425f-21db-11e2-0800-d98563b1f16e:8642
        Local state: 00000000-0000-0000-0000-000000000000:-1
121029 14:53:08 [Note] WSREP: New cluster view: global state:
97be425f-21db-11e2-0800-d98563b1f16e:8642, view# 10: Primary, number of
nodes: 3, my index: 1, protocol version 2
121029 14:53:08 [Warning] WSREP: Gap in state sequence. Need state transfer.
121029 14:53:10 [Note] WSREP: Running: 'wsrep_sst_xtrabackup 'joiner'
'11.11.11.110' 'xtrabackup:1234567' '/data/mysql1/' '/etc/mysql/my.cnf'
'11904' 2>sst.err'
121029 14:53:10 [Note] WSREP: Prepared SST request:
xtrabackup|11.11.11.110:4444/xtrabackup_sst
121029 14:53:10 [Note] WSREP: wsrep_notify_cmd is not defined, skipping
notification.
121029 14:53:10 [Note] WSREP: Assign initial position for certification:
8642, protocol version: 2
121029 14:53:10 [Warning] WSREP: Failed to prepare for incremental state
transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not
match group state UUID (97be425f-21db-11e2-0800-d98563b1f16e): 1 (Operation
not permitted)
         at galera/src/replicator_str.cpp:prepare_for_IST():439. IST will
be unavailable.
121029 14:53:10 [Note] WSREP: Node 1 (node110) requested state transfer
from '*any*'. Selected 0 (node112)(SYNCED) as donor.
121029 14:53:10 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 8682)
121029 14:53:10 [Note] WSREP: Requesting state transfer: success, donor: 0
121029 14:58:25 [Note] WSREP: 0 (node112): State transfer to 1 (node110)
complete.
121029 14:58:25 [Note] WSREP: Member 0 (node112) synced with group.

At other times, the same load test causes this error:

[ERROR] WSREP: FSM: no such a transition REPLICATING -> REPLICATING

Can any insight be provided for this error? After it occurs, the cluster
automatically fixes itself, but what is causing the issue in the first
place?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
PaulO  
View profile  
 More options Oct 29 2012, 3:42 pm
From: PaulO <paul....@ceifx.com>
Date: Mon, 29 Oct 2012 12:42:58 -0700 (PDT)
Local: Mon, Oct 29 2012 3:42 pm
Subject: Re: WSREP FSM Error when load testing PHP application

Furthermore, I have the following configurations on my 3 nodes:

Node 1:
wsrep_urls=gcomm://10.10.11.111:4567,gcomm://10.10.11.112:4567,gcomm://

Node 2:
wsrep_urls=gcomm://10.10.11.110:4567,gcomm://10.10.11.112:4567

Node 3:
wsrep_urls=gcomm://10.10.11.110:4567,gcomm://10.10.11.111:4567

I am using Percona XtraDB Cluster 5.5.24-55.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alex Yurchenko  
View profile  
 More options Oct 29 2012, 4:54 pm
From: Alex Yurchenko <alexey.yurche...@codership.com>
Date: Mon, 29 Oct 2012 22:54:37 +0200
Local: Mon, Oct 29 2012 4:54 pm
Subject: Re: [percona-group] Re: WSREP FSM Error when load testing PHP application
You must be hitting a known and fixed bug. At least some bugs with the
same symptoms have been identified and fixed some time ago.
Unfortunately not all of these fixes are in PXC yet. Still, try the
latest PXC release it may fix your particular problem.

On 2012-10-29 21:42, PaulO wrote:

--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
PaulO  
View profile  
 More options Oct 30 2012, 4:39 pm
From: PaulO <paul....@ceifx.com>
Date: Tue, 30 Oct 2012 13:39:51 -0700 (PDT)
Local: Tues, Oct 30 2012 4:39 pm
Subject: Re: [percona-group] Re: WSREP FSM Error when load testing PHP application

After upgrading each of the 3 nodes to the newest version, I am still
getting this error:

121030 16:34:17 [ERROR] WSREP: FSM: no such a transition REPLICATING ->
ROLLED_BACK
Aborted
121030 16:34:17 mysqld_safe Number of processes running now: 0
121030 16:34:17 mysqld_safe mysqld restarted
121030 16:34:17 [Note] Flashcache bypass: disabled
121030 16:34:17 [Note] Flashcache setup error is : ioctl failed

121030 16:34:17 [Note] WSREP: Read nil XID from storage engines, skipping
position init
121030 16:34:17 [Note] WSREP: wsrep_load(): loading provider library
'/usr/lib/libgalera_smm.so'
121030 16:34:17 [Note] WSREP: wsrep_load(): Galera 2.2(r115) by Codership
Oy <i...@codership.com> loaded succesfully.
121030 16:34:17 [Note] WSREP: Found saved state:
00000000-0000-0000-0000-000000000000:-1
121030 16:34:17 [Note] WSREP: Reusing existing '/data/mysql1//galera.cache'.
121030 16:34:17 [Note] WSREP: Passing config to GCS: base_host =
10.10.11.110; gcache.dir = /data/mysql1/; gcache.keep_pages_size = 0;
gcache.mem_size = 0; gcache.name = /data/mysql1//galera.cache;
gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0;
gcs.fc_factor = 0.5; gcs.fc_limit = 16; gcs.fc_master_slave = NO;
gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit
= 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO;
pc.ignore_sb = false; replicator.causal_read_timeout = PT30S;
replicator.commit_order = 3
121030 16:34:17 [Note] WSREP: Assign initial position for certification:
-1, protocol version: -1
121030 16:34:17 [Note] WSREP: wsrep_sst_grab()
121030 16:34:17 [Note] WSREP: Start replication
121030 16:34:17 [Note] WSREP: Setting initial position to
00000000-0000-0000-0000-000000000000:-1
121030 16:34:17 [Note] WSREP: protonet asio version 0
121030 16:34:17 [Note] WSREP: backend: asio
121030 16:34:17 [Note] WSREP: GMCast version 0
121030 16:34:17 [Note] WSREP: (2d2525aa-22d1-11e2-0800-9da5e98fae00,
'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
121030 16:34:17 [Note] WSREP: (2d2525aa-22d1-11e2-0800-9da5e98fae00,
'tcp://0.0.0.0:4567') multicast: , ttl: 1
121030 16:34:17 [Note] WSREP: EVS version 0
121030 16:34:17 [Note] WSREP: PC version 0
121030 16:34:17 [Note] WSREP: gcomm: connecting to group 'test_cluster2',
peer '10.10.11.111:4567'
121030 16:34:17 [Note] WSREP: (2d2525aa-22d1-11e2-0800-9da5e98fae00,
'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
tcp://10.10.11.112:4567
121030 16:34:18 [Note] WSREP: (2d2525aa-22d1-11e2-0800-9da5e98fae00,
'tcp://0.0.0.0:4567') turning message relay requesting off
121030 16:34:22 [Note] WSREP: declaring
3f6142cf-22cf-11e2-0800-b481e890b368 stable
121030 16:34:22 [Note] WSREP: declaring
526375df-22cf-11e2-0800-3d47d06db344 stable
121030 16:34:22 [Note] WSREP:
view(view_id(PRIM,2d2525aa-22d1-11e2-0800-9da5e98fae00,4) memb {
        2d2525aa-22d1-11e2-0800-9da5e98fae00,
        3f6142cf-22cf-11e2-0800-b481e890b368,
        526375df-22cf-11e2-0800-3d47d06db344,

} joined {
} left {
} partitioned {

        5970a776-22cf-11e2-0800-64b44ccf18ca,
})

121030 16:34:23 [Note] WSREP: gcomm: connected
121030 16:34:23 [Note] WSREP: Changing maximum packet size to 64500,
resulting msg size: 32636
121030 16:34:23 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
121030 16:34:23 [Note] WSREP: Opened channel 'test_cluster2'
121030 16:34:23 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no,
my_idx = 0, memb_num = 3
121030 16:34:23 [Note] WSREP: Waiting for SST to complete.
121030 16:34:23 [Note] WSREP: STATE_EXCHANGE: sent state UUID:
306ced91-22d1-11e2-0800-3368c23d357e
121030 16:34:23 [Note] WSREP: STATE EXCHANGE: sent state msg:
306ced91-22d1-11e2-0800-3368c23d357e
121030 16:34:23 [Note] WSREP: STATE EXCHANGE: got state msg:
306ced91-22d1-11e2-0800-3368c23d357e from 0 (node110)
121030 16:34:23 [Note] WSREP: STATE EXCHANGE: got state msg:
306ced91-22d1-11e2-0800-3368c23d357e from 1 (node112)
121030 16:34:23 [Note] WSREP: STATE EXCHANGE: got state msg:
306ced91-22d1-11e2-0800-3368c23d357e from 2 (node111)
121030 16:34:23 [Note] WSREP: Quorum results:
        version    = 2,
        component  = PRIMARY,
        conf_id    = 3,
        members    = 2/3 (joined/total),
        act_id     = 2016,
        last_appl. = -1,
        protocols  = 0/4/2 (gcs/repl/appl),
        group UUID = 721fa11a-22b7-11e2-0800-28d019b4cb3f
121030 16:34:23 [Note] WSREP: Flow-control interval: [14, 28]
121030 16:34:23 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 2016)
121030 16:34:23 [Note] WSREP: State transfer required:
        Group state: 721fa11a-22b7-11e2-0800-28d019b4cb3f:2016
        Local state: 00000000-0000-0000-0000-000000000000:-1
121030 16:34:23 [Note] WSREP: New cluster view: global state:
721fa11a-22b7-11e2-0800-28d019b4cb3f:2016, view# 4: Primary, number of
nodes: 3, my index: 0, protocol version 2
121030 16:34:23 [Warning] WSREP: Gap in state sequence. Need state transfer.
121030 16:34:23 [Note] WSREP: (2d2525aa-22d1-11e2-0800-9da5e98fae00,
'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
tcp://10.10.11.111:4567
121030 16:34:23 [Note] WSREP: remote endpoint tcp://10.10.11.111:4567
changed identity 526375df-22cf-11e2-0800-3d47d06db344 ->
30f824fd-22d1-11e2-0800-3971ee83c1f9
121030 16:34:23 [Note] WSREP: (2d2525aa-22d1-11e2-0800-9da5e98fae00,
'tcp://0.0.0.0:4567') turning message relay requesting off
121030 16:34:25 [Note] WSREP: Running: 'wsrep_sst_xtrabackup 'joiner'
'10.10.11.110:4444' 'xtrabackup:123456789' '/data/mysql1/'
'/etc/mysql/my.cnf' '16821' 2>sst.err'
121030 16:34:25 [Note] WSREP: Prepared SST request:
xtrabackup|10.10.11.110:4444/xtrabackup_sst
121030 16:34:25 [Note] WSREP: wsrep_notify_cmd is not defined, skipping
notification.
121030 16:34:25 [Note] WSREP: Assign initial position for certification:
2016, protocol version: 2
121030 16:34:25 [Warning] WSREP: Failed to prepare for incremental state
transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not
match group state UUID (721fa11a-22b7-11e2-0800-28d019b4cb3f): 1 (Operation
not permitted)
         at galera/src/replicator_str.cpp:prepare_for_IST():440. IST will
be unavailable.
121030 16:34:28 [Note] WSREP:
evs::proto(2d2525aa-22d1-11e2-0800-9da5e98fae00, GATHER,
view_id(REG,2d2525aa-22d1-11e2-0800-9da5e98fae00,4)) suspecting node:
526375df-22cf-11e2-0800-3d47d06db344
121030 16:34:29 [Note] WSREP: declaring
30f824fd-22d1-11e2-0800-3971ee83c1f9 stable
121030 16:34:29 [Note] WSREP: declaring
3f6142cf-22cf-11e2-0800-b481e890b368 stable
121030 16:34:29 [Note] WSREP:
view(view_id(PRIM,2d2525aa-22d1-11e2-0800-9da5e98fae00,5) memb {
        2d2525aa-22d1-11e2-0800-9da5e98fae00,
        30f824fd-22d1-11e2-0800-3971ee83c1f9,
        3f6142cf-22cf-11e2-0800-b481e890b368,
} joined {
} left {
} partitioned {

        526375df-22cf-11e2-0800-3d47d06db344,
})

121030 16:34:29 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no,
my_idx = 0, memb_num = 3
121030 16:34:29 [Note] WSREP: STATE_EXCHANGE: sent state UUID:
34117026-22d1-11e2-0800-b17b27d9cef1
121030 16:34:29 [Note] WSREP: STATE EXCHANGE: sent state msg:
34117026-22d1-11e2-0800-b17b27d9cef1
121030 16:34:29 [Note] WSREP: STATE EXCHANGE: got state msg:
34117026-22d1-11e2-0800-b17b27d9cef1 from 0 (node110)
121030 16:34:29 [Note] WSREP: STATE EXCHANGE: got state msg:
34117026-22d1-11e2-0800-b17b27d9cef1 from 2 (node112)
121030 16:34:29 [Note] WSREP: STATE EXCHANGE: got state msg:
34117026-22d1-11e2-0800-b17b27d9cef1 from 1 (node111)
121030 16:34:29 [Note] WSREP: Quorum results:
        version    = 2,
        component  = PRIMARY,
        conf_id    = 4,
        members    = 1/3 (joined/total),
        act_id     = 2049,
        last_appl. = 0,
        protocols  = 0/4/2 (gcs/repl/appl),
        group UUID = 721fa11a-22b7-11e2-0800-28d019b4cb3f
121030 16:34:29 [Note] WSREP: Flow-control interval: [14, 28]
121030 16:34:29 [Note] WSREP: Shifting PRIMARY -> PRIMARY (TO: 2049)
121030 16:34:29 [Note] WSREP: Node 0 (node110) requested state transfer
from '*any*'. Selected 2 (node112)(SYNCED) as donor.
121030 16:34:29 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 2049)
121030 16:34:29 [Note] WSREP: Requesting state transfer: success, donor: 2

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
PaulO  
View profile  
 More options Oct 31 2012, 8:37 am
From: PaulO <paul....@ceifx.com>
Date: Wed, 31 Oct 2012 05:37:26 -0700 (PDT)
Local: Wed, Oct 31 2012 8:37 am
Subject: Re: [percona-group] Re: WSREP FSM Error when load testing PHP application

Do you have any other thoughts on what the issue might be? Is it possible
that the load test is just writing to the nodes too quickly? The strange
part is that on some runs of the load test with jMeter, it will not have
the issue. However, on other identical runs, it will cause the issue almost
immediately.

Any help would be appreciated.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alex Yurchenko  
View profile  
 More options Oct 31 2012, 10:28 am
From: Alex Yurchenko <alexey.yurche...@codership.com>
Date: Wed, 31 Oct 2012 16:28:26 +0200
Local: Wed, Oct 31 2012 10:28 am
Subject: Re: [percona-group] Re: WSREP FSM Error when load testing PHP application
On 2012-10-31 14:37, PaulO wrote:

> Do you have any other thoughts on what the issue might be? Is it
> possible
> that the load test is just writing to the nodes too quickly?

Of course if you used a single threaded test, there would be no
problem. But you probably won't need a cluster then. ;)

> The strange
> part is that on some runs of the load test with jMeter, it will not
> have
> the issue. However, on other identical runs, it will cause the issue
> almost
> immediately.

This error is a result of a bad execution context cleanup (wrong state
is preserved) and most likely involves cluster-wide deadlock (you should
not be seeing that if you use the cluster in a master-slave way). So it
is a race and may happen and may not.

> Any help would be appreciated.

You could

1) describe your test in sufficient detail so that we could reproduce
it in our lab, then we could fix it or confirm that it is fixed. (if
there is sensitive information involved, you can e-mail directly to
i...@codership.com)

2) compile the latest head from lp:codership-mysql and test with it and
Galera 2.2 (again from https://launchpad.net/galera/+download) and
verify that it is fixed there.

Thanks,
Alex

--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
PaulO  
View profile  
 More options Oct 31 2012, 1:31 pm
From: PaulO <paul....@ceifx.com>
Date: Wed, 31 Oct 2012 10:31:56 -0700 (PDT)
Local: Wed, Oct 31 2012 1:31 pm
Subject: Re: [percona-group] Re: WSREP FSM Error when load testing PHP application

Thanks for the insight so far. We have narrowed down the issue to a
specific function in our application, so we will continue to dig there.
However, one other thing that we have noticed during these tests is that
sometimes, when 2 of the 3 nodes go down and only 1 is still up, that node
still has a local_state_comment of "Synced (6)" and we are still able to
access the data through that node. It seems that the Split brain detection
is not functioning correctly. Have you seen an issue like this? Here is
some of the output of "show global status like 'wsrep%';"

| wsrep_local_state          | 4                                    |
| wsrep_local_state_comment  | Synced (6)                           |
| wsrep_cert_index_size      | 0                                    |
| wsrep_causal_reads         | 0                                    |
| wsrep_cluster_conf_id      | 9                                    |
| wsrep_cluster_size         | 1                                    |
| wsrep_cluster_state_uuid   | 721fa11a-22b7-11e2-0800-28d019b4cb3f |
| wsrep_cluster_status       | Primary                              |
| wsrep_connected            | ON                                   |
| wsrep_local_index          | 0                                    |
| wsrep_provider_name        | Galera                               |
| wsrep_provider_vendor      | Codership Oy <i...@codership.com>    |
| wsrep_provider_version     | 2.2(r115)                            |
| wsrep_ready                | ON                                   |

Any idea why the primary node does not block access to the DB?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
PaulO  
View profile  
 More options Oct 31 2012, 1:34 pm
From: PaulO <paul....@ceifx.com>
Date: Wed, 31 Oct 2012 10:34:04 -0700 (PDT)
Local: Wed, Oct 31 2012 1:34 pm
Subject: Re: [percona-group] Re: WSREP FSM Error when load testing PHP application

Here is my "wsrep_provider_options" on the primary node with node 2 and
node 3 down.

| wsrep_provider_options | base_host = 11.11.11.110; base_port = 4567;
evs.debug_log_mask = 0x1; evs.inactive_check_period = PT0.5S;
evs.inactive_timeout = PT15S; evs.info_log_mask = 0; evs.install_timeout =
PT15S; evs.join_retrans_period = PT0.3S; evs.keepalive_period = PT1S;
evs.max_install_timeouts = 1; evs.send_window = 4; evs.stats_report_period
= PT1M; evs.suspect_timeout = PT5S; evs.use_aggregate = true;
evs.user_send_window = 2; evs.version = 0; evs.view_forget_timeout = PT5M;
gcache.dir = /data/mysql1/; gcache.keep_pages_size = 0; gcache.mem_size =
0; gcache.name = /data/mysql1//galera.cache; gcache.page_size = 128M;
gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 0.5; gcs.fc_limit =
16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle
= 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit
= 0.25; gcs.sync_donor = NO; gmcast.listen_addr = tcp://0.0.0.0:4567;
gmcast.mcast_addr = ; gmcast.mcast_ttl = 1; gmcast.peer_timeout = PT3S;
gmcast.time_wait = PT5S; gmcast.version = 0; ist.recv_addr = 11.11.11.110;
pc.checksum = true; pc.ignore_quorum = false; pc.ignore_sb = false;
pc.linger = PT2S; pc.npvo = false; pc.version = 0; protonet.backend = asio;
protonet.version = 0; replicator.causal_read_timeout = PT30S;
replicator.commit_order = 3 |

I confirmed that pc.ignore_sb=false;


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
PaulO  
View profile  
 More options Oct 31 2012, 3:06 pm
From: PaulO <paul....@ceifx.com>
Date: Wed, 31 Oct 2012 12:06:11 -0700 (PDT)
Local: Wed, Oct 31 2012 3:06 pm
Subject: Re: [percona-group] Re: WSREP FSM Error when load testing PHP application

Upon further investigation, the issue seems to be related to when we have
multiple web servers attempting to connect (many times per second) at the
same time. We are using KEMP hardware load balancers, not HAProxy, could
this be part of the issue?

When I have load tested just 1 web server which writes to just 1 DB server,
there is no issue.
If I have 1 web server write to multiple DB servers, I get the error.
If I have 2 web servers write to 1 DB server, I get the error, regardless
of whether I use the KEMP balancer or not.

Does HAProxy somehow funnel the requests better than a hardware balancer
would such that it makes it easier for the cluster to keep up?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alex Yurchenko  
View profile  
 More options Oct 31 2012, 6:15 pm
From: Alex Yurchenko <alexey.yurche...@codership.com>
Date: Thu, 01 Nov 2012 00:15:40 +0200
Local: Wed, Oct 31 2012 6:15 pm
Subject: Re: [percona-group] Re: WSREP FSM Error when load testing PHP application
On 2012-10-31 21:06, PaulO wrote:

> Upon further investigation, the issue seems to be related to when we
> have
> multiple web servers attempting to connect (many times per second) at
> the
> same time. We are using KEMP hardware load balancers, not HAProxy,
> could
> this be part of the issue?

It may help the issue to surface, but there is nothing wrong with using
hardware balancers.

> When I have load tested just 1 web server which writes to just 1 DB
> server,
> there is no issue.
> If I have 1 web server write to multiple DB servers, I get the error.

Yes, this is when you get cluster-wide conflicts.

> If I have 2 web servers write to 1 DB server, I get the error,
> regardless
> of whether I use the KEMP balancer or not.

Are you *absolutely* sure that they are writing to 1 DB server?

> Does HAProxy somehow funnel the requests better than a hardware
> balancer
> would such that it makes it easier for the cluster to keep up?

HAProxy is a dumb TCP proxy, it does not understand what's in the
connections. If anything it may be far slower which may reduce the
contention. But if there is a bug, you'll hit it with HAProxy sooner or
later.

About the 3rd node staying online: apparently the nodes didn't leave
the cluster ungracefully (in most such cases we're trying to close
cluster connection before aborting). This allowed the 3rd node to
continue. Details in the logs.

Regards,
Alex

--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »