partial error when doing SST

47 views
Skip to first unread message

Ale C.

unread,
May 28, 2012, 2:23:40 PM5/28/12
to codership
Hi guys, we are having some weird issue.

We got a 4 master cluster schema with galera/mysql 5.5.23 ( 3 r/w
nodes and one node of reference )
We create a new database and filled it with tables and data.
After that, i ran a "mysqldump" from the node of reference, and the
dump is ok, THEN! i stop and start the node of reference, the SST is
done, but when i try to read one of the tables, i get this error :

- - - - - - - - - - - - - - - - - - - - -
May 28 11:57:55 i-00000092-asm mysqld: 120528 11:57:55 [ERROR] Cannot
find or open table nicira1/projects from
May 28 11:57:55 i-00000092-asm mysqld: the internal data dictionary of
InnoDB though the .frm file for the
May 28 11:57:55 i-00000092-asm mysqld: table exists. Maybe you have
deleted and recreated InnoDB data
May 28 11:57:55 i-00000092-asm mysqld: files but have forgotten to
delete the corresponding .frm files
May 28 11:57:55 i-00000092-asm mysqld: of InnoDB tables, or you have
moved .frm files to another database?
May 28 11:57:55 i-00000092-asm mysqld: or, the table contains indexes
that this version of the engine
May 28 11:57:55 i-00000092-asm mysqld: doesn't support.
May 28 11:57:55 i-00000092-asm mysqld: See
http://dev.mysql.com/doc/refman/5.5/en/innodb-troubleshooting.html
May 28 11:57:55 i-00000092-asm mysqld: how you can resolve the
problem.
- - - - - - - - - - - - - - - - - - - - -

And from the client perspective, it says that the table doesnt exist.
This is the log from the Node of Reference in the SST time.

- - - - - - - - - - - - - - - - - - - - -
May 28 11:56:31 i-00000092-asm mysqld: version = 2,
May 28 11:56:31 i-00000092-asm mysqld: component = PRIMARY,
May 28 11:56:31 i-00000092-asm mysqld: conf_id = 64,
May 28 11:56:31 i-00000092-asm mysqld: members = 3/4 (joined/
total),
May 28 11:56:31 i-00000092-asm mysqld: act_id = 210837192,
May 28 11:56:31 i-00000092-asm mysqld: last_appl. = -1,
May 28 11:56:31 i-00000092-asm mysqld: protocols = 0/3/1 (gcs/repl/
appl),
May 28 11:56:31 i-00000092-asm mysqld: group UUID =
3108998a-67d4-11e1-0800-84a08b380d82
May 28 11:56:31 i-00000092-asm mysqld: 120528 11:56:31 [Note] WSREP:
Flow-control interval: [16, 32]
May 28 11:56:31 i-00000092-asm mysqld: 120528 11:56:31 [Note] WSREP:
Shifting OPEN -> PRIMARY (TO: 210837192)
May 28 11:56:31 i-00000092-asm mysqld: 120528 11:56:31 [Note] WSREP:
State transfer required:
May 28 11:56:31 i-00000092-asm mysqld: Group state:
3108998a-67d4-11e1-0800-84a08b380d82:210837192
May 28 11:56:31 i-00000092-asm mysqld: Local state:
3108998a-67d4-11e1-0800-84a08b380d82:210835601
May 28 11:56:31 i-00000092-asm mysqld: 120528 11:56:31 [Note] WSREP:
New cluster view: global state:
3108998a-67d4-11e1-0800-84a08b380d82:210192, view# 65: Primary, number
of nodes: 4, my index: 3, protocol version 1
May 28 11:56:31 i-00000092-asm mysqld: 120528 11:56:31 [Warning]
WSREP: Gap in state sequence. Need state transfer.
May 28 11:56:31 i-00000092-asm mysqld: 120528 11:56:31 [Note] WSREP:
[debug]: closing client connections for PRIM
May 28 11:56:33 i-00000092-asm mysqld: 120528 11:56:33 [Note] WSREP:
waiting for client connections to close: 2
May 28 11:56:33 i-00000092-asm mysqld: 120528 11:56:33 [Note] WSREP:
Running: 'wsrep_sst_rsync 'joiner' '172.16.159.121' 'wsrep_sst:wspass'
'r/lib/mysql/' '/etc/mysql/conf.d/wsrep.cnf' '28132' 2>sst.err'
May 28 11:56:33 i-00000092-asm rsyncd[28175]: rsyncd version 3.0.7
starting, listening on port 4444
May 28 11:56:33 i-00000092-asm mysqld: 120528 11:56:33 [Note] WSREP:
Prepared SST request: rsync|172.16.159.121:4444/rsync_sst
May 28 11:56:33 i-00000092-asm mysqld: 120528 11:56:33 [Note] WSREP:
wsrep_notify_cmd is not defined, skipping notification.
May 28 11:56:33 i-00000092-asm mysqld: 120528 11:56:33 [Note] WSREP:
Assign initial position for certification: 210837192, protocol
version:
May 28 11:56:33 i-00000092-asm mysqld: 120528 11:56:33 [Note] WSREP:
Prepared IST receiver, listening at: tcp://172.16.159.121:4568
May 28 11:56:33 i-00000092-asm mysqld: 120528 11:56:33 [Note] WSREP:
Node 3 (i-00000092-asm) requested state transfer from '*any*'.
Selected i-0000015c-psm)(SYNCED) as donor.
May 28 11:56:33 i-00000092-asm mysqld: 120528 11:56:33 [Note] WSREP:
Shifting PRIMARY -> JOINER (TO: 210837263)
May 28 11:56:33 i-00000092-asm mysqld: 120528 11:56:33 [Note] WSREP:
Requesting state transfer: success, donor: 0
May 28 11:56:33 i-00000092-asm rsyncd[28186]: name lookup failed for
172.16.148.95: Temporary failure in name resolution
May 28 11:56:33 i-00000092-asm rsyncd[28186]: connect from UNKNOWN
(172.16.148.95)
May 28 11:56:33 i-00000092-asm rsyncd[28186]: rsync to rsync_sst/ from
UNKNOWN (172.16.148.95)
May 28 11:56:33 i-00000092-asm rsyncd[28186]: receiving file list
May 28 11:56:33 i-00000092-asm rsyncd[28186]: sent 54 bytes received
182 bytes total size 47
May 28 11:56:35 i-00000092-asm rsyncd[28175]: rsync error: received
SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(541) [Receiver=3.0.7]
May 28 11:56:35 i-00000092-asm mysqld: 120528 11:56:35 [Note] WSREP:
SST complete, seqno: 210835601
May 28 11:56:35 i-00000092-asm mysqld: 120528 11:56:35 InnoDB: The
InnoDB memory heap is disabled
May 28 11:56:35 i-00000092-asm mysqld: 120528 11:56:35 InnoDB: Mutexes
and rw_locks use GCC atomic builtins
May 28 11:56:35 i-00000092-asm mysqld: 120528 11:56:35 InnoDB:
Compressed tables use zlib 1.2.3.3
May 28 11:56:35 i-00000092-asm mysqld: 120528 11:56:35 InnoDB: Using
Linux native AIO
May 28 11:56:35 i-00000092-asm mysqld: 120528 11:56:35 InnoDB:
Initializing buffer pool, size = 4.0G
May 28 11:56:35 i-00000092-asm mysqld: 120528 11:56:35 InnoDB:
Completed initialization of buffer pool
May 28 11:56:35 i-00000092-asm mysqld: 120528 11:56:35 InnoDB: highest
supported file format is Barracuda.
May 28 11:56:37 i-00000092-asm mysqld: 120528 11:56:37 InnoDB:
Waiting for the background threads to start
May 28 11:56:38 i-00000092-asm mysqld: 120528 11:56:38 InnoDB: 1.1.8
started; log sequence number 100441910215
May 28 11:56:38 i-00000092-asm mysqld: 120528 11:56:38 [Note] WSREP:
Initial TC log open: dummy
May 28 11:56:39 i-00000092-asm mysqld: 120528 11:56:39 [Note] Event
Scheduler: Loaded 0 events
May 28 11:56:39 i-00000092-asm mysqld: 120528 11:56:39 [Note] WSREP:
Signalling provider to continue.
May 28 11:56:39 i-00000092-asm mysqld: 120528 11:56:39 [Note] WSREP:
Received SST: 3108998a-67d4-11e1-0800-84a08b380d82:210835601
May 28 11:56:39 i-00000092-asm mysqld: 120528 11:56:39 [Note] WSREP:
SST received: 3108998a-67d4-11e1-0800-84a08b380d82:210835601
May 28 11:56:39 i-00000092-asm mysqld: 120528 11:56:39 [Note] WSREP:
Receiving IST: 1591 writesets, seqnos 210835601-210837192
May 28 11:56:39 i-00000092-asm mysqld: 120528 11:56:39 [Note] /usr/
sbin/mysqld: ready for connections.
May 28 11:56:39 i-00000092-asm mysqld: Version: '5.5.23' socket: '/
var/run/mysqld/mysqld.sock' port: 3306 Source distribution,
wsrep_23.543
May 28 11:56:39 i-00000092-asm mysqld: 120528 11:56:39 [Note] WSREP:
IST received: 3108998a-67d4-11e1-0800-84a08b380d82:210837192
May 28 11:56:39 i-00000092-asm mysqld: 120528 11:56:39 [Note] WSREP: 0
(i-0000015c-psm): State transfer to 3 (i-00000092-asm) complete.
May 28 11:56:39 i-00000092-asm mysqld: 120528 11:56:39 [Note] WSREP: 3
(i-00000092-asm): State transfer from 0 (i-0000015c-psm) complete.
May 28 11:56:39 i-00000092-asm mysqld: 120528 11:56:39 [Note] WSREP:
Shifting JOINER -> JOINED (TO: 210837519)
May 28 11:56:39 i-00000092-asm mysqld: 120528 11:56:39 [Note] WSREP:
Member 0 (i-0000015c-psm) synced with group.
May 28 11:56:39 i-00000092-asm mysqld: 120528 11:56:39 [Note] WSREP:
Member 3 (i-00000092-asm) synced with group.
May 28 11:56:39 i-00000092-asm mysqld: 120528 11:56:39 [Note] WSREP:
Shifting JOINED -> SYNCED (TO: 210837520)
May 28 11:56:39 i-00000092-asm mysqld: 120528 11:56:39 [Note] WSREP:
Synchronized with group, ready for connections
May 28 11:56:39 i-00000092-asm mysqld: 120528 11:56:39 [Note] WSREP:
wsrep_notify_cmd is not defined, skipping notification.
- - - - - - - - - - - - - - - - - - - - -

What might be happening in the SST time ? ( if the problem is in the
SST )

Alex Yurchenko

unread,
May 28, 2012, 3:33:15 PM5/28/12
to codersh...@googlegroups.com
Hi,

Do you guys use the latest release (MySQL-wsrep-5.5.23-23.6, Galera
23.2.1)? Are the nodes identical, if not, what is the difference? Is
rsync version the same on all nodes? Are you using Cluster Control or
any other tool to manage the nodes? Could you post a corresponding log
piece from i-0000015c-psm ?

Thanks,
Alex
--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

Ale C.

unread,
May 29, 2012, 9:38:15 AM5/29/12
to codership
Hi Alex.
Responses in order.

On May 28, 4:33 pm, Alex Yurchenko <alexey.yurche...@codership.com>
wrote:
> Hi,
>
> Do you guys use the latest release (MySQL-wsrep-5.5.23-23.6, Galera
> 23.2.1)? Are the nodes identical, if not, what is the difference? Is
> rsync version the same on all nodes?

All the nodes are the same virtual machine image as base, Ubuntu
10.10.
They are all using : galera-23.2.0-amd64.deb
And we have upgraded all the nodes from mysql-server-wsrep-5.5.20-23.4-
amd64.deb to mysql-server-wsrep-5.5.23-23.5-amd64.deb.
As they are all the same, the rsync version is : rsync version 3.0.7
protocol version 30.

> Are you using Cluster Control or any other tool to manage the nodes?

We are not using any special tool to manage the nodes.

>Could you post a corresponding log
> piece from i-0000015c-psm ?

Here you go, this is the log from the DONOR :
-------------------------------------------
May 28 11:56:31 i-0000015c-psm mysqld: 120528 11:56:31 [Note] WSREP:
(031c7151-a73e-11e1-0800-5b90b6781083, 'tcp://0.0.0.0:4567') cleaning
up duplicate 0x7f07f40e5a20 after established 0x7f06b84973a0
May 28 11:56:31 i-0000015c-psm mysqld: 120528 11:56:31 [Note] WSREP:
GMCast::handle_stable_view: view(view_id(PRIM,031c7151-
a73e-11e1-0800-5b90b6781083,114) memb {
May 28 11:56:31 i-0000015c-psm mysqld: 031c7151-
a73e-11e1-0800-5b90b6781083,
May 28 11:56:31 i-0000015c-psm mysqld: 73140858-
a73d-11e1-0800-8519f763d7c6,
May 28 11:56:31 i-0000015c-psm mysqld: 8cb1ccb5-a73e-11e1-0800-
f3c769af6bda,
May 28 11:56:31 i-0000015c-psm mysqld: b11ebccc-
a8dd-11e1-0800-327a5484e333,
May 28 11:56:31 i-0000015c-psm mysqld: } joined {
May 28 11:56:31 i-0000015c-psm mysqld: } left {
May 28 11:56:31 i-0000015c-psm mysqld: } partitioned {
May 28 11:56:31 i-0000015c-psm mysqld: })
May 28 11:56:31 i-0000015c-psm mysqld: 120528 11:56:31 [Note] WSREP:
declaring 73140858-a73d-11e1-0800-8519f763d7c6 stable
May 28 11:56:31 i-0000015c-psm mysqld: 120528 11:56:31 [Note] WSREP:
New COMPONENT: primary = yes, my_idx = 0, memb_num = 4
May 28 11:56:31 i-0000015c-psm mysqld: 120528 11:56:31 [Note] WSREP:
declaring 8cb1ccb5-a73e-11e1-0800-f3c769af6bda stable
May 28 11:56:31 i-0000015c-psm mysqld: 120528 11:56:31 [Note] WSREP:
declaring b11ebccc-a8dd-11e1-0800-327a5484e333 stable
May 28 11:56:31 i-0000015c-psm mysqld: 120528 11:56:31 [Note] WSREP:
STATE_EXCHANGE: sent state UUID: b1521664-a8dd-11e1-0800-354876d79350
May 28 11:56:31 i-0000015c-psm mysqld: 120528 11:56:31 [Note] WSREP:
STATE EXCHANGE: sent state msg: b1521664-a8dd-11e1-0800-354876d79350
May 28 11:56:31 i-0000015c-psm mysqld: 120528 11:56:31 [Note] WSREP:
STATE EXCHANGE: got state msg: b1521664-a8dd-11e1-0800-354876d79350
from 0 (i-0000015c-psm)
May 28 11:56:31 i-0000015c-psm mysqld: 120528 11:56:31 [Note] WSREP:
STATE EXCHANGE: got state msg: b1521664-a8dd-11e1-0800-354876d79350
from 1 (i-000000c6-wsm)
May 28 11:56:31 i-0000015c-psm mysqld: 120528 11:56:31 [Note] WSREP:
STATE EXCHANGE: got state msg: b1521664-a8dd-11e1-0800-354876d79350
from 2 (i-000000df-usm)
May 28 11:56:31 i-0000015c-psm mysqld: 120528 11:56:31 [Note] WSREP:
STATE EXCHANGE: got state msg: b1521664-a8dd-11e1-0800-354876d79350
from 3 (i-00000092-asm)
May 28 11:56:31 i-0000015c-psm mysqld: 120528 11:56:31 [Note] WSREP:
Quorum results:
May 28 11:56:31 i-0000015c-psm mysqld: version = 2,
May 28 11:56:31 i-0000015c-psm mysqld: component = PRIMARY,
May 28 11:56:31 i-0000015c-psm mysqld: conf_id = 64,
May 28 11:56:31 i-0000015c-psm mysqld: members = 3/4 (joined/
total),
May 28 11:56:31 i-0000015c-psm mysqld: act_id = 210837192,
May 28 11:56:31 i-0000015c-psm mysqld: last_appl. = 210836442,
May 28 11:56:31 i-0000015c-psm mysqld: protocols = 0/3/1 (gcs/repl/
appl),
May 28 11:56:31 i-0000015c-psm mysqld: group UUID =
3108998a-67d4-11e1-0800-84a08b380d82
May 28 11:56:31 i-0000015c-psm mysqld: 120528 11:56:31 [Note] WSREP:
Flow-control interval: [16, 32]
May 28 11:56:31 i-0000015c-psm mysqld: 120528 11:56:31 [Note] WSREP:
New cluster view: global state:
3108998a-67d4-11e1-0800-84a08b380d82:210837192, view# 65: Primary,
number of nodes: 4, my index: 0, protocol version 1
May 28 11:56:31 i-0000015c-psm mysqld: 120528 11:56:31 [Note] WSREP:
wsrep_notify_cmd is not defined, skipping notification.
May 28 11:56:31 i-0000015c-psm mysqld: 120528 11:56:31 [Note] WSREP:
Assign initial position for certification: 210837192, protocol
version: 2
May 28 11:56:33 i-0000015c-psm mysqld: 120528 11:56:33 [Note] WSREP:
Node 3 (i-00000092-asm) requested state transfer from '*any*'.
Selected 0 (i-0000015c-psm)(SYNCED) as donor.
May 28 11:56:33 i-0000015c-psm mysqld: 120528 11:56:33 [Note] WSREP:
Shifting SYNCED -> DONOR/DESYNCED (TO: 210837263)
May 28 11:56:33 i-0000015c-psm mysqld: 120528 11:56:33 [Note] WSREP:
IST request: 3108998a-67d4-11e1-0800-84a08b380d82:210835601-210837192|
tcp://172.16.159.121:4568
May 28 11:56:33 i-0000015c-psm mysqld: 120528 11:56:33 [Note] WSREP:
wsrep_notify_cmd is not defined, skipping notification.
May 28 11:56:33 i-0000015c-psm mysqld: 120528 11:56:33 [Note] WSREP:
Running: 'wsrep_sst_rsync 'donor' '172.16.159.121:4444/rsync_sst'
'wsrep_sst:wspass' '/var/lib/mysql/' '/etc/mysql/conf.d/wsrep.cnf'
'3108998a-67d4-11e1-0800-84a08b380d82' '210835601' '1''
May 28 11:56:33 i-0000015c-psm mysqld: 120528 11:56:33 [Note] WSREP:
sst_donor_thread signaled with 0
May 28 11:56:33 i-0000015c-psm mysqld: 120528 11:56:33 [Note] WSREP:
async IST sender starting to serve tcp://172.16.159.121:4568 sending
210835602-210837192
May 28 11:56:39 i-0000015c-psm mysqld: 120528 11:56:39 [Note] WSREP:
async IST sender served
May 28 11:56:39 i-0000015c-psm mysqld: 120528 11:56:39 [Note] WSREP: 0
(i-0000015c-psm): State transfer to 3 (i-00000092-asm) complete.
May 28 11:56:39 i-0000015c-psm mysqld: 120528 11:56:39 [Note] WSREP:
Shifting DONOR/DESYNCED -> JOINED (TO: 210837519)
May 28 11:56:39 i-0000015c-psm mysqld: 120528 11:56:39 [Note] WSREP: 3
(i-00000092-asm): State transfer from 0 (i-0000015c-psm) complete.
May 28 11:56:39 i-0000015c-psm mysqld: 120528 11:56:39 [Note] WSREP:
Member 0 (i-0000015c-psm) synced with group.
May 28 11:56:39 i-0000015c-psm mysqld: 120528 11:56:39 [Note] WSREP:
Shifting JOINED -> SYNCED (TO: 210837519)
May 28 11:56:39 i-0000015c-psm mysqld: 120528 11:56:39 [Note] WSREP:
Synchronized with group, ready for connections
May 28 11:56:39 i-0000015c-psm mysqld: 120528 11:56:39 [Note] WSREP:
wsrep_notify_cmd is not defined, skipping notification.
May 28 11:56:39 i-0000015c-psm mysqld: 120528 11:56:39 [Note] WSREP:
Member 3 (i-00000092-asm) synced with group.
-------------------------------------------

Any sugestion Alex ?
Since when a node goes down, and after SST it gets that unconsistent,
is critical for us.

Thanks as allways.

Alex Yurchenko

unread,
May 29, 2012, 10:02:01 AM5/29/12
to codersh...@googlegroups.com
Well, logs just don't show anything suspicious except that strange
rsync error notification, but I'm sure it is ok and must be caused (the
notification) by some monitoring tool installed - there actually was no
SST to corrupt the data.

My only guess is that data on that node either was corrupted while it
was stopped, or had been corrupted long ago.

What was your complete mysqldump command? If you repeat it on the
reference node again, does it fail?
What was the command to access the table? Can you access this table on
i-0000015c-psm?
Is this table routinely modified by your application?
Do you have enough disk space there?

Thanks,
Alex

Ale C.

unread,
May 31, 2012, 9:48:26 PM5/31/12
to codership
The problem is fixed !
It was related to the database schema from the application side :

This is the bug : https://bugs.launchpad.net/nova/+bug/993663
Patch applied and everything is working just fine !

On May 29, 11:02 am, Alex Yurchenko <alexey.yurche...@codership.com>
wrote:

Ale C.

unread,
May 31, 2012, 9:51:13 PM5/31/12
to codership
Thanks Alex for all the help !
Reply all
Reply to author
Forward
0 new messages