galera - 3 nodes crashed from OOM

82 views
Skip to first unread message

Michael Mittentag

unread,
May 12, 2015, 11:44:11 AM5/12/15
to codersh...@googlegroups.com
Last night we had 3 nodes crash at the same time out of a 6 node cluster.

We are running the following on Amazon AMI in EC2:

galera-23.2.7-1.rhel6.x86_64
MariaDB-Galera-server-5.5.34-1.x86_64

wsrep_provider_version                   | 23.2.7(r157)

Our innodb_buffer pool is set to 200G, and server has 240 GB total


We have been running these versions for 1 year without an issue

There was nothing in the mysql log except that it ended:

150511 21:48:04 mysqld_safe Number of processes running now: 0

150511 21:48:04 mysqld_safe WSREP: not restarting wsrep node automatically

150511 21:48:04 mysqld_safe mysqld from pid file /var/lib/mysql/node4.pid ended




but in /var/log/messages we saw this on each of the 3 nodes that crashed:

May 11 21:48:02 node4 kernel: [7336182.265235] mysqld invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
May 11 21:48:02 node4 kernel: [7336182.273747] mysqld cpuset=/ mems_allowed=0-1
May 11 21:48:02 node4 kernel: [7336182.275878] Pid: 27347, comm: mysqld Not tainted 3.4.37-40.44.amzn1.x86_64 #1
May 11 21:48:02 node4 kernel: [7336182.279841] Call Trace:
May 11 21:48:02 node4 kernel: [7336182.281287]  [<ffffffff8110644e>] dump_header.constprop.6+0x7e/0x1b0
May 11 21:48:02 node4 kernel: [7336182.284242]  [<ffffffff81233130>] ? ___ratelimit+0xa0/0x120
May 11 21:48:02 node4 kernel: [7336182.286881]  [<ffffffff811067fd>] oom_kill_process.part.4.constprop.5+0x13d/0x280
May 11 21:48:02 node4 kernel: [7336182.290900]  [<ffffffff811e12a5>] ? security_capable_noaudit+0x15/0x20
May 11 21:48:02 node4 kernel: [7336182.294265]  [<ffffffff81054717>] ? has_capability_noaudit+0x17/0x20
May 11 21:48:02 node4 kernel: [7336182.297247]  [<ffffffff81106e37>] out_of_memory+0x367/0x540
May 11 21:48:02 node4 kernel: [7336182.299944]  [<ffffffff8110bfa9>] __alloc_pages_nodemask+0x8e9/0x900
May 11 21:48:02 node4 kernel: [7336182.302996]  [<ffffffff812116e2>] ? queue_unplugged+0x62/0xf0
May 11 21:48:02 node4 kernel: [7336182.305729]  [<ffffffffa009a7d0>] ? noalloc_get_block_write+0x30/0x30 [ext4]
May 11 21:48:02 node4 kernel: [7336182.309038]  [<ffffffff81142036>] alloc_pages_current+0xb6/0x120
May 11 21:48:02 node4 kernel: [7336182.311995]  [<ffffffff81102a8f>] __page_cache_alloc+0xcf/0xf0
May 11 21:48:02 node4 kernel: [7336182.314751]  [<ffffffff811052f8>] filemap_fault+0x298/0x460
May 11 21:48:02 node4 kernel: [7336182.317328]  [<ffffffff81124dcf>] __do_fault+0x6f/0x4d0
May 11 21:48:02 node4 kernel: [7336182.319698]  [<ffffffff81127e77>] handle_pte_fault+0xf7/0x970
May 11 21:48:02 node4 kernel: [7336182.322386]  [<ffffffff81129979>] handle_mm_fault+0x259/0x340
May 11 21:48:02 node4 kernel: [7336182.325571]  [<ffffffff8130e5ed>] ? sock_aio_read+0x2d/0x40
May 11 21:48:02 node4 kernel: [7336182.328416]  [<ffffffff813e9799>] do_page_fault+0x139/0x4e0
May 11 21:48:02 node4 kernel: [7336182.332204]  [<ffffffff813e5d89>] ? _raw_spin_unlock_bh+0x19/0x20
May 11 21:48:02 node4 kernel: [7336182.335024]  [<ffffffff8131264a>] ? release_sock+0xfa/0x120
May 11 21:48:02 node4 kernel: [7336182.338632]  [<ffffffff8103c355>] ? pvclock_clocksource_read+0x55/0xf0
May 11 21:48:02 node4 kernel: [7336182.341820]  [<ffffffff813e63e5>] page_fault+0x25/0x30
May 11 21:48:02 node4 kernel: [7336182.344426] Mem-Info:
May 11 21:48:02 node4 kernel: [7336182.345886] Node 0 DMA per-cpu:
May 11 21:48:02 node4 kernel: [7336182.348058] CPU    0: hi:    0, btch:   1 usd:   0
May 11 21:48:02 node4 kernel: [7336182.350483] CPU    1: hi:    0, btch:   1 usd:   0
May 11 21:48:02 node4 kernel: [7336182.352691] CPU    2: hi:    0, btch:   1 usd:   0
May 11 21:48:02 node4 kernel: [7336182.354883] CPU    3: hi:    0, btch:   1 usd:   0
May 11 21:48:02 node4 kernel: [7336182.357517] CPU    4: hi:    0, btch:   1 usd:   0
May 11 21:48:02 node4 kernel: [7336182.360319] CPU    5: hi:    0, btch:   1 usd:   0
May 11 21:48:02 node4 kernel: [7336182.362915] CPU    6: hi:    0, btch:   1 usd:   0
May 11 21:48:02 node4 kernel: [7336182.365239] CPU    7: hi:    0, btch:   1 usd:   0
May 11 21:48:02 node4 kernel: [7336182.367433] CPU    8: hi:    0, btch:   1 usd:   0
May 11 21:48:02 node4 kernel: [7336182.369906] CPU    9: hi:    0, btch:   1 usd:   0
May 11 21:48:02 node4 kernel: [7336182.372369] CPU   10: hi:    0, btch:   1 usd:   0
May 11 21:48:02 node4 kernel: [7336182.374840] CPU   11: hi:    0, btch:   1 usd:   0
May 11 21:48:02 node4 kernel: [7336182.377352] CPU   12: hi:    0, btch:   1 usd:   0
May 11 21:48:02 node4 kernel: [7336182.379548] CPU   13: hi:    0, btch:   1 usd:   0
May 11 21:48:02 node4 kernel: [7336182.381758] CPU   14: hi:    0, btch:   1 usd:   0
May 11 21:48:02 node4 kernel: [7336182.383932] CPU   15: hi:    0, btch:   1 usd:   0
May 11 21:48:02 node4 kernel: [7336182.386128] CPU   16: hi:    0, btch:   1 usd:   0
May 11 21:48:02 node4 kernel: [7336182.424113] Node 0 DMA32 per-cpu:
May 11 21:48:02 node4 kernel: [7336182.425959] CPU    0: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.428215] CPU    1: hi:  186, btch:  31 usd:  58
May 11 21:48:02 node4 kernel: [7336182.430416] CPU    2: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.432838] CPU    3: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.435054] CPU    4: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.437265] CPU    5: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.439458] CPU    6: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.441673] CPU    7: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.443874] CPU    8: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.446087] CPU    9: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.448336] CPU   10: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.450754] CPU   11: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.453503] CPU   12: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.455725] CPU   13: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.457934] CPU   14: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.460151] CPU   15: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.462632] CPU   16: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.464968] CPU   17: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.467251] CPU   18: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.469538] CPU   19: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.471831] CPU   20: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.474145] CPU   21: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.476683] CPU   22: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.478851] CPU   23: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.481044] CPU   24: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.483231] CPU   25: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.485765] CPU   26: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.487937] CPU   27: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.490449] CPU   28: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.492654] CPU   29: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.495148] CPU   30: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.497558] CPU   31: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.503985] Node 0 Normal per-cpu:
May 11 21:48:02 node4 kernel: [7336182.506022] CPU    0: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.508494] CPU    1: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.511403] CPU    2: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.514203] CPU    3: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.516750] CPU    4: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.518922] CPU    5: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.521127] CPU    6: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.523322] CPU    7: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.525829] CPU    8: hi:  186, btch:  31 usd:   0
May 11 21:48:02 node4 kernel: [7336182.668479] active_anon:59741220 inactive_anon:2719180 isolated_anon:0
May 11 21:48:02 node4 kernel: [7336182.668479]  active_file:331 inactive_file:0 isolated_file:52
May 11 21:48:02 node4 kernel: [7336182.668480]  unevictable:8 dirty:0 writeback:0 unstable:0
May 11 21:48:02 node4 kernel: [7336182.668481]  free:145665 slab_reclaimable:32128 slab_unreclaimable:18577
May 11 21:48:02 node4 kernel: [7336182.668482]  mapped:230 shmem:4038827 pagetables:150251 bounce:0
May 11 21:48:02 node4 kernel: [7336182.685186] Node 0 DMA free:15908kB min:4kB low:4kB high:4kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15652kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
May 11 21:48:02 node4 kernel: [7336182.704317] lowmem_reserve[]: 0 3760 123027 123027
May 11 21:48:02 node4 kernel: [7336182.707051] Node 0 DMA32 free:478348kB min:1376kB low:1720kB high:2064kB active_anon:3320088kB inactive_anon:36948kB active_file:0kB inactive_file:8kB unevictable:0kB isolated(anon):0kB isolated(file):28kB present:3850496kB mlocked:0kB dirty:0kB writeback:0kB mapped:12kB shmem:37820kB slab_reclaimable:2084kB slab_unreclaimable:272kB kernel_stack:8kB pagetables:3072kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:17 all_unreclaimable? no
May 11 21:48:02 node4 kernel: [7336182.724517] lowmem_reserve[]: 0 0 119266 119266
May 11 21:48:02 node4 kernel: [7336182.727889] Node 0 Normal free:43584kB min:43672kB low:54588kB high:65508kB active_anon:111972680kB inactive_anon:9333612kB active_file:1028kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):200kB present:122129280kB mlocked:0kB dirty:0kB writeback:0kB mapped:852kB shmem:12915252kB slab_reclaimable:97860kB slab_unreclaimable:46360kB kernel_stack:1856kB pagetables:353060kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
May 11 21:48:02 node4 kernel: [7336182.746277] lowmem_reserve[]: 0 0 0 0
May 11 21:48:02 node4 kernel: [7336182.748642] Node 1 Normal free:44900kB min:45056kB low:56320kB high:67584kB active_anon:123672112kB inactive_anon:1506192kB active_file:4kB inactive_file:128kB unevictable:32kB isolated(anon):0kB isolated(file):0kB present:126000000kB mlocked:32kB dirty:0kB writeback:0kB mapped:0kB shmem:3202260kB slab_reclaimable:28376kB slab_unreclaimable:27672kB kernel_stack:760kB pagetables:244872kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:81 all_unreclaimable? no
May 11 21:48:02 node4 kernel: [7336182.769637] lowmem_reserve[]: 0 0 0 0
May 11 21:48:02 node4 kernel: [7336182.771944] Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15908kB
May 11 21:48:02 node4 kernel: [7336182.779264] Node 0 DMA32: 406*4kB 1259*8kB 2321*16kB 1691*32kB 1114*64kB 690*128kB 335*256kB 140*512kB 43*1024kB 7*2048kB 0*4096kB = 478368kB
May 11 21:48:02 node4 kernel: [7336182.786913] Node 0 Normal: 11237*4kB 93*8kB 14*16kB 2*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 45980kB
May 11 21:48:02 node4 kernel: [7336182.794923] Node 1 Normal: 1131*4kB 344*8kB 214*16kB 170*32kB 81*64kB 57*128kB 38*256kB 6*512kB 4*1024kB 0*2048kB 0*4096kB = 45516kB
May 11 21:48:02 node4 kernel: [7336182.802542] 4039692 total pagecache pages
May 11 21:48:02 node4 kernel: [7336182.804734] 0 pages in swap cache
May 11 21:48:02 node4 kernel: [7336182.806319] Swap cache stats: add 0, delete 0, find 0/0
May 11 21:48:02 node4 kernel: [7336182.809626] Free swap  = 0kB
May 11 21:48:02 node4 kernel: [7336182.811010] Total swap = 0kB
May 11 21:48:02 node4 kernel: [7336183.254335] 63999984 pages RAM
May 11 21:48:02 node4 kernel: [7336183.256417] 1020195 pages reserved
May 11 21:48:02 node4 kernel: [7336183.258218] 826 pages shared
May 11 21:48:02 node4 kernel: [7336183.259612] 62832290 pages non-shared
May 11 21:48:02 node4 kernel: [7336183.261633] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
May 11 21:48:02 node4 kernel: [7336183.265074] [ 3363]     0  3363     3820      150   7     -17         -1000 udevd
May 11 21:48:02 node4 kernel: [7336183.268547] [ 3861]     0  3861     3739       97  25     -17         -1000 udevd
May 11 21:48:02 node4 kernel: [7336183.272116] [ 4014]     0  4014     2306      125   0       0             0 dhclient
May 11 21:48:02 node4 kernel: [7336183.275685] [ 4054]     0  4054    27956      104  15     -17         -1000 auditd
May 11 21:48:02 node4 kernel: [7336183.279471] [ 4067]     0  4067    60960     1702   0       0             0 rsyslogd
May 11 21:48:02 node4 kernel: [7336183.283316] [ 4078]     0  4078     3488      160   4       0             0 irqbalance
May 11 21:48:02 node4 kernel: [7336183.286987] [ 4089]    81  4089     5389       56   7       0             0 dbus-daemon
May 11 21:48:02 node4 kernel: [7336183.290994] [ 4127]     0  4127     1048       27   0       0             0 acpid
May 11 21:48:02 node4 kernel: [7336183.294928] [ 4265]     0  4265    19456      193   0     -17         -1000 sshd
May 11 21:48:02 node4 kernel: [7336183.298257] [ 4273]     0  4273     5697       60   0       0             0 xinetd
May 11 21:48:02 node4 kernel: [7336183.301753] [ 4298]    38  4298     6764      148   3       0             0 ntpd
May 11 21:48:02 node4 kernel: [7336183.305369] [ 4313]     0  4313    21773      457  12       0             0 sendmail
May 11 21:48:02 node4 kernel: [7336183.309023] [ 4321]    51  4321    19635      360  10       0             0 sendmail
May 11 21:48:02 node4 kernel: [7336183.312862] [ 4346]     0  4346    31040      152  23       0             0 crond
May 11 21:48:02 node4 kernel: [7336183.317054] [ 4357]     0  4357     5427       42   7       0             0 atd
May 11 21:48:02 node4 kernel: [7336183.320831] [ 4448]     0  4448     1047       23   0       0             0 agetty
May 11 21:48:02 node4 kernel: [7336183.324553] [ 4450]     0  4450     1044       22  14       0             0 mingetty
May 11 21:48:02 node4 kernel: [7336183.328233] [ 4454]     0  4454     1044       23   3       0             0 mingetty
May 11 21:48:02 node4 kernel: [7336183.331709] [ 4456]     0  4456     1044       21   5       0             0 mingetty
May 11 21:48:02 node4 kernel: [7336183.335198] [ 4458]     0  4458     1044       22   4       0             0 mingetty
May 11 21:48:02 node4 kernel: [7336183.338844] [ 4461]     0  4461     1044       22   0       0             0 mingetty
May 11 21:48:02 node4 kernel: [7336183.342335] [ 4466]     0  4466     3819      150  26     -17         -1000 udevd
May 11 21:48:02 node4 kernel: [7336183.345702] [ 4467]     0  4467     1044       23  13       0             0 mingetty
May 11 21:48:02 node4 kernel: [7336183.349315] [ 5253]     0  5253    32973     8130   0       0             0 mcollectived
May 11 21:48:02 node4 kernel: [7336183.352977] [ 5296]     0  5296   177224     1331  15       0             0 collectd
May 11 21:48:02 node4 kernel: [7336183.356472] [ 5354]     0  5354    34154    10779   5       0             0 puppet
May 11 21:48:02 node4 kernel: [7336183.359966] [ 5690]     0  5690     2884       88   6       0             0 mysqld_safe
May 11 21:48:02 node4 kernel: [7336183.363977] [ 6733]    27  6733 75871973 58376204   0       0             0 mysqld
May 11 21:48:02 node4 kernel: [7336183.368015] [ 7793]   219  7793   105021     5759  16       0             0 searchd
May 11 21:48:02 node4 kernel: [7336183.371736] [ 6570]     0  6570    29730     2098   5       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.375060] [ 7016]     0  7016    28951     1122   4       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.378946] [ 7522]     0  7522    28901     1255   1       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.382324] [ 7728]     0  7728    27989      344   1       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.385668] [12820]     0 12820    29080     1256  21       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.389167] [12896]     0 12896    28512      860   4       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.392661] [ 1395]     0  1395    27893      285   6       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.396161] [ 9664]     0  9664    27999      367   6       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.399728] [ 5176]     0  5176    28024      379   7       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.403152] [15032]     0 15032    27944      335   5       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.406488] [15184]     0 15184    27986      369   1       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.410095] [25739]     0 25739    28656      954   0       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.413724] [25793]     0 25793    27893      267   0       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.417337] [23950]     0 23950    27893      259   0       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.420668] [23990]     0 23990    27893      258   1       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.424269] [24063]     0 24063    27893      259   5       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.427876] [24371]     0 24371    27893      260   0       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.431672] [24473]     0 24473    27893      260   0       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.435179] [24753]     0 24753    27893      261   0       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.438971] [26253]     0 26253    27978      334  12       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.442588] [26269]     0 26269    28064      477   6       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.446445] [26312]     0 26312    27987      347   7       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.482543] [27338]     0 27338    27893      250   0       0             0 sshd
May 11 21:48:02 node4 kernel: [7336183.486744] Out of memory: Kill process 6733 (mysqld) score 930 or sacrifice child
May 11 21:48:02 node4 kernel: [7336183.490430] Killed process 6733 (mysqld) total-vm:303487892kB, anon-rss:233504816kB, file-rss:0kB
May 11 21:48:03 node4 kernel: [7336183.939107] serial8250: too much work for irq4
May 11 21:48:03 node4 kernel: [7336184.120830] serial8250: too much work for irq4



Any ideas?  Have you seen this?


Thanks,
Michael 

Joseph De Oliveiro

unread,
May 12, 2015, 11:46:34 AM5/12/15
to codersh...@googlegroups.com
I've seen this error before. MySQLd ran out of memory and the system
process manager killed it.
> --
> You received this message because you are subscribed to the Google
> Groups "codership" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to codership-tea...@googlegroups.com
> <mailto:codership-tea...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Michael Mittentag

unread,
May 12, 2015, 11:59:10 AM5/12/15
to codersh...@googlegroups.com
Joe,

Did you discover what caused it?  How much ram did you have and did it happen on all your nodes or just 1?

alexey.y...@galeracluster.com

unread,
May 12, 2015, 4:58:24 PM5/12/15
to Michael Mittentag, codersh...@googlegroups.com
On 2015-05-12 18:44, Michael Mittentag wrote:
> Last night we had 3 nodes crash at the same time out of a 6 node
> cluster.

So this means that these 3 nodes are somehow different than the other 3.
How? HW, configuration, usage?

> We are running the following on Amazon AMI in EC2:
>
> galera-23.2.7-1.rhel6.x86_64
> MariaDB-Galera-server-5.5.34-1.x86_64
>
> wsrep_provider_version | 23.2.7(r157)
>
> Our innodb_buffer pool is set to 200G, and server has 240 GB total

One thing I can say is that Galera 2.x is relatively memory-greedy when
it comes to certification index, and I take that with such RAM sizes you
are trying to wield some huge transactions. But I'm not sure that it can
consume more than 10gigs - after all the total writeset limit is 2. And
I would expect it to happen on all 6 nodes, provided they are of the
same configuration.

However 2.x is also quite dated and real buggy by today's standards. I'd
upgrade anyway.


Michael Mittentag

unread,
May 12, 2015, 5:20:12 PM5/12/15
to codersh...@googlegroups.com
Hi Alexey,

Thanks for taking the time to respond.

All 6 nodes are exact clones of each other using AWS ami , hardware instance size and software configuration.  The only differences would be hostname, ip, and node defined.


That is why I thought it was odd that 3 out of 6 crashed, while I am glad not all of them crashed at once that would be really scary, but I cannot figure out why these 3 in particular did, and it was the first time we ever had a OOM like this.


I am hesitant to upgrade, the last time we did was 8 months ago. After the upgrade the entire cluster locked up several times in a short span, and we ultimately had to rollback and have been stable until last night:

Here is the thread I started on codership with the details that  happened last time:


Out of curiosity do you know if that issue we ran into has been resolved if we upgrade to the latest MariaDB?


Thanks

alexey.y...@galeracluster.com

unread,
May 13, 2015, 10:22:31 AM5/13/15
to Michael Mittentag, codersh...@googlegroups.com
On 2015-05-13 00:20, Michael Mittentag wrote:
> Here is the thread I started on codership with the details that
> happened
> last time:
>
> https://groups.google.com/forum/#!topic/codership-team/jB4aV0COkxI
>
> Out of curiosity do you know if that issue we ran into has been
> resolved if
> we upgrade to the latest MariaDB?

Unfortunately I can't vouch that the problem you encountered is fixed.
The MariaDB ticket is closed, but it may be a different issue.
Reply all
Reply to author
Forward
0 new messages