Re: [codership-team] 3 cluster galera setup but 2 clusters can't communicate firewall question

95 views
Skip to first unread message

Alex Yurchenko

unread,
Oct 2, 2012, 2:55:31 AM10/2/12
to codersh...@googlegroups.com
Hi,

1) if anything, "solo" is by far the least involved mode for Galera
node to run. So if it crashes solo, you should report a bug. Also it is
EXTREMELY unlikely that Galera node would run solo as a result of
network partitioning or peer crash. If it does, you should report a bug

2) by default galera node uses port 4567 for replication and 4568 for
incremental state transfer, so these ports should be opened too.

3) by default rsync and xtrabackup snapshot methods use port 4444, so
that should be open too.

All in all, to simplify things I'd open all ports between those 3
nodes. If those nodes are trusted, restricting it to individual ports
does not really add much to security. Just remove --dport 3306 option.

4) don't forget to upgrade to the latest versions. They are released
because they fix bugs. It sounds like you're running 0.8 or something ;)

Regards,
Alex

On 2012-10-02 01:39, SteveSRS wrote:
> Hi,
>
> I've been running a 3 node (UK - NL - US) cluster with Mysql Galera.
> UK - runs same site (Geo targeted)
> US - runs same site (Geo targeted)
> NL - is the controller (no apache server just mysql)
>
> Normally it is working good however every now and then (mostly when
> my
> hosting f's up) and something goes down it all goes wrong big time
> and
> takes various days to get back up..
>
> It just happened again:
> My UK server went down (up, down, up down, down down still nothing)
> Then my site stayed online due Galera (mirrored on US server)..
> however
> some hours later Galera on US also crashed..
>
> Then I checked and I noticed US galera was running solo (where it
> should
> be 2 as the NL server is also still online).
>
> *This is first issue* which I've noticed since long time ago (also in
> prev
> versions). Mysql Galera can't run solo it will always give out after
> some
> hours (max 1 day).
>
> *Now the main issue:*
> Doing further investigation why the cluster didn't stay alive with
> just NL
> - US I found out there is a Firewall problem blocking communication
> between
> NL and US!
> However when everything is online (including UK) and I check cluster
> status
> it does always state '3 nodes' connected like normal.
>
> I found out because I flushed the IPtables on US server and then it
> did
> work.
>
> So my IPtable rules on US server are:
> :INPUT DROP [462:55883]
> :FORWARD DROP [0:0]
> :OUTPUT ACCEPT [204532:1123918317]
> -A INPUT -i eth1 -p tcp -m tcp --dport 22 -m state --state NEW -m
> recent
> --update --seconds 180 --hitcount 4 --name DEFAULT --rsource -j DROP
> -A INPUT -i eth1 -p tcp -m tcp --dport 22 -m state --state NEW -m
> recent
> --set --name DEFAULT --rsource
> -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
> -A INPUT -p icmp -j ACCEPT
> -A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
> -A INPUT -p tcp -m tcp --dport 25 -j ACCEPT
> -A INPUT -p tcp -m tcp --dport 26 -j ACCEPT
> -A INPUT -p tcp -m tcp --dport 143 -j ACCEPT
> -A INPUT -p tcp -m tcp --dport 110 -j ACCEPT
> -A INPUT -p tcp -m tcp --dport 80 -j ACCEPT
> -A INPUT -p tcp -m tcp --dport 995 -j ACCEPT
> -A INPUT -p tcp -m tcp --dport 113 -j REJECT --reject-with
> icmp-port-unreachable
> -A INPUT -p tcp -m tcp --dport 443 -j ACCEPT
> -A INPUT -s [US-IP]/32 -p tcp -m tcp --dport 3306 -j ACCEPT
> -A INPUT -s [UK-IP]/32 -p tcp -m tcp --dport 3306 -j ACCEPT
> -A INPUT -s [NL-IP]/32 -p tcp -m tcp --dport 3306 -j ACCEPT
> -A INPUT -i lo -j ACCEPT
>
> With these rules activated on US server I tried the following on the
> NL
> server:
> telnet [IP-US] 3306
> Which works of course raw connection and can't really do anything but
> I get
> some weird characters and Escape character is '^]'.
>
> Connecting via Mysql
> /opt/galera/mysql-galera -g gcomm://[IP-US] --mysql-opt
> --socket=/var/lib/mysql/mysql.sock start
>
> Gives me error:
> ERROR 2013 (HY000): Lost connection to MySQL server at 'reading
> initial
> communication packer' (104)
>
> The connection between UK and US server always seemed to work fine
> (otherwise I don't think the whole cluster connecting at all would
> work
> when all is online, I can't further test this right now as it is
> still down)
>
> Can somebody tell me what to add to the IPtable rules to make it work
> connecting from NL to US server?
>
> Thank you!
>
> P.s. The root user for the galera setup on the US server has % value
> as
> host so that should be good.

--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

SteveSRS

unread,
Oct 3, 2012, 4:40:51 PM10/3/12
to codersh...@googlegroups.com
Hi,

No I'm running 23.2.1 (r129)
Mysql Version 5.5.23

Thanks for the port info, I'll indeed just open it up completely between those nodes think that would be best.

If it happens again that it is running on 1 node and it crashes I'll check it out and collect log data. However all is live environment so don't want to go using to test too much.

The problem is solved now. When UK server came back online it also didn't want to connect to US. So I completely turned off all firewalls on all 3 servers.
Then NL - US connected no problem but UK - US kept refusing to connect always Mysql errors 104, 110 and 111. Main problem seemed to be with mysql.sock which sometimes was there and other times not (and Mysql seemed to be running fine).

In the end I deleted grabstate.dat and at about the same moment hosting support guy turned back-on the firewall on the US (with other set of rules than before) and it suddenly connected. Which is weird as the firewall (iptables) was completely flushed and turning it back on should not solve the problem, so this makes me conclude deleting grabstate.dat from galera mysql/var/ dir did the trick.


Alex Yurchenko

unread,
Oct 3, 2012, 4:50:20 PM10/3/12
to codersh...@googlegroups.com
I would not bet my life on it (there can be a very convoluted
explanation) but it is EXTREMELY unlikely that grastate.dat has anything
to do with that. Errors 104, 110 and 111 are all *network communication
errors*. It ia much more likely that iptables were not flushed the way
you expected them to be - there are three machines involved in all this
and one mistake on one machine can render the whole cluster
nonoperational.

Anyway, glad you got it running!
Reply all
Reply to author
Forward
0 new messages