Thanks for the advice. I did this and didn't see any indication that
the IP was being moved between hosts. After I put both hosts back
online, mmm_mond.log on the monitoring server started to fill with
toggling info and warning messages related to rep_threads and
rep_backlog. Here's a timeline of events:
I put these hosts online at 1:49pm:
2011/05/11 13:49:09 FATAL Admin changed state of 'mysql1' from
AWAITING_RECOVERY to ONLINE
2011/05/11 13:49:09 INFO Orphaned role 'writer(192.168.1.200)' has
been assigned to 'mysql1'
2011/05/11 13:49:22 FATAL Admin changed state of 'mysql2' from
AWAITING_RECOVERY to ONLINE
mysql1 logged continuous errors (192.168.1.15 is mysql1):
2011/05/11 13:49:26 FATAL Couldn't allow writes: ERROR: Can't connect
to MySQL (host =
192.168.1.15:3306, user = agent)! Can't connect to
MySQL server on '192.168.1.15' (4)
This went on until 3:57, when mysql1 then logged the following:
2011/05/11 15:57:18 INFO We have some new roles added or old rules
deleted!
2011/05/11 15:57:18 INFO Deleted: writer(192.168.1.200)
2011/05/11 15:57:21 FATAL Couldn't deny writes: ERROR: Can't connect
to MySQL (host =
192.168.1.15:3306, user = agent)! Can't connect to
MySQL server on '192.168.1.15' (4)
At this same time on mysql2, the server at 192.168.1.16 which has no
roles assigned at all:
2011/05/11 15:57:20 INFO We have some new roles added or old rules
deleted!
2011/05/11 15:57:20 INFO Added: writer(192.168.1.200)
2011/05/11 15:57:23 FATAL Couldn't sync with master: ERROR: Can't
connect to MySQL (host =
192.168.1.16:3306, user = agent)! Can't
connect to MySQL server on '192.168.1.16' (4)
Now mysql2 starts logging:
2011/05/11 15:57:26 FATAL Couldn't allow writes: ERROR: Can't connect
to MySQL (host =
192.168.1.16:3306, user = agent)! Can't connect to
MySQL server on '192.168.1.16' (4)
This happens for a few minutes until the monitor decides to delete the
writer role from this server and try to give it back to mysql1. mysql1
has the writer role for a bit and then it gets assigned back to
mysql2. This goes back and forth for a while until logging stops for
both hosts around 4pm.
Starting around 3:57pm the monitoring node has been logging the fact
that both hosts are swapping between a HARD_OFFLINE to
AWAITING_RECOVERY to ONLINE back to HARD_OFFLINE:
2011/05/11 13:49:09 FATAL Admin changed state of 'mysql1' from
AWAITING_RECOVERY to ONLINE
2011/05/11 13:49:22 FATAL Admin changed state of 'mysql2' from
AWAITING_RECOVERY to ONLINE
2011/05/11 15:57:32 FATAL State of host 'mysql1' changed from
HARD_OFFLINE to AWAITING_RECOVERY
2011/05/11 15:57:35 FATAL State of host 'mysql1' changed from
AWAITING_RECOVERY to ONLINE because it was down for only 18 seconds
2011/05/11 16:00:27 FATAL State of host 'mysql2' changed from
HARD_OFFLINE to AWAITING_RECOVERY
2011/05/11 16:00:30 FATAL State of host 'mysql2' changed from
AWAITING_RECOVERY to ONLINE because it was down for only 12 seconds
This is pretty much the state of affairs every time. My thought is
that the issue is the MySQL agent is not able to connect to the local
MySQL database and that this is causing mysql-mmm monitor to think the
host is offline. I tested local access using the agent and even added
a agent@localhost user just in case with the same privileges as the
subnet-specific user I'd created already. I still get the same issue.
On May 11, 10:51 am, Manuel Arostegui Ramirez
<
manuel.todoli...@gmail.com> wrote:
> Make sure the IP ain't moving around both machines.
> Try to do a 'watch ip addr' and check if it's going back and forward between
> then.
>
> Manuel.
>
> 2011/5/10 iglablues <
stevielivesh...@gmail.com>