Hi All,
In advance, thanks for any light you can shed on this issue I'm
experiencing.
The set up I have is two MySQL servers in MySQL-MMM configuration
(only one active writer role [10.0.1.1], and no reader roles between
them). Two further MySQL servers then replicate from 10.0.1.1 and are
used in the "application" as read-only sources (these are the
"application servers"). The application servers only read from
themselves and direct all writes to 10.0.1.1. There is monitoring in
place by way of a table within the replicated database that has the
current time entered every minute (via a cron job), and each slave
"application server" has a monitoring cron job which checks the
replicated table for the time and compares it to it's own time. If the
difference is greater than 1 minute, I am alerted via SMS.
MySQL Version (on all 4 boxes) is: MySQL-5.0.84 (build 18) from
Percona [
http://www.percona.com/mysql/5.0.84-b18]
MySQL-MMM Version (on 2 master boxes) is: 2.0.9
Perl Version: "This is perl, v5.8.8 built for i386-linux-thread-multi"
All boxes running CentOS-5.3(Final).
Sorry for the long windedness, but some or all of the above may be
relevant.
Now the problem...
Rather at random, I'm seeing log entries in the monitor log as
follows:
ERROR Check 'mysql' on 'db01' has failed for 14 seconds! Message:
ERROR: Connect error (host =
10.0.1.11:3306, user = usr_mmm_monitor)!
Can't connect to MySQL server on '10.0.1.11' (4)
At the same time, the agent log shows things like:
FATAL Couldn't allow writes: ERROR: Can't connect to MySQL (host =
10.0.1.11:3306, user = usr_mmm_agent)!
Both the above log entries are seen (randomly?) throughout normal
operation, and don't seem to actually impact in performance or
availability, until they happen many times sequentially, causing a
fail-over.
As far as the monitor is concerned, mysql has been dead for 14 seconds
(in example above).
With default MySQL-MMM config, after 15 seconds of dead time, the
writer role is pushed to the other server (db02 in my case), which is
great but...
...during the 15 seconds of dead time above, the cron job has
connected to local mysql, updated the time in the check table and both
the "application servers" have replicated this value too, so mysql is
certainly not dead.
For the time being, I've upped the required mysql dead time to 30
seconds, which is working around the problem (although, I still see
entries such as the above in the agent monitor logs).
If anyone has any suggestions as to what I can check, I'd appreciate
it, and apologies for the potentially confusing long post.
Thanks in advance,
Paul