I have experienced some strange issues with checks failing to run after some period of time with MMM 2.2.1.
Here's an example of d01 being stuck in REPLICATION_DELAY even though it actually is all caught up:
# mmm_control show
d01(10.0.0.200) master/REPLICATION_DELAY. Roles:
d02(10.0.0.201) master/ONLINE. Roles: writer(10.0.0.100)
On d01:
# mysql -uroot -e 'show slave status\G' | grep Seconds_Behind
Seconds_Behind_Master: 0
Here's what the checks show:
# mmm_control checks all all
d02 ping [last change: 2011/09/19 00:34:57] OK
d02 mysql [last change: 2011/09/19 23:43:30] OK
d02 rep_threads [last change: 2011/09/19 00:34:57] OK
d02 rep_backlog [last change: 2011/09/19 00:34:57] OK: Backlog is null
d01 ping [last change: 2011/09/19 00:34:57] OK
d01 mysql [last change: 2011/09/20 01:55:01] OK
d01 rep_threads [last change: 2011/09/19 00:34:57] OK
d01 rep_backlog [last change: 2011/09/19 00:34:57] ERROR: Backlog is too big
I was able to solve the issue by killing the rep_backlog checker:
# ps auxwwf | grep mmm
root 7765 0.0 0.2 166212 8892 ? S Sep19 0:00 mmm_mond
root 7767 0.2 1.1 703152 48160 ? Sl Sep19 12:12 \_ mmm_mond
root 7889 0.1 0.1 155096 7728 ? S Sep19 9:45 \_ perl /usr/libexec/mysql-mmm/monitor/checker ping_ip
root 7892 0.1 0.2 185580 9196 ? S Sep19 8:47 \_ perl /usr/libexec/mysql-mmm/monitor/checker mysql
root 7894 0.1 0.2 155096 8408 ? S Sep19 7:49 \_ perl /usr/libexec/mysql-mmm/monitor/checker ping
root 7896 0.1 3.9 338732 161732 ? S Sep19 8:14 \_ perl /usr/libexec/mysql-mmm/monitor/checker rep_backlog
root 7898 0.1 3.9 338732 161892 ? S Sep19 7:57 \_ perl /usr/libexec/mysql-mmm/monitor/checker rep_threads
# mmm_control checks all all
d02 ping [last change: 2011/09/19 00:34:57] OK
d02 mysql [last change: 2011/09/19 23:43:30] OK
d02 rep_threads [last change: 2011/09/19 00:34:57] OK
d02 rep_backlog [last change: 2011/09/19 00:34:57] OK: Backlog is null
d01 ping [last change: 2011/09/19 00:34:57] OK
d01 mysql [last change: 2011/09/20 01:55:01] OK
d01 rep_threads [last change: 2011/09/19 00:34:57] OK
d01 rep_backlog [last change: 2011/09/22 15:55:31] OK: Backlog is null
# mmm_control show
d01(10.0.0.200) master/ONLINE. Roles:
d02(10.0.0.201) master/ONLINE. Roles: writer(10.0.0.100)
which shows that killing the check did work.
Is this a known issue? Is there a workaround?
Thanks,
Eric