[sheepdog-ng] recovery stalled

Valerio Pachera

unread,

Feb 24, 2017, 8:16:00 AM2/24/17

to sheepdog-ng

Ok, today there had been some issue in my small datacenter that cause the swithces to paninc.

We are still looking for the cause.

Two of the three hosts are done with the recovery.

node id 0 is still recovering

dog node recovery
Nodes In Recovery:
Id Host:Port V-Nodes Zone Progress
1 192.168.6.111:7000 881 1862707392 98.5%

Progress is hanged at 98.5%.

In sheep.log I read

Feb 24 14:09:22   WARN [gway 14573] wait_forward_request(393) poll timeout 1, disks of some nodes or network is busy. Going to poll-wait again
Feb 24 14:09:22   WARN [gway 15243] wait_forward_request(393) poll timeout 1, disks of some nodes or network is busy. Going to poll-wait again
Feb 24 14:09:22   WARN [gway 15249] wait_forward_request(393) poll timeout 1, disks of some nodes or network is busy. Going to poll-wait again
Feb 24 14:09:22   WARN [gway 14571] wait_forward_request(393) poll timeout 1, disks of some nodes or network is busy. Going to poll-wait again
Feb 24 14:09:25   WARN [gway 15248] wait_forward_request(393) poll timeout 1, disks of some nodes or network is busy. Going to poll-wait again

but I can tell you that there's NOT network activity or disk activity (looking atop).

Any Idea?

It's pretty urgent :-)

Valerio Pachera

unread,

Feb 24, 2017, 8:29:37 AM2/24/17

to sheepdog-ng

root@sheep002:~# dog cluster info
Cluster status: running, auto-recovery enabled

Cluster created at Tue Mar 8 17:46:48 2016

Epoch Time           Version
2017-02-24 14:18:59     14 [192.168.6.80:7000(1), 192.168.6.111:7000(1), 192.168.6.112:7000(1)]
2017-02-24 14:18:54     13 [192.168.6.111:7000(1), 192.168.6.112:7000(1)]
2017-02-24 13:39:11     12 [192.168.6.80:7000(1), 192.168.6.111:7000(1), 192.168.6.112:7000(1)]
2017-02-24 13:16:09     11 [192.168.6.111:7000(1), 192.168.6.112:7000(1)]
2017-02-24 13:15:23     10 [192.168.6.112:7000(1)]
2017-02-24 12:53:27      9 [192.168.6.111:7000(1), 192.168.6.112:7000(1)]

Notice that at 14:18 someone very clever turned on a switch that I shut of before and then the cluster started recovering again !!!

It seems that node 192.168.6.111 is not going to end the recovery.

I also notice in sheep.log of 192.168.6.111

...
Feb 24 13:15:25 ERROR [rw 26609] recover_object_work(549) failed to recover object 8a0b7000000c24
Feb 24 13:15:25 ERROR [rw 26608] recover_object_work(549) failed to recover object 8a0b7000000c28
Feb 24 13:15:25 ERROR [rw 26609] recover_object_work(549) failed to recover object 8a0b7000000c2a
Feb 24 13:15:25 ERROR [rw 26607] recover_object_work(549) failed to recover object 8a0b7000000a72
Feb 24 13:15:25 ERROR [rw 26608] recover_object_work(549) failed to recover object 8a0b7000000c33
Feb 24 13:15:25 ERROR [rw 26609] recover_object_work(549) failed to recover object 8a0b7000000dfa
Feb 24 13:15:25 ERROR [rw 26608] recover_object_work(549) failed to recover object 8a0b7000000dfc
Feb 24 13:15:25 ERROR [rw 26609] recover_object_work(549) failed to recover object 800c658b00000000
Feb 24 13:15:25 ERROR [rw 26607] recover_object_work(549) failed to recover object 8a0b7000000c34
Feb 24 13:15:25 ERROR [rw 26608] recover_object_work(549) failed to recover object 8080a86f00000000
Feb 24 13:15:25 ERROR [rw 26607] recover_object_work(549) failed to recover object 808a0b7000000000

Valerio Pachera

unread,

Feb 25, 2017, 4:24:07 AM2/25/17

to sheepdog-ng

In the end I had to kill -9 sheep on 192.168.6.111.

After that, the other two nodes run successfully the recovery.

Believe it or not, a blond girl connected 2 cables on the same switch (double loop!).

This small switch is connected to the main switch of the office that is connected to my datacenter switch.

All switches went mad!

Sheepdog has the I/O nic separated on a dedicated switch, so that didn't get affected by the loop.

The other nic wasn't able to communicate with the other nodes and zookeeper causing what you saw in the previous mail.

I've been lucky that vdi didn't get corrupted!!!!

Liu Yuan

unread,

Mar 2, 2017, 9:45:18 PM3/2/17

to Valerio Pachera, sheepdog-ng

On Sat, Feb 25, 2017 at 10:24:06AM +0100, Valerio Pachera wrote:
> In the end I had to kill -9 sheep on 192.168.6.111.
> After that, the other two nodes run successfully the recovery.
>
> Believe it or not, a blond girl connected 2 cables on the same switch
> (double loop!).

Hmmm, she should buy you a dinner for that!

> This small switch is connected to the main switch of the office that is
> connected to my datacenter switch.
> All switches went mad!
>
> Sheepdog has the I/O nic separated on a dedicated switch, so that didn't
> get affected by the loop.
> The other nic wasn't able to communicate with the other nodes and zookeeper
> causing what you saw in the previous mail.
>
> I've been lucky that vdi didn't get corrupted!!!!

That's nice to hear

Yuan

Reply all

Reply to author

Forward