harep does not detect DEGRADED Disk status for instance!?

the2nd

unread,

Aug 22, 2014, 3:57:38 AM8/22/14

to gan...@googlegroups.com

hi,

today its the first time i tried harep that comes with ganeti 2.11.3

root@vserver12:~# gnt-cluster version
Software version: 2.11.3
Internode protocol: 2110000
Configuration format: 2110000
OS api version: 20
Export interface: 0
VCS version: (ganeti) version v2.11.3

i've added an drbd test instance, manually removed its lvs on the second node and its detected as DEGRADED:

root@vserver12:~# gnt-instance info instance1 | grep DEGR
      on primary: /dev/drbd12 (147:12) in sync, status *DEGRADED*
      on primary: /dev/drbd13 (147:13) in sync, status *DEGRADED*
      on primary: /dev/drbd14 (147:14) in sync, status *DEGRADED*

i've also added the required tags as described in the manpage:

root@vserver12:~# gnt-instance list-tags instance1
ganeti:watcher:autorepair:fix-storage

but it does not detect any problem with the instances:

root@vserver12:~# harep
Warning: cluster has inconsistent data:
- node vserver12 is missing -19810 MB ram and 111 GB disk
- node vserver6 is missing 259 MB ram and 100 GB disk

---------------------
Instance status count
---------------------
Healthy: 5

am i doing something wrong?

regards

Petr Pudlák

unread,

Aug 22, 2014, 8:02:15 AM8/22/14

to gan...@googlegroups.com

Hi,

currently Harep only detects failures by looking at which nodes are marked as offline. However it really seems reasonable to ask for checking the status of disks attached to the instances as well. Please fill a feature request as you see fit.

Thank you,

Petr

candlerb

unread,

Aug 24, 2014, 5:12:50 PM8/24/14

to gan...@googlegroups.com

On Friday, 22 August 2014 13:02:15 UTC+1, Petr Pudlak wrote:

currently Harep only detects failures by looking at which nodes are marked as offline.

That's interesting. As I understand it, nodes which fail only be marked offline manually. So this implies that harep will never make any repair without manual intervention, is that true?

If this is true, I'd say the documentation is highly misleading.

http://docs.ganeti.org/ganeti/master/man/harep.html

There it talks only about checking state of *instances*, not the state of *nodes*.

Furthermore, under this model, I can see the value of:

failover: allow instance reboot on the secondary

(i.e. if the node fails, restart the instance on the other node). But I can't see under what circumstance it would use

migrate: allow instance migration

How can a (live) migration be done if either the instance is down or one of the nodes is offline? Conversely, if the instance is working, why would it need migrating to "repair" it?

Regards,

Brian.

Petr Pudlák

unread,

Aug 25, 2014, 9:06:07 AM8/25/14

to gan...@googlegroups.com

Hi Brian,

On Sun, Aug 24, 2014 at 11:12 PM, candlerb <b.ca...@pobox.com> wrote:

On Friday, 22 August 2014 13:02:15 UTC+1, Petr Pudlak wrote:

currently Harep only detects failures by looking at which nodes are marked as offline.

That's interesting. As I understand it, nodes which fail only be marked offline manually. So this implies that harep will never make any repair without manual intervention, is that true?

Yes, that is true. The problem is how to reliably detect that a node is offline. An automated test can't easily distinguish for example a network failure from a node failure, etc. So the current decision is just to test if a node is offline. But as mentioned, there are possible improvements in this area.

If this is true, I'd say the documentation is highly misleading.

http://docs.ganeti.org/ganeti/master/man/harep.html

There it talks only about checking state of *instances*, not the state of *nodes*.

Furthermore, under this model, I can see the value of:
failover: allow instance reboot on the secondary
(i.e. if the node fails, restart the instance on the other node). But I can't see under what circumstance it would use

migrate: allow instance migration
How can a (live) migration be done if either the instance is down or one of the nodes is offline? Conversely, if the instance is working, why would it need migrating to "repair" it?

That's a good point. By design, harep would migrate an instance away when a node is marked as drained, but as far as I know, this isn't implemented yet.

I'll add feature request for this.

Thanks for noticing,

Petr

the2nd

unread,

Aug 25, 2014, 10:06:05 AM8/25/14

to gan...@googlegroups.com

hi petr,

the manpage states this:

· fix-storage: allow disk replacement or fix the backend without affecting the instance itself (broken DRBD secondary)

i thought this means fixing broken drbd setups of instances for example after a "crash and restore" of the secondary node. doesnt it?

currently it needs an gnt-instance repair-disks after restoring the secondary node. automating this task via hrep would be great. but i'm unsure what will or should happen if harep is unable to fix the issue.

for example its also possible that the drbd backend devices (lvs) may "crash" on the primary node (without affecting the running instance), because of a raid failure. and as long as the raid/vg is in failed state its not possible for harep to fix the issue i think. so i guess its an idea to send an admin email about the failed try to fix the issue? or whats the intended behavior of harep in this case? migrate the instance to the secondary node to reduce the chance of an instance failure because of a (storage) network failure on the primary node may be an option i guess.

the manpage is not clear about all of this.

regards
heiko

Petr Pudlák

unread,

Aug 25, 2014, 10:36:23 AM8/25/14

to gan...@googlegroups.com

FYI I filed two bug/feature requests for Harep:

Let harep automatically detect degraded DRBD disks

https://code.google.com/p/ganeti/issues/detail?id=926

Harep doesn't migrate instances from a drained node

https://code.google.com/p/ganeti/issues/detail?id=927

Best regards,

Petr

Reply all

Reply to author

Forward