Issue when converting from plain to drbd on 2.15

Phil Regnauld

unread,

Dec 19, 2016, 9:30:40 AM12/19/16

to gan...@googlegroups.com

On Ganeti 2.15 - master is Ubuntu 16.04 LTS, but the two
nodes referenced are running Debian 8.5 with Ganeti 2.15
from backports.

The nodes are recently installed, and I'm seeing a funky error when trying
to convert a plain node to drbd.

- nodes are connected via a secondary 10G replication network
- the node config reflects this
- connectivity has been tested

On the master, I see:

root@ganeti5:/var/log/ganeti# gnt-instance modify -t drbd -n ganeti7 customer-server.xyz
Mon Dec 19 14:49:02 2016 Converting disk template from 'plain' to 'drbd'
Mon Dec 19 14:49:02 2016 Creating additional volumes...
Mon Dec 19 14:49:06 2016 Renaming original volumes...
Mon Dec 19 14:49:06 2016 Initializing DRBD devices...
Mon Dec 19 14:49:13 2016 - INFO: Waiting for instance customer-server.xyz to sync disks
Mon Dec 19 14:49:27 2016 - INFO: Instance customer-server.xyz's disks are in sync
Failure: command execution error:
There are some degraded disks for this instance, please cleanup manually

Now, this is a 400 GB instance, so I doubt the disks were replicated in 14 seconds :)

The error looks incomplete...

In jobs.log:

2016-12-19 14:49:13,884: job-1060193 pid=32037 INFO Waiting for instance customer-server.xyz to sync disks
2016-12-19 14:49:14,323: job-1060193 pid=32037 INFO Degraded disks found, 10 retries left
2016-12-19 14:49:15,688: job-1060193 pid=32037 INFO Degraded disks found, 9 retries left
2016-12-19 14:49:17,026: job-1060193 pid=32037 INFO Degraded disks found, 8 retries left
2016-12-19 14:49:18,349: job-1060193 pid=32037 INFO Degraded disks found, 7 retries left
2016-12-19 14:49:19,667: job-1060193 pid=32037 INFO Degraded disks found, 6 retries left
2016-12-19 14:49:21,080: job-1060193 pid=32037 INFO Degraded disks found, 5 retries left
2016-12-19 14:49:22,420: job-1060193 pid=32037 INFO Degraded disks found, 4 retries left
2016-12-19 14:49:23,754: job-1060193 pid=32037 INFO Degraded disks found, 3 retries left
2016-12-19 14:49:25,082: job-1060193 pid=32037 INFO Degraded disks found, 2 retries left
2016-12-19 14:49:26,447: job-1060193 pid=32037 INFO Degraded disks found, 1 retries left
2016-12-19 14:49:27,791: job-1060193 pid=32037 INFO Instance customer-server.xyz's disks are in sync
2016-12-19 14:49:28,288: job-1060193 pid=32037 ERROR Op 1/1: Caught exception in INSTANCE_SET_PARAMS(customer-server.xyz)
Traceback (most recent call last):
File "/usr/share/ganeti/2.15/ganeti/jqueue/__init__.py", line 933, in _ExecOpCodeUnlocked
timeout=timeout)
File "/usr/share/ganeti/2.15/ganeti/jqueue/__init__.py", line 1227, in _WrapExecOpCode
return execop_fn(op, *args, **kwargs)
File "/usr/share/ganeti/2.15/ganeti/mcpu.py", line 697, in ExecOpCode
calc_timeout)
File "/usr/share/ganeti/2.15/ganeti/mcpu.py", line 624, in _LockAndExecLU
pending=pending)
File "/usr/share/ganeti/2.15/ganeti/mcpu.py", line 624, in _LockAndExecLU
pending=pending)
File "/usr/share/ganeti/2.15/ganeti/mcpu.py", line 624, in _LockAndExecLU
pending=pending)
File "/usr/share/ganeti/2.15/ganeti/mcpu.py", line 624, in _LockAndExecLU
pending=pending)
File "/usr/share/ganeti/2.15/ganeti/mcpu.py", line 631, in _LockAndExecLU
result = self._LockAndExecLU(lu, level + 1, calc_timeout, pending=pending)
File "/usr/share/ganeti/2.15/ganeti/mcpu.py", line 538, in _LockAndExecLU
result = self._ExecLU(lu)
File "/usr/share/ganeti/2.15/ganeti/mcpu.py", line 496, in _ExecLU
result = _ProcessResult(submit_mj_fn, lu.op, lu.Exec(self.Log))
File "/usr/share/ganeti/2.15/ganeti/cmdlib/instance_set_params.py", line 1879, in Exec
self._DISK_CONVERSIONS[mode](self, feedback_fn)
File "/usr/share/ganeti/2.15/ganeti/cmdlib/instance_set_params.py", line 1466, in _ConvertPlainToDrbd
raise errors.OpExecError("There are some degraded disks for"
OpExecError: There are some degraded disks for this instance, please cleanup manually
2016-12-19 14:49:30,938: job-1060193 pid=32037 INFO Finished job 1060193, status = error

I checked node-daemon.log on both nodes, and apart from drbd setup and some lvm commands,
none of which are returning failures, I'm not sure what the issue could be.

Phil Regnauld

unread,

Dec 19, 2016, 9:33:00 AM12/19/16

to gan...@googlegroups.com

Phil Regnauld (regnauld) writes:
>
> On the master, I see:

Additional info - gnt-cluster verify does whine a bit:

Mon Dec 19 15:20:53 2016 * Verifying configuration file consistency
Mon Dec 19 15:20:53 2016 - ERROR: node: There is more than one key for node ganeti6.customer in the public key file.
Mon Dec 19 15:20:53 2016 - ERROR: node: There is more than one key for node ganeti5.customer in the public key file.
Mon Dec 19 15:20:53 2016 - ERROR: node: There is more than one key for node ganeti7.customer in the public key file.

... ganeti5 happens to be the master, and ganeti6+7 are the two nodes I'm
trying to set up a drbd instance on.

Which public key file is being referenced here ? /root/.ssh/known_hosts ?

Phil Regnauld

unread,

Dec 19, 2016, 10:04:30 AM12/19/16

to gan...@googlegroups.com

... and of course, rebooting the affected nodes fixes the problem...

Sorry about the noise!

Benjamin Redling

unread,

Dec 19, 2016, 10:22:39 AM12/19/16

to gan...@googlegroups.com

Hi Phil,

Probably.

I found the error message only in
https://github.com/ganeti/ganeti/blob/stable-2.15/lib/backend.py
Lines 1012 to 1043 should be all relevant logic.
SSH_LOGIN_USER is a constant set to "root"

I don't understand how that
if len(my_keys) != 1:
changes after a reboot and your ERROR is suddenly gone.

Regards,
Benjamin
--
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321

Phil Regnauld

unread,

Dec 19, 2016, 2:54:45 PM12/19/16

to gan...@googlegroups.com

Benjamin Redling (benjamin.rampe) writes:
> > Mon Dec 19 15:20:53 2016 * Verifying configuration file consistency
> > Mon Dec 19 15:20:53 2016 - ERROR: node: There is more than one key for node ganeti6.customer in the public key file.
> > Mon Dec 19 15:20:53 2016 - ERROR: node: There is more than one key for node ganeti5.customer in the public key file.
> > Mon Dec 19 15:20:53 2016 - ERROR: node: There is more than one key for node ganeti7.customer in the public key file.
> >

> I don't understand how that
> if len(my_keys) != 1:
> changes after a reboot and your ERROR is suddenly gone.

Sorry - those seem to be unrelated. The problem (or, warning ?) above
is still there, but I was wondering if the DRBD problem stemmed from
some SSH induced failure to execute a remote command. It doesn't seem
to be.

Reply all

Reply to author

Forward