I did a dist-upgrade on one of the secondary nodes from Debian Etch to
Debian lenny.
Since then the node is not reachable anymore by the master node.
This is quite a desaster, because it's a production environment...
What can I do to solve my problem? I think it's an issue with python.
Etch uses 2.4 and lenny 2.5.
Is there a way to fix that without recreating the whole cluster?
Here's an error message (from /var/log/ganeti/errors):
2010-03-02 12:47:10,858 gnt-cluster verify: caller_connect: could not
connect to remote host sophie.oehwu.local, reason [Failure instance:
Traceback (failure with no frames):
twisted.internet.error.ConnectionLost: Connection to the other side
was lost in a non-clean fashion.
watcher.log:
| caller_connect: could not connect to remote host sophie.oehwu.local,
reason [Failure instance: Traceback (failure with no frames):
twisted.internet.error.ConnectionLost: Connection to the other side
was lost in a non-clean fashion: Connection lost.
| ]
| Unhandled error in Deferred:
| Traceback (most recent call last):
| File "/usr/lib/python2.4/site-packages/twisted/spread/pb.py", line
1555, in clientConnectionLost
| self._failAll(reason)
| File "/usr/lib/python2.4/site-packages/twisted/spread/pb.py", line
1542, in _failAll
| d.errback(reason)
| File "/usr/lib/python2.4/site-packages/twisted/internet/defer.py",
line 251, in errback
| self._startRunCallbacks(fail)
| File "/usr/lib/python2.4/site-packages/twisted/internet/defer.py",
line 294, in _startRunCallbacks
| self._runCallbacks()
| --- <exception caught here> ---
| File "/usr/lib/python2.4/site-packages/twisted/internet/defer.py",
line 307, in _runCallbacks
| self.result = callback(self.result, *args, **kw)
| File "/usr/local/lib/python2.4/site-packages/ganeti/rpc.py", line
133, in cb_err1
| self._check_end()
| File "/usr/local/lib/python2.4/site-packages/ganeti/rpc.py", line
97, in _check_end
| reactor.stop()
| File "/usr/lib/python2.4/site-packages/twisted/internet/base.py",
line 342, in stop
| raise RuntimeError, "can't stop reactor that isn't running"
| exceptions.RuntimeError: can't stop reactor that isn't running
My environment:
# gnt-cluster version
Software version: 1.2.4
Internode protocol: 12
Configuration format: 3
OS api version: 5
Export interface: 0
# uname -a
Linux susi.oehwu.local 2.6.18-6-xen-amd64 #1 SMP Thu Nov 5 04:10:03
UTC 2009 x86_64 GNU/Linux
I'm using Debian Etch on all nodes except the one that failed.
DRBD is 0.7
I know, everything is very old, that's why I wanted to update the
nodes ;)
Please help me!
Thank you in advance!
Cheers,
Thomas
As I understand, the upgraded node was not the master node, right?
> This is quite a desaster, because it's a production environment...
>
> What can I do to solve my problem? I think it's an issue with python.
> Etch uses 2.4 and lenny 2.5.
> Is there a way to fix that without recreating the whole cluster?
First, backup /var/lib/ganeti on the master node, and keep that archive
safe.
Second, do not do any other cluster changes for the moment (instance
add, replace, etc.).
> Here's an error message (from /var/log/ganeti/errors):
> 2010-03-02 12:47:10,858 gnt-cluster verify: caller_connect: could not
> connect to remote host sophie.oehwu.local, reason [Failure instance:
> Traceback (failure with no frames):
> twisted.internet.error.ConnectionLost: Connection to the other side
> was lost in a non-clean fashion.
Has ganeti itself been upgraded on the target node?
Were you running from source or .deb packages?
What does "ps -ef | grep ganeti" on the lenny node say?
> My environment:
> # gnt-cluster version
> Software version: 1.2.4
> Internode protocol: 12
> Configuration format: 3
> OS api version: 5
> Export interface: 0
>
> # uname -a
> Linux susi.oehwu.local 2.6.18-6-xen-amd64 #1 SMP Thu Nov 5 04:10:03
> UTC 2009 x86_64 GNU/Linux
>
> I'm using Debian Etch on all nodes except the one that failed.
> DRBD is 0.7
>
> I know, everything is very old, that's why I wanted to update the
> nodes ;)
You have two options:
- 1) consider that node as "dead" (as if it had a hardware failure),
failover the instances, reinstall it with etch, keep ganeti 1.2.4
- 2) upgrade all nodes to lenny + ganeti 1.2.6 from Debian, but make
sure to remove any from-source installations
Python 2.4 versus 2.5 should not make a big difference. It's rather how
ganeti is installed.
iustin
thanks for the quick reply
On 2 Mrz., 14:45, Iustin Pop <ius...@google.com> wrote:
> On Tue, Mar 02, 2010 at 05:29:40AM -0800, Thomas Rieschl wrote:
> > Hello!
>
> > I did a dist-upgrade on one of the secondary nodes from Debian Etch to
> > Debian lenny.
> > Since then the node is not reachable anymore by the master node.
>
> As I understand, the upgraded node was not the master node, right?
Yes, that's right. I "tried" the upgrade with a minor node with just 2
primary instances.
>
> > This is quite a desaster, because it's a production environment...
>
> > What can I do to solve my problem? I think it's an issue with python.
> > Etch uses 2.4 and lenny 2.5.
> > Is there a way to fix that without recreating the whole cluster?
>
> First, backup /var/lib/ganeti on the master node, and keep that archive
> safe.
did it (despite that, I have daily backups)
>
> Second, do not do any other cluster changes for the moment (instance
> add, replace, etc.).
>
> > Here's an error message (from /var/log/ganeti/errors):
> > 2010-03-02 12:47:10,858 gnt-cluster verify: caller_connect: could not
> > connect to remote host sophie.oehwu.local, reason [Failure instance:
> > Traceback (failure with no frames):
> > twisted.internet.error.ConnectionLost: Connection to the other side
> > was lost in a non-clean fashion.
>
> Has ganeti itself been upgraded on the target node?
>
Not after the the dist-upgrade. But then, ganeti didn't work at all.
(I couldn't start the node-daemon) So I downloaded the ganeti 1.2.4
source and compiled and installed it again. Then ganeti-noded worked.
> Were you running from source or .deb packages?
>
From source. There weren't any debs the time I created the cluster.
> What does "ps -ef | grep ganeti" on the lenny node say?
# ps -ef | grep ganeti
root 3017 1 0 15:51 ? 00:00:00 /usr/bin/python /usr/
local/sbin/ganeti-noded
root 3021 2753 0 15:51 pts/0 00:00:00 grep ganeti
>
>
>
> > My environment:
> > # gnt-cluster version
> > Software version: 1.2.4
> > Internode protocol: 12
> > Configuration format: 3
> > OS api version: 5
> > Export interface: 0
>
> > # uname -a
> > Linux susi.oehwu.local 2.6.18-6-xen-amd64 #1 SMP Thu Nov 5 04:10:03
> > UTC 2009 x86_64 GNU/Linux
>
> > I'm using Debian Etch on all nodes except the one that failed.
> > DRBD is 0.7
>
> > I know, everything is very old, that's why I wanted to update the
> > nodes ;)
>
> You have two options:
>
> - 1) consider that node as "dead" (as if it had a hardware failure),
> failover the instances, reinstall it with etch, keep ganeti 1.2.4
> - 2) upgrade all nodes to lenny + ganeti 1.2.6 from Debian, but make
> sure to remove any from-source installations
@ 1: I tried that, but surprisingly it doesn't work either...
I tried 3 different approaches:
1st approach: with ganeti-noded running on lenny node ("sophie"):
the failover command didn't output anything until I stopped it with
CTRL-C. Then I got the follwing error:
# gnt-instance failover -f --ignore-consistency lilu.oehwu.local
caller_connect: could not connect to remote host sophie.oehwu.local,
reason [Failure instance: Traceback (failure with no frames):
twisted.internet.error.ConnectionLost: Connection to the other side
could not connect to node sophie.oehwu.local
Traceback (most recent call last):
File "/usr/local/sbin/gnt-instance", line 1048, in ?
override={"tag_type": constants.TAG_INSTANCE}))
File "/usr/local/lib/python2.4/site-packages/ganeti/cli.py", line
497, in GenericMain
result = func(options, args)
File "/usr/local/sbin/gnt-instance", line 545, in FailoverInstance
SubmitOpCode(op)
File "/usr/local/lib/python2.4/site-packages/ganeti/cli.py", line
389, in SubmitOpCode
return proc.ExecOpCode(op)
File "/usr/local/lib/python2.4/site-packages/ganeti/mcpu.py", line
128, in ExecOpCode
lu.CheckPrereq()
File "/usr/local/lib/python2.4/site-packages/ganeti/cmdlib.py", line
2832, in CheckPrereq
instance.name, instance.memory)
File "/usr/local/lib/python2.4/site-packages/ganeti/cmdlib.py", line
2256, in _CheckNodeFreeMemory
free_mem = nodeinfo[node].get('memory_free')
AttributeError: 'bool' object has no attribute 'get'
2nd approach: with ganeti-noded stopped:
The failover command immediately threw this error:
# gnt-instance failover -f --ignore-consistency lilu.oehwu.local
caller_connect: could not connect to remote host sophie.oehwu.local,
reason [Failure instance: Traceback (failure with no frames):
twisted.internet.error.ConnectionRefusedError: Connection was refused
by other side: 111: Connection refused.
]
could not connect to node sophie.oehwu.local
Traceback (most recent call last):
File "/usr/local/sbin/gnt-instance", line 1048, in ?
override={"tag_type": constants.TAG_INSTANCE}))
File "/usr/local/lib/python2.4/site-packages/ganeti/cli.py", line
497, in GenericMain
result = func(options, args)
File "/usr/local/sbin/gnt-instance", line 545, in FailoverInstance
SubmitOpCode(op)
File "/usr/local/lib/python2.4/site-packages/ganeti/cli.py", line
389, in SubmitOpCode
return proc.ExecOpCode(op)
File "/usr/local/lib/python2.4/site-packages/ganeti/mcpu.py", line
128, in ExecOpCode
lu.CheckPrereq()
File "/usr/local/lib/python2.4/site-packages/ganeti/cmdlib.py", line
2832, in CheckPrereq
instance.name, instance.memory)
File "/usr/local/lib/python2.4/site-packages/ganeti/cmdlib.py", line
2256, in _CheckNodeFreeMemory
free_mem = nodeinfo[node].get('memory_free')
AttributeError: 'bool' object has no attribute 'get'
... connection refused... quite obvious.
3rd approach: the "node-is-dead" approach with node shut down, I
immediately got that error:
# gnt-instance failover --ignore-consistency lilu.oehwu.local
Failover will happen to image lilu.oehwu.local. This requires a
shutdown of the instance. Continue?
y/[n]/?: y
caller_connect: could not connect to remote host sophie.oehwu.local,
reason [Failure instance: Traceback (failure with no frames):
twisted.internet.error.ConnectError: An error occurred while
connecting: 113: No route to host.
]
could not connect to node sophie.oehwu.local
Traceback (most recent call last):
File "/usr/local/sbin/gnt-instance", line 1048, in ?
override={"tag_type": constants.TAG_INSTANCE}))
File "/usr/local/lib/python2.4/site-packages/ganeti/cli.py", line
497, in GenericMain
result = func(options, args)
File "/usr/local/sbin/gnt-instance", line 545, in FailoverInstance
SubmitOpCode(op)
File "/usr/local/lib/python2.4/site-packages/ganeti/cli.py", line
389, in SubmitOpCode
return proc.ExecOpCode(op)
File "/usr/local/lib/python2.4/site-packages/ganeti/mcpu.py", line
128, in ExecOpCode
lu.CheckPrereq()
File "/usr/local/lib/python2.4/site-packages/ganeti/cmdlib.py", line
2832, in CheckPrereq
instance.name, instance.memory)
File "/usr/local/lib/python2.4/site-packages/ganeti/cmdlib.py", line
2256, in _CheckNodeFreeMemory
free_mem = nodeinfo[node].get('memory_free')
AttributeError: 'bool' object has no attribute 'get'
I don't quite understand that behaviour because the failover and the
whole ganeti thing renders useless. What do I need a mirrored server
if I can't fail it over?
I already used the failover command on that cluster some time ago, but
then it worked as supposed.
BTW: I removed the ganeti-watcher from /etc/cron.d/ because it started
issuinge gnt-instance commands and failover just returned a "waiting
for cmd lock" or something.
>
> Python 2.4 versus 2.5 should not make a big difference. It's rather how
> ganeti is installed.
hm, strange. What else could be the problem? I use the same kernel on
the lenny node and on the etch nodes.
So the only remaining thing is recreating the whole cluster?
>
> iustin
Thank you for your help, I really appreciate that!
Cheers, Thomas
That is good.
> > > This is quite a desaster, because it's a production environment...
> >
> > > What can I do to solve my problem? I think it's an issue with python.
> > > Etch uses 2.4 and lenny 2.5.
> > > Is there a way to fix that without recreating the whole cluster?
> >
> > First, backup /var/lib/ganeti on the master node, and keep that archive
> > safe.
>
> did it (despite that, I have daily backups)
That is good to hear :)
> > Second, do not do any other cluster changes for the moment (instance
> > add, replace, etc.).
> >
> > > Here's an error message (from /var/log/ganeti/errors):
> > > 2010-03-02 12:47:10,858 gnt-cluster verify: caller_connect: could not
> > > connect to remote host sophie.oehwu.local, reason [Failure instance:
> > > Traceback (failure with no frames):
> > > twisted.internet.error.ConnectionLost: Connection to the other side
> > > was lost in a non-clean fashion.
> >
> > Has ganeti itself been upgraded on the target node?
> >
>
> Not after the the dist-upgrade. But then, ganeti didn't work at all.
> (I couldn't start the node-daemon) So I downloaded the ganeti 1.2.4
> source and compiled and installed it again. Then ganeti-noded worked.
Ok…
[snip]
> I don't quite understand that behaviour because the failover and the
> whole ganeti thing renders useless. What do I need a mirrored server
> if I can't fail it over?
> I already used the failover command on that cluster some time ago, but
> then it worked as supposed.
You're running an extremely old Ganeti version, that has known bugs.
Upgrade to 1.2.9, if you want to remain on the 1.2 branch. But it's
really recommended to upgrade to 2.0 ASAP.
> BTW: I removed the ganeti-watcher from /etc/cron.d/ because it started
> issuinge gnt-instance commands and failover just returned a "waiting
> for cmd lock" or something.
>
> >
> > Python 2.4 versus 2.5 should not make a big difference. It's rather how
> > ganeti is installed.
>
> hm, strange. What else could be the problem? I use the same kernel on
> the lenny node and on the etch nodes.
> So the only remaining thing is recreating the whole cluster?
No, no. I'm not sure I understand, what is the status of the sophie node
now? What do you see in /var/log/ganeti/node-daemon.log?
Basically, if you have the same ganeti version, with the prerequisites
installed (twisted, etc.), and the SSL certificates good, then you
should be able to connect.
Try to copy back /var/lib/ganeti/* from the master node to that node.
Again, only the logical ganeti configuration is broken, you don't need
to recreate the cluster.
iustin
okey, I managed to failover the instances. the failover did actually
work although the command threw those error messages. I could start
the instances with the gnt-instance startup --force [instance]
command...
hm, I wanted to do that. But as you see I ran into problems ;)
btw a bit off-topic: Should I use the 1.2.6 from debian lenny rep or
2.x from source?
>
> > BTW: I removed the ganeti-watcher from /etc/cron.d/ because it started
> > issuinge gnt-instance commands and failover just returned a "waiting
> > for cmd lock" or something.
>
> > > Python 2.4 versus 2.5 should not make a big difference. It's rather how
> > > ganeti is installed.
>
> > hm, strange. What else could be the problem? I use the same kernel on
> > the lenny node and on the etch nodes.
> > So the only remaining thing is recreating the whole cluster?
>
> No, no. I'm not sure I understand, what is the status of the sophie node
> now? What do you see in /var/log/ganeti/node-daemon.log?
okey, there are problems. When I restart the ganeti daemon (/etc/
init.d/ganeti restart) I get the following error message in node-
daemon.log:
Traceback (most recent call last):
File "/usr/local/sbin/ganeti-noded", line 587, in <module>
main()
File "/usr/local/sbin/ganeti-noded", line 583, in main
reactor.run()
File "/usr/lib/python2.5/site-packages/twisted/internet/base.py",
line 1048, in run
self.mainLoop()
--- <exception caught here> ---
File "/usr/lib/python2.5/site-packages/twisted/internet/base.py",
line 1060, in mainLoop
self.doIteration(t)
File "/usr/lib/python2.5/site-packages/twisted/internet/
selectreactor.py", line 126, in doSelect
self._preenDescriptors()
File "/usr/lib/python2.5/site-packages/twisted/internet/
selectreactor.py", line 88, in _preenDescriptors
self._disconnectSelectable(selectable, e, False)
File "/usr/lib/python2.5/site-packages/twisted/internet/
posixbase.py", line 196, in _disconnectSelectable
selectable.connectionLost(failure.Failure(why))
File "/usr/lib/python2.5/site-packages/twisted/internet/
posixbase.py", line 150, in connectionLost
os.close(fd)
exceptions.OSError: [Errno 9] Bad file descriptor
I have no clue what that means...
But I think there could be a problem with XEN. When I try to restart
the xen-daemon I get the following error:
# /etc/init.d/xend restart
Restarting XEN control daemon: xendTraceback (most recent call last):
File "/usr/lib/xen-3.0.3-1/bin/xend", line 40, in <module>
from xen.xend.server import SrvDaemon
File "/usr/lib/xen-3.0.3-1/lib/python/xen/xend/server/SrvDaemon.py",
line 17, in <module>
import xen.lowlevel.xc
ImportError: /usr/lib/xen-3.0.3-1/bin/../lib/python/xen/lowlevel/
xc.so: undefined symbol: Py_InitModule4
Traceback (most recent call last):
File "/usr/lib/xen-3.0.3-1/bin/xend", line 40, in <module>
from xen.xend.server import SrvDaemon
File "/usr/lib/xen-3.0.3-1/lib/python/xen/xend/server/SrvDaemon.py",
line 17, in <module>
import xen.lowlevel.xc
ImportError: /usr/lib/xen-3.0.3-1/bin/../lib/python/xen/lowlevel/
xc.so: undefined symbol: Py_InitModule4
Traceback (most recent call last):
File "/usr/lib/xen-3.0.3-1/bin/xend", line 40, in <module>
from xen.xend.server import SrvDaemon
File "/usr/lib/xen-3.0.3-1/lib/python/xen/xend/server/SrvDaemon.py",
line 17, in <module>
import xen.lowlevel.xc
ImportError: /usr/lib/xen-3.0.3-1/bin/../lib/python/xen/lowlevel/
xc.so: undefined symbol: Py_InitModule4
failed!
Should I try to update to the xen-linux-system-2.6.26-2-xen-amd64?
(http://packages.debian.org/lenny/xen-linux-system-2.6.26-2-xen-amd64)
>
> Basically, if you have the same ganeti version, with the prerequisites
> installed (twisted, etc.), and the SSL certificates good, then you
> should be able to connect.
>
> Try to copy back /var/lib/ganeti/* from the master node to that node.
> Again, only the logical ganeti configuration is broken, you don't need
> to recreate the cluster.
That didn't work either.
any other suggesstions except to remove the node an reinstall the
whole server?
>
> iustin
thanks again!
cheers,
Thomas
This is good, at least you have now the instances up.
I just uploaded 2.x to the lenny backports, but they're still in the
backports NEW queue. If you wait for a couple of days more, you should
be able to use the from backports.
Note that upgrade from 1.2 to 2.0 needs downtime, see
http://code.google.com/p/ganeti/wiki/UpgradeNotes.
Ah, I know what this is. Twisted version in Lenny is new and you need at
least Ganeti 1.2.8 for it to work with Lenny twisted.
This I'm not sure, it's Xen/Python issue rather than Ganeti. I'd say
that testing with that package is worth trying.
> > Basically, if you have the same ganeti version, with the prerequisites
> > installed (twisted, etc.), and the SSL certificates good, then you
> > should be able to connect.
> >
> > Try to copy back /var/lib/ganeti/* from the master node to that node.
> > Again, only the logical ganeti configuration is broken, you don't need
> > to recreate the cluster.
>
> That didn't work either.
> any other suggesstions except to remove the node an reinstall the
> whole server?
As I wrote above, for Lenny you need Ganeti either 1.2.8 from source or
1.2.6 from Lenny packages (which has the correct fix). Or the 2.0.x
package which will be in backports hopefully soon.
… not sure what to recommend. I would reinstall with etch for now, and
then plan a downtime for the entire upgrade.
iustin
Yes!! now it works! :)
I updated to the new xen-kernel, booted the system with it. Then XEN
worked.
After that I installed ganeti 1.2.9 from source. Now the gnt-cluster
verify command workes again, but with version mismatch warning:
* Verifying node sophie.oehwu.local
- ERROR: sw version mismatch: master 12, node(sophie.oehwu.local) 15
just another question:
the proposed upgrade sequence is: install ganeti 1.2.9 on all nodes,
upgrade to lenny, update xen kernel. Then I should have a running
lenny cluster.
but: is there a recommended way to upgrade from drbd 0.7 (installed as
described in the ganeti 1.2.4 install doc) to drbd8 (lenny repository)
and a recommened way to upgrade from ganeti 1.2.9 source to ganeti 2.x
from lenny-backports rep? Do I have to remove ganeti 1.2.9 files? How?
Is there a "uninstall" script? UpgradeNotes justs talks about 1.2
source to 2.0 source, not debian repository.
Thank you very much for your help! Without you I would probably be
sitting now recreating the cluster ;)
cheers,
Thomas
> Note that upgrade from 1.2 to 2.0 needs downtime, seehttp://code.google.com/p/ganeti/wiki/UpgradeNotes.
As expected, due to the software difference. Some functionality might
work, some might not.
> just another question:
> the proposed upgrade sequence is: install ganeti 1.2.9 on all nodes,
> upgrade to lenny, update xen kernel. Then I should have a running
> lenny cluster.
Yes.
> but: is there a recommended way to upgrade from drbd 0.7 (installed as
> described in the ganeti 1.2.4 install doc) to drbd8 (lenny repository)
In the source archive, in the tools directory, there's a script to do
this (drbd8-upgrade); please read the README.drbd8-upgrade file
carefully. This should work without problems.
> and a recommened way to upgrade from ganeti 1.2.9 source to ganeti 2.x
> from lenny-backports rep? Do I have to remove ganeti 1.2.9 files? How?
> Is there a "uninstall" script? UpgradeNotes justs talks about 1.2
> source to 2.0 source, not debian repository.
There's no uninstall script, but I can tell you the files that need
removal:
/usr/sbin/gnt-* (or /usr/local/sbin/...)
/usr/lib/python2.4/site-packages/ganeti/*
The data directory (var/lib/ganeti) must *not* be removed, of course :)
> Thank you very much for your help! Without you I would probably be
> sitting now recreating the cluster ;)
Glad to help!
iustin
I've got another problem...
It seems, that debian lenny with the new 2.6.26-2-xen kernel doesn't
support drbd0.7 anymore.
When I try to build the drbd kernel module with m-a I get the
following error message:
# cat /var/cache/modass/drbd0.7-module-source.buildlog.2.6.26-2-xen-
amd64.1268065746
dpatch deapply-all
rm -rf patch-stamp patch-stampT debian/patched
dh_clean
/usr/bin/make -C drbd clean
make[1]: Entering directory `/usr/src/modules/drbd/drbd'
rm -rf .tmp_versions
rm -f *.[oas] *.ko .*.cmd .*.d .*.tmp
*.mod.c .*.flags .depend .kernel*
make[1]: Leaving directory `/usr/src/modules/drbd/drbd'
/usr/bin/make -f debian/rules kdist_clean kdist_config binary-modules
make[1]: Entering directory `/usr/src/modules/drbd'
dpatch deapply-all
rm -rf patch-stamp patch-stampT debian/patched
dh_clean
/usr/bin/make -C drbd clean
make[2]: Entering directory `/usr/src/modules/drbd/drbd'
rm -rf .tmp_versions
rm -f *.[oas] *.ko .*.cmd .*.d .*.tmp
*.mod.c .*.flags .depend .kernel*
make[2]: Leaving directory `/usr/src/modules/drbd/drbd'
for templ in /usr/src/modules/drbd/debian/drbd0.7-module-
_KVERS_.postinst /usr/src/modules/drbd/debian/drbd0.7-module-
_KVERS_.postinst.backup /usr/src/modules/drbd/debian/drbd0.7-module-
_KVERS_.postinst.modules.in; do \
cp $templ `echo $templ | sed -e 's/_KVERS_/2.6.26-2-xen-amd64/
g'` ; \
done
for templ in `ls debian/*.modules.in` ; do \
test -e ${templ%.modules.in}.backup || cp ${templ%.modules.in} $
{templ%.modules.in}.backup 2>/dev/null || true; \
sed -e 's/##KVERS##/2.6.26-2-xen-amd64/g ;s/#KVERS#/2.6.26-2-xen-
amd64/g ; s/_KVERS_/2.6.26-2-xen-amd64/g ; s/##KDREV##/2.6.26-21lenny3/
g ; s/#KDREV#/2.6.26-21lenny3/g ; s/_KDREV_/2.6.26-21lenny3/g ' <
$templ > ${templ%.modules.in}; \
done
dh_testdir
dh_testroot
dh_clean -k
/usr/bin/make -C drbd KERNEL_SOURCES=/lib/modules/2.6.26-2-xen-amd64/
build MODVERSIONS=detect KERNEL=linux-2.6.26-2-xen-amd64 KDIR=/lib/
modules/2.6.26-2-xen-amd64/build ARCH_UM=
make[2]: Entering directory `/usr/src/modules/drbd/drbd'
Calling toplevel makefile of kernel source tree, which I believe
is in
KDIR=/lib/modules/2.6.26-2-xen-amd64/build
test -f ../scripts/adjust_drbd_config_h.sh && \
KDIR=/lib/modules/2.6.26-2-xen-amd64/build /bin/bash ../
scripts/adjust_drbd_config_h.sh
Using unmodified drbd_config.h
/usr/bin/make -C /lib/modules/2.6.26-2-xen-amd64/build SUBDIRS=/usr/
src/modules/drbd/drbd modules
make[3]: Entering directory `/usr/src/linux-headers-2.6.26-2-xen-
amd64'
scripts/Makefile.build:46: *** CFLAGS was changed in "/usr/src/modules/
drbd/drbd/Makefile". Fix it to use EXTRA_CFLAGS. Stop.
make[3]: *** [_module_/usr/src/modules/drbd/drbd] Error 2
make[3]: Leaving directory `/usr/src/linux-headers-2.6.26-2-xen-amd64'
make[2]: *** [kbuild] Error 2
make[2]: Leaving directory `/usr/src/modules/drbd/drbd'
make[1]: *** [binary-modules] Error 2
make[1]: Leaving directory `/usr/src/modules/drbd'
make: *** [kdist_build] Error 2
Setting KBUILD_NOPEDANTIC=1 doesn't work either:
[...]
Using unmodified drbd_config.h
/usr/bin/make -C /lib/modules/2.6.26-2-xen-amd64/build SUBDIRS=/usr/
src/modules/drbd/drbd modules
make[3]: Entering directory `/usr/src/linux-headers-2.6.26-2-xen-
amd64'
CC [M] /usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.o
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:1:24: error:
linux/drbd.h: No such file or directory
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c: In function
‘__assert_sizeof_disk_config_eq_24’:
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:10: error:
invalid application of ‘sizeof’ to incomplete type ‘struct
disk_config’
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:10: error:
duplicate case value
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:10: error:
previously used here
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c: In function
‘__assert_sizeof_disk_config_modulo_8_eq_0’:
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:10: error:
invalid application of ‘sizeof’ to incomplete type ‘struct
disk_config’
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c: In function
‘__assert_sizeof_net_config_eq_304’:
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:11: error:
invalid application of ‘sizeof’ to incomplete type ‘struct net_config’
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:11: error:
duplicate case value
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:11: error:
previously used here
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c: In function
‘__assert_sizeof_net_config_modulo_8_eq_0’:
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:11: error:
invalid application of ‘sizeof’ to incomplete type ‘struct net_config’
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c: In function
‘__assert_sizeof_syncer_config_eq_24’:
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:12: error:
invalid application of ‘sizeof’ to incomplete type ‘struct
syncer_config’
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:12: error:
duplicate case value
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:12: error:
previously used here
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c: In function
‘__assert_sizeof_syncer_config_modulo_8_eq_0’:
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:12: error:
invalid application of ‘sizeof’ to incomplete type ‘struct
syncer_config’
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c: In function
‘__assert_sizeof_ioctl_disk_config_eq_32’:
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:13: error:
invalid application of ‘sizeof’ to incomplete type ‘struct
ioctl_disk_config’
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:13: error:
duplicate case value
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:13: error:
previously used here
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c: In function
‘__assert_sizeof_ioctl_disk_config_modulo_8_eq_0’:
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:13: error:
invalid application of ‘sizeof’ to incomplete type ‘struct
ioctl_disk_config’
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c: In function
‘__assert_sizeof_ioctl_net_config_eq_312’:
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:14: error:
invalid application of ‘sizeof’ to incomplete type ‘struct
ioctl_net_config’
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:14: error:
duplicate case value
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:14: error:
previously used here
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c: In function
‘__assert_sizeof_ioctl_net_config_modulo_8_eq_0’:
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:14: error:
invalid application of ‘sizeof’ to incomplete type ‘struct
ioctl_net_config’
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c: In function
‘__assert_sizeof_ioctl_syncer_config_eq_32’:
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:15: error:
invalid application of ‘sizeof’ to incomplete type ‘struct
ioctl_syncer_config’
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:15: error:
duplicate case value
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:15: error:
previously used here
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c: In function
‘__assert_sizeof_ioctl_syncer_config_modulo_8_eq_0’:
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:15: error:
invalid application of ‘sizeof’ to incomplete type ‘struct
ioctl_syncer_config’
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c: In function
‘__assert_sizeof_ioctl_wait_eq_16’:
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:16: error:
invalid application of ‘sizeof’ to incomplete type ‘struct ioctl_wait’
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:16: error:
duplicate case value
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:16: error:
previously used here
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c: In function
‘__assert_sizeof_ioctl_wait_modulo_8_eq_0’:
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:16: error:
invalid application of ‘sizeof’ to incomplete type ‘struct ioctl_wait’
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c: In function
‘__assert_sizeof_ioctl_get_config_eq_440’:
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:17: error:
invalid application of ‘sizeof’ to incomplete type ‘struct
ioctl_get_config’
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:17: error:
duplicate case value
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:17: error:
previously used here
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c: In function
‘__assert_sizeof_ioctl_get_config_modulo_8_eq_0’:
/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.c:17: error:
invalid application of ‘sizeof’ to incomplete type ‘struct
ioctl_get_config’
make[4]: *** [/usr/src/modules/drbd/drbd/drbd_sizeof_sanity_check.o]
Error 1
make[3]: *** [_module_/usr/src/modules/drbd/drbd] Error 2
make[3]: Leaving directory `/usr/src/linux-headers-2.6.26-2-xen-amd64'
make[2]: *** [kbuild] Error 2
make[2]: Leaving directory `/usr/src/modules/drbd/drbd'
make[1]: *** [binary-modules] Error 2
make[1]: Leaving directory `/usr/src/modules/drbd'
make: *** [kdist_build] Error 2
Well, I could/would upgrade to drbd8, but the README.drbd8-upgrade
says that all disks (remote_raid1) are fully synced and working
correctly and I have to upgrade all nodes to lenny, the new xen kernel
and drdb8 simultaneously.
Is it possible to build drbd0.7 somehow or should I just try to update
to drbd8 anyways despite not fully synced disks?
Thanks for your help!
Cheers,
Thomas
I updated everything to lenny and drbd8. After a while it seemed to
work but I can't start an instance:
# gnt-instance console laura
XENBUS: Device with no driver: device/console/0
Loading, please wait...
Begin: Loading essential drivers... ...
Done.
Begin: Running /scripts/init-premount ...
FATAL: Error inserting fan (/lib/modules/2.6.18-6-xen-amd64/kernel/
drivers/acpi/fan.ko): No such device
processor: Unknown symbol pm_idle
WARNING: Error inserting processor (/lib/modules/2.6.18-6-xen-amd64/
kernel/drivers/acpi/processor.ko): Unknown symbol in module, or
unknown parameter (see dmesg)
thermal: Unknown symbol acpi_processor_set_thermal_limit
FATAL: Error inserting thermal (/lib/modules/2.6.18-6-xen-amd64/kernel/
drivers/acpi/thermal.ko): Unknown symbol in module, or unknown
parameter (see dmesg)
Done.
Begin: Mounting root file system... ...
Begin: Running /scripts/local-top ...
Begin: Loading MD modules ...
md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: bitmap version 4.39
md: raid1 personality registered for level 1
Success: loaded module raid1.
Done.
Begin: Assembling all MD arrays ...
md: md6 stopped.
mdadm: no devices found for /dev/md6
md: md7 stopped.
mdadm: no devices found for /dev/md7
md: md0 stopped.
mdadm: no devices found for /dev/md0
md: md1 stopped.
mdadm: no devices found for /dev/md1
md: md4 stopped.
mdadm: no devices found for /dev/md4
md: md5 stopped.
mdadm: no devices found for /dev/md5
md: md6 stopped.
mdadm: no devices found for /dev/md6
md: md7 stopped.
mdadm: no devices found for /dev/md7
md: md0 stopped.
mdadm: no devices found for /dev/md0
md: md1 stopped.
mdadm: no devices found for /dev/md1
md: md2 stopped.
mdadm: no devices found for /dev/md2
md: md3 stopped.
mdadm: no devices found for /dev/md3
md: md4 stopped.
mdadm: no devices found for /dev/md4
md: md5 stopped.
mdadm: no devices found for /dev/md5
md: md2 stopped.
mdadm: no devices found for /dev/md2
md: md3 stopped.
mdadm: no devices found for /dev/md3
md: md10 stopped.
mdadm: no devices found for /dev/md10
md: md11 stopped.
mdadm: no devices found for /dev/md11
md: md8 stopped.
md: bind<sda>
raid1: raid set md8 active with 1 out of 1 mirrors
mdadm: /dev/md8 has been started with 1 drive.
md: md9 stopped.
md: bind<sdb>
raid1: raid set md9 active with 1 out of 1 mirrors
mdadm: /dev/md9 has been started with 1 drive.
Failure: failed to assemble all arrays.
Done.
device-mapper: ioctl: 4.7.0-ioctl (2006-06-24) initialised: dm-
de...@redhat.com
Done.
Begin: Running /scripts/local-premount ...
Done.
mount: Mounting /dev/sda on /root failed: Device or resource busy
Begin: Running /scripts/local-bottom ...
Done.
Done.
Begin: Running /scripts/init-bottom ...
mount: Mounting /root/dev on /dev/.static/dev failed: No such file or
directory
Done.
mount: Mounting /sys on /root/sys failed: No such file or directory
mount: Mounting /proc on /root/proc failed: No such file or directory
Target filesystem doesn't have /sbin/init
BusyBox v1.1.3 (Debian 1:1.1.3-4) Built-in shell (ash)
Enter 'help' for a list of built-in commands.
/bin/sh: can't access tty; job control turned off
(initramfs)
Why is the device /dev/sda busy? Activating the disks and mounting
them works as expected. What is the problem here?
I didn't find anything on google...
Please help! This is a production environment and the whole cluster is
down for hours now...
Thank you!
Cheers,
Thomas
> ...
>
> Erfahren Sie mehr »
Uh-oh, an instance should *not* be doing RAID by itself. Well, if you
have a special setup, it might be, but I'd say it's uncommon.
> md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27
> md: bitmap version 4.39
> md: raid1 personality registered for level 1
> Success: loaded module raid1.
> Done.
> Begin: Assembling all MD arrays ...
[…]
> md: md8 stopped.
> md: bind<sda>
> raid1: raid set md8 active with 1 out of 1 mirrors
> mdadm: /dev/md8 has been started with 1 drive.
> md: md9 stopped.
> md: bind<sdb>
> raid1: raid set md9 active with 1 out of 1 mirrors
> mdadm: /dev/md9 has been started with 1 drive.
> Failure: failed to assemble all arrays.
> Done.
This shows that the domU (instance) kernel sees some md devices, which
is strange in normal situations.
I think what happens is that the domU kernel sees the md devices that
were used in the remote_raid md+drbd0.7 layout, and activates them, and
this makes sda and sdb busy, leading to the next error:
> mount: Mounting /dev/sda on /root failed: Device or resource busy
> Begin: Running /scripts/local-bottom ...
> Done.
> Done.
> Begin: Running /scripts/init-bottom ...
> mount: Mounting /root/dev on /dev/.static/dev failed: No such file or
> directory
> Done.
> mount: Mounting /sys on /root/sys failed: No such file or directory
> mount: Mounting /proc on /root/proc failed: No such file or directory
> Target filesystem doesn't have /sbin/init
>
>
> BusyBox v1.1.3 (Debian 1:1.1.3-4) Built-in shell (ash)
> Enter 'help' for a list of built-in commands.
>
> /bin/sh: can't access tty; job control turned off
> (initramfs)
>
> Why is the device /dev/sda busy? Activating the disks and mounting
> them works as expected. What is the problem here?
> I didn't find anything on google...
I think it's because the md arrays, as described above.
You have a couple of options:
- start the instance with kernel parameter “raid=noauto” to disable the
automatic mounting
- remove the md raid1 module from the domU kernel/initrd
- export/import the instance, in order to recreate the disks without the
old raid header
- or simply dump the filesystem, recreate a new one, import
For the short term, any of them suffices. But for long term, you will
need to apply one of the last two, once you manage to get the
instance(s) back.
regards,
iustin
Thanks for your response!
On 8 Mrz., 23:15, Iustin Pop <iu...@k1024.org> wrote:
> On Mon, Mar 08, 2010 at 12:10:11PM -0800, Thomas Rieschl wrote:
> > hi again...
>
> > I updated everything to lenny and drbd8. After a while it seemed to
> > work but I can't start an instance:
>
> > # gnt-instance console laura
> > XENBUS: Device with no driver: device/console/0
> > Loading, please wait...
> > Begin: Mounting root file system... ...
> > Begin: Running /scripts/local-top ...
> > Begin: Loading MD modules ...
>
> Uh-oh, an instance should *not* be doing RAID by itself. Well, if you
> have a special setup, it might be, but I'd say it's uncommon.
hm, I did not do anything like that. That must hav happend somehow...
How do I pass kernel parameters to the domU? I didn't see anything
like that in the xend-config.sxp file?
The other alternatives seem to require more time. I have to bring up
the instances as fast as possible...
>
> For the short term, any of them suffices. But for long term, you will
> need to apply one of the last two, once you manage to get the
> instance(s) back.
>
> regards,
> iustin
Thanks!
cheers, thomas
On 8 Mrz., 23:15, Iustin Pop <iu...@k1024.org> wrote:
[snip]
>
> You have a couple of options:
>
> - start the instance with kernel parameter “raid=noauto” to disable the
> automatic mounting
> - remove the md raid1 module from the domU kernel/initrd
> - export/import the instance, in order to recreate the disks without the
> old raid header
> - or simply dump the filesystem, recreate a new one, import
I didn't manage to do any ohe these things...
- I could not figure out how to pass kernel parameters to an instance
in ganeti
- I could not (did not try) to build my own kernel
- Export/Import didn't work either. I tried that but the new imported
instance could not be started (same error as source instance)
- dumping the filesystem didn't work. I tried to activate the disks of
the source instance, did a dd if=/dev/drbd0 of=backup-instance and
also tried to cat /dev/drdb0 > backup-instance. Both commands did
result in endless filling of the target file (bigger than instance-
disk).
Activating both source and target disks and cp -av all files didn't
work either. The instance started up but lots of the functions and
programs didn't work.
So.. how do I manage that? What did I do wrong?
>
> For the short term, any of them suffices. But for long term, you will
> need to apply one of the last two, once you manage to get the
> instance(s) back.
>
> regards,
> iustin
Thanks for the help!
cheers,
thomas
In Ganeti 2.0 it's possible; if you still have ganeti 1.2, just modify
/etc/xen/$FQDN and modify the "extra" line.
> - I could not (did not try) to build my own kernel
ok.
> - Export/Import didn't work either. I tried that but the new imported
> instance could not be started (same error as source instance)
hmm, that is strange.
> - dumping the filesystem didn't work. I tried to activate the disks of
> the source instance, did a dd if=/dev/drbd0 of=backup-instance and
> also tried to cat /dev/drdb0 > backup-instance. Both commands did
> result in endless filling of the target file (bigger than instance-
> disk).
this is not useful, since the DRBD device (/dev/drbd0) contains
everything, including the bad RAID signature.
> Activating both source and target disks and cp -av all files didn't
> work either. The instance started up but lots of the functions and
> programs didn't work.
cp -av is not good, I'd try rather a tar:
tar c -C /source . | tar x -C /destination
Let us know if either the xen modification worked or this import/export.
A last resort would be to zero the last 64K (or 128K?) of the instance's
disk, and then run a fsck on it.
iustin
On 9 Mrz., 09:43, Iustin Pop <ius...@google.com> wrote:
> On Mon, Mar 08, 2010 at 04:21:15PM -0800, Thomas Rieschl wrote:
> > and once again hi...
>
> > On 8 Mrz., 23:15, Iustin Pop <iu...@k1024.org> wrote:
>
> > [snip]
>
> > > You have a couple of options:
>
> > > - start the instance with kernel parameter “raid=noauto” to disable the
> > > automatic mounting
> > > - remove the md raid1 module from the domU kernel/initrd
> > > - export/import the instance, in order to recreate the disks without the
> > > old raid header
> > > - or simply dump the filesystem, recreate a new one, import
>
> > I didn't manage to do any ohe these things...
>
> > - I could not figure out how to pass kernel parameters to an instance
> > in ganeti
>
> In Ganeti 2.0 it's possible; if you still have ganeti 1.2, just modify
> /etc/xen/$FQDN and modify the "extra" line.
That didn't work. I has no effect at all...
>
> > - I could not (did not try) to build my own kernel
>
> ok.
>
> > - Export/Import didn't work either. I tried that but the new imported
> > instance could not be started (same error as source instance)
>
> hmm, that is strange.
I'll try that again...
>
> > - dumping the filesystem didn't work. I tried to activate the disks of
> > the source instance, did a dd if=/dev/drbd0 of=backup-instance and
> > also tried to cat /dev/drdb0 > backup-instance. Both commands did
> > result in endless filling of the target file (bigger than instance-
> > disk).
>
> this is not useful, since the DRBD device (/dev/drbd0) contains
> everything, including the bad RAID signature.
oh, didn't knew that...
>
> > Activating both source and target disks and cp -av all files didn't
> > work either. The instance started up but lots of the functions and
> > programs didn't work.
>
> cp -av is not good, I'd try rather a tar:
>
> tar c -C /source . | tar x -C /destination
hm, your tar command seems to work, but only partially... It worked
for one small instance but when trying to restore a webserver, the
instance could not bring up eth0:
# ifup eth0
[...]
SIOCSIFADDR: No such device
eth0: ERROR while getting interface flags: No such device
eth0: ERROR while getting interface flags: No such device
Bind socket to interface: No such device
Failed to bring up eth0.
# cat /var/log/syslog | grep -i eth
Mar 9 13:08:54 alice kernel: netfront: Initialising virtual ethernet
driver.
Mar 9 13:08:54 alice kernel: netfront: device eth0 has copying
receive path.
The host node seems to activate the brigde, though (vif8.0)
Any suggestions?
Perhaps because I've overwritten /dev with the TAR command?
>
> Let us know if either the xen modification worked or this import/export.
>
> A last resort would be to zero the last 64K (or 128K?) of the instance's
> disk, and then run a fsck on it.
>
> iustin
thanks
cheers, thomas
It means that md being loaded by initrd doesn't use the kernel
arguments. I wasn't sure of this, sorry.
what does 'ifconfig -a' or 'ip l' say?
> The host node seems to activate the brigde, though (vif8.0)
>
> Any suggestions?
> Perhaps because I've overwritten /dev with the TAR command?
Not sure, this is very strange. But I have hopes that this will be
easy to fix.
Did you just create a new instance? Then the new instance might have a
different MAC, and it might be renamed by udev to eth1?
iustin
I tried that again, it worked, just as with tar
omg... such a rookie mistake! of course, udev renamed it.
now the whole cluster is up and running again :)
thank your very very much for your help an patience ;)
cheers,
thomas
Glad to hear this, looking forward for your experiences with the new
ganeti :)
iustin