Failure: command execution error:
Node daemon on eugn-vnode1 didn't answer queries within 10.0 seconds
In commands.log, the following error is recorded (including traceback):
2011-02-23 13:51:36,028: gnt-cluster init pid=12857 ERROR RPC error in version from node eugn-vnode1: Error 77:
2011-02-23 13:51:36,029: gnt-cluster init pid=12857 ERROR Error during command processing
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/ganeti/cli.py", line 1931, in GenericMain
result = func(options, args)
File "/usr/lib/python2.6/site-packages/ganeti/rpc.py", line 176, in wrapper
return fn(*args, **kwargs)
File "/usr/lib/python2.6/site-packages/ganeti/client/gnt_cluster.py", line 146, in InitCluster
prealloc_wipe_disks=opts.prealloc_wipe_disks,
File "/usr/lib/python2.6/site-packages/ganeti/bootstrap.py", line 444, in InitCluster
_InitGanetiServerSetup(hostname.name)
File "/usr/lib/python2.6/site-packages/ganeti/bootstrap.py", line 176, in _InitGanetiServerSetup
_WaitForNodeDaemon(master_name)
File "/usr/lib/python2.6/site-packages/ganeti/bootstrap.py", line 192, in _WaitForNodeDaemon
" %s seconds" % (node_name, _DAEMON_READY_TIMEOUT))
According to the manpage for curl, error code 77 means "Problem with
reading the SSL CA cert (path? access rights?)". It's not clear to me
why the client would have a problem verifying the SSL certificate of the
node when it is using the same certificate as the server. This host has
pycurl-7.19.0 installed. Since I'm running 'gnt-cluster init' as root,
I fail to see why there would be path or access rights issues.
Furthermore, it appears _ConfigRpcCurl() in rpc.py sets CAINFO to the
server's certificate, /var/lib/ganeti/server.pem. Any ideas why curl
would complain about a bad CACERT file?
I did not run into this problem when initializing a cluster on rhel5.
That is running ganeti-2.3.1 and pycurl-7.15.5.
Please let me know if any further information would be of assistance.
Further log output is below.
Thanks,
sf
Starting gnt-cluster init with the debug flag:
2011-02-23 15:11:49,370: gnt-cluster init pid=14713 process:195 DEBUG RunCmd /usr/lib/ganeti/daemon-util start ganeti-noded
2011-02-23 15:11:49,668: gnt-cluster init pid=14713 client:335 DEBUG Starting request <ganeti.http.client.HttpClientRequest 207.98.65.18:1811 PUT /version at 0x7f88d0c3ce50>
2011-02-23 15:11:49,668: gnt-cluster init pid=14713 client:320 DEBUG Created new client <ganeti.http.client._PooledHttpClient id=207.98.65.18/1811 lastuse=0 <ganeti.http.client._HttpClient object at 0x7f88d0c3ce90> at 0x7f88d0bcbdd0>
2011-02-23 15:11:49,710: gnt-cluster init pid=14713 client:232 DEBUG Request <ganeti.http.client.HttpClientRequest 207.98.65.18:1811 PUT /version at 0x7f88d0c3ce50> finished, errmsg=Error 77:
2011-02-23 15:11:49,710: gnt-cluster init pid=14713 client:350 DEBUG Returning client <ganeti.http.client._PooledHttpClient id=207.98.65.18/1811 lastuse=1 <ganeti.http.client._HttpClient object at 0x7f88d0c3ce90> at 0x7f88d0bcbdd0> to pool
2011-02-23 15:11:49,711: gnt-cluster init pid=14713 rpc:393 ERROR RPC error in version from node eugn-vnode1.nero.net: Error 77:
From node-daemon.log:
2011-02-23 13:51:36,038: ganeti-noded pid=12968 mlock:73 DEBUG Memory lock set
2011-02-23 13:51:36,038: ganeti-noded pid=12968 server:264 DEBUG Connection from 207.98.65.18:40696
2011-02-23 13:51:36,038: ganeti-noded pid=12968 server:305 DEBUG Disconnected 207.98.65.18:40696
[…]
Please make sure to cleanup any existing old node daemons before
initializing the clusters. From the log you have given, it seems that
the node daemon was started two hours before the cluster init.
Unfortunately we don't detect this case in the init script right now. I
just filled Issue 145 for this.
thanks,
iustin
My apologies, I must have snipped together logs from two different
attempts to initialize the cluster. Before each attempt, I removed
everything in /var/lib/ganeti to start at a clean state. I tried
running gnt-cluster destroy, but that needs the master daemon to be up.
Aside from the timestamp issue, any ideas for what may be going on?
Thanks,
sf
I hope this means that you also killed the daemon (ganeti-noded), as
that will survive just the removal of /var/lib/ganeti.
> I tried
> running gnt-cluster destroy, but that needs the master daemon to be up.
>
> Aside from the timestamp issue, any ideas for what may be going on?
Firewalls? Check the time of start of ganeti-noded against the last mod
time of the server.pem file? Not sure.
regards,
iustin
> > My apologies, I must have snipped together logs from two different
> > attempts to initialize the cluster. Before each attempt, I removed
> > everything in /var/lib/ganeti to start at a clean state.
>
> I hope this means that you also killed the daemon (ganeti-noded), as
> that will survive just the removal of /var/lib/ganeti.
Yes. I stop the service (/etc/init.d/ganeti stop),
remove /var/lib/ganeti, and then recreate the directory.
> > I tried
> > running gnt-cluster destroy, but that needs the master daemon to be up.
> >
> > Aside from the timestamp issue, any ideas for what may be going on?
>
> Firewalls? Check the time of start of ganeti-noded against the last mod
> time of the server.pem file? Not sure.
iptables is set to accept everything.
I've isolated the problem to the ssl certificate that ganeti creates.
The problem isn't in ganeti itself. A bug [1] recorded in
bugzilla.redhat.com suggests the problem is that openssl creates a
pkcs#8 encoded pem rsa key file that nss (which curl is compiled
against) cannot read. If I use a certificate generated from my
rhel5-based cluster (and tweak ganeti so as not to recreate the
certificate), initialization works just fine.
Regards,
sf
Ah, nice debugging. Thanks! Since the bug in RHEL seems fixed, I presume
we don't need to check for this in Ganeti itself?
iustin
> > I've isolated the problem to the ssl certificate that ganeti creates.
> > The problem isn't in ganeti itself. A bug [1] recorded in
> > bugzilla.redhat.com suggests the problem is that openssl creates a
> > pkcs#8 encoded pem rsa key file that nss (which curl is compiled
> > against) cannot read. If I use a certificate generated from my
> > rhel5-based cluster (and tweak ganeti so as not to recreate the
> > certificate), initialization works just fine.
>
> Ah, nice debugging. Thanks! Since the bug in RHEL seems fixed, I presume
> we don't need to check for this in Ganeti itself?
Since ganeti is getting error code 77 from pycurl, it may help to log a
message that ganeti/pycurl could not correctly parse the server
certificate. Since you're providing the full path and it should be
owned by root (or whomever ganeti runs as), that leaves a problem with
the certificate itself.
I know fedora builds curl against nss. Not sure if debian or ubuntu
will eventually deprecate building curl against openssl in favor of nss.
sf
I am using ganeti 2.4.1 on Scientific Linux 6.0.
As a workaround, you can manually generate certificates by using openssl command.
- cleanup
/etc/init.d/ganeti stop (or some command)
rm /var/lib/ganeti/* -rf
- initialize cluster (example)
# gnt-cluster init --vg-name vmvg --master-netdev br0 --enabled-hypervisors kvm --nic-parameters link=br1 gcluster
Failure: command execution error:
Node daemon on node01.example.org didn't answer queries within 10.0 seconds
- stop node daemon
killall ganeti-noded
- generate server.pem
openssl req -new -x509 -days 1825 -keyout server-key.pem -out server-cert.pem
openssl rsa -in server-key.pem -out server-key-nopass.pem
cat server-key-nopass.pem server-cert.pem > /var/lib/ganeti/server.pem
- generate rapi.pem
openssl req -new -x509 -days 1825 -keyout rapi-key.pem -out rapi-cert.pem
openssl rsa -in rapi-key.pem -out rapi-key-nopass.pem
cat rapi-key-nopass.pem rapi-cert.pem > /var/lib/ganeti/rapi.pem
- start daemons
/etc/init.d/ganeti start (or some command)
--
Jun Futagawa