Problem initializng cluster: RPC error in version from node: Error 77

304 views
Skip to first unread message

Stephen Fromm

unread,
Feb 24, 2011, 3:07:17 AM2/24/11
to ganeti
I'm unable to initialize a cluster on rhel6. I've tried with
ganeti-2.3.1 and ganeti-2.4.0~rc2. Trying to initialize the cluster
fails with:

Failure: command execution error:
Node daemon on eugn-vnode1 didn't answer queries within 10.0 seconds

In commands.log, the following error is recorded (including traceback):

2011-02-23 13:51:36,028: gnt-cluster init pid=12857 ERROR RPC error in version from node eugn-vnode1: Error 77:
2011-02-23 13:51:36,029: gnt-cluster init pid=12857 ERROR Error during command processing
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/ganeti/cli.py", line 1931, in GenericMain
result = func(options, args)
File "/usr/lib/python2.6/site-packages/ganeti/rpc.py", line 176, in wrapper
return fn(*args, **kwargs)
File "/usr/lib/python2.6/site-packages/ganeti/client/gnt_cluster.py", line 146, in InitCluster
prealloc_wipe_disks=opts.prealloc_wipe_disks,
File "/usr/lib/python2.6/site-packages/ganeti/bootstrap.py", line 444, in InitCluster
_InitGanetiServerSetup(hostname.name)
File "/usr/lib/python2.6/site-packages/ganeti/bootstrap.py", line 176, in _InitGanetiServerSetup
_WaitForNodeDaemon(master_name)
File "/usr/lib/python2.6/site-packages/ganeti/bootstrap.py", line 192, in _WaitForNodeDaemon
" %s seconds" % (node_name, _DAEMON_READY_TIMEOUT))

According to the manpage for curl, error code 77 means "Problem with
reading the SSL CA cert (path? access rights?)". It's not clear to me
why the client would have a problem verifying the SSL certificate of the
node when it is using the same certificate as the server. This host has
pycurl-7.19.0 installed. Since I'm running 'gnt-cluster init' as root,
I fail to see why there would be path or access rights issues.
Furthermore, it appears _ConfigRpcCurl() in rpc.py sets CAINFO to the
server's certificate, /var/lib/ganeti/server.pem. Any ideas why curl
would complain about a bad CACERT file?

I did not run into this problem when initializing a cluster on rhel5.
That is running ganeti-2.3.1 and pycurl-7.15.5.

Please let me know if any further information would be of assistance.
Further log output is below.

Thanks,
sf

Starting gnt-cluster init with the debug flag:

2011-02-23 15:11:49,370: gnt-cluster init pid=14713 process:195 DEBUG RunCmd /usr/lib/ganeti/daemon-util start ganeti-noded
2011-02-23 15:11:49,668: gnt-cluster init pid=14713 client:335 DEBUG Starting request <ganeti.http.client.HttpClientRequest 207.98.65.18:1811 PUT /version at 0x7f88d0c3ce50>
2011-02-23 15:11:49,668: gnt-cluster init pid=14713 client:320 DEBUG Created new client <ganeti.http.client._PooledHttpClient id=207.98.65.18/1811 lastuse=0 <ganeti.http.client._HttpClient object at 0x7f88d0c3ce90> at 0x7f88d0bcbdd0>
2011-02-23 15:11:49,710: gnt-cluster init pid=14713 client:232 DEBUG Request <ganeti.http.client.HttpClientRequest 207.98.65.18:1811 PUT /version at 0x7f88d0c3ce50> finished, errmsg=Error 77:
2011-02-23 15:11:49,710: gnt-cluster init pid=14713 client:350 DEBUG Returning client <ganeti.http.client._PooledHttpClient id=207.98.65.18/1811 lastuse=1 <ganeti.http.client._HttpClient object at 0x7f88d0c3ce90> at 0x7f88d0bcbdd0> to pool
2011-02-23 15:11:49,711: gnt-cluster init pid=14713 rpc:393 ERROR RPC error in version from node eugn-vnode1.nero.net: Error 77:

From node-daemon.log:

2011-02-23 13:51:36,038: ganeti-noded pid=12968 mlock:73 DEBUG Memory lock set
2011-02-23 13:51:36,038: ganeti-noded pid=12968 server:264 DEBUG Connection from 207.98.65.18:40696
2011-02-23 13:51:36,038: ganeti-noded pid=12968 server:305 DEBUG Disconnected 207.98.65.18:40696

Iustin Pop

unread,
Feb 24, 2011, 4:05:40 AM2/24/11
to gan...@googlegroups.com
On Thu, Feb 24, 2011 at 12:07:17AM -0800, Stephen Fromm wrote:
> I'm unable to initialize a cluster on rhel6. I've tried with
> ganeti-2.3.1 and ganeti-2.4.0~rc2. Trying to initialize the cluster
> fails with:
>
> Failure: command execution error:
> Node daemon on eugn-vnode1 didn't answer queries within 10.0 seconds

[…]

Please make sure to cleanup any existing old node daemons before
initializing the clusters. From the log you have given, it seems that
the node daemon was started two hours before the cluster init.

Unfortunately we don't detect this case in the init script right now. I
just filled Issue 145 for this.

thanks,
iustin

Stephen Fromm

unread,
Feb 24, 2011, 9:36:16 AM2/24/11
to gan...@googlegroups.com

My apologies, I must have snipped together logs from two different
attempts to initialize the cluster. Before each attempt, I removed
everything in /var/lib/ganeti to start at a clean state. I tried
running gnt-cluster destroy, but that needs the master daemon to be up.

Aside from the timestamp issue, any ideas for what may be going on?

Thanks,
sf

Iustin Pop

unread,
Feb 24, 2011, 9:59:43 AM2/24/11
to gan...@googlegroups.com
On Thu, Feb 24, 2011 at 06:36:16AM -0800, Stephen Fromm wrote:
> On Thu, 2011-02-24 at 10:05 +0100, Iustin Pop wrote:
> > On Thu, Feb 24, 2011 at 12:07:17AM -0800, Stephen Fromm wrote:
> > > I'm unable to initialize a cluster on rhel6. I've tried with
> > > ganeti-2.3.1 and ganeti-2.4.0~rc2. Trying to initialize the cluster
> > > fails with:
> > >
> > > Failure: command execution error:
> > > Node daemon on eugn-vnode1 didn't answer queries within 10.0 seconds
> >
> > […]
> >
> > Please make sure to cleanup any existing old node daemons before
> > initializing the clusters. From the log you have given, it seems that
> > the node daemon was started two hours before the cluster init.
> >
> > Unfortunately we don't detect this case in the init script right now. I
> > just filled Issue 145 for this.
>
> My apologies, I must have snipped together logs from two different
> attempts to initialize the cluster. Before each attempt, I removed
> everything in /var/lib/ganeti to start at a clean state.

I hope this means that you also killed the daemon (ganeti-noded), as
that will survive just the removal of /var/lib/ganeti.

> I tried
> running gnt-cluster destroy, but that needs the master daemon to be up.
>
> Aside from the timestamp issue, any ideas for what may be going on?

Firewalls? Check the time of start of ganeti-noded against the last mod
time of the server.pem file? Not sure.

regards,
iustin

Stephen Fromm

unread,
Feb 24, 2011, 7:40:54 PM2/24/11
to gan...@googlegroups.com
On Thu, 2011-02-24 at 15:59 +0100, Iustin Pop wrote:

> > My apologies, I must have snipped together logs from two different
> > attempts to initialize the cluster. Before each attempt, I removed
> > everything in /var/lib/ganeti to start at a clean state.
>
> I hope this means that you also killed the daemon (ganeti-noded), as
> that will survive just the removal of /var/lib/ganeti.

Yes. I stop the service (/etc/init.d/ganeti stop),
remove /var/lib/ganeti, and then recreate the directory.

> > I tried
> > running gnt-cluster destroy, but that needs the master daemon to be up.
> >
> > Aside from the timestamp issue, any ideas for what may be going on?
>
> Firewalls? Check the time of start of ganeti-noded against the last mod
> time of the server.pem file? Not sure.

iptables is set to accept everything.

I've isolated the problem to the ssl certificate that ganeti creates.
The problem isn't in ganeti itself. A bug [1] recorded in
bugzilla.redhat.com suggests the problem is that openssl creates a
pkcs#8 encoded pem rsa key file that nss (which curl is compiled
against) cannot read. If I use a certificate generated from my
rhel5-based cluster (and tweak ganeti so as not to recreate the
certificate), initialization works just fine.

Regards,
sf

[1] https://bugzilla.redhat.com/show_bug.cgi?id=631000)

Iustin Pop

unread,
Feb 25, 2011, 3:17:53 AM2/25/11
to gan...@googlegroups.com

Ah, nice debugging. Thanks! Since the bug in RHEL seems fixed, I presume
we don't need to check for this in Ganeti itself?

iustin

Stephen Fromm

unread,
Feb 25, 2011, 2:00:47 PM2/25/11
to gan...@googlegroups.com
On Fri, 2011-02-25 at 09:17 +0100, Iustin Pop wrote:

> > I've isolated the problem to the ssl certificate that ganeti creates.
> > The problem isn't in ganeti itself. A bug [1] recorded in
> > bugzilla.redhat.com suggests the problem is that openssl creates a
> > pkcs#8 encoded pem rsa key file that nss (which curl is compiled
> > against) cannot read. If I use a certificate generated from my
> > rhel5-based cluster (and tweak ganeti so as not to recreate the
> > certificate), initialization works just fine.
>
> Ah, nice debugging. Thanks! Since the bug in RHEL seems fixed, I presume
> we don't need to check for this in Ganeti itself?

Since ganeti is getting error code 77 from pycurl, it may help to log a
message that ganeti/pycurl could not correctly parse the server
certificate. Since you're providing the full path and it should be
owned by root (or whomever ganeti runs as), that leaves a problem with
the certificate itself.

I know fedora builds curl against nss. Not sure if debian or ubuntu
will eventually deprecate building curl against openssl in favor of nss.

sf

Arthur Enright

unread,
Apr 30, 2011, 10:03:53 PM4/30/11
to gan...@googlegroups.com
It looks like this is still an issue.  I'm having the same problem on Scientific Linux 6.0.  I checked on the bugzilla entry and it reports fixed in nss-3.12.9-1.el6, but the package/SPEC have not yet been released (at least not on ftp.redhat.com)  As I don't have a RHEL5 based environment to generate the certificate with, is anyone aware of a workaround until nss-3.12.9-1 is released?

Thanks,
-Art

Jun Futagawa

unread,
May 1, 2011, 9:09:50 AM5/1/11
to gan...@googlegroups.com
Hi,

I am using ganeti 2.4.1 on Scientific Linux 6.0.
As a workaround, you can manually generate certificates by using openssl command.

- cleanup
/etc/init.d/ganeti stop (or some command)
rm /var/lib/ganeti/* -rf

- initialize cluster (example)
# gnt-cluster init --vg-name vmvg --master-netdev br0 --enabled-hypervisors kvm --nic-parameters link=br1 gcluster
Failure: command execution error:
Node daemon on node01.example.org didn't answer queries within 10.0 seconds

- stop node daemon
killall ganeti-noded

- generate server.pem
openssl req -new -x509 -days 1825 -keyout server-key.pem -out server-cert.pem
openssl rsa -in server-key.pem -out server-key-nopass.pem
cat server-key-nopass.pem server-cert.pem > /var/lib/ganeti/server.pem

- generate rapi.pem
openssl req -new -x509 -days 1825 -keyout rapi-key.pem -out rapi-cert.pem
openssl rsa -in rapi-key.pem -out rapi-key-nopass.pem
cat rapi-key-nopass.pem rapi-cert.pem > /var/lib/ganeti/rapi.pem

- start daemons
/etc/init.d/ganeti start (or some command)

--
Jun Futagawa

Reply all
Reply to author
Forward
0 new messages