Error in resize and setup

28 views
Skip to first unread message

Narcis Yousefi

unread,
Mar 26, 2020, 1:09:08 PM3/26/20
to elasticluster
Dear Riccardo,
We have a cluster at S3IT and now I need to enlarge it.

To add new nodes I ran:
elasticluster resize -a 4:med128compute primrose -t primrose 

and the nodes were three different sizes that I added to config file.

This part worked but with an error in the last lines:

-----
2020-03-26 07:54:21 1603ae1b395c gc3.elasticluster[1] ERROR Command `ansible-playbook --private-key=/home/ubuntu/.ssh/id_rsa /home/elasticluster/share/playbooks/main.yml --inventory=/home/ubuntu/.elasticluster/storage/primrose.inventory --become --become-user=root -e @extra_vars.yml` failed with exit code 2.
2020-03-26 07:54:21 1603ae1b395c gc3.elasticluster[1] WARNING The cluster has likely *not* been configured correctly. You may need to re-run `elasticluster setup`.
2020-03-26 07:54:21 1603ae1b395c gc3.elasticluster[1] WARNING Cluster `primrose` not yet configured. Please, re-run `elasticluster setup primrose` and/or check your configuration
Cluster name:     primrose
Cluster template: primrose
Default ssh to node: frontend001
- frontend nodes: 1
- smallcompute nodes: 2
- compute nodes: 3
- med128compute nodes: 4

To login on the frontend node, run the command:

    elasticluster ssh primrose

To upload or download files to the cluster, use the command:

    elasticluster sftp primrose
------

The added nodes I can see in web interface but not from frontendnode (I restarted slurm).

And I get the attached error when I run:
elasticluster setup primrose

Could you please have a look and help me fix this error?
Thanks a lot in advance!

All the best,
Narcis
elasticluster_setup_error.pages
elasticluster_setup_error.txt

Riccardo Murri

unread,
Mar 26, 2020, 2:33:11 PM3/26/20
to Narcis Yousefi, elasticluster
Hello Narcis,

> 2020-03-26 07:54:21 1603ae1b395c gc3.elasticluster[1] ERROR Command `ansible-playbook --private-key=/home/ubuntu/.ssh/id_rsa /home/elasticluster/share/playbooks/main.yml --inventory=/home/ubuntu/.elasticluster/storage/primrose.inventory --become --become-user=root -e @extra_vars.yml` failed with exit code 2.

This is the final, top-level error, the root cause would be in the
preceding lines; in this case, since the failure happened while
running Ansible, it would be in Ansible's output.

Looking at the complete output (good that you included it!), it seems
to me that all tasks on `medXXX` nodes fail with this error
(reformatted for readability):

Running installation command 'apt-get install -y python2.7
python-simplejson' ...
E: Could not get lock /var/lib/dpkg/lock - open (11: Resource
temporarily unavailable)
E: Unable to lock the administration directory (/var/lib/dpkg/),
is another process using it?"

This typically happens on Debian/Ubuntu when you try to install a
package (`apt-get install`) while some other package is being
installed or upgraded.

It may be a transient error; just try again and see if it works now.

If it still fails, log into one of the failing nodes:

elasticluster ssh primrose -n med32compute002

then run these commands to see what is still holding the lock:

sudo fuser -v /var/lib/dpkg/lock
ps auxww | egrep 'dpkg|apt'

Thanks,
R

Narcis Yousefi

unread,
Mar 26, 2020, 3:07:45 PM3/26/20
to elasticluster
Dear Riccardo, Thanks a lot for your prompt response!

When I run those commands here is what I get:

ubuntu@primrose-med32compute001:~$ sudo fuser -v /var/lib/dpkg/lock

                     USER        PID ACCESS COMMAND

/var/lib/dpkg/lock:  root       3316 F.... dpkg

ubuntu@primrose-med32compute001:~$ ps auxww | egrep 'dpkg|apt'

root      3223  0.0  0.1 115388 63232 ?        S    12:16   0:11 apt-get install -y python2.7 python-simplejson

root      3316  0.0  0.0  23592  4544 pts/1    Ss+  12:16   0:00 /usr/bin/dpkg --status-fd 43 --configure --pending

root      3328  0.0  0.0  67640 18524 pts/1    S+   12:17   0:00 /usr/bin/perl -w /usr/share/debconf/frontend /var/lib/dpkg/info/libssl1.1:amd64.postinst configure 1.1.0g-2ubuntu4.1

root      3334  0.0  0.0   4628  1792 pts/1    S+   12:17   0:00 /bin/sh /var/lib/dpkg/info/libssl1.1:amd64.postinst configure 1.1.0g-2ubuntu4.1

ubuntu    5371  0.0  0.0  17548  1084 pts/3    S+   19:02   0:00 grep -E --color=auto dpkg|apt


Just to mention that I ran setup several times and updated elasticluster in between.

Best regards,
Narcis

Riccardo Murri

unread,
Mar 29, 2020, 10:09:45 AM3/29/20
to Narcis Yousefi, elasticluster
Hello Narcis,
From the output of `ps auxww`, it looks like the `apt-get install
python2.7 python-simplejson` command triggered an update of some other
package, which requires user interaction. The entire apt/dpkg system
is then blocked waiting for input which will never come.

It's a bug in ElastiCluster (`apt-get` should be called with an
explicit "no interaction" flag); I will commit a fix soon.

For the time being, you can:

(1) kill those processes: from any node in the cluster, issue this command::

pdsh -a sudo pkill -f /usr/share/debconf/frontend

(2) resume configuration of pending packages: from any node in the
cluster, issue this command::

pdsh -a sudo env DEBIAN_FRONTEND=noninteractive dpkg --pending
--configure

(3) Resume ElastiCluster setup::

elastilcluster setup primrose

Heop this helps!

Riccardo

Narcis Yousefi

unread,
Mar 29, 2020, 3:27:31 PM3/29/20
to elasticluster
Hey Riccardo,
Thanks a lot! That error is now fixed. I ran setup again. It went through many steps but I get a new error at almost end. Please see attachment. 
Now I see all of the added nodes with sinfo at frontend node, but when I ssh to them I get an error that these nodes need to be restarted.

I appreciate your help, many thanks!
All the best,
Narcis
elasticluster_setup_error5.pages
elasticluster_setup_error5.txt

Narcis Yousefi

unread,
Mar 30, 2020, 9:02:02 AM3/30/20
to elasticluster
Dear Riccardo,
Just an update. The problem was that slurmd was active on old nodes but systemctl didn't know about it. It is now resolved and new setup was successful. Thanks a lot for your help!

All the best,
Narcis
 

Riccardo Murri

unread,
Mar 30, 2020, 9:03:36 AM3/30/20
to Narcis Yousefi, elasticluster
Great!  All is well that ends well :-)

Ciao,
R
Reply all
Reply to author
Forward
0 new messages