Error in resize and setup

Narcis Yousefi

unread,

Mar 26, 2020, 1:09:08 PM3/26/20

to elasticluster

Dear Riccardo,
We have a cluster at S3IT and now I need to enlarge it.

To add new nodes I ran:
elasticluster resize -a 4:med128compute primrose -t primrose

and the nodes were three different sizes that I added to config file.

This part worked but with an error in the last lines:

-----
2020-03-26 07:54:21 1603ae1b395c gc3.elasticluster[1] ERROR Command `ansible-playbook --private-key=/home/ubuntu/.ssh/id_rsa /home/elasticluster/share/playbooks/main.yml --inventory=/home/ubuntu/.elasticluster/storage/primrose.inventory --become --become-user=root -e @extra_vars.yml` failed with exit code 2.
2020-03-26 07:54:21 1603ae1b395c gc3.elasticluster[1] WARNING The cluster has likely *not* been configured correctly. You may need to re-run `elasticluster setup`.
2020-03-26 07:54:21 1603ae1b395c gc3.elasticluster[1] WARNING Cluster `primrose` not yet configured. Please, re-run `elasticluster setup primrose` and/or check your configuration
Cluster name: primrose
Cluster template: primrose
Default ssh to node: frontend001
- frontend nodes: 1
- smallcompute nodes: 2
- compute nodes: 3
- med128compute nodes: 4

To login on the frontend node, run the command:

elasticluster ssh primrose

To upload or download files to the cluster, use the command:

elasticluster sftp primrose
------

The added nodes I can see in web interface but not from frontendnode (I restarted slurm).

And I get the attached error when I run:
elasticluster setup primrose

Could you please have a look and help me fix this error?
Thanks a lot in advance!

All the best,
Narcis

elasticluster_setup_error.pages

elasticluster_setup_error.txt

Riccardo Murri

unread,

Mar 26, 2020, 2:33:11 PM3/26/20

to Narcis Yousefi, elasticluster

Hello Narcis,

> 2020-03-26 07:54:21 1603ae1b395c gc3.elasticluster[1] ERROR Command `ansible-playbook --private-key=/home/ubuntu/.ssh/id_rsa /home/elasticluster/share/playbooks/main.yml --inventory=/home/ubuntu/.elasticluster/storage/primrose.inventory --become --become-user=root -e @extra_vars.yml` failed with exit code 2.

This is the final, top-level error, the root cause would be in the
preceding lines; in this case, since the failure happened while
running Ansible, it would be in Ansible's output.

Looking at the complete output (good that you included it!), it seems
to me that all tasks on `medXXX` nodes fail with this error
(reformatted for readability):

Running installation command 'apt-get install -y python2.7
python-simplejson' ...
E: Could not get lock /var/lib/dpkg/lock - open (11: Resource
temporarily unavailable)
E: Unable to lock the administration directory (/var/lib/dpkg/),
is another process using it?"

This typically happens on Debian/Ubuntu when you try to install a
package (`apt-get install`) while some other package is being
installed or upgraded.

It may be a transient error; just try again and see if it works now.

If it still fails, log into one of the failing nodes:

elasticluster ssh primrose -n med32compute002

then run these commands to see what is still holding the lock:

sudo fuser -v /var/lib/dpkg/lock
ps auxww | egrep 'dpkg|apt'

Thanks,
R

Narcis Yousefi

unread,

Mar 26, 2020, 3:07:45 PM3/26/20

to elasticluster

Dear Riccardo, Thanks a lot for your prompt response!

When I run those commands here is what I get:

ubuntu@primrose-med32compute001:~$ sudo fuser -v /var/lib/dpkg/lock

USER PID ACCESS COMMAND

/var/lib/dpkg/lock: root 3316 F.... dpkg

ubuntu@primrose-med32compute001:~$ ps auxww | egrep 'dpkg|apt'

root 3223 0.0 0.1 115388 63232 ? S 12:16 0:11 apt-get install -y python2.7 python-simplejson

root 3316 0.0 0.0 23592 4544 pts/1 Ss+ 12:16 0:00 /usr/bin/dpkg --status-fd 43 --configure --pending

root 3328 0.0 0.0 67640 18524 pts/1 S+ 12:17 0:00 /usr/bin/perl -w /usr/share/debconf/frontend /var/lib/dpkg/info/libssl1.1:amd64.postinst configure 1.1.0g-2ubuntu4.1

root 3334 0.0 0.0 4628 1792 pts/1 S+ 12:17 0:00 /bin/sh /var/lib/dpkg/info/libssl1.1:amd64.postinst configure 1.1.0g-2ubuntu4.1

ubuntu 5371 0.0 0.0 17548 1084 pts/3 S+ 19:02 0:00 grep -E --color=auto dpkg|apt

Just to mention that I ran setup several times and updated elasticluster in between.

Best regards,

Narcis

Riccardo Murri

unread,

Mar 29, 2020, 10:09:45 AM3/29/20

to Narcis Yousefi, elasticluster

Hello Narcis,
From the output of `ps auxww`, it looks like the `apt-get install
python2.7 python-simplejson` command triggered an update of some other
package, which requires user interaction. The entire apt/dpkg system
is then blocked waiting for input which will never come.

It's a bug in ElastiCluster (`apt-get` should be called with an
explicit "no interaction" flag); I will commit a fix soon.

For the time being, you can:

(1) kill those processes: from any node in the cluster, issue this command::

pdsh -a sudo pkill -f /usr/share/debconf/frontend

(2) resume configuration of pending packages: from any node in the
cluster, issue this command::

pdsh -a sudo env DEBIAN_FRONTEND=noninteractive dpkg --pending
--configure

(3) Resume ElastiCluster setup::

elastilcluster setup primrose

Heop this helps!

Riccardo

Narcis Yousefi

unread,

Mar 29, 2020, 3:27:31 PM3/29/20

to elasticluster

Hey Riccardo,

Thanks a lot! That error is now fixed. I ran setup again. It went through many steps but I get a new error at almost end. Please see attachment.

Now I see all of the added nodes with sinfo at frontend node, but when I ssh to them I get an error that these nodes need to be restarted.

I appreciate your help, many thanks!

All the best,

Narcis

elasticluster_setup_error5.pages

elasticluster_setup_error5.txt

Narcis Yousefi

unread,

Mar 30, 2020, 9:02:02 AM3/30/20

to elasticluster

Dear Riccardo,

Just an update. The problem was that slurmd was active on old nodes but systemctl didn't know about it. It is now resolved and new setup was successful. Thanks a lot for your help!

All the best,

Narcis

Riccardo Murri

unread,

Mar 30, 2020, 9:03:36 AM3/30/20

to Narcis Yousefi, elasticluster

Great!  All is well that ends well :-)

Ciao,

R

Reply all

Reply to author

Forward