The deeper I go in the problem, the worser it seems... but maybe I'm a step closer to the solution.
I discovered that munge was disabled on the nodes (my fault, Gennaro pointed out the problem before, but I enabled it back only on the master). Btw, it's very strange that the wheezy->jessie upgrade disabled munge on all nodes and master...
Unfortunately, re-enabling munge on the nodes, didn't made slurmd start again.
Maybe filling this setting could give me some info about the problem?
#SlurmdLogFile=
Thank you very much for your help. Is very precious to me.
betta
Ps: some test I made ->
Running on the nodes
slurm -Dvvv
returns
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: Considering each NUMA node as a socket
slurmd: debug: CPUs:16 Boards:1 Sockets:4 CoresPerSocket:4 ThreadsPerCore:1
slurmd: Node configuration differs from hardware: CPUs=16:16(hw) Boards=1:1(hw) SocketsPerBoard=16:4(hw) CoresPerSocket=1:4(hw) ThreadsPerCore=1:1(hw)
slurmd: topology NONE plugin loaded
slurmd: Gathering cpu frequency information for 16 cpus
slurmd: debug: Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: Considering each NUMA node as a socket
slurmd: debug: CPUs:16 Boards:1 Sockets:4 CoresPerSocket:4 ThreadsPerCore:1
slurmd: debug: Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf
slurmd: debug: task/cgroup: now constraining jobs allocated cores
slurmd: task/cgroup: loaded
slurmd: debug: spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
slurmd: Munge cryptographic signature plugin loaded
slurmd: Warning: Core limit is only 0 KB
slurmd: slurmd version 14.03.9 started
slurmd: Job accounting gather LINUX plugin loaded
slurmd: debug: job_container none plugin loaded
slurmd: switch NONE plugin loaded
slurmd: slurmd started on Mon, 15 Jan 2018 18:07:17 +0100
slurmd: CPUs=16 Boards=1 Sockets=16 Cores=1 Threads=1 Memory=15999 TmpDisk=40189 Uptime=1254
slurmd: AcctGatherEnergy NONE plugin loaded
slurmd: AcctGatherProfile NONE plugin loaded
slurmd: AcctGatherInfiniband NONE plugin loaded
slurmd: AcctGatherFilesystem NONE plugin loaded
slurmd: debug2: No acct_gather.conf file (/etc/slurm-llnl/acct_gather.conf)
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused slurmd: debug: Failed to contact primary controller: Connection refused
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused slurmd: debug: Failed to contact primary controller: Connection refused
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused slurmd: debug: Failed to contact primary controller: Connection refused
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused slurmd: debug: Failed to contact primary controller: Connection refused
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused slurmd: debug: Failed to contact primary controller: Connection refused
^Cslurmd: got shutdown request
slurmd: waiting on 1 active threads
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused slurmd: debug: Failed to contact primary controller: Connection refused
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused slurmd: debug: Failed to contact primary controller: Connection refused
©©slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused slurmd: debug: Failed to contact primary controller: Connection refused
^C^C^C^Cslurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused slurmd: debug: Failed to contact primary controller: Connection refused
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused slurmd: debug: Failed to contact primary controller: Connection refused
slurmd: error: Unable to register: Unable to contact slurm controller (connect failure)
slurmd: debug: Unable to register with slurm controller, retrying
slurmd: all threads complete
slurmd: Consumable Resources (CR) Node Selection plugin shutting down ...
slurmd: Munge cryptographic signature plugin unloaded
slurmd: Slurmd shutdown completing
which maybe it is not so bad as it seems for it may only point out that slurm is not up on the master, isn't?
On the master running
service slurmctld restart
returns
Job for slurmctld.service failed. See 'systemctl status slurmctld.service' and 'journalctl -xn' for details.
and
service slurmctld status
returns