I would hazard to guess that the DNS is not working fully from or for the nodes themselves.
Validate that you can ping the nodes by name from the backup controller. Also verify they are named what the dns says they are. And validate you can ping the backup controller from the nodes by the name it has in the slurm.conf file.
Also, a quick way to do the failover check is to run (from the
backup controller): scontrol takeover
Brian Andrus