So I've fought hours on the end and the issue still persists.
After upgrade to Stretch (all updates successful and no issues) we rebooted the nodes. First the issue showed that some vps's won't start due to to not finding the mount point, we emptied the cdrom_image_path and restarted the sshfs (it failed on some nodes due to problems with keys and host unknown, we had to perform some magic on the cluster keys and root ssh keys due to version jump and I assume OpenSSH change).
2 out of 8 systems (same config) return the following error
Wed Jan 10 16:27:12 2018 - ERROR: node vds-1: hypervisor kvm parameter verify failure (source instance system2): Parameter 'cdrom2_image_path' fails validation: not found or not a file or URL (current value: '/mnt/nfs/isos/debian-9.3.0-amd64-netinst.iso'
Wed Jan 10 16:27:12 2018 - ERROR: node vds-2: hypervisor kvm parameter verify failure (source instance system2): Parameter 'cdrom2_image_path' fails validation: not found or not a file or URL (current value: '/mnt/nfs/isos/debian-9.3.0-amd64-netinst.iso'
Any other node doesn't report any issues at all
When I do gnt-instance modify -H cdrom_image_path=/mnt/nfs/isos/win2008r2.iso system2
Hypervisor parameter validation failed on node vds-1: Parameter 'cdrom_image_path' fails validation: not found or not a file or URL (current value: '/mnt/nfs/isos/win2008r2.iso')
Now the /mnt/nfs/isos is actually connecting using SSHFS. And running the command ala gnt-cluster command file /mnt/nfs/isos/win2008r2.iso returns
/mnt/nfs/isos/win2008r2.iso: ISO 9660 CD-ROM filesystem data 'GRMSXVOL_EN_DVD' (bootable)
on all 8 nodes
It is working perfectly on other systems and all /etc/fstab settings are the same. Ganeti version is 2.15 and the cluster is also 2.15
# gnt-cluster version
Software version: 2.15.2
Internode protocol: 2150000
Configuration format: 2150000
OS api version: 20
Export interface: 0
VCS version: (ganeti) version 2.15.2-7+deb9u1
Linux version on all nodes the same 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04) x86_64 GNU/Linux
Python version except for one node with 2.7.13 is 2.7.14+
I don't even know where to look because it works everywhere except master node and second node. It fails only in Ganeti and it is impossible to declutter Ganeti OP code structure with it's tuple magic for me. So if someone could help me debug this issue it would be greatly appreciated.
The level of frustration this nonsense brings in deeply underestimated by me. We cannot use files on 2 out of 8 nodes which have been used there before upgrade. I've restarted ganeti on all nodes and same. I've restarted sshfs (umount and mount, check ps for leftover processes)
When I modify in hypervisor/hv_base.py:
_FILE_OR_URL_CHECK = (lambda x: utils.IsNormAbsPath(x) or utils.IsUrl(x),
"must be an absolute normalized path or a URL",
lambda x: os.path.isfile(x) or utils.IsUrl(x),
"not found or not a file or URL")
to
_FILE_OR_URL_CHECK = (lambda x:True,
"must be an absolute normalized path or a URL",
lambda x: True,
"not found or not a file or URL")
The issue goes away but this will be the cause of problems in the future. I don't understand why this is not working as intended.
If I run utils.IsNormAbsPath('/mnt/nfs/isos/win2008r2.iso') and os.path.isfile('/mnt/nfs/isos/win2008r2.iso') in python session I get
>>> x = '/mnt/nfs/isos/win2008r2.iso'
>>> ganeti.utils.IsNormAbsPath(x) or ganeti.utils.IsUrl(x), os.path.isfile(x) or ganeti.utils.IsUrl(x)
(True, True)
Checked ganeti logs but there is nothing useful there, just the text from hv_base.py which doesn't make sense (if I run individual op code checks in python session they all return true except for URL check). Maybe I'm missing something?