I think it depends on the filesystem type. Lustre generally fails over nicely and handles reconnections with out much of a problem. We've done this before with out any hitches, even with the jobs being live. Generally the jobs just hang and then resolve once the filesystem comes back. On a live system you will end up with a completion storm as jobs are always exiting and thus while the filesystem is gone the jobs dependent on it will just hang and if they are completing they will just stall on the completion step. Once it returns then all that traffic flushes. This can create issues where a bunch of nodes get closed due to Kill task fail or other completion flags. Generally these are harmless though I have seen stuck processes on nodes and have had to reboot them to clear, so you should check any node before putting it back in action.
That said if you are pausing all the jobs and scheduling this is
some what mitigated, though jobs will still exit due to timeout.
-Paul Edmon-