Has anyone successfully run FDS MPI on Windows OS based multi-nodes

240 views
Skip to first unread message

Chermac Rolle

unread,
Jan 28, 2021, 5:28:33 PM1/28/21
to FDS and Smokeview Discussions
Hi all,

I am seeking advice/anecdotal evidence on the above being accomplished.

For context, I have already reviewed the following:
  • Reviewed Section 3.2 of the FDS User's Manual
  • Searched the group discussions here using search terms such as 'Windows', 'MPI', 'cluster'.
  • Subsequently reviewed the Intel MPI Library Guides Running Applications and Developer Reference for Windows OS (especially the Command Reference pages)
  • I have undertaken some trial and error to better understand the command line options. Firstly mimicking the test_mpi recommended procedure and then trying to run a simple two mesh, two process simulation.
I have performed these trial and errors using two physical Windows 10 Pro 20H2 based machines. I can confirm the machines can communicate via the shared domain (ping, remote management, 'hello world' returned via test_mpi). They both have access to a read/write network share. [Using FDS6.7.0-0-g5ccea76-master]

I have expanded on varying command line options using the test_mpi and the Intel guides; successfully returning the 'hello world' in different option combinations.

I then tried to run my simple simulation but no success after a couple hours worth of attempts. I added -verbose output from hydra which indicates that the calls to execute fds prepends the UNC path to the file in arguments. Understandably a problem for Windows and has an associated failure from mpiexec of the input file not being in the working directory.

I have tried assigning the network share to the same drive letter on bot machines as well as using the hydra options -map and -mapall. The UNC expansion still occurs.

Has anyone noticed and maybe solved this behaviour?

Noting the many discussion threads stating it would be better to attempt this on *nix, I made an attempt at repeating this exercise using WSL [FDS6.7.5-0-g71f0256-release]. I got a failure regarding the 'slots' option not being available at this time. I assume this is a WSL related shortcoming due to my use case as I have successfully run fds mpi on both of these machines under WSL individually.

Has anyone attempted/made this work before?

Windows are needed on this machines. A dual boot may be possible but not preferred as these are production machines. My next option will be to attempt creating a virtual *nix multi-node cluster via Hyper-V. But before I go down that road I was hoping to crowd source some of the experiences of others. If anyone has successfully used VM clusters on Windows, I would love to hear about it.

A bit of a long post, but any advice or points-in-the-right-direction would be appreciated. Thanks.

Kevin McGrattan

unread,
Jan 29, 2021, 9:35:19 AM1/29/21
to fds...@googlegroups.com
Does the test_mpi program produce a "Hello World" string from multiple computers? If so, that is a good sign. If you then have trouble with FDS, the problem is most likely due to the shared directory. Is there any error message? Can you try to run a case where the working directory is on the computer than invokes the mpiexec command?

Chermac Rolle

unread,
Jan 29, 2021, 1:14:55 PM1/29/21
to FDS and Smokeview Discussions
Hi Kevin,

I am working on THIS-HOST with:

 Revision         : FDS6.7.0-0-g5ccea76-master
 Revision Date    : Mon Jun 25 13:03:23 2018 -0400
 Compiler         : Intel ifort 18.0.2.185
 Compilation Date : Tue 06/26/2018  10:49 AM

The command: mpiexec -hosts 2 THIS-HOST 1 THAT-HOST 1 test_mpi

Returns:

 Hello world: rank            0  of            2  running on
 THIS-HOST

 Hello world: rank            1  of            2  running on
 THAT-HOST


I then attempt to run using just one host and have now launched: mpiexec -hosts 1 THIS-HOST 2 fds Speed-Test-02MSH-02MPI-08OMP.fds

This successfully runs the file when launched from a local folder [C:\path\to\folder]! Promising.

A parent folder [C:\path\tois shared on the network and mapped on THIS-HOST to [M:\] and on THAT-HOST to its [M:\].

I have cd into [M:\folder] and run again: mpiexec -hosts 1 THIS-HOST 2 fds Speed-Test-02MSH-02MPI-08OMP.fds

I get the following error:

 Reading input file ...
forrtl: severe (9): permission to access file denied, unit 11, file C:\Speed-Test-02MSH-02MPI-08OMP.sinfo
Image              PC                Routine            Line        Source
fds.exe            00007FF7DCF1E507  Unknown               Unknown  Unknown
fds.exe            00007FF7DCEC190E  Unknown               Unknown  Unknown
fds.exe            00007FF7E4E1C412  Unknown               Unknown  Unknown
fds.exe            00007FF7E460ED2C  MAIN__                    125  main.f90
fds.exe            00007FF7E45F8632  Unknown               Unknown  Unknown
fds.exe            00007FF7E4F07A80  Unknown               Unknown  Unknown
KERNEL32.DLL       00007FFE15937034  Unknown               Unknown  Unknown
ntdll.dll          00007FFE1735D0D1  Unknown               Unknown  Unknown

Interestingly the file has been identified as living on the C:\ local drive, but strangely the program seems to think it is at the root on the drive instead of [C:\path\to\folder].

I can confirm that [C:\Speed-Test-02MSH-02MPI-08OMP.sinfo] does not exist but [C:\path\to\folder\Speed-Test-02MSH-02MPI-08OMP.sinfo] does.

So the following was tried, with cmd at [M:\folder]:

mpiexec -hosts 1 THIS-HOST 2 -wdir %CD% fds Speed-Test-02MSH-02MPI-08OMP.fds - Same error

mpiexec -hosts 1 THIS-HOST 2 -wdir  M:\folder  fds Speed-Test-02MSH-02MPI-08OMP.fds - Same error

mpiexec -hosts 1 THIS-HOST 2 -wdir  C:\path\to\folder  fds Speed-Test-02MSH-02MPI-08OMP.fds - Runs successfully!

Following a theory, I have moved the working folder over to another physical drive on THIS-HOST [D:\path\to\folder].

cd /D D:\path\to\folder
mpiexec -hosts 1 THIS-HOST 2 fds Speed-Test-02MSH-02MPI-08OMP.fds

This runs successfully!

There is no way easy way (if any - parallel SCSI?!) for both hosts to share the same drive without network sharing; so would this mean that a Windows OS based multi-node approach is attainable? The thinking behind my last test was to map physical drive D: on THIS-HOST to network drive D: on THAT-HOST (my current configurations means this must be planned out). But the failure with -wdir M:\folder leaves me thinking this too would fail.

Would you be able to suggest anything else I should try?

Any insight would help. Thank you.



Kevin McGrattan

unread,
Jan 29, 2021, 1:56:14 PM1/29/21
to fds...@googlegroups.com
I would not refer to any folder using the DOS letters, C, D, M, etc. I would try

-wdir \\THIS-HOST\Users\...

THAT-HOST may not recognize M:\blahblah.  Try using directory names that are meaningful over the entire domain network.


Chermac Rolle

unread,
Jan 29, 2021, 3:20:23 PM1/29/21
to FDS and Smokeview Discussions
This works: mpiexec -hosts 1 THIS-HOST 2 -wdir \\THIS-HOST\path\to\folder  fds Speed-Test-02MSH-02MPI-08OMP.fds

The solve with regards to the one machine UNC run is progress, so thank you. 

Then this hangs (like it's doing something), and returns nothing on completion of execution: mpiexec -hosts 2 THIS-HOST 1 THAT-HOST 1 -wdir \\THIS-HOST\path\to\folder fds Speed-Test-02MSH-02MPI-08OMP.fds

I have additionally tried other combinations of options (comma separated -hosts, using the local partition [:] to provider a -wdir to each host explicitly) but no joy.

Therefore, I have added in -v the above command that hung, to provide more insight. Note, edits made to the output for privacy and brevity.

host: THIS-HOST
host: THAT-HOST

==================================================================================================
mpiexec options:
----------------
  Base path: c:\Program Files\firemodels\FDS6\bin\
  Launcher: service
  Debug level: 1
  Enable X: -1

  Global environment:
  -------------------
  <ENV from THIS-HOST>
.
  COMPUTERNAME=THIS-HOST
.
  LOGONSERVER=\\THIS-HOST
  NUMBER_OF_PROCESSORS=16 [THAT-HOST has 32, so definitely THIS-HOST]
.
  Path=<THIS-HOST>
.
.
.

  Hydra internal environment:
  ---------------------------
    MPIR_CVAR_NEMESIS_ENABLE_CKPOINT=1
    GFORTRAN_UNBUFFERED_PRECONNECTED=y
    I_MPI_HYDRA_UUID=1ce81c0000-82d4e582-efe54aef-314a5e31☺

  Intel(R) MPI Library specific variables:
  ----------------------------------------
    I_MPI_HYDRA_UUID=1ce81c0000-82d4e582-efe54aef-314a5e31☺


    Proxy information:
    *********************
      [1] proxy: THIS-HOST (1 cores)
      Exec list: fds (1 processes);

      [2] proxy: THAT-HOST (1 cores)
      Exec list: fds (1 processes);


==================================================================================================

[mpiexec@THIS-HOST] Timeout set to -1 (-1 means infinite)
[mpiexec@THIS-HOST] Got a control port string of THIS-HOST:61238 

Proxy launch args: c:\Program Files\firemodels\FDS6\bin\pmi_proxy --control-port THIS-HOST:61238 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk user --launcher service --demux select --pgid 0 --enable-stdin 1 --retries 10 --control-code 24203 --usize -2 --proxy-id

Arguments being passed to proxy 0:
--version 3.2 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname THIS-HOST <options> <ENV from THIS-HOST

Arguments being passed to proxy 1:
--version 3.2 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname THAT-HOST <options> <ENV from THIS-HOST> <<< contains COMPUTERNAME=THIS-HOST and NUMBER_OF_PROCESSORS=16

[mpiexec@ THIS-HOST] Launch arguments: c:\Program Files\firemodels\FDS6\bin\pmi_proxy --control-port  THIS-HOST:61238 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk user --launcher service --demux select --pgid 0 --enable-stdin 1 --retries 10 --control-code 24203 --usize -2 --proxy-id 0
[mpiexec@ THIS-HOST] Launch arguments: c:\Program Files\firemodels\FDS6\bin\pmi_proxy --control-port  THIS-HOST:61238 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk user --launcher service --demux select --pgid 0 --enable-stdin 1 --retries 10 --control-code 24203 --usize -2 --proxy-id 1
[mpiexec@ THIS-HOST] STDIN will be redirected to 1 fd(s): 4
[mpiexec@ THIS-HOST] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec@ THIS-HOST] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec@ THIS-HOST] PMI response to fd 668 pid 540: cmd=barrier_out
[mpiexec@ THIS-HOST] PMI response to fd 564 pid 540: cmd=barrier_out
[mpiexec@ THIS-HOST] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec@ THIS-HOST] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec@ THIS-HOST] PMI response to fd 668 pid 540: cmd=barrier_out
[mpiexec@ THIS-HOST] PMI response to fd 564 pid 540: cmd=barrier_out
[mpiexec@ THIS-HOST] [pgid: -1] got PMI command: cmd=put kvsname=kvs_7400_0 key=P0-businesscard-0 value=description# THIS-HOST$port#61261$ifname# THIS-HOST$fabrics_list#tcp$
[mpiexec@ THIS-HOST] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec@ THIS-HOST] [pgid: -1] got PMI command: cmd=put kvsname=kvs_7400_0 key=P1-businesscard-0 value=description# THAT-HOST$port#50310$ifname# THAT-HOST$fabrics_list#tcp$
[mpiexec@ THIS-HOST] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec@ THIS-HOST] PMI response to fd 668 pid 540: cmd=barrier_out
[mpiexec@ THIS-HOST] PMI response to fd 564 pid 540: cmd=barrier_out
[mpiexec@ THIS-HOST] [pgid: 0] got PMI command: cmd=abort exitcode=69777679


It appears there is an issue with ENV being forwarded from THIS-HOST(??) to THAT-HOST rather than using that machine's ENV, so I have attempted using -genvnone and -envnone options in case this is causing conflict. Unfortunately the variables are still passed.

Has this behavior been seen before? I assume in the *nix world when using a scheduler like Slurm, this is handled by the manager? Would I need to construct some elaborate -env options?

Kevin McGrattan

unread,
Jan 29, 2021, 4:40:14 PM1/29/21
to fds...@googlegroups.com
Do both THIS-HOST and THAT-HOST have the same version of FDS installed, and the same version of Windows? If you login to THAT-HOST, can you cd to the -wdir and read and write files? Use the exact same dir name and path.

Chermac Rolle

unread,
Jan 29, 2021, 4:55:07 PM1/29/21
to FDS and Smokeview Discussions
Hi Kevin,

I've just logged in via RDP and can confirm yes to all the above.

While logged in, and cd into M:\ on THAT-HOST, mpiexec -hosts 1 THAT-HOST 2 -wdir \\THIS-HOST\path\to\folder  fds Speed-Test-02MSH-02MPI-08OMP.fds runs successfully.

Kevin McGrattan

unread,
Jan 29, 2021, 5:11:19 PM1/29/21
to fds...@googlegroups.com
Do you have PyroSim installed, or any other application that involves MPI? The hydra service is very finicky. Other than that, I've run out of ideas. I use a linux cluster to run MPI jobs. My luck running FDS across our Windows domain network at work is about 50%. What usually will work is if you have the exact same computers set up in the exact same way on the exact same subnet. Even then, it's not perfect, and I've never figured out why. My support tickets to Intel are usually ignored, I suspect because even they don't really know all the various ways that a Windows domain network can be configured. This is why people typically use linux compute clusters that are dedicated to high performance computing.

Chermac Rolle

unread,
Jan 29, 2021, 5:30:31 PM1/29/21
to FDS and Smokeview Discussions
I have had PyroSim installed on THIS-HOST previously, as well as BlueCAPE's OpenFOAM come to think of it, so I'll do some digging to make sure there aren't any left-behinds from them. The Windows 10 OS versions have been updated since but never a fresh re-install.

I appreciate you taking the time to throw some ideas at me. After spending some time going through the group, there didn't seem to be much in way of a solution to this on here, so if anything this will hopefully save the next person some time when debugging this problem.

I now generally use WSL to run FDS or FOAM if I need *nix based simulating, but after my initial MPI fails yesterday I will now attempt using some Ubuntu VMs.

Thanks again for your time, and if I get a good working solution I'll be sure to post it.

Regards,

Chermac

Reply all
Reply to author
Forward
0 new messages