Greetings,
I am investigating BeeGFS 7.0 for use as a high-performance parallel scratch space for HPC jobs. One of the tests I am using is the io-500 benchmark suite. The ior_hard test uses ior to test how well the file system can handle a writing and reading to a single-shared-file with many small, irregular writes. This is intended to be a "worst case scenerio" test.
Using 1 MPI process per node on the client, I get excellent results. (3.3 GB/s) But using 4 MPI processes per client node, the performance is significantly worse (0.39GB/s) for a smaller file. Reads do not show any degradation in performance.
I have 22 clients and 11 servers. Clients and servers are on separate machines. Servers are running both storage and metadata services.
Metadata targets are two SATA SSDs per server attached to a RAID controller running in RAID-1 configuration. Storage targets are 12 SAS HDDs per server attached to the same RAID controller running in a RAID-6 configuration.
Filesystem is mounted via FDR infiniband and beegfs-net shows RDMA support is enabled and working.
IOR is launched via OpenMPI 1.10.7 running via TCP over 10Gb/s ethernet with round-robin process mapping.
A couple observations:
- When running with multiple processes per client node, the larger the file gets, the worse performance gets. At 50000 transfers, the bandwidth is ~0.07 GB/s. The same level of degradation is not seen when running with one process per client node.
- When running with multiple processes per client node, the file transfers very quickly at first, then slows down significantly toward the end. The last 1% of the file takes longer to transfer than the first 99%.
Does anyone know what could be causing this? Is this a bug? If so, are there any workarounds?
Thanks,
Jim
P.S. The results of the test are below:
+ mkdir -p /mnt/beegfs/jburto2/ior_hard
+ sudo beegfs-ctl --setpattern --numtargets=11 --chunksize=512k /mnt/beegfs/jburto2/ior_hard
New chunksize: 524288
New number of storage targets: 11
+ /usr/lib64/openmpi/bin/mpirun -np 22 --mca btl_tcp_if_exclude em1,ib0 --mca btl self,tcp --map-by node --machinefile /home/jburto2/pvfsnodelistmpi --prefix /usr/lib64/openmpi /home/jburto2/io-500-dev/bin/ior -w -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 500000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-22-512
IOR-3.1.0: MPI Coordinated Test of Parallel I/O
ior WARNING: fsync() only available in POSIX/MMAP. Using value of 0.
Began: Fri Jul 13 12:55:57 2018
Command line used: /home/jburto2/io-500-dev/bin/ior -w -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 500000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-22-512
Test 0 started: Fri Jul 13 12:55:57 2018
Summary:
api = MPIIO (version=3, subversion=0)
test filename = /mnt/beegfs/jburto2/ior_hard/IOR_file-22-512
access = single-shared-file
ordering in a file = sequential offsets
ordering inter file= constant task offsets = 1
clients = 22 (1 per node)
repetitions = 1
xfersize = 47008 bytes
blocksize = 47008 bytes
aggregate filesize = 481.58 GiB
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 3320.87 45.91 45.91 0.736730 147.65 0.113621 148.50 0
Max Write: 3320.87 MiB/sec (3482.19 MB/sec)
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
write 3320.87 3320.87 3320.87 0.00 74076.48 74076.48 74076.48 0.00 148.49517 0 22 1 1 0 1 1 0 0 500000 47008 47008 517088000000 MPIIO 0
Finished: Fri Jul 13 12:58:25 2018
+ /usr/lib64/openmpi/bin/mpirun -np 88 --mca btl_tcp_if_exclude em1,ib0 --mca btl self,tcp --map-by node --machinefile /home/jburto2/pvfsnodelistmpi --prefix /usr/lib64/openmpi /home/jburto2/io-500-dev/bin/ior -w -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 25000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-88-512
IOR-3.1.0: MPI Coordinated Test of Parallel I/O
ior WARNING: fsync() only available in POSIX/MMAP. Using value of 0.
Began: Fri Jul 13 12:58:27 2018
Command line used: /home/jburto2/io-500-dev/bin/ior -w -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 25000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-88-512
Test 0 started: Fri Jul 13 12:58:27 2018
Summary:
api = MPIIO (version=3, subversion=0)
test filename = /mnt/beegfs/jburto2/ior_hard/IOR_file-88-512
access = single-shared-file
ordering in a file = sequential offsets
ordering inter file= constant task offsets = 1
clients = 88 (4 per node)
repetitions = 1
xfersize = 47008 bytes
blocksize = 47008 bytes
aggregate filesize = 96.32 GiB
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 388.56 45.91 45.91 0.356252 253.42 0.049576 253.82 0
Max Write: 388.56 MiB/sec (407.44 MB/sec)
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
write 388.56 388.56 388.56 0.00 8667.42 8667.42 8667.42 0.00 253.82403 0 88 4 1 0 1 1 0 0 25000 47008 47008 103417600000 MPIIO 0
Finished: Fri Jul 13 13:02:41 2018
....
+ /usr/lib64/openmpi/bin/mpirun -np 22 --mca btl_tcp_if_exclude em1,ib0 --mca btl self,tcp --map-by node --machinefile /home/jburto2/pvfsnodelistmpi --prefix /usr/lib64/openmpi /home/jburto2/io-500-dev/bin/ior -r -R -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 500000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-22-512
IOR-3.1.0: MPI Coordinated Test of Parallel I/O
ior WARNING: fsync() only available in POSIX/MMAP. Using value of 0.
Began: Fri Jul 13 14:22:15 2018
Command line used: /home/jburto2/io-500-dev/bin/ior -r -R -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 500000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-22-512
Test 0 started: Fri Jul 13 14:22:15 2018
Summary:
api = MPIIO (version=3, subversion=0)
test filename = /mnt/beegfs/jburto2/ior_hard/IOR_file-22-512
access = single-shared-file
ordering in a file = sequential offsets
ordering inter file= constant task offsets = 1
clients = 22 (1 per node)
repetitions = 1
xfersize = 47008 bytes
blocksize = 47008 bytes
aggregate filesize = 481.58 GiB
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
read 2948.62 45.91 45.91 0.007031 167.23 0.010853 167.24 0
Max Read: 2948.62 MiB/sec (3091.86 MB/sec)
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
read 2948.62 2948.62 2948.62 0.00 65772.97 65772.97 65772.97 0.00 167.24194 0 22 1 1 0 1 1 0 0 500000 47008 47008 517088000000 MPIIO 0
Finished: Fri Jul 13 14:25:03 2018
+ /usr/lib64/openmpi/bin/mpirun -np 88 --mca btl_tcp_if_exclude em1,ib0 --mca btl self,tcp --map-by node --machinefile /home/jburto2/pvfsnodelistmpi --prefix /usr/lib64/openmpi /home/jburto2/io-500-dev/bin/ior -r -R -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 25000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-88-512
IOR-3.1.0: MPI Coordinated Test of Parallel I/O
ior WARNING: fsync() only available in POSIX/MMAP. Using value of 0.
Began: Fri Jul 13 14:25:05 2018
Command line used: /home/jburto2/io-500-dev/bin/ior -r -R -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 25000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-88-512
Test 0 started: Fri Jul 13 14:25:05 2018
Summary:
api = MPIIO (version=3, subversion=0)
test filename = /mnt/beegfs/jburto2/ior_hard/IOR_file-88-512
access = single-shared-file
ordering in a file = sequential offsets
ordering inter file= constant task offsets = 1
clients = 88 (4 per node)
repetitions = 1
xfersize = 47008 bytes
blocksize = 47008 bytes
aggregate filesize = 96.32 GiB
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
read 4254 45.91 45.91 0.020689 23.16 0.006642 23.19 0
Max Read: 4253.75 MiB/sec (4460.38 MB/sec)
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
read 4253.75 4253.75 4253.75 0.00 94885.46 94885.46 94885.46 0.00 23.18585 0 88 4 1 0 1 1 0 0 25000 47008 47008 103417600000 MPIIO 0
Finished: Fri Jul 13 14:25:28 2018