beegfs single-shared file writes with multiple MPI processes per client (ior_hard)

710 views
Skip to first unread message

James Burton

unread,
Jul 13, 2018, 2:29:06 PM7/13/18
to fhgfs...@googlegroups.com
Greetings,

I am investigating BeeGFS 7.0 for use as a high-performance parallel scratch space for HPC jobs. One of the tests I am using is the io-500 benchmark suite. The ior_hard test uses ior to test how well the file system can handle a writing and reading to a single-shared-file with many small, irregular writes. This is intended to be a "worst case scenerio" test. 

Using 1 MPI process per node on the client, I get excellent results. (3.3 GB/s) But using 4 MPI processes per client node, the performance is significantly worse (0.39GB/s) for a smaller file. Reads do not show any degradation in performance.

I have 22 clients and 11 servers. Clients and servers are on separate machines. Servers are running both storage and metadata services.

Metadata targets are two SATA SSDs per server attached to a RAID controller running in RAID-1 configuration. Storage targets are 12 SAS HDDs per server attached to the same RAID controller running in a RAID-6 configuration.

Filesystem is mounted via FDR infiniband and beegfs-net shows RDMA support is enabled and working.


IOR is launched via OpenMPI 1.10.7 running via TCP over 10Gb/s ethernet with round-robin process mapping. 

A couple observations:
  1. When running with multiple processes per client node, the larger the file gets, the worse performance gets. At 50000 transfers, the bandwidth is ~0.07 GB/s. The same level of degradation is not seen when running with one process per client node.
  2. When running with multiple processes per client node, the file transfers very quickly at first, then slows down significantly toward the end. The last 1% of the file takes longer to transfer than the first 99%.
Does anyone know what could be causing this? Is this a bug? If so, are there any workarounds?

Thanks,

Jim

P.S. The results of the test are below:


+ mkdir -p /mnt/beegfs/jburto2/ior_hard
+ sudo beegfs-ctl --setpattern --numtargets=11 --chunksize=512k /mnt/beegfs/jburto2/ior_hard
New chunksize: 524288
New number of storage targets: 11

+ /usr/lib64/openmpi/bin/mpirun -np 22 --mca btl_tcp_if_exclude em1,ib0 --mca btl self,tcp --map-by node --machinefile /home/jburto2/pvfsnodelistmpi --prefix /usr/lib64/openmpi /home/jburto2/io-500-dev/bin/ior -w -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 500000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-22-512
IOR-3.1.0: MPI Coordinated Test of Parallel I/O

ior WARNING: fsync() only available in POSIX/MMAP.  Using value of 0.
Began: Fri Jul 13 12:55:57 2018
Command line used: /home/jburto2/io-500-dev/bin/ior -w -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 500000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-22-512

Test 0 started: Fri Jul 13 12:55:57 2018
Summary:
api                = MPIIO (version=3, subversion=0)
test filename      = /mnt/beegfs/jburto2/ior_hard/IOR_file-22-512
access             = single-shared-file
ordering in a file = sequential offsets
ordering inter file= constant task offsets = 1
clients            = 22 (1 per node)
repetitions        = 1
xfersize           = 47008 bytes
blocksize          = 47008 bytes
aggregate filesize = 481.58 GiB

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   --------   ----
write     3320.87    45.91      45.91      0.736730   147.65     0.113621   148.50     0

Max Write: 3320.87 MiB/sec (3482.19 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
write        3320.87    3320.87    3320.87       0.00   74076.48   74076.48   74076.48       0.00  148.49517 0 22 1 1 0 1 1 0 0 500000 47008 47008 517088000000 MPIIO 0

Finished: Fri Jul 13 12:58:25 2018
+ /usr/lib64/openmpi/bin/mpirun -np 88 --mca btl_tcp_if_exclude em1,ib0 --mca btl self,tcp --map-by node --machinefile /home/jburto2/pvfsnodelistmpi --prefix /usr/lib64/openmpi /home/jburto2/io-500-dev/bin/ior -w -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 25000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-88-512
IOR-3.1.0: MPI Coordinated Test of Parallel I/O

ior WARNING: fsync() only available in POSIX/MMAP.  Using value of 0.
Began: Fri Jul 13 12:58:27 2018
Command line used: /home/jburto2/io-500-dev/bin/ior -w -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 25000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-88-512

Test 0 started: Fri Jul 13 12:58:27 2018
Summary:
api                = MPIIO (version=3, subversion=0)
test filename      = /mnt/beegfs/jburto2/ior_hard/IOR_file-88-512
access             = single-shared-file
ordering in a file = sequential offsets
ordering inter file= constant task offsets = 1
clients            = 88 (4 per node)
repetitions        = 1
xfersize           = 47008 bytes
blocksize          = 47008 bytes
aggregate filesize = 96.32 GiB

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   --------   ----
write     388.56     45.91      45.91      0.356252   253.42     0.049576   253.82     0

Max Write: 388.56 MiB/sec (407.44 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
write         388.56     388.56     388.56       0.00    8667.42    8667.42    8667.42       0.00  253.82403 0 88 4 1 0 1 1 0 0 25000 47008 47008 103417600000 MPIIO 0

Finished: Fri Jul 13 13:02:41 2018

....

+ /usr/lib64/openmpi/bin/mpirun -np 22 --mca btl_tcp_if_exclude em1,ib0 --mca btl self,tcp --map-by node --machinefile /home/jburto2/pvfsnodelistmpi --prefix /usr/lib64/openmpi /home/jburto2/io-500-dev/bin/ior -r -R -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 500000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-22-512
IOR-3.1.0: MPI Coordinated Test of Parallel I/O

ior WARNING: fsync() only available in POSIX/MMAP.  Using value of 0.
Began: Fri Jul 13 14:22:15 2018
Command line used: /home/jburto2/io-500-dev/bin/ior -r -R -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 500000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-22-512

Test 0 started: Fri Jul 13 14:22:15 2018
Summary:
api                = MPIIO (version=3, subversion=0)
test filename      = /mnt/beegfs/jburto2/ior_hard/IOR_file-22-512
access             = single-shared-file
ordering in a file = sequential offsets
ordering inter file= constant task offsets = 1
clients            = 22 (1 per node)
repetitions        = 1
xfersize           = 47008 bytes
blocksize          = 47008 bytes
aggregate filesize = 481.58 GiB

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   --------   ----
read      2948.62    45.91      45.91      0.007031   167.23     0.010853   167.24     0

Max Read:  2948.62 MiB/sec (3091.86 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
read         2948.62    2948.62    2948.62       0.00   65772.97   65772.97   65772.97       0.00  167.24194 0 22 1 1 0 1 1 0 0 500000 47008 47008 517088000000 MPIIO 0

Finished: Fri Jul 13 14:25:03 2018
+ /usr/lib64/openmpi/bin/mpirun -np 88 --mca btl_tcp_if_exclude em1,ib0 --mca btl self,tcp --map-by node --machinefile /home/jburto2/pvfsnodelistmpi --prefix /usr/lib64/openmpi /home/jburto2/io-500-dev/bin/ior -r -R -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 25000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-88-512
IOR-3.1.0: MPI Coordinated Test of Parallel I/O

ior WARNING: fsync() only available in POSIX/MMAP.  Using value of 0.
Began: Fri Jul 13 14:25:05 2018
Command line used: /home/jburto2/io-500-dev/bin/ior -r -R -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 25000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-88-512

Test 0 started: Fri Jul 13 14:25:05 2018
Summary:
api                = MPIIO (version=3, subversion=0)
test filename      = /mnt/beegfs/jburto2/ior_hard/IOR_file-88-512
access             = single-shared-file
ordering in a file = sequential offsets
ordering inter file= constant task offsets = 1
clients            = 88 (4 per node)
repetitions        = 1
xfersize           = 47008 bytes
blocksize          = 47008 bytes
aggregate filesize = 96.32 GiB

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   --------   ----
read      4254       45.91      45.91      0.020689   23.16      0.006642   23.19      0

Max Read:  4253.75 MiB/sec (4460.38 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
read         4253.75    4253.75    4253.75       0.00   94885.46   94885.46   94885.46       0.00   23.18585 0 88 4 1 0 1 1 0 0 25000 47008 47008 103417600000 MPIIO 0

Finished: Fri Jul 13 14:25:28 2018

James Burton

unread,
Jul 16, 2018, 4:02:56 PM7/16/18
to fhgfs...@googlegroups.com
I think I may have solved the mystery.

Looking at client and server stats provided by beegfs-ctl, the client appears to be doing some sort of buffering of sequential writes.

In the single process per client node case, all the writes from a given client node to the filesystem are sequential. If the client is buffering these, then that would take the "hardness" out of the ior_hard test leading to impressive performance in this benchmark.  beegfs-ctl shows about 100,000 ops-wr per second throughout the test.

In the multiple process per client node case, the writes from a given client node are NOT sequential, so they don't get the same benefit. beegfs-ctl shows about 30,000 ops-wr per second at first, with performance dropping throughout the test. 

Running a test with a single process per client node and random offsets gives similar performance to the multiple process per client node case.

--
James Burton
OS and Storage Architect
Advanced Computing Infrastructure
Clemson University Computing and Information Technology
340 Computer Court
Anderson, SC 29625

John Bent

unread,
Jul 18, 2018, 1:46:05 AM7/18/18
to beegfs-user
Hello James,

Great job debugging your own issue; I think you identified the problem perfectly!  This is a nice illustration of exactly why we have the IO500 ior-hard test.  Hopefully you will be submitting this result to the official IO500 list?  :)

Thanks,

John

James Burton

unread,
Jul 18, 2018, 10:49:41 AM7/18/18
to fhgfs...@googlegroups.com
I expect we will be submitting an IO500 result in time for SC18. 

We have gotten very good numbers with BeeGFS and mediocre numbers with Lustre on the same hardware, but Lustre seems to handle the "worst case scenerio" tests less badly.

The IO500 is useful not only as a way of comparing different storage systems, but it also useful at determining the strengths and weaknesses of a given storage system. All parallel and distributed filesystems, including BeeGFS and Lustre, rely on at least some "smoke and mirrors" to get decent performance. It's good to know where the smoke and mirrors are. 

I hope the BeeGFS developers are reading this mailing list, because if they can improve the multiple process per client node test case (or have any ideas for us to better tune the system), then that would be helpful for users.




--
You received this message because you are subscribed to the Google Groups "beegfs-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

John Bent

unread,
Jul 18, 2018, 11:15:26 AM7/18/18
to fhgfs...@googlegroups.com
Fantastic.  It sounds like you are planning to submit both the Lustre and beegfs results on the same hardware which is awesome because a main purpose of io500 is to enable these exact apples-apples comparisons for the “smoke-mirrors” reasons you cite. 

Please let us on the io500 list if you have any questions or need any help. 

BeeGFS developers: apologies for hijacking this thread. 
You received this message because you are subscribed to a topic in the Google Groups "beegfs-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/fhgfs-user/UZTnmFD-BWI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to fhgfs-user+...@googlegroups.com.

James Burton

unread,
Jul 18, 2018, 2:35:13 PM7/18/18
to fhgfs...@googlegroups.com
Perhaps this is hijacking the thread, but users and developers in this group might find this interesting.

I ran a true apples-to-apples comparison between BeeGFS 7 and Lustre 2.10.4 not too long ago. This is older hardware and I'm sure both setups could be better tuned, but here are the numbers I got. (The BeeGFS numbers do not qualify for submission.) The memo is below.

Jim

P.S. I think there is a bug in the IO500.sh script for calculating the final score. It gives the product of the bandwidth and IOPS, not the geometric mean.

---------------------------------
Setup

Storage nodes were the following:

11 x Dell R720 ofstest nodes
Metadata storage: 2 SSDs in RAID-1 configuration
Data storage: 10 SAS drives in RAID-6 configuration
Mellanox ConnectX-3 FDR interconnect
16GB RAM

Client nodes were the following:

22x Dell R510 pvfs nodes
4 clients per node (1 per CPU core)
Mellanox ConnectX-3 FDR interconnect
12GB RAM

Striping for ior_easy directory was 1 with 4m chunk size.
Striping for ior_hard directory was across all nodes with 4m chunk size.

mdtest_easy for Lustre created all files in the same directory, but on mdtest_easy for BeeGFS, each process created files in 11 directories (to spread metadata around to different servers)
Otherwise, all tests were the same.

OpenMPI 1.10.7 running over TCP was used to run all tests.

Results

==> io-500-dev/ofstest-results/io500-lustre-ofstest-88clients.6.out <==
[RESULT] BW   phase 1            ior_easy_write                8.120 GB/s : time 520.19 seconds
[RESULT] BW   phase 2            ior_hard_write                0.182 GB/s : time 423.88 seconds
[RESULT] BW   phase 3             ior_easy_read                5.361 GB/s : time 787.93 seconds
[RESULT] BW   phase 4             ior_hard_read                1.621 GB/s : time  47.52 seconds
[RESULT] IOPS phase 1         mdtest_easy_write               23.268 kiops : time 381.13 seconds
[RESULT] IOPS phase 2         mdtest_hard_write               15.361 kiops : time 404.02 seconds
[RESULT] IOPS phase 3                      find              354.900 kiops : time  42.15 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat              114.934 kiops : time  79.05 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat               80.333 kiops : time  79.53 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete               25.587 kiops : time 347.40 seconds
[RESULT] IOPS phase 7          mdtest_hard_read               39.911 kiops : time 156.92 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete               20.101 kiops : time 309.40 seconds
[SCORE] Bandwidth 1.89261 GB/s : IOPS 47.0566 kiops : TOTAL 89.05979 (actual score 9.43715 (11th of 13))

==> io-500-dev/ofstest-results/io500-beegfs-ofstest-88clients.1.out <==
[RESULT] BW   phase 1            ior_easy_write               10.666 GB/s : time 396.03 seconds
[RESULT] BW   phase 2            ior_hard_write                0.903 GB/s : time  85.32 seconds
[RESULT] BW   phase 3             ior_easy_read                6.521 GB/s : time 647.82 seconds
[RESULT] BW   phase 4             ior_hard_read                5.357 GB/s : time  14.38 seconds
[RESULT] IOPS phase 1         mdtest_easy_write              102.013 kiops : time  89.42 seconds
[RESULT] IOPS phase 2         mdtest_hard_write                6.298 kiops : time 981.00 seconds
[RESULT] IOPS phase 3                      find              113.000 kiops : time 132.38 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat              400.181 kiops : time  24.48 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat               82.882 kiops : time  76.82 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete               94.327 kiops : time  97.19 seconds
[RESULT] IOPS phase 7          mdtest_hard_read               18.642 kiops : time 333.59 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete                6.174 kiops : time 1006.95 seconds
[SCORE] Bandwidth 4.28292 GB/s : IOPS 47.5528 kiops : TOTAL 203.66483 (actual score: 14.2711 (9th of 13))

Discussion

BeeGFS had considerably faster throughput than Lustre in all tests. (Note: Subsequent testing has showed that BeeGFS's performance with multiple processes per client node degrades more than Lustre's as the size of the ior_hard problem increases.)

BeeGFS also had considerably faster "best case" metadata performance (mdtest_easy).

Lustre's big advantage was "worst case" metadata performance (mdtest_hard).

BeeGFS's performance degraded significantly between mdtest_easy and mdtest_hard, while Lustre showed consistently mediocre metadata performance. Because BeeGFS does not have distributed directories, one metadata server was getting hammered in the mdtest_hard test. Additionally, Lustre's large client-side cache probably meant that Lustre was doing fewer writes to the slow data storage disks to store data in the mdtest_hard test. More tuning can probably improve BeeGFS's numbers, but I doubt it will catch Lustre.

Lustre's advantage in the mdtest_hard test will only increase with the new Data-On-MDT feature that is coming in Lustre 2.11. This allows small file data to be stored on low-latency metadata targets.

Lustre's other advantage was performance on (parallel) find, which is likely due to client-side caching.

Winner

Using the io-500's scoring system that attempts to balance data and metadata performance, BeeGFS trounced Lustre, 14.2711 to 9.43715.

However, due to BeeGFS's poor performance on mdtest_hard, Lustre finished faster, completing the benchmark in 3579.12 s, while BeeGFS took 3885.38 s. 

Conclusion

BeeGFS has faster throughput than Lustre. BeeGFS's metadata performance can be faster because it is leveraging optimizations from the underlying filesystem, but it lacks the features Lustre has to balance metadata operations and storage between servers in complex worst-case scenerios. 



To unsubscribe from this group and all its topics, send an email to fhgfs-user+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "beegfs-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

James Burton

unread,
Jul 27, 2018, 5:05:08 PM7/27/18
to fhgfs...@googlegroups.com
Update: Client AND server write buffering is confirmed from the source code. 

In client_module/source/filesystem/FhgfsOpsHelper.c 

FhgfsOpsHelper_writeCached() buffers writes by adding consecutive writes to a file a file cache buffer until the buffer is full or a cache miss.  

Looking at the design of the fileCacheBuffer in  client_module/source/filesystem/FhgfsInode.c:
  • Consecutive writes to the same file will maximize the number of cache hits and send writes up to the size of the fileCacheBuffer. 
    • This is what is happening with one process per node case.
    • The BeeGFS storage servers are seeing buffered, big writes, and avoiding most of the "hard" part of the ior_hard test

  • Non-consecutive writes to the same file WILL cause a cache miss, which causes an expensive write to the servers.
    • This is what is happening with the multiple processes per node case.
    • The BeeGFS storage servers are seeing small writes, and giving the usual poor performance on the ior_hard test. 

Surprisingly, even if client side buffering is turned off, performance of the one-process-per-node case does not suffer significantly.

Further investigation showed that the BeeGFS servers are buffering consecutive writes to the filesystem via the underlying Linux kernel functionality. 

The write process is in storage/net/message/session/rw/WriteLocalFileMessageEx.cpp 



Reply all
Reply to author
Forward
0 new messages