beegfs single-shared file writes with multiple MPI processes per client (ior

James Burton

unread,

Jul 13, 2018, 2:29:06 PM7/13/18

to fhgfs...@googlegroups.com

Greetings,

I am investigating BeeGFS 7.0 for use as a high-performance parallel scratch space for HPC jobs. One of the tests I am using is the io-500 benchmark suite. The ior_hard test uses ior to test how well the file system can handle a writing and reading to a single-shared-file with many small, irregular writes. This is intended to be a "worst case scenerio" test.

Using 1 MPI process per node on the client, I get excellent results. (3.3 GB/s) But using 4 MPI processes per client node, the performance is significantly worse (0.39GB/s) for a smaller file. Reads do not show any degradation in performance.

I have 22 clients and 11 servers. Clients and servers are on separate machines. Servers are running both storage and metadata services.

Metadata targets are two SATA SSDs per server attached to a RAID controller running in RAID-1 configuration. Storage targets are 12 SAS HDDs per server attached to the same RAID controller running in a RAID-6 configuration.

Filesystem is mounted via FDR infiniband and beegfs-net shows RDMA support is enabled and working.

IOR is launched via OpenMPI 1.10.7 running via TCP over 10Gb/s ethernet with round-robin process mapping.

A couple observations:

When running with multiple processes per client node, the larger the file gets, the worse performance gets. At 50000 transfers, the bandwidth is ~0.07 GB/s. The same level of degradation is not seen when running with one process per client node.
When running with multiple processes per client node, the file transfers very quickly at first, then slows down significantly toward the end. The last 1% of the file takes longer to transfer than the first 99%.

Does anyone know what could be causing this? Is this a bug? If so, are there any workarounds?

Thanks,

Jim

P.S. The results of the test are below:

+ mkdir -p /mnt/beegfs/jburto2/ior_hard

+ sudo beegfs-ctl --setpattern --numtargets=11 --chunksize=512k /mnt/beegfs/jburto2/ior_hard

New chunksize: 524288

New number of storage targets: 11

+ /usr/lib64/openmpi/bin/mpirun -np 22 --mca btl_tcp_if_exclude em1,ib0 --mca btl self,tcp --map-by node --machinefile /home/jburto2/pvfsnodelistmpi --prefix /usr/lib64/openmpi /home/jburto2/io-500-dev/bin/ior -w -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 500000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-22-512

IOR-3.1.0: MPI Coordinated Test of Parallel I/O

ior WARNING: fsync() only available in POSIX/MMAP. Using value of 0.

Began: Fri Jul 13 12:55:57 2018

Command line used: /home/jburto2/io-500-dev/bin/ior -w -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 500000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-22-512

Machine: Linux pvfs008.ofsdev.clemson.edu

Test 0 started: Fri Jul 13 12:55:57 2018

Summary:

api = MPIIO (version=3, subversion=0)

test filename = /mnt/beegfs/jburto2/ior_hard/IOR_file-22-512

access = single-shared-file

ordering in a file = sequential offsets

ordering inter file= constant task offsets = 1

clients = 22 (1 per node)

repetitions = 1

xfersize = 47008 bytes

blocksize = 47008 bytes

aggregate filesize = 481.58 GiB

access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter

------ --------- ---------- --------- -------- -------- -------- -------- ----

write 3320.87 45.91 45.91 0.736730 147.65 0.113621 148.50 0

Max Write: 3320.87 MiB/sec (3482.19 MB/sec)

Summary of all tests:

Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum

write 3320.87 3320.87 3320.87 0.00 74076.48 74076.48 74076.48 0.00 148.49517 0 22 1 1 0 1 1 0 0 500000 47008 47008 517088000000 MPIIO 0

Finished: Fri Jul 13 12:58:25 2018

+ /usr/lib64/openmpi/bin/mpirun -np 88 --mca btl_tcp_if_exclude em1,ib0 --mca btl self,tcp --map-by node --machinefile /home/jburto2/pvfsnodelistmpi --prefix /usr/lib64/openmpi /home/jburto2/io-500-dev/bin/ior -w -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 25000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-88-512

IOR-3.1.0: MPI Coordinated Test of Parallel I/O

ior WARNING: fsync() only available in POSIX/MMAP. Using value of 0.

Began: Fri Jul 13 12:58:27 2018

Command line used: /home/jburto2/io-500-dev/bin/ior -w -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 25000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-88-512

Machine: Linux pvfs008.ofsdev.clemson.edu

Test 0 started: Fri Jul 13 12:58:27 2018

Summary:

api = MPIIO (version=3, subversion=0)

test filename = /mnt/beegfs/jburto2/ior_hard/IOR_file-88-512

access = single-shared-file

ordering in a file = sequential offsets

ordering inter file= constant task offsets = 1

clients = 88 (4 per node)

repetitions = 1

xfersize = 47008 bytes

blocksize = 47008 bytes

aggregate filesize = 96.32 GiB

access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter

------ --------- ---------- --------- -------- -------- -------- -------- ----

write 388.56 45.91 45.91 0.356252 253.42 0.049576 253.82 0

Max Write: 388.56 MiB/sec (407.44 MB/sec)

Summary of all tests:

Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum

write 388.56 388.56 388.56 0.00 8667.42 8667.42 8667.42 0.00 253.82403 0 88 4 1 0 1 1 0 0 25000 47008 47008 103417600000 MPIIO 0

Finished: Fri Jul 13 13:02:41 2018

....

+ /usr/lib64/openmpi/bin/mpirun -np 22 --mca btl_tcp_if_exclude em1,ib0 --mca btl self,tcp --map-by node --machinefile /home/jburto2/pvfsnodelistmpi --prefix /usr/lib64/openmpi /home/jburto2/io-500-dev/bin/ior -r -R -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 500000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-22-512

IOR-3.1.0: MPI Coordinated Test of Parallel I/O

ior WARNING: fsync() only available in POSIX/MMAP. Using value of 0.

Began: Fri Jul 13 14:22:15 2018

Command line used: /home/jburto2/io-500-dev/bin/ior -r -R -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 500000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-22-512

Machine: Linux pvfs008.ofsdev.clemson.edu

Test 0 started: Fri Jul 13 14:22:15 2018

Summary:

api = MPIIO (version=3, subversion=0)

test filename = /mnt/beegfs/jburto2/ior_hard/IOR_file-22-512

access = single-shared-file

ordering in a file = sequential offsets

ordering inter file= constant task offsets = 1

clients = 22 (1 per node)

repetitions = 1

xfersize = 47008 bytes

blocksize = 47008 bytes

aggregate filesize = 481.58 GiB

access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter

------ --------- ---------- --------- -------- -------- -------- -------- ----

read 2948.62 45.91 45.91 0.007031 167.23 0.010853 167.24 0

Max Read: 2948.62 MiB/sec (3091.86 MB/sec)

Summary of all tests:

Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum

read 2948.62 2948.62 2948.62 0.00 65772.97 65772.97 65772.97 0.00 167.24194 0 22 1 1 0 1 1 0 0 500000 47008 47008 517088000000 MPIIO 0

Finished: Fri Jul 13 14:25:03 2018

+ /usr/lib64/openmpi/bin/mpirun -np 88 --mca btl_tcp_if_exclude em1,ib0 --mca btl self,tcp --map-by node --machinefile /home/jburto2/pvfsnodelistmpi --prefix /usr/lib64/openmpi /home/jburto2/io-500-dev/bin/ior -r -R -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 25000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-88-512

IOR-3.1.0: MPI Coordinated Test of Parallel I/O

ior WARNING: fsync() only available in POSIX/MMAP. Using value of 0.

Began: Fri Jul 13 14:25:05 2018

Command line used: /home/jburto2/io-500-dev/bin/ior -r -R -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -s 25000 -o /mnt/beegfs/jburto2/ior_hard/IOR_file-88-512

Machine: Linux pvfs008.ofsdev.clemson.edu

Test 0 started: Fri Jul 13 14:25:05 2018

Summary:

api = MPIIO (version=3, subversion=0)

test filename = /mnt/beegfs/jburto2/ior_hard/IOR_file-88-512

access = single-shared-file

ordering in a file = sequential offsets

ordering inter file= constant task offsets = 1

clients = 88 (4 per node)

repetitions = 1

xfersize = 47008 bytes

blocksize = 47008 bytes

aggregate filesize = 96.32 GiB

access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter

------ --------- ---------- --------- -------- -------- -------- -------- ----

read 4254 45.91 45.91 0.020689 23.16 0.006642 23.19 0

Max Read: 4253.75 MiB/sec (4460.38 MB/sec)

Summary of all tests:

Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum

read 4253.75 4253.75 4253.75 0.00 94885.46 94885.46 94885.46 0.00 23.18585 0 88 4 1 0 1 1 0 0 25000 47008 47008 103417600000 MPIIO 0

Finished: Fri Jul 13 14:25:28 2018

James Burton

unread,

Jul 16, 2018, 4:02:56 PM7/16/18

to fhgfs...@googlegroups.com

I think I may have solved the mystery.

Looking at client and server stats provided by beegfs-ctl, the client appears to be doing some sort of buffering of sequential writes.

In the single process per client node case, all the writes from a given client node to the filesystem are sequential. If the client is buffering these, then that would take the "hardness" out of the ior_hard test leading to impressive performance in this benchmark. beegfs-ctl shows about 100,000 ops-wr per second throughout the test.

In the multiple process per client node case, the writes from a given client node are NOT sequential, so they don't get the same benefit. beegfs-ctl shows about 30,000 ops-wr per second at first, with performance dropping throughout the test.

Running a test with a single process per client node and random offsets gives similar performance to the multiple process per client node case.

--

James Burton

OS and Storage Architect

Advanced Computing Infrastructure

Clemson University Computing and Information Technology

340 Computer Court

Anderson, SC 29625

(864) 656-9047

John Bent

unread,

Jul 18, 2018, 1:46:05 AM7/18/18

to beegfs-user

Hello James,

Great job debugging your own issue; I think you identified the problem perfectly! This is a nice illustration of exactly why we have the IO500 ior-hard test. Hopefully you will be submitting this result to the official IO500 list? :)

Thanks,

John

James Burton

unread,

Jul 18, 2018, 10:49:41 AM7/18/18

to fhgfs...@googlegroups.com

I expect we will be submitting an IO500 result in time for SC18.

We have gotten very good numbers with BeeGFS and mediocre numbers with Lustre on the same hardware, but Lustre seems to handle the "worst case scenerio" tests less badly.

The IO500 is useful not only as a way of comparing different storage systems, but it also useful at determining the strengths and weaknesses of a given storage system. All parallel and distributed filesystems, including BeeGFS and Lustre, rely on at least some "smoke and mirrors" to get decent performance. It's good to know where the smoke and mirrors are.

I hope the BeeGFS developers are reading this mailing list, because if they can improve the multiple process per client node test case (or have any ideas for us to better tune the system), then that would be helpful for users.

--
You received this message because you are subscribed to the Google Groups "beegfs-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

John Bent

unread,

Jul 18, 2018, 11:15:26 AM7/18/18

to fhgfs...@googlegroups.com

Fantastic. It sounds like you are planning to submit both the Lustre and beegfs results on the same hardware which is awesome because a main purpose of io500 is to enable these exact apples-apples comparisons for the “smoke-mirrors” reasons you cite.

Please let us on the io500 list if you have any questions or need any help.

BeeGFS developers: apologies for hijacking this thread.

You received this message because you are subscribed to a topic in the Google Groups "beegfs-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/fhgfs-user/UZTnmFD-BWI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to fhgfs-user+...@googlegroups.com.

James Burton

unread,

Jul 18, 2018, 2:35:13 PM7/18/18

to fhgfs...@googlegroups.com

Perhaps this is hijacking the thread, but users and developers in this group might find this interesting.

I ran a true apples-to-apples comparison between BeeGFS 7 and Lustre 2.10.4 not too long ago. This is older hardware and I'm sure both setups could be better tuned, but here are the numbers I got. (The BeeGFS numbers do not qualify for submission.) The memo is below.

Jim

P.S. I think there is a bug in the IO500.sh script for calculating the final score. It gives the product of the bandwidth and IOPS, not the geometric mean.

---------------------------------

Setup

Storage nodes were the following:

11 x Dell R720 ofstest nodes

Metadata storage: 2 SSDs in RAID-1 configuration

Data storage: 10 SAS drives in RAID-6 configuration

Mellanox ConnectX-3 FDR interconnect

16GB RAM

Client nodes were the following:

22x Dell R510 pvfs nodes

4 clients per node (1 per CPU core)

Mellanox ConnectX-3 FDR interconnect

12GB RAM

Striping for ior_easy directory was 1 with 4m chunk size.

Striping for ior_hard directory was across all nodes with 4m chunk size.

mdtest_easy for Lustre created all files in the same directory, but on mdtest_easy for BeeGFS, each process created files in 11 directories (to spread metadata around to different servers)

Otherwise, all tests were the same.

OpenMPI 1.10.7 running over TCP was used to run all tests.

Results

==> io-500-dev/ofstest-results/io500-lustre-ofstest-88clients.6.out <==

[RESULT] BW phase 1 ior_easy_write 8.120 GB/s : time 520.19 seconds

[RESULT] BW phase 2 ior_hard_write 0.182 GB/s : time 423.88 seconds

[RESULT] BW phase 3 ior_easy_read 5.361 GB/s : time 787.93 seconds

[RESULT] BW phase 4 ior_hard_read 1.621 GB/s : time 47.52 seconds

[RESULT] IOPS phase 1 mdtest_easy_write 23.268 kiops : time 381.13 seconds

[RESULT] IOPS phase 2 mdtest_hard_write 15.361 kiops : time 404.02 seconds

[RESULT] IOPS phase 3 find 354.900 kiops : time 42.15 seconds

[RESULT] IOPS phase 4 mdtest_easy_stat 114.934 kiops : time 79.05 seconds

[RESULT] IOPS phase 5 mdtest_hard_stat 80.333 kiops : time 79.53 seconds

[RESULT] IOPS phase 6 mdtest_easy_delete 25.587 kiops : time 347.40 seconds

[RESULT] IOPS phase 7 mdtest_hard_read 39.911 kiops : time 156.92 seconds

[RESULT] IOPS phase 8 mdtest_hard_delete 20.101 kiops : time 309.40 seconds

[SCORE] Bandwidth 1.89261 GB/s : IOPS 47.0566 kiops : TOTAL 89.05979 (actual score 9.43715 (11th of 13))

==> io-500-dev/ofstest-results/io500-beegfs-ofstest-88clients.1.out <==

[RESULT] BW phase 1 ior_easy_write 10.666 GB/s : time 396.03 seconds

[RESULT] BW phase 2 ior_hard_write 0.903 GB/s : time 85.32 seconds

[RESULT] BW phase 3 ior_easy_read 6.521 GB/s : time 647.82 seconds

[RESULT] BW phase 4 ior_hard_read 5.357 GB/s : time 14.38 seconds

[RESULT] IOPS phase 1 mdtest_easy_write 102.013 kiops : time 89.42 seconds

[RESULT] IOPS phase 2 mdtest_hard_write 6.298 kiops : time 981.00 seconds

[RESULT] IOPS phase 3 find 113.000 kiops : time 132.38 seconds

[RESULT] IOPS phase 4 mdtest_easy_stat 400.181 kiops : time 24.48 seconds

[RESULT] IOPS phase 5 mdtest_hard_stat 82.882 kiops : time 76.82 seconds

[RESULT] IOPS phase 6 mdtest_easy_delete 94.327 kiops : time 97.19 seconds

[RESULT] IOPS phase 7 mdtest_hard_read 18.642 kiops : time 333.59 seconds

[RESULT] IOPS phase 8 mdtest_hard_delete 6.174 kiops : time 1006.95 seconds

[SCORE] Bandwidth 4.28292 GB/s : IOPS 47.5528 kiops : TOTAL 203.66483 (actual score: 14.2711 (9th of 13))

Discussion

BeeGFS had considerably faster throughput than Lustre in all tests. (Note: Subsequent testing has showed that BeeGFS's performance with multiple processes per client node degrades more than Lustre's as the size of the ior_hard problem increases.)

BeeGFS also had considerably faster "best case" metadata performance (mdtest_easy).

Lustre's big advantage was "worst case" metadata performance (mdtest_hard).

BeeGFS's performance degraded significantly between mdtest_easy and mdtest_hard, while Lustre showed consistently mediocre metadata performance. Because BeeGFS does not have distributed directories, one metadata server was getting hammered in the mdtest_hard test. Additionally, Lustre's large client-side cache probably meant that Lustre was doing fewer writes to the slow data storage disks to store data in the mdtest_hard test. More tuning can probably improve BeeGFS's numbers, but I doubt it will catch Lustre.

Lustre's advantage in the mdtest_hard test will only increase with the new Data-On-MDT feature that is coming in Lustre 2.11. This allows small file data to be stored on low-latency metadata targets.

Lustre's other advantage was performance on (parallel) find, which is likely due to client-side caching.

Winner

Using the io-500's scoring system that attempts to balance data and metadata performance, BeeGFS trounced Lustre, 14.2711 to 9.43715.

However, due to BeeGFS's poor performance on mdtest_hard, Lustre finished faster, completing the benchmark in 3579.12 s, while BeeGFS took 3885.38 s.

Conclusion

BeeGFS has faster throughput than Lustre. BeeGFS's metadata performance can be faster because it is leveraging optimizations from the underlying filesystem, but it lacks the features Lustre has to balance metadata operations and storage between servers in complex worst-case scenerios.

To unsubscribe from this group and all its topics, send an email to fhgfs-user+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "beegfs-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

James Burton

unread,

Jul 27, 2018, 5:05:08 PM7/27/18

to fhgfs...@googlegroups.com

Update: Client AND server write buffering is confirmed from the source code.

In client_module/source/filesystem/FhgfsOpsHelper.c

FhgfsOpsHelper_writeCached() buffers writes by adding consecutive writes to a file a file cache buffer until the buffer is full or a cache miss.

Looking at the design of the fileCacheBuffer in client_module/source/filesystem/FhgfsInode.c:

Consecutive writes to the same file will maximize the number of cache hits and send writes up to the size of the fileCacheBuffer.

This is what is happening with one process per node case.
The BeeGFS storage servers are seeing buffered, big writes, and avoiding most of the "hard" part of the ior_hard test

Non-consecutive writes to the same file WILL cause a cache miss, which causes an expensive write to the servers.

This is what is happening with the multiple processes per node case.
The BeeGFS storage servers are seeing small writes, and giving the usual poor performance on the ior_hard test.

Surprisingly, even if client side buffering is turned off, performance of the one-process-per-node case does not suffer significantly.

Further investigation showed that the BeeGFS servers are buffering consecutive writes to the filesystem via the underlying Linux kernel functionality.

The write process is in storage/net/message/session/rw/WriteLocalFileMessageEx.cpp

Reply all

Reply to author

Forward

beegfs single-shared file writes with multiple MPI processes per client (ior_hard)

James Burton

James Burton

John Bent

James Burton

John Bent

James Burton

James Burton