Issues with tuneFileCacheType=native … writes not possible at all

153 views
Skip to first unread message

Dr. Thomas Orgis

unread,
Jun 29, 2021, 12:37:30 PM6/29/21
to BeeGFS User Group
Hi,

we run beegfs-7.1.5 on CentOS 7.9 with vanilla kernel 4.14.237. After
the recommendation about the current sensible caching mode I set

tuneFileCacheType=native

in the client config. Before that, I had the mode paged, which worked,
but showed about 40 MB/s transfer rate on the actual I/O to the server,
a lot worse than the over 600 MB/s that we got otherwise, or 250 MB/s
during worse times.

Now, the test of the native cache mode gets me stuck at this:

# dd if=/dev/zero of=/work_cached/testfile bs=4K count=100
100+0 records in
100+0 records out
409600 bytes (410 kB) copied, 0.00295583 s, 139 MB/s

Oh … that worked now … I meant … this:

[root@node136 ~]# dd if=/dev/zero of=/work_cached/testfile bs=1M count=100
dd: error writing ‘/work_cached/testfile’
1+0 records in
0+0 records out
2097152 bytes (2.1 MB) copied, 0.280609 s, 7.5 MB/s

# ls -l /work_cached/testfile
-rw-r--r-- 1 root root 2097152 Jun 29 18:27 /work_cached/testfile

# cp /work_cached/testfile /work_cached/testfile2
cp: error writing ‘/work_cached/testfile2’: Bad address
cp: failed to extend ‘/work_cached/testfile2’: Bad address

The kernel gives me such:

[ 1096.866422] ------------[ cut here ]------------
[ 1096.871139] WARNING: CPU: 22 PID: 20132 at lib/iov_iter.c:870 iov_iter_revert+0xae/0x130
[ 1096.879337] Modules linked in: xfs libcrc32c beegfs(O) ib_isert iscsi_target_mod target_core_mod ib_ipoib rpcrdma ib_umad rdma_ucm ib_uverbs ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm configfs mlx4_ib ib_core mlx4_en nfsv3 nfs_acl nfs lockd grace fscache x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm drm_kms_helper irqbypass crct10dif_pclmul syscopyarea crc32_pclmul sysfillrect sysimgblt ghash_clmulni_intel fb_sys_fops mlx4_core aesni_intel iTCO_wdt ttm aes_x86_64 iTCO_vendor_support crypto_simd drm cryptd pcspkr glue_helper sg lpc_ich i2c_i801 mfd_core wmi acpi_power_meter binfmt_misc ip_tables overlay squashfs loop sd_mod igb crc32c_intel ahci i2c_algo_bit libahci i2c_core sunrpc
[ 1096.942532] CPU: 22 PID: 20132 Comm: cp Tainted: G W O 4.14.237-1.0.32.el7.rrz.x86_64 #1
[ 1096.951684] Hardware name: Supermicro Super Server/X10DRT-H, BIOS 2.0 12/17/2015
[ 1096.959187] task: ffff88e72f13ba00 task.stack: ffffab2bcee28000
[ 1096.965204] RIP: 0010:iov_iter_revert+0xae/0x130
[ 1096.969909] RSP: 0018:ffffab2bcee2bd78 EFLAGS: 00010286
[ 1096.975224] RAX: 0000000080000000 RBX: 0000000000080000 RCX: 0000000000000000
[ 1096.982448] RDX: 0000000000200016 RSI: fffffffffff80000 RDI: ffffab2bcee2be78
[ 1096.989674] RBP: ffff88e6665105f8 R08: 0000000000080000 R09: ffff88e76822fde8
[ 1096.996894] R10: fffff4a55fb70980 R11: ffff88e767162300 R12: ffffab2bcee2be78
[ 1097.004115] R13: 0000000000000000 R14: ffffab2bcee2be50 R15: 000000000000007f
[ 1097.011343] FS: 0000145bb0464840(0000) GS:ffff88e79fd80000(0000) knlGS:0000000000000000
[ 1097.019546] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1097.025379] CR2: 0000145bb0462000 CR3: 00000007eeb9a004 CR4: 00000000001606e0
[ 1097.032602] Call Trace:
[ 1097.035139] generic_file_direct_write+0x127/0x170
[ 1097.040023] __generic_file_write_iter+0xb3/0x1d0
[ 1097.044823] generic_file_write_iter+0xe7/0x1d0
[ 1097.049451] __vfs_write+0xd6/0x130
[ 1097.053026] vfs_write+0xa8/0x1a0
[ 1097.056431] SyS_write+0x41/0xb0
[ 1097.059751] do_syscall_64+0x61/0x1b0
[ 1097.063507] entry_SYSCALL_64_after_hwframe+0x41/0xa6
[ 1097.068647] RIP: 0033:0x145baf940a90
[ 1097.072307] RSP: 002b:00007ffce790e148 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 1097.079987] RAX: ffffffffffffffda RBX: 0000000000080000 RCX: 0000145baf940a90
[ 1097.094886] RDX: 0000000000080000 RSI: 0000145bb03e3000 RDI: 0000000000000004
[ 1097.102112] RBP: 0000145bb03e3000 R08: 0000000000000000 R09: 00007ffce790f886
[ 1097.109338] R10: 00007ffce790dba0 R11: 0000000000000246 R12: 0000145bb03e3000
[ 1097.116558] R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000080000
[ 1097.123778] Code: 48 83 c1 02 48 39 c6 4c 89 47 20 76 d9 48 83 ea 10 48 29 c6 48 8b 42 08 48 89 4f 20 48 83 c1 01 48 39 f0 72 e8 48 89 57 18 eb bf <0f> 0b c3 8b 47 20 4c 8b 4f 18 48 8b 4f 08 48 63 d0 48 6b d2 28
[ 1097.142786] ---[ end trace 76c7427338d43b36 ]---


Any hints? Any options to tune? Small writes work, large writes don't:

# dd if=/dev/zero of=/work_cached/testfile bs=1M count=1
dd: error writing ‘/work_cached/testfile’
1+0 records in
0+0 records out
2097152 bytes (2.1 MB) copied, 0.27871 s, 7.5 MB/s
# dd if=/dev/zero of=/work_cached/testfile bs=512K count=1
dd: error writing ‘/work_cached/testfile’
1+0 records in
0+0 records out
1048576 bytes (1.0 MB) copied, 0.277986 s, 3.8 MB/s
# dd if=/dev/zero of=/work_cached/testfile bs=256K count=1
1+0 records in
1+0 records out
262144 bytes (262 kB) copied, 0.00154456 s, 170 MB/s
# dd if=/dev/zero of=/work_cached/testfile bs=256K count=1
1+0 records in
1+0 records out
262144 bytes (262 kB) copied, 0.00154456 s, 170 MB/s

The reading performance … when I write to the non-cached mount:

# dd if=/dev/zero of=/work/testzero bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 5.72161 s, 751 MB/s

# dd if=/work_cached/testzero of=/dev/null bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 3.49394 s, 1.2 GB/s

# dd if=/work_cached/testzero of=/dev/null bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 3.40145 s, 1.3 GB/s

Hm. The 'paged' cache mount got twice that data rate before from the
cache. Using 4K blocks, it really gets bogged down:

# dd if=/work_cached/testzero of=/dev/null bs=4K
^C962871+0 records in
962870+0 records out
3943915520 bytes (3.9 GB) copied, 81.9798 s, 48.1 MB/s


So this config is not right. Someone got more experience with
tuneFileCacheType=native setups? Performance is one thing, but outright
breakage with larger blocks is a proper bummer.


Alrighty then,

Thomas

--
Dr. Thomas Orgis
HPC @ Universität Hamburg

Dr. Thomas Orgis

unread,
Jul 13, 2021, 5:28:29 AM7/13/21
to fhgfs...@googlegroups.com
Am Tue, 29 Jun 2021 18:37:26 +0200
schrieb "Dr. Thomas Orgis" <thomas...@uni-hamburg.de>:

> [root@node136 ~]# dd if=/dev/zero of=/work_cached/testfile bs=1M count=100
> dd: error writing ‘/work_cached/testfile’
> 1+0 records in
> 0+0 records out
> 2097152 bytes (2.1 MB) copied, 0.280609 s, 7.5 MB/s

> [ 1097.032602] Call Trace:
> [ 1097.035139] generic_file_direct_write+0x127/0x170
> [ 1097.040023] __generic_file_write_iter+0xb3/0x1d0
> [ 1097.044823] generic_file_write_iter+0xe7/0x1d0
> [ 1097.049451] __vfs_write+0xd6/0x130

Nobody? Can someone at least confirm that caching works for them and
with proper transfer rates?

Bernd Schubert

unread,
Jul 13, 2021, 7:23:41 AM7/13/21
to fhgfs...@googlegroups.com, Dr. Thomas Orgis


On 7/13/21 11:28 AM, Dr. Thomas Orgis wrote:
> Am Tue, 29 Jun 2021 18:37:26 +0200
> schrieb "Dr. Thomas Orgis" <thomas...@uni-hamburg.de>:
>
>> [root@node136 ~]# dd if=/dev/zero of=/work_cached/testfile bs=1M count=100
>> dd: error writing ‘/work_cached/testfile’
>> 1+0 records in
>> 0+0 records out
>> 2097152 bytes (2.1 MB) copied, 0.280609 s, 7.5 MB/s
>
>> [ 1097.032602] Call Trace:
>> [ 1097.035139] generic_file_direct_write+0x127/0x170
>> [ 1097.040023] __generic_file_write_iter+0xb3/0x1d0
>> [ 1097.044823] generic_file_write_iter+0xe7/0x1d0
>> [ 1097.049451] __vfs_write+0xd6/0x130
>
> Nobody? Can someone at least confirm that caching works for them and
> with proper transfer rates?

Hmm, so back that time when I had worked the paged mode (years ago), I
was getting about 2GB/s per client node (single stream, as far as I
remember). I have not checked the native code in detail, which was
created later.
I might be able to find some time to test and look into 'paged'
performance next week, but on a plain centos7.9 kernel only - this would
be totally outside of upstream BeeGFS, though.


- Bernd



Dr. Thomas Orgis

unread,
Jul 13, 2021, 7:44:48 AM7/13/21
to Bernd Schubert, fhgfs...@googlegroups.com
Am Tue, 13 Jul 2021 13:23:37 +0200
schrieb Bernd Schubert <bernd.s...@fastmail.fm>:

> I might be able to find some time to test and look into 'paged'
> performance next week, but on a plain centos7.9 kernel only - this would
> be totally outside of upstream BeeGFS, though.

Thanks for picking this up. The issue of bad performance of paged mode
might be rather irrelevant since BeeGFS folks consider it
outdated/deprecated, even.

The real kicker is the failure of writes on the native one,

# cp /work_cached/testfile /work_cached/testfile2
cp: error writing ‘/work_cached/testfile2’: Bad address
cp: failed to extend ‘/work_cached/testfile2’: Bad address

which seems to relate to transfers above a certain block size (256K)
failing in generic_file_direct_write().

I might be that this releates to my kernel, which is vanilla 4.14.237
based on the config from CentOS 7. If it works fine with stock CentOS
7.9, that's a hint, of course.

Dr. Thomas Orgis

unread,
Jul 20, 2021, 4:42:34 AM7/20/21
to fhgfs...@googlegroups.com
I have an update on this. Can the dev people investigating this issue
give me a hint if they're on that track already?

Am Tue, 13 Jul 2021 13:44:44 +0200
schrieb "Dr. Thomas Orgis" <thomas...@uni-hamburg.de>:

> # cp /work_cached/testfile /work_cached/testfile2
> cp: error writing ‘/work_cached/testfile2’: Bad address
> cp: failed to extend ‘/work_cached/testfile2’: Bad address
>
> which seems to relate to transfers above a certain block size (256K)
> failing in generic_file_direct_write().

I became aware of this default value:

tuneFileCacheBufSize=524288

And indeed, there is a close relation:

# dd if=/dev/zero bs=524288 count=2 of=/work_cached/testzero2
dd: error writing ‘/work_cached/testzero2’
1+0 records in
0+0 records out
1048576 bytes (1.0 MB) copied, 0.283151 s, 3.7 MB/s
# dd if=/dev/zero bs=524287 count=2 of=/work_cached/testzero2
2+0 records in
2+0 records out
1048574 bytes (1.0 MB) copied, 0.0028077 s, 373 MB/s

Staying one byte below that buffer size keeps transfers working. I
verified this by increasing this value up to 16M

(tuneFileCacheBufSize=16777216)

# dd if=/dev/zero bs=16777216 count=2 of=/work_cached/testzero2
dd: error writing ‘/work_cached/testzero2’
1+0 records in
0+0 records out
33554432 bytes (34 MB) copied, 0.31872 s, 105 MB/s
# dd if=/dev/zero bs=16777215 count=2 of=/work_cached/testzero2
2+0 records in
2+0 records out
33554430 bytes (34 MB) copied, 0.0461524 s, 727 MB/s


So … this seems like a very reliable issue with a clear relation to
what the code tries to do when the transfer size approaches the cache
buffer size. I'd assume some splitting would have to occur. Seems
straightforward.

Dr. Thomas Orgis

unread,
Jul 20, 2021, 9:05:36 AM7/20/21
to fhgfs...@googlegroups.com
Am Tue, 20 Jul 2021 10:42:30 +0200
schrieb "Dr. Thomas Orgis" <thomas...@uni-hamburg.de>:

> Staying one byte below that buffer size keeps transfers working. I
> verified this by increasing this value up to 16M

Sadly, performance is abysmal and erratic with increased buffer size.
It slows down to 50 MB/s reads and only with smaller bufer sizes
sometimes jumps to levels approaching the uncached mount, apparently
having invalidated the cache. This is all backwards.

Is re-exporting BeeGFS via NFSv4 a robust approach? If that is stable,
it seems a bit weird to go that route just to get client-side caching,
but it would be an option to make certain workloads bearable on the
BeeGFS. Of course, we could also decide to really keep collections of
numerous small files and read-heavy workloads (think per-user
installations of Python software stacks that start lots of processes
all the time, being seriously hampered by reading the program files
from BeeGFS again and again) away from BeeGFS and instead install a
separate NFS server for such data.

I'd prefer caching to work for BeeGFS in a sensible manner, though. But
if it is just too complicated to get right, we should be honest about
it and tell people that they need to work on their data access patterns
to avoid situations where caching would be important. After all, BeeGFS
is a special-purpose file system for which such a statement could be
understood. Having those options that just do not work right is a
disservice to the product, though:-/ My two ¢.

Peter Serocka

unread,
Jul 20, 2021, 1:06:47 PM7/20/21
to fhgfs...@googlegroups.com


> On Jul 20, 2021, at 15:05, Dr. Thomas Orgis <thomas...@uni-hamburg.de> wrote:
>
>
> numerous small files and read-heavy workloads (think per-user
> installations of Python software stacks that start lots of processes
> all the time, being seriously hampered by reading the program files
> from BeeGFS again and again) away from BeeGFS and instead install a
> separate NFS server for such data.
>

That has actually been working very well for us; with an NFS server
strictly dedidated to providing software stacks (and not beeing hammered
by test/dev builds) one gets two levels of RAM caching, and this has been
phenomenal even in the times of HDDs.

fwiw
-- Peter



Philipp Falk

unread,
Jul 22, 2021, 3:54:37 AM7/22/21
to fhgfs...@googlegroups.com
Hello Thomas,

I just wanted to let you know that we have identified the issue and have
created a patch that will be released with our next maintenance release.

A bit of background: A while ago, direct I/O was disabled for the native
mode because it didn't work correctly. We recently discovered that when
that change was made, a code path that changed cached I/O in native mode to
direct I/O automatically, when the write size is larger than
tuneFileCacheBufSize, was not correctly updated.

Please find attached a patch for the client module that disables that code
path.

You can apply that patch like this if you use beegfs-client-dkms (for
example version 7.2.3, please adapt):

$ cd /usr/src/beegfs-7.2.3
$ patch -p2 < /path/to/patch/file
$ dkms remove beegfs/7.2.3 -k $(uname -r)
$ dkms install beegfs/7.2.3 -k $(uname -r)

Or if you use the beegfs-client package:

$ cd /opt/beegfs/src/client/client_module_7
$ patch -p2 < /path/to/patch/file
$ /etc/init.d/beegfs-client rebuild

The patch is based on version 7.2.x but should apply to 7.1.x as well with
a little bit of fuzz.

Best regards
- Philipp

* Dr. Thomas Orgis <thomas...@uni-hamburg.de> [210720 15:05]:
> --
> You received this message because you are subscribed to the Google Groups "beegfs-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/fhgfs-user/20210720150533.548f65dc%40plasteblaster.

--
Philipp Falk | System Engineer | m: philip...@thinkparq.com
ThinkParQ GmbH | Trippstadter Strasse 113 | 67663 Kaiserslautern | Germany
CEO: Frank Herold | COB: Dr. Franz-Josef Pfreundt | Registered: Amtsgericht Kaiserslautern HRB 31565 I VAT-ID-No.:DE 292001792

Confidentiality Notice: This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential or proprietary information.
native_no_direct.patch

Dr. Thomas Orgis

unread,
Jul 22, 2021, 4:19:15 AM7/22/21
to fhgfs...@googlegroups.com
Am Thu, 22 Jul 2021 09:54:32 +0200
schrieb Philipp Falk <philip...@thinkparq.com>:

> Please find attached a patch for the client module that disables that code
> path.

Great news, thanks for the update! To be clear: This just completely
disables direct I/O, so no bypassing of the cache.

If direct I/O worked, a subsequent read of these large blocks would
have been read back from the servers, right? I guess it can be
discussed if this is the desired mode or not. There might be value in
ensuring that everything that has been written is in the page cache
(while it still fits).

On the other hand, I will compare the achieved data rates. What
performance impact (loss/gain of (perceived) peak transfer rate) do you
measure with native caching compared to buffered mode?

I will test the patch with our 7.1.5 install and report back.


Regards,

Dr. Thomas Orgis

unread,
Jul 28, 2021, 5:56:11 AM7/28/21
to fhgfs...@googlegroups.com
Am Thu, 22 Jul 2021 10:19:11 +0200
schrieb "Dr. Thomas Orgis" <thomas...@uni-hamburg.de>:

> On the other hand, I will compare the achieved data rates. What
> performance impact (loss/gain of (perceived) peak transfer rate) do you
> measure with native caching compared to buffered mode?

OK, so I applied the change to our 7.1.5 client. This is the result
(the cluster is busy, so the uncached data rate is moderate).

$ for bs in 512 4K 16K 256K 1M; do echo "======== bs=$bs =========="; for d in work work_cached; do echo 3 >/proc/sys/vm/drop_caches; echo "***** $d *******"; for n in 1 2 3; do dd if=$d/testzero bs=$bs of=/dev/null; done; done; done
======== bs=512 ==========
***** work *******
8388608+0 records in
8388608+0 records out
4294967296 bytes (4.3 GB) copied, 21.307 s, 202 MB/s
8388608+0 records in
8388608+0 records out
4294967296 bytes (4.3 GB) copied, 21.8692 s, 196 MB/s
8388608+0 records in
8388608+0 records out
4294967296 bytes (4.3 GB) copied, 22.3251 s, 192 MB/s
***** work_cached *******
8388608+0 records in
8388608+0 records out
4294967296 bytes (4.3 GB) copied, 99.4572 s, 43.2 MB/s
8388608+0 records in
8388608+0 records out
4294967296 bytes (4.3 GB) copied, 8.21897 s, 523 MB/s
8388608+0 records in
8388608+0 records out
4294967296 bytes (4.3 GB) copied, 8.19792 s, 524 MB/s
======== bs=4K ==========
***** work *******
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 9.82118 s, 437 MB/s
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 10.997 s, 391 MB/s
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 10.5302 s, 408 MB/s
***** work_cached *******
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 91.0911 s, 47.2 MB/s
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 1.39473 s, 3.1 GB/s
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 1.31659 s, 3.3 GB/s
======== bs=16K ==========
***** work *******
262144+0 records in
262144+0 records out
4294967296 bytes (4.3 GB) copied, 10.5552 s, 407 MB/s
262144+0 records in
262144+0 records out
4294967296 bytes (4.3 GB) copied, 12.229 s, 351 MB/s
262144+0 records in
262144+0 records out
4294967296 bytes (4.3 GB) copied, 10.8311 s, 397 MB/s
***** work_cached *******
262144+0 records in
262144+0 records out
4294967296 bytes (4.3 GB) copied, 90.2333 s, 47.6 MB/s
262144+0 records in
262144+0 records out
4294967296 bytes (4.3 GB) copied, 0.711975 s, 6.0 GB/s
262144+0 records in
262144+0 records out
4294967296 bytes (4.3 GB) copied, 0.649651 s, 6.6 GB/s
======== bs=256K ==========
***** work *******
16384+0 records in
16384+0 records out
4294967296 bytes (4.3 GB) copied, 9.44409 s, 455 MB/s
16384+0 records in
16384+0 records out
4294967296 bytes (4.3 GB) copied, 8.92695 s, 481 MB/s
16384+0 records in
16384+0 records out
4294967296 bytes (4.3 GB) copied, 7.98414 s, 538 MB/s
***** work_cached *******
16384+0 records in
16384+0 records out
4294967296 bytes (4.3 GB) copied, 88.6168 s, 48.5 MB/s
16384+0 records in
16384+0 records out
4294967296 bytes (4.3 GB) copied, 0.551815 s, 7.8 GB/s
16384+0 records in
16384+0 records out
4294967296 bytes (4.3 GB) copied, 0.487542 s, 8.8 GB/s
======== bs=1M ==========
***** work *******
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 26.9552 s, 159 MB/s
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 24.0293 s, 179 MB/s
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 23.6395 s, 182 MB/s
***** work_cached *******
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 25.2812 s, 170 MB/s
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 20.2517 s, 212 MB/s
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 25.3047 s, 170 MB/s


So, I see the cache nicely working once data has been read. For 1M
block size, the cache is inactive by design, I presume (the cache block
size).

But it is very consistent that the first read into the cache is
extremely slow with below 50 MB/s, compared to the direct read rate of
around 200 to 400 MB/s.

There must be some rather obvious bottleneck that slows things down
when managing the cache blocks. Any chance to identify that? The
tradeoff to have 1/4 or even less of the initial transfer rate is
rather hefty.


Alrighty then,
Reply all
Reply to author
Forward
0 new messages