Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug#1050446: nfs over rdma between debian12 client and debian11 server can cause data corruption

87 views
Skip to first unread message

Alois Schlögl

unread,
Aug 24, 2023, 1:40:04 PM8/24/23
to
Package: nfs-common,rdma-core

I've been testing the upgrade of a compute node from Debian11 to Debian12.
That node was connected through nfs with rdma protocol to a zfs-storage server running on Debian11.
The compute node and the storage server are part of a high-performance compute cluster, connected over infiniband.
Not sure whether this is important, but the storage server is using zfs.

After the upgrade of the compute node (node client) to Debian 12, this machine could not correctly read a few (small) files. The files were correctly shown with "ls", and the size matched as well.
However the content was corrupted (looked like random garbage). In one case the .ssh/authorized_keys was corrupted, in some other case the "version.lua" from the lmod system was affected, rendering lmod unusable.
Interestingly, only very few files seemed to be affected. Most files were correctly retrieved.

So this is a very subtle error, and not obvious.
When retrieving these files, no error was reported, but data of the expected size was retrieved.
Effectively, the retrieved data was corrupted, and could lead to potential data loss.

The compute node on Debian12 had

ii libnfsidmap1:amd64 1:2.6.2-4 amd64 NFS idmapping library
ii nfs-common 1:2.6.2-4 amd64 NFS support files common to client and server
ii librdmacm1:amd64 44.0-2 amd64 Library for managing RDMA connections
ii rdma-core 44.0-2 amd64 RDMA core userspace infrastructure and documentation
ii rdmacm-utils 44.0-2 amd64 Examples for the librdmacm library


The storage server on Debian11 had
ii nfs-common 1:1.3.4-6 amd64 NFS support files common to client and server
ii nfs-kernel-server 1:1.3.4-6 amd64 support for NFS kernel server
ii librdmacm1:amd64 33.2-1 amd64 Library for managing RDMA connections


The problem went away, when changing nfs mount protocal from proto=rdma to proto=tcp.

I tried to learn about this incompatibility, but did not find any information.
I'm also curious whether an nfs 2.6 server would correctly talk to an nfs 1.3 client over rdma ?
Can anyone provide more information on that topic ?

Alois Schlögl

unread,
Nov 9, 2023, 1:20:05 PM11/9/23
to

Severity: wishlist

We run more tests, and observed that pgrading the storage server to
Debian12(/bookwork solves the issue.
The storage server can be accessed through rdma from debian11 and 12.

That means, the problem occurs only when the storage server is on
debian11, the client is on debian11 mounted with proto=rdma.

This information is good enough for us, as we'll upgrade first the
storage server.

Another workaround is to switch to proto=tcp

As this issue (proto=rdma from debian12 client to debian11 server) is
difficult to fix and a workaround exists, I'll downgrade the severity.
0 new messages