[client_module] BeeGFS client crash on armv8.2+ machine

501 views
Skip to first unread message

Xinliang Liu

unread,
Jun 24, 2021, 10:16:48 AM6/24/21
to fhgfs...@googlegroups.com, chenxu...@hisilicon.com, daizh...@huawei.com, fengch...@huawei.com, hanfe...@hisilicon.com, Jammy Zhou, jiangj...@huawei.com, Kevin Zhao, lar...@huawei.com, liuqi...@hisilicon.com, luor...@huawei.com, wangzen...@huawei.com, wuqi...@huawei.com, zhang...@huawei.com, Xinliang Liu, matthia...@huawei.com
Hi BeeGFS,
I’m Xinliang from Linaro[1], which focuses on open source software for Arm.

Crash issue background
==================
Recently, I was told by our customer that the BeeGFS client will cause a kernel crash[2] on some new armv8.2 machines. And, unfortunately, the fact that the BeeGFS client only works on very old armv8 machines :(.  And this crash might be related to PAN[3] and UAO[4] armv8 CPU features.

Then I start to reproduce this crash and investigate the root cause and work out a draft workable patch at last. I will describe the details of root cause and long-term solution thoughts below, and paste the draft patch at the end. I’m not an expert of BeeGFS. Please help to review and let me know if I am wrong, give some inputs. Thanks.

The crash root cause
================
First, I want to put some background of set_fs() usage here in case someone knows a little of this set_fs() kernel function. Because this crash issue is related to set_fs(). Usually people add set_fs(KERNEL_DS) before a system call with a kernel
address as argument in order to pass the user address checking(a.k.a
access_ok()). If calling a system call with a user address as argument
then no need to add set_fs(KERNEL_DS).
Also see KERNEL_DS definition explanation[5] for more details.

But the BeeGFS client kernel module handles all the socket operations,such as sock_sendmsg/recvmsg, with set_fs(KERNEL_DS), no matter if this operation is in the user space or in the kernel space. Usually these socket operations are called via user space system calls. In the user space, normally, it should not operate sockets with set_fs(KERNEL_DS). This is the root cause why the new armv8 machines crash. Because this crash issue(in fact crash in copy_from/to_user functions) happens on new ARMV8.2+hardwares which have both PAN and UAO CPU features. It will not permit user space access if PAN and UAO are on and fs=KERNEL_DS if in such case it will cause a memory access permission fault crash.
“ Unable to handle kernel access to user memory with fs=KERNEL_DS at virtual address 0000ffff890a0010” ( crash detail log will be in the draft patch at the end of this mail)

Let’s go to see why we will go into this case. UAO can be toggled on/off via set_fs()[6], PAN can not be toggled on/off by kernel API, to make things easier to understand we can assume PAN is always on here.  Usually the kernel toggles PAN and UAO on/off itself automatically and works properly.  If we toggled UAO wrong via set_fs() explicitly (in this case toggle UAO on when calling a system call in the user space) the kernel will not work properly and cause a memory access permission fault crash. Thus armv8’s PAN and UAO make user address checking stricter than other CPU architectures, e.g. X86. That’s why other cpu platforms and old armv8 machines won’t crash.

How to fix
========

The solution is separate socket operations in the kernel space. Only use set_fs(KERNEL_DS) in this case and remove unnecessary set_fs(KERNEL_DS) in the user space socket operations. Or even remove usage of set_fs() like the mainline v5.1x kernel does now.

Here are some things we need to do to work out a long-term fix IMO:
1) Identify which socket operations are in the kernel space. Usually with a kernel address or buf. Unfortunately, there are a lot of sendmsg/recvmsg  socket operations in the kernel space.
2) Separate socket operations in the kernel space. Only use set_fs(KERNEL_DS) in this case and remove unnecessary set_fs(KERNEL_DS) in the user space socket operations. My draft patch is only at this step now.
3) Hopefully, make kernel space socket operation functions named starting with Kernel_xxx() to make kernel space socket operation clearly. And reference to "universal pointer" type sockptr_t[7], deal with kernel/user space cases in one place for socket sendmsg/recvmsg  instead of dealing with them everywhere. Because these socket operations are per msg/address.
4) Ideally and if possible, work out a long-term fix, that is removing set_fs() usages. Because from kernel v5.11 on, set_fs is removed for some main CPU architectures,such as x86 and arm64, see[8] for how to migrate without using set_fs to call system call in the kernel process, usually we need call kernel version socket operation functions kernel_xx()[9].

[1]: https://www.linaro.org
[2]: “ Unable to handle kernel access to user memory with fs=KERNEL_DS at virtual address 0000ffff890a0010” ( crash detail log will be in the draft patch at the end of this mail)
[3]: https://developer.arm.com/documentation/ddi0595/2021-03/AArch64-Registers/PAN--Privileged-Access-Never
[4]: https://developer.arm.com/documentation/ddi0595/2021-03/AArch64-Registers/UAO--User-Access-Override
[5]: https://elixir.bootlin.com/linux/v4.18/source/arch/x86/include/asm/uaccess.h#L16
[6]: https://elixir.bootlin.com/linux/v4.18/source/arch/arm64/include/asm/uaccess.h#L59
[7]: https://lwn.net/ml/linux-kernel/2020072306090...@lst.de/
[8]: https://lwn.net/Articles/832121/
[9]: https://elixir.bootlin.com/linux/v4.18/source/net/socket.c

Best,
Xinliang

Draft patch based on v7.2.2 pasted and attached
=====================================
From db3e47fc3e388ed889f396cbaadcb8d5ac2a9005 Mon Sep 17 00:00:00 2001
From: Xinliang Liu <xinlia...@linaro.org>
Date: Thu, 17 Jun 2021 03:53:20 +0000
Subject: [PATCH v4] Separate socket operations in the kernel space

Separate socket operations in the kernel space. Hopefully, make kernel
space socket operation functions name started with Kernel_xxx(). Only use
set_fs(KERNEL_DS) in kernel space socket operations and remove unnecessary
set_fs(KERNEL_DS) in the user space socket operations.

Usually people add set_fs(KERNEL_DS) before a system call with a kernel
address as argument in order to pass the user address checking(a.k.a
access_ok()). If calling a system call with a user address as argument
then no need to add set_fs(KERNEL_DS).
Also see KERNEL_DS definition explanation[1] for more details.

By doing this, it can fix bellow crash which happens on new ARMV8.2+
hardwares which have both PAN[2] and UAO[3] CPU features. It will not permit
user space access if PAN and UAO are on and fs=KERNEL_DS. And UAO can be
toggled on/off via set_fs()[4]. Usually kernel toggles PAN and UAO on/off
automatically and works properly. If we toggled UAO wrong via set_fs
explicitly (in this case toggle UAO on when calling system call in the user
space) kernel will not work properly and cause a memory access permission
fault crash. Thus PAN and UAO make user address checking stricter.

And BTW, from kernel v5.11 on, set_fs is removed for some main CPU arches,
such as x86 and arm64, see[5] for how to migrate without using set_fs to
call system call in the kernel process, usually we need call kernel
version socket operation functions kernel_xx()[6].

[1]: https://elixir.bootlin.com/linux/v4.18/source/arch/x86/include/asm/uaccess.h#L16
[2]: https://developer.arm.com/documentation/ddi0595/2021-03/AArch64-Registers/PAN--Privileged-Access-Never
[3]: https://developer.arm.com/documentation/ddi0595/2021-03/AArch64-Registers/UAO--User-Access-Override
[4]: https://elixir.bootlin.com/linux/v4.18/source/arch/arm64/include/asm/uaccess.h#L59
[5]: https://lwn.net/Articles/832121/
[6]: https://elixir.bootlin.com/linux/v4.18/source/net/socket.c

--------CentOS8 crash log-------------
...
[ 1447.277321] XFS (vdb): Mounting V5 Filesystem
[ 1447.404286] XFS (vdb): Ending clean mount
[ 1980.470989] beegfs: mount(2478): BeeGFS mount ready.
[ 2550.404306] Unable to handle kernel access to user memory with fs=KERNEL_DS at virtual address 0000ffff890a0010
[ 2550.410472] Mem abort info:
[ 2550.411605]   ESR = 0x9600000f
[ 2550.412844]   Exception class = DABT (current EL), IL = 32 bits
[ 2550.415515]   SET = 0, FnV = 0
[ 2550.416763]   EA = 0, S1PTW = 0
[ 2550.418085] Data abort info:
[ 2550.419254]   ISV = 0, ISS = 0x0000000f
[ 2550.420803]   CM = 0, WnR = 0
[ 2550.422033] user pgtable: 64k pages, 48-bit VAs, pgdp = 0000000053fceea2
[ 2550.424938] [0000ffff890a0010] pgd=0000000203050003, pud=0000000203050003, pmd=0000000203820003, pte=00e80001c2bb0f53
[ 2550.429307] Internal error: Oops: 9600000f [#1] SMP
[ 2550.431294] Modules linked in: beegfs(OE) rdma_cm iw_cm ib_cm ib_core virtio_gpu drm_kms_helper drm crct10dif_ce ghash_ce sha2_ce fb_sys_fops syscopyarea sysfillrect sysimgblt sha256_arm64 sha1_ce virtio_balloon vfat fat ip_tables xfs libcrc32c virtio_net virtio_blk net_failover failover virtio_mmio sunrpc dm_mirror dm_region_hash dm_log dm_mod
[ 2550.444728] CPU: 2 PID: 2836 Comm: scp Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-305.3.1.el8.aarch64 #1
[ 2550.449792] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 2550.452721] pstate: 20c00005 (nzCv daif +PAN +UAO)
[ 2550.454815] pc : __arch_copy_from_user+0x180/0x240
[ 2550.456919] lr : copyin+0x54/0x68
[ 2550.458364] sp : ffff00001434f5a0
[ 2550.459784] x29: ffff00001434f5a0 x28: ffffd35787c1a000
[ 2550.462109] x27: ffffd35783db0072 x26: ffff00001434f7c0
[ 2550.464424] x25: ffff00001434f7d0 x24: 000000000000ff8e
[ 2550.466734] x23: ffffd35787ee0c00 x22: 0000000000000000
[ 2550.469011] x21: ffff00001434fb60 x20: 0000000000000000
[ 2550.471326] x19: 000000000000ff8e x18: 0000000000000001
[ 2550.473624] x17: 0000ffff96bd7cf0 x16: ffff42c4cd132848
[ 2550.475899] x15: 0000ffff96c316f0 x14: 000000000000eed6
[ 2550.478199] x13: 000000002e01a8c0 x12: 000000002e01a8c0
[ 2550.480477] x11: ffff42c4ce476b00 x10: 0000000000000002
[ 2550.482765] x9 : 000000000000ff8e x8 : 0000000000000000
[ 2550.485051] x7 : 00000001f8570000 x6 : ffffd35783db0072
[ 2550.487344] x5 : ffffd35783dc0000 x4 : 0000000000000000
[ 2550.489594] x3 : 0000ffff890a0010 x2 : 000000000000ff0e
[ 2550.491821] x1 : 0000ffff890a0010 x0 : ffffd35783db0072
[ 2550.494098] Process scp (pid: 2836, stack limit = 0x0000000010a319e7)
[ 2550.496806] Call trace:
[ 2550.497862]  __arch_copy_from_user+0x180/0x240
[ 2550.499724]  _copy_from_iter_full+0x80/0x2c0
[ 2550.501540]  tcp_sendmsg_locked+0x808/0xc00
[ 2550.503314]  tcp_sendmsg+0x40/0x60
[ 2550.504758]  inet_sendmsg+0x4c/0x70
[ 2550.506253]  sock_sendmsg+0x4c/0x68
[ 2550.507758]  _StandardSocket_sendto+0xd8/0x178 [beegfs]
[ 2550.509987]  __commkit_writefile_sendData+0x90/0x140 [beegfs]
[ 2550.512406]  FhgfsOpsCommkit_communicate+0x114/0xb28 [beegfs]
[ 2550.514875]  FhgfsOpsCommKit_writefileV2bCommunicate+0x68/0x90 [beegfs]
[ 2550.517673]  FhgfsOpsRemoting_writefileVec+0x2d8/0x538 [beegfs]
[ 2550.520164]  FhgfsOpsRemoting_writefile+0x70/0x98 [beegfs]
[ 2550.522501]  FhgfsOpsHelper_writefileEx+0x60/0xa8 [beegfs]
[ 2550.524842]  __FhgfsOpsHelper_writeCacheFlushed+0x1f8/0x298 [beegfs]
[ 2550.527543]  FhgfsOpsHelper_writeCached+0x104/0x2b0 [beegfs]
[ 2550.529955]  FhgfsOps_write+0x1f4/0x360 [beegfs]
[ 2550.531894]  __vfs_write+0x48/0x90
[ 2550.533364]  vfs_write+0xac/0x1b8
[ 2550.534754]  ksys_write+0x6c/0xd0
[ 2550.536144]  __arm64_sys_write+0x24/0x30
[ 2550.537811]  el0_svc_handler+0xb0/0x180
[ 2550.539416]  el0_svc+0x8/0xc
[ 2550.540646] Code: d503201f d503201f d503201f d503201f (f8400827)
[ 2550.543361] SMP: stopping secondary CPUs
[ 2550.548621] Starting crashdump kernel...
[ 2550.550183] Bye!

--------CentOS8 crash log-------------

Signed-off-by: Xinliang Liu <xinlia...@linaro.org>
---
 .../net/message/nodes/HeartbeatRequestMsgEx.c |   2 +-
 .../net/message/nodes/RemoveNodeMsgEx.c       |   2 +-
 .../common/net/msghelpers/MsgHelperAck.h      |   2 +-
 .../common/net/sock/NetworkInterfaceCard.c    |   2 +-
 .../source/common/net/sock/RDMASocket.c       |  10 --
 client_module/source/common/net/sock/Socket.c |  33 +++++-
 client_module/source/common/net/sock/Socket.h | 103 +++++++++++++++++-
 .../source/common/net/sock/StandardSocket.c   |  28 +----
 .../source/common/nodes/NodeConnPool.c        |   8 +-
 .../source/common/toolkit/MessagingTk.c       |   2 +-
 .../source/common/toolkit/MessagingTk.h       |   4 +-
 client_module/source/components/AckManager.c  |   2 +-
 .../source/components/DatagramListener.c      |   8 +-
 .../source/components/DatagramListener.h      |   2 +-
 .../source/filesystem/FhgfsOpsFile.c          |  12 ++
 .../source/filesystem/FhgfsOpsHelper.c        |   3 +
 .../source/net/filesystem/FhgfsOpsCommKit.c   |  37 ++++++-
 .../net/filesystem/FhgfsOpsCommKitVec.c       |  10 +-
 18 files changed, 199 insertions(+), 71 deletions(-)

diff --git a/client_module/source/common/net/message/nodes/HeartbeatRequestMsgEx.c b/client_module/source/common/net/message/nodes/HeartbeatRequestMsgEx.c
index 7cf186d..bd4ff55 100644
--- a/client_module/source/common/net/message/nodes/HeartbeatRequestMsgEx.c
+++ b/client_module/source/common/net/message/nodes/HeartbeatRequestMsgEx.c
@@ -48,7 +48,7 @@ bool __HeartbeatRequestMsgEx_processIncoming(NetMessage* this, struct App* app,
       sendRes = DatagramListener_sendto(dgramLis, respBuf, respLen, 0, fromAddr);
    }
    else
-      sendRes = Socket_sendto(sock, respBuf, respLen, 0, NULL);
+      sendRes = Kernel_socket_sendto(sock, respBuf, respLen, 0, NULL);
 
    if(unlikely(sendRes <= 0) )
       Logger_logErrFormatted(log, logContext, "Send error. ErrCode: %lld", (long long)sendRes);
diff --git a/client_module/source/common/net/message/nodes/RemoveNodeMsgEx.c b/client_module/source/common/net/message/nodes/RemoveNodeMsgEx.c
index 2086734..d89f10a 100644
--- a/client_module/source/common/net/message/nodes/RemoveNodeMsgEx.c
+++ b/client_module/source/common/net/message/nodes/RemoveNodeMsgEx.c
@@ -114,7 +114,7 @@ bool __RemoveNodeMsgEx_processIncoming(NetMessage* this, struct App* app,
          sendRes = DatagramListener_sendto(dgramLis, respBuf, respLen, 0, fromAddr);
       }
       else
-         sendRes = Socket_sendto(sock, respBuf, respLen, 0, NULL);
+         sendRes = Kernel_socket_sendto(sock, respBuf, respLen, 0, NULL);
 
       if(unlikely(sendRes <= 0) )
          Logger_logErrFormatted(log, logContext, "Send error. ErrCode: %lld", (long long)sendRes);
diff --git a/client_module/source/common/net/msghelpers/MsgHelperAck.h b/client_module/source/common/net/msghelpers/MsgHelperAck.h
index dbc744b..2f919e1 100644
--- a/client_module/source/common/net/msghelpers/MsgHelperAck.h
+++ b/client_module/source/common/net/msghelpers/MsgHelperAck.h
@@ -47,7 +47,7 @@ bool MsgHelperAck_respondToAckRequest(App* app, const char* ackID,
       sendRes = DatagramListener_sendto(dgramLis, respBuf, respLen, 0, fromAddr);
    }
    else
-      sendRes = Socket_sendto(sock, respBuf, respLen, 0, NULL);
+      sendRes = Kernel_socket_sendto(sock, respBuf, respLen, 0, NULL);
 
    if(unlikely(sendRes <= 0) )
       Logger_logErrFormatted(log, logContext, "Send error. ErrCode: %lld", (long long)sendRes);
diff --git a/client_module/source/common/net/sock/NetworkInterfaceCard.c b/client_module/source/common/net/sock/NetworkInterfaceCard.c
index cdf7a95..7b0f2e8 100644
--- a/client_module/source/common/net/sock/NetworkInterfaceCard.c
+++ b/client_module/source/common/net/sock/NetworkInterfaceCard.c
@@ -279,7 +279,7 @@ void __NIC_filterInterfacesForRDMA(NicAddressList* nicList, NicAddressList* outL
       if(!RDMASocket_init(&rdmaSock) )
          continue;
 
-      bindRes = sock->ops->bindToAddr(sock, &nicAddr->ipAddr, 0);
+      bindRes = Kernel_socket_bindToAddr(sock, &nicAddr->ipAddr, 0);
 
       if(bindRes)
       { // we've got an RDMA-capable interface => append it to outList
diff --git a/client_module/source/common/net/sock/RDMASocket.c b/client_module/source/common/net/sock/RDMASocket.c
index 30f8cb6..9d19dc1 100644
--- a/client_module/source/common/net/sock/RDMASocket.c
+++ b/client_module/source/common/net/sock/RDMASocket.c
@@ -185,14 +185,9 @@ ssize_t _RDMASocket_recvT(Socket* this, struct iov_iter* iter, int flags, int ti
    RDMASocket* thisCast = (RDMASocket*)this;
 
    ssize_t retVal;
-   mm_segment_t oldfs;
-
-   ACQUIRE_PROCESS_CONTEXT(oldfs);
 
    retVal = IBVSocket_recvT(&thisCast->ibvsock, iter, flags, timeoutMS);
 
-   RELEASE_PROCESS_CONTEXT(oldfs);
-
    return retVal;
 }
 
@@ -207,14 +202,9 @@ ssize_t _RDMASocket_sendto(Socket* this, struct iov_iter* iter, int flags,
    RDMASocket* thisCast = (RDMASocket*)this;
 
    ssize_t retVal;
-   mm_segment_t oldfs;
-
-   ACQUIRE_PROCESS_CONTEXT(oldfs);
 
    retVal = IBVSocket_send(&thisCast->ibvsock, iter, flags);
 
-   RELEASE_PROCESS_CONTEXT(oldfs);
-
    return retVal;
 }
 
diff --git a/client_module/source/common/net/sock/Socket.c b/client_module/source/common/net/sock/Socket.c
index 12ca2ed..4e72be7 100644
--- a/client_module/source/common/net/sock/Socket.c
+++ b/client_module/source/common/net/sock/Socket.c
@@ -16,14 +16,39 @@ void _Socket_uninit(Socket* this)
 }
 
 
-bool Socket_bind(Socket* this, unsigned short port)
+bool Kernel_socket_bind(Socket* this, unsigned short port)
 {
    struct in_addr ipAddr = { INADDR_ANY };
+   mm_segment_t oldfs;
+   bool ret;
 
-   return this->ops->bindToAddr(this, &ipAddr, port);
+   ACQUIRE_PROCESS_CONTEXT(oldfs);
+   ret = this->ops->bindToAddr(this, &ipAddr, port);
+   RELEASE_PROCESS_CONTEXT(oldfs);
+
+   return ret;
+}
+
+bool Kernel_socket_bindToAddr(Socket* this, struct in_addr* ipAddr, unsigned short port)
+{
+   mm_segment_t oldfs;
+   bool ret;
+
+   ACQUIRE_PROCESS_CONTEXT(oldfs);
+   ret =  this->ops->bindToAddr(this, ipAddr, port);
+   RELEASE_PROCESS_CONTEXT(oldfs);
+
+   return ret;
 }
 
-bool Socket_bindToAddr(Socket* this, struct in_addr* ipAddr, unsigned short port)
+bool Kernel_socket_connectByIP(Socket* this, struct in_addr* ipAddr, unsigned short port)
 {
-   return this->ops->bindToAddr(this, ipAddr, port);
+   mm_segment_t oldfs;
+   bool ret;
+
+   ACQUIRE_PROCESS_CONTEXT(oldfs);
+   ret = this->ops->connectByIP(this, ipAddr, port);
+   RELEASE_PROCESS_CONTEXT(oldfs);
+
+   return ret;
 }
diff --git a/client_module/source/common/net/sock/Socket.h b/client_module/source/common/net/sock/Socket.h
index addc881..e763b24 100644
--- a/client_module/source/common/net/sock/Socket.h
+++ b/client_module/source/common/net/sock/Socket.h
@@ -24,8 +24,10 @@ typedef struct Socket Socket;
 extern void _Socket_init(Socket* this);
 extern void _Socket_uninit(Socket* this);
 
-extern bool Socket_bind(Socket* this, unsigned short port);
-extern bool Socket_bindToAddr(Socket* this, struct in_addr* ipAddr, unsigned short port);
+// socket functions called in kernel space with kernel address
+extern bool Kernel_socket_bind(Socket* this, unsigned short port);
+extern bool Kernel_socket_bindToAddr(Socket* this, struct in_addr* ipAddr, unsigned short port);
+extern bool Kernel_socket_connectByIP(Socket* this, struct in_addr* ipAddr, unsigned short port);
 
 // getters & setters
 static inline NicAddrType_t Socket_getSockType(Socket* this);
@@ -38,9 +40,10 @@ static inline ssize_t Socket_recvExactT(Socket* this, void *buf, size_t len, int
    int timeoutMS);
 static inline ssize_t Socket_recvExactTEx(Socket* this, void *buf, size_t len, int flags,
    int timeoutMS, size_t* outNumReceivedBeforeError);
-
-
-
+static inline ssize_t Kernel_socket_recvExactT(Socket* this, void *buf, size_t len, int flags,
+   int timeoutMS);
+static inline ssize_t Kernel_socket_recvExactTEx(Socket* this, void *buf, size_t len, int flags,
+   int timeoutMS, size_t* outNumReceivedBeforeError);
 
 struct SocketOps
 {
@@ -106,6 +109,27 @@ static inline ssize_t Socket_recvT(Socket* this, void* buf, size_t len, int flag
    return this->ops->recvT(this, &iter, flags, timeoutMS);
 }
 
+// called in the kenrel space
+static inline ssize_t Kernel_socket_recvT(Socket* this, void* buf, size_t len, int flags, int timeoutMS)
+{
+   struct iovec iov = { buf, len };
+   struct iov_iter iter;
+   mm_segment_t oldfs;
+   ssize_t recvRes;
+
+   ACQUIRE_PROCESS_CONTEXT(oldfs);
+
+   BEEGFS_IOV_ITER_INIT(&iter, READ, &iov, 1, len);
+   BUG_ON(!(iter.type & ITER_KVEC));
+
+   recvRes =  this->ops->recvT(this, &iter, flags, timeoutMS);
+
+   RELEASE_PROCESS_CONTEXT(oldfs);
+
+   return recvRes;
+}
+
+
 /**
  * Receive with timeout.
  *
@@ -118,6 +142,14 @@ ssize_t Socket_recvExactT(Socket* this, void *buf, size_t len, int flags, int ti
    return Socket_recvExactTEx(this, buf, len, flags, timeoutMS, &numReceivedBeforeError);
 }
 
+// called in the kenrel space
+ssize_t Kernel_socket_recvExactT(Socket* this, void *buf, size_t len, int flags, int timeoutMS)
+{
+   size_t numReceivedBeforeError;
+
+   return Kernel_socket_recvExactTEx(this, buf, len, flags, timeoutMS, &numReceivedBeforeError);
+}
+
 /**
  * Receive with timeout, extended version with numReceivedBeforeError.
  *
@@ -156,6 +188,41 @@ ssize_t Socket_recvExactTEx(Socket* this, void *buf, size_t len, int flags, int
    return len;
 }
 
+// called in the kenrel space
+ssize_t Kernel_socket_recvExactTEx(Socket* this, void *buf, size_t len, int flags, int timeoutMS,
+   size_t* outNumReceivedBeforeError)
+{
+   ssize_t missingLen = len;
+   ssize_t recvRes;
+
+   do
+   {
+      struct iovec iov = { buf + (len - missingLen), missingLen };
+      struct iov_iter iter;
+      mm_segment_t oldfs;
+
+      ACQUIRE_PROCESS_CONTEXT(oldfs);
+
+      BEEGFS_IOV_ITER_INIT(&iter, READ, &iov, 1, missingLen);
+      BUG_ON(!(iter.type & ITER_KVEC));
+
+      recvRes = this->ops->recvT(this, &iter, flags, timeoutMS);
+
+      RELEASE_PROCESS_CONTEXT(oldfs);
+
+      if(unlikely(recvRes <= 0) )
+         return recvRes;
+
+      missingLen -= recvRes;
+      *outNumReceivedBeforeError += recvRes;
+
+   } while(missingLen);
+
+   // all received if we got here
+
+   return len;
+}
+
 static inline ssize_t Socket_sendto(Socket* this, const void* buf, size_t len, int flags,
    fhgfs_sockaddr_in* to)
 {
@@ -172,5 +239,31 @@ static inline ssize_t Socket_send(Socket* this, const void* buf, size_t len, int
    return Socket_sendto(this, buf, len, flags, NULL);
 }
 
+// called in the kenrel space
+static inline ssize_t Kernel_socket_sendto(Socket* this, const void* buf, size_t len, int flags,
+   fhgfs_sockaddr_in* to)
+{
+   struct iovec iov = { (void*) buf, len };
+   struct iov_iter iter;
+   mm_segment_t oldfs;
+   ssize_t sendRes;
+
+   ACQUIRE_PROCESS_CONTEXT(oldfs);
+
+   BEEGFS_IOV_ITER_INIT(&iter, WRITE, &iov, 1, len);
+   BUG_ON(!(iter.type & ITER_KVEC));
+   sendRes =  this->ops->sendto(this, &iter, flags, to);
+
+   RELEASE_PROCESS_CONTEXT(oldfs);
+
+   return sendRes;
+}
+
+// called in the kenrel space
+static inline ssize_t Kernel_socket_send(Socket* this, const void* buf, size_t len, int flags)
+{
+   return Kernel_socket_sendto(this, buf, len, flags, NULL);
+}
+
 
 #endif /*SOCKET_H_*/
diff --git a/client_module/source/common/net/sock/StandardSocket.c b/client_module/source/common/net/sock/StandardSocket.c
index 19f2e74..478b293 100644
--- a/client_module/source/common/net/sock/StandardSocket.c
+++ b/client_module/source/common/net/sock/StandardSocket.c
@@ -450,7 +450,6 @@ bool _StandardSocket_connectByIP(Socket* this, struct in_addr* ipaddress, unsign
 
    StandardSocket* thisCast = (StandardSocket*)this;
 
-   mm_segment_t oldfs;
    int connRes;
 
    struct sockaddr_in serveraddr =
@@ -460,17 +459,12 @@ bool _StandardSocket_connectByIP(Socket* this, struct in_addr* ipaddress, unsign
       .sin_port = htons(port),
    };
 
-
-   ACQUIRE_PROCESS_CONTEXT(oldfs);
-
    connRes = thisCast->sock->ops->connect(
       thisCast->sock,
       (struct sockaddr*) &serveraddr,
       sizeof(serveraddr),
       O_NONBLOCK); // non-blocking connect
 
-   RELEASE_PROCESS_CONTEXT(oldfs);
-
    if(connRes)
    {
       if(connRes == -EINPROGRESS)
@@ -582,14 +576,9 @@ bool _StandardSocket_shutdown(Socket* this)
    StandardSocket* thisCast = (StandardSocket*)this;
 
    int sendshutRes;
-   mm_segment_t oldfs;
-
-   ACQUIRE_PROCESS_CONTEXT(oldfs);
 
    sendshutRes = thisCast->sock->ops->shutdown(thisCast->sock, SEND_SHUTDOWN);
 
-   RELEASE_PROCESS_CONTEXT(oldfs);
-
    if( (sendshutRes < 0) && (sendshutRes != -ENOTCONN) )
    {
       printk_fhgfs(KERN_WARNING, "Failed to send shutdown. ErrCode: %d\n", sendshutRes);
@@ -604,22 +593,17 @@ bool _StandardSocket_shutdownAndRecvDisconnect(Socket* this, int timeoutMS)
    bool shutRes;
    char buf[SOCKET_SHUTDOWN_RECV_BUF_LEN];
    int recvRes;
-   mm_segment_t oldfs;
 
    shutRes = this->ops->shutdown(this);
    if(!shutRes)
       return false;
 
-   ACQUIRE_PROCESS_CONTEXT(oldfs);
-
    // receive until shutdown arrives
    do
    {
-      recvRes = Socket_recvT(this, buf, SOCKET_SHUTDOWN_RECV_BUF_LEN, 0, timeoutMS);
+      recvRes = Kernel_socket_recvT(this, buf, SOCKET_SHUTDOWN_RECV_BUF_LEN, 0, timeoutMS);
    } while(recvRes > 0);
 
-   RELEASE_PROCESS_CONTEXT(oldfs);
-
    if(recvRes &&
       (recvRes != -ECONNRESET) )
    { // error occurred (but we're disconnecting, so we don't really care about errors)
@@ -647,7 +631,6 @@ ssize_t _StandardSocket_sendto(Socket* this, struct iov_iter* iter, int flags,
 
    int sendRes;
    size_t len;
-   mm_segment_t oldfs;
 
    struct sockaddr_in toSockAddr;
 
@@ -678,16 +661,12 @@ ssize_t _StandardSocket_sendto(Socket* this, struct iov_iter* iter, int flags,
       toSockAddr.sin_port = to->port;
    }
 
-   ACQUIRE_PROCESS_CONTEXT(oldfs);
-
 #ifndef KERNEL_HAS_SOCK_SENDMSG_NOLEN
    sendRes = sock_sendmsg(thisCast->sock, &msg, len);
 #else
    sendRes = sock_sendmsg(thisCast->sock, &msg);
 #endif
 
-   RELEASE_PROCESS_CONTEXT(oldfs);
-
    if(sendRes >= 0)
       iov_iter_advance(iter, sendRes);
 
@@ -698,7 +677,6 @@ ssize_t StandardSocket_recvfrom(StandardSocket* this, struct iov_iter* iter, int
    fhgfs_sockaddr_in *from)
 {
    int recvRes;
-   mm_segment_t oldfs;
    size_t len;
 
    struct sockaddr_in fromSockAddr;
@@ -723,16 +701,12 @@ ssize_t StandardSocket_recvfrom(StandardSocket* this, struct iov_iter* iter, int
    len = iter->count;
 #endif // LINUX_VERSION_CODE
 
-   ACQUIRE_PROCESS_CONTEXT(oldfs);
-
 #ifdef KERNEL_HAS_RECVMSG_SIZE
    recvRes = sock_recvmsg(this->sock, &msg, len, flags);
 #else
    recvRes = sock_recvmsg(this->sock, &msg, flags);
 #endif
 
-   RELEASE_PROCESS_CONTEXT(oldfs);
-
    if(recvRes > 0)
       iov_iter_advance(iter, recvRes);
 
diff --git a/client_module/source/common/nodes/NodeConnPool.c b/client_module/source/common/nodes/NodeConnPool.c
index f10ac56..895c75a 100644
--- a/client_module/source/common/nodes/NodeConnPool.c
+++ b/client_module/source/common/nodes/NodeConnPool.c
@@ -268,7 +268,7 @@ Socket* NodeConnPool_acquireStreamSocketEx(NodeConnPool* this, bool allowWaiting
       // the actual connection attempt
       __NodeConnPool_applySocketOptionsPreConnect(this, sock);
 
-      connectRes = sock->ops->connectByIP(sock, &nicAddr->ipAddr, port);
+      connectRes = Kernel_socket_connectByIP(sock, &nicAddr->ipAddr, port);
 
       if(connectRes)
       { // connected
@@ -687,7 +687,7 @@ bool __NodeConnPool_applySocketOptionsConnected(NodeConnPool* this, Socket* sock
       sendBuf = (char*)os_kmalloc(sendBufLen);
 
       NetMessage_serialize( (NetMessage*)&authMsg, sendBuf, sendBufLen);
-      sendRes = Socket_send(sock, sendBuf, sendBufLen, 0);
+      sendRes = Kernel_socket_send(sock, sendBuf, sendBufLen, 0);
       if(sendRes <= 0)
       {
          Logger_log(log, Log_WARNING, logContext, "Failed to send authentication");
@@ -708,7 +708,7 @@ bool __NodeConnPool_applySocketOptionsConnected(NodeConnPool* this, Socket* sock
       sendBuf = (char*)os_kmalloc(sendBufLen);
 
       NetMessage_serialize( (NetMessage*)&directMsg, sendBuf, sendBufLen);
-      sendRes = Socket_send(sock, sendBuf, sendBufLen, 0);
+      sendRes = Kernel_socket_send(sock, sendBuf, sendBufLen, 0);
       if(sendRes <= 0)
       {
          Logger_log(log, Log_WARNING, logContext, "Failed to set channel to indirect mode");
@@ -731,7 +731,7 @@ bool __NodeConnPool_applySocketOptionsConnected(NodeConnPool* this, Socket* sock
       if (sendBuf)
       {
          NetMessage_serialize(&peerInfo.netMessage, sendBuf, sendBufLen);
-         sendRes = Socket_send(sock, sendBuf, sendBufLen, 0);
+         sendRes = Kernel_socket_send(sock, sendBuf, sendBufLen, 0);
       }
 
       if(sendRes <= 0)
diff --git a/client_module/source/common/toolkit/MessagingTk.c b/client_module/source/common/toolkit/MessagingTk.c
index a9db63f..339c9e2 100644
--- a/client_module/source/common/toolkit/MessagingTk.c
+++ b/client_module/source/common/toolkit/MessagingTk.c
@@ -584,7 +584,7 @@ FhgfsOpsErr __MessagingTk_requestResponseWithRRArgsComm(App* app,
 
    // send request
 
-   sendRes = Socket_send(sock, rrArgs->outRespBuf, sendBufLen, 0);
+   sendRes = Kernel_socket_send(sock, rrArgs->outRespBuf, sendBufLen, 0);
 
    if(unlikely(sendRes != (ssize_t)sendBufLen) )
       goto socket_exception;
diff --git a/client_module/source/common/toolkit/MessagingTk.h b/client_module/source/common/toolkit/MessagingTk.h
index 8a94623..2eae30f 100644
--- a/client_module/source/common/toolkit/MessagingTk.h
+++ b/client_module/source/common/toolkit/MessagingTk.h
@@ -85,7 +85,7 @@ ssize_t MessagingTk_recvMsgBuf(App* app, Socket* sock, char* bufIn, size_t bufIn
 
    // receive at least the message header
 
-   recvRes = Socket_recvExactTEx(sock, bufIn, NETMSG_MIN_LENGTH, 0, CONN_LONG_TIMEOUT,
+   recvRes = Kernel_socket_recvExactTEx(sock, bufIn, NETMSG_MIN_LENGTH, 0, CONN_LONG_TIMEOUT,
          &numReceived);
 
    if(unlikely(recvRes <= 0) )
@@ -115,7 +115,7 @@ ssize_t MessagingTk_recvMsgBuf(App* app, Socket* sock, char* bufIn, size_t bufIn
       return -EMSGSIZE;
    }
 
-   recvRes = Socket_recvExactTEx(sock, &bufIn[numReceived], msgLength-numReceived, 0,
+   recvRes = Kernel_socket_recvExactTEx(sock, &bufIn[numReceived], msgLength-numReceived, 0,
          CONN_LONG_TIMEOUT, &numReceived);
 
    if(unlikely(recvRes <= 0) )
diff --git a/client_module/source/components/AckManager.c b/client_module/source/components/AckManager.c
index 91cea63..ef6caa9 100644
--- a/client_module/source/components/AckManager.c
+++ b/client_module/source/components/AckManager.c
@@ -175,7 +175,7 @@ void __AckManager_processAckQueue(AckManager* this)
 
          if(likely(sock) )
          { // send msg
-            sendRes = Socket_send(sock, this->ackMsgBuf, msgLen, 0);
+            sendRes = Kernel_socket_send(sock, this->ackMsgBuf, msgLen, 0);
 
             if(unlikely(sendRes != (ssize_t) msgLen) )
             { // comm error => invalidate conn
diff --git a/client_module/source/components/DatagramListener.c b/client_module/source/components/DatagramListener.c
index 6aac705..9ba3a71 100644
--- a/client_module/source/components/DatagramListener.c
+++ b/client_module/source/components/DatagramListener.c
@@ -37,11 +37,17 @@ void __DatagramListener_listenLoop(DatagramListener* this)
       ssize_t recvRes;
       struct iovec iov = { this->recvBuf, DGRAMMGR_RECVBUF_SIZE };
       struct iov_iter iter;
+      mm_segment_t oldfs;
+
+      ACQUIRE_PROCESS_CONTEXT(oldfs);
 
       BEEGFS_IOV_ITER_INIT(&iter, READ, &iov, 1, iov.iov_len);
+      BUG_ON(!(iter.type & ITER_KVEC));
 
       recvRes = StandardSocket_recvfromT(this->udpSock, &iter, 0, &fromAddr, recvTimeoutMS);
 
+      RELEASE_PROCESS_CONTEXT(oldfs);
+
       if(recvRes == -ETIMEDOUT)
       { // timeout: nothing to worry about, just idle
          continue;
@@ -169,7 +175,7 @@ bool __DatagramListener_initSock(DatagramListener* this, unsigned short udpPort)
 
    // bind the socket
 
-   bindRes = Socket_bind(udpSockBase, udpPort);
+   bindRes = Kernel_socket_bind(udpSockBase, udpPort);
    if(!bindRes)
    {
       Logger_logErrFormatted(log, logContext, "Binding UDP socket to port %d failed.", udpPort);
diff --git a/client_module/source/components/DatagramListener.h b/client_module/source/components/DatagramListener.h
index d7dc79b..1e0759e 100644
--- a/client_module/source/components/DatagramListener.h
+++ b/client_module/source/components/DatagramListener.h
@@ -143,7 +143,7 @@ ssize_t DatagramListener_sendto(DatagramListener* this, void* buf, size_t len, i
 
    Mutex_lock(&this->sendMutex);
 
-   sendRes = Socket_sendto(&this->udpSock->pooledSocket.socket, buf, len, flags, to);
+   sendRes = Kernel_socket_sendto(&this->udpSock->pooledSocket.socket, buf, len, flags, to);
 
    Mutex_unlock(&this->sendMutex);
 
diff --git a/client_module/source/filesystem/FhgfsOpsFile.c b/client_module/source/filesystem/FhgfsOpsFile.c
index 45ab3df..342699a 100644
--- a/client_module/source/filesystem/FhgfsOpsFile.c
+++ b/client_module/source/filesystem/FhgfsOpsFile.c
@@ -1195,10 +1195,15 @@ static ssize_t FhgfsOps_buffered_read_iter(struct kiocb *iocb, struct iov_iter *
       {
          ssize_t readRes;
          size_t copyRes;
+         mm_segment_t oldfs;
+
 
          while (iov_iter_count(to) > 0)
          {
+            ACQUIRE_PROCESS_CONTEXT(oldfs);
             readRes = FhgfsOps_read(iocb->ki_filp, kaddr, PAGE_SIZE, &iocb->ki_pos);
+            RELEASE_PROCESS_CONTEXT(oldfs);
+
             if (readRes < 0 && totalReadRes == 0)
             {
                totalReadRes = readRes;
@@ -1584,12 +1589,16 @@ static ssize_t FhgfsOps_buffered_write_iter(struct kiocb *iocb, struct iov_iter
       {
          ssize_t writeRes;
          size_t copyRes;
+         mm_segment_t oldfs;
 
          while (iov_iter_count(from) > 0)
          {
             copyRes = copy_page_from_iter(buffer, 0, PAGE_SIZE, from);
 
+            ACQUIRE_PROCESS_CONTEXT(oldfs);
             writeRes = FhgfsOps_write(iocb->ki_filp, kaddr, copyRes, &iocb->ki_pos);
+            RELEASE_PROCESS_CONTEXT(oldfs);
+
             if (writeRes < 0 && totalWriteRes == 0)
             {
                totalWriteRes = writeRes;
@@ -2040,12 +2049,15 @@ int FhgfsOps_write_end(struct file* file, struct address_space* mapping,
       char* buf = kmap(page);
       RemotingIOInfo ioInfo;
       ssize_t writeRes;
+      mm_segment_t oldfs;
 
       FsFileInfo_getIOInfo(fileInfo, fhgfsInode, &ioInfo);
 
       FhgfsInode_incWriteBackCounter(fhgfsInode);
 
+      ACQUIRE_PROCESS_CONTEXT(oldfs);
       writeRes = FhgfsOpsRemoting_writefile(&buf[offset], copied, pos, &ioInfo);
+      RELEASE_PROCESS_CONTEXT(oldfs);
 
       spin_lock(&inode->i_lock);
       FhgfsInode_setLastWriteBackOrIsizeWriteTime(fhgfsInode);
diff --git a/client_module/source/filesystem/FhgfsOpsHelper.c b/client_module/source/filesystem/FhgfsOpsHelper.c
index 5014df3..24477d1 100644
--- a/client_module/source/filesystem/FhgfsOpsHelper.c
+++ b/client_module/source/filesystem/FhgfsOpsHelper.c
@@ -1339,6 +1339,7 @@ ssize_t FhgfsOpsHelper_writeStatelessInode(FhgfsInode* fhgfsInode, const char __
    FhgfsOpsErr referenceRes;
    ssize_t writeRes;
    FhgfsOpsErr releaseRes;
+   mm_segment_t oldfs;
 
 
    // open file
@@ -1356,7 +1357,9 @@ ssize_t FhgfsOpsHelper_writeStatelessInode(FhgfsInode* fhgfsInode, const char __
    FhgfsInode_getRefIOInfo(fhgfsInode, handleType, FhgfsInode_handleTypeToOpenFlags(handleType),
       &ioInfo);
 
+   ACQUIRE_PROCESS_CONTEXT(oldfs);
    writeRes = FhgfsOpsHelper_writefileEx(fhgfsInode, buf, size, offset, &ioInfo);
+   RELEASE_PROCESS_CONTEXT(oldfs);
    if(unlikely(writeRes < 0) )
    { // error
       FhgfsInode_releaseHandle(fhgfsInode, handleType, NULL);
diff --git a/client_module/source/net/filesystem/FhgfsOpsCommKit.c b/client_module/source/net/filesystem/FhgfsOpsCommKit.c
index e0ba8f9..f6f6f7e 100644
--- a/client_module/source/net/filesystem/FhgfsOpsCommKit.c
+++ b/client_module/source/net/filesystem/FhgfsOpsCommKit.c
@@ -331,7 +331,7 @@ static void __commkit_sendheader_generic(CommKitContext* context,
    if(BEEGFS_SHOULD_FAIL(commkit_sendheader_timeout, 1) )
       sendRes = -ETIMEDOUT;
    else
-      sendRes = Socket_send(info->socket, info->headerBuffer, info->headerSize, 0);
+      sendRes = Kernel_socket_send(info->socket, info->headerBuffer, info->headerSize, 0);
 
    if(unlikely(sendRes != info->headerSize) )
    {
@@ -405,7 +405,7 @@ static void __commkit_recvheader_generic(CommKitContext* context, struct CommKit
 
       if(info->headerSize < NETMSG_MIN_LENGTH)
       {
-         recvRes = Socket_recvT(info->socket, info->headerBuffer + info->headerSize,
+         recvRes = Kernel_socket_recvT(info->socket, info->headerBuffer + info->headerSize,
             NETMSG_MIN_LENGTH - info->headerSize, MSG_DONTWAIT, 0);
          if(recvRes <= 0)
          {
@@ -432,7 +432,7 @@ static void __commkit_recvheader_generic(CommKitContext* context, struct CommKit
 
       if(info->headerSize < msgLength)
       {
-         recvRes = Socket_recvT(info->socket, info->headerBuffer + info->headerSize,
+         recvRes = Kernel_socket_recvT(info->socket, info->headerBuffer + info->headerSize,
             msgLength - info->headerSize, MSG_DONTWAIT, 0);
          if(recvRes <= 0)
          {
@@ -999,14 +999,39 @@ static ssize_t __commkit_readfile_receive(CommKitContext* context, ReadfileState
    return recvRes;
 }
 
-static int __commkit_readfile_recvdata_prefix(CommKitContext* context, ReadfileState* currentState)
+static ssize_t __kernel_commkit_readfile_receive(CommKitContext* context, ReadfileState* currentState,
+   void* buffer, size_t length, bool exact)
+{
+   ssize_t recvRes;
+   Socket* socket = currentState->base.socket;
+
+   if(BEEGFS_SHOULD_FAIL(commkit_readfile_receive_timeout, 1) )
+      recvRes = -ETIMEDOUT;
+   else
+   if(exact)
+      recvRes = Kernel_socket_recvExactT(socket, buffer, length, 0, CONN_LONG_TIMEOUT);
+   else
+      recvRes = Kernel_socket_recvT(socket, buffer, length, 0, CONN_LONG_TIMEOUT);
+
+   if(unlikely(recvRes < 0) )
+   {
+      Logger_logFormatted(context->log, Log_SPAM, context->ops->logContext,
+         "Request details: receive from %s: %lld bytes (error %zi)",
+         Node_getNodeIDWithTypeStr(currentState->base.node), (long long)length, recvRes);
+   }
+
+   return recvRes;
+}
+
+
+static int __kernel_commkit_readfile_recvdata_prefix(CommKitContext* context, ReadfileState* currentState)
 {
    ssize_t recvRes;
    char dataLenBuf[sizeof(int64_t)]; // length info in fhgfs network byte order
    int64_t lengthInfo; // length info in fhgfs host byte order
    DeserializeCtx ctx = { dataLenBuf, sizeof(int64_t) };
 
-   recvRes = __commkit_readfile_receive(context, currentState, dataLenBuf, sizeof(int64_t), true);
+   recvRes = __kernel_commkit_readfile_receive(context, currentState, dataLenBuf, sizeof(int64_t), true);
    if(recvRes < 0)
       return recvRes;
    if (recvRes == 0)
@@ -1054,7 +1079,7 @@ static int __commkit_readfile_recvdata(CommKitContext* context, struct CommKitTa
    struct iovec iov;
 
    if(!currentState->receiveFileData)
-      return __commkit_readfile_recvdata_prefix(context, currentState);
+      return __kernel_commkit_readfile_recvdata_prefix(context, currentState);
 
    if(currentState->data.count == 0)
    {
diff --git a/client_module/source/net/filesystem/FhgfsOpsCommKitVec.c b/client_module/source/net/filesystem/FhgfsOpsCommKitVec.c
index 8a6db35..2bc766f 100644
--- a/client_module/source/net/filesystem/FhgfsOpsCommKitVec.c
+++ b/client_module/source/net/filesystem/FhgfsOpsCommKitVec.c
@@ -132,7 +132,7 @@ void __FhgfsOpsCommKitVec_readfileStagePREPARE(CommKitVecHelper* commHelper,
 
    comm->nodeResult = 0; // ready to read, so set this variable to 0
 
-   sendRes = Socket_send(comm->sock, comm->msgBuf, comm->hdrLen, 0);
+   sendRes = Kernel_socket_send(comm->sock, comm->msgBuf, comm->hdrLen, 0);
 
 #if (BEEGFS_COMMKIT_DEBUG & COMMKIT_DEBUG_READ_SEND )
    if (sendRes > 0 && jiffies % CommKitErrorInjectRate == 0)
@@ -181,7 +181,7 @@ void __FhgfsOpsCommKitVec_readfileStageRECVHEADER(CommKitVecHelper* commHelper,
 
    LOG_DEBUG_TOP_FORMATTED(commHelper->log, LogTopic_COMMKIT, Log_DEBUG, __func__, "enter");
 
-   recvRes = Socket_recvExactT(comm->sock, &dataLenBuf, sizeof(int64_t), 0, CONN_LONG_TIMEOUT);
+   recvRes = Kernel_socket_recvExactT(comm->sock, &dataLenBuf, sizeof(int64_t), 0, CONN_LONG_TIMEOUT);
 
    if(unlikely(recvRes <= 0) )
    { // error
@@ -299,7 +299,7 @@ void __FhgfsOpsCommKitVec_readfileStageRECVDATA(CommKitVecHelper* commHelper,
       requestLength = MIN(pageLen, comm->read.serverSize - receiveSum);
 
       // receive available dataPart
-      recvRes = Socket_recvExactT(comm->sock, pageDataPtr, requestLength, 0, CONN_LONG_TIMEOUT);
+      recvRes = Kernel_socket_recvExactT(comm->sock, pageDataPtr, requestLength, 0, CONN_LONG_TIMEOUT);
 
       LOG_DEBUG_TOP_FORMATTED(commHelper->log, LogTopic_COMMKIT, Log_DEBUG, __func__,
          "requested: %lld; received: %lld",
@@ -667,7 +667,7 @@ void __FhgfsOpsCommKitVec_writefileStageSENDHEADER(CommKitVecHelper* commHelper,
 
    LOG_DEBUG_TOP_FORMATTED(commHelper->log, LogTopic_COMMKIT, Log_DEBUG, __func__, "enter");
 
-   sendRes = Socket_send(comm->sock, comm->msgBuf, comm->hdrLen, 0);
+   sendRes = Kernel_socket_send(comm->sock, comm->msgBuf, comm->hdrLen, 0);
 
 #if (BEEGFS_COMMKIT_DEBUG & COMMKIT_DEBUG_WRITE_HEADER )
       if (sendRes == FhgfsOpsErr_SUCCESS && jiffies % CommKitErrorInjectRate == 0)
@@ -738,7 +738,7 @@ void __FhgfsOpsCommKitVec_writefileStageSENDDATA(CommKitVecHelper* commHelper,
       data = fhgfsPage->data;
 
       // send dataPart blocking
-      sendDataPartRes = Socket_send(comm->sock, data, dataLength, 0);
+      sendDataPartRes = Kernel_socket_send(comm->sock, data, dataLength, 0);
 
       LOG_DEBUG_TOP_FORMATTED(commHelper->log, LogTopic_COMMKIT, Log_DEBUG, __func__,
           "VecIdx: %zu; size: %lld; PgLen: %d, sendRes: %zd",
--
2.25.1
v4-0001-Separate-socket-operations-in-the-kernel-space.patch

Radiance Chou

unread,
Jul 10, 2021, 9:51:18 AM7/10/21
to beegfs-user
Hello: Xinliang

Analyzed the patch you submitted
These patches just set the underlying set_ fs(KERNEL_ DS) is deleted, but added to the outermost calling function.
In the client_module code of beegfs, how to distinguish between user space socket calls and kernel space socket calls?

Best
yaohui.zhou

Xinliang Liu

unread,
Jul 14, 2021, 10:32:40 PM7/14/21
to fhgfs...@googlegroups.com
Hi  yaohui.zhou,



On Sat, 10 Jul 2021 at 21:51, Radiance Chou <ch.z...@gmail.com> wrote:
Hello: Xinliang

Analyzed the patch you submitted
These patches just set the underlying set_ fs(KERNEL_ DS) is deleted, but added to the outermost calling function.
In the client_module code of beegfs, how to distinguish between user space socket calls and kernel space socket calls?

Please see what I described  in the section ''How to fix', that is, "1) Identify which socket operations are in the kernel space. Usually with a kernel address or buf. Unfortunately, there are a lot of sendmsg/recvmsg  socket operations in the kernel space."

Xinliang
 
--
You received this message because you are subscribed to the Google Groups "beegfs-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fhgfs-user/16839a55-8a93-41c9-8cdb-da7b86bbb106n%40googlegroups.com.

Yaohui Zhou

unread,
Jul 21, 2021, 10:14:04 PM7/21/21
to beegfs-user
Hi: xinliang

The client CPU of our beegfs environment is like this:
[root@cn72102%TH3 ~]# lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           16
NUMA node(s):        8
Model:               2
BogoMIPS:            100.00
NUMA node0 CPU(s):   0,1
NUMA node1 CPU(s):   2,3
NUMA node2 CPU(s):   4,5
NUMA node3 CPU(s):   6,7
NUMA node4 CPU(s):   8,9
NUMA node5 CPU(s):   10,11
NUMA node6 CPU(s):   12,13
NUMA node7 CPU(s):   14,15
Flags:               fp asimd evtstrm crc32 cpuid
CPU MHz:             2300
Model name:          Phytium FT2000+
L1d cache:           32K
L1i cache:           48K
L2 cache:            512K
[root@cn72102%TH3 ~]#
[root@cn72102%TH3 ~]# 
[root@cn72102%TH3 ~]# 
[root@cn72102%TH3 ~]# 

We have closed all PAN and UAO,
However, problems will still be reported in the test of beegfs:

[root@cn72102%TH3 ~]# zgrep -E 'CONFIG_ARM64_PAN|CONFIG_ARM64_UAO|CONFIG_ARM64_SW_TTBR0_PAN' /proc/config.gz
# CONFIG_ARM64_SW_TTBR0_PAN is not set
# CONFIG_ARM64_PAN is not set
# CONFIG_ARM64_UAO is not set
[root@cn72102%TH3 ~]# 

This problem is not necessarily present, but is reported only after running for more than 10 hours.
Now it is suspected to be related to the buffer memory address in user mode, set_fs(KERNEL_DS) is also called in this;

[Thu Jul 22 08:19:49 2021] Unable to handle kernel access to user memory with fs=KERNEL_DS at virtual address 000000003a8ef000
[Thu Jul 22 08:19:49 2021] Mem abort info:
[Thu Jul 22 08:19:49 2021]   ESR = 0x9600004e
[Thu Jul 22 08:19:49 2021]   Exception class = DABT (current EL), IL = 32 bits
[Thu Jul 22 08:19:49 2021]   SET = 0, FnV = 0
[Thu Jul 22 08:19:49 2021]   EA = 0, S1PTW = 0
[Thu Jul 22 08:19:49 2021] Data abort info:
[Thu Jul 22 08:19:49 2021]   ISV = 0, ISS = 0x0000004e
[Thu Jul 22 08:19:49 2021]   CM = 0, WnR = 1
[Thu Jul 22 08:19:49 2021] user pgtable: 4k pages, 48-bit VAs, pgdp = 0000000055c2b623
[Thu Jul 22 08:19:49 2021] [000000003a8ef000] pgd=0000020074421003, pud=00000200002c5003, pmd=00e0020009000fd1
[Thu Jul 22 08:19:49 2021] Internal error: Oops: 9600004e [#1] SMP
[Thu Jul 22 08:19:49 2021] Modules linked in: orcafs(O) zni_net(O) zni_dev(O) knem(O) ip_tables x_tables [last unloaded: orcafs]
[Thu Jul 22 08:19:49 2021] Process IOR-ft (pid: 15237, stack limit = 0x00000000d5fdfa99)
[Thu Jul 22 08:19:49 2021] CPU: 3 PID: 15237 Comm: IOR-ft Tainted: G           O      4.19.46-mt+ #354
[Thu Jul 22 08:19:49 2021] Hardware name: M3000 (DT)
[Thu Jul 22 08:19:49 2021] pstate: 20000005 (nzCv daif -PAN -UAO)
[Thu Jul 22 08:19:49 2021] pc : __arch_copy_to_user+0x110/0x160
[Thu Jul 22 08:19:49 2021] lr : FhgfsOps_access_mem+0x100/0x130 [orcafs]
[Thu Jul 22 08:19:49 2021] sp : ffff0000104b3c50
[Thu Jul 22 08:19:49 2021] x29: ffff0000104b3c50 x28: ffff8380f1784080
[Thu Jul 22 08:19:49 2021] x27: ffff8780f1cc3480 x26: ffff820072784200
[Thu Jul 22 08:19:49 2021] x25: 000000003a8ef000 x24: ffff8780f1cc3480
[Thu Jul 22 08:19:49 2021] x23: 0000ffffffffffff x22: 000000003a8ef000
[Thu Jul 22 08:19:49 2021] x21: ffff000019a39000 x20: 0000000000100000
[Thu Jul 22 08:19:49 2021] x19: ffff8380f1784080 x18: 0000000000000000
[Thu Jul 22 08:19:49 2021] x17: 0000000000000000 x16: 0000000000000000
[Thu Jul 22 08:19:49 2021] x15: 0000000000000000 x14: 0000000000000000
[Thu Jul 22 08:19:49 2021] x13: 0000000000000000 x12: 0000000000000000
[Thu Jul 22 08:19:49 2021] x11: 0000000000000000 x10: 0000000000000000
[Thu Jul 22 08:19:49 2021] x9 : 0000000000000000 x8 : 0000000000000000
[Thu Jul 22 08:19:49 2021] x7 : 0000000000000000 x6 : 000000003a8ef000
[Thu Jul 22 08:19:49 2021] x5 : 000000003a9ef000 x4 : 0000000000000000
[Thu Jul 22 08:19:49 2021] x3 : ffff8400764d79c8 x2 : 00000000000fff80
[Thu Jul 22 08:19:49 2021] x1 : ffff000019a39040 x0 : 000000003a8ef000
[Thu Jul 22 08:19:49 2021] Call trace:
[Thu Jul 22 08:19:49 2021]  __arch_copy_to_user+0x110/0x160
[Thu Jul 22 08:19:49 2021]  FhgfsOps_read+0xa8/0x1c8 [orcafs]
[Thu Jul 22 08:19:49 2021]  __vfs_read+0x30/0x158
[Thu Jul 22 08:19:49 2021]  vfs_read+0x90/0x160
[Thu Jul 22 08:19:49 2021]  ksys_read+0x64/0xd8
[Thu Jul 22 08:19:49 2021]  __arm64_sys_read+0x18/0x20
[Thu Jul 22 08:19:49 2021]  el0_svc_common+0x84/0xf0
[Thu Jul 22 08:19:49 2021]  el0_svc_handler+0x24/0x80
[Thu Jul 22 08:19:49 2021]  el0_svc+0x8/0xc
[Thu Jul 22 08:19:49 2021] Code: a8c12027 a8c12829 a8c1302b a8c1382d (a88120c7)
[Thu Jul 22 08:19:49 2021] ---[ end trace dddd700636b0af2d ]---

I would like to ask you to help us analyze this problem. Do you have any good suggestions?

Xinliang Liu

unread,
Jul 21, 2021, 11:53:17 PM7/21/21
to fhgfs...@googlegroups.com
Hi Zhou,

The crash log seems similar to my crash log, but my crash issue is easy to reproduce only via cp file to/from Beegfs dir.
Does my patch solve your issue?  The ESR code looks different.
ESR = 0x9600004e ===> yours
ESR = 0x9600000f ===> mine
I will suggest you dig into this code and the source code[1] to find out which condition it fails.


Best,
Xinliang 

 

Yaohui Zhou

unread,
Jul 26, 2021, 8:56:00 AM7/26/21
to beegfs-user
Hi:Xinliang

Thank you for your guidance, your patch can solve this problem.
I analyzed the value of ESR, the fsc_type corresponding to 4e and 0f is ESR_ELx_FSC_PERM (0x0C), which means permission fault.

Xinliang Liu

unread,
Jul 27, 2021, 3:04:49 AM7/27/21
to fhgfs...@googlegroups.com
Hi Zhou,

On Mon, 26 Jul 2021 at 20:56, Yaohui Zhou <ch.z...@gmail.com> wrote:
Hi:Xinliang

Thank you for your guidance, your patch can solve this problem.

Happy to hear that. 

I analyzed the value of ESR, the fsc_type corresponding to 4e and 0f is ESR_ELx_FSC_PERM (0x0C), which means permission fault.

I also take a look at it. You hit the kernel read user address permission fault. And I hit the write one. Which means the kernel access user address checking is working. Separating the kernel and user space socket operations should be the solution. 

Xinliang
 

qq z

unread,
Sep 11, 2021, 11:10:09 AM9/11/21
to beegfs-user
Hi:

The content of the beegfs-client.log is

(3) Sep11 23:04:29 *mount(53926) [NodeConn (acquire stream)] >> Connected: beegfs-meta@***:8005 (protocol: TCP)
(3) Sep11 23:04:29 *mount(53926) [NodeConn (acquire stream)] >> Connected: beegfs-storage@***:8003 (protocol: TCP)
(0) Sep11 23:05:47 *sysbench(54001) [Remoting (read file)] >> Error storage targetID: 301; Msg: Bad memory address; FileHandle: 2F7D9921#4B-613CC5AE-2
(0) Sep11 23:05:47 *sysbench(54000) [Remoting (read file)] >> Error storage targetID: 301; Msg: Bad memory address; FileHandle: 2F7D9934#5E-613CC5AE-2

qq z

unread,
Sep 11, 2021, 11:13:45 AM9/11/21
to beegfs-user
Hi:
I use your patch, but one error occurred when I ran the sysbench random read-write test
sysbench --test=fileio --threads=2 --file-total-size=512M --file-test-mode=rndrw run
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
sysbench 1.0.16 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 2
Initializing random number generator from current time


Extra file open flags: (none)
128 files, 4MiB each
512MiB total file size
Block size 16KiB
Number of IO requests: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Initializing worker threads...

Threads started!

FATAL: Failed to read file! file: 78 pos: 3670016 errno = 14 (Bad address)
FATAL: Failed to read file! file: 97 pos: 2326528 errno = 14 (Bad address)

Xinliang Liu

unread,
Sep 12, 2021, 5:47:08 AM9/12/21
to fhgfs...@googlegroups.com, 18138...@163.com
Hi,

On Sat, 11 Sept 2021 at 23:13, qq z <china...@gmail.com> wrote:
Hi:
I use your patch, but one error occurred when I ran the sysbench random read-write test
sysbench --test=fileio --threads=2 --file-total-size=512M --file-test-mode=rndrw run
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
sysbench 1.0.16 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 2
Initializing random number generator from current time


Extra file open flags: (none)
128 files, 4MiB each
512MiB total file size
Block size 16KiB
Number of IO requests: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Initializing worker threads...

Threads started!

FATAL: Failed to read file! file: 78 pos: 3670016 errno = 14 (Bad address)
FATAL: Failed to read file! file: 97 pos: 2326528 errno = 14 (Bad address)
 
Thank you for letting us know about this, we will look into this test case.
This should be the user address checking access_ok not pass. Probably there are still some kernel space socket operations not handled properly, which are missing consideration in the patch.
Anyway, we will work with the BeeGFS team about this patch, refine it and adapt it to the newer kernel(v5.10+). And do more tests to ensure there is no such issue.

Xinliang
 

qq z

unread,
Sep 12, 2021, 9:17:26 PM9/12/21
to beegfs-user
Hi:Xinliang
Thank you for your reply. If there is a solution, please let me know.  Thank you very much.

Xinliang Liu

unread,
Sep 12, 2021, 10:02:34 PM9/12/21
to fhgfs...@googlegroups.com
Hi, 

On Mon, 13 Sept 2021 at 09:17, qq z <china...@gmail.com> wrote:
Hi:Xinliang
Thank you for your reply. If there is a solution, please let me know.  Thank you very much.

Sure, but it may be taking some time, not sure when now, not so soon.

Xinliang
 
Message has been deleted

Yaohui Zhou

unread,
Sep 25, 2021, 9:49:30 PM9/25/21
to beegfs-user
Hi:

I found a phenomenon here.
After using this patch, there will be problems when FhgfsOps_write calls the FhgfsOpsHelper_writeCached function. When FhgfsOps_write calls the FhgfsOpsRemoting_writefile function, it is correct.

Hope this discovery is helpful to solve this problem!

yaohui.zhou

Xinliang Liu

unread,
Sep 26, 2021, 8:55:38 PM9/26/21
to fhgfs...@googlegroups.com
Hi,

On Sun, 26 Sept 2021 at 09:49, Yaohui Zhou <ch.z...@gmail.com> wrote:
Hi:

I found a phenomenon here.
After using this patch, there will be problems when FhgfsOps_write calls the FhgfsOpsHelper_writeCached function. When FhgfsOps_write calls the FhgfsOpsRemoting_writefile function, it is correct.

Hope this discovery is helpful to solve this problem!

Sure, thank you for letting us know this. We are working on the next version patch. Hope the new patch will solve the problem.

Xinliang
 
Reply all
Reply to author
Forward
0 new messages