Soft freezing with rapid mount/umount

200 views
Skip to first unread message

Lukas Hejtmanek

unread,
Jun 12, 2023, 5:14:50 AM6/12/23
to beegfs-user
Hello,

I have beegfs 7.3.3 in kubernetes with netapp provisioning driver (but it basically mounts/umounts and create pvc dir only). PVC are persistent volumes - directories on a single beegfs file system that are mounted into containers, it means that the single beegfs file system is mounted multiple times into different directories. It works unless I run a computation that spawns many containers (like 20-30) per physical node. Each container mounts the same beegfs filesystem, they are short lived - 10seconds to 2minutes and umounts after computatin. This causes many parallel mounts/umounts of the same filesystem and this leads to freeze of the mounter. In this case, manipulation with the mounted file system freezes as well (ls, touch file). However, killing the freezed mount resolves the issue and everything is ok, but it is a bit unfortunate. 

Is there any mitigation possible?

Joe McCormick

unread,
Jun 14, 2023, 4:47:48 PM6/14/23
to beegfs-user
Hi Lukas,
 
I'm one of the maintainers of the BeeGFS CSI driver so I can help with this. In the past we've done stress testing around how many parallel PVCs could be handled, and we could create 100s of PVCs concurrently and (so far) never saw anything like this.

Before diving into a recreate I'm wanting to understand the problem and sequence of events better. Sorry for the barrage of questions but probably this is a nuanced issue and has to do with some mix of the way in which the mount operations are happening and the environment is setup. Whatever questions you're comfortable answering on a public forum would be great.

(1) You mentioned the mounter freezes. Under the hood the driver uses the K8s mount utilities (https://github.com/kubernetes/mount-utils/blob/master/mount_linux.go#L124) which is essentially a wrapper for the mount utility in Linux. Are you saying the driver gets hung while trying to mount a PVC into a container? Or does the mount utility become unusable entirely? Just trying to understand what exactly is freezing and the effects it causes.

(2) You mentioned manipulating the mounted file system freezes. Does the mount operation actually/eventually complete, just the resulting mount is non-functional?

(3) When you kill the frozen mount how are you actually doing this? Is this like a kill -9 of the mount command? Or forcibly unmounting the mount point?

(4) It sounds like you're doing dynamic provisioning. This means the controller service (csi-beegfs-controller-0) creates a directory then the node service (csi-beegfs-node-*) on the same K8s node as your app container handles mounting/bind mounting it into the app container. Are you having problems with the mounter on the controller service, node service, or both?

(5) Can you provide more information about the workloads running on your cluster that are causing this issue? Are the 20-30 containers all trying to mount the same PVC, or do they each have their own?

Some additional details about the environment would also be helpful:

(6) Can you share the version of Kubernetes you're running, and specifics about your cluster configuration? Things like the number of nodes, resources allocated to each one (memory/CPU), Linux version, BeeGFS CSI driver version, etc.

(7) Could you share the StorageClass and PersistentVolumeClaim configurations for the BeeGFS volume? It would be particularly helpful to see any mountOptions you're using.

(8) Are there any relevant logs you can send along? In particular anything around the time of a frozen mount (or anything suspicious) in:

* The kubectl logs from the CSI driver containers (controller and/or node service depending what is affected).
  * There are multiple containers in the pod so you'll need to do: `kubectl logs <POD> -c beegfs`
* /var/log/beegfs-client.log
* dmesg
* journalctl (probably the least helpful if it would be a lot of work to sanitize).

Thanks!

~Joe

Lukas Hejtmanek

unread,
Jun 15, 2023, 6:01:22 PM6/15/23
to beegfs-user
Hello Joe,

thank you for reply and support. I will provide as much as I can.

The workload that runs in kubernetes is nextflow sarek pipeline - https://nf-co.re/sarek. We use a single dynamic pvc on a single beegfs file system for all the tasks.

This is how it looks when containers are stuck:
nf-1528f8a24fb0108fc8881424c83d1d91-7mrzn   1/1     Running             0          16m   10.42.102.76    kub-b14.priv.cerit-sc.cz   <none>           <none>
nf-260d2f2ea7e3acffde3920590cb7ca52-5kgxd   1/1     Running             0          16m   10.42.85.52     kub-b16.priv.cerit-sc.cz   <none>           <none>
nf-2b8f7f720e89e9624c008962f9f50ad2-5xbqc   1/1     Running             0          16m   10.42.5.73      kub-b15.priv.cerit-sc.cz   <none>           <none>
nf-2c2b5ea5efeee6813d7a84119f215ab7-2gg4q   1/1     Running             0          16m   10.42.85.153    kub-b16.priv.cerit-sc.cz   <none>           <none>
nf-44ba94995aa9c2abc12945d4216d8594-rpvh7   0/1     ContainerCreating   0          16m   <none>          kub-b2.priv.cerit-sc.cz    <none>           <none>
nf-46251e149167ee618a333682e356eef3-5lx98   0/1     ContainerCreating   0          16m   <none>          kub-b2.priv.cerit-sc.cz    <none>           <none>
nf-470e54aca8b66e5c6225a238939deb4f-lvccf   0/1     ContainerCreating   0          16m   <none>          kub-b2.priv.cerit-sc.cz    <none>           <none>
nf-4da6367b4cc86cdc36aae182bd915592-2nmxz   1/1     Running             0          16m   10.42.237.184   kub-b13.priv.cerit-sc.cz   <none>           <none>
nf-5013c053b64b3f59cf2328603a8426c2-pkb2l   1/1     Running             0          16m   10.42.237.87    kub-b13.priv.cerit-sc.cz   <none>           <none>
nf-562bff13b6c123b2f58f52fc7bbb0377-hvw84   1/1     Running             0          16m   10.42.200.118   kub-b3.priv.cerit-sc.cz    <none>           <none>
nf-5bc4f30b4b19931d3511951554f7d446-t82ps   0/1     ContainerCreating   0          16m   <none>          kub-b2.priv.cerit-sc.cz    <none>           <none>
nf-63d8c16c8fdcc82ff2107f45171dc90f-kjsr9   1/1     Running             0          16m   10.42.231.49    kub-b6.priv.cerit-sc.cz    <none>           <none>
nf-84ad675cd2138d27a6cd4194a5afc394-pqmrv   1/1     Running             0          16m   10.42.14.62     kub-b12.priv.cerit-sc.cz   <none>           <none>
nf-8d752ff3e27d9717983ab8b34bee532f-qszkh   0/1     ContainerCreating   0          16m   <none>          kub-b2.priv.cerit-sc.cz    <none>           <none>
nf-93e24e62c07d468598ca49e4f9f69f0f-hmgv4   1/1     Running             0          16m   10.42.85.141    kub-b16.priv.cerit-sc.cz   <none>           <none>
nf-a85e64c68052a803bd834d0cc5f31257-r7blp   1/1     Running             0          16m   10.42.200.5     kub-b3.priv.cerit-sc.cz    <none>           <none>
nf-b288b017e5b0cd2e106ede8823d5217c-shhkv   1/1     Running             0          16m   10.42.231.134   kub-b6.priv.cerit-sc.cz    <none>           <none>
nf-d62365976b123b2ae8933749eaa71009-bsv4d   1/1     Running             0          16m   10.42.102.67    kub-b14.priv.cerit-sc.cz   <none>           <none>
nf-de7fd179e8d12fe93db7a8b35819751d-w7pzg   1/1     Running             0          16m   10.42.85.47     kub-b16.priv.cerit-sc.cz   <none>           <none>
nf-decd9cc13c9b147c5c2f2ebc2289107b-jkjrs   1/1     Running             0          16m   10.42.200.47    kub-b3.priv.cerit-sc.cz    <none>           <none>
nf-f38523c75cbbfaf8d9b8d6ff78253e8b-2n9lr   1/1     Running             0          16m   10.42.237.115   kub-b13.priv.cerit-sc.cz   <none>           <none> 

statx(AT_FDCWD, "/var/lib/rancher/kubelet/plugins/kubernetes.io/csi/beegfs.csi.netapp.com/36fd002bd789918c90422917978b993fcddeeeadfa2a3d5cf67d413d8ddeb394/globalmount/mount", AT_STATX_SYNC_AS_STAT, STATX_MODE, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=STATX_ATTR_MOUNT_ROOT, stx_mode=S_IFDIR|0777, stx_size=1, ...}) = 0
openat(AT_FDCWD, "/var/lib/rancher/kubelet/plugins/kubernetes.io/csi/beegfs.csi.netapp.com/36fd002bd789918c90422917978b993fcddeeeadfa2a3d5cf67d413d8ddeb394/globalmount/mount", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
newfstatat(3, "", {st_mode=S_IFDIR|0777, st_size=1, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffa70d6a000
getdents64(3,

and never finishes.

This is definition of storageclass:
kubectl get storageclass csi-beegfs-dyn-sc -o yaml
allowVolumeExpansion: false
allowedTopologies:
- matchLabelExpressions:
  - key: cerit.io/beegfs
    values:
    - "true"
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  creationTimestamp: "2023-05-13T11:07:14Z"
  name: csi-beegfs-dyn-sc
  resourceVersion: "946037537"
  uid: 04dbed35-3350-4abd-8582-c3cd4a2e385e
parameters:
  permissions/mode: "0777"
  sysMgmtdHost: beegfs-mgmtd.ib.cerit-sc.cz
  volDirBasePath: k8s/kuba/dyn
provisioner: beegfs.csi.netapp.com
reclaimPolicy: Delete
volumeBindingMode: Immediate

and pvc:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: beegfs.csi.netapp.com
    volume.kubernetes.io/storage-provisioner: beegfs.csi.netapp.com
  creationTimestamp: "2023-05-13T12:22:47Z"
  finalizers:
  - kubernetes.io/pvc-protection
  name: pvc-acgt-beegfs
  namespace: acgt-ns
  resourceVersion: "946181069"
  uid: d34900ed-90cd-4c48-a8d1-f81022725970
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 100000Gi
  storageClassName: csi-beegfs-dyn-sc
  volumeMode: Filesystem
  volumeName: pvc-d34900ed
status:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 100000Gi
  phase: Bound

Kubernetes version is 1.24.13. (rke2 based). There is one mgmtd, two metadata nodes and 16 storage nodes. Stoarge is zfs raid5 on top of 8 nvme disks on each node. System is ubuntu 22.04, kernel 5.15.0 - standard ubuntu.

Problematic is this process: beegfs-csi-driver, if I kill it (kill -9 pidof beegfs-csi-driver), everything resumes. This is kernel stack trace for all beegfs-csi-driver threads:

[<0>] d_alloc_parallel+0x3fb/0x4b0
[<0>] __lookup_slow+0x5f/0x150
[<0>] walk_component+0x145/0x1c0
[<0>] link_path_walk.part.0.constprop.0+0x23f/0x3a0
[<0>] path_lookupat+0x3e/0x1c0
[<0>] filename_lookup+0xcf/0x1d0
[<0>] user_path_at_empty+0x3f/0x60
[<0>] vfs_statx+0x7a/0x130
[<0>] __do_sys_newfstatat+0x36/0x70
[<0>] __x64_sys_newfstatat+0x1c/0x30
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] d_alloc_parallel+0x3fb/0x4b0
[<0>] __lookup_slow+0x5f/0x150
[<0>] walk_component+0x145/0x1c0
[<0>] link_path_walk.part.0.constprop.0+0x23f/0x3a0
[<0>] path_lookupat+0x3e/0x1c0
[<0>] filename_lookup+0xcf/0x1d0
[<0>] user_path_at_empty+0x3f/0x60
[<0>] vfs_statx+0x7a/0x130
[<0>] __do_sys_newfstatat+0x36/0x70
[<0>] __x64_sys_newfstatat+0x1c/0x30
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] d_alloc_parallel+0x3fb/0x4b0
[<0>] __lookup_slow+0x5f/0x150
[<0>] walk_component+0x145/0x1c0
[<0>] link_path_walk.part.0.constprop.0+0x23f/0x3a0
[<0>] path_lookupat+0x3e/0x1c0
[<0>] filename_lookup+0xcf/0x1d0
[<0>] user_path_at_empty+0x3f/0x60
[<0>] vfs_statx+0x7a/0x130
[<0>] __do_sys_newfstatat+0x36/0x70
[<0>] __x64_sys_newfstatat+0x1c/0x30
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] d_alloc_parallel+0x3fb/0x4b0
[<0>] __lookup_slow+0x5f/0x150
[<0>] walk_component+0x145/0x1c0
[<0>] link_path_walk.part.0.constprop.0+0x23f/0x3a0
[<0>] path_lookupat+0x3e/0x1c0
[<0>] filename_lookup+0xcf/0x1d0
[<0>] user_path_at_empty+0x3f/0x60
[<0>] vfs_statx+0x7a/0x130
[<0>] __do_sys_newfstatat+0x36/0x70
[<0>] __x64_sys_newfstatat+0x1c/0x30
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] d_alloc_parallel+0x3fb/0x4b0
[<0>] __lookup_slow+0x5f/0x150
[<0>] walk_component+0x145/0x1c0
[<0>] link_path_walk.part.0.constprop.0+0x23f/0x3a0
[<0>] path_lookupat+0x3e/0x1c0
[<0>] filename_lookup+0xcf/0x1d0
[<0>] user_path_at_empty+0x3f/0x60
[<0>] vfs_statx+0x7a/0x130
[<0>] __do_sys_newfstatat+0x36/0x70
[<0>] __x64_sys_newfstatat+0x1c/0x30
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] d_alloc_parallel+0x3fb/0x4b0
[<0>] __lookup_slow+0x5f/0x150
[<0>] walk_component+0x145/0x1c0
[<0>] link_path_walk.part.0.constprop.0+0x23f/0x3a0
[<0>] path_lookupat+0x3e/0x1c0
[<0>] filename_lookup+0xcf/0x1d0
[<0>] user_path_at_empty+0x3f/0x60
[<0>] vfs_statx+0x7a/0x130
[<0>] __do_sys_newfstatat+0x36/0x70
[<0>] __x64_sys_newfstatat+0x1c/0x30
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] Thread_sleep+0x78/0x90 [beegfs]
[<0>] __MessagingTk_requestResponseNodeRetry+0x37a/0x580 [beegfs]
[<0>] MessagingTk_requestResponseNodeRetryAutoIntr+0x20/0x30 [beegfs]
[<0>] FhgfsOpsRemoting_statAndGetParentInfo+0x141/0x280 [beegfs]
[<0>] FhgfsOps_lookupIntent+0x38d/0x6e0 [beegfs]
[<0>] __lookup_slow+0x88/0x150
[<0>] walk_component+0x145/0x1c0
[<0>] link_path_walk.part.0.constprop.0+0x23f/0x3a0
[<0>] path_lookupat+0x3e/0x1c0
[<0>] filename_lookup+0xcf/0x1d0
[<0>] user_path_at_empty+0x3f/0x60
[<0>] vfs_statx+0x7a/0x130
[<0>] __do_sys_newfstatat+0x36/0x70
[<0>] __x64_sys_newfstatat+0x1c/0x30
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] d_alloc_parallel+0x3fb/0x4b0
[<0>] __lookup_slow+0x5f/0x150
[<0>] walk_component+0x145/0x1c0
[<0>] link_path_walk.part.0.constprop.0+0x23f/0x3a0
[<0>] path_lookupat+0x3e/0x1c0
[<0>] filename_lookup+0xcf/0x1d0
[<0>] user_path_at_empty+0x3f/0x60
[<0>] vfs_statx+0x7a/0x130
[<0>] __do_sys_newfstatat+0x36/0x70
[<0>] __x64_sys_newfstatat+0x1c/0x30
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] d_alloc_parallel+0x3fb/0x4b0
[<0>] __lookup_slow+0x5f/0x150
[<0>] walk_component+0x145/0x1c0
[<0>] link_path_walk.part.0.constprop.0+0x23f/0x3a0
[<0>] path_lookupat+0x3e/0x1c0
[<0>] filename_lookup+0xcf/0x1d0
[<0>] user_path_at_empty+0x3f/0x60
[<0>] vfs_statx+0x7a/0x130
[<0>] __do_sys_newfstatat+0x36/0x70
[<0>] __x64_sys_newfstatat+0x1c/0x30
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] d_alloc_parallel+0x3fb/0x4b0
[<0>] __lookup_slow+0x5f/0x150
[<0>] walk_component+0x145/0x1c0
[<0>] link_path_walk.part.0.constprop.0+0x23f/0x3a0
[<0>] path_lookupat+0x3e/0x1c0
[<0>] filename_lookup+0xcf/0x1d0
[<0>] user_path_at_empty+0x3f/0x60
[<0>] vfs_statx+0x7a/0x130
[<0>] __do_sys_newfstatat+0x36/0x70
[<0>] __x64_sys_newfstatat+0x1c/0x30
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] d_alloc_parallel+0x3fb/0x4b0
[<0>] __lookup_slow+0x5f/0x150
[<0>] walk_component+0x145/0x1c0
[<0>] link_path_walk.part.0.constprop.0+0x23f/0x3a0
[<0>] path_lookupat+0x3e/0x1c0
[<0>] filename_lookup+0xcf/0x1d0
[<0>] user_path_at_empty+0x3f/0x60
[<0>] vfs_statx+0x7a/0x130
[<0>] __do_sys_newfstatat+0x36/0x70
[<0>] __x64_sys_newfstatat+0x1c/0x30
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] ep_poll+0x2b1/0x380
[<0>] do_epoll_wait+0xba/0xe0
[<0>] do_compat_epoll_pwait.part.0+0x10/0x80
[<0>] __x64_sys_epoll_pwait+0x7f/0x130
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] d_alloc_parallel+0x3fb/0x4b0
[<0>] __lookup_slow+0x5f/0x150
[<0>] walk_component+0x145/0x1c0
[<0>] link_path_walk.part.0.constprop.0+0x23f/0x3a0
[<0>] path_lookupat+0x3e/0x1c0
[<0>] filename_lookup+0xcf/0x1d0
[<0>] user_path_at_empty+0x3f/0x60
[<0>] vfs_statx+0x7a/0x130
[<0>] __do_sys_newfstatat+0x36/0x70
[<0>] __x64_sys_newfstatat+0x1c/0x30
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

[<0>] futex_wait_queue_me+0xa7/0x110
[<0>] futex_wait+0x105/0x260
[<0>] do_futex+0x149/0x1f0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb 


We have non-default configuration of timeouts on server:
sysTargetOfflineTimeoutSecs  = 86400

and on the client:
connCommRetrySecs: "864000"
sysTargetOfflineTimeoutSecs: "86400"

those huge timeouts are to mitigate node reboots if they are needed to reboot.

note, client and server nodes are the same - i.e., nodes that act as storage servers also mount the filesystem.

the pods use securityContext to avoid recursive chown:
securityContext:
    fsGroup: 1
    fsGroupChangePolicy: OnRootMismatch

root@kub-b2:/proc/fs/beegfs/33F46D-648B7E07-kub-b2.priv.cerit-sc.cz# cat meta_target_state
[Target ID: 1] @ beegfs-meta kub-b1.priv.cerit-sc.cz [ID: 1]: Online / Good
[Target ID: 2] @ beegfs-meta kub-b2.priv.cerit-sc.cz [ID: 2]: Online / Good
root@kub-b2:/proc/fs/beegfs/33F46D-648B7E07-kub-b2.priv.cerit-sc.cz# cat storage_target_state
[Target ID: 1] @ beegfs-storage kub-b1.priv.cerit-sc.cz [ID: 1]: Online / Good
[Target ID: 2] @ beegfs-storage kub-b2.priv.cerit-sc.cz [ID: 2]: Online / Good
[Target ID: 3] @ beegfs-storage kub-b3.priv.cerit-sc.cz [ID: 3]: Online / Good
[Target ID: 4] @ beegfs-storage kub-b4.priv.cerit-sc.cz [ID: 4]: Online / Good
[Target ID: 5] @ beegfs-storage kub-b5.priv.cerit-sc.cz [ID: 5]: Online / Good
[Target ID: 6] @ beegfs-storage kub-b6.priv.cerit-sc.cz [ID: 6]: Online / Good
[Target ID: 7] @ beegfs-storage kub-b7.priv.cerit-sc.cz [ID: 7]: Online / Good
[Target ID: 8] @ beegfs-storage kub-b8.priv.cerit-sc.cz [ID: 8]: Online / Good
[Target ID: 9] @ beegfs-storage kub-b9.priv.cerit-sc.cz [ID: 9]: Online / Good
[Target ID: 10] @ beegfs-storage kub-b10.priv.cerit-sc.cz [ID: 10]: Online / Good
[Target ID: 11] @ beegfs-storage kub-b11.priv.cerit-sc.cz [ID: 11]: Online / Good
[Target ID: 12] @ beegfs-storage kub-b12.priv.cerit-sc.cz [ID: 12]: Online / Good
[Target ID: 13] @ beegfs-storage kub-b13.priv.cerit-sc.cz [ID: 13]: Online / Good
[Target ID: 14] @ beegfs-storage kub-b14.priv.cerit-sc.cz [ID: 14]: Online / Good
[Target ID: 15] @ beegfs-storage kub-b15.priv.cerit-sc.cz [ID: 15]: Online / Good
[Target ID: 16] @ beegfs-storage kub-b16.priv.cerit-sc.cz [ID: 16]: Online / Good
root@kub-b2:/proc/fs/beegfs/33F46D-648B7E07-kub-b2.priv.cerit-sc.cz#

cat config
cfgFile = /var/lib/kubelet/plugins/kubernetes.io/csi/beegfs.csi.netapp.com/36fd002bd789918c90422917978b993fcddeeeadfa2a3d5cf67d413d8ddeb394/globalmount/beegfs-client.conf
logLevel = 3
logType = helperd
logClientID = 0
logHelperdIP =
connUseRDMA = 1
connTCPFallbackEnabled = 0
connMgmtdPortUDP = 8008
connClientPortUDP = 55307
connMgmtdPortTCP = 8008
connHelperdPortTCP = 8006
connMaxInternodeNum = 12
connInterfacesFile = /var/lib/kubelet/plugins/kubernetes.io/csi/beegfs.csi.netapp.com/36fd002bd789918c90422917978b993fcddeeeadfa2a3d5cf67d413d8ddeb394/globalmount/connInterfacesFile
connRDMAInterfacesFile = /var/lib/kubelet/plugins/kubernetes.io/csi/beegfs.csi.netapp.com/36fd002bd789918c90422917978b993fcddeeeadfa2a3d5cf67d413d8ddeb394/globalmount/connRDMAInterfacesFile
connNetFilterFile =
connAuthFile = /var/lib/kubelet/plugins/kubernetes.io/csi/beegfs.csi.netapp.com/36fd002bd789918c90422917978b993fcddeeeadfa2a3d5cf67d413d8ddeb394/globalmount/connAuthFile
connDisableAuthentication = false
connTcpOnlyFilterFile =
connFallbackExpirationSecs = 900
connRDMABufSize = 8192
connRDMABufNum = 70
connCommRetrySecs = 864000
connNumCommRetries = 14419
connUnmountRetries = 1
tunePreferredMetaFile =
tunePreferredStorageFile =
tuneFileCacheType = buffered
tuneFileCacheBufSize = 524288
tuneRemoteFSync = 1
tuneUseGlobalFileLocks = 0
tuneRefreshOnGetAttr = 0
tuneInodeBlockBits = 19
tuneInodeBlockSize = 524288
tuneEarlyCloseResponse = 0
tuneUseGlobalAppendLocks = 0
tuneUseBufferedAppend = 0
tuneStatFsCacheSecs = 10
tuneCoherentBuffers = 1
sysACLsEnabled = 0
sysMgmtdHost = beegfs-mgmtd.ib.cerit-sc.cz
sysInodeIDStyle = md4hash64
sysCreateHardlinksAsSymlinks = 0
sysMountSanityCheckMS = 11000
sysSyncOnClose = 0
sysXAttrsEnabled = 0
sysSessionCheckOnClose = 0
sysSessionChecksEnabled = 1
quotaEnabled = 0
sysFileEventLogMask = none
sysRenameEbusyAsXdev = 0

also to mention, we have problem also with nfs mounts, that was already resolved:
https://github.com/kubernetes/kubernetes/issues/109047

however, the symptoms were similar, you could mount/umount PVC as many times you wanted and everything was ok. But running real workflow failed. I think this is because the workload leaves  dirty volume that needs to be synced/flushed.

Dne středa 14. června 2023 v 22:47:48 UTC+2 uživatel Joe McCormick napsal:

Joe McCormick

unread,
Jun 16, 2023, 7:15:20 PM6/16/23
to beegfs-user
Hi Lukas,

So if I'm following, the nextflow sarek pipeline runs and should write the results to a single dynamic PVC. In this example when the pipeline tried to start containers on kub-b2.priv.cerit-sc.cz, they got stuck in ContainerCreating because the CSI driver was unable to mount BeeGFS on that node. Presumably those containers never got a chance to even start writing to BeeGFS, though containers on the other nodes probably did.

(1) If you run `kubectl describe pod <UUID>` against one of the Pods in ContainerCreating, what events are listed? I'm guessing they are sitting at "pod has unbound immediate PersistentVolumeClaims"?

(2) Are the kernel stack traces you provided from all threads of the beegfs-csi-driver on kub-b2.priv.cerit-sc.cz? Or are some from other nodes as well?

(3) In particular, did this trace come from the beegfs-csi-driver node service running on kub-b2.priv.cerit-sc.cz?


[<0>] Thread_sleep+0x78/0x90 [beegfs]
[<0>] __MessagingTk_requestResponseNodeRetry+0x37a/0x580 [beegfs]
[<0>] MessagingTk_requestResponseNodeRetryAutoIntr+0x20/0x30 [beegfs]
[<0>] FhgfsOpsRemoting_statAndGetParentInfo+0x141/0x280 [beegfs]
[<0>] FhgfsOps_lookupIntent+0x38d/0x6e0 [beegfs]
[<0>] __lookup_slow+0x88/0x150
[<0>] walk_component+0x145/0x1c0
[<0>] link_path_walk.part.0.constprop.0+0x23f/0x3a0
[<0>] path_lookupat+0x3e/0x1c0
[<0>] filename_lookup+0xcf/0x1d0
[<0>] user_path_at_empty+0x3f/0x60
[<0>] vfs_statx+0x7a/0x130
[<0>] __do_sys_newfstatat+0x36/0x70
[<0>] __x64_sys_newfstatat+0x1c/0x30
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

That is the one that jumps out because it contains BeeGFS specific function calls. This thread is sleeping waiting on a response from one the server nodes, presumably a metadata node since we're trying to perform the newfstatat() system call. This is one of the typical system calls we see when a filesystem is mounted.

I also noticed that kub-b2.priv.cerit-sc.cz has 6 containers scheduled, and at most the other nodes have 3. Possibly this issue has to do with the number of parallel requests we're making from this particular client to the same metadata service. On the clients there is a limit on the max number of simultaneous connections to the same node (connMaxInternodeNum=12 in this case). There is also a similar limit on the metadata server, which defaults to 32. You could try doubling connMaxInternodeNum on the client and metadata server, though note this will use more RAM, particularly if you're using RDMA (https://git.beegfs.io/pub/v7/-/blob/master/meta/build/dist/etc/beegfs-meta.conf#L218). In fact the earlier testing I mentioned probably had this set as high as 128 on the metadata side. Worth noting the huge timeouts may be masking a more obvious error and lowering those for testing to the defaults would be nice.

That said, if we were running up against the connMaxInternodeNum max I'd just expect some operations to be temporarily blocked until others finish. I did a quick test where I created a deployment with 64 replicas (double the metadata connMaxInterNodeNum) of an Alpine container with each container using touch to create a new file inside the same PVC then sleep. On my local Minikube VM it took a minute or two for all the Pods to shift to the running state, but all the necessary mount points were eventually created.

(4) Have you noticed this issue typically only occur with kub-b2.priv.cerit-sc.cz (or a subset of nodes)? Or perhaps with whichever node(s) have the most pods scheduled?

(5) If you look at the output from /proc/mounts on the affected node, do you see any of the bind mounts for a particular Pod? Or do you only see the globalmount?

The way the driver works each PVC will have its own globalmount of the BeeGFS file system root, in this case at /var/lib/rancher/kubelet/plugins/kubernetes.io/csi/beegfs.csi.netapp.com/36fd002bd789918c90422917978b993fcddeeeadfa2a3d5cf67d413d8ddeb394/globalmount/mount. Then each Pod that needs access to the PVC gets a bind mount of the particular sub-directory representing the PVC (this is how isolation occurs). These would look like: /var/lib/kubelet/pods/<UUID>/volumes/kubernetes.io~csi/pvc-<UUID>/mount. 

Just trying to narrow down where we're getting hung up, while were making the globalmount, or as we're working through the bind mounts.

(6) Do the number of mounts match the actual number of PVCs being accessed by the Pods on that node, or are there stale/leftover mounts from old PVCs that should be deleted?

This is a long-shot, but we had an older issue where what we referred to as "orphaned" mounts could be leftover on Kubernetes nodes. We largely mitigated this issue, but there are some situations outside the driver's control where they may come up again. If there are a large number of old mounts laying around, overtime this could cause weird behavior.

(7) You mentioned there were often many parallel mounts/umounts of the same file system. Can you provide more detail on that? Are there umount operations occuring against other PVC(s) around the same time this one was mounted?

With the way BeeGFS CSI driver handles PVCs, I wouldn't expect the unmount of one PVC to affect the mount of another PVC. Possibly mounting/unmounting the same PVC from different Pods on the same node could have some interesting side effects in different situations, but I haven't seen anything like that so far. The globalmount should only be removed once all bind mounts for individual Pods are gone.

Mainly wanting to understand what else I may need to do to recreate this on my system. Worth noting I testing with Kubernetes 1.26.3, though my kernel (5.15.0) and BeeGFS version (7.3.3) are identical to yours. I'll poke through and see if I can find any related issues, but at this point I'm less inclined to think this is a bug in K8s .

Thanks,

~Joe

Lukas Hejtmanek

unread,
Jun 16, 2023, 7:38:15 PM6/16/23
to fhgfs...@googlegroups.com
Hi Joe,

> On 17. 6. 2023, at 1:15, Joe McCormick <joe.mc...@thinkparq.com> wrote:
> So if I'm following, the nextflow sarek pipeline runs and should write the results to a single dynamic PVC. In this example when the pipeline tried to start containers on kub-b2.priv.cerit-sc.cz, they got stuck in ContainerCreating because the CSI driver was unable to mount BeeGFS on that node. Presumably those containers never got a chance to even start writing to BeeGFS, though containers on the other nodes probably did.

Yes, the csi driver was unable to mount, probably because the mount is frozen, as I described, ls hangs, as as you suggests there may be a problem with connection to metadata.

> (1) If you run `kubectl describe pod <UUID>` against one of the Pods in ContainerCreating, what events are listed? I'm guessing they are sitting at "pod has unbound immediate PersistentVolumeClaims"?

Warning FailedMount 90s (x694 over 26h) kubelet Unable to attach or mount volumes: unmounted volumes=[vol-25], unattached volumes=[vol-25]: timed out waiting for the condition

> (2) Are the kernel stack traces you provided from all threads of the beegfs-csi-driver on kub-b2.priv.cerit-sc.cz? Or are some from other nodes as well?

they all are from kub-b2.

> (3) In particular, did this trace come from the beegfs-csi-driver node service running on kub-b2.priv.cerit-sc.cz?
>
> [<0>] Thread_sleep+0x78/0x90 [beegfs]
> [<0>] __MessagingTk_requestResponseNodeRetry+0x37a/0x580 [beegfs]
> [<0>] MessagingTk_requestResponseNodeRetryAutoIntr+0x20/0x30 [beegfs]
> [<0>] FhgfsOpsRemoting_statAndGetParentInfo+0x141/0x280 [beegfs]
> [<0>] FhgfsOps_lookupIntent+0x38d/0x6e0 [beegfs]
> [<0>] __lookup_slow+0x88/0x150
> [<0>] walk_component+0x145/0x1c0
> [<0>] link_path_walk.part.0.constprop.0+0x23f/0x3a0
> [<0>] path_lookupat+0x3e/0x1c0
> [<0>] filename_lookup+0xcf/0x1d0
> [<0>] user_path_at_empty+0x3f/0x60
> [<0>] vfs_statx+0x7a/0x130
> [<0>] __do_sys_newfstatat+0x36/0x70
> [<0>] __x64_sys_newfstatat+0x1c/0x30
> [<0>] do_syscall_64+0x5c/0xc0
> [<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

yes, this one is also from kub-b2.

> That is the one that jumps out because it contains BeeGFS specific function calls. This thread is sleeping waiting on a response from one the server nodes, presumably a metadata node since we're trying to perform the newfstatat() system call. This is one of the typical system calls we see when a filesystem is mounted.

the csi driver sees the pvc mounted:
2023-06-17T01:24:47.616697387+02:00 stderr F I0616 23:24:47.616575 1 beegfs_util.go:210] "msg"="Volume is already mounted" "path"="/var/lib/kubelet/plugins/kubernetes.io/csi/beegfs.csi.netapp.com/36fd002bd789918c90422917978b993fcddeeeadfa2a3d5cf67d413d8ddeb394/globalmount/mount" "reqID"="d04d" "volumeID"="beegfs://beegfs-mgmtd.ib.cerit-sc.cz/k8s/kuba/dyn/pvc-d34900ed"

> I also noticed that kub-b2.priv.cerit-sc.cz has 6 containers scheduled, and at most the other nodes have 3. Possibly this issue has to do with the number of parallel requests we're making from this particular client to the same metadata service. On the clients there is a limit on the max number of simultaneous connections to the same node (connMaxInternodeNum=12 in this case). There is also a similar limit on the metadata server, which defaults to 32. You could try doubling connMaxInternodeNum on the client and metadata server, though note this will use more RAM, particularly if you're using RDMA (https://git.beegfs.io/pub/v7/-/blob/master/meta/build/dist/etc/beegfs-meta.conf#L218). In fact the earlier testing I mentioned probably had this set as high as 128 on the metadata side. Worth noting the huge timeouts may be masking a more obvious error and lowering those for testing to the defaults would be nice.

the number of 6 is even low, there are usualy more containers. And ok, I try to increase number of connections.

> That said, if we were running up against the connMaxInternodeNum max I'd just expect some operations to be temporarily blocked until others finish. I did a quick test where I created a deployment with 64 replicas (double the metadata connMaxInterNodeNum) of an Alpine container with each container using touch to create a new file inside the same PVC then sleep. On my local Minikube VM it took a minute or two for all the Pods to shift to the running state, but all the necessary mount points were eventually created.

it will recover, if I kill the csi driver, it seems that it is holding some lock preventing to resume.

> (4) Have you noticed this issue typically only occur with kub-b2.priv.cerit-sc.cz (or a subset of nodes)? Or perhaps with whichever node(s) have the most pods scheduled?

it is not related just to kub-b2 node, it happens randomly and sometimes more then 1 node is affected at once.

> (5) If you look at the output from /proc/mounts on the affected node, do you see any of the bind mounts for a particular Pod? Or do you only see the globalmount?

only the globalmount.

Note, I cannot list or stat anything in that globalmount.

> (6) Do the number of mounts match the actual number of PVCs being accessed by the Pods on that node, or are there stale/leftover mounts from old PVCs that should be deleted?

I don’t think so, but I need to check this.

> (7) You mentioned there were often many parallel mounts/umounts of the same file system. Can you provide more detail on that? Are there umount operations occuring against other PVC(s) around the same time this one was mounted?

all mentioned is happing on a single beegfs file system with a single PVC. The beegfs actually has only one PVC so far.


--
Lukáš Hejtmánek

Linux Administrator only because
Full Time Multitasking Ninja
is not an official job title

Joe McCormick

unread,
Jun 19, 2023, 3:46:06 PM6/19/23
to beegfs-user
Hi Lukáš,

Thanks for the additional details. According to the CSI specification, Kubernetes should only attempt to mount BeeGFS once (for each PVC) to globalmount using the NodeStageVolume RPC, then create a bind mount from there for each Pod using the NodePublishVolume RPC (ref: https://github.com/container-storage-interface/spec/blob/master/spec.md#nodestagevolume). The log message you provided suggests Kubernetes may be making extra calls to NodeStageVolume, and I don't see this same message when I try to recreate the issue on my system.

I did find one issue in Kubernetes where NodeStageVolume  can be called multiple times for the same volume on the same node (https://github.com/kubernetes/kubernetes/issues/112343). However that is specific to raw block CSI volumes, and an earlier fix is in place for filesystem based volumes (https://github.com/kubernetes/kubernetes/commit/6060c57ba3a944bd133968fd669980c97af846c6) which seems to be in 1.24.13. That said I'm not sure there is a strict mechanism in place to prevent multiple calls to NodeStageVolume in quick succession, and in fact the spec allows NodeStageVolume to be called multiple times if Kubernetes isn't sure if it has been called for some reason. This is why CSI drivers are expected to handle this call in an idempotent fashion (which the BeeGFS CSI driver does and won't remount BeeGFS unnecessarily).

Could you provide full logs from one of the BeeGFS CSI node services from the time-frame the issue happened (`kubectl logs csi-beegfs-node-<UUID> -c beegfs`)? For now don't worry about increasing logging verbosity, I'm mainly interested in the GRPC calls/responses which are logged by default, and I'm worried increasing logging verbosity could make the issue hard to reproduce since it seems to be timing related.

Originally I thought this issue was with NodePublishVolume, since I'm not aware there is any mechanism to throttle on the number of concurrent requests Kubernetes makes to bind mount the directory for a given PVC to each Pod. That is why I thought increasing connMaxInterNodeNum could help. Have you had a chance to increase connMaxInterNodeNum and did it make a difference?

Can you also provide details about your Kubernetes node hardware? Specifically CPU configuration. I'm still unable to recreate the issue on my local system even if I lower connMaxInterNodeNum on my BeeGFS client and metadata configuration to half the number of CPUs on my client K8s node (which affects how many Pods will start and volume attachment K8s will be made concurrently).

To your point about the CSI driver seeming to be holding some lock. Regardless if the issue is in NodeStageVolume or NodePublishVolume, what seems to be happening is we've run some number of commands that are somehow blocked and have exhausted all the connections that can be made to a particular BeeGFS metadata service. So when you kill the CSI driver all those connections are released and now you can use the filesystem normally again. When you kill the driver and things recover, does the globalmount starts working (stat/ls) without needing to do anything else? Presumably K8s also restarts the driver and then all the bind mounts happen so the Pods also start working?

Thanks,

~Joe

Lukas Hejtmanek

unread,
Jun 19, 2023, 6:06:11 PM6/19/23
to fhgfs...@googlegroups.com
Hi Joe,

> On 19. 6. 2023, at 21:46, Joe McCormick <joe.mc...@thinkparq.com> wrote:
>
> Could you provide full logs from one of the BeeGFS CSI node services from the time-frame the issue happened (`kubectl logs csi-beegfs-node-<UUID> -c beegfs`)? For now don't worry about increasing logging verbosity, I'm mainly interested in the GRPC calls/responses which are logged by default, and I'm worried increasing logging verbosity could make the issue hard to reproduce since it seems to be timing related.

they are attached. I noticed that problem started about 21:50CEST 19/6 (the first time column). And it resumed at 23-06-19T22:04:22.

> Originally I thought this issue was with NodePublishVolume, since I'm not aware there is any mechanism to throttle on the number of concurrent requests Kubernetes makes to bind mount the directory for a given PVC to each Pod. That is why I thought increasing connMaxInterNodeNum could help. Have you had a chance to increase connMaxInterNodeNum and did it make a difference?

Yes, I did, I increased to 128, but it did not help. However, what did help, is lowering:
connCommRetrySecs = 900
sysTargetOfflineTimeoutSecs = 900

because after those 900 seconds, containers that were stuck in ContainerCreating, started.

> Can you also provide details about your Kubernetes node hardware? Specifically CPU configuration. I'm still unable to recreate the issue on my local system even if I lower connMaxInterNodeNum on my BeeGFS client and metadata configuration to half the number of CPUs on my client K8s node (which affects how many Pods will start and volume attachment K8s will be made concurrently).

I have no resource set for both metadata and storage, so there is basically no limit on the server side. For client, I have 128Mi memory limit for node driver, however, memory utilization is about 28Mi only.

Nodes are 2 sockets AMD EPYC 7543, so 64 physical cores per node and 512GB RAM.

>
> To your point about the CSI driver seeming to be holding some lock. Regardless if the issue is in NodeStageVolume or NodePublishVolume, what seems to be happening is we've run some number of commands that are somehow blocked and have exhausted all the connections that can be made to a particular BeeGFS metadata service. So when you kill the CSI driver all those connections are released and now you can use the filesystem normally again. When you kill the driver and things recover, does the globalmount starts working (stat/ls) without needing to do anything else? Presumably K8s also restarts the driver and then all the bind mounts happen so the Pods also start working?

killing the csi driver is the only action needed. I am not sure, if stat/ls recovers immediately though or it needs some time.

0.log

Joe McCormick

unread,
Jun 20, 2023, 2:34:26 PM6/20/23
to beegfs-user
Hi Lukáš,

So NodeUnstageVolume unmounted the volume at 21:50:01.438100475 and less then a second later NodeStageVolume told us to mount again at 21:50:01.449942615. That mount operation seems to have completed successfully, because the subsequent calls to NodeStageVolume indicate the volume is already mounted. If we trace through the first NodeStageVolume request, roughly 900s later we see an error not from the initial mount, but a follow on stat operation:

2023-06-19T22:04:22.570187661+02:00 stderr F E0619 20:04:22.570009       1 nodeserver.go:230]  "msg"="Failed attempting to stat node tracking directory" "error"="stat /var/lib/kubelet/plugins/kubernetes.io/csi/beegfs.csi.netapp.com/36fd002bd789918c90422917978b993fcddeeeadfa2a3d5cf67d413d8ddeb394/globalmount/mount/k8s/kuba/dyn/.csi/volumes/pvc-d34900ed/nodes: communication error on send" "fullError"="stat /var/lib/kubelet/plugins/kubernetes.io/csi/beegfs.csi.netapp.com/36fd002bd789918c90422917978b993fcddeeeadfa2a3d5cf67d413d8ddeb394/globalmount/mount/k8s/kuba/dyn/.csi/volumes/pvc-d34900ed/nodes: communication error on send" "path"="/var/lib/kubelet/plugins/kubernetes.io/csi/beegfs.csi.netapp.com/36fd002bd789918c90422917978b993fcddeeeadfa2a3d5cf67d413d8ddeb394/globalmount/mount/k8s/kuba/dyn/.csi/volumes/pvc-d34900ed/nodes" "reqID"="f45a" "volumeID"="beegfs://beegfs-mgmtd.ib.cerit-sc.cz/k8s/kuba/dyn/pvc-d34900ed"

That "node tracking directory" is a workaround (https://github.com/ThinkParQ/beegfs-csi-driver/blob/master/docs/troubleshooting.md#outdated-driver-or-node-unstage-timeout-exceeded) for that Kubernetes issue where it might try to delete a volume before it was unmounted from all nodes. The controller service is what creates and deletes volumes and also manages the lifecycle of the nodes/ directory the node service is running stat against. The directory should only deleted when the PVC is deleted, but if the PVC was recreated (including the nodes directory), that could explain this behavior and why I'm having trouble reproducing it.

Does the workflow actually delete and recreate the pvc-d34900ed? Even if it doesn't it'd be helpful if you could send along logs from the controller service during this timeframe, I'd like to sanity check what the controller service is doing.

Thanks,

~Joe

Lukas Hejtmanek

unread,
Jun 20, 2023, 4:00:25 PM6/20/23
to fhgfs...@googlegroups.com
Hi Joe,

> That "node tracking directory" is a workaround (https://github.com/ThinkParQ/beegfs-csi-driver/blob/master/docs/troubleshooting.md#outdated-driver-or-node-unstage-timeout-exceeded) for that Kubernetes issue where it might try to delete a volume before it was unmounted from all nodes. The controller service is what creates and deletes volumes and also manages the lifecycle of the nodes/ directory the node service is running stat against. The directory should only deleted when the PVC is deleted, but if the PVC was recreated (including the nodes directory), that could explain this behavior and why I'm having trouble reproducing it.
>
> Does the workflow actually delete and recreate the pvc-d34900ed? Even if it doesn't it'd be helpful if you could send along logs from the controller service during this timeframe, I'd like to sanity check what the controller service is doing.

no, PVC is stable and present all the time. No new volume is created or deleted. All the time, I have a single PVC. This is for sure because there is one pod, that “holds” the PVC attached - the pod uses it all the time, so PVC itself cannot be deleted and recreated as it is in use all the time. The pod is used to start the nextflow pipeline.

Unfortunately, controller is silent, no verbose logs, only from startup. Shall I increase log level somehow? csi-driver or csi-provisioner?

This is dmesg with timing info:
[Jun19 21:47] beegfs: umount(488261): BeeGFS unmounted.
[ +1.104401] beegfs: enabling unsafe global rkey
[ +0.004626] beegfs: enabling unsafe global rkey
[ +0.004461] beegfs: enabling unsafe global rkey
[ +0.007404] beegfs: enabling unsafe global rkey
[ +0.004718] beegfs: enabling unsafe global rkey
[ +0.004249] beegfs: enabling unsafe global rkey
[ +0.004017] beegfs: enabling unsafe global rkey
[ +0.004253] beegfs: enabling unsafe global rkey
[ +0.004295] beegfs: enabling unsafe global rkey
[ +0.004521] beegfs: enabling unsafe global rkey
[ +0.004994] beegfs: enabling unsafe global rkey
[ +0.004797] beegfs: enabling unsafe global rkey
[ +0.004583] beegfs: enabling unsafe global rkey
[ +0.004950] beegfs: enabling unsafe global rkey
[ +0.005506] beegfs: enabling unsafe global rkey
[ +0.004903] beegfs: enabling unsafe global rkey
[ +0.003845] beegfs: enabling unsafe global rkey
[ +0.004679] beegfs: mount(488823): BeeGFS mount ready.
[Jun19 22:01] beegfs: enabling unsafe global rkey
[ +1.417144] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ +0.000122] IPv6: ADDRCONF(NETDEV_CHANGE): cali523db3e0a51: link becomes ready
[ +0.427424] beegfs: enabling unsafe global rkey
[ +0.925008] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ +0.000112] IPv6: ADDRCONF(NETDEV_CHANGE): cali60e64a51a5c: link becomes ready
[ +0.517912] beegfs: enabling unsafe global rkey
[ +0.000016] beegfs: enabling unsafe global rkey
[ +0.401604] beegfs: enabling unsafe global rkey
[ +0.019406] beegfs: enabling unsafe global rkey
[ +0.129368] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ +0.000149] IPv6: ADDRCONF(NETDEV_CHANGE): cali8c60e1110e4: link becomes ready
[ +0.411540] beegfs: enabling unsafe global rkey
[ +0.010607] beegfs: enabling unsafe global rkey
[ +0.000074] beegfs: enabling unsafe global rkey
[ +0.015936] beegfs: enabling unsafe global rkey
[ +0.000760] beegfs: enabling unsafe global rkey
[ +0.003149] beegfs: enabling unsafe global rkey
[ +0.003943] beegfs: enabling unsafe global rkey
[ +0.000812] beegfs: enabling unsafe global rkey
[ +0.003879] beegfs: enabling unsafe global rkey
[ +0.002547] beegfs: enabling unsafe global rkey
[ +0.004183] beegfs: enabling unsafe global rkey
[ +0.003145] beegfs: enabling unsafe global rkey
[ +0.002674] beegfs: enabling unsafe global rkey
[ +0.000181] beegfs: enabling unsafe global rkey
[ +0.004432] beegfs: enabling unsafe global rkey
[ +0.003933] beegfs: enabling unsafe global rkey
[ +0.000490] beegfs: enabling unsafe global rkey
[ +0.000021] beegfs: enabling unsafe global rkey
[ +0.004215] beegfs: enabling unsafe global rkey
[ +0.000018] beegfs: enabling unsafe global rkey
[ +0.000022] beegfs: enabling unsafe global rkey
[ +0.009685] beegfs: enabling unsafe global rkey
[ +0.001610] beegfs: enabling unsafe global rkey
[ +0.002971] beegfs: enabling unsafe global rkey
[ +0.006074] beegfs: enabling unsafe global rkey
[ +0.005775] beegfs: enabling unsafe global rkey
[ +0.006999] beegfs: enabling unsafe global rkey
[ +0.004661] beegfs: enabling unsafe global rkey
[ +0.003391] beegfs: enabling unsafe global rkey
[ +0.003704] beegfs: enabling unsafe global rkey
[ +0.000850] beegfs: enabling unsafe global rkey
[ +0.000767] beegfs: enabling unsafe global rkey
[ +0.000719] beegfs: enabling unsafe global rkey
[ +0.000011] beegfs: enabling unsafe global rkey
[ +0.004755] beegfs: enabling unsafe global rkey
[ +0.012335] beegfs: enabling unsafe global rkey
[ +0.003536] beegfs: enabling unsafe global rkey
[ +0.000356] beegfs: enabling unsafe global rkey
[ +0.025615] beegfs: enabling unsafe global rkey
[ +0.015285] beegfs: enabling unsafe global rkey
[ +0.007606] beegfs: enabling unsafe global rkey
[ +0.014390] beegfs: enabling unsafe global rkey
[ +0.022770] beegfs: enabling unsafe global rkey
[ +0.010850] beegfs: enabling unsafe global rkey
[ +0.009029] beegfs: enabling unsafe global rkey
[ +0.077486] beegfs: enabling unsafe global rkey
[ +0.015030] beegfs: enabling unsafe global rkey
[ +0.053817] beegfs: enabling unsafe global rkey
[ +0.008146] beegfs: enabling unsafe global rkey
[ +0.145996] beegfs: enabling unsafe global rkey
[ +0.000196] beegfs: enabling unsafe global rkey
[ +0.170787] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ +0.000126] IPv6: ADDRCONF(NETDEV_CHANGE): calica7144903ad: link becomes ready
[ +0.293605] beegfs: enabling unsafe global rkey
[ +0.005498] beegfs: enabling unsafe global rkey
[ +0.014350] beegfs: enabling unsafe global rkey
[ +0.000013] beegfs: enabling unsafe global rkey
[ +0.313616] beegfs: enabling unsafe global rkey
[ +0.043206] beegfs: enabling unsafe global rkey
[ +0.131647] beegfs: enabling unsafe global rkey
[ +0.059039] IPv6: ADDRCONF(NETDEV_CHANGE): cali45bfe5620c7: link becomes ready
[ +0.029478] beegfs: enabling unsafe global rkey
[ +0.251267] beegfs: enabling unsafe global rkey
[ +0.024959] beegfs: enabling unsafe global rkey
[ +0.005367] beegfs: enabling unsafe global rkey
[ +0.454570] beegfs: enabling unsafe global rkey
[ +0.329274] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ +0.000072] IPv6: ADDRCONF(NETDEV_CHANGE): calib855662d412: link becomes ready
[ +1.226132] IPv6: ADDRCONF(NETDEV_CHANGE): califfd1e2bd091: link becomes ready
[ +1.084970] IPv6: ADDRCONF(NETDEV_CHANGE): cali911afef9df4: link becomes ready
[ +1.189186] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ +0.000079] IPv6: ADDRCONF(NETDEV_CHANGE): cali5331c8c53f7: link becomes ready
[ +0.162815] beegfs: enabling unsafe global rkey
[ +2.138266] beegfs: enabling unsafe global rkey
[ +0.000785] beegfs: enabling unsafe global rkey
[ +0.001791] beegfs: enabling unsafe global rkey
[ +0.354556] beegfs: enabling unsafe global rkey
[ +0.315070] beegfs: enabling unsafe global rkey
[ +2.804804] beegfs: enabling unsafe global rkey
[ +6.682594] beegfs: enabling unsafe global rkey
[ +8.621871] beegfs: enabling unsafe global rkey
[ +0.119760] beegfs: enabling unsafe global rkey
[ +0.005202] beegfs: enabling unsafe global rkey
[ +0.000592] beegfs: enabling unsafe global rkey
[Jun19 22:02] beegfs: enabling unsafe global rkey
[ +1.204703] beegfs: enabling unsafe global rkey
[ +0.002081] beegfs: enabling unsafe global rkey
[ +0.000015] beegfs: enabling unsafe global rkey
[ +1.258890] beegfs: enabling unsafe global rkey
[ +0.000241] beegfs: enabling unsafe global rkey
[ +0.001358] beegfs: enabling unsafe global rkey
[ +1.290517] beegfs: enabling unsafe global rkey
[ +0.186285] beegfs: enabling unsafe global rkey
[ +0.340943] beegfs: enabling unsafe global rkey
[ +0.006794] beegfs: enabling unsafe global rkey
[ +0.002601] beegfs: enabling unsafe global rkey
[ +1.775295] beegfs: enabling unsafe global rkey
[ +0.267922] beegfs: enabling unsafe global rkey
[ +3.524590] beegfs: enabling unsafe global rkey
[ +0.001011] beegfs: enabling unsafe global rkey
[ +0.241760] beegfs: enabling unsafe global rkey
[ +0.002395] beegfs: enabling unsafe global rkey
[ +1.105426] beegfs: enabling unsafe global rkey
[ +3.160258] beegfs: enabling unsafe global rkey
[ +0.001251] beegfs: enabling unsafe global rkey
[ +3.408555] beegfs: enabling unsafe global rkey
[ +9.895894] beegfs: enabling unsafe global rkey
[ +8.342054] beegfs: enabling unsafe global rkey
[ +0.000805] beegfs: enabling unsafe global rkey
[ +0.000142] beegfs: enabling unsafe global rkey
[ +2.068748] beegfs: enabling unsafe global rkey
[ +0.002028] beegfs: enabling unsafe global rkey
[ +1.969862] beegfs: enabling unsafe global rkey
[ +0.002583] beegfs: enabling unsafe global rkey
[ +0.001995] beegfs: enabling unsafe global rkey
[ +18.848671] beegfs: enabling unsafe global rkey
[Jun19 22:03] beegfs: enabling unsafe global rkey
[ +0.000587] beegfs: enabling unsafe global rkey
[ +0.502635] beegfs: enabling unsafe global rkey
[Jun19 22:04] beegfs: enabling unsafe global rkey
[ +33.971986] beegfs: enabling unsafe global rkey
[Jun19 22:05] beegfs: enabling unsafe global rkey
[ +38.692220] beegfs: enabling unsafe global rkey
[Jun19 22:06] beegfs: enabling unsafe global rkey
[ +0.000399] beegfs: enabling unsafe global rkey
[ +0.000400] beegfs: enabling unsafe global rkey
[ +8.137573] beegfs: enabling unsafe global rkey
[ +0.616557] beegfs: enabling unsafe global rkey
[ +1.234374] beegfs: enabling unsafe global rkey
[ +0.000016] beegfs: enabling unsafe global rkey
[Jun19 22:07] beegfs: enabling unsafe global rkey
[ +0.000534] beegfs: enabling unsafe global rkey
[ +0.000178] beegfs: enabling unsafe global rkey
[Jun19 22:08] beegfs: enabling unsafe global rkey
[ +3.751615] beegfs: enabling unsafe global rkey
[ +26.882015] beegfs: enabling unsafe global rkey
[Jun19 22:09] beegfs: enabling unsafe global rkey
[Jun19 22:14] beegfs: enabling unsafe global rkey
[ +0.000024] beegfs: enabling unsafe global rkey
[Jun19 22:15] beegfs: enabling unsafe global rkey

Joe McCormick

unread,
Jun 21, 2023, 1:20:41 PM6/21/23
to beegfs-user
Hi Lukáš,

No need to increase the log level, the messages that indicate PVCs are getting created/deleted are logged by default so things are working as expected.

Could you run dmesg with the -T option to convert all the timestamps?

Could you also send over the contents of /var/log/beegfs-client.log from the affected client and the /var/log/beegfs-meta.log from the metas during the 21:50CEST 19/6 timeframe?

I presume BeeGFS traffic isn't using that eth0 link and we can ignore those messages in dmesg?

In the past I've only seen that "communication error on send" message in response to a legitimate connectivity problem, so maybe we need to be looking closer at the BeeGFS services themselves. 

I did write a quick program that tests running 10 threads in parallel looping through mounting BeeGFS (to separate mount points), creating/deleting separate files in a shared directory, remounting then running stat on the shared directory. If I run unmount to soon after writing a file understandably it may error due to "target is busy", but I have yet to see any issues on the mount or file operations or any communication errors. This roughly simulates what the driver is doing around the same time frame of the issue.

Thanks,

~Joe

Lukas Hejtmanek

unread,
Jun 21, 2023, 5:01:38 PM6/21/23
to fhgfs...@googlegroups.com


> On 21. 6. 2023, at 19:20, Joe McCormick <joe.mc...@thinkparq.com> wrote:
>
> Hi Lukáš,
>
> No need to increase the log level, the messages that indicate PVCs are getting created/deleted are logged by default so things are working as expected.
>
> Could you run dmesg with the -T option to convert all the timestamps?

[Mon Jun 19 21:37:05 2023] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[Mon Jun 19 21:37:05 2023] IPv6: ADDRCONF(NETDEV_CHANGE): cali8f2a330934c: link becomes ready
[Mon Jun 19 21:47:05 2023] beegfs: umount(488261): BeeGFS unmounted.
[Mon Jun 19 21:47:06 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 21:47:06 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 21:47:06 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 21:47:06 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 21:47:06 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 21:47:06 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 21:47:06 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 21:47:06 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 21:47:06 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 21:47:06 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 21:47:06 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 21:47:06 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 21:47:06 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 21:47:06 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 21:47:07 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 21:47:07 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 21:47:07 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 21:47:07 2023] beegfs: mount(488823): BeeGFS mount ready.
[Mon Jun 19 22:01:26 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:28 2023] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[Mon Jun 19 22:01:28 2023] IPv6: ADDRCONF(NETDEV_CHANGE): cali523db3e0a51: link becomes ready
[Mon Jun 19 22:01:28 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:29 2023] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[Mon Jun 19 22:01:29 2023] IPv6: ADDRCONF(NETDEV_CHANGE): cali60e64a51a5c: link becomes ready
[Mon Jun 19 22:01:30 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:30 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:30 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:30 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:30 2023] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[Mon Jun 19 22:01:30 2023] IPv6: ADDRCONF(NETDEV_CHANGE): cali8c60e1110e4: link becomes ready
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:31 2023] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[Mon Jun 19 22:01:31 2023] IPv6: ADDRCONF(NETDEV_CHANGE): calica7144903ad: link becomes ready
[Mon Jun 19 22:01:32 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:32 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:32 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:32 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:32 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:32 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:32 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:32 2023] IPv6: ADDRCONF(NETDEV_CHANGE): cali45bfe5620c7: link becomes ready
[Mon Jun 19 22:01:32 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:33 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:33 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:33 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:33 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:33 2023] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[Mon Jun 19 22:01:33 2023] IPv6: ADDRCONF(NETDEV_CHANGE): calib855662d412: link becomes ready
[Mon Jun 19 22:01:35 2023] IPv6: ADDRCONF(NETDEV_CHANGE): califfd1e2bd091: link becomes ready
[Mon Jun 19 22:01:36 2023] IPv6: ADDRCONF(NETDEV_CHANGE): cali911afef9df4: link becomes ready
[Mon Jun 19 22:01:37 2023] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[Mon Jun 19 22:01:37 2023] IPv6: ADDRCONF(NETDEV_CHANGE): cali5331c8c53f7: link becomes ready
[Mon Jun 19 22:01:37 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:39 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:39 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:39 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:40 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:40 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:43 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:49 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:58 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:58 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:58 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:01:58 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:00 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:01 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:01 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:01 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:02 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:02 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:02 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:03 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:04 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:04 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:04 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:04 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:06 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:06 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:10 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:10 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:10 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:10 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:11 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:14 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:14 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:17 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:27 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:36 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:36 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:36 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:38 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:38 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:40 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:40 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:40 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:02:59 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:03:18 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:03:18 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:03:18 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:04:10 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:04:44 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:05:12 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:05:51 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:06:48 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:06:48 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:06:48 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:06:56 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:06:57 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:06:58 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:06:58 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:07:09 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:07:09 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:07:09 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:08:20 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:08:24 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:08:51 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:09:16 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:14:30 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:14:30 2023] beegfs: enabling unsafe global rkey
[Mon Jun 19 22:15:48 2023] beegfs: enabling unsafe global rkey

> Could you also send over the contents of /var/log/beegfs-client.log from the affected client and the /var/log/beegfs-meta.log from the metas during the 21:50CEST 19/6 timeframe?

I do not have beegfs-client.lgo, but perhaps helperd log helps?

2023-06-19T20:03:36.755357936+02:00 stdout F (3) Jun19 18:03:36 *beegfs-csi-driv(215136) [NodeConn (acquire stream)] >> Connected: beegf...@10.16.59.42:8005 (protocol: RDMA)
2023-06-19T21:49:59.157348571+02:00 stdout F (2) Jun19 19:49:59 *umount(488261) [App (stop components)] >> Stopping components...
2023-06-19T21:49:59.158215895+02:00 stdout F (2) Jun19 19:49:59 *beegfs_XNodeSyn(192522) [Deregistration] >> Node deregistration successful.
2023-06-19T21:50:01.161619058+02:00 stdout F (2) Jun19 19:50:01 *umount(488261) [App (wait for component termination)] >> Still waiting for this component to stop: beegfs_AckMgr
2023-06-19T21:50:01.385757312+02:00 stdout F (2) Jun19 19:50:01 *umount(488261) [App (wait for component termination)] >> Component stopped: beegfs_AckMgr
2023-06-19T21:50:01.386086566+02:00 stdout F (1) Jun19 19:50:01 *umount(488261) [App (stop)] >> All components stopped.
2023-06-19T21:50:02.53258485+02:00 stdout F (3) Jun19 19:50:02 *mount(488823) [DatagramListener (init sock)] >> Listening for UDP datagrams: Port 34603
2023-06-19T21:50:02.532815477+02:00 stdout F (1) Jun19 19:50:02 *mount(488823) [App_logInfos] >> BeeGFS Client Version: 7.3.2
2023-06-19T21:50:02.533023692+02:00 stdout F (2) Jun19 19:50:02 *mount(488823) [App_logInfos] >> ClientID: 77577-6490B16A-kub-b15.priv.cerit-sc.cz
2023-06-19T21:50:02.533146184+02:00 stdout F (2) Jun19 19:50:02 *mount(488823) [App_logInfos] >> Usable NICs: ibp225s0(TCP) ibp225s0(RDMA)
2023-06-19T21:50:02.534053534+02:00 stdout F (3) Jun19 19:50:02 *beegfs_XNodeSyn(488829) [NodeConn (acquire stream)] >> Connected: beegfs-...@127.0.0.1:8006 (protocol: TCP)
2023-06-19T21:50:02.535751583+02:00 stdout F (2) Jun19 19:50:02 *beegfs_DGramLis(488828) [Heartbeat incoming] >> New node: beegfs-mgmtd beegfs-mgmtd-0 [ID: 1];
2023-06-19T21:50:02.536154046+02:00 stdout F (3) Jun19 19:50:02 *beegfs_XNodeSyn(488829) [Init] >> Management node found. Downloading node groups...
2023-06-19T21:50:02.536367631+02:00 stdout F (3) Jun19 19:50:02 *beegfs_XNodeSyn(488829) [NodeConn (acquire stream)] >> Connected: beegfs...@10.16.59.2:8008 (protocol: TCP)
2023-06-19T21:50:02.537016772+02:00 stdout F (2) Jun19 19:50:02 *beegfs_XNodeSyn(488829) [Sync] >> Nodes added (sync results): 2 (Type: beegfs-meta)
2023-06-19T21:50:02.537470632+02:00 stdout F (2) Jun19 19:50:02 *beegfs_XNodeSyn(488829) [Sync] >> Nodes added (sync results): 16 (Type: beegfs-storage)
2023-06-19T21:50:02.538287992+02:00 stdout F (3) Jun19 19:50:02 *beegfs_XNodeSyn(488829) [Init] >> Node registration...
2023-06-19T21:50:02.538625913+02:00 stdout F (2) Jun19 19:50:02 *beegfs_XNodeSyn(488829) [Registration] >> Node registration successful.
2023-06-19T21:50:02.539453021+02:00 stdout F (3) Jun19 19:50:02 *beegfs_XNodeSyn(488829) [Init] >> Init complete.
2023-06-19T21:50:02.544177067+02:00 stdout F (3) Jun19 19:50:02 *mount(488823) [NodeConn (acquire stream)] >> Connected: beegf...@10.16.59.41:8005 (protocol: RDMA)
2023-06-19T21:50:02.548934786+02:00 stdout F (3) Jun19 19:50:02 *mount(488823) [NodeConn (acquire stream)] >> Connected: beegfs-...@10.16.59.41:8003 (protocol: RDMA)
2023-06-19T21:50:02.55601015+02:00 stdout F (3) Jun19 19:50:02 *mount(488823) [NodeConn (acquire stream)] >> Connected: beegfs-...@10.16.59.42:8003 (protocol: RDMA)
2023-06-19T21:50:02.560882126+02:00 stdout F (3) Jun19 19:50:02 *mount(488823) [NodeConn (acquire stream)] >> Connected: beegfs-...@10.16.59.43:8003 (protocol: RDMA)
2023-06-19T21:50:02.565309309+02:00 stdout F (3) Jun19 19:50:02 *mount(488823) [NodeConn (acquire stream)] >> Connected: beegfs-...@10.16.59.44:8003 (protocol: RDMA)
2023-06-19T21:50:02.569343547+02:00 stdout F (3) Jun19 19:50:02 *mount(488823) [NodeConn (acquire stream)] >> Connected: beegfs-...@10.16.59.45:8003 (protocol: RDMA)
2023-06-19T21:50:02.573732689+02:00 stdout F (3) Jun19 19:50:02 *mount(488823) [NodeConn (acquire stream)] >> Connected: beegfs-...@10.16.59.46:8003 (protocol: RDMA)
2023-06-19T21:50:02.57790575+02:00 stdout F (3) Jun19 19:50:02 *mount(488823) [NodeConn (acquire stream)] >> Connected: beegfs-...@10.16.59.47:8003 (protocol: RDMA)
2023-06-19T21:50:02.582409618+02:00 stdout F (3) Jun19 19:50:02 *mount(488823) [NodeConn (acquire stream)] >> Connected: beegfs-...@10.16.59.48:8003 (protocol: RDMA)
2023-06-19T21:50:02.587123625+02:00 stdout F (3) Jun19 19:50:02 *mount(488823) [NodeConn (acquire stream)] >> Connected: beegfs-...@10.16.59.49:8003 (protocol: RDMA)
2023-06-19T21:50:02.592202994+02:00 stdout F (3) Jun19 19:50:02 *mount(488823) [NodeConn (acquire stream)] >> Connected: beegfs-...@10.16.59.50:8003 (protocol: RDMA)
2023-06-19T21:50:02.601847018+02:00 stdout F (3) Jun19 19:50:02 *mount(488823) [NodeConn (acquire stream)] >> Connected: beegfs-...@10.16.59.51:8003 (protocol: RDMA)
2023-06-19T21:50:02.601982605+02:00 stdout F (3) Jun19 19:50:02 *mount(488823) [NodeConn (acquire stream)] >> Connected: beegfs-...@10.16.59.52:8003 (protocol: RDMA)
2023-06-19T21:50:02.607062666+02:00 stdout F (3) Jun19 19:50:02 *mount(488823) [NodeConn (acquire stream)] >> Connected: beegfs-...@10.16.59.53:8003 (protocol: RDMA)
2023-06-19T21:50:02.612197851+02:00 stdout F (3) Jun19 19:50:02 *mount(488823) [NodeConn (acquire stream)] >> Connected: beegfs-...@10.16.59.54:8003 (protocol: RDMA)
2023-06-19T21:50:02.616026619+02:00 stdout F (3) Jun19 19:50:02 *mount(488823) [NodeConn (acquire stream)] >> Connected: beegfs-...@10.16.59.55:8003 (protocol: RDMA)
2023-06-19T21:50:02.619892919+02:00 stdout F (3) Jun19 19:50:02 *mount(488823) [NodeConn (acquire stream)] >> Connected: beegfs-...@10.16.59.56:8003 (protocol: RDMA)
2023-06-19T21:50:02.622883309+02:00 stdout F (3) Jun19 19:50:02 *beegfs-csi-driv(215136) [NodeConn (acquire stream)] >> Connect failed: beegf...@10.16.59.42:8005 (protocol: RDMA)
2023-06-19T21:50:02.623057178+02:00 stdout F (1) Jun19 19:50:02 *beegfs-csi-driv(215136) [NodeConn (acquire stream)] >> Connect failed on all available routes: beegfs-meta kub-b2.priv.cerit-sc.cz [ID: 2]
2023-06-19T21:50:02.623163981+02:00 stdout F (2) Jun19 19:50:02 *beegfs-csi-driv(215136) [Messaging (RPC)] >> Unable to connect to: beegfs-meta kub-b2.priv.cerit-sc.cz [ID: 2]
2023-06-19T21:50:02.623268328+02:00 stdout F (3) Jun19 19:50:02 *beegfs-csi-driv(215136) [Messaging (RPC node)] >> Retrying communication with node: beegfs-meta kub-b2.priv.cerit-sc.cz [ID: 2]
2023-06-19T22:04:22.576557958+02:00 stdout F (3) Jun19 20:04:22 *beegfs-csi-driv(60278) [NodeConn (acquire stream)] >> Connected: beegf...@10.16.59.42:8005 (protocol: RDMA)
2023-06-19T22:57:43.836840278+02:00 stdout F (2) Jun19 20:57:43 *umount(829785) [App (stop components)] >> Stopping components...
2023-06-19T22:57:43.837929172+02:00 stdout F (2) Jun19 20:57:43 *beegfs_XNodeSyn(488829) [Deregistration] >> Node deregistration successful.
2023-06-19T22:57:45.389755252+02:00 stdout F (1) Jun19 20:57:45 *umount(829785) [App (stop)] >> All components stopped.

beegfs-meta does not have new records since start.


So this looks like ‘sysTargetOfflineTimeoutSecs’ effectively disables reconnection attempts?

> I presume BeeGFS traffic isn't using that eth0 link and we can ignore those messages in dmesg?

no, it is over infiniband/rdma only.

Joe McCormick

unread,
Jun 23, 2023, 1:51:32 PM6/23/23
to beegfs-user
Hi Lukáš,

So for some reason when BeeGFS was remounted the client module couldn't connect to the second metadata service:

2023-06-19T21:50:02.622883309+02:00 stdout F (3) Jun19 19:50:02 *beegfs-csi-driv(215136) [NodeConn (acquire stream)] >> Connect failed: beegf...@10.16.59.42:8005 (protocol: RDMA)
2023-06-19T21:50:02.623057178+02:00 stdout F (1) Jun19 19:50:02 *beegfs-csi-driv(215136) [NodeConn (acquire stream)] >> Connect failed on all available routes: beegfs-meta kub-b2.priv.cerit-sc.cz [ID: 2]
2023-06-19T21:50:02.623163981+02:00 stdout F (2) Jun19 19:50:02 *beegfs-csi-driv(215136) [Messaging (RPC)] >> Unable to connect to: beegfs-meta kub-b2.priv.cerit-sc.cz [ID: 2]
2023-06-19T21:50:02.623268328+02:00 stdout F (3) Jun19 19:50:02 *beegfs-csi-driv(215136) [Messaging (RPC node)] >> Retrying communication with node: beegfs-meta kub-b2.priv.cerit-sc.cz [ID: 2]
2023-06-19T22:04:22.576557958+02:00 stdout F (3) Jun19 20:04:22 *beegfs-csi-driv(60278) [NodeConn (acquire stream)] >> Connected: beegf...@10.16.59.42:8005 (protocol: RDMA)

I don't believe sysTargetOfflineTimeoutSecs is involved here, that is the upper bounds the client will wait to get updates from the management service on the status of the metadata and storage services before it presumes they're offline. Here the client connected to management and was able to download the list of nodes and there are no management errors. I suspect having connCommRetrySecs set so high is what caused everything to appear hung indefinitely. Some parameters are set automatically based on the value of others, for example connNumCommRetries is based on connCommRetrySecs.

In this case it looks like the client had some sort of issue connecting to that metadata service over RDMA. Provided there isn't a legitimate network issue, I'd be inclined to suspect an RDMA firmware or driver bug (possibly triggered by the rapid unmount/mount). This explains the "Communication error on send" that is eventually returned from the client (and the CSI driver). Probably some issue freeing up resources in time for them to be reused.

At this point if this system is under support I'd suggest raising a ticket so they can investigate if there is anything we can do in BeeGFS to handle this better, or if this is fixed in some newer driver. If not there are some experimental settings that let you tune the messaging and RDMA timeouts that may allow you to tweak things in BeeGFS so if there is some transient error the connect can be potentially retried before the operation fails: https://git.beegfs.io/pub/v7/-/blob/master/client_module/build/dist/etc/beegfs-client.conf#L297. I would caution not to be to aggressive with lowering these, especially in a production environment.

Alternatively we could try and avoid the remount entirely since that seems to be the crux of the problem, two options come to mind: 

(1) Try to find a way to delay the mount following an unmount. I'm not aware of a way to tell Kubernetes to wait some period of time before calling NodeStageVolume after it calls NodeUnstageVolume for the same volume/provider. Maybe this is something you can configure in the nextflow pipeline?

(2) Use a DaemonSet to keep the volume attached attached to all your nodes all the time avoiding the unmount/remount. This could look like an Alpine container that runs sleep so no CPU cycles are consumed.

Thanks,

~Joe

Lukas Hejtmanek

unread,
Jun 27, 2023, 4:18:14 AM6/27/23
to fhgfs...@googlegroups.com
Hi Joe,

> On 23. 6. 2023, at 19:51, Joe McCormick <joe.mc...@thinkparq.com> wrote:
>
> Hi Lukáš,
>
> So for some reason when BeeGFS was remounted the client module couldn't connect to the second metadata service:
>
> 2023-06-19T21:50:02.622883309+02:00 stdout F (3) Jun19 19:50:02 *beegfs-csi-driv(215136) [NodeConn (acquire stream)] >> Connect failed: beegf...@10.16.59.42:8005 (protocol: RDMA)
> 2023-06-19T21:50:02.623057178+02:00 stdout F (1) Jun19 19:50:02 *beegfs-csi-driv(215136) [NodeConn (acquire stream)] >> Connect failed on all available routes: beegfs-meta kub-b2.priv.cerit-sc.cz[ID: 2]
> 2023-06-19T21:50:02.623163981+02:00 stdout F (2) Jun19 19:50:02 *beegfs-csi-driv(215136) [Messaging (RPC)] >> Unable to connect to: beegfs-meta kub-b2.priv.cerit-sc.cz [ID: 2]
> 2023-06-19T21:50:02.623268328+02:00 stdout F (3) Jun19 19:50:02 *beegfs-csi-driv(215136) [Messaging (RPC node)] >> Retrying communication with node: beegfs-meta kub-b2.priv.cerit-sc.cz [ID: 2]
> 2023-06-19T22:04:22.576557958+02:00 stdout F (3) Jun19 20:04:22 *beegfs-csi-driv(60278) [NodeConn (acquire stream)] >> Connected: beegf...@10.16.59.42:8005 (protocol: RDMA)
>
> I don't believe sysTargetOfflineTimeoutSecs is involved here, that is the upper bounds the client will wait to get updates from the management service on the status of the metadata and storage services before it presumes they're offline. Here the client connected to management and was able to download the list of nodes and there are no management errors. I suspect having connCommRetrySecs set so high is what caused everything to appear hung indefinitely. Some parameters are set automatically based on the value of others, for example connNumCommRetries is based on connCommRetrySecs.

so, do you think that reconnect succeeded just after sysTargetOfflineTimeoutSecs time? I can change it to some different value and see, when it happens again, if it is really that time.

>
> In this case it looks like the client had some sort of issue connecting to that metadata service over RDMA. Provided there isn't a legitimate network issue, I'd be inclined to suspect an RDMA firmware or driver bug (possibly triggered by the rapid unmount/mount). This explains the "Communication error on send" that is eventually returned from the client (and the CSI driver). Probably some issue freeing up resources in time for them to be reused.
>
> At this point if this system is under support I'd suggest raising a ticket so they can investigate if there is anything we can do in BeeGFS to handle this better, or if this is fixed in some newer driver. If not there are some experimental settings that let you tune the messaging and RDMA timeouts that may allow you to tweak things in BeeGFS so if there is some transient error the connect can be potentially retried before the operation fails: https://git.beegfs.io/pub/v7/-/blob/master/client_module/build/dist/etc/beegfs-client.conf#L297. I would caution not to be to aggressive with lowering these, especially in a production environment.

none of these timeouts: connMessagingTimeouts and connRDMATimeouts are 900s so that it matches the reconnect from the logs.

> (1) Try to find a way to delay the mount following an unmount. I'm not aware of a way to tell Kubernetes to wait some period of time before calling NodeStageVolume after it calls NodeUnstageVolume for the same volume/provider. Maybe this is something you can configure in the nextflow pipeline?

I cannot do this in the pipeline, but perhaps the csi driver could do this - postpone unstage for some time?

> (2) Use a DaemonSet to keep the volume attached attached to all your nodes all the time avoiding the unmount/remount. This could look like an Alpine container that runs sleep so no CPU cycles are consumed.

yes, this could mitigate the problem.

I agree that we might have a buggy HW/FW, but beegfs should deal with it anyway.

Joe McCormick

unread,
Jun 28, 2023, 9:27:56 AM6/28/23
to beegfs-user
Hi Lukáš,

I don't believe sysTargetOfflineTimeoutSecs is relevant here, but further lowering connCommRetrySecs should unblock things sooner so we can retry. Be careful not to lower it too aggressively, otherwise operations that would have otherwise completed may fail on a busy system.

As for postphoning the remount, I'm hestitant to introduce a mechanism like that into the CSI driver. In my experience, injecting arbitrary delays/waits into a program as a workaround should be done with caution. The first time may seem harmless, but eventually you'll end up with a really slow program since its unlikely any of the delays will ever be removed once added. Generally its better to track down and understand/resolve the source of the problem that is prompting us to inject the delay. Or try to find another workaround if there is a specific workflow triggering the issue, like the DaemonSet option.

I did work with our DevOps team to run the program I wrote to aggressively remount BeeGFS and simulate the other aspects of this particular workflow on our RDMA test systems, but never saw any issues on the stat following a quick remount. I can add that test to our suite (we had similar tests, but nothing quite that aggressively focused on a quick remount), but unless we can recreate it, its hard to know where to start to improve how we handle this behavior. If you're using specific RDMA drivers (OFED vs. inbox) I can double check we're testing with the same versions. 

~Joe

Lukas Hejtmanek

unread,
Jul 3, 2023, 2:27:26 AM7/3/23
to fhgfs...@googlegroups.com
Hi Joe,

btw, is it possible to switch metadata to TCP only? I tried on existing file system but it cannot connect to metadata servers any more.
> --
> You received this message because you are subscribed to the Google Groups "beegfs-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/fhgfs-user/6ebb73fa-075c-46bd-a839-ce28b883df46n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages