Infiniband related errors? in log, but still normally working system.

39 views
Skip to first unread message

Kilian Schnelle

unread,
Jun 12, 2022, 7:43:32 AMJun 12
to beegfs-user
Hello all,

we have our beegfs-system since a little and it all seems to work totally fine, but the log files are really getting spammed full of error? messages that i cant find a solution for. Its the same for all storage/client nodes so i will just post from one each as example. Since its all the same its maybe a network problem? someone has an idea?

cheers
Kilian

client:
```

(0) Jun11 15:36:10 *python(645436) [readfileV2 (communication)] >> Communication error in RECVDATA stage. Node: beegfs-storage samson103 [ID: 3] (recv result: -70)

(0) Jun11 15:36:10 *python(645436) [readfileV2 (communication)] >> Communication error. Node: beegfs-storage samson103 [ID: 3]

(0) Jun11 15:38:15 *python(645485) [readfileV2 (communication)] >> Communication error in RECVDATA stage. Node: beegfs-storage samson103 [ID: 3] (recv result: -70)

(0) Jun11 15:38:15 *python(645485) [readfileV2 (communication)] >> Communication error. Node: beegfs-storage samson103 [ID: 3]

```


storage meta: ( not a lot here just sometimes)

```

(2) Jun11 15:20:43 CommSlave49 [IBVSocket.cpp:1782] >> Bad/unexpected completion opcode. wc[i].opcode: 136

(0) Jun11 15:20:43 CommSlave49 [Messaging (RPC)] >> Communication error: Disconnect during send() to: 192.168.127.203:8003; Peer: beegfs-storage samson103 [ID: 3]. (Message type: GetChunkFileAttribs (2017))

(2) Jun11 15:20:43 CommSlave49 [MessagingTk.cpp:281] >> Retrying communication. targetID: 3; message type: GetChunkFileAttribs (2017)

(0) Jun11 15:20:47 CommSlave188 [Messaging (RPC)] >> Communication error: Received disconnect from: 192.168.127.203:8003; Peer: beegfs-storage samson103 [ID: 3]. (Message type: GetChunkFileAttribs (2017))

(2) Jun11 15:20:47 CommSlave188 [MessagingTk.cpp:281] >> Retrying communication. targetID: 3; message type: GetChunkFileAttribs (2017)

```


storage-storage: ( this is really really full of this messages)

```

(2) Jun11 15:36:10 Worker1 [IBVSocket.cpp:1782] >> Bad/unexpected completion opcode. wc[i].opcode: 136

(0) Jun11 15:36:10 Worker1 [ReadChunkFileV2Msg incoming] >> SocketException occurred: Disconnect during send() to: 192.168.124.4:50916

(2) Jun11 15:36:10 Worker1 [ReadChunkFileV2Msg incoming] >> Details: sessionID: 79; fileHandle: 20D804FD#11-62A4B5B0-3; offset: 39845888; count: 524288

(3) Jun11 15:36:10 Worker1 [Work (process incoming msg)] >> Problem encountered during processing of a message. Disconnecting: 192.168.124.4:50916

(2) Jun11 15:38:15 Worker8 [IBVSocket.cpp:1782] >> Bad/unexpected completion opcode. wc[i].opcode: 136

(0) Jun11 15:38:15 Worker8 [ReadChunkFileV2Msg incoming] >> SocketException occurred: Disconnect during send() to: 192.168.124.4:38304

(2) Jun11 15:38:15 Worker8 [ReadChunkFileV2Msg incoming] >> Details: sessionID: 79; fileHandle: 20D80801#28-62A4B62E-3; offset: 24641536; count: 524288

(3) Jun11 15:38:15 Worker8 [Work (process incoming msg)] >> Problem encountered during processing of a message. Disconnecting: 192.168.124.4:38304

```

web...@netapp.com

unread,
Jun 13, 2022, 9:40:13 AMJun 13
to beegfs-user
We recently ran into this on a BeeGFS system. It appears to be related to an issue with the IB/RDMA drivers. If you are running OFED, the OFED 5.6-1.0.3.3 release notes mention the following issue: "On rare occasions, the application did not use any raw WQE feature and unexpectedly got wc opcode IBV_WC_DRIVER2." (https://docs.nvidia.com/networking/display/MLNXOFEDv561033/Bug+Fixes+in+This+Version). The notes indicate that this was discovered in OFED 5.5-1.0.3.2 and fixed in OFED 5.6-1.0.3.3. Inbox drivers are also likely to be affected, but specific versions may be a bit harder to pin down. The following commit to rdma-core (which is included in rdma-core v39.0+) appears to be the one that takes care of it: https://github.com/linux-rdma/rdma-core/commit/4c905646de3e75bdccada4abe9f0d273d76eaf50.

We had good luck upgrading OFED to 5.6 to get rid of the errors, but this is not yet on the BeeGFS support matrix, so that's something to consider.

Eric Weber
E-Series Solutions Software Engineer
NetApp

Reply all
Reply to author
Forward
0 new messages