I’ve just noticed the following errors in /var/log/beegfs-storage.log
(0) Aug07 19:46:59 Worker3-2 [WriteChunkFileMsg incoming] >> SocketException occurred: Receive timed out from: x.x.x.x:18267
(2) Aug07 19:46:59 Worker3-2 [WriteChunkFileMsg incoming] >> Details: sessionID: 67; fileHandle: 4709BA7E#1-5989096E-2; offset: 0; count: 524288
(3) Aug07 19:46:59 Worker3-2 [Work (process incoming msg)] >> Problem encountered during processing of a message. Disconnecting: x.x.x.x:18267
They are appearing on all 4 of our storage nodes and at the point they started appearing have been popping up approximately every 10 minutes or so in the logs. Has anyone seen this before? It appears to cause issues with the client that also prevent it from being restarted, save for rebooting the whole machine.
All of our nodes are running version 6.12 on CentOS 7. Let me know if there’s any other info I can provide.
Thanks,
Kevin
Hi Kevin,
that's very odd and obviously we haven't seen this very much.
I just scanned our entire cluster for the string:
"Receive failed from: beegfs-storage:"
and on 2 beegfs systems totalling about a PB over ~265 nodes, I got the following errors that look like yours.
Command: [grep "Receive failed from: beegfs-storage" /var/log/beegfs-client-dfs[12].log]
err wc # nodes
0 0 0 252 compute-1-1 compu
4 64 718 1 compute-2-5
23 162 1380 1 compute-13-20 * (bad hardware)
1 16 177 1 compute-2-12
2 32 354 1 compute-8-2
2 32 359 1 compute-2-7
1 16 176 1 compute-2-2
5 80 865 1 hpc-login-1-2
1 16 173 1 hpc-login-1-3
1 16 187 1 compute-2-11
1 16 176 1 compute-7-9
2 32 368 1 compute-4-1
4 64 692 1 compute-2-8
(above shows how many error lines were generated by that many nodes so
252 nodes had no hits, compute-13-20 has 23 lines of errors (about to be pulled due to multiple hardware problems).
so we've seen them, but very rarely. Our IB interfaces are completely free from xmit/rec errors, so these error may be related to rare packet collisions or .. ?
We have a few ethernet errors but they don't seem to be closely related to these errors.
This could also be related to the storage nodes. Is there one storage node that is picking up the errors? (Crude measure is just the size of the /var/log/beegfs-storage.log)
What FS are you using?
Can you scan the individual disks with smartctl to see if you have a couple of duds? If you're using a software RAID, it's trivial; if it's an LSI controller, you have to use a peek-thru workaround:
http://moo.nac.uci.edu/~hjm/HOWTOS.html#smartmegaraid
We see about 10-500 errors like this:
(0) Feb01 09:59:15 Worker6-1 [ReadChunkFileV2Msg incoming] >> SocketException occurred: Disconnect during send()
to: 10.2.255.222:56363
since jan on all 10 of our storage servers. So they exist, but they're not frequent.
Also, are you using quotas? If so, are the affected users overrunning their quotas?
hjm
What about your network integrity? Is this new equipment or older stuff?
10G or IB? and what are the interface error states?
If you have dual networks (many sites - like ours - have both 1GbE and IB to all servers and if your DNS is set up well, the servers will fail/fall over to the other network), what happens if you pull the offending network interface? This will have virtually no effect on a lightly loaded MD server, but you'll def see an effect on storage servers while on GbE.
What's the output from repeated iterations of beegfs-net? Do you see servers falling off the connection via this util?
If you have centralized logging set up, can you see any patterns of errors? Do they propagate from one client or server?
(If you don't, you might consider it - it's very handy for things like this.)
Do these errors occur during particular applications? hard to tell, I admit.
hjm
On Thursday, August 10, 2017 9:34:53 AM PDT Kevin Leigeb wrote:
> We use ZFS so we are quickly notified of any disk anomalies and haven't seen anything in recent runs of `zpool status` -- Also, we are not using quotas.
>
> These error messages appear across all storage nodes, unfortunately.
>
> I should also add that I've seen this message pertaining to specific targetIDs:
>
> (0) Aug09 16:21:14 *beegfs_Flusher(3921) [Remoting (write file)] >> Error storage targetID: 1702; Msg: Communication error; FileHandle: 5C9DBC66#0-598B6FB1-2
>
> Thanks
>
--
Harry Mangalam, Research CyberInfrastructure Center, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
XSEDE 'Campus Champion' - ask me about your research computing needs.
Map to MSTB| Map to Data Center Gate