Strange storage node errors

Kevin Leigeb

unread,

Aug 8, 2017, 5:02:08 PM8/8/17

to fhgfs...@googlegroups.com

I’ve just noticed the following errors in /var/log/beegfs-storage.log

(0) Aug07 19:46:59 Worker3-2 [WriteChunkFileMsg incoming] >> SocketException occurred: Receive timed out from: x.x.x.x:18267

(2) Aug07 19:46:59 Worker3-2 [WriteChunkFileMsg incoming] >> Details: sessionID: 67; fileHandle: 4709BA7E#1-5989096E-2; offset: 0; count: 524288

(3) Aug07 19:46:59 Worker3-2 [Work (process incoming msg)] >> Problem encountered during processing of a message. Disconnecting: x.x.x.x:18267

They are appearing on all 4 of our storage nodes and at the point they started appearing have been popping up approximately every 10 minutes or so in the logs. Has anyone seen this before? It appears to cause issues with the client that also prevent it from being restarted, save for rebooting the whole machine.

All of our nodes are running version 6.12 on CentOS 7. Let me know if there’s any other info I can provide.

Thanks,

Kevin

kevin....@gmail.com

unread,

Aug 9, 2017, 2:52:06 AM8/9/17

to beegfs-user

At the same time, the client logs are also showing this:

(2) Aug08 16:08:52 *beegfs_Flusher(17149) [writefile (communication)] >> Receive failed from: beegfs-storage bigdata-store-3.glbrc.org [ID: 13] @ 144.92.98.74:8003

(0) Aug08 16:08:52 *beegfs_Flusher(17149) [writefile (communication)] >> Communication error. Node: beegfs-storage bigdata-store-3.glbrc.org [ID: 13]

Harry Mangalam

unread,

Aug 10, 2017, 12:19:23 PM8/10/17

to beegfs-user

Hi Kevin,

that's very odd and obviously we haven't seen this very much.

I just scanned our entire cluster for the string:

"Receive failed from: beegfs-storage:"

and on 2 beegfs systems totalling about a PB over ~265 nodes, I got the following errors that look like yours.

Command: [grep "Receive failed from: beegfs-storage" /var/log/beegfs-client-dfs[12].log]

err wc # nodes

0 0 0 252 compute-1-1 compu

4 64 718 1 compute-2-5

23 162 1380 1 compute-13-20 * (bad hardware)

1 16 177 1 compute-2-12

2 32 354 1 compute-8-2

2 32 359 1 compute-2-7

1 16 176 1 compute-2-2

5 80 865 1 hpc-login-1-2

1 16 173 1 hpc-login-1-3

1 16 187 1 compute-2-11

1 16 176 1 compute-7-9

2 32 368 1 compute-4-1

4 64 692 1 compute-2-8

(above shows how many error lines were generated by that many nodes so

252 nodes had no hits, compute-13-20 has 23 lines of errors (about to be pulled due to multiple hardware problems).

so we've seen them, but very rarely. Our IB interfaces are completely free from xmit/rec errors, so these error may be related to rare packet collisions or .. ?

We have a few ethernet errors but they don't seem to be closely related to these errors.

This could also be related to the storage nodes. Is there one storage node that is picking up the errors? (Crude measure is just the size of the /var/log/beegfs-storage.log)

What FS are you using?

Can you scan the individual disks with smartctl to see if you have a couple of duds? If you're using a software RAID, it's trivial; if it's an LSI controller, you have to use a peek-thru workaround:

http://moo.nac.uci.edu/~hjm/HOWTOS.html#smartmegaraid

We see about 10-500 errors like this:

(0) Feb01 09:59:15 Worker6-1 [ReadChunkFileV2Msg incoming] >> SocketException occurred: Disconnect during send()
to: 10.2.255.222:56363

since jan on all 10 of our storage servers. So they exist, but they're not frequent.

Also, are you using quotas? If so, are the affected users overrunning their quotas?

hjm

Kevin Leigeb

unread,

Aug 10, 2017, 12:34:53 PM8/10/17

to beegfs-user

Harry -

We use ZFS so we are quickly notified of any disk anomalies and haven't seen anything in recent runs of `zpool status` -- Also, we are not using quotas.

These error messages appear across all storage nodes, unfortunately.

I should also add that I've seen this message pertaining to specific targetIDs:

(0) Aug09 16:21:14 *beegfs_Flusher(3921) [Remoting (write file)] >> Error storage targetID: 1702; Msg: Communication error; FileHandle: 5C9DBC66#0-598B6FB1-2

Thanks

harry mangalam

unread,

Aug 10, 2017, 1:20:40 PM8/10/17

to fhgfs...@googlegroups.com, Kevin Leigeb

What about your network integrity? Is this new equipment or older stuff?

10G or IB? and what are the interface error states?

If you have dual networks (many sites - like ours - have both 1GbE and IB to all servers and if your DNS is set up well, the servers will fail/fall over to the other network), what happens if you pull the offending network interface? This will have virtually no effect on a lightly loaded MD server, but you'll def see an effect on storage servers while on GbE.

What's the output from repeated iterations of beegfs-net? Do you see servers falling off the connection via this util?

If you have centralized logging set up, can you see any patterns of errors? Do they propagate from one client or server?

(If you don't, you might consider it - it's very handy for things like this.)

Do these errors occur during particular applications? hard to tell, I admit.

hjm

On Thursday, August 10, 2017 9:34:53 AM PDT Kevin Leigeb wrote:

> We use ZFS so we are quickly notified of any disk anomalies and haven't seen anything in recent runs of `zpool status` -- Also, we are not using quotas.

>

> These error messages appear across all storage nodes, unfortunately.

>

> I should also add that I've seen this message pertaining to specific targetIDs:

>

> (0) Aug09 16:21:14 *beegfs_Flusher(3921) [Remoting (write file)] >> Error storage targetID: 1702; Msg: Communication error; FileHandle: 5C9DBC66#0-598B6FB1-2

>

> Thanks

>

--

Harry Mangalam, Research CyberInfrastructure Center, Rm 225 MSTB, UC Irvine

[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487

415 South Circle View Dr, Irvine, CA, 92697 [shipping]

XSEDE 'Campus Champion' - ask me about your research computing needs.

Map to MSTB| Map to Data Center Gate

Kevin Leigeb

unread,

Aug 10, 2017, 6:25:12 PM8/10/17

to beegfs-user, kevin....@gmail.com

Switching to version 6.14 which was just released today has not fixed the issue.

Everything we run is using 10G Ethernet. I was worried that not limiting connections to specific interfaces was the issue, but I was able to recreate the issue after configuring a connInterfaces file.

I have been using dos2unix to recreate the issue since that is what I was using when I first noticed the issue. I can successfully run the command on certain nodes, which is the most infuriating part.

Also, once I've "bonked" the connection on a certain client, the only way to restart the client is to reboot the node as it sends the client into a state that won't allow it the rmmod the kernel module and the beegfs_Flusher process goes into D state and can't be killed.

Kevin Leigeb

unread,

Aug 17, 2017, 11:59:17 AM8/17/17

to beegfs-user, kevin....@gmail.com

I was finally able to figure out the issue here. Turns out there was a network discrepancy with some of the clients running on a different switch than the BeeGFS nodes. Since we don't control our networking (it's handled at the campus level) there are likely some settings somewhere along the path causing communications issues - not with the metadata nodes, but with the striped data across multiple storage nodes. Moving clients to the same switch as the nodes fixed the issue for us.

Thanks Harry for the help on this one!

Kevin

Jure Pečar

unread,

Apr 10, 2018, 11:33:42 AM4/10/18

to fhgfs...@googlegroups.com

On Thu, 17 Aug 2017 08:59:16 -0700 (PDT)
Kevin Leigeb <kevin....@gmail.com> wrote:

> I was finally able to figure out the issue here. Turns out there was a
> network discrepancy with some of the clients running on a different switch
> than the BeeGFS nodes. Since we don't control our networking (it's handled
> at the campus level) there are likely some settings somewhere along the
> path causing communications issues - not with the metadata nodes, but with
> the striped data across multiple storage nodes. Moving clients to the same
> switch as the nodes fixed the issue for us.

We received a bunch of new nodes for our cluster and managed to recreate the same kind of problem.

We temporarily connected new nodes (blade chassis) through a single 10Gb network link, creating a large oversubscription of that link. It is ok for deployment and testing, but when we started doing beegfs testing, we periodically saw one of the beegfs-storage processes on storage servers spewing "SocketException occurred: Receive timed out from: [new nodes ips]" for some minutes and then locking up completely, making the whole beegfs mount hang on all cluster nodes. Restarting affected beegfs-storage process makes the whole thing running again.

To me this looks like some kind of unhandled connection state in the beegfs-storage process. We're running 6.18, el6 on server side, el7 on hpc nodes.

When we recable these new nodes to 2x100Gb, I'll try to recreate this issue and I hope I'm not able to recreate it.

--

Jure Pečar
https://jure.pecar.org
http://f5j.eu

Bogdan Costescu

unread,

Apr 11, 2018, 8:59:57 AM4/11/18

to fhgfs...@googlegroups.com

Hi Jure,

you wrote:

"one of the beegfs-storage processes on storage servers spewing
"SocketException occurred: Receive timed out from: [new nodes ips]"
for some minutes and then locking up completely, making the whole
beegfs mount hang on all cluster nodes. Restarting affected
beegfs-storage process makes the whole thing running again."

Could you please clarify whether only the beegfs-storage process or
the whole storage server hangs? If I read the first phrase, I
understand a node lock-up, while the second phrase mentions a restart
of only the process.

Cheers,
Bogdan

Jure Pečar

unread,

Apr 11, 2018, 10:06:33 AM4/11/18

to fhgfs...@googlegroups.com, Bogdan Costescu

Process only, node is fine.

Maybe I should also mention that beegfs-storage doesn't die on first kill, it only goes away after second kill. Does this ring any bells?

Jure Pečar

unread,

Apr 17, 2018, 11:01:08 AM4/17/18

to fhgfs...@googlegroups.com

On Tue, 10 Apr 2018 17:33:39 +0200
Jure Pečar <peg...@nerv.eu.org> wrote:

> We temporarily connected new nodes (blade chassis) through a single 10Gb network link,

Of course, blade chassis has its own switch. And it turned out that switch came with MTU set to 1500 by default, where we run everything (storage servers and nodes) on 9000.

So if you want to recreate that issue to properly debug it, place a switch with mtu 1500 between your beegfs client and server, both set to 9000. Then do some io, observe "frame too long" errors on switch and figure out why this messes up internal state of beegfs-storage process.

Thanks,

Reply all

Reply to author

Forward