Hello,
we have a 4 node FhGFS storage setup using version 2011.04.r21. All four
nodes are running meta and storage services and have IB cards. Most of
our clients are accessing via IB, but we have few that use TCP.
Today mount hung on one of TCP clients. I looked through all the logs
and couldn't find any change on system level. Also, there is no problem
in network connectivity between the client and servers.
The repeating error message in fhgfs-client.log is:
Failed to receive response from: hpc-storage3.srce (
10.8.16.153:8003).
Expected response type: 2032
Here is the full client log:
(1) Mar05 19:55:09 Main [App] >> FhGFS Helper Daemon Version: 2011.04-r21
(1) Mar05 19:55:09 Main [App] >> Client log messages will be prefixed
with an asterisk (*) symbol.
(3) Mar05 19:55:14 *mount(570) [DatagramListener (init sock)] >>
Listening for UDP datagrams: Port 8004
(1) Mar05 19:55:14 *mount(570) [App_logInfos] >> FhGFS Client Version:
2011.04-r21
(2) Mar05 19:55:14 *mount(570) [App_logInfos] >> ClientID:
tannat.srce-51363F92-23A
(2) Mar05 19:55:14 *mount(570) [App_logInfos] >> Usable NICs: eth0(TCP)
eth1(TCP) eth2(TCP)
(3) Mar05 19:55:14 *fhgfs_HBeatMgr(572) [NodeConn (acquire stream)] >>
Establishing new TCP connection to:
127.0.0.1:8006
(3) Mar05 19:55:14 *fhgfs_HBeatMgr(572) [NodeConn (acquire stream)] >>
Connected:
127.0.0.1:8006
(2) Mar05 19:55:14 *fhgfs_DGramLis(571) [Heartbeat incoming] >> New node
[ID: hpc-storage1.srce; Type: Management; Source: 10.8.16.151]
(3) Mar05 19:55:14 *fhgfs_HBeatMgr(572) [Init] >> Management node found.
Downloading node groups...
(3) Mar05 19:55:14 *fhgfs_HBeatMgr(572) [NodeConn (acquire stream)] >>
Establishing new TCP connection to:
192.168.1.151:8008
(1) Mar05 19:55:16 *df(580) [Remoting (stat storage targets)] >> No
storage targets known.
(3) Mar05 19:55:19 *fhgfs_HBeatMgr(572) [NodeConn (acquire stream)] >>
Connect failed:
192.168.1.151:8008
(3) Mar05 19:55:19 *fhgfs_HBeatMgr(572) [NodeConn (acquire stream)] >>
Establishing new TCP connection to:
10.8.16.151:8008
(3) Mar05 19:55:19 *fhgfs_HBeatMgr(572) [NodeConn (acquire stream)] >>
Connected:
10.8.16.151:8008
(2) Mar05 19:55:19 *fhgfs_HBeatMgr(572) [Sync] >> Nodes added (sync
results): 4 (Type: Meta)
(2) Mar05 19:55:19 *fhgfs_HBeatMgr(572) [Sync] >> Nodes added (sync
results): 4 (Type: Storage)
(3) Mar05 19:55:19 *fhgfs_HBeatMgr(572) [Init] >> Node registration...
(2) Mar05 19:55:19 *fhgfs_HBeatMgr(572) [Registration] >> Node
registration successful.
(3) Mar05 19:55:19 *fhgfs_HBeatMgr(572) [Init] >> Init complete.
(3) Mar05 19:55:28 *fhgfs_Worker/3(575) [NodeConn (acquire stream)] >>
Establishing new TCP connection to:
192.168.1.151:8003
(3) Mar05 19:55:28 *fhgfs_Worker/2(574) [NodeConn (acquire stream)] >>
Establishing new TCP connection to:
192.168.1.152:8003
(3) Mar05 19:55:28 *fhgfs_Worker/1(573) [NodeConn (acquire stream)] >>
Establishing new TCP connection to:
192.168.1.153:8003
(3) Mar05 19:55:33 *fhgfs_Worker/3(575) [NodeConn (acquire stream)] >>
Connect failed:
192.168.1.151:8003
(3) Mar05 19:55:33 *fhgfs_Worker/2(574) [NodeConn (acquire stream)] >>
Connect failed:
192.168.1.152:8003
(3) Mar05 19:55:33 *fhgfs_Worker/2(574) [NodeConn (acquire stream)] >>
Establishing new TCP connection to:
10.8.16.152:8003
(3) Mar05 19:55:33 *fhgfs_Worker/1(573) [NodeConn (acquire stream)] >>
Connect failed:
192.168.1.153:8003
(3) Mar05 19:55:33 *fhgfs_Worker/1(573) [NodeConn (acquire stream)] >>
Establishing new TCP connection to:
10.8.16.153:8003
(3) Mar05 19:55:33 *fhgfs_Worker/3(575) [NodeConn (acquire stream)] >>
Establishing new TCP connection to:
10.8.16.151:8003
(3) Mar05 19:55:33 *fhgfs_Worker/2(574) [NodeConn (acquire stream)] >>
Connected:
10.8.16.152:8003
(3) Mar05 19:55:33 *fhgfs_Worker/1(573) [NodeConn (acquire stream)] >>
Connected:
10.8.16.153:8003
(3) Mar05 19:55:33 *fhgfs_Worker/3(575) [NodeConn (acquire stream)] >>
Connected:
10.8.16.151:8003
(3) Mar05 19:55:33 *fhgfs_Worker/2(574) [NodeConn (acquire stream)] >>
Establishing new TCP connection to:
192.168.1.154:8003
(3) Mar05 19:55:38 *fhgfs_Worker/2(574) [NodeConn (acquire stream)] >>
Connect failed:
192.168.1.154:8003
(3) Mar05 19:55:38 *fhgfs_Worker/2(574) [NodeConn (acquire stream)] >>
Establishing new TCP connection to:
10.8.16.154:8003
(3) Mar05 19:55:38 *fhgfs_Worker/2(574) [NodeConn (acquire stream)] >>
Connected:
10.8.16.154:8003
(2) Mar05 19:56:28 *fhgfs_Worker/2(32708) [Messaging (RPC)] >> Failed to
receive response from: hpc-storage3.srce (
10.8.16.153:8003). Expected
response type: 2032
(3) Mar05 19:56:28 *fhgfs_Worker/2(32708) [NodeConn (invalidate stream)]
>> Disconnected:
10.8.16.153:8003
(0) Mar05 19:58:05 *fhgfs_RtrWrk/1(32710) [MessagingTk (recv msg)] >>
[
10.8.16.153:8003] SocketException: ErrCode: 0
(2) Mar05 19:58:05 *fhgfs_RtrWrk/1(32710) [Messaging (RPC)] >> Failed to
receive response from: hpc-storage3.srce (
10.8.16.153:8003). Expected
response type: 2032
(3) Mar05 19:58:05 *fhgfs_RtrWrk/1(32710) [NodeConn (invalidate stream)]
>> Disconnected:
10.8.16.153:8003
On server 10.8.16.153 in fhgfs-storage.log I see (client IP is 10.8.16.10):
(0) Mar05 19:49:10 Worker8 [WriteLocalFileMsg incoming] >>
SocketException occured: SocketDisconnectException: Soft Disconnect from
10.8.16.10:35744
(2) Mar05 19:49:10 Worker8 [WriteLocalFileMsg incoming] >> Details:
sessionID: tannat.srce-50A34E82-7FC0; FD: 112; offset: 0; count: 524288;
(3) Mar05 19:49:10 Worker8 [Work (process incoming data)] >> Problem
encountered during processing of a message. Disconnecting:
10.8.16.10:35744
...
(3) Mar05 19:53:11 StreamLis [StreamLis] >> Accepted new connection from
10.8.16.10:42984 [SockFD: 326]
...
(3) Mar05 19:55:33 StreamLis [StreamLis] >> Accepted new connection from
10.8.16.10:43225 [SockFD: 356]
...
(3) Mar05 19:58:10 StreamLis [StreamLis] >> Accepted new connection from
10.8.16.10:43246 [SockFD: 477]
...
(3) Mar05 20:04:10 Worker5 [Work (process incoming data)] >>
SocketDisconnectException: Soft Disconnect from
10.8.16.10:58164
(3) Mar05 20:04:10 Worker5 [Work (process incoming data)] >>
SocketDisconnectException: Soft Disconnect from
10.8.16.10:58329
(0) Mar05 20:04:10 Worker5 [WriteLocalFileMsg incoming] >>
SocketException occured: SocketDisconnectException: Soft Disconnect from
10.8.16.10:35802
(2) Mar05 20:04:10 Worker5 [WriteLocalFileMsg incoming] >> Details:
sessionID: tannat.srce-50A34E82-7FC0; FD: 112; offset: 0; count: 524288;
(3) Mar05 20:04:10 Worker5 [Work (process incoming data)] >> Problem
encountered during processing of a message. Disconnecting:
10.8.16.10:35802
(... are messages related to other servers)
The only thing I haven't tried yet is rebooting the client. Any suggestion?
Thanks in advance,
Emir Imamagic