Client multi_mode to separate metadata / mgmtd services conflict

288 views

Skip to first unread message

Trey Dockendorf

unread,

Jun 17, 2014, 4:42:10 PM6/17/14

to fhgfs...@googlegroups.com

I am in the process of upgrading our cluster and testing configurations on a sub-set of nodes. The "production" cluster is CentOS 5.7 with FhGFS 2012.10.r13. The "development" cluster is CentOS 6.5 with FhGFS 2012.10.r13. The clients in "development" are configured with multi_mode enabled and point one instance to production and one instance to a single metadata/mgmtd server that manages a single storage server. Today I had an issue with my production metadata server's pool going into "emergency" state which took place shortly after I restarted the fhgfs-meta and fhgfs-mgmtd daemons on the development server. My hunch is that somehow these two systems (both on same IB fabric using ib0 / RDMA as primary interface via connInterfacesFile) conflicted and that caused the production to have issues. The production metadata server itself was fine, no issues with storage or any errors visible in dmesg or "ibcheckerrs". The only log entries I found on the production server are below.

Is this a case of a configuration mistake? If I want two separate FhGFS instances (separate metadata/storage/mgmtd) to be on the same client, do the production and development instances need separate ports for all the services, or just the client? I realize now I can achieve this on the client using multiple lines in fhgfs-mounts.conf, but am concerned about this issue today and what the cause could have been.

Notification of fs03 (production metadata) going into emergency state was @ 15:09 local time, our notifications are delayed by about 5 minutes (intentional) so ~15:04 the emergency state occurred.

From fhgfs-meta.log.old-1

(0) Jun17 14:39:46 CommSlave505 [Messaging (RPC)] >> Communication error: Receive timed out from: 192.168.211.248:8003; Peer: fhgfs-storage sa1-1 [ID: 1]. (Message type: UnlinkLocalFile)

(3) Jun17 14:39:46 CommSlave129 [NodeConn (acquire stream)] >> Establishing new RDMA connection to: 192.168.211.248:8003

(3) Jun17 14:39:46 CommSlave505 [NodeConn (invalidate stream)] >> Disconnected: 192.168.211.248:8003

(2) Jun17 14:39:46 CommSlave505 [Messaging (RPC)] >> Retrying communication with node: fhgfs-storage sa1-1 [ID: 1] (Message type: UnlinkLocalFile)

(0) Jun17 14:40:00 CommSlave127 [Messaging (RPC)] >> Communication error: Receive timed out from: 192.168.211.248:8003; Peer: fhgfs-storage sa1-1 [ID: 1]. (Message type: SetLocalAttr)

(3) Jun17 14:40:00 CommSlave318 [NodeConn (acquire stream)] >> Establishing new RDMA connection to: 192.168.211.248:8003

(3) Jun17 14:40:00 CommSlave127 [NodeConn (invalidate stream)] >> Disconnected: 192.168.211.248:8003

(2) Jun17 14:40:00 CommSlave127 [Messaging (RPC)] >> Retrying communication with node: fhgfs-storage sa1-1 [ID: 1] (Message type: SetLocalAttr)

(0) Jun17 14:40:39 CommSlave5 [Messaging (RPC)] >> Communication error: Receive timed out from: 192.168.211.248:8003; Peer: fhgfs-storage sa1-1 [ID: 1]. (Message type: UnlinkLocalFile)

(3) Jun17 14:40:39 CommSlave445 [NodeConn (acquire stream)] >> Establishing new RDMA connection to: 192.168.211.248:8003

(3) Jun17 14:40:39 CommSlave5 [NodeConn (invalidate stream)] >> Disconnected: 192.168.211.248:8003

(2) Jun17 14:40:39 CommSlave5 [Messaging (RPC)] >> Retrying communication with node: fhgfs-storage sa1-1 [ID: 1] (Message type: UnlinkLocalFile)

(1) Jun17 15:16:18 Main [App::signalHandler] >> Received a SIGTERM. Shutting down...

(2) Jun17 15:16:20 Main [App (wait for component termination)] >> Still waiting for this component to stop: Worker1

From fhgfs-mgmtd.log

(0) Jun17 14:53:49 XNodeSync [Messaging (RPC)] >> Communication error: Receive timed out from: 192.168.208.200:8005; Peer: fhgfs-meta fs03 [ID: 1]. (Message type: StatStoragePath)

(2) Jun17 14:53:50 XNodeSync [Messaging (RPC)] >> Retrying communication with node: fhgfs-meta fs03 [ID: 1] (Message type: StatStoragePath)

(2) Jun17 15:03:26 Worker4 [RemoveNodeMsg incoming] >> Node removed: fhgfs-client FC6D-539F0404-c0134.brazos.tamu.edu

(2) Jun17 15:03:28 Worker3 [Node registration] >> New node: fhgfs-client 1988-53A09F10-c0134.brazos.tamu.edu; RDMA; Ver: 2012.10-13; Source: 192.168.200.43:42665

(2) Jun17 15:03:28 Worker3 [Node registration] >> Number of nodes: Meta: 1; Storage: 6; Client: 316; Mgmt: 1

(0) Jun17 15:03:50 XNodeSync [Messaging (RPC)] >> Communication error: Receive timed out from: 192.168.208.200:8005; Peer: fhgfs-meta fs03 [ID: 1]. (Message type: StatStoragePath)

(2) Jun17 15:03:50 XNodeSync [Update capacities (direct)] >> Target moved to emergency pool for unreachable node. Node: fhgfs-meta fs03 [ID: 1]

(2) Jun17 15:04:45 Worker1 [Node registration] >> New node: fhgfs-client 19ED-53A09F5D-c0134.brazos.tamu.edu; RDMA; Ver: 2012.10-13; Source: 192.168.200.43:42673

(2) Jun17 15:04:45 Worker1 [Node registration] >> Number of nodes: Meta: 1; Storage: 6; Client: 317; Mgmt: 1

(0) Jun17 15:18:39 XNodeSync [Messaging (RPC)] >> Communication error: Recv(): Hard disconnect from 192.168.208.200:8005. SysErr: Connection reset by peer; Peer: fhgfs-meta fs03 [ID: 1]. (Message type: StatStoragePath)

(2) Jun17 15:18:39 XNodeSync [Messaging (RPC)] >> Retrying communication with node: fhgfs-meta fs03 [ID: 1] (Message type: StatStoragePath)

(1) Jun17 15:18:39 XNodeSync [NodeConn (acquire stream)] >> Connect failed on all available routes: 192.168.208.200:8005

(2) Jun17 15:18:39 XNodeSync [Messaging (RPC)] >> Unable to connect to: fhgfs-meta fs03 [ID: 1]. (Message type: StatStoragePath)

(2) Jun17 15:18:54 XNodeSync [Update capacities (direct)] >> Target capacity pool assignment updated. Node: fhgfs-meta fs03 [ID: 1]; Pool: Normal

Thanks

- Trey

Bernd Schubert

unread,

Jun 18, 2014, 5:22:11 AM6/18/14

to fhgfs...@googlegroups.com

On 06/17/2014 10:42 PM, Trey Dockendorf wrote:
> I am in the process of upgrading our cluster and testing configurations on
> a sub-set of nodes. The "production" cluster is CentOS 5.7 with
> FhGFS 2012.10.r13. The "development" cluster is CentOS 6.5 with
> FhGFS 2012.10.r13. The clients in "development" are configured with
> multi_mode enabled and point one instance to production and one instance to
> a single metadata/mgmtd server that manages a single storage server. Today
> I had an issue with my production metadata server's pool going into
> "emergency" state which took place shortly after I restarted the fhgfs-meta
> and fhgfs-mgmtd daemons on the development server. My hunch is that
> somehow these two systems (both on same IB fabric using ib0 / RDMA as
> primary interface via connInterfacesFile) conflicted and that caused the
> production to have issues. The production metadata server itself was fine,
> no issues with storage or any errors visible in dmesg or "ibcheckerrs".
> The only log entries I found on the production server are below.
>
> Is this a case of a configuration mistake? If I want two separate FhGFS
> instances (separate metadata/storage/mgmtd) to be on the same client, do
> the production and development instances need separate ports for all the
> services, or just the client? I realize now I can achieve this on the
> client using multiple lines in fhgfs-mounts.conf, but am concerned about
> this issue today and what the cause could have been.

If daemons are running on different servers there shouldn't be any need
for that. A coupling of events of different servers sounds very unlikely.

>
> Notification of fs03 (production metadata) going into emergency state was @
> 15:09 local time, our notifications are delayed by about 5 minutes
> (intentional) so ~15:04 the emergency state occurred.
>
> From fhgfs-meta.log.old-1
>
> (0) Jun17 14:39:46 CommSlave505 [Messaging (RPC)] >> Communication error:
> Receive timed out from: 192.168.211.248:8003; Peer: fhgfs-storage sa1-1
> [ID: 1]. (Message type: UnlinkLocalFile)

Are there any logs on fhgfs-storage sa1-1? Are there any daemon syslogs?
The communication library fhgfs-opentk-lib logs information via syslog
as daemon instance, so depending on you syslog configuration logs might
be saved in /var/log/{daemon.log,messages,syslog}.

Did you try to ping 192.168.211.248 from your meta or management server?
And/or did you try to run 'ibping' between these nodes?

If there are no further information and you would _like_ to further
investigate, you could re-compile fhgfs-opentk-lib with FHGFS_DEBUG=1
and then re-try what you did before.
With FHGFS_DEBUG=1 opentk-lib would log ways more information (to daemon
stdout and sylog).

Thanks,
Bernd

Trey Dockendorf

unread,

Jun 18, 2014, 2:32:34 PM6/18/14

to fhgfs...@googlegroups.com

Bernd,

Thank you for your response. Below is new information from today as
new issues have arose (possible related). Inline responses also to
answer your questions.

Today I've had users reporting that our production FhGFS filesystem is
non-responsive. In actuality it's just so slow users think it's
non-responsive. Our monitoring has failed to pick this up as all the
fhgfs-ctl tools report things are okay. What I've found is this:

* fhgfs-admon-gui shows aggregate storage activity of average read no
higher than 10MB/s and average write no higher than 20MB/s
(essentially idle compared to normal activity)
* There was a period when our metadata server was handling ~40,000
requests per second due to a user's job stuck in an infinite loop
performing "stat" type operations.
* Once the looping jobs were killed metadata requests dropped below
100 per second
* From metadata/mgmtd 'fhgfs-ctl --listpools --nodetype=storage' takes
0.011 seconds to complete
* From metadata/mgmtd 'fhgfs-ctl --listpools --nodetype=storage
--spaceinfo' takes 52 seconds to complete.
* (sa1-1 is ID 1) From sa1-1 'fhgfs-ctl --iostat --nodetype=storage 1'
takes 19 seconds to complete and only 2 of the 10 lines have any
write/read/reqs greater than 0. All lines have a qlen from 175-200
and bsy 10-20.
* sa1-1 'iostat -dmx 1' shows all data drives hitting 100 %util
* The other 5 storage servers are almost completely idle. It appears
sa1-1 is preventing anything from working (our default striping is
across all targets).

As far as I can tell from running 'fhgfs-ctl --clientstats
--nodetype=storage 1', sa1-1 has almost no activity taking place.
It's as if those hung processes are chewing up the drives with no
clients requesting data.

In hopes to get the filesystem usable again I'm power cycling sa1-1,
but any insight or debugging steps are always appreciated.

sa1-1 from 16:49 to 17:37 has these entries in /var/log/messages:

Jun 17 16:49:40 sa1-1 kernel: INFO: task fhgfs-storage/W:19460 blocked
for more than 120 seconds.
Jun 17 16:49:40 sa1-1 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 17 16:49:40 sa1-1 kernel: fhgfs-storage D 000000000000000b 0
19460 1 0x00000080
Jun 17 16:49:40 sa1-1 kernel: ffff880d7acfde08 0000000000000082
ffff88100f4bf540 ffff88100f4bf540
Jun 17 16:49:40 sa1-1 kernel: ffff88100f4bf540 ffff880d7acfc000
ffff88100f4bf540 0000001000008000
Jun 17 16:49:40 sa1-1 kernel: ffff88100f4bfaf8 ffff880d7acfdfd8
000000000000fb88 ffff88100f4bfaf8
Jun 17 16:49:40 sa1-1 kernel: Call Trace:
Jun 17 16:49:40 sa1-1 kernel: [<ffffffff8150f78e>]
__mutex_lock_slowpath+0x13e/0x180
Jun 17 16:49:40 sa1-1 kernel: [<ffffffff8150f62b>] mutex_lock+0x2b/0x50
Jun 17 16:49:40 sa1-1 kernel: [<ffffffff81192189>] do_unlinkat+0xa9/0x1b0
Jun 17 16:49:40 sa1-1 kernel: [<ffffffff811813e2>] ? vfs_write+0x132/0x1a0
Jun 17 16:49:40 sa1-1 kernel: [<ffffffff811922a6>] sys_unlink+0x16/0x20
Jun 17 16:49:40 sa1-1 kernel: [<ffffffff8100b072>]
system_call_fastpath+0x16/0x1b

The metadata/mgmtd server had nothing in syslog (/var/log/messages)
related to fhgfs. Only entries were Puppet and kernel lines such as

Jun 15 05:10:06 fs03 kernel: ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:0001, status -22

Those lines are an issue with our IB switch and IIRC someone from
FhGFS clarified them as being a non-issue.

> Did you try to ping 192.168.211.248 from your meta or management server?
> And/or did you try to run 'ibping' between these nodes?
>

At the time when I noticed the issue I did try ping and it worked. I
did not try ibping at that time. Doing ibping now works.

>
> If there are no further information and you would _like_ to further
> investigate, you could re-compile fhgfs-opentk-lib with FHGFS_DEBUG=1 and
> then re-try what you did before.
> With FHGFS_DEBUG=1 opentk-lib would log ways more information (to daemon
> stdout and sylog).
>

Is fhgfs-opentk-lib only associated to the clients? I have the
fhgfs-client installed to all my storage and metadata services with
the fhgfs-client service disabled, to allow for fhgfs-ctl commands to
be used. Would I need to re-compile on storage or metadata if the
client service is not running?

Thanks
- Trey

>
>
> Thanks,
> Bernd
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "fhgfs-user" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/fhgfs-user/08KSudV6p-k/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> fhgfs-user+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages