Bernd,
Thank you for your response. Below is new information from today as
new issues have arose (possible related). Inline responses also to
answer your questions.
Today I've had users reporting that our production FhGFS filesystem is
non-responsive. In actuality it's just so slow users think it's
non-responsive. Our monitoring has failed to pick this up as all the
fhgfs-ctl tools report things are okay. What I've found is this:
* fhgfs-admon-gui shows aggregate storage activity of average read no
higher than 10MB/s and average write no higher than 20MB/s
(essentially idle compared to normal activity)
* There was a period when our metadata server was handling ~40,000
requests per second due to a user's job stuck in an infinite loop
performing "stat" type operations.
* Once the looping jobs were killed metadata requests dropped below
100 per second
* From metadata/mgmtd 'fhgfs-ctl --listpools --nodetype=storage' takes
0.011 seconds to complete
* From metadata/mgmtd 'fhgfs-ctl --listpools --nodetype=storage
--spaceinfo' takes 52 seconds to complete.
* (sa1-1 is ID 1) From sa1-1 'fhgfs-ctl --iostat --nodetype=storage 1'
takes 19 seconds to complete and only 2 of the 10 lines have any
write/read/reqs greater than 0. All lines have a qlen from 175-200
and bsy 10-20.
* sa1-1 'iostat -dmx 1' shows all data drives hitting 100 %util
* The other 5 storage servers are almost completely idle. It appears
sa1-1 is preventing anything from working (our default striping is
across all targets).
As far as I can tell from running 'fhgfs-ctl --clientstats
--nodetype=storage 1', sa1-1 has almost no activity taking place.
It's as if those hung processes are chewing up the drives with no
clients requesting data.
In hopes to get the filesystem usable again I'm power cycling sa1-1,
but any insight or debugging steps are always appreciated.
sa1-1 from 16:49 to 17:37 has these entries in /var/log/messages:
Jun 17 16:49:40 sa1-1 kernel: INFO: task fhgfs-storage/W:19460 blocked
for more than 120 seconds.
Jun 17 16:49:40 sa1-1 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 17 16:49:40 sa1-1 kernel: fhgfs-storage D 000000000000000b 0
19460 1 0x00000080
Jun 17 16:49:40 sa1-1 kernel: ffff880d7acfde08 0000000000000082
ffff88100f4bf540 ffff88100f4bf540
Jun 17 16:49:40 sa1-1 kernel: ffff88100f4bf540 ffff880d7acfc000
ffff88100f4bf540 0000001000008000
Jun 17 16:49:40 sa1-1 kernel: ffff88100f4bfaf8 ffff880d7acfdfd8
000000000000fb88 ffff88100f4bfaf8
Jun 17 16:49:40 sa1-1 kernel: Call Trace:
Jun 17 16:49:40 sa1-1 kernel: [<ffffffff8150f78e>]
__mutex_lock_slowpath+0x13e/0x180
Jun 17 16:49:40 sa1-1 kernel: [<ffffffff8150f62b>] mutex_lock+0x2b/0x50
Jun 17 16:49:40 sa1-1 kernel: [<ffffffff81192189>] do_unlinkat+0xa9/0x1b0
Jun 17 16:49:40 sa1-1 kernel: [<ffffffff811813e2>] ? vfs_write+0x132/0x1a0
Jun 17 16:49:40 sa1-1 kernel: [<ffffffff811922a6>] sys_unlink+0x16/0x20
Jun 17 16:49:40 sa1-1 kernel: [<ffffffff8100b072>]
system_call_fastpath+0x16/0x1b
The metadata/mgmtd server had nothing in syslog (/var/log/messages)
related to fhgfs. Only entries were Puppet and kernel lines such as
Jun 15 05:10:06 fs03 kernel: ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:0000:0001, status -22
Those lines are an issue with our IB switch and IIRC someone from
FhGFS clarified them as being a non-issue.
> Did you try to ping 192.168.211.248 from your meta or management server?
> And/or did you try to run 'ibping' between these nodes?
>
At the time when I noticed the issue I did try ping and it worked. I
did not try ibping at that time. Doing ibping now works.
>
> If there are no further information and you would _like_ to further
> investigate, you could re-compile fhgfs-opentk-lib with FHGFS_DEBUG=1 and
> then re-try what you did before.
> With FHGFS_DEBUG=1 opentk-lib would log ways more information (to daemon
> stdout and sylog).
>
Is fhgfs-opentk-lib only associated to the clients? I have the
fhgfs-client installed to all my storage and metadata services with
the fhgfs-client service disabled, to allow for fhgfs-ctl commands to
be used. Would I need to re-compile on storage or metadata if the
client service is not running?
Thanks
- Trey
>
>
> Thanks,
> Bernd
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "fhgfs-user" group.
> To unsubscribe from this topic, visit
>
https://groups.google.com/d/topic/fhgfs-user/08KSudV6p-k/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
>
fhgfs-user+...@googlegroups.com.
> For more options, visit
https://groups.google.com/d/optout.