beegfs client 6.1 hangs

93 views
Skip to first unread message

Yann Sagon

unread,
Dec 13, 2016, 5:19:17 AM12/13/16
to fhgfs...@googlegroups.com
Dear list,

I did the upgrade of Beegfs 6.1 some days ago. I had some trouble doing it but now it seems it's working correctly for the main part. I have anyway a user that "crash" the computes nodes when he runs a particular tasks. He told me it was working before.

Here is what I see in dmesg:

Nov 29 22:35:19 node059 kernel: INFO: task ABCsampler:95661 blocked for more than 120 seconds.
Nov 29 22:35:19 node059 kernel:      Not tainted 2.6.32-642.el6.x86_64 #1
Nov 29 22:35:19 node059 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 29 22:35:19 node059 kernel: ABCsampler    D 000000000000000f     0 95661  95655 0x00020080
Nov 29 22:35:19 node059 kernel: ffff880fdc68f940 0000000000000086 ffff880fdc68f8a8 ffffffff81151539
Nov 29 22:35:19 node059 kernel: 0000000000000001 ffff881033dcb540 ffff880840010dc0 ffff880fdc68f9d8
Nov 29 22:35:19 node059 kernel: ffffffff8113c261 ffff880700000004 ffff880f664dd068 ffff880fdc68ffd8
Nov 29 22:35:19 node059 kernel: Call Trace:
Nov 29 22:35:19 node059 kernel: [<ffffffff81151539>] ? zone_statistics+0x99/0xc0
Nov 29 22:35:19 node059 kernel: [<ffffffff8113c261>] ? get_page_from_freelist+0x3d1/0x870
Nov 29 22:35:19 node059 kernel: [<ffffffff8154a645>] rwsem_down_failed_common+0x95/0x1d0
Nov 29 22:35:19 node059 kernel: [<ffffffff8154a7d6>] rwsem_down_read_failed+0x26/0x30
Nov 29 22:35:19 node059 kernel: [<ffffffff810b2bef>] ? ktime_get_ts+0xbf/0x100
Nov 29 22:35:19 node059 kernel: [<ffffffff812a6f74>] call_rwsem_down_read_failed+0x14/0x30
Nov 29 22:35:19 node059 kernel: [<ffffffff81549cd4>] ? down_read+0x24/0x30
Nov 29 22:35:19 node059 kernel: [<ffffffffa06cf100>] __FhgfsOps_revalidateIntent+0xf0/0x570 [beegfs]
Nov 29 22:35:19 node059 kernel: [<ffffffff812a237c>] ? put_dec+0x10c/0x110
Nov 29 22:35:19 node059 kernel: [<ffffffff8113e0c9>] ? __alloc_pages_nodemask+0x129/0x950
Nov 29 22:35:19 node059 kernel: [<ffffffff812a4c30>] ? vsnprintf+0x450/0x5e0
Nov 29 22:35:19 node059 kernel: [<ffffffffa06cf6d4>] FhgfsOps_revalidateIntent+0x154/0x1c0 [beegfs]
Nov 29 22:35:19 node059 kernel: [<ffffffff811a9b66>] do_lookup+0x66/0x230
Nov 29 22:35:19 node059 kernel: [<ffffffff811a7333>] ? generic_permission+0x23/0xb0
Nov 29 22:35:19 node059 kernel: [<ffffffff811aa2c0>] __link_path_walk+0x200/0x1060
Nov 29 22:35:19 node059 kernel: [<ffffffff81159b3d>] ? handle_pte_fault+0x2cd/0xb20
Nov 29 22:35:19 node059 kernel: [<ffffffff810f91f5>] ? call_rcu_sched+0x15/0x20
Nov 29 22:35:19 node059 kernel: [<ffffffff810f920e>] ? call_rcu+0xe/0x10
Nov 29 22:35:19 node059 kernel: [<ffffffff811ab3da>] path_walk+0x6a/0xe0
Nov 29 22:35:19 node059 kernel: [<ffffffff811ab5eb>] filename_lookup+0x6b/0xc0
Nov 29 22:35:19 node059 kernel: [<ffffffff8123ac46>] ? security_file_alloc+0x16/0x20
Nov 29 22:35:19 node059 kernel: [<ffffffff811acac4>] do_filp_open+0x104/0xd20
Nov 29 22:35:19 node059 kernel: [<ffffffff812a749a>] ? strncpy_from_user+0x4a/0x90
Nov 29 22:35:19 node059 kernel: [<ffffffff811ba252>] ? alloc_fd+0x92/0x160
Nov 29 22:35:19 node059 kernel: [<ffffffff81196bd7>] do_sys_open+0x67/0x130
Nov 29 22:35:19 node059 kernel: [<ffffffff811eed2a>] compat_sys_open+0x1a/0x20
Nov 29 22:35:19 node059 kernel: [<ffffffff8105b080>] sysenter_dispatch+0x7/0x2e

I see nothing in the client logs (last line is this one):

(3) Nov29 10:57:39 *R(69937) [NodeConn (acquire stream)] >> Connected: beegf...@192.168.102.8:8005 (protocol: RDMA)

The nodes allocations is done using SLURM. In this case the job stay in non killable state and I need to reboot the node. Before rebooting, I can't access the file that was in use by the application from this node (not responding), but I can access it from the other nodes. From this node I can browse without problem the other files.

Do you think it's something that is corrected in version 6.2?

Thanks

--
Yann SAGON
Ingénieur système HPC
24 Rue du Général-Dufour
1211 Genève 4 - Suisse
Tél. : +41 (0)22 379 7737
yann....@unige.ch - www.unige.ch

Frank Kautz

unread,
Dec 15, 2016, 5:23:18 AM12/15/16
to fhgfs...@googlegroups.com
Dear Yann,

we will have a look at this issue.

> Do you think it's something that is corrected in version 6.2?
When we had a look, we can tell you more.

kind regards,
Frank
> (3) Nov29 10:57:39 *R(69937) [NodeConn (acquire stream)] >> Connected: beegf...@192.168.102.8:8005 <http://beegf...@192.168.102.8:8005> (protocol: RDMA)
>
> The nodes allocations is done using SLURM. In this case the job stay in non killable state and I need to reboot the node. Before rebooting, I can't access the file that was in use by the application from this node (not responding), but I can access it from the other nodes. From this node I can browse without problem the other files.
>
> Do you think it's something that is corrected in version 6.2?
>
> Thanks
>
> --
> Yann SAGON
> Ingénieur système HPC
> 24 Rue du Général-Dufour
> 1211 Genève 4 - Suisse
> Tél. : +41 (0)22 379 7737
> yann....@unige.ch <mailto:yann....@unige.ch> - www.unige.ch <http://www.unige.ch>
>
> --
> You received this message because you are subscribed to the Google Groups "beegfs-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+...@googlegroups.com <mailto:fhgfs-user+...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Yann Sagon

unread,
Dec 20, 2016, 5:02:33 AM12/20/16
to fhgfs...@googlegroups.com
Hello,

I have upgraded the nodes to version 6.2 and I still have the same issue. Do you have any news on your side?

By the way, is there a way to see what is the version of the beegfs client currently running? I ask that because I have upgraded beegfs on all the nodes but I couldn't restart beegfs on all of them for the moment.

Thanks

> To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+unsubscribe@googlegroups.com <mailto:fhgfs-user+unsub...@googlegroups.com>.

> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "beegfs-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Yann SAGON
Ingénieur système HPC
24 Rue du Général-Dufour
1211 Genève 4 - Suisse
Tél. : +41 (0)22 379 7737

Yann Sagon

unread,
Dec 20, 2016, 5:08:08 AM12/20/16
to fhgfs...@googlegroups.com
I answer myself to see the client version currently running (sorry for the noise):

[root@node060 ~]# beegfs-ctl --listnodes --showversion --nodetype=client | grep -A1 node060
4A1F-58500266-node060.cluster [ID: 378]


> To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+unsubscribe@googlegroups.com <mailto:fhgfs-user+unsubscribe@googlegroups.com>.

> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "beegfs-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Yann SAGON
Ingénieur système HPC
24 Rue du Général-Dufour
1211 Genève 4 - Suisse
Tél. : +41 (0)22 379 7737
yann....@unige.ch - www.unige.ch
Reply all
Reply to author
Forward
0 new messages