metadata server crash

428 views
Skip to first unread message

Yann Sagon

unread,
Feb 11, 2017, 4:35:39 AM2/11/17
to fhgfs...@googlegroups.com
Dear list,

this night one of our metadata server crashed. This is the relevant information on the log:

(0) Feb06 13:11:44 Worker13 [File (store updated xattr metadata)] >> Unable to write FileInode metadata update: inodes/6/35/59-589867D7-2. SysErr: No such file or directory
(0) Feb11 02:20:18 Worker13 [File (store updated xattr metadata)] >> Unable to write FileInode metadata update: inodes/6C/4C/243-589E6698-FA89. SysErr: No such file or directory
(2) Feb11 02:20:20 Worker24 [FileInode (store updated Inode)] >> Failed to write inlined inode: parentID: 1-5857F8E6-7482 entryID: 96C-589E6696-1 fileName: matlab.settings Error: Internal error
(0) Feb11 02:20:20 Worker24 [File (store updated xattr metadata)] >> Unable to write FileInode metadata update: inodes/9/A/14C-589E66C4-2. SysErr: No such file or directory
(2) Feb11 02:20:20 Worker24 [FileInode (store updated Inode)] >> Trying to write as non-inlined inode also failed.

The log is finished.

We are using version 6.1, I'll try to upgrade next week.

Any clue?

Best


--
Yann SAGON
Ingénieur système HPC
24 Rue du Général-Dufour
1211 Genève 4 - Suisse
Tél. : +41 (0)22 379 7737
yann....@unige.ch - www.unige.ch

Sven Breuner

unread,
Feb 13, 2017, 5:58:50 AM2/13/17
to fhgfs...@googlegroups.com, Yann Sagon
Hi Yann,

are there any error messages e.g. from the underlying ext4 in dmesg on this server?

Does "crashed" mean that the beegfs-meta service got killed or was the whole
machine down? If a beegfs-service gets terminated unexpectedly, it typically
tries to write a corresponding log message - unless it is really killed in a
hard way, e.g. by machine power off or by something like "kill -9".

Best regards,
Sven
> yann....@unige.ch <mailto:yann....@unige.ch> - www.unige.ch
> <http://www.unige.ch>
>
> --
> You received this message because you are subscribed to the Google Groups
> "beegfs-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to fhgfs-user+...@googlegroups.com
> <mailto:fhgfs-user+...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Yann Sagon

unread,
Feb 14, 2017, 10:56:52 AM2/14/17
to fhgfs...@googlegroups.com
Nothing about the underlying fs, but about sockets:

Feb  6 13:45:14 server1 beegfs-meta[3748]: 1015:__IBVSocket_createCommContext: Couldn't create QP (Error: -1)
Feb  6 13:45:14 server1 beegfs-meta[3748]: 464:IBVSocket_accept: creation of CommContext failed
Feb  6 13:45:14 server1 beegfs-meta[3748]: 467:IBVSocket_accept: rdma_reject failed
Feb  6 14:06:13 server1 beegfs-meta[3748]: 1015:__IBVSocket_createCommContext: Couldn't create QP (Error: -1)
Feb  6 14:06:13 server1 beegfs-meta[3748]: 464:IBVSocket_accept: creation of CommContext failed
Feb  6 14:06:13 server1 beegfs-meta[3748]: 467:IBVSocket_accept: rdma_reject failed
[..]
Feb  6 18:11:13 server1 rsyslogd: -- MARK --
[..]
Feb  7 13:03:33 server1 beegfs-meta[3748]: IBVSocket_accept:496: rdma_accept failed
Feb  7 13:03:33 server1 beegfs-meta[3748]: 1015:__IBVSocket_createCommContext: Couldn't create QP (Error: -1)
Feb  7 13:03:33 server1 beegfs-meta[3748]: 464:IBVSocket_accept: creation of CommContext failed
Feb  7 13:03:33 server1 beegfs-meta[3748]: 467:IBVSocket_accept: rdma_reject failed
[...]
Feb  7 14:11:18 server1 rsyslogd: -- MARK --
Feb 11 15:37:59 server1 beegfs-meta[208498]: 467:IBVSocket_accept: rdma_reject failed
Feb 11 15:37:59 server1 beegfs-meta[208498]: IBVSocket_accept:598: Ignoring conn manager event (8: RDMA_CM_EVENT_REJECTED)
[...]
Feb 11 19:01:04 server1 beegfs-meta[208498]: 464:IBVSocket_accept: creation of CommContext failed
Feb 11 19:01:04 server1 beegfs-meta[208498]: 467:IBVSocket_accept: rdma_reject failed

Let me know if you want the full dmesg log.


Today I have again the same kind of error:

Feb 12 03:51:44 server1 rsyslogd: -- MARK --
[...] almost only MARK.
Feb 14 06:11:57 server1 rsyslogd: -- MARK --
Feb 14 06:17:10 server1 beegfs-meta[208498]: 1015:__IBVSocket_createCommContext: Couldn't create QP (Error: -1)
Feb 14 06:17:10 server1 beegfs-meta[208498]: 464:IBVSocket_accept: creation of CommContext failed
Feb 14 06:17:10 server1 beegfs-meta[208498]: 467:IBVSocket_accept: rdma_reject failed
Feb 14 06:18:09 server1 beegfs-meta[208498]: 1015:__IBVSocket_createCommContext: Couldn't create QP (Error: -1)
Feb 14 06:18:09 server1 beegfs-meta[208498]: 464:IBVSocket_accept: creation of CommContext failed
Feb 14 06:18:09 server1 beegfs-meta[208498]: 467:IBVSocket_accept: rdma_reject failed
Feb 14 06:18:39 server1 beegfs-meta[208498]: 1015:__IBVSocket_createCommContext: Couldn't create QP (Error: -1)
Feb 14 06:18:39 server1 beegfs-meta[208498]: 464:IBVSocket_accept: creation of CommContext failed
Feb 14 06:18:39 server1 beegfs-meta[208498]: 467:IBVSocket_accept: rdma_reject failed
Feb 14 06:31:57 server1 rsyslogd: -- MARK --

I have the same kind of errors on other servers too. We are using 4 servers.

In fact I don't know if this is something normal, as even after having this error the fs was still available.
It has only be down when meta crashed on server1. What I mean by crash is:
the machine was not down.
I did /etc/init.d/beegfs-meta status and I got this: service dead, but /var/run/  pid  file exists
I've only restarted beegfs-meta with /etc/init.d/beegfs-meta and the storage was working again.
I haven't seen any segfault or similar in dmesg.

HTH



For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "beegfs-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Yann SAGON
Ingénieur système HPC
24 Rue du Général-Dufour
1211 Genève 4 - Suisse
Tél. : +41 (0)22 379 7737
Reply all
Reply to author
Forward
0 new messages