I am mounting lustre through an fstab entry. This fails quite often, the
nodes end up without the lustre mount. Even when I log in, it take 2-3
tries to get it to mount. This is what I get:
mount /lustre
mount.lustre: mount 10.1.1.1@tcp0:/lustre at /lustre failed: Cannot send after transport endpoint shutdown
This is /var/log/messages:
Nov 15 16:27:43 compute-1-10 kernel: LustreError: 2124:0:(lib-move.c:2441:LNetPut()) Error sending PUT to 12345-10.1.1.1@tcp: -113
Nov 15 16:27:43 compute-1-10 kernel: LustreError: 2124:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 req@d73d7c00 x1352468062535684/t0 o250->M...@MGC10.1.1.1@tcp_0:26/25 lens 368/584 e 0 to 1 dl 1289834868 ref 2 fl Rpc:N/0/0 rc 0/0
Nov 15 16:27:43 compute-1-10 kernel: LustreError: 29069:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@d73d7800 x1352468062535685/t0 o101->M...@MGC10.1.1.1@tcp_0:26/25 lens 296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
Nov 15 16:27:43 compute-1-10 kernel: LustreError: 15c-8: MGC10.1.1.1@tcp: The configuration from log 'lustre-client' failed (-108). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Nov 15 16:27:43 compute-1-10 kernel: LustreError: 29069:0:(llite_lib.c:1176:ll_fill_super()) Unable to process log: -108
Nov 15 16:27:43 compute-1-10 kernel: LustreError: 29069:0:(obd_mount.c:2045:lustre_fill_super()) Unable to mount (-108)
I have no errors on the interface, so I assume this is a timing problem.
Can I improve this through some timeout setting?
Cheers,
Arne
_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
> From the log, we can see that either your MGS node was not ready for
> connection yet, or there's network error between client and the MGS node.
No error on the server nor on the client. What else can it be? Maybe the
switch is bad, I can see RX errors on most of it's interfaces.
> Were you rebooting the MGS at the moment?
No. It's something that happenes regularly.
> Since you said there's no errors on the interface, you need to check
> the lnet connection and also verify that the MGS/MDT are up running.
As far as I can tell, everything seems to be set up correctly. I have
quite a simple setup (single network, single interface gbe).
Thanks
Arne
> 在 2010-11-15,下午11:32, Arne Brutschy 写道:
>
> > Hi all,
> >
> > I am mounting lustre through an fstab entry. This fails quite often, the
> > nodes end up without the lustre mount. Even when I log in, it take 2-3
> > tries to get it to mount. This is what I get:
> >
> > mount /lustre
> > mount.lustre: mount 10.1.1.1@tcp0:/lustre at /lustre failed: Cannot send after transport endpoint shutdown
> >
> > This is /var/log/messages:
> >
> > Nov 15 16:27:43 compute-1-10 kernel: LustreError: 2124:0:(lib-move.c:2441:LNetPut()) Error sending PUT to 12345-10.1.1.1@tcp: -113
> > Nov 15 16:27:43 compute-1-10 kernel: LustreError: 2124:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 req@d73d7c00 x1352468062535684/t0 o250->M...@MGC10.1.1.1@tcp_0:26/25 lens 368/584 e 0 to 1 dl 1289834868 ref 2 fl Rpc:N/0/0 rc 0/0
> > Nov 15 16:27:43 compute-1-10 kernel: LustreError: 29069:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@d73d7800 x1352468062535685/t0 o101->M...@MGC10.1.1.1@tcp_0:26/25 lens 296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
> > Nov 15 16:27:43 compute-1-10 kernel: LustreError: 15c-8: MGC10.1.1.1@tcp: The configuration from log 'lustre-client' failed (-108). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
> > Nov 15 16:27:43 compute-1-10 kernel: LustreError: 29069:0:(llite_lib.c:1176:ll_fill_super()) Unable to process log: -108
> > Nov 15 16:27:43 compute-1-10 kernel: LustreError: 29069:0:(obd_mount.c:2045:lustre_fill_super()) Unable to mount (-108)
> >
> > I have no errors on the interface, so I assume this is a timing problem.
> > Can I improve this through some timeout setting?
> >
> > Cheers,
> > Arne
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-...@lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
--
Arne Brutschy
Ph.D. Student Email arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles Tel +32 2 650 2273
Avenue Franklin Roosevelt 50 Fax +32 2 650 2715
1050 Bruxelles, Belgium (Fax at IRIDIA secretary)
在 2010-11-16,下午7:25, Arne Brutschy 写道:
> Hello,
>
>> From the log, we can see that either your MGS node was not ready for
>> connection yet, or there's network error between client and the MGS node.
>
> No error on the server nor on the client. What else can it be? Maybe the
> switch is bad, I can see RX errors on most of it's interfaces.
The switch could be the culprit - error message shows client failed to send request to MGS. Network sending status was -EHOSTUNREACH.
I suggest you reexamine the network of your system.
> The switch could be the culprit - error message shows client failed to
> send request to MGS. Network sending status was -EHOSTUNREACH.
OK, I started monitoring the switch - maybe I have to replace it though.
I am reluctant to do so, as I am not *sure* that it's the switch.
I guess our overall network layout is bad - it's a single network,
single interface gbe network without any routing happening. Maybe the
switch just gets overloaded...
> I suggest you reexamine the network of your system.
I realized that I did not give any parameters to the lnet module. Might
this have been a problem? If yes, why did it work at all?
Anyways, I added a simple
options lnet networks=tcp0(eth0)
to my modprobe.conf on my clients (I had it already on the servers).
Thanks
arne
> So I was incorrect to assume you are mounting this for the first time?
> Sounded like other clients have successfully mounted and have written
> data to the OSTs? Status -113 = no route to host so you might want
> to check your "attempt mount" client connectivity to both the MDS/MGS
> and most importantly to the OSS.
Yes, sorry if this wasn't clear before. I am using this system in
production since 8 months (~40 users, ~600GB of data). The whole thing
comes spinning down from time to time, most probably though flaky
networking. I can't really predict when it happens, or what triggers
it.
The mount problem I described here is always present and seems to be
connected to the main problem. Random nodes can't mount on the first try
(even if it worked before on this node). Retrying works usually.
> I have encountered the exact same error prior when my mgsnode=xxxx@o2ib0,yyyy@o2ib1 but missing the zzzz@tcp0 on the OSTs. I tried using "tunefs.lustre --erase-param --mgsnode=....." to avoid reformatting it but at the end decided to wipe it out and start from scratch. My client only use tcp to MDS/MGS/OSS via two separate networks so once I append the missing piece on the mgsnode= parameter it mounted immediately.
This is what I did according to my notes:
mds
mkfs.lustre --fsname=lustre --mgs --mdt /dev/sda6
mount -t lustre /dev/sda6 /mdt0
oss0
mkfs.lustre --ost --fsname=lustre --mgsnode=10.1.1.1@tcp0 /dev/sda3
mkfs.lustre --ost --fsname=lustre --mgsnode=10.1.1.1@tcp0 /dev/sdb3
mount -t lustre /dev/sda3 /ost0
mount -t lustre /dev/sdb3 /ost1
oss1
mkfs.lustre --ost --fsname=lustre --mgsnode=10.1.1.1@tcp0 /dev/sda3
mkfs.lustre --ost --fsname=lustre --mgsnode=10.1.1.1@tcp0 /dev/sdb3
mount -t lustre /dev/sda3 /ost0
mount -t lustre /dev/sdb3 /ost1
client
mount -t lustre 10.1.1.1@tcp0:/lustre /lustre
I can try to rewrite the parameters on the file systems and power-cycle
the whole cluster, but I guess I have to wait for quieter times for
this. Reformatting the OSTs would be a major hassle, as I need to back
up all the data before.