That message you can see in two cases
1) low level network error, that bad - because client will be reconnected and resend requests after that error.
that will add extra load to the service nodes.
2) service node (MDS, OSS) is restarted or hung, at that case transfer aborted.
On Sep 22, 2010, at 19:20, Aurelien Degremont wrote:
> Hello
>
> I've noticed that Lustre network error, especially LND errors, are considered as maskable errors.
> That means that on a production node, where debug mask is 0, those specific errors won't be displayed if they happened.
>
> Does that mean that they are harmless?
> Do upper-layers resend their RPC/packet if LNDs report an error?
>
> When, in my case, o2iblnd says something like "RDMA failed" (neterror). It is a big issue? Some RPC were lost or not?
>
> Thanks in advance
>
> --
> Aurelien Degremont
> _______________________________________________
> Lustre-devel mailing list
> Lustre...@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Eric Barton a écrit :
> It's expected that peers will crash and therefore the low-level
> network should not clutter the logs with noise and the upper
> layers should handle the problem by retrying or doing actual
> recovery.
Ok, so I can understand those errors to something like:
- my IB network is not so clean
- but Lustre upper layers will retry, and so this is transparent for them
as long as i do not have too many of this kind of issue.
> "RDMA failed" should really only occur when a peer node crashes.
> However it could be a sign that there are deeper problems with
> the network setup or hardware.
Ok, but in my case we have issue where nodes do not crash but we got this kind of issues, like:
(this occurs on LNET routeurs)
Tx -> ... cookie ... sending 1 waiting 0: failed 12
Closing conn to ... : error -5 (waiting)
Even if the corresponding node is responding and Lustre works for it.
> If you suspect the network is
> misbehaving, I'd run an LNET self-test. This is well documented
> in the manual (at least to people who already know how it works ;)
> and lets you soak-test the network from any convenient node.
Ok :) I use it often, so that's ok.
But lnet_selftest has difficulties to works nicely if your using different OFED stacks (at least v1.4.2 against v1.5.1).
So it is difficult to use it as a test for my current issue.
Thanks
Aurélien
>
> Cheers,
> Eric
>
>
>
>> -----Original Message-----
>> From: lustre-dev...@lists.lustre.org [mailto:lustre-dev...@lists.lustre.org] On Behalf
>> Of Aurelien Degremont
>> Sent: 22 September 2010 5:20 PM
>> To: lustre...@lists.lustre.org
>> Subject: [Lustre-devel] Meaning of LND/neterrors ?
>>
>> Hello
>>
>> I've noticed that Lustre network error, especially LND errors, are considered as maskable errors.
>> That means that on a production node, where debug mask is 0, those specific errors won't be displayed
>> if they happened.
>>
>> Does that mean that they are harmless?
>> Do upper-layers resend their RPC/packet if LNDs report an error?
>>
>> When, in my case, o2iblnd says something like "RDMA failed" (neterror). It is a big issue? Some RPC
>> were lost or not?
>>
>> Thanks in advance
>>
>> --
>> Aurelien Degremont
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre...@lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>
>
--
Aurelien Degremont
CEA
In our cases nodes were not restarted, so the infiniband network seems to have issues.
But these errors could be ignored as long as they do not appear to often.
--
Aurelien Degremont
CEA
Thanks
Liang
On 9/23/10 3:57 PM, Aurelien Degremont wrote:
>
>> If you suspect the network is
>> misbehaving, I'd run an LNET self-test. This is well documented
>> in the manual (at least to people who already know how it works ;)
>> and lets you soak-test the network from any convenient node.
> Ok :) I use it often, so that's ok.
> But lnet_selftest has difficulties to works nicely if your using different OFED stacks (at least v1.4.2 against v1.5.1).
> So it is difficult to use it as a test for my current issue.
>
>
> Thanks
>
> Aurélien
>
>
>> Cheers,
>> Eric
>>
>>
>>
>>
--
Cheers
Liang
> Eric Barton a écrit :
> > It's expected that peers will crash and therefore the low-level
> > network should not clutter the logs with noise and the upper
> > layers should handle the problem by retrying or doing actual
> > recovery.
>
> Ok, so I can understand those errors to something like:
> - my IB network is not so clean
> - but Lustre upper layers will retry, and so this is transparent for them
> as long as i do not have too many of this kind of issue.
>
>
> > "RDMA failed" should really only occur when a peer node crashes.
> > However it could be a sign that there are deeper problems with
> > the network setup or hardware.
>
> Ok, but in my case we have issue where nodes do not crash but we got this kind of issues, like:
> (this occurs on LNET routeurs)
> Tx -> ... cookie ... sending 1 waiting 0: failed 12
> Closing conn to ... : error -5 (waiting)
>
> Even if the corresponding node is responding and Lustre works for it.
Then I'd suspect the IB network (switches and cabling). If I were you,
I'd really want to root these problems out. While they persist, Lustre
can evict clients spuriously and clients may appear to hang for many
seconds at a time.
> > If you suspect the network is
> > misbehaving, I'd run an LNET self-test. This is well documented
> > in the manual (at least to people who already know how it works ;)
> > and lets you soak-test the network from any convenient node.
>
> Ok :) I use it often, so that's ok.
> But lnet_selftest has difficulties to works nicely if your using different OFED stacks (at least
> v1.4.2 against v1.5.1).
> So it is difficult to use it as a test for my current issue.
Hmm - Lnet self-test doesn't care at all what the underlying networks
are so if networking breaks when you're using different OFED stacks,
I'd suspect the real problem is that OFED version interoperation doesn't
work when the network is under stress. I'm not clear what guarantees on
version interoperation (if any) OFED makes, and even if it's supposed to
work, it could easily be buggy.
Cheers,
Eric