[Lustre-discuss] making a client reconnect to OST

477 views
Skip to first unread message

Brock Palen

unread,
Jan 24, 2008, 10:23:54 AM1/24/08
to lustre-...@clusterfs.com
I have a client (one of our login nodes) that was evicted by one of
the OST's but not both of them. So some files are accessible others
are not. Strange thing is that both the OST's live on the same OSS.

The errors in dmesg are:

LustreError: 11-0: an error occurred while communicating with
141.212.30.181@tcp. The obd_ping operation failed with -107
Lustre: nobackup-OST0001-osc-000001007d548400: Connection to service
nobackup-OST0001 via nid 141.212.30.181@tcp was lost; in progress
operations using this service will wait for recovery to complete.
LustreError: 167-0: This client was evicted by nobackup-OST0001; in
progress operations using this service will fail.
LustreError: 29595:0:(file.c:1052:ll_glimpse_size()) obd_enqueue
returned rc -5, returning -EIO
LustreError: 29629:0:(file.c:1052:ll_glimpse_size()) obd_enqueue
returned rc -5, returning -EIO


OST0000 also lives at 141.212.30.181, so its strange that only one
will kill it off. Is there a way to ask lustre to restore this? Up
till this point, the client would recover quickly, but this time its
just waiting.

Brock Palen
Center for Advanced Computing
bro...@umich.edu
(734)936-1985


_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Andreas Dilger

unread,
Jan 25, 2008, 2:55:23 PM1/25/08
to Brock Palen, lustre-...@clusterfs.com
On Jan 24, 2008 10:23 -0500, Brock Palen wrote:
> I have a client (one of our login nodes) that was evicted by one of
> the OST's but not both of them. So some files are accessible others
> are not. Strange thing is that both the OST's live on the same OSS.
>
> Is there a way to ask lustre to restore this? Up
> till this point, the client would recover quickly, but this time its
> just waiting.

You could try "lctl --device {OSC device in question} recover".

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Jim Harm

unread,
Jan 25, 2008, 5:10:29 PM1/25/08
to Andreas Dilger, lustre-...@clusterfs.com
On the client i tried the lctl --device $number deactivate
which worked
followed by
llctl --device $number activate
which i believe should have done the same thing
this failed without error notice to me.

i ended up having to umount and mount, which finally reconnected the ost.


--
}}}===============>> LLNL
James E. Harm (Jim); jh...@llnl.gov
System Administrator, ICCD Clusters
(925) 422-4018 Page: 423-7705x57152

Jim Harm

unread,
Feb 4, 2008, 11:31:40 AM2/4/08
to lustre-...@clusterfs.com
Is there a tool that will really attempt a reconnect from a client to
a single OST?
it would be helpful for those rare cases
when this happens and there is nothing really wrong with either.
i imagine original cause could be something as simple as repeated delays
on a very busy network?
Other OSTs from the same OSS remained connected to the same client
during this problem.
If umount and mount could be avoided,
it would be less disruptive to other processes on the client.

Andreas Dilger

unread,
Feb 4, 2008, 2:35:45 PM2/4/08
to Jim Harm, lustre-...@clusterfs.com
On Feb 04, 2008 08:31 -0800, Jim Harm wrote:
> Is there a tool that will really attempt a reconnect from a client to
> a single OST?
> it would be helpful for those rare cases
> when this happens and there is nothing really wrong with either.
> i imagine original cause could be something as simple as repeated delays
> on a very busy network?
> Other OSTs from the same OSS remained connected to the same client
> during this problem.
> If umount and mount could be avoided,
> it would be less disruptive to other processes on the client.

You can use "echo_client" to perform operations on a single OST. See
the lustre-iokit obdfilter-survey for usage details.

Reply all
Reply to author
Forward
0 new messages