[Lustre-discuss] lustre OSS IP change

Brendon

unread,

Jan 11, 2011, 5:23:09 PM1/11/11

to lustre-...@lists.lustre.org

Hello.

I'm new to Lustre. I'm writing because I have changed some IP's and I
can't get everything mounted.

4 total servers (changed IPs of just OSS servers)
head - MGT/MGS (no IP change)
oss0 - OSS (changed IP)
oss1 - OSS (changed IP)
oss2 - OSS (changed IP)

brianjm was helpful in IRC and hinted I needed to read the manual
about Changing a NID. I followed this procedure : 4.3.12 Changing a
Server NID

On all servers, I made sure everything was unmounted: umount /mnt/mdt
and umount /mnt/lustre

On head, I ran : tunefs.lustre --writeconf /dev/sdb1
On oss0, I ran : tunefs.lustre --writeconf /dev/sda5
On oss1, I ran : tunefs.lustre --writeconf /dev/sda5
On oss2, I ran : tunefs.lustre --writeconf /dev/sda5

I then mounted everything. First the mdt on server head

On head: mount /mnt/mdt -- this didn't throw any errors. checked the
logs and no errors.

On oss0: mount /mnt/lustre
mount.lustre: mount jupiter@tcp0:/mylustre at /mnt/lustre failed: No
medium found
This filesystem needs at least 1 OST

Error log on oss0:
Jan 11 14:12:25 compute-oss-0-0 kernel: LustreError: 151-5: There are
no OST's in this filesystem. There must be at least one active OST for
a client to start.
Jan 11 14:12:25 compute-oss-0-0 kernel: LustreError:
4136:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -108 from
cancel RPC: canceling anyway
Jan 11 14:12:25 compute-oss-0-0 kernel: LustreError:
4136:0:(ldlm_request.c:1575:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -108
Jan 11 14:12:25 compute-oss-0-0 kernel: LustreError:
4136:0:(obd_mount.c:1951:lustre_fill_super()) Unable to mount (-123)

The same error appears on the other oss servers.

Any help would be greatly appreciated.
-Brendon
_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Brendon

unread,

Jan 11, 2011, 5:47:04 PM1/11/11

to lustre-...@lists.lustre.org

Since the last post, I realized the fstab didn't have an entry for the
OST to mount. It's not clear to me how this was working before,
because I don't recall seeing an OST mounted when running "mount"
before. Anyway, continuing on...

On oss0 I ran:
mount -t lustre /dev/sda5 /mnt/ost0

Logs on oss0 say:
Jan 11 14:33:33 compute-oss-0-0 kernel: LustreError:
4458:0:(ldlm_lib.c:806:target_handle_connect()) mylustre-OST0003:
denying connection for new client 10.1.1.1@tcp (mylustre-mdtlov_UUID):
3 clients in recovery for 300s
Jan 11 14:33:33 compute-oss-0-0 kernel: LustreError:
4458:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error
(-16) req@ffff81021e84da00 x72/t0 o8-><?>@<?>:0/0 lens 240/144 e 0 to
0 dl 1294785313 ref 1 fl Interpret:/0/0 rc -16/0
Jan 11 14:33:58 compute-oss-0-0 kernel: LustreError:
4459:0:(ldlm_lib.c:806:target_handle_connect()) mylustre-OST0003:
denying connection for new client 10.1.1.1@tcp (mylustre-mdtlov_UUID):
3 clients in recovery for 275s
...
Jan 11 14:37:18 compute-oss-0-0 kernel: LustreError:
4468:0:(ldlm_lib.c:806:target_handle_connect()) mylustre-OST0003:
denying connection for new client 10.1.1.1@tcp (mylustre-mdtlov_UUID):
3 clients in recovery for 74s
Jan 11 14:37:18 compute-oss-0-0 kernel: LustreError:
4468:0:(ldlm_lib.c:806:target_handle_connect()) Skipped 1 previous
similar message
Jan 11 14:37:18 compute-oss-0-0 kernel: LustreError:
4468:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error
(-16) req@ffff81021e310800 x91/t0 o8-><?>@<?>:0/0 lens 240/144 e 0 to
0 dl 1294785538 ref 1 fl Interpret:/0/0 rc -16/0
Jan 11 14:37:18 compute-oss-0-0 kernel: LustreError:
4468:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped1 previous
similar message
Jan 11 14:38:33 compute-oss-0-0 kernel: LustreError:
0:0:(ldlm_lib.c:1105:target_recovery_expired()) mylustre-OST0003:
recovery timed out, aborting
Jan 11 14:38:33 compute-oss-0-0 kernel: LustreError:
4471:0:(genops.c:1024:class_disconnect_stale_exports())
mylustre-OST0003: disconnecting 3 stale clients

logs on head server :
Jan 11 14:33:33 jupiter kernel: LustreError: 11-0: an error occurred
while communicating with 10.1.1.2@tcp. The ost_connect operation
failed with -16
Jan 11 14:34:23 jupiter last message repeated 2 times
Jan 11 14:35:38 jupiter last message repeated 3 times
Jan 11 14:37:18 jupiter last message repeated 3 times
Jan 11 14:37:18 jupiter kernel: LustreError: Skipped 1 previous similar message

Any help is much appreciated.

Brendon

unread,

Jan 11, 2011, 6:07:10 PM1/11/11

to lustre-...@lists.lustre.org

I continued mounting the OSTs on n1 and n2. I received the same errors
when mounting them as on n0.

I then tried mounting the lustre as a client. That worked, but
touching a file on it caused it to hang and spit the errors below on
server head. After about a minute the server became unresponsive to
ping. I'm guessing it has oops'ed.

I googled the ost_connect error -16, but haven't found anything
relevnat yet that appears useful.
I'm going to take a break. I've been working this one all day... time
for a late lunch.

Any insight is much appreciated.
-Brendon

Jan 11 14:54:19 jupiter kernel: LustreError: 11-0: an error occurred
while communicating with 10.1.1.3@tcp. The ost_connect operation
failed with -16
Jan 11 14:54:19 jupiter kernel: LustreError: Skipped 5 previous similar messages
Jan 11 14:57:14 jupiter kernel: LustreError:
7694:0:(osc_create.c:348:osc_create()) mylustre-OST0002-osc: oscc
recovery failed: -110
Jan 11 14:57:14 jupiter kernel: LustreError:
7693:0:(osc_create.c:348:osc_create()) mylustre-OST0001-osc: oscc
recovery failed: -110
Jan 11 14:57:14 jupiter kernel: LustreError:
7694:0:(lov_obd.c:1074:lov_clear_orphans()) error in orphan recovery
on OST idx 2/4: rc = -110
Jan 11 14:57:14 jupiter kernel: LustreError: 11-0: an error occurred
while communicating with 10.1.1.4@tcp. The ost_connect operation
failed with -16
Jan 11 14:57:14 jupiter kernel: LustreError: Skipped 9 previous similar messages

Wojciech Turek

unread,

Jan 11, 2011, 6:35:26 PM1/11/11

to Brendon, lustre-...@lists.lustre.org

Hi Brendon,

Can you please provide following:
1) output of ifconfig run on each OSS MDS and at least one client
2) output of lctl list_nids run on each OSS MDS and at least one client
3) output of tunefs.lustre --print --dryrun /dev/<OST_block_device> from each OSS

Wojciech

--
Wojciech Turek

Senior System Architect

High Performance Computing Service
University of Cambridge

Brendon

unread,

Jan 13, 2011, 12:41:44 PM1/13/11

to lustre-...@lists.lustre.org

On Tue, Jan 11, 2011 at 3:35 PM, Wojciech Turek <wj...@cam.ac.uk> wrote:
> Hi Brendon,
>
> Can you please provide following:
> 1) output of ifconfig run on each OSS MDS and at least one client
> 2) output of lctl list_nids run on each OSS MDS and at least one client
> 3) output of tunefs.lustre --print --dryrun /dev/<OST_block_device> from
> each OSS
>
> Wojciech

After someone looked at the emails I sent out, they grabbed me on IRC.
We had a discussion and basically they interpreted the email as
everything should be working, I just needed to wait for a repair to
run and complete. What I then learned is that first, a client has to
connect for a repair to initiate. Secondly, the code isn't perfect.
The MDS kernel oops'ed twice before it finally completed a repair
successfully. I was in the process of disabling panic on oops, but it
finally completed successfully. Once that was done, I got a clean bill
of health.

Just to complete this discussion, I have listed the requested output.
I might still learn something :)

...Looks like I did learn something. OSS0 has an issue with the root
FS and was remounted RO which I discovered when running tunefs.lustre
--print --dryrun /dev/sda5.

The fun never ends :)
-Brendon

1) ifconfig info
MDS: # ifconfig
eth0 Link encap:Ethernet HWaddr 00:15:17:5E:46:64
inet addr:10.1.1.1 Bcast:10.1.1.255 Mask:255.255.255.0
inet6 addr: fe80::215:17ff:fe5e:4664/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:49140546 errors:0 dropped:0 overruns:0 frame:0
TX packets:63644404 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:18963170801 (17.6 GiB) TX bytes:65261762295 (60.7 GiB)
Base address:0xcc00 Memory:f58e0000-f5900000

eth1 Link encap:Ethernet HWaddr 00:15:17:5E:46:65
inet addr:192.168.0.181 Bcast:192.168.0.255 Mask:255.255.255.0
inet6 addr: fe80::215:17ff:fe5e:4665/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:236738842 errors:0 dropped:0 overruns:0 frame:0
TX packets:458503163 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:15562858193 (14.4 GiB) TX bytes:686167422947 (639.0 GiB)
Base address:0xc880 Memory:f5880000-f58a0000

OSS : # ifconfig
eth0 Link encap:Ethernet HWaddr 00:1D:60:E0:5B:B2
inet addr:10.1.1.2 Bcast:10.1.1.255 Mask:255.255.255.0
inet6 addr: fe80::21d:60ff:fee0:5bb2/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:3092588 errors:0 dropped:0 overruns:0 frame:0
TX packets:3547204 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:1320521551 (1.2 GiB) TX bytes:2670089148 (2.4 GiB)
Interrupt:233

client: # ifconfig
eth0 Link encap:Ethernet HWaddr 00:1E:8C:39:E4:69
inet addr:10.1.1.5 Bcast:10.1.1.255 Mask:255.255.255.0
inet6 addr: fe80::21e:8cff:fe39:e469/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:727922 errors:0 dropped:0 overruns:0 frame:0
TX packets:884188 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:433349006 (413.2 MiB) TX bytes:231985578 (221.2 MiB)
Interrupt:50

2) lctl list_nids

client: lctl list_nids
10.1.1.5@tcp

MDS: lctl list_nids
10.1.1.1@tcp

OSS: lctl list_nids
10.1.1.2@tcp

3) tunefs.lustre --print --dryrun /dev/sda5
OSS0: ]# tunefs.lustre --print --dryrun /dev/sda5
checking for existing Lustre data: found CONFIGS/mountdata
tunefs.lustre: Can't create temporary directory /tmp/dirCZXt3k:
Read-only file system

tunefs.lustre FATAL: Failed to read previous Lustre data from /dev/sda5 (30)
tunefs.lustre: exiting with 30 (Read-only file system)

OSS1: # tunefs.lustre --print --dryrun /dev/sda5
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

Read previous values:
Target: mylustre-OST0001
Index: 1
Lustre FS: mylustre
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.1.1.1@tcp

Permanent disk data:
Target: mylustre-OST0001
Index: 1
Lustre FS: mylustre
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.1.1.1@tcp

exiting before disk write.

OSS2: # tunefs.lustre --print --dryrun /dev/sda5
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

Read previous values:
Target: mylustre-OST0002
Index: 2
Lustre FS: mylustre
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.1.1.1@tcp

Permanent disk data:
Target: mylustre-OST0002
Index: 2
Lustre FS: mylustre
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.1.1.1@tcp

exiting before disk write.

Wojciech Turek

unread,

Jan 13, 2011, 1:02:10 PM1/13/11

to lustre-discuss

Hi Brendon,

So it looks like you Lustre was just stuck in recovery processes after all.
It is a bit concerning that you had kernel panics on MDS during recovery. Which Lustre version are you using? Do you have stack traces from the kernel panics?

Wojciech

Wojciech Turek

unread,

Jan 13, 2011, 1:40:58 PM1/13/11

to Brendon, lustre-discuss

Kernel panics were due to bugs in the Lustre version you are running. If you want to avoid this sort of troubles in the future and make your filesystem stable then you should upgrade to 1.8.5.

On 13 January 2011 18:04, Brendon <b...@brendon.com> wrote:

Wojciech-

Before this, I did read a bit about lustre, but not much. Just some
high-level stuff. It was definitely a "crash course"

It looks like version 1.6.5. I don't have stack traces. The kernel
paniced each time and I don't have a console server.

# uname -a
Linux jupiter.nanostellar.com 2.6.18-53.1.14.el5_lustre.1.6.5.1smp #1
SMP Wed Jun 18 19:45:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

-Brendon

On Thu, Jan 13, 2011 at 10:01 AM, Wojciech Turek <wj...@cam.ac.uk> wrote:
> Hi Brendon,
>

> So it looks like you Lustre was just stuck in recovery processes after all.
> It is a bit concerning that you had kernel panics on MDS during recovery.
> Which Lustre version are you using? Do you have stack traces from the kernel
> panics?
>
> Wojciech
>
> On 13 January 2011 17:41, Brendon <b...@brendon.com> wrote:
>>

> --
> Wojciech Turek
>
>
>

Reply all

Reply to author

Forward