I'm new to Lustre. I'm writing because I have changed some IP's and I
can't get everything mounted.
4 total servers (changed IPs of just OSS servers)
head - MGT/MGS (no IP change)
oss0 - OSS (changed IP)
oss1 - OSS (changed IP)
oss2 - OSS (changed IP)
brianjm was helpful in IRC and hinted I needed to read the manual
about Changing a NID. I followed this procedure : 4.3.12 Changing a
Server NID
On all servers, I made sure everything was unmounted: umount /mnt/mdt
and umount /mnt/lustre
On head, I ran : tunefs.lustre --writeconf /dev/sdb1
On oss0, I ran : tunefs.lustre --writeconf /dev/sda5
On oss1, I ran : tunefs.lustre --writeconf /dev/sda5
On oss2, I ran : tunefs.lustre --writeconf /dev/sda5
I then mounted everything. First the mdt on server head
On head: mount /mnt/mdt -- this didn't throw any errors. checked the
logs and no errors.
On oss0: mount /mnt/lustre
mount.lustre: mount jupiter@tcp0:/mylustre at /mnt/lustre failed: No
medium found
This filesystem needs at least 1 OST
Error log on oss0:
Jan 11 14:12:25 compute-oss-0-0 kernel: LustreError: 151-5: There are
no OST's in this filesystem. There must be at least one active OST for
a client to start.
Jan 11 14:12:25 compute-oss-0-0 kernel: LustreError:
4136:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -108 from
cancel RPC: canceling anyway
Jan 11 14:12:25 compute-oss-0-0 kernel: LustreError:
4136:0:(ldlm_request.c:1575:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -108
Jan 11 14:12:25 compute-oss-0-0 kernel: LustreError:
4136:0:(obd_mount.c:1951:lustre_fill_super()) Unable to mount (-123)
The same error appears on the other oss servers.
Any help would be greatly appreciated.
-Brendon
_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
On oss0 I ran:
mount -t lustre /dev/sda5 /mnt/ost0
Logs on oss0 say:
Jan 11 14:33:33 compute-oss-0-0 kernel: LustreError:
4458:0:(ldlm_lib.c:806:target_handle_connect()) mylustre-OST0003:
denying connection for new client 10.1.1.1@tcp (mylustre-mdtlov_UUID):
3 clients in recovery for 300s
Jan 11 14:33:33 compute-oss-0-0 kernel: LustreError:
4458:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error
(-16) req@ffff81021e84da00 x72/t0 o8-><?>@<?>:0/0 lens 240/144 e 0 to
0 dl 1294785313 ref 1 fl Interpret:/0/0 rc -16/0
Jan 11 14:33:58 compute-oss-0-0 kernel: LustreError:
4459:0:(ldlm_lib.c:806:target_handle_connect()) mylustre-OST0003:
denying connection for new client 10.1.1.1@tcp (mylustre-mdtlov_UUID):
3 clients in recovery for 275s
...
Jan 11 14:37:18 compute-oss-0-0 kernel: LustreError:
4468:0:(ldlm_lib.c:806:target_handle_connect()) mylustre-OST0003:
denying connection for new client 10.1.1.1@tcp (mylustre-mdtlov_UUID):
3 clients in recovery for 74s
Jan 11 14:37:18 compute-oss-0-0 kernel: LustreError:
4468:0:(ldlm_lib.c:806:target_handle_connect()) Skipped 1 previous
similar message
Jan 11 14:37:18 compute-oss-0-0 kernel: LustreError:
4468:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error
(-16) req@ffff81021e310800 x91/t0 o8-><?>@<?>:0/0 lens 240/144 e 0 to
0 dl 1294785538 ref 1 fl Interpret:/0/0 rc -16/0
Jan 11 14:37:18 compute-oss-0-0 kernel: LustreError:
4468:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped1 previous
similar message
Jan 11 14:38:33 compute-oss-0-0 kernel: LustreError:
0:0:(ldlm_lib.c:1105:target_recovery_expired()) mylustre-OST0003:
recovery timed out, aborting
Jan 11 14:38:33 compute-oss-0-0 kernel: LustreError:
4471:0:(genops.c:1024:class_disconnect_stale_exports())
mylustre-OST0003: disconnecting 3 stale clients
logs on head server :
Jan 11 14:33:33 jupiter kernel: LustreError: 11-0: an error occurred
while communicating with 10.1.1.2@tcp. The ost_connect operation
failed with -16
Jan 11 14:34:23 jupiter last message repeated 2 times
Jan 11 14:35:38 jupiter last message repeated 3 times
Jan 11 14:37:18 jupiter last message repeated 3 times
Jan 11 14:37:18 jupiter kernel: LustreError: Skipped 1 previous similar message
Any help is much appreciated.
I then tried mounting the lustre as a client. That worked, but
touching a file on it caused it to hang and spit the errors below on
server head. After about a minute the server became unresponsive to
ping. I'm guessing it has oops'ed.
I googled the ost_connect error -16, but haven't found anything
relevnat yet that appears useful.
I'm going to take a break. I've been working this one all day... time
for a late lunch.
Any insight is much appreciated.
-Brendon
Jan 11 14:54:19 jupiter kernel: LustreError: 11-0: an error occurred
while communicating with 10.1.1.3@tcp. The ost_connect operation
failed with -16
Jan 11 14:54:19 jupiter kernel: LustreError: Skipped 5 previous similar messages
Jan 11 14:57:14 jupiter kernel: LustreError:
7694:0:(osc_create.c:348:osc_create()) mylustre-OST0002-osc: oscc
recovery failed: -110
Jan 11 14:57:14 jupiter kernel: LustreError:
7693:0:(osc_create.c:348:osc_create()) mylustre-OST0001-osc: oscc
recovery failed: -110
Jan 11 14:57:14 jupiter kernel: LustreError:
7694:0:(lov_obd.c:1074:lov_clear_orphans()) error in orphan recovery
on OST idx 2/4: rc = -110
Jan 11 14:57:14 jupiter kernel: LustreError: 11-0: an error occurred
while communicating with 10.1.1.4@tcp. The ost_connect operation
failed with -16
Jan 11 14:57:14 jupiter kernel: LustreError: Skipped 9 previous similar messages
After someone looked at the emails I sent out, they grabbed me on IRC.
We had a discussion and basically they interpreted the email as
everything should be working, I just needed to wait for a repair to
run and complete. What I then learned is that first, a client has to
connect for a repair to initiate. Secondly, the code isn't perfect.
The MDS kernel oops'ed twice before it finally completed a repair
successfully. I was in the process of disabling panic on oops, but it
finally completed successfully. Once that was done, I got a clean bill
of health.
Just to complete this discussion, I have listed the requested output.
I might still learn something :)
...Looks like I did learn something. OSS0 has an issue with the root
FS and was remounted RO which I discovered when running tunefs.lustre
--print --dryrun /dev/sda5.
The fun never ends :)
-Brendon
1) ifconfig info
MDS: # ifconfig
eth0 Link encap:Ethernet HWaddr 00:15:17:5E:46:64
inet addr:10.1.1.1 Bcast:10.1.1.255 Mask:255.255.255.0
inet6 addr: fe80::215:17ff:fe5e:4664/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:49140546 errors:0 dropped:0 overruns:0 frame:0
TX packets:63644404 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:18963170801 (17.6 GiB) TX bytes:65261762295 (60.7 GiB)
Base address:0xcc00 Memory:f58e0000-f5900000
eth1 Link encap:Ethernet HWaddr 00:15:17:5E:46:65
inet addr:192.168.0.181 Bcast:192.168.0.255 Mask:255.255.255.0
inet6 addr: fe80::215:17ff:fe5e:4665/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:236738842 errors:0 dropped:0 overruns:0 frame:0
TX packets:458503163 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:15562858193 (14.4 GiB) TX bytes:686167422947 (639.0 GiB)
Base address:0xc880 Memory:f5880000-f58a0000
OSS : # ifconfig
eth0 Link encap:Ethernet HWaddr 00:1D:60:E0:5B:B2
inet addr:10.1.1.2 Bcast:10.1.1.255 Mask:255.255.255.0
inet6 addr: fe80::21d:60ff:fee0:5bb2/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:3092588 errors:0 dropped:0 overruns:0 frame:0
TX packets:3547204 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:1320521551 (1.2 GiB) TX bytes:2670089148 (2.4 GiB)
Interrupt:233
client: # ifconfig
eth0 Link encap:Ethernet HWaddr 00:1E:8C:39:E4:69
inet addr:10.1.1.5 Bcast:10.1.1.255 Mask:255.255.255.0
inet6 addr: fe80::21e:8cff:fe39:e469/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:727922 errors:0 dropped:0 overruns:0 frame:0
TX packets:884188 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:433349006 (413.2 MiB) TX bytes:231985578 (221.2 MiB)
Interrupt:50
2) lctl list_nids
client: lctl list_nids
10.1.1.5@tcp
MDS: lctl list_nids
10.1.1.1@tcp
OSS: lctl list_nids
10.1.1.2@tcp
3) tunefs.lustre --print --dryrun /dev/sda5
OSS0: ]# tunefs.lustre --print --dryrun /dev/sda5
checking for existing Lustre data: found CONFIGS/mountdata
tunefs.lustre: Can't create temporary directory /tmp/dirCZXt3k:
Read-only file system
tunefs.lustre FATAL: Failed to read previous Lustre data from /dev/sda5 (30)
tunefs.lustre: exiting with 30 (Read-only file system)
OSS1: # tunefs.lustre --print --dryrun /dev/sda5
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: mylustre-OST0001
Index: 1
Lustre FS: mylustre
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.1.1.1@tcp
Permanent disk data:
Target: mylustre-OST0001
Index: 1
Lustre FS: mylustre
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.1.1.1@tcp
exiting before disk write.
OSS2: # tunefs.lustre --print --dryrun /dev/sda5
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: mylustre-OST0002
Index: 2
Lustre FS: mylustre
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.1.1.1@tcp
Permanent disk data:
Target: mylustre-OST0002
Index: 2
Lustre FS: mylustre
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.1.1.1@tcp
exiting before disk write.
Wojciech-
Before this, I did read a bit about lustre, but not much. Just some
high-level stuff. It was definitely a "crash course"
It looks like version 1.6.5. I don't have stack traces. The kernel
paniced each time and I don't have a console server.
# uname -a
Linux jupiter.nanostellar.com 2.6.18-53.1.14.el5_lustre.1.6.5.1smp #1
SMP Wed Jun 18 19:45:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
-Brendon
> So it looks like you Lustre was just stuck in recovery processes after all.
> It is a bit concerning that you had kernel panics on MDS during recovery.
> Which Lustre version are you using? Do you have stack traces from the kernel
> panics?
>
> Wojciech
>
> On 13 January 2011 17:41, Brendon <b...@brendon.com> wrote:
>>
> --
> Wojciech Turek
>
>
>