I assume you have a fibrechannel storage array? Use the array tools to make sure it works, and use the HBA tools or just dmesg to check whether the hba and fibre channel lis came up.
Look back at old logs if you have them to see how normal boot messages on the storage nodes should look.
-Larry
> --
> You received this message because you are subscribed to the Google Groups "SiCortex Users" group.
> To post to this group, send email to sicorte...@googlegroups.com.
> To unsubscribe from this group, send email to sicortex-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/sicortex-users?hl=en.
>
It really sounds, from your original description, like the storage array has not come back to life. Everything else sounds fine. The global clock sync and sboot messages look fine.
I'm guessing that you can run programs on all the nodes just fine as well, such as
"srun -p sc1 -N 108 hostname" or similar
From the messages, lustre didn't start only because the disks that it uses were not found by the linux kernel on the nodes connected to the storage. That is the place to start.
Just use ssh to log into the storage nodes and see what happened there. All the usual linux
commands and utilities for diagnosing storage problems are there.
On the ssp, you can also look at the syslog logs for the storage nodes in /var/log/nodes.../...
to make sure that the fibre channel links to the storage have come up. Here is what they
look like, for example, with Qlogic HBAs connected to a Promise storage array:
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.434000] QLogic Fibre Channel HBA Driver
2011-01-05T11:11:21-05:00 <kern:warning> scx-m4n3 [4294767.508000] PCI: Enabling device 0000:01:00.0 (0000 -> 0003)
2011-01-05T11:11:21-05:00 <kern:warning> scx-m4n3 [4294767.508000] PMI error interrupts enabled
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.509000] qla2400 0000:01:00.0: Found an ISP2432, irq 23, iobase 0x9000000800000000
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.510000] qla2400 0000:01:00.0: Configuring PCI space...
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.511000] qla2400 0000:01:00.0: Configure NVRAM parameters...
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.524000] qla2400 0000:01:00.0: Verifying loaded RISC code...
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.682000] qla2400 0000:01:00.0: Allocated (64 KB) for EFT...
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.682000] qla2400 0000:01:00.0: Allocated (1413 KB) for firmware dump...
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.684000] qla2400 0000:01:00.0: Waiting for LIP to complete...
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.990000] qla2400 0000:01:00.0: LIP reset occured (f7e2).
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.991000] qla2400 0000:01:00.0: LIP occured (f7e2).
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.992000] qla2400 0000:01:00.0: LOOP UP detected (4 Gbps).
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.995000] qla2400 0000:01:00.0: Topology - (Loop), Host Loop address 0x0
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294768.006000] scsi0 : qla2xxx
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294768.007000] qla2400 0000:01:00.0:
2011-01-05T11:11:21-05:00 <kern:warning> scx-m4n3 [4294768.007000] QLogic Fibre Channel HBA Driver: 8.01.07.15-fo
2011-01-05T11:11:21-05:00 <kern:warning> scx-m4n3 [4294768.007000] QLogic QEM2462 - QLE2440
2011-01-05T11:11:21-05:00 <kern:warning> scx-m4n3 [4294768.007000] ISP2432: PCIe (2.5Gb/s x4) @ 0000:01:00.0 hdma+, host#=0, fw=4.00.26 [I\
P]
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.007000] Vendor: Promise Model: VTrak E310f Rev: 0331
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.007000] Type: Direct-Access ANSI SCSI revision: 05
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294768.007000] qla2400 0000:01:00.0: scsi(0:0:0:0): Enabled tagged queuing, queue depth 32.
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.007000] SCSI device sda: 1463812096 512-byte hdwr sectors (749472 MB)
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.008000] sda: Write Protect is off
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.008000] SCSI device sda: drive cache: write back w/ FUA
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.008000] SCSI device sda: 1463812096 512-byte hdwr sectors (749472 MB)
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.008000] sda: Write Protect is off
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.009000] SCSI device sda: drive cache: write back w/ FUA
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294768.009000] sda: unknown partition table
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.009000] sd 0:0:0:0: Attached scsi disk sda
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.009000] sd 0:0:0:0: Attached scsi generic sg0 type 0
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.010000] Vendor: Promise Model: VTrak E310f Rev: 0331
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.010000] Type: Direct-Access ANSI SCSI revision: 05
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294768.010000] qla2400 0000:01:00.0: scsi(0:0:0:2): Enabled tagged queuing, queue depth 32.
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.010000] SCSI device sdb: 1463812096 512-byte hdwr sectors (749472 MB)
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.011000] sdb: Write Protect is off
... and other messages for the other logical volumes exported by the array.
I suspect that your storage array didn't come on again, and that has nothing to do with
scboot or rest of the machine. Debug the array and its connections .
You can also just check for blinking lights. I'm assuming you have fiberchannel cards, they should like lights to show the links are working. The storage array should have lights to show that it is
alive as well.
-L
-Larry
You debug dns problems on sicortex systems with regular linux methods.
The command dig (man dig) lets you try queries.
The file /etc/resolv.conf tells you what your name servers are.
Check if they are what you expecte.
Check if you can ping the dns server ip address
Check if the server is actually working by logging into server machiine and trying dig there.
Restart server if it isn't
Often the nodes are set to use the ssp as the dns server, the ssp runs "dnsmasq" as a combined dns, arp, and tftp server.
-Larry