Cannot mount lustre file system after manual shutdown

27 views
Skip to first unread message

Katie Cooper

unread,
May 24, 2011, 6:08:27 PM5/24/11
to SiCortex Users

Hi,

Our server room underwent a cooling update this past weekend and so I
had to bring down my SC648 prior to it. I followed the SysAdmin guide
for a proper manual shutdown including the unmounting of my lustre
file system. Today I rebooted the system, but lustre is not
mounting. When I try running the mount script, I get this error:
"tunefs.lustre FATAL: Device /dev/sd? has not been formatted with
mkfs.lustre"

I also saw these errors/warnings/messages during the reboot process:

"No automount maps defined"

"hostname: Unknown host"

"slurm_update error: Invalid node name"

"Trouble setting nodes to SLURM 'Idle' state"

any help would be appreciated (I"m freaking out just a tad)

thanks,

Katie

Katie Cooper

unread,
May 24, 2011, 6:10:50 PM5/24/11
to SiCortex Users

Also, the global clock completion table looks like this:

----------------------scboot-monitor----------------------
sec kernel fabric initifs slurm
119 108 108 108 0
global clock sync complete
144 108 108 108
Node boot complete

Larry Stewart

unread,
May 24, 2011, 6:38:29 PM5/24/11
to sicorte...@googlegroups.com, SiCortex Users
the lustre messages sound like the disk drives did not become accessible after the reboot.
You can log into the nodes which host the drives and usenormal linux tools to debug that part of it.

I assume you have a fibrechannel storage array? Use the array tools to make sure it works, and use the HBA tools or just dmesg to check whether the hba and fibre channel lis came up.

Look back at old logs if you have them to see how normal boot messages on the storage nodes should look.

-Larry

> --
> You received this message because you are subscribed to the Google Groups "SiCortex Users" group.
> To post to this group, send email to sicorte...@googlegroups.com.
> To unsubscribe from this group, send email to sicortex-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/sicortex-users?hl=en.
>

Aaron Brooks

unread,
May 24, 2011, 9:30:43 PM5/24/11
to sicorte...@googlegroups.com
Katie,

Was the storage array powered down during the maintenance? Was it powered back up prior to the SC648 boot?

-Aaron

Katie Cooper

unread,
May 24, 2011, 11:57:43 PM5/24/11
to SiCortex Users

Hi Aaron & Larry,

I followed the directions on the sysadmin guide (see below). The
power was off in the room for four hours so I'm assuming it was then
powered down. I wasn't there for the power up, but my guess is that
the power was just restored - not sure if it was unplugged and plugged
back in during the process. Would the storage array power back up
once the power was back on or plugged back in prior to the actual
turning the SSP on with the switch?

Is there anything to the SLURM and Hostname errors? Could this be a
problem of it not finding itself or where to look? I don't think that
IP addresses, etc, might have changed during the maintenance, but
wsu's website was hacked during the weekend as well (which I think is
also hosted in the same room), so there may have been some changes. I
just don't know. I feel like I might be grasping in the dark here.

thanks,
katie

On May 24, 6:30 pm, Aaron Brooks <aa...@brooks1.net> wrote:
> Katie,
>
> Was the storage array powered down during the maintenance? Was it powered
> back up prior to the SC648 boot?
>
> -Aaron
>
> On Tue, May 24, 2011 at 6:38 PM, Larry Stewart <lstewa...@gmail.com> wrote:
> > the lustre messages sound like the disk drives did not become accessible
> > after the reboot.
> > You can log into the nodes which host the drives and usenormal linux tools
> > to debug that part of it.
>
> > I assume you have a fibrechannel storage array?  Use the array tools to
> > make sure it works, and use the HBA tools or just dmesg to check whether the
> > hba and fibre channel lis came up.
>
> > Look back at old logs if you have them to see how normal boot messages on
> > the storage nodes should look.
>
> > -Larry
>

Katie Cooper

unread,
May 25, 2011, 12:08:48 AM5/25/11
to SiCortex Users
this is what I followed for the shutdown process -

1. To stop a partition from accepting jobs, set its state to “drain”:
sysadmin@ssp # sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
sc1 up infinite 108 idle sc1-m0n[0-26],sc1-m1n[0-26],\
sc1-m2n[0-26],sc1-m3n[0-26]
sysadmin@ssp # scontrol update NodeName=sc1-m0n[0-26],sc1-
m1n[0-26],sc1-m2n[0-26],sc1-m3n[0-26]
State=drain
sysadmin@ssp # sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
sc1 up infinite 108 drain sc1-m0n[0-26],sc1-m1n[0-26],\
sc1-m2n[0-26],sc1-m3n[0-26]
2. To see if there are any jobs still running on the system, use the
squeue command:
sysadmin@ssp # squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
16429 sc1 sleep-12 j_random R 0:09 4 sc1-m1n[3-21]
3. To kill all jobs remaining on the system, use scancel like this:
sysadmin@ssp # scancel -p sc1
sysadmin@ssp # squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
sysadmin@ssp #
4. When you run scboot to reboot the nodes, scboot resets the
nodes. However, Lustre file systems are not unmounted cleanly -
including both clients and various types of servers. To fix this, run
the script described below, to avoid having to wait for a period, for
Manual System Shutdown
252 (PN 2905-03 Rev. 01) System Administration Guide
the recovery process to complete. The recovery contributes about
five minutes to the reboot time.
On the SSP, run the lustre_umount.sh script to unmount each
Lustre file system, as described in Unmounting a Lustre File System
with a Script on page 194.
5. Run umount commands to unmount any other file systems you may
have mounted at boot time or later.
6. On the SSP, type the following command to shut down the system:
# shutdown -h now
This command powers down the SSP. However, it does not shut
down the nodes. The MSPs and the nodes continue to receive
power and to run at a basic level.
7. Power down to complete the shutdown:
SC072: To power down the MSP and nodes on an SC072, execute
any required commands to gracefully shut down all applications
and the desktop environment. Then press the power rocker
switch to power off the system.
SC648: To power down the MSPs and nodes on an SC648, disconnect
the main system power cord(s) from the wall.
NOTE: After you have completed all the procedures detailed
above, it is safe to pull the plug to complete the power-off
sequence and to power down the nodes. The power system is
designed to automatically power down the hardware components
in a safe and controlled manner when the power plug is
disconnected from the power source.

Lawrence Stewart

unread,
May 25, 2011, 12:41:17 AM5/25/11
to sicorte...@googlegroups.com, Lawrence Stewart
Sounds like you did everything correctly. The purpose for all this stuff is simply
to shut down cleanly. If there had been a sudden power failure, none of this
would have gotten done. Just like with desktop machines, the real purpose of carefully shutting
down is to avoid having to do filesystem checks when it comes back up.

It really sounds, from your original description, like the storage array has not come back to life. Everything else sounds fine. The global clock sync and sboot messages look fine.

I'm guessing that you can run programs on all the nodes just fine as well, such as
"srun -p sc1 -N 108 hostname" or similar

From the messages, lustre didn't start only because the disks that it uses were not found by the linux kernel on the nodes connected to the storage. That is the place to start.

Just use ssh to log into the storage nodes and see what happened there. All the usual linux
commands and utilities for diagnosing storage problems are there.

On the ssp, you can also look at the syslog logs for the storage nodes in /var/log/nodes.../...
to make sure that the fibre channel links to the storage have come up. Here is what they
look like, for example, with Qlogic HBAs connected to a Promise storage array:

2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.434000] QLogic Fibre Channel HBA Driver
2011-01-05T11:11:21-05:00 <kern:warning> scx-m4n3 [4294767.508000] PCI: Enabling device 0000:01:00.0 (0000 -> 0003)
2011-01-05T11:11:21-05:00 <kern:warning> scx-m4n3 [4294767.508000] PMI error interrupts enabled
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.509000] qla2400 0000:01:00.0: Found an ISP2432, irq 23, iobase 0x9000000800000000
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.510000] qla2400 0000:01:00.0: Configuring PCI space...
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.511000] qla2400 0000:01:00.0: Configure NVRAM parameters...
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.524000] qla2400 0000:01:00.0: Verifying loaded RISC code...
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.682000] qla2400 0000:01:00.0: Allocated (64 KB) for EFT...
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.682000] qla2400 0000:01:00.0: Allocated (1413 KB) for firmware dump...
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.684000] qla2400 0000:01:00.0: Waiting for LIP to complete...
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.990000] qla2400 0000:01:00.0: LIP reset occured (f7e2).
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.991000] qla2400 0000:01:00.0: LIP occured (f7e2).
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.992000] qla2400 0000:01:00.0: LOOP UP detected (4 Gbps).
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294767.995000] qla2400 0000:01:00.0: Topology - (Loop), Host Loop address 0x0
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294768.006000] scsi0 : qla2xxx
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294768.007000] qla2400 0000:01:00.0:
2011-01-05T11:11:21-05:00 <kern:warning> scx-m4n3 [4294768.007000] QLogic Fibre Channel HBA Driver: 8.01.07.15-fo
2011-01-05T11:11:21-05:00 <kern:warning> scx-m4n3 [4294768.007000] QLogic QEM2462 - QLE2440
2011-01-05T11:11:21-05:00 <kern:warning> scx-m4n3 [4294768.007000] ISP2432: PCIe (2.5Gb/s x4) @ 0000:01:00.0 hdma+, host#=0, fw=4.00.26 [I\
P]
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.007000] Vendor: Promise Model: VTrak E310f Rev: 0331
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.007000] Type: Direct-Access ANSI SCSI revision: 05
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294768.007000] qla2400 0000:01:00.0: scsi(0:0:0:0): Enabled tagged queuing, queue depth 32.
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.007000] SCSI device sda: 1463812096 512-byte hdwr sectors (749472 MB)
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.008000] sda: Write Protect is off
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.008000] SCSI device sda: drive cache: write back w/ FUA
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.008000] SCSI device sda: 1463812096 512-byte hdwr sectors (749472 MB)
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.008000] sda: Write Protect is off
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.009000] SCSI device sda: drive cache: write back w/ FUA
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294768.009000] sda: unknown partition table
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.009000] sd 0:0:0:0: Attached scsi disk sda
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.009000] sd 0:0:0:0: Attached scsi generic sg0 type 0
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.010000] Vendor: Promise Model: VTrak E310f Rev: 0331
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.010000] Type: Direct-Access ANSI SCSI revision: 05
2011-01-05T11:11:21-05:00 <kern:info> scx-m4n3 [4294768.010000] qla2400 0000:01:00.0: scsi(0:0:0:2): Enabled tagged queuing, queue depth 32.
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.010000] SCSI device sdb: 1463812096 512-byte hdwr sectors (749472 MB)
2011-01-05T11:11:21-05:00 <kern:notice> scx-m4n3 [4294768.011000] sdb: Write Protect is off
... and other messages for the other logical volumes exported by the array.

I suspect that your storage array didn't come on again, and that has nothing to do with
scboot or rest of the machine. Debug the array and its connections .

You can also just check for blinking lights. I'm assuming you have fiberchannel cards, they should like lights to show the links are working. The storage array should have lights to show that it is
alive as well.

-L

Katie Cooper

unread,
May 25, 2011, 2:09:34 AM5/25/11
to SiCortex Users

ah ha!...

could it be truly just a unplugged cable????

2011-05-24T15:05:35-07:00 <syslog:notice> sci-m0n3 syslog-ng[3424]:
syslog-ng starting up; version=\'2.0.9\'
2011-05-24T15:05:35-07:00 <kern:warning> sci-m0n3 [4294680.868000]
[cpu 4] slow irq startup : Fabric Link (21)
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294680.872000] scfab:
initialized 14 DMA devices
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294706.598000] SCIO
version 2008-07-29
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294706.598000] scio:
getting node id from scfab
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294706.598000] scio:
node id is 3
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294706.602000]
qbs_per_peer = 93
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294706.645000]
quota_avail = 9392
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294706.834000]
SCethernet version 2008-01-24
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294707.158000]
xmit_order = 7
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294707.159000]
recv_order = 7
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294727.468000] QLogic
Fibre Channel HBA Driver
2011-05-24T15:05:35-07:00 <kern:warning> sci-m0n3 [4294727.505000]
PCI: Enabling device 0000:01:00.0 (0000 -> 0003)
2011-05-24T15:05:35-07:00 <kern:warning> sci-m0n3 [4294727.505000] PMI
error interrupts enabled
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294727.505000]
qla2400 0000:01:00.0: Found an ISP2432, irq 23, iobase
0x9000000800000000
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294727.507000]
qla2400 0000:01:00.0: Configuring PCI space...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294727.508000]
qla2400 0000:01:00.0: Configure NVRAM parameters...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294727.520000]
qla2400 0000:01:00.0: Verifying loaded RISC code...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294727.678000]
qla2400 0000:01:00.0: Allocated (64 KB) for EFT...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294727.678000]
qla2400 0000:01:00.0: Allocated (1413 KB) for firmware dump...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294727.680000]
qla2400 0000:01:00.0: Waiting for LIP to complete...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294729.721000]
qla2400 0000:01:00.0: Cable is unplugged...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294729.721000]
scsi0 : qla2xxx
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294729.721000]
qla2400 0000:01:00.0:
2011-05-24T15:05:35-07:00 <kern:warning> sci-m0n3 [4294729.721000]
QLogic Fibre Channel HBA Driver: 8.01.07.15-fo
2011-05-24T15:05:35-07:00 <kern:warning> sci-m0n3 [4294729.721000]
QLogic QEM2462 - QLE2440
2011-05-24T15:05:35-07:00 <kern:warning> sci-m0n3 [4294729.721000]
ISP2432: PCIe (2.5Gb/s x4) @ 0000:01:00.0 hdma+, host#=0, fw=4.00.26
[IP]
2011-05-24T15:05:35-07:00 <kern:warning> sci-m0n3 [4294729.779000]
PCI: Enabling device 0000:01:00.1 (0000 -> 0003)
2011-05-24T15:05:35-07:00 <kern:warning> sci-m0n3 [4294729.779000] PMI
error interrupts enabled
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294729.779000]
qla2400 0000:01:00.1: Found an ISP2432, irq 23, iobase
0x9000000800004000
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294729.781000]
qla2400 0000:01:00.1: Configuring PCI space...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294729.782000]
qla2400 0000:01:00.1: Configure NVRAM parameters...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294729.794000]
qla2400 0000:01:00.1: Verifying loaded RISC code...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294729.952000]
qla2400 0000:01:00.1: Allocated (64 KB) for EFT...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294729.952000]
qla2400 0000:01:00.1: Allocated (1413 KB) for firmware dump...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294729.954000]
qla2400 0000:01:00.1: Waiting for LIP to complete...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294731.995000]
qla2400 0000:01:00.1: Cable is unplugged...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294731.998000]
scsi1 : qla2xxx
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294731.998000]
qla2400 0000:01:00.1:
2011-05-24T15:05:35-07:00 <kern:warning> sci-m0n3 [4294731.998000]
QLogic Fibre Channel HBA Driver: 8.01.07.15-fo
2011-05-24T15:05:35-07:00 <kern:warning> sci-m0n3 [4294731.998000]
QLogic QEM2462 - QLE2440
2011-05-24T15:05:35-07:00 <kern:warning> sci-m0n3 [4294731.998000]
ISP2432: PCIe (2.5Gb/s x4) @ 0000:01:00.1 hdma+, host#=1, fw=4.00.26
[IP]
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294741.252000]
perfmon: SCB performance counters are enabled
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294741.252000]
perfmon: Per-core partitioning of the SCB performance counters is
enabled
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294741.252000]
perfmon: SCB histogram threshold is 1 (0x1)
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294741.252000]
perfmon: SCB sampling interval is 6 (0x6) or 2048 cclk cycles
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294741.252000]
perfmon: ScbPerfCtl DFL: 0x00000000000005f6 RSV:
0xffffffffffffffff
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294741.252000]
perfmon: ScbPerfHist DFL: 0x0000000000000001 RSV:
0xffffffffffffffff
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294741.252000]
perfmon: ScbPerfBuckNum DFL: 0x0000000000000000 RSV:
0xffffffffffffffff
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294741.252000]
perfmon: ScbPerfEna DFL: 0x0000000000000001 RSV:
0xffffffffffffffff
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294741.252000]
perfmon: ScbPerfBuckets DFL: 0x0000000000000000 RSV:
0xfffffffffffc0000
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294741.252000]
perfmon: ICE9B PMU detected, 48 PMCs, 48 PMDs, 44 counters (31 bits)
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294741.254000]
perfmon: ICE9B PMU installed
2011-05-24T15:05:40-07:00 <daemon:crit> sci-m0n3 automount[3536]:
bind_ldap_anonymous: lookup(ldap): Unable to bind to the LDAP server:
(default), error Can\'t contact LDAP server
2011-05-24T15:05:40-07:00 <daemon:crit> sci-m0n3 automount[3536]:
bind_ldap_anonymous: lookup(ldap): Unable to bind to the LDAP server:
(default), error Can\'t contact LDAP server
2011-05-24T15:05:41-07:00 <daemon:crit> sci-m0n3 automount[3536]:
bind_ldap_anonymous: lookup(ldap): Unable to bind to the LDAP server:
(default), error Can\'t contact LDAP server
2011-05-24T15:05:42-07:00 <daemon:notice> sci-m0n3 ntpd[3598]: ntpd
4.2...@1.1520 Fri Jul 4 12:21:10 UTC 2008 (1)
2011-05-24T15:05:42-07:00 <daemon:info> sci-m0n3 ntpd[3599]: precision
= 2.000 usec
2011-05-24T15:05:42-07:00 <daemon:info> sci-m0n3 ntpd[3599]: Listening
on interface #0 wildcard, 0.0.0.0#123 Disabled
2011-05-24T15:05:42-07:00 <daemon:info> sci-m0n3 ntpd[3599]: Listening
on interface #1 msp0, 172.31.100.103#123 Enabled
2011-05-24T15:05:42-07:00 <daemon:info> sci-m0n3 ntpd[3599]: Listening
on interface #2 lo, 127.0.0.1#123 Enabled
2011-05-24T15:05:42-07:00 <daemon:info> sci-m0n3 ntpd[3599]: Listening
on interface #3 sceth, 172.31.200.203#123 Enabled
2011-05-24T15:05:42-07:00 <daemon:info> sci-m0n3 ntpd[3599]: kernel
time sync status 0040
2011-05-24T15:05:44-07:00 <daemon:notice> sci-m0n3 rpc.statd[3655]:
Version 1.1.2 Starting
2011-05-24T15:05:44-07:00 <daemon:notice> sci-m0n3 rpc.statd[3655]:
Flags:
2011-05-24T15:05:44-07:00 <daemon:notice> sci-m0n3 rpc.statd[3655]:
statd running as root. chown /var/lib/nfs/sm to choose different user
2011-05-24T15:05:49-07:00 <auth:info> sci-m0n3 sshd[3790]: Server
listening on 0.0.0.0 port 22.
> ...
>
> read more »

Katie Cooper

unread,
May 25, 2011, 2:10:12 AM5/25/11
to SiCortex Users

ah ha!...

could it be truly just a unplugged cable????

2011-05-24T15:05:35-07:00 <syslog:notice> sci-m0n3 syslog-ng[3424]:
syslog-ng starting up; version=\'2.0.9\'
2011-05-24T15:05:35-07:00 <kern:warning> sci-m0n3 [4294680.868000]
[cpu 4] slow irq startup : Fabric Link (21)
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294680.872000] scfab:
initialized 14 DMA devices
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294706.598000] SCIO
version 2008-07-29
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294706.598000] scio:
getting node id from scfab
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294706.598000] scio:
node id is 3
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294706.602000]
qbs_per_peer = 93
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294706.645000]
quota_avail = 9392
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294706.834000]
SCethernet version 2008-01-24
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294707.158000]
xmit_order = 7
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294707.159000]
recv_order = 7
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294727.468000] QLogic
Fibre Channel HBA Driver
2011-05-24T15:05:35-07:00 <kern:warning> sci-m0n3 [4294727.505000]
PCI: Enabling device 0000:01:00.0 (0000 -> 0003)
2011-05-24T15:05:35-07:00 <kern:warning> sci-m0n3 [4294727.505000] PMI
error interrupts enabled
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294727.505000]
qla2400 0000:01:00.0: Found an ISP2432, irq 23, iobase
0x9000000800000000
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294727.507000]
qla2400 0000:01:00.0: Configuring PCI space...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294727.508000]
qla2400 0000:01:00.0: Configure NVRAM parameters...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294727.520000]
qla2400 0000:01:00.0: Verifying loaded RISC code...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294727.678000]
qla2400 0000:01:00.0: Allocated (64 KB) for EFT...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294727.678000]
qla2400 0000:01:00.0: Allocated (1413 KB) for firmware dump...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294727.680000]
qla2400 0000:01:00.0: Waiting for LIP to complete...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294729.721000]
qla2400 0000:01:00.0: Cable is unplugged...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294729.721000]
scsi0 : qla2xxx
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294729.721000]
qla2400 0000:01:00.0:
2011-05-24T15:05:35-07:00 <kern:warning> sci-m0n3 [4294729.721000]
QLogic Fibre Channel HBA Driver: 8.01.07.15-fo
2011-05-24T15:05:35-07:00 <kern:warning> sci-m0n3 [4294729.721000]
QLogic QEM2462 - QLE2440
2011-05-24T15:05:35-07:00 <kern:warning> sci-m0n3 [4294729.721000]
ISP2432: PCIe (2.5Gb/s x4) @ 0000:01:00.0 hdma+, host#=0, fw=4.00.26
[IP]
2011-05-24T15:05:35-07:00 <kern:warning> sci-m0n3 [4294729.779000]
PCI: Enabling device 0000:01:00.1 (0000 -> 0003)
2011-05-24T15:05:35-07:00 <kern:warning> sci-m0n3 [4294729.779000] PMI
error interrupts enabled
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294729.779000]
qla2400 0000:01:00.1: Found an ISP2432, irq 23, iobase
0x9000000800004000
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294729.781000]
qla2400 0000:01:00.1: Configuring PCI space...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294729.782000]
qla2400 0000:01:00.1: Configure NVRAM parameters...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294729.794000]
qla2400 0000:01:00.1: Verifying loaded RISC code...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294729.952000]
qla2400 0000:01:00.1: Allocated (64 KB) for EFT...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294729.952000]
qla2400 0000:01:00.1: Allocated (1413 KB) for firmware dump...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294729.954000]
qla2400 0000:01:00.1: Waiting for LIP to complete...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294731.995000]
qla2400 0000:01:00.1: Cable is unplugged...
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294731.998000]
scsi1 : qla2xxx
2011-05-24T15:05:35-07:00 <kern:info> sci-m0n3 [4294731.998000]
qla2400 0000:01:00.1:
2011-05-24T15:05:35-07:00 <kern:warning> sci-m0n3 [4294731.998000]
QLogic Fibre Channel HBA Driver: 8.01.07.15-fo
> ...
>
> read more »

Lawrence Stewart

unread,
May 25, 2011, 8:10:42 AM5/25/11
to Katie Cooper, sicorte...@googlegroups.com, Lawrence Stewart
Or the thing at the other end of the cable isn't turned on. -L

Katie Cooper

unread,
May 25, 2011, 2:18:59 PM5/25/11
to SiCortex Users

Yep, the switches on the data array were flipped off during the
shutdown. I'm going to make a note to the facilities folks not to
flip off any switches during power downs - just to unplug.

It finally did come back on, but it running incredibly sluggish...

Larry Stewart

unread,
May 25, 2011, 2:34:19 PM5/25/11
to sicorte...@googlegroups.com, drc...@gmail.com, SiCortex Users
Sluggish could be the array has failed a drive and is rebuilding a raid set. You might use the storage array gui to check

-Larry

Katie Cooper

unread,
May 27, 2011, 12:11:12 AM5/27/11
to SiCortex Users
Quick question - I also seem to have lost the domain name recognition
as well. Is that something that is stored on lustre?


On May 25, 11:34 am, Larry Stewart <lstewa...@gmail.com> wrote:
> Sluggish could be the array has failed a drive and is rebuilding a raid set. You might use the storage array gui to check
>
> -Larry
>

Larry Stewart

unread,
May 27, 2011, 9:18:11 PM5/27/11
to sicorte...@googlegroups.com, SiCortex Users
Not usually!

You debug dns problems on sicortex systems with regular linux methods.

The command dig (man dig) lets you try queries.
The file /etc/resolv.conf tells you what your name servers are.

Check if they are what you expecte.
Check if you can ping the dns server ip address
Check if the server is actually working by logging into server machiine and trying dig there.
Restart server if it isn't

Often the nodes are set to use the ssp as the dns server, the ssp runs "dnsmasq" as a combined dns, arp, and tftp server.

-Larry

Katie Cooper

unread,
May 28, 2011, 10:59:21 AM5/28/11
to SiCortex Users

sorry, I mistyped... I meant hostname instead of domain name. As in,
when I log into the head node, it no longer prompts me with
persephone# but with the node name sci-m4n6#

On May 27, 6:18 pm, Larry Stewart <lstewa...@gmail.com> wrote:
> Not usually!
>
> You debug dns problems on sicortex systems with regular linux methods.
>
> The command dig (man dig) lets you try queries.
> The file /etc/resolv.conf tells you what your name servers are.
>
> Check if they are what you expecte.
> Check if you can ping the dns server ip address
> Check if the server is actually working by logging into server machiine and trying dig there.
> Restart server if it isn't
>
> Often the nodes are set to use the ssp as the dns server, the ssp runs "dnsmasq" as a combined dns, arp, and tftp server.  
>
> -Larry
>
Reply all
Reply to author
Forward
0 new messages