I'm new to newsgroup, but i need some help. At work we have a digital
alphaserver 800 and each day this server is powering off. In fact each
24hours plus 4/5 (time for machine reboot).
There is no unix log that helps me to find the trouble, but at boot,
alphaserver bios phase display that dke100 has sense key not ready, i
think this is the time for harddisk to be up.
16:35.34 failed to send Start Unit to dke100.1.0.5.0
16:35.34 sense key = 'Not Ready' (04|02) from dke100.1.0.5.0
16:35.34 sense key = 'Not Ready' (04|03) from dke100.1.0
Between a startup and hot reset there is nothing in /usr/adm/messages log.
Is there a way to determine what cause this behaviour ?
Sorry for my bad english, i'm french.
What version of unix do you run?
It looks as if one disk fails and somehow this causes the system to
fail.
You write "is powering off". Do you mean that the system is actually
without mains power, no fans, no lights?
Or is the system in console mode, you can still type commands on the
>>> prompt?
How do you restart the system?
- press a button, or
- type: b -fl a
Hans
> On 13 jan, 09:58, Dup <david.du...@groupe3a.fr> wrote:
>> Hi,
>>
>> I'm new to newsgroup, but i need some help. At work we have a digital
>> alphaserver 800 and each day this server is powering off. In fact each
>> 24hours plus 4/5 (time for machine reboot).
>>
>> There is no unix log that helps me to find the trouble, but at boot,
>> alphaserver bios phase display that dke100 has sense key not ready, i
>> think this is the time for harddisk to be up.
>>
>> 16:35.34 failed to send Start Unit to dke100.1.0.5.0 16:35.34
>> sense key = 'Not Ready' (04|02) from dke100.1.0.5.0 16:35.34
>> sense key = 'Not Ready' (04|03) from dke100.1.0
>>
>> Between a startup and hot reset there is nothing in /usr/adm/messages
>> log.
>>
>> Is there a way to determine what cause this behaviour ?
>>
>> Sorry for my bad english, i'm french.
>
> What version of unix do you run?
Version of unix i run is : OSF1 V4.0 878 alpha
> It looks as if one disk fails and somehow this causes the system to
> fail.
Yes i agree it looks like a HDD failure but what disturb me is that it
appears each 24 hours.
> You write "is powering off". Do you mean that the system is actually
> without mains power, no fans, no lights? Or is the system in console
> mode, you can still type commands on the
>>>> prompt?
Sorry, i write a mistake, system doesn't poweroff but reboot immediately
(no clean reboot). It seems to be a harddisk failure but
>
> How do you restart the system?
> - press a button, or
> - type: b -fl a
As said just before system restart and recover automatically
>
> Hans
Sorry for saying something bad and thank to you.
Second mistake, they give me a documentation of alphaserver 800 but what
we have is an alphaserver 4000.
In log i get lot error about SCSI but this is logged after system reboot.
A processor interrupt was generated by the
CACHEA Dynamic Ram controller and
ArBitration engine (DRAB) with an
indication that the CACHE backup battery
has failed or is low (needs charging).
Someone know if it can be a battery problem ?
David,
I doubt it is a battery problem because once the system runs it has no
need for the battery.
In another port you wrote that the system is an AS 4000, not an 800.
That was not a problem but it solved the mystery
of the DKE device name. An AS 800 has 4 internal disks and not many
PCI slots to put a SCSI controller in.
It is rather rare for an AS800 to have three SCSI controllers, 5 is
definitely a lot. On a 4000 it is quite possible.
If the system shuts down and reboots then the failure of the disk
somehow affects Unix. So I'd guess that the DKE100 disk
holds data or datastructures of unix, like a pagefile or so.
It is rather odd that this happens every 24 hours. Does that mean it
happens at the same time as well?
And is something happening on the system on that time, like a big
batch job that starts?
Perhaps that job uses DKE100, or it uses a lot of memory which causes
excessive paging to that disk.
On v5.0 there is a utility called sysman station and this shows the
mounted filesystems and the physical disks that
are part of the filesystem. Can you find out to what filesystem DKE100
belongs?
Hans
Many RAID controllers have a battery to back up a write cache.
This is necessary to ensure file system integrity in case of
a power failure.
Some RAID systems refuse to work if their battery is empty,
but I don't know much about the RAID controllers used in
Alphaservers.
Dennis
--
Don't suffer from insanity...
Enjoy every minute of it.
Dennis, you are correct but the OP reports just one disk that logs
errors.
With an empty battery in a RAID array I'd have expected several disks
to fail.
So my initial response was "bad disk" not "flat battery".
Hans
There is no specific job launched at this time, no crontab too, this is
why i don't understand where it's come from. Our maintainer will come
this afternoon to help detect this problem (if its an hardware problem).
If it was a disk failure, i hope there will be some unix logs and i see
nothing (/usr/adm/messages).
I confirm that dke100 message appears at boot, its like a waiting for
harddisk to be up.
Is DKE100 part of an Avanced Filesystem set?
Yesterday maintainer change our battery and it does actually 1 day and
30mn of uptime, so it seems that battery's change correct our restart
problem.
I ll keep an eye on it.
Hello David,
I suggest you get aquainted with Tru64 unix. Many things differ from
other unices, especially the system management.
The online documentation is found at:
http://h30097.www3.hp.com/docs/pub_page/V40G_DOCS/V40G_DOCLIST.HTM
The V4.0G documentation is sufficient for all the V4.x versions
A few notes about finding error information on a Tru64 system.
The syslog, where you can find OS logged information is in the different
logfiles found in /var/adm/syslog.dated/
The binary (mostly hardware) errors are logged in the binary errorlog
which is not a text file and therefore needs to be read with a special
utility called uerf. Its documentation is found at:
http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/V40G_HTML/APS2RFTE/RFPXXXXX.HTM
Usually it is easiest to start reading it from the end (last events)
towards the beginning like follows:
# uerf -R
I hope this helps you.
Good luck!
Kari
Thanks for your documentation's link this will help me to maintain our
Tru64 Unix.
To see what binary.errlog report i use dia which is the "new" tool to
read binary.errlog but maybe uerf is better ?
Since battery change in our disk bay, uptime is 3 days, so there were
no restart since this day.
Thanks for help all, and i will do my best to learn Tru 64 and Alpha
server (direction put me on this machine because i know linux, but there
is quite some difference ;).
And apologise for my bad english i'm french :D
David.