I've got an Ultra60 at home running Solaris 9. Because of the heat here
the last weeks, the computer has occasionally panicked because of PCI
thermal warnings:
Aug 6 20:48:12 antares pcipsy: [ID 819770 kern.warning] WARNING: pci:
Thermal warning detected!
Aug 6 20:48:12 antares unix: [ID 836849 kern.notice]
Aug 6 20:48:12 antares ^Mpanic[cpu2]/thread=300016ac2a0:
Aug 6 20:48:12 antares unix: [ID 534432 kern.notice] dispatcher invoked
from high-level interrupt handler
And then syncs, dumps and powers off.
My question: can I monitor these environmental readings in a more
intelligent fashion? It would be nice if, for example, if I could
program or script a warning to my Xsession early and, when things get
critical, perform a clean shutdown.
Would this be possible?
--
Dean C. Strik Eindhoven University of Technology
C.S...@tue.nl - de...@stack.nl -- http://www.ipnet6.org/
"This isn't right. This isn't even wrong." -- Wolfgang Pauli
> My question: can I monitor these environmental readings in a more
> intelligent fashion? It would be nice if, for example, if I could
> program or script a warning to my Xsession early and, when things get
> critical, perform a clean shutdown.
Yep; use prtdiag and/or prtpicl.
--
Rich Teer, SCNA, SCSA
President,
Rite Online Inc.
Voice: +1 (250) 979-1638
URL: http://www.rite-online.net
Anything in particular? The prtpicl -v output doesn't seem to contain
output related to temperature / thermal status. An E450 has temp-factors
and similar but it's not available on the SUNW,Ultra-60.
> Anything in particular? The prtpicl -v output doesn't seem to contain
> output related to temperature / thermal status. An E450 has temp-factors
> and similar but it's not available on the SUNW,Ultra-60.
Try prtpicl -v -c temperature-sensor:
rich@grover5527# prtpicl -v -c temperature-sensor
cpu (temperature-sensor, 350000038a)
:Label Die
:HighPowerOffThreshold 125
:HighShutdownThreshold 90
:HighWarningThreshold 85
:LowWarningThreshold 0
:LowShutdownThreshold -10
:LowPowerOffThreshold -20
:Temperature 54
:devfs-path /pci@1f,0/pmu@3/i2c@0/temperature@30:die_temp
:_class temperature-sensor
:name cpu
cpu-ambient (temperature-sensor, 3500000396)
:Label Ambient
:HighPowerOffThreshold 70
:HighShutdownThreshold 60
:HighWarningThreshold 40
:LowWarningThreshold 0
:LowShutdownThreshold -10
:LowPowerOffThreshold -20
:Temperature 36
:devfs-path /pci@1f,0/pmu@3/i2c@0/temperature@30:amb_temp
:_class temperature-sensor
:name cpu-ambient
No luck, I'm afraid:
# uname -a
SunOS antares 5.9 Generic_112233-07 sun4u sparc SUNW,Ultra-60
# prtpicl -v -c temperature-sensor
#
> > Try prtpicl -v -c temperature-sensor:
>
> No luck, I'm afraid:
>
> # uname -a
> SunOS antares 5.9 Generic_112233-07 sun4u sparc SUNW,Ultra-60
> # prtpicl -v -c temperature-sensor
Bummer. My output was from a SB 100; it's possible that
the Ultra 60 doesn't have the required termerature sensors.
I tried sending this earlier, but it seems to have gone to /dev/null
I think that is so. I looked at it once and was unable to get this
information on a U60. Neither is is possible on the Ultra 80.
Here are 12 ideas I can think of that might work - in no particular
order of preference. Some are perhaps not too great (silly even), but
might get you thinking. Most won't give you the acutal temperature,
but will allow you to shut it down should it get too high.
1) If you are very lucky, there might just be a jumper on the
motherboard that causes an orderly shutdown if opened or closed, but
that is very unlikely. If there is, a thermal switch across that would
be all you need.
Thermal switches, based on a bi-metallic switch, are easy to come by
and cost only about $3 each. You can get them in normally open, or
normally closed. They can handle a lot of current (10 A typical). You
can get a wide range of trip temperatures.
2) Another possible option is reading the disk temperature, as newer
drives have S.M.A.R.T technology. I've had no luck trying to read the
disk temperature with *free software*, although in principle that
should be possible. There are a few Linux programs around to read the
disk temperature (smarttools is one I believe). I looked at porting
one such program to Solaris once, but gave up the idea. If you could
measure the disk temperature, that would be a clue that the case
temperature is getting too hot. Of course, it could mean the disk is
failing, but in that case shutting the system down might not be a bad
idea anyway.
Of course, a fan failure on a CPU would not be detected, if those to
the disks were running fine.
3) There is commercial software available for Solaris that can read
the disk temperature. I expect that could force a shutdown.
4) I implemented a system on some SS20's,
http://www.medphys.ucl.ac.uk/~davek/sun/cool.html
which seems to have worked well in the +37 deg C we have had in the UK
recently (and a lot hotter in my garage). These SS20 often have dual
processors and 10,000 rpm disks, which at best is a marginal situation
and one I would not wish to use on a system I relied on. But for what
I use these SS20's for, it is acceptable. My approach has been to cut
the +5 V power to the disks, which will cause the system to panic, but
for me that is acceptable. I'd rather a paniced system than a damaged
one.
Of course you don't want a panicked system, but that is better than a
blown up one.
This has tripped a few times in the last week, but never before.
This needed (in the SS20 at least) no wires cut (read the above link).
That might not be so in the U60. The total outlay was about $6 per
machine.
5) If you have a spare serial port, I suspect it would be relatively
easy to write a bit of code that checks the status of one of the
lines and use that (with a thermal switch) to cause a shutdown -
search the web, as someone might have already done it.
There was a recent post on a Sun newsgroup about using the parallel
port on a Sun, to detect a change on the input on one of the pins.
Greg Andrews (I think it was him) thought would be difficult, but said
the serial port would be a lot easier. So you could use a program run
from cron every 5 minutes which checks the status of the XXX line on
the serial port, where the serial port is connected to a thermal
switch.
6) Another option I can think of would be to use a UPS with small
batteries in it. Build in some form of temperature sensor into the U60
and use that to cut the power line to the UPS. Most UPS's can be
configured to do an orderly shutdown via a serial port. Not very
elegant and would need a good understanding of electrical engineering
to build such a system safely.
7) There are systems designed for PCs that allow several sensors to be
attached. I think you can configure the temperature at which each one
trips. The output drives a buzzer, which would be adequate if the
system is always run attended, but not so if left unattended.
8) Adapt (7), triggering the serial port as in (4).
9) Repeatedly ping another machine, router or whatever, with a thermal
switch in the ethernet line (you might need two in this case, I don't
know much about the wiring of ethernet). If this ping fails, shut the
system down.
10) Have a normally open thermal switch in line with a power line to
the floppy drive, which has a mountable floppy in it. A cron job tries
every 5 mins to mount the floppy, which should always fail since the
floppy has no power. But if it does mount, it means the switch has
closed and a shutdown needs to be performed.
I think the reverse, or trying to read the floppy every 5 minutes
would soon damage the drive.
11) Maxim (a nice place as you can easily get free samples from their
web site), do (or at least I think they do), thermal devices that
communicate via a serial port the temperature.
12) You could measure just ambient temperature, and use that on the
serial port as in (5). No need to go inside the U60 then. Use a couple
of bi-metallic strips in series (or parallel), with one measuring
ambient and the other exhaust temperature. If exhaust air temperature
rises, it probably indicates fan failure, as theoretically the
difference in temperature between input and output is proportional to
dissipation and inversely proportional to airflow.
Well, there are some ideas. Most stupid I guess, but they might lead
you to think of others.
Dr. David Kirkby
Thank you for your verbose answer. I'll take a closer look in the
morning when I'm more awake ;) It seems very unfortunate though that
ordinary temperature monitoring is not available.
I'm not sure if disk temperature monitoring gives a good indication of
the system temperature though.
Cheers,
> > Here are 12 ideas I can think of that might work - in no particular
> > order of preference. Some are perhaps not too great (silly even), but
> > might get you thinking. Most won't give you the acutal temperature,
> > but will allow you to shut it down should it get too high.
>
> Thank you for your verbose answer. I'll take a closer look in the
> morning when I'm more awake ;) It seems very unfortunate though that
> ordinary temperature monitoring is not available.
Some are certainly better than others, but I'd be interested in your
views.
> I'm not sure if disk temperature monitoring gives a good indication of
> the system temperature though.
>
> Cheers,
>
> --
> Dean C. Strik Eindhoven University of Technology
Well your system consists of many parts. CPU, disks, motherboard, PCI
cards. These will likely all be at different temperatures. Hence you
could(should) monitor the temperature at several points.
In the SS20's at least, the disks receive less cooling than the cpu or
motherboard, so measuring the temperature there seemed the sensible
solution.
Unlike the CPU or other components, I can at least determine what is
acceptable from the disk's data sheet. The lack of what is an
acceptable temperature for your motherboard, cpu etc will always be an
issue. I guess you will have to measure data over a period of weeks
and find what is normal and what is not.
It might well be true that the disk warms up a lot when run thrashed a
lot, so setting a maximum temperature for the disk will not be a great
idea. However, to my knowledge, it's the only component inside the U60
that has a temperature sensor on it, so is the only way that can be
done just in software.
I checked the specs the other day for the U80 and found it is rated to
+45 deg C ambient. So measuring ambient might be sensible in some
cases.
--
Dr. David Kirkby,
Senior Research Fellow,
Department of Medical Physics,
University College London,
11-20 Capper St, London, WC1E 6JA.
Tel: 020 7679 6408 Fax: 020 7679 6269
Internal telephone: ext 46408
e-mail da...@medphys.ucl.ac.uk
>I think that is so. I looked at it once and was unable to get this
>information on a U60. Neither is is possible on the Ultra 80.
>Here are 12 ideas I can think of that might work - in no particular
>order of preference. Some are perhaps not too great (silly even), but
>might get you thinking. Most won't give you the acutal temperature,
>but will allow you to shut it down should it get too high.
I don't think these systems have readable temperature sensors either;
however, I think the Ultra-2 and up all have overtemp protection
(I witnessed an Ultra-2 go down once when it was put on its side,
blocking the airvents)
Checking the code and some systems here it appears that the indication
that this feature is present and enabled is the "thermal-interrupt"
boolean property in the PCI or Sbus controller; "prtconf -vp|grep thermal"
should list one or more lines like:
thermal-interrupt:
(Getting the disk temperature seems like a nice way to do this too)
Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.
There seems to two such lines in my U80, suggesting that it won't melt,
which is nice to know.
> (Getting the disk temperature seems like a nice way to do this too)
I think it would be useful. It has the advantage the sensor is already
there. It's just a shame the Linux tools that exist for measuring disk
temperature are too linux specific.
Dr. David Kirkby PhD,
Senior Research Fellow,
Department of Medical Physics,
University College London,
11-20 Capper St, London, WC1E 6JA.
Tel: 020 7679 6408 Fax: 020 7679 6269
Internal telephone: ext 46408
e-mail da...@medphys.ucl.ac.uk
Web page: http://www.medphys.ucl.ac.uk/~davek
>I think it would be useful. It has the advantage the sensor is already
>there. It's just a shame the Linux tools that exist for measuring disk
>temperature are too linux specific.
If I knew where the information was stored, I coudl possibly write
such a tool for Solaris quite easily.
Casper
I can probably find a bit more info, but won't be able to for a week at
least. Let me know if this is sufficient. Stick your response on the
newsgroup, as I can more easily read that from home, where I'll be
working from for the next week or so.
There is some information on this is the Seagate technical manuals, for
at least one of the cheetah drives I looked at. The smartmontools
utility mentioned has more information about where the data is stored on
the disk. It's just that the code to get the data out is linux specific.
Dr. David Kirkby.
----------
David,
You want LOG SENSE page 0xd. Here is a list of log sense
pages supported by my Fujitsu MAM3184:
# sg_logs /dev/sg0
Supported pages:
0x00 Supported log pages
0x01 Buffer over-run/under-run
0x02 Error counters (write)
0x03 Error counters (read)
0x05 Error counters (verify)
0x06 Non-medium errors
0x0d Temperature
0x0e Start-stop cycle counter
0x0f Application client
0x10 Self-test results
0x2f Informational exceptions (SMART)
0x38
and here is page 0xd :
# sg_logs /dev/sg0 -p=d
Temperature page
Current temperature= 41 C
Reference temperature= 65 C
sg_logs is a utility in sg3_utils for Linux. It
is found at http://www.torque.net/sg
smartmontools also prints out the temperature with the
'-a' switch. I'm working on its SCSI code at the moment.
Hopefully it will be ported to other versions of Unix
soon.
It may be worth checking if SCU prints out LOG SENSE
information.
BTW The SMART "standard" seems to gave fallen into a
void. In SCSI documents at www.t10.org look for the
awkward term "informational exceptions". Vendors
refer to an "IEC" document (X3T10/94-190) but I can't
find it anywhere.
---------
> The smartmontools
> utility mentioned has more information about where the data is stored on
> the disk. It's just that the code to get the data out is linux specific.
> smartmontools also prints out the temperature with the
> '-a' switch. I'm working on its SCSI code at the moment.
> Hopefully it will be ported to other versions of Unix
> soon.
As far as I know, smartmontools under linux should run OK (starting
with release 5.1-16) on big endian and 64-bit hardware, such as SPARC.
I don't know of anyone doing smartmontools a Solaris port, but if one
is done, I would be delighted if the person who did it could join
smartmontools as a developer so that the code could be merged back
into the smartmontools CVS tree.
Cheers,
Bruce
> > smartmontools also prints out the temperature with the
> > '-a' switch. I'm working on its SCSI code at the moment.
> > Hopefully it will be ported to other versions of Unix
> > soon.
The problem is that something like smartmontools, which looks a
useful, but nontrivial program, is it will never in a million years
get ported to all the operating systems where people would find it
helpful.
I looked at a Solaris port before (you may recall we had some emails
on it), but I gave up, as my knowledge of SCSI is just too small. The
effort involved for me to find out how to do this would be huge.
Casper Dik has stated on this thread if he knew where the information
was stored, he could possibly write
something to measure disk temperature on Solaris quite easily. (I
assume Casper is talking about SCSI disks, as the thread started about
an Ultra 60). This is my understanding of where the information is
stored.
Looking at version 5.1-17 (the latest as I write this) of
smartmontools
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/smartmontools/sm5/
I see this function in scsicmds.c
int scsiGetTemp(int device, UINT8 *currenttemp, UINT8 *triptemp)
{
UINT8 tBuf[252];
int err;
if ((err = scsiLogSense(device, TEMPERATURE_PAGE, tBuf,
sizeof(tBuf), 0))) {
*currenttemp = 0;
*triptemp = 0;
pout("Log Sense for temperature failed [%s]\n",
scsiErrString(err));
return err;
}
*currenttemp = tBuf[9];
*triptemp = tBuf[15];
return 0;
}
TEMPERATURE_PAGE is defined in the header as 0x0d, so I assume you
read 252 bytes of data from this page into tBuf with the temperature
being located at offset = 9 and a trip temperature at offset = 15.
Looking at the function scsiPrintTemp(), it would appear you print
these numbers as integers, so it looks like that offset 15 contains
the current temperature in Celsius and offset 15 the trip temperature
in Celsius. So the data is stored to a resolution of 1 deg Celsius.
Have I understood that all correctly ?
> As far as I know, smartmontools under linux should run OK (starting
> with release 5.1-16) on big endian and 64-bit hardware, such as SPARC.
It appears this is not the case. I can't compile it on my Sun Ultra
80, since it seems to want a Linux header file linux/hdreg.h
> I don't know of anyone doing smartmontools a Solaris port, but if one
> is done, I would be delighted if the person who did it could join
> smartmontools as a developer so that the code could be merged back
> into the smartmontools CVS tree.
If Casper could put something together to just read the temperature,
it might aid you in understanding how it could be done on Solaris.
I can't help feel it would be better if the SCSI code used libscg from
Jörg Schilling (the author of the very portable cdrecord many of us
use). Jörg has stated
that libscg gives a platform independent SCSI transport interface.
Unless the SCSI code is written in a portable way, I think such
utilities are likely to unavailable on many systems, despite the fact
you have clearly put a lot of work into the tool. It would be very
nice if it could be run on any system cdrecord could be run on, which
seems to be just about anything.
I'd like to help on this, but feel my knowledge of SCSI it too small.
You clearly understand it under Linux, and Casper under Solaris.
Perhaps something useful will come of it all.
--
Dr. David Kirkby,
On the ATA side of smartmontools, provided that gcc and glibc are
available, there is only a single function that needs to be ported to
a new OS to make it work. This is linux_ata_command_interface() in
atacmds.c.
On the SCSI side, the OS dependent parts are more spread out -- but it
might be
possible to fix this.
> I looked at a Solaris port before (you may recall we had some emails
> on it), but I gave up, as my knowledge of SCSI is just too small. The
> effort involved for me to find out how to do this would be huge.
<SNIP>
> > As far as I know, smartmontools under linux should run OK (starting
> > with release 5.1-16) on big endian and 64-bit hardware, such as SPARC.
>
> It appears this is not the case. I can't compile it on my Sun Ultra
> 80, since it seems to want a Linux header file linux/hdreg.h
Is this a linux system? I wrote "..smartmontools under linux...". If
so, let me know and I'll move the relevant parts of hdreg.h into
atacmds.h.
> > I don't know of anyone doing smartmontools a Solaris port, but if one
> > is done, I would be delighted if the person who did it could join
> > smartmontools as a developer so that the code could be merged back
> > into the smartmontools CVS tree.
>
> If Casper could put something together to just read the temperature,
> it might aid you in understanding how it could be done on Solaris.
>
> I can't help feel it would be better if the SCSI code used libscg from
> Jörg Schilling (the author of the very portable cdrecord many of us
> use).
I'll ask our SCSI expert if this would be possible.
Cheers,
Bruce
> On the SCSI side, the OS dependent parts are more spread out -- but it
> might be possible to fix this.
I imagine (and may be wrong) that most linux users use IDE disks and
most real UNIX systems use SCSI disks. I know Sun do some machines
with IDE disks, but certainly every UNIX box I own has SCSI. Hence
unless people make the greater effort to use port the SCSI code, there
won't be ports to Solaris, AIX, IRIX, HP-UX, Tru64 etc, you are
probably severely restricting the user base.
> > > As far as I know, smartmontools under linux should run OK (starting
> > > with release 5.1-16) on big endian and 64-bit hardware, such as SPARC.
> >
> I can't compile it on my Sun Ultra 80
>
> Is this a linux system?
My stupidity, I did not see the linux bit. I'm running Solaris 9 on
SPARC, not Linux on SPARC!
> > I can't help feel it would be better if the SCSI code used libscg from
> > Jörg Schilling (the author of the very portable cdrecord many of us
> > use).
>
> I'll ask our SCSI expert if this would be possible.
I think that would be well worth a try. Jörg Schilling once said to me
there were some template files and that starting from 'readcd' and
adapting that would be sensible. Unfortunately, I hate to say it Jörg,
but the C code is quite poorly commented, so makes life a bit
difficult - for me at least.
I think for complex software, a huge number of README's and man pages
is not the best way of documenting programs either. HTML, FAQ's,
hyperlinks ... etc is needed.
But if smartmontools could be adapted to use Jörg Schilling's code,
that (I believe) would give you a portable system, which does not need
a re-write for each OS.
You did not comment on my understanding of where the temperature data
is stored. If I'm right (see previous post by me), that might be
enough to get Casper to produce a simple temperature only measuring
program for Solaris, which would go a long way to solving the original
posters problem I feel.
--
Dr. David Kirkby,
Senior Research Fellow,
Department of Medical Physics,
University College London,
11-20 Capper St, London, WC1E 6JA.
Tel: 020 7679 6408 Fax: 020 7679 6269
e-mail da...@medphys.ucl.ac.uk web:
http://www.medphys.ucl.ac.uk/~davek
>> On the SCSI side, the OS dependent parts are more spread out -- but it
>> might be possible to fix this.
>I imagine (and may be wrong) that most linux users use IDE disks and
>most real UNIX systems use SCSI disks. I know Sun do some machines
>with IDE disks, but certainly every UNIX box I own has SCSI. Hence
>unless people make the greater effort to use port the SCSI code, there
>won't be ports to Solaris, AIX, IRIX, HP-UX, Tru64 etc, you are
>probably severely restricting the user base.
Shouldn't SCSI code be easier to port or to write again at least on systems
with a proper USCSI implementation (i.e., not Linux if I listen to Jörg,
but Solaris is pretty darn close)
I've no idea. The problem is there are far more Linux developers than
Solaris ones, so these sorts of programs tend to run on linux only.
Maybe Solaris code would be easier to write, but few know how to do
it.
I don't know if you see my earlier post, but I think I've worked out
where the information on temperature is stored on the disks, if you
fancy putting something together quickly for that purpose. Log sense
page 0x0d, offset 9 bytes for the temperature and offset 15 bytes for
a trip temperature. Data is stored in Celsius. (See earlier post for a
bit more details).
A portable version of smartmontools would be nice, but a program to
measure disk temperature on Solaris would I'm sure interest many here.
A significant rise in temperature would indicate a problem and allow a
system to be shut down gracefully.
perhaps, but the number of good Linux developers may not be much
different than the number of good Solaris developers. a million
monkeys pounding on keyboards can produce just as many crappy
"version 0.1.3.7b-rc3" reinvented wheels as the linux
community has.
>so these sorts of programs tend to run on linux only.
>Maybe Solaris code would be easier to write, but few know how to do
>it.
the Solaris USCSI interface is easy to use and smartmontools
isolates the os-specific code to one function in scsicmds.c.
it would not be hard to write a do_scsi_cmnd_io for Solaris if
someone had a few free hours.
it took 10 minutes to write a small throw-away program based
on the information in smartmontools' scsicmd.c that uses USCSI to read
just the temperature log sense page and print out the current and trip
temperatures:
% su
Password:
# ./a.out
usage: ./a.out <device>
# ./a.out /dev/rdsk/c0t3d0s2
current temp = 38 C, trip temp = 65 C
# ./a.out /dev/rdsk/c0t1d0s2
current temp = 29 C, trip temp = 65 C
the second disk was spun down (power-managed) and querying the
temperature waited until it spun up. it is noticably cooler
than the disk that was already running. the ambient case
temperature near the cpu is 29C according to picl so a spun-down
disk is about the same as the ambient system temperature but a
spinning disk is generating its own heat and runs warmer than
the ambient system temperature.
Well you are obviously better than me, as I've been looking at the
uscsi man page and quite confused by it all.
ioctl(int fildes, int request, struct uscsi_cmd *cmd);
I realse what the first two are, and the third is pointer to the
structure of type uscsi_cmd. However, working out what the hell one
does with that is not easy for me at least. Understanding some bits of
it are easy, but others are far less so.
struct uscsi_cmd {
int uscsi_flags; /* read, write, etc. see below
*/
short uscsi_status; /* resulting status */
short uscsi_timeout; /* Command Timeout */
caddr_t uscsi_cdb; /* cdb to send to target */
caddr_t uscsi_bufaddr; /* i/o source/destination */
size_t uscsi_buflen; /* size of i/o to take place
*/
size_t uscsi_resid; /* resid from i/o operation */
uchar_t uscsi_cdblen; /* # of valid cdb bytes */
uchar_t uscsi_rqlen; /* size of uscsi_rqbuf */
uchar_t uscsi_rqstatus; /* status of request sense cmd
*/
uchar_t uscsi_rqresid; /* resid of request sense cmd
*/
caddr_t uscsi_rqbuf; /* request sense buffer */
void *uscsi_reserved_5; /* Reserved for Future
Use */
};
> it took 10 minutes to write a small throw-away program based
> on the information in smartmontools' scsicmd.c that uses USCSI to read
> just the temperature log sense page and print out the current and trip
> temperatures:
>
> # ./a.out /dev/rdsk/c0t3d0s2
> current temp = 38 C, trip temp = 65 C
Well how about making that code public? Even in a throw away state, it
might help some of us understend it a bit more.
Dr. David Kirkby
if you were out of work would you give away your skills and labour
for free? in the spirit of sharing:
#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <sys/scsi/scsi.h>
#define LOG_SENSE 0x4d
#define TEMPERATURE_PAGE 0x0d
int
scsi_log_sense(int fd, int pagenum, uint8_t *pbuf, size_t buflen,
size_t known_resp_len)
{
struct uscsi_cmd ucmd;
struct scsi_extended_sense sense;
uint8_t cdb[10];
int status;
memset(&ucmd, 0, sizeof (ucmd));
memset(cdb, 0, sizeof (cdb));
cdb[0] = LOG_SENSE;
cdb[2] = 0x40 | (pagenum & 0x3f);
cdb[7] = known_resp_len >> 8;
cdb[8] = known_resp_len;
ucmd.uscsi_cdb = (caddr_t)cdb;
ucmd.uscsi_cdblen = sizeof (cdb);
ucmd.uscsi_bufaddr = pbuf;
ucmd.uscsi_buflen = known_resp_len;
ucmd.uscsi_rqbuf = (caddr_t)&sense;
ucmd.uscsi_rqlen = sizeof (sense);
ucmd.uscsi_timeout = 15;
ucmd.uscsi_flags = USCSI_READ;
status = ioctl(fd, USCSICMD, &ucmd);
return (status);
}
int
main(int argc, char *argv[])
{
char *device;
int fd;
int err;
uint8_t tbuf[16];
if (argc != 2) {
fprintf(stderr, "usage: %s <device>\n", argv[0]);
exit (1);
}
device = argv[1];
fd = open(device, O_RDONLY | O_NONBLOCK);
if (fd < 0) {
perror(device);
exit(1);
}
memset(tbuf, 0, sizeof (tbuf));
err = scsi_log_sense(fd, TEMPERATURE_PAGE, tbuf, sizeof (tbuf),
sizeof (tbuf));
if (err != 0) {
perror("scsi_log_sense failed");
exit(1);
}
printf("current temp = %d C, trip temp = %d C\n", tbuf[9], tbuf[15]);
return (0);
}
Depends. If I did something that I had no intention of selling, then
yes I would give it away for free.
If I developed some code I thought I might be able to sell, then no I
would not - I'd try to raise some $$ for it.
> in the spirit of sharing:
> #include <stdlib.h>
> #include <stdio.h>
<snip>
> printf("current temp = %d C, trip temp = %d C\n", tbuf[9], tbuf[15]);
Thanks very much.
That I'm sure will be helpful to many. I can't do your career any harm
either and might even do it some good, if someone was to be looking to
hire someone with your skills. It compiled with no warnings even with
-Wall on Solaris 9. It would not compile on Solaris 2.5, since that
lacks USCSI. Does anyone know when USCSI first came in?
Your program confirmed one thing I have long known - the disks in the
Ultra 80 are kept pretty cool, whereas those in Sun 611 boxes don't
get anywhere near as much cooling. I recall measuring the temperature
of some Seagate disks before (with a thermocouple) in one of those 611
boxes and while it was within the maximum temperature for the disk, it
was outside the temperature limits to achieve the stated MTBF.
Here's the 4 disks on my machine. The machine is virtually idle - the
disks are hardly being used.
The top two disks are internal to the U80, the bottom two are
external. Ambient temperature is an uncomfortably warm 27 deg C.
These two disks are identical 36 Gb, 10,000 rpm Seagate ST336704LC's,
inside the Ultra 80.
sparrow /export/home/davek # hdtemp /dev/rdsk/c0t0d0s2
current temp = 40 C, trip temp = 65 C
sparrow /export/home/davek # hdtemp /dev/rdsk/c0t1d0s2
current temp = 36 C, trip temp = 65 C
This next disk is another identical 36 Gb, 10,000 rpm Seagate
ST336704LC, but in a Sun 611 external disk box. Note it's quite
significantly (10-14 deg C) warmer than those in the Ultra 80, yet is
the same disk type.
sparrow /export/home/davek # hdtemp /dev/rdsk/c3t3d0s5
current temp = 50 C, trip temp = 65
This is a different disk, a somewhat older (1.6" high) 73 Gb
ST173404LC, again in a 611 box. This is clearly running quite close to
its maximum.
sparrow /export/home/davek # hdtemp /dev/rdsk/c3t2d0s2
current temp = 57 C, trip temp = 65 C
Seems like I still need to reply on my brute force approach of cutting
the power lines to the disk with a thermal switch on my sold SS20
running Solaris 2.5.
http://www.medphys.ucl.ac.uk/~davek/sun/cool.html
--
Dr. David Kirkby,
Senior Research Fellow,
Department of Medical Physics,
University College London,
11-20 Capper St, London, WC1E 6JA.
Website: http://www.medphys.ucl.ac.uk/~davek
Author of 'atlc' http://atlc.sourceforge.net/
Thanks!
> Your program confirmed one thing I have long known - the disks in the
> Ultra 80 are kept pretty cool, whereas those in Sun 611 boxes don't
> get anywhere near as much cooling. I recall measuring the temperature
> of some Seagate disks before (with a thermocouple) in one of those 611
> boxes and while it was within the maximum temperature for the disk, it
> was outside the temperature limits to achieve the stated MTBF.
[..]
> sparrow /export/home/davek # hdtemp /dev/rdsk/c0t0d0s2
> current temp = 40 C, trip temp = 65 C
> sparrow /export/home/davek # hdtemp /dev/rdsk/c0t1d0s2
> current temp = 36 C, trip temp = 65 C
> sparrow /export/home/davek # hdtemp /dev/rdsk/c3t3d0s5
> current temp = 50 C, trip temp = 65
My Ultra60 (Solaris 9) (the one I started this thread about) currently
only has a single Fujitsu 36GB disk (MAF3364L SUN36G). The temperature
reading (machine mostly idle):
root@antares:/house/dean/src# ./hdtemp /dev/rdsk/c0t0d0s2
current temp = 50 C, trip temp = 80 C
I think it's interesting how the temperature and trip temperature differ
from your readings.
For fun, I just tried to run this program on 'my' E450 Solaris 9 with
only 2x 9GB drives, but it failed. Could be the drives (IBM
DNES30917SUN9.0G).
root@crab:~# ~dean/src/hdtemp /dev/rdsk/c0t2d0s2
scsi_log_sense failed: I/O error
root@crab:~# ~dean/src/hdtemp /dev/rdsk/c0t3d0s2
scsi_log_sense failed: I/O error
Unfortunately, I have no root access to other SCSI-enabled Sun/Solaris
machines to test it on.
> My Ultra60 (Solaris 9) (the one I started this thread about) currently
> only has a single Fujitsu 36GB disk (MAF3364L SUN36G). The temperature
> reading (machine mostly idle):
>
> root@antares:/house/dean/src# ./hdtemp /dev/rdsk/c0t0d0s2
> current temp = 50 C, trip temp = 80 C
>
> I think it's interesting how the temperature and trip temperature differ
> from your readings.
Yes, all my Seagates have trips of 65 deg C. One of which is running
very close to it !! The person who wrote the program also posted data
with a trip of 65 deg C. The temperature limits on disks vary very
much from one place to another on the disk. I guess it depends on
where the temperature sensor is mounted - and of course on the disk
construction.
> For fun, I just tried to run this program on 'my' E450 Solaris 9 with
> only 2x 9GB drives, but it failed. Could be the drives (IBM
> DNES30917SUN9.0G).
>
> root@crab:~# ~dean/src/hdtemp /dev/rdsk/c0t2d0s2
> scsi_log_sense failed: I/O error
I suspect drives made in the Era of 9 Gb disks will not have the
S.M.A.R.T technology built into them. I might be wrong, but I think
they are too old. You can probably find out the part number and get a
data sheet on it from the Hitachi web site - Hitachi bought out IBM's
disk division.
I've started to log the temperature of my drives every 15 minutes,
putting that in /var/log/disktemp.log, so I can get some idea about
trends.
I've also set the system up to mail me if the internal disks reach 45
deg C and shut the system down if the internal disks reach 50 deg C.
While such temperatures will not harm the disks, I think they would
indicate of a problem in the Ultra 80, since from what little data I
have seen, I can't get the internal disks to go above 41 deg C,
despite the fact the ambient is around 27 deg C.
For the external disks I've set the system to mail me if the
temperatures reach 60 deg C and shut the system down if they reach 65
deg C, which is the trip temperature for my Seagate disks.
So far, the maximum I have reached has been 59 deg C on an external
disk and 41 deg C on an internal disk. It is very warm here now. A
week or two ago, when the temperatures in the UK execeeded a 150 year
old record, I did contemplate shutting my Ultra 80 down, although a
check of its manual says its okay to 45 deg C ambient.
I've put my little script below. It's obviously tailor made for my
system and no doubt someone else whose better at writing scripts than
me could do a lot better. But it might give you some ideas, if nothing
else.
#! /bin/sh
# Mail users and if necessary shutdown system when disks are getting
too hot.
# The temperature at which users are mailed and the system shut down
depends
# on whether or not the disk is internal or not. Gernally, the
internal disks
# on a Ultra 80 are better cooled than the external ones, so tend to
run
# cooler. If these start getting warm, it indicates there is probably
a
# fan failure inside the U80. However, these temperatures might be
quite
# normal for external disks, which run at higher temperatures.
PATH=/usr/bin:/usr/sbin:/usr/local/bin
export PATH
MAIL1=da...@medphys.ucl.ac.uk # First user to mail of messages.
MAIL2=drki...@ntlworld.com # Second user to mail of messags.
MAIL_INTERNAL=45 # Mail users if the internal disk temp exceeds
this.
SHUTDOWN_INTERNAL=50 # Shut down the system if the internal temp
exceeds this.
MAIL_EXTERNAL=60 # Mail users if the external disk temp exceeds
this.
SHUTDOWN_EXTERNAL=65 # Shut down the system if the external temp
exceeds this.
INTERNAL_DISK_1=/dev/rdsk/c0t0d0s2
INTERNAL_DISK_2=/dev/rdsk/c0t1d0s2
EXTERNAL_DISK_1=/dev/rdsk/c3t2d0s2
EXTERNAL_DISK_2=/dev/rdsk/c3t3d0s2
TEMP_INTERNAL_DISK_1=`hdtemp $INTERNAL_DISK_1 | awk '{print $4}'`
TEMP_INTERNAL_DISK_2=`hdtemp $INTERNAL_DISK_2 | awk '{print $4}'`
TEMP_EXTERNAL_DISK_1=`hdtemp $EXTERNAL_DISK_1 | awk '{print $4}'`
TEMP_EXTERNAL_DISK_2=`hdtemp $EXTERNAL_DISK_2 | awk '{print $4}'`
# Check the first internal disk.
if [ $TEMP_INTERNAL_DISK_1 -gt $SHUTDOWN_INTERNAL ] ; then
echo "Internal disk $INTERNAL_DISK_1 at $TEMP_INTERNAL_DISK_1 deg C
so are SHUTTING DOWN" | logger -p local1.emerg -t disktemp
echo "Internal disk $INTERNAL_DISK_1 at $TEMP_INTERNAL_DISK_1 deg C
so are SHUTTING DOWN" | mailx -c $MAIL1 -s "SHUTTING DOWN -
internal disk too hot" $MAIL2
sleep 5
poweroff
elif [ $TEMP_INTERNAL_DISK_1 -gt $MAIL_INTERNAL ] ; then
echo 1
echo "Internal disk $INTERNAL_DISK_1 at $TEMP_INTERNAL_DISK_1 deg C
which is very warm" | logger -p local1.warning -t disktemp
echo "Internal disk $INTERNAL_DISK_1 at $TEMP_INTERNAL_DISK_1 deg C
which is very warm" | mailx -c $MAIL1 -s "WARNING disk very
warm" $MAIL2
fi
# Check the second internal disk.
if [ $TEMP_INTERNAL_DISK_2 -gt $SHUTDOWN_INTERNAL ] ; then
echo "Internal disk $INTERNAL_DISK_2 at $TEMP_INTERNAL_DISK_2 deg C
so are SHUTTING DOWN" | logger -p local1.emerg -t disktemp
echo "Internal disk $INTERNAL_DISK_2 at $TEMP_INTERNAL_DISK_2 deg C
so are SHUTTING DOWN" | mailx -c $MAIL1 -s "SHUTTING DOWN -
internal disk too hot" $MAIL2
sleep 5
poweroff
elif [ $TEMP_INTERNAL_DISK_2 -gt $MAIL_INTERNAL ] ; then
echo "Internal disk $INTERNAL_DISK_2 at $TEMP_INTERNAL_DISK_2 deg C
which is very warm" | logger -p local1.warning -t disktemp
echo "Internal disk $INTERNAL_DISK_2 at $TEMP_INTERNAL_DISK_2 deg C
which is very warm" | mailx -c $MAIL1 -s "WARNING disk very
warm" $MAIL2
fi
# Check the first external disk.
if [ $TEMP_EXTERNAL_DISK_1 -gt $SHUTDOWN_EXTERNAL ] ; then
echo "External disk $EXTERNAL_DISK_1 at $TEMP_EXTERNAL_DISK_1 deg C
so are SHUTTING DOWN" | logger -p local1.emerg -t disktemp
echo "External disk $EXTERNAL_DISK_1 at $TEMP_EXTERNAL_DISK_1 deg C
so are SHUTTING DOWN" | mailx -c $MAIL1 -s "SHUTTING DOWN -
external disk too hot" $MAIL2
sleep 5
poweroff
elif [ $TEMP_EXTERNAL_DISK_1 -gt $MAIL_EXTERNAL ] ; then
echo "External disk $EXTERNAL_DISK_1 at $TEMP_EXTERNAL_DISK_1 deg C
which is very warm" | logger -p local1.warning -t disktemp
echo "External disk $EXTERNAL_DISK_1 at $TEMP_EXTERNAL_DISK_1 deg C
which is very warm" | mailx -c $MAIL1 -s "WARNING disk very
warm" $MAIL2
fi
# Check the second external disk.
if [ $TEMP_EXTERNAL_DISK_2 -gt $SHUTDOWN_EXTERNAL ] ; then
echo "External disk $EXTERNAL_DISK_2 at $TEMP_EXTERNAL_DISK_2 deg C
so are SHUTTING DOWN" | logger -p local1.emerg -t disktemp
echo "External disk $EXTERNAL_DISK_2 at $TEMP_EXTERNAL_DISK_2 deg C
so are SHUTTING DOWN" | mailx -c $MAIL1 -s "SHUTTING DOWN -
external disk too hot" $MAIL2
sleep 5
poweroff
elif [ $TEMP_EXTERNAL_DISK_2 -gt $MAIL_EXTERNAL ] ; then
echo "External disk $EXTERNAL_DISK_2 at $TEMP_EXTERNAL_DISK_2 deg C
which is very warm" | logger -p local1.warning -t disktemp
echo "External disk $EXTERNAL_DISK_2 at $TEMP_EXTERNAL_DISK_2 deg C
which is very warm" | mailx -c $MAIL1 -s "WARNING disk very
warm" $MAIL2
fi
# Save data in a file for trend analysis
echo `date` $INTERNAL_DISK_1 $TEMP_INTERNAL_DISK_1 $INTERNAL_DISK_2
$TEMP_INTERNAL_DISK_2 $EXTERNAL_DISK_1 $TEMP_EXTERNAL_DISK_1 $
EXTERNAL_DISK_1 $TEMP_EXTERNAL_DISK_2 >> /var/log/disktemp.log
This may very well be, yes. Fortunately I trust my E450 to keep cooled,
and prtdiag and prtpicl provide good temperature indications.
I'll put my 9GB Cheetah back in my U60 tomorrow; I sure hope this drive
supports the monitoring.
> I've also set the system up to mail me if the internal disks reach 45
> deg C and shut the system down if the internal disks reach 50 deg C.
> While such temperatures will not harm the disks, I think they would
> indicate of a problem in the Ultra 80, since from what little data I
> have seen, I can't get the internal disks to go above 41 deg C,
> despite the fact the ambient is around 27 deg C.
>
> For the external disks I've set the system to mail me if the
> temperatures reach 60 deg C and shut the system down if they reach 65
> deg C, which is the trip temperature for my Seagate disks.
I have something like this in mind for my U60. Time will show how well
the disk temperature maps back to the PCI temperature sensor readings
that effectively panic my machine.
> So far, the maximum I have reached has been 59 deg C on an external
> disk and 41 deg C on an internal disk. It is very warm here now. A
> week or two ago, when the temperatures in the UK execeeded a 150 year
> old record, I did contemplate shutting my Ultra 80 down, although a
> check of its manual says its okay to 45 deg C ambient.
I didn't check the manual for my U60, but during that week I had to keep
it offline mostly as thermal warnings (read: panics) could occur every
afternoon, environmental temperature around 40 degC (my room is too hot
really).
To my annoyance, the system was able to stay cooler (and be up longer)
when the cover was removed from the U60. So much for my conception of
covers and air flow...
My computer had a thermal panic tonight while I was having dinner.
I powered it on when I returned. It panicked 5 minutes later. I booted
again, and a panic in a few minutes again. I removed the cover, and it
stayed up...
> I've put my little script below. It's obviously tailor made for my
> system and no doubt someone else whose better at writing scripts than
> me could do a lot better. But it might give you some ideas, if nothing
> else.
I'll run something like this after I add the second disk and will report
how it turns out after a few days.
> I'll put my 9GB Cheetah back in my U60 tomorrow; I sure hope this drive
> supports the monitoring.
Check the Seagate web site, but I would not be too surprised if the
disk does support S.M.A.R.T, given it's only 9 Gb and hance quite old.
I did some tests whilst leaving the disks virtually idel, then thrash
them as much as possible. Temperature increases of around 3 deg C were
noted, suggesting to me that significant changes in disk temperature
would be due to effects other than disk usage.
Of course, I don't have the time or facilities to verify this all to
be sure of the facts.
> I didn't check the manual for my U60, but during that week I had to keep
> it offline mostly as thermal warnings (read: panics) could occur every
> afternoon, environmental temperature around 40 degC (my room is too hot
> really).
That is a very warm room.
I did something tonight for a test. I was going to the pub for a BBQ,
so knew I would be out several hours. I shut the system down using
'poweroff', but did not touch the power on external disks. I.e. I
simulated what I could do automatically, without manual intervention.
Since switching on, around 20 mins ago, the internal disks have risen
in temperature by about 5 deg C, but the external ones (to which power
was kept applied), have changed very little (one increased by 1 deg C,
the other no change). This suggests to me that powering off a Sun does
not significantly reduce the heat produced on external disks, so if
external disks are getting too warm, executing 'poweroff' does not
help much. Perhaps there are SCSI commands to spin a disk down, which
could be executed before shuttding down a system. But clearly just
shutting a system down has little or no effect on the temperature of
external disks.
> My computer had a thermal panic tonight while I was having dinner.
> I powered it on when I returned. It panicked 5 minutes later. I booted
> again, and a panic in a few minutes again. I removed the cover, and it
> stayed up...
If the room is at 40 deg C, with little ari movement, one really is
asking for problems.
> > I've put my little script below. It's obviously tailor made for my
> I'll run something like this after I add the second disk and will report
> how it turns out after a few days.
Let use know. I think disk temperature is at least a reasonable, if
not a good, indicator of a system's suseptability to panics caused by
excess temperature.
Dr. David Kirkby
I wanted to, but since I didn't have the drive here I didn't know the
part number yet. But it's a high drive, probably too old indeed.
>> I didn't check the manual for my U60, but during that week I had to keep
>> it offline mostly as thermal warnings (read: panics) could occur every
>> afternoon, environmental temperature around 40 degC (my room is too hot
>> really).
>
> That is a very warm room.
[..]
> If the room is at 40 deg C, with little ari movement, one really is
> asking for problems.
That was in the 'hot' week, and about the maximum. I generally have 4 or
5 computers on 24/7, but powered down most, although it didn't make much
of a difference.
It's about 27 degC during the day now, which is ok. But I have had the
occasional thermal warnings outside that period, like today, while the
tempreature was ok.
> Let use know. I think disk temperature is at least a reasonable, if
> not a good, indicator of a system's suseptability to panics caused by
> excess temperature.
Fortunately the panic is entirely fixed: when receiving the thermal
warning. The hardware isn't flipping bits or anything yet.