Problems SC648

10 views

Skip to first unread message

Jeremy Hallum

unread,

Dec 9, 2011, 3:38:02 PM12/9/11

to SiCortex Users

Hi all,

We're having some problems with our SC 648 hanging completely at
intermittent intervals. Can anyone provide some advice as to what
might be wrong?

Symptoms: The entire cluster (except for the ssp) will freeze
completely and not allow any logins. Slurm reports every node down
except for the node to the outside world. That node is reported idle,
but it is not accessible either. No node passes a ping test.

When we attempt to run the sicortex diags via the command:

/opt/sicortex/diags/g2_control_planel/run_diags sci

when the cluster has been rebooted, There are no errors. However,
when we run run_diags after there has been an outage, we get the
following errors:

Passed : msp_check.pl()
Passed : msp_flash.pl()
Passed : temp_check.pl()
running: power_check.pl()
ERROR on MSP 2 43.000C (Thu Dec 8 10:51:30 2011) Temperature
Sensor reading
too High.
pol_19 (U1_R6, B_1.8V) 1.466V 4.685A
80.978C
FAILED : power_check.pl()
Passed : pol_check.pl()
Passed : jtag_test.pl()
Passed : plls_lock.pl()
Passed : smart_bist.pl()
Passed : iclk_pll_lock.pl()
Passed : ssp_ice9_comm.pl()
Passed : pcitest.pl()
Passed : attn_test.pl()
Passed : attn_int_test.pl()
Passed : led_check.pl(loop=10)
Passed : led_node.pl(loop=10)
Passed : msp_check.pl()
running: bootPathTest.pl(cpu=0)
ERROR on Node 2.26 43.000C (Thu Dec 8 11:05:34 2011)
bootPathTest.pl Test Timeout.

FAILED : bootPathTest.pl(cpu=0)
running: bootPathTest.pl(cpu=1)
ERROR on Node 2.26 41.000C (Thu Dec 8 11:16:10 2011)
bootPathTest.pl Test Timeout.

FAILED : bootPathTest.pl(cpu=1)
running: bootPathTest.pl(cpu=2)
ERROR on Node 2.26 41.000C (Thu Dec 8 11:26:45 2011)
bootPathTest.pl Test Timeout.

FAILED : bootPathTest.pl(cpu=2)
running: bootPathTest.pl(cpu=3)
ERROR on Node 2.26 40.000C (Thu Dec 8 11:37:19 2011)
bootPathTest.pl Test Timeout.

FAILED : bootPathTest.pl(cpu=3)
running: bootPathTest.pl(cpu=4)
ERROR on Node 2.26 40.000C (Thu Dec 8 11:47:53 2011)
bootPathTest.pl Test Timeout.

With the Test Timeouts continuing through to the end of the diag.

When the error occurs in the diags, policyd logs show the following:
[2011-12-08 10:51:11] WARNING: sicortex.policyd.Temperature: TIMEOUT
on sci-msp2 Ms
pEnv_Temp_Node_18: no reading for 180 seconds
[2011-12-08 10:51:11] WARNING: sicortex.policyd.Temperature: TIMEOUT
on sci-msp2 Ms
pEnv_Temp_Node_19: no reading for 180 seconds
[2011-12-08 10:51:11] WARNING: sicortex.policyd.Temperature: TIMEOUT
on sci-msp2 Ms
pEnv_Temp_Node_22: no reading for 180 seconds
[2011-12-08 10:51:11] WARNING: sicortex.policyd.Temperature: TIMEOUT
on sci-msp2 Ms
pEnv_Temp_Node_10: no reading for 180 seconds
[2011-12-08 10:51:11] WARNING: sicortex.policyd.Temperature: TIMEOUT
on sci-msp2 Ms
pEnv_Temp_Node_11: no reading for 180 seconds
[2011-12-08 10:51:11] WARNING: sicortex.policyd.Temperature: TIMEOUT
on sci-msp2 Ms
pEnv_Temp_Node_12: no reading for 180 seconds
[2011-12-08 10:51:11] WARNING: sicortex.policyd.Temperature: TIMEOUT
on sci-msp2 Ms
pEnv_Temp_Node_13: no reading for 180 seconds
[2011-12-08 10:51:11] WARNING: sicortex.policyd.Temperature: TIMEOUT
on sci-msp2 Ms
pEnv_Temp_Node_14: no reading for 180 seconds
[2011-12-08 10:51:11] WARNING: sicortex.policyd.Temperature: TIMEOUT
on sci-msp2 Ms
pEnv_Temp_Node_15: no reading for 180 seconds
[2011-12-08 10:51:11] WARNING: sicortex.policyd.Temperature: TIMEOUT
on sci-msp2 Ms
pEnv_Temp_Node_16: no reading for 180 seconds
[2011-12-08 10:51:11] WARNING: sicortex.policyd.Temperature: TIMEOUT
on sci-msp2 Ms
pEnv_Temp_Node_17: no reading for 180 seconds
[2011-12-08 10:51:11] WARNING: sicortex.policyd.Temperature: TIMEOUT
on sci-msp2 Ms
pEnv_Power_Pol_23: no reading for 180 seconds
[2011-12-08 10:51:11] WARNING: sicortex.policyd.Temperature: TIMEOUT
on sci-msp2 Ms
pEnv_Power_Pol_22: no reading for 180 seconds
[2011-12-08 10:51:11] WARNING: sicortex.policyd.Temperature: TIMEOUT
on sci-msp2 Ms
pEnv_Power_Pol_07: no reading for 180 seconds
[2011-12-08 10:51:11] WARNING: sicortex.policyd.Temperature: TIMEOUT
on sci-msp2 Ms
pEnv_Temp_Node_26: no reading for 180 seconds
[2011-12-08 10:51:11] WARNING: sicortex.policyd.Temperature: TIMEOUT
on sci-msp2 Ms
pEnv_Temp_AD_12: no reading for 180 seconds

and subsequently the temperature was abnormally high on that msp for
the remainder of the test. Is that my problem, or a symptom of a
greater problem on that board?

Does anyone have any possible ideas what might be the problem here?

Thanks
Jeremy

Lawrence Stewart

unread,

Dec 10, 2011, 8:54:38 PM12/10/11

to sicorte...@googlegroups.com, Jeremy Hallum, Lawrence Stewart

Hanging completely is a symptom of interconnect failure brought on by a failure in the switch portion of one of the node chips. Normally the interconnect is reliable, if bit errors happen on the links, they will be detected and automatically retried. However, if a link fails hard, or the node chip fails, then traffic will "back up" into preceding links and switches and quickly jam the entire machine.

To solve the problem, the usual approach is to identify the bad links or nodes, mark them disabled, and reboot the cluster. Alternatively, it might be possible to repair the problem if it is, for example, a
bad power regulator.

SiCortex systems actually have another network, and it might be possible to "ping" each node over
the "control" network even after the machine has hung. You probably couldn't log in, since the interconnect is used for root filesystem access, but you might be able to ping each node and establish which ones
are down, and therefore candidates to be disabled.

The "control" network consists of a 100 Mbit ethernet between the ssp and the four (in an SC648) module service processors (msp0, msp1, msp2, msp3). These have addresses like "sc1-msp0", "sc1-msp1", "sc1-msp2", and "sc1-msp3". Each msp, in turn, operates a point-to-point protocol link to each of the 27 node chips on its module (this actually is a bit-banged path that works over the JTAG chip debugging link, rather slowly).
From the SSP, you can ping

sc1-msp0-n0
sc1-msp0-n1
...
sc1-msp0-n26

to reach the 27 nodes on module 0, without using the high speed interconnect.

You might also look at the last stuff in the various log files to see if there are any clues about links
going down. Nodes will report these things, but they only check every 10 seconds or so, and the machine
could lock up more quickly than that. Look at /var/log/sc1/sc1-m*n*.console and /var/log/sc1/mfd.log
(the master fabric daemon log). Also check /var/log/nodes-201112/*, which are the node syslog files.
Also check /var/log/msp-env/<date-based logs> which contain the environmental events outside normal ranges.

From the diagnostic tests, it seems likely that one of your modules has a bad power regulator, or something like that. pol19 (point of load regulator 19) may be acting up. You also may have trouble with
sc1-m2n26, since that one shows up in testing.

There is also a very useful utility on the ssp, mspenv, that will read out all the voltages, currents, and
temperature sensors. If you run this on a good board and also on the suspect one, it will likely have
clues:

mspenv -m sc1-msp2 -n 27 -t ;; read temperature data
mspenv -m sc1-msp2 -n 27 -p ;; read power data

and then run these again on, say sc1-msp0

if there <is> a bad regulator, then it might affect only a few nodes, or all of them, because there are several regulators for each supply voltage, and since the different voltage busses have different current draw, each regulator may feed a different number of chips. I have the module schematics so given the POL number I can figure out which node chips are affected.

If there is a bad POL, then it might be possible to fix it, by replacing the POL. Tony Hudon may still have the equipment to attempt this, in his basement or garage or whatever.

Otherwise, you will have to mark disabled the parts of the machine that aren't working and press on.
In the worst case, you can simply unplug the bad module and run without it. It would be best in that case to replace the bad module with a jumpber module, to pass through the links. I know that some SiCortex sites have spare jumper modules, so you may be able to get one by asking politely.

It might be the case that the "hang" is really caused by, say, an overtemperature signal from one of
the power regulators, which would cause the MSP to automatically shut down the power system on that module,
which would hang the whole machine due to interconnect problems. Are the lights still on on all modules
when it hangs? I am fairly sure the msp logs some strident messages somewhere when it does this.
Probably in the msp-env logs.

Let me or the list know what you find.

-Larry Stewart
Serissa Research

> --
> You received this message because you are subscribed to the Google Groups "SiCortex Users" group.
> To post to this group, send email to sicorte...@googlegroups.com.
> To unsubscribe from this group, send email to sicortex-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/sicortex-users?hl=en.
>

Lawrence Stewart

unread,

Dec 10, 2011, 9:04:35 PM12/10/11

to sicorte...@googlegroups.com, Jeremy Hallum, Lawrence Stewart

Here's the kind of output mspenv produces (from MIT's SC648)

The mspenv -t sensors are the node chips and some board level general sensors, in "milli degrees centigrade"
[root@sicortex-ssp msp-env]# mspenv -m sc1-msp2 -t -n 27
1323567793 env sc1-msp2 MspEnv_Temp_Node_00 36750 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_01 42250 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_02 62500 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_03 37000 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_04 44250 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_05 57750 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_06 42500 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_07 51000 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_08 52250 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_09 62250 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_10 59500 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_11 57750 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_12 38250 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_13 54250 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_14 59750 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_15 59000 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_16 62500 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_17 49000 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_18 49250 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_19 51500 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_20 59000 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_21 54500 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_22 47250 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_23 64250 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_24 65250 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_25 61500 mC
1323567793 env sc1-msp2 MspEnv_Temp_Node_26 60250 mC
1323567793 env sc1-msp2 MspEnv_Temp_AD_00 47000 mC
1323567793 env sc1-msp2 MspEnv_Temp_AD_01 46000 mC
1323567793 env sc1-msp2 MspEnv_Temp_AD_02 41000 mC
1323567793 env sc1-msp2 MspEnv_Temp_AD_03 42000 mC
1323567793 env sc1-msp2 MspEnv_Temp_AD_04 29000 mC
1323567793 env sc1-msp2 MspEnv_Temp_AD_05 26000 mC
1323567793 env sc1-msp2 MspEnv_Temp_AD_06 42000 mC
1323567793 env sc1-msp2 MspEnv_Temp_AD_07 44000 mC
1323567793 env sc1-msp2 MspEnv_Temp_AD_08 38000 mC
1323567793 env sc1-msp2 MspEnv_Temp_AD_09 36000 mC
1323567793 env sc1-msp2 MspEnv_Temp_AD_10 29000 mC
1323567793 env sc1-msp2 MspEnv_Temp_AD_11 34000 mC
1323567793 env sc1-msp2 MspEnv_Temp_AD_12 32000 mC
1323567793 env sc1-msp2 MspEnv_Temp_AD_13 28000 mC
[root@sicortex-ssp msp-env]#

The "power" readings include voltage, temperature, and sometimes current from
the various power regulators.
[root@sicortex-ssp msp-env]# mspenv -m sc1-msp2 -c -n 27
1323568670 env sc1-msp2 MspEnv_Power_Pol_00 1099 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_00 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_00 51174 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_01 1099 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_01 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_01 54311 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_02 1099 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_02 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_02 46468 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_03 1078 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_03 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_03 44900 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_04 1121 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_04 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_04 54311 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_05 1099 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_05 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_05 35488 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_06 1099 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_06 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_06 29214 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_07 1121 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_07 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_07 43331 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_08 1099 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_08 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_08 49606 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_09 1121 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_09 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_09 51174 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_10 1078 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_10 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_10 41763 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_11 1121 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_11 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_11 48821 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_12 1121 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_12 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_12 40978 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_13 1121 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_13 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_13 34704 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_14 1790 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_14 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_14 41763 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_15 1833 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_15 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_15 46468 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_16 1790 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_16 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_16 44900 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_17 1811 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_17 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_17 33920 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_18 1768 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_18 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_18 49606 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_19 1790 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_19 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_19 43331 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_20 1790 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_20 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_20 37057 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_21 1747 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_21 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_21 49606 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_22 1768 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_22 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_22 46468 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_23 1811 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_23 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_23 34704 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_24 1790 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_24 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_24 32351 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_25 1186 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_25 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_25 59017 mC
1323568670 env sc1-msp2 MspEnv_Power_Pol_26 1186 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_26 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_26 48037 mC
1323568670 env sc1-msp2 MspEnv_Power_Dpm_Ibv 10483 mV
1323568670 env sc1-msp2 MspEnv_Power_Dpm_Ibv NM mA
1323568670 env sc1-msp2 MspEnv_Power_Dpm_Ibv NM mC
1323568670 env sc1-msp2 MspEnv_Power_Ad_Vibv_96 10350 mV
1323568670 env sc1-msp2 MspEnv_Power_Ad_Vibv_96 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Ad_Vibv_96 NM mC
1323568670 env sc1-msp2 MspEnv_Power_Ad_Vddc_10 1115 mV
1323568670 env sc1-msp2 MspEnv_Power_Ad_Vddc_10 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Ad_Vddc_10 NM mC
1323568670 env sc1-msp2 MspEnv_Power_Ad_Vddf_12 1225 mV
1323568670 env sc1-msp2 MspEnv_Power_Ad_Vddf_12 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Ad_Vddf_12 NM mC
1323568670 env sc1-msp2 MspEnv_Power_Ad_Vddr_18 1865 mV
1323568670 env sc1-msp2 MspEnv_Power_Ad_Vddr_18 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Ad_Vddr_18 NM mC
1323568670 env sc1-msp2 MspEnv_Power_Ad_Vddm_25 2570 mV
1323568670 env sc1-msp2 MspEnv_Power_Ad_Vddm_25 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Ad_Vddm_25 NM mC
1323568670 env sc1-msp2 MspEnv_Power_Ad_Vdd_33 3360 mV
1323568670 env sc1-msp2 MspEnv_Power_Ad_Vdd_33 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Ad_Vdd_33 NM mC
1323568670 env sc1-msp2 MspEnv_Power_Ad_Iibv_A NM mV
1323568670 env sc1-msp2 MspEnv_Power_Ad_Iibv_A 14715 mA
1323568670 env sc1-msp2 MspEnv_Power_Ad_Iibv_A NM mC
1323568670 env sc1-msp2 MspEnv_Power_Ad_Iibv_B NM mV
1323568670 env sc1-msp2 MspEnv_Power_Ad_Iibv_B 15120 mA
1323568670 env sc1-msp2 MspEnv_Power_Ad_Iibv_B NM mC
[root@sicortex-ssp msp-env]#

So for example, POL 19 reports

1323568670 env sc1-msp2 MspEnv_Power_Pol_19 1790 mV
1323568670 env sc1-msp2 MspEnv_Power_Pol_19 NM mA
1323568670 env sc1-msp2 MspEnv_Power_Pol_19 43331 mC

it is running at 1.8 V and 43 degrees C, which is fine. The environmental readings
usually don't measure current, because the POL vendor told us that actually reading
the current could glitch the output, so we stopped polling it.

-L

Jeremy Hallum

unread,

Dec 14, 2011, 2:58:37 PM12/14/11

to SiCortex Users

HI Larry,

Thanks for all of the great help. It does look like a bad pol (19).
The whole module fails a ping test over the control network, and it
indeed runs quite a bit hotter than the rest of the POLs across the
cluster:

1323886909 env sci-msp2 MspEnv_Power_Pol_15 1488 mV
1323886909 env sci-msp2 MspEnv_Power_Pol_15 NM mA
1323886909 env sci-msp2 MspEnv_Power_Pol_15 60586 mC
1323886909 env sci-msp2 MspEnv_Power_Pol_16 1466 mV
1323886909 env sci-msp2 MspEnv_Power_Pol_16 NM mA
1323886909 env sci-msp2 MspEnv_Power_Pol_16 37841 mC
1323886909 env sci-msp2 MspEnv_Power_Pol_17 1466 mV
1323886909 env sci-msp2 MspEnv_Power_Pol_17 NM mA
1323886909 env sci-msp2 MspEnv_Power_Pol_17 32351 mC
1323886909 env sci-msp2 MspEnv_Power_Pol_18 1488 mV
1323886909 env sci-msp2 MspEnv_Power_Pol_18 NM mA
1323886909 env sci-msp2 MspEnv_Power_Pol_18 35488 mC
1323886909 env sci-msp2 MspEnv_Power_Pol_19 1466 mV
1323886909 env sci-msp2 MspEnv_Power_Pol_19 NM mA
1323886909 env sci-msp2 MspEnv_Power_Pol_19 79409 mC <---
1323886909 env sci-msp2 MspEnv_Power_Pol_20 1445 mV
1323886909 env sci-msp2 MspEnv_Power_Pol_20 NM mA
1323886909 env sci-msp2 MspEnv_Power_Pol_20 40194 mC

It sounds to me that it's the latter case you listed, where the POL is
running hot and causing the board power system to be shut down.
However, there's no logs explicitly say this, which is odd to me, but
whatever.

Can you get me the board schematics, in case we need to replace it?
If you wouldn't mind privately asking Mr. Hudon if it's ok for me to
contact him about this, I would love that as well (or if anyone else
has a spare power regulator for a SI648 lying around, I'd love to hear
from you!).

Thanks very much for your help Larry, we really appreciate it.

-jeremy

Reply all

Reply to author

Forward

0 new messages