Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

SCO OpenServer 5.0.7 on HP Server with 2 CPU dual-core - Kernel Panic because of multiple CPU

245 views
Skip to first unread message

OpenServer SCO

unread,
Oct 15, 2010, 7:18:10 PM10/15/10
to
Hello,

SInce now 3 years, we are working with SCO OpenServer 5.0.7 MP5 on HP
Server 2-CPU dual-core
Suddenly, the server begins to crash as soon as some few people are
connected on it with Panic or Double-Panic
We changed all hardware components but still crashing

After a while, I had the idea to disable CPU2, CPU3, CPU4 with
cpuonoff and since there, everything is running fine but the server is
now not enough powerful because of lack of 3 cores

As soon as I'm trying to enable again a 2nd CPU, it's crashing

What can be the problem ?

Regards,
July

Pat Welch

unread,
Oct 17, 2010, 3:55:23 AM10/17/10
to

First, post the PANIC and DOUBLE PANIC messages from the screen before
they possibly scroll off.

Which model HP server, including the "g" number?

Things to try:

Did you replace the system board? If once, try again - HP sometimes is
not as diligent as it should be to make sure MB's are thoroughly tested
before being put back in parts.

Rotate RAM sticks - maybe managing threads/CPU's is using a flaky stick
for cache, and rotating the order of the sticks may make the problem
worse or better.

Make sure any replaced CPU's are rated and matched appropriately. If the
replacement CPU's are not well matched that is likely the problem.

Carefully inspect all the cables.

Replace the internal power supply, or if you have redundant supplies,
swap them in their slots.

Network problems can cause fast context switching as cpu's/cores are
assigned to handle net traffic, possibly exceeding the ability of SCO to
assign threads. Have any network changes occurred around the time the
problem appeared (new router, firewall changes, new ISP, new PC's added,
etc.)?

--
----------------------------------------------------
Pat Welch, UBB Computer Services, a WCS Affiliate
SCO Authorized Partner
Microlite BackupEdge Certified Reseller
Unix/Linux/Windows/Hardware Sales/Support
(209) 745-1401 Cell: (209) 251-9120
E-mail: pat...@inreach.com
----------------------------------------------------

OpenServer SCO

unread,
Oct 18, 2010, 3:45:46 AM10/18/10
to
> ----------------------------------------------------- Hide quoted text -
>
> - Show quoted text -
THANKS A LOT FOR ANSWERING SO FAST, the situation is critical

lock timeout
lockbstack = F0651C20, depth = 1, lockbp = F0651C2C
0 shlockb lock F02C2444 flag 00 owner 00 readers 00000001 from
F00FAA73

PANIC: lock timeout; caller=F010F5F2, lock=F05C5A4C, owner = CPU3
Unable to freeze processor 2, proceeding...
Unable to freeze processor 3, proceeding...
Trying to dump 1005463 pages to dumpdev hd (1/41), 12569 pages per '.'
_________________________________________________________________
OR,
lock timeout
CPU2:
PANIC: lock timeout: caller= F0104C6C, lock=F05C5A4C, owner=CPU1

Unexpected trap in kernel mode :
cr0 0x8001003B cr2 0x000003D7 cr3 0x00002000 tlb 0x00000000
ss 0x00007888 uesp 0xF0695C04 efl 0x00010246 ipl 0x00000000
cs 0x00000150 eip 0xF0695B81 err 0x00000000 trap 0x0000000E
eax 0xF0834800 ecx 0x000003E7 edx 0xE0008900 ebx 0x00000000
esp 0xE00008C4 ebp 0xE0000A00 esi 0x00000003 edi 0x00000000
ds 0x00000160 es 0x00000160 fs 0x00000000 gs 0x00000000
cpu 0x00000001

PANIC: k_trap - Kernel mode trap type 0x0000000E
Unable to freeze processor 2, proceeding...

DOUBLE PANIC: k_trap - Kernel mode trap type 0x0000000D
Unable to freeze processor 2, proceeding...
____________________________________________________________
Server Model : HP DL385 G1

All the hardware was changed.
In fact, we have the exact same hardware available with same CPU,
motherboard, RAM, etc.
We simply moved disks from one server to the other but the problem
still remains.
That's why to my eyes, hardware is not the problem.

In fact, nothing changed when the problem occured, it was 4 pm, people
were working as usually since 7 am and suddenly, server began to crash
as soon as people connect on it. If one connection or 2, the server
stays up and running, it seems really linked to the number of
connections
After moving disks to a new server, same behaviour.
After disabling CPU2, CPU3 and CPU4, people can connect and work. It's
now 2 days that we're working with only 1 CPU and no more crashes but
it's not possible to continue like that, we don't have enough
performance.
So, why it is not anymore possible to work with my 4 cores ?

Thanks again for your help

mbennett

unread,
Oct 18, 2010, 11:13:24 AM10/18/10
to

Run 'scoadmin license' and make sure the additional CPU license is
still registered.

Bela Lubkin

unread,
Oct 18, 2010, 11:19:10 AM10/18/10
to
July wrote:

> lock timeout
> lockbstack = F0651C20, depth = 1, lockbp = F0651C2C
> 0 shlockb lock F02C2444 flag 00 owner 00 readers 00000001 from F00FAA73
>
> PANIC: lock timeout; caller=F010F5F2, lock=F05C5A4C, owner = CPU3
> Unable to freeze processor 2, proceeding...
> Unable to freeze processor 3, proceeding...
> Trying to dump 1005463 pages to dumpdev hd (1/41), 12569 pages per '.'
> _________________________________________________________________
> OR,
> lock timeout
> CPU2:

> PANIC: lock timeout: caller=F0104C6C, lock=F05C5A4C, owner=CPU1


>
> Unexpected trap in kernel mode :
> cr0 0x8001003B cr2 0x000003D7 cr3 0x00002000 tlb 0x00000000
> ss 0x00007888 uesp 0xF0695C04 efl 0x00010246 ipl 0x00000000
> cs 0x00000150 eip 0xF0695B81 err 0x00000000 trap 0x0000000E
> eax 0xF0834800 ecx 0x000003E7 edx 0xE0008900 ebx 0x00000000
> esp 0xE00008C4 ebp 0xE0000A00 esi 0x00000003 edi 0x00000000
> ds 0x00000160 es 0x00000160 fs 0x00000000 gs 0x00000000
> cpu 0x00000001
>
> PANIC: k_trap - Kernel mode trap type 0x0000000E
> Unable to freeze processor 2, proceeding...
>
> DOUBLE PANIC: k_trap - Kernel mode trap type 0x0000000D
> Unable to freeze processor 2, proceeding...

This has a bunch of useful addresses in it, which unfortunately are not
given in symbolic form. You can translate them by using /etc/scodb (as
root):

# scodb
scodb:1> F02C2444
[answer]
scodb:2> F00FAA73
[answer]
...

Translate these addresses (use copy/paste to avoid typos in both the
inputs & answers): F00FAA73 F0104C6C F010F5F2 F02C2444 F05C5A4C F0695B81
F0695C04 F0834800.

Post the translations, they may tell what the problem is.

>Bela<

OpenServer SCO

unread,
Oct 19, 2010, 6:02:35 AM10/19/10
to
> >Bela<- Hide quoted text -
>
> - Show quoted text -- Hide quoted text -

>
> - Show quoted text -

# scodb
dumpfile = /dev/mem
namelist = /unix
Cannot open stun file 'stun.def' or '/etc/conf/pack.d/scodb/defs/
stun.def'
Attempting to create '/etc/conf/pack.d/scodb/defs/stun.def'
stunfile = /etc/conf/pack.d/scodb/defs/stun.def
varifile = /etc/conf/pack.d/scodb/defs/vari.def
Warning: scodb can only perform ansi output
PID 38FE: scodb
scodb:1> F00FAA73
F00FAA73 in_slowtimo+27 text
scodb:2> F0104C6C
F0104C6C tcp_linput+218 text
scodb:3> F010F5F2
F010F5F2 tcp_slowtimo+1E6 text
scodb:4> F02C2444
F02C2444 in_lock_timer data
scodb:5> F05C5A4C
F05C5A4C tcb+7C data
scodb:6> F0695B81
F0695B81 net0cardinfo.route_table+1 data
scodb:7> F0695C04
F0695C04 net0cardinfo.down_queue+78 data
scodb:8> F0834800
F0834800 syssegs+34800 data
scodb:9>

Bela Lubkin

unread,
Oct 21, 2010, 4:47:15 AM10/21/10
to
Substituting in symbolic names, July wrote:

> lock timeout
> lockbstack = F0651C20, depth = 1, lockbp = &lockbstack+C
> 0 shlockb lock &in_lock_timer flag 00 owner 00 readers 00000001 from &in_slowtimo+27
>
> PANIC: lock timeout; caller=&tcp_slowtimo+1E6, lock=&tcb+7C, owner = CPU3


> Unable to freeze processor 2, proceeding...
> Unable to freeze processor 3, proceeding...
> Trying to dump 1005463 pages to dumpdev hd (1/41), 12569 pages per '.'
> _________________________________________________________________
> OR,
> lock timeout
> CPU2:

> PANIC: lock timeout: caller=&tcp_linput+218, lock=&tcb+7C, owner=CPU1


>
> Unexpected trap in kernel mode :
> cr0 0x8001003B cr2 0x000003D7 cr3 0x00002000 tlb 0x00000000

> ss 0x00007888 uesp &net0cardinfo.down_queue+78 efl 0x00010246 ipl 0x00000000
> cs 0x00000150 eip &net0cardinfo.route_table+1 err 0x00000000 trap 0x0000000E
> eax &syssegs+34800 ecx 0x000003E7 edx 0xE0008900 ebx 0x00000000


> esp 0xE00008C4 ebp 0xE0000A00 esi 0x00000003 edi 0x00000000
> ds 0x00000160 es 0x00000160 fs 0x00000000 gs 0x00000000
> cpu 0x00000001

The addresses given for UESP and EIP appear to be inside a data
structure, which is not a good sign...

I don't recognize the structure `net0cardinfo'. What kind of NIC and
NIC driver is this system using?

The panic is a lock timeout in the TCP/IP stack, generally caused by a
deadlock between two routines on different CPUs. You've also collected
evidence of both routines: tcp_slowtimo() and tcp_linput().

Without getting into too much more analysis: something is wrong with
your NIC. Try swapping units or using a different brand/model.

>Bela<

OpenServer SCO

unread,
Oct 21, 2010, 6:21:39 AM10/21/10
to
> >Bela<- Hide quoted text -
>
> - Show quoted text -- Hide quoted text -
>
> - Show quoted text -

Hi,
Thanks really for all your information, and help

Here is our Network Card
││ HW HP NC7782 Gigabit Server Adapter - PCI Bus# 3,Device#
6,Function# ││
││ SCO TCP/
IP ││
││ - SCO NFS Runtime
System ││

name=bcme0 vec=5 dma=- chip=BCM5704 mem=F7FF0000
addr=00:17:08:50:78:34

Drivers are Coming from
hp ProLiant Extended Feature Supplement (ver 5.74a)

FYI, we had to disable ASF from the Network Card because it was not
supported by the driver ( problem known in SCO OpenServer knowledge
base )

What's strange is that the NIC is on the motherboard, and the day the
problem occured, we swap motherboards but again, same behaviour.
So, that's why I supposed that it cannot be a hardware problem ...

Pat Welch

unread,
Oct 21, 2010, 7:33:12 PM10/21/10
to
On 10/21/2010 3:21 AM, OpenServer SCO wrote:
> On Oct 21, 10:47 am, Bela Lubkin<fi...@armory.com> wrote:
>> Substituting in symbolic names, July wrote:
>>
>>
>>
>>
>>
>>> lock timeout
>>> lockbstack = F0651C20, depth = 1, lockbp =&lockbstack+C
>>> 0 shlockb lock&in_lock_timer flag 00 owner 00 readers 00000001 from&in_slowtimo+27

>>
>>> PANIC: lock timeout; caller=&tcp_slowtimo+1E6, lock=&tcb+7C, owner = CPU3
>>> Unable to freeze processor 2, proceeding...
>>> Unable to freeze processor 3, proceeding...
>>> Trying to dump 1005463 pages to dumpdev hd (1/41), 12569 pages per '.'
>>> _________________________________________________________________
>>> OR,
>>> lock timeout
>>> CPU2:
>>> PANIC: lock timeout: caller=&tcp_linput+218, lock=&tcb+7C, owner=CPU1
>>
>>> Unexpected trap in kernel mode :
>>> cr0 0x8001003B cr2 0x000003D7 cr3 0x00002000 tlb 0x00000000
>>> ss 0x00007888 uesp&net0cardinfo.down_queue+78 efl 0x00010246 ipl 0x00000000

>>> cs 0x00000150 eip&net0cardinfo.route_table+1 err 0x00000000 trap 0x0000000E
>>> eax&syssegs+34800 ecx 0x000003E7 edx 0xE0008900 ebx 0x00000000

I would update your EFS version to 5.790a, found here:

http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=US&swItem=MTX-8a01f00748d5406cb5f41e9298

(Make it all one line if your Email client has wrapped it around)

And be sure to do the separate step of updating the NIC drivers after
it's installed (it will prompt you with the name of the program,
hpnicinstall I think.)

Be sure to check ASF status and disable it again if needed.

Bela Lubkin

unread,
Oct 23, 2010, 1:24:57 AM10/23/10
to
July wrote:

> Here is our Network Card
> HW HP NC7782 Gigabit Server Adapter - PCI Bus# 3,Device# 6,Function#

> name=bcme0 vec=5 dma=- chip=BCM5704 mem=F7FF0000 addr=00:17:08:50:78:34
>
> Drivers are Coming from
> hp ProLiant Extended Feature Supplement (ver 5.74a)

> What's strange is that the NIC is on the motherboard, and the day the


> problem occured, we swap motherboards but again, same behaviour.
> So, that's why I supposed that it cannot be a hardware problem ...

That's a reasonable conclusion. So, besides the things that Pat Welch
recommended, consider whether there have been any environmental changes
around the system. New routers, router firmware upgrade, changes to
network topology, firewall changes, new machines, software or protocols
on the network...?

>Bela<

OpenServer SCO

unread,
Oct 27, 2010, 8:47:00 AM10/27/10
to
> >Bela<- Hide quoted text -
>
> - Show quoted text -

Thanks a lot for your help
Actually, we're unable to do the upgrade of EFS because I'm not on
site, and I'm the only one that can do it
Anyway, we've found a workaround until we try the EFS upgrade solution

We put a dual-core and it seems working even with the 2 cores
activated, like the problem is to put a 2nd physical processor

As soon as I'll upgrade the EFS, I'll post my feedback here

Really strange because everything on this site is stable since 3
years, no new server, no new router, no new network, and no new users

Regards,
July

0 new messages