Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

lock timeout CPU1 on SCO 5.0.7v

76 views
Skip to first unread message

Steve M. Fabac, Jr.

unread,
Apr 23, 2013, 3:44:14 PM4/23/13
to
I've a client that I am helping to move his 5.0.7 system to a VMWare system
running in a remote data center. a full backup of the physical 5.0.7 system
with MP5 was restored as a guest OS under VMWare and then SCO 5.0.7v upgrade
was installed and licensed.

The 5.0.7v VM is configured to see two CPU cores. The 5.0.7 physical server
is a Dell T310 with a 4-core Xeon processor.

The 5.0.7v system is running under vCenter Server: V5.1.0 Build 947673
and ESXi: V5.1.0 Build 1021289 (model and version provided to me as shown
upon my request to the VMWare system administrator).

The client has done some preliminary testing on the 5.0.7v VM but has not
moved the company operation from the 5.0.7 physical server to the 5.0.7v VM.

Now the problem: The 5.0.7v idling for several days from 4/15 to 4/19 became
un-reachable via telnet over the VPN link to the data center and when I
checked the vSphere console screen see:

login: lock timeout
CPU2:
PANIC: Lock timeout: caller=F022F7E1, lock=F029CEC, owner=CPU1
CPU2:
CPU2: Dump not completed

** Shutdown Complete **
** Safe to Power Off this VM **


** Safe to Power Off **
-or-
** Press Any Key to Reboot **

There is nothing about the panic logged to /usr/adm/syslog.

After rebooting the 5.0.7v VM, I logged in via SSH and checked
to see when the system panicked. Syslog only shows when the system
was booted/rebooted on 4/15 and 4/19.

# cd /etc/rc2.d/messages

49541 100640 root 60 Apr 19 09:48:49 2013 P75cron.log
14114 100640 root 132 Apr 19 09:49:23 2013 S85tcp.log

# cat P75cron.log
! *** cron started *** pid = 307 Fri Apr 19 09:48:48 2013
#
# cat S85tcp.log
Starting TCP services: prngd inetd snmpd sshd 19 Apr 09:49:23 ntpdate[420]: step
time server 129.6.15.28 offset 32.685633 sec
ntpd
#


Now because I have had issues with NTP not working with another client's
5.0.7 system (not 5.0.7v) running under a cloud hosting service I
had installed the same /usr/bin/checkdate script on the 5.0.7v system
and run it every 10 minutes as a cron job.

There is a VPN route between the client's office hosting the physical
5.0.7 server and the remote data center hosting the Esxi host and 5.0.7v VM.

The log file created by my checkdate script running on the 5.0.7v system
shows:

507v vs 507
Wed Apr 17 20:20:00 CDT 2013
Wed Apr 17 20:20:04 CDT 2013

507v vs 507
Wed Apr 17 20:30:03 CDT 2013
Wed Apr 17 20:30:04 CDT 2013
<-- NOTE! missing 20:40, 20:50, 21:00 !?!?!?
507v vs 507
Wed Apr 17 21:12:39 CDT 2013
Wed Apr 17 21:13:12 CDT 2013
<-- NOTE: missing 20:20
507v vs 507
Wed Apr 17 21:31:12 CDT 2013
Wed Apr 17 21:31:45 CDT 2013
<-- NOTE: missing 20:40
507v vs 507
Wed Apr 17 21:50:00 CDT 2013
Wed Apr 17 21:50:32 CDT 2013

507v vs 507
Fri Apr 19 09:50:32 CDT 2013 <-- after VM rebooted: Apr 19 09:48:11 unix syslogd: restart
Fri Apr 19 09:50:33 CDT 2013

507v vs 507
Fri Apr 19 10:00:00 CDT 2013
Fri Apr 19 10:00:01 CDT 2013

So the system went down (PANIC: lock timeout) sometime after 4/17 at 21:50 and because the
system was idle and not used in production, not noticed until 4/19 when I was asked to
investigated the issue.

Note that the time reported in /usr/adm/check_date.log should be around the hh:mm:00 seconds
mark for each data point. But at 21:10:00 (crontab time) the date command returns 21:12:39
and the "rcmd ounix date" command returns 21:31:45. Then again at the 21:30:00 date
returns 21:31:12 and rcmd returns 21:31:45.

Note that at 21:50:00 date executed more quickly and returned 21:50:00 but "rcmd ounix
date" took an additional 32 seconds (network delay?, or something on the 507v VM
affecting execution times?).

These delays between crontab expected execution time and date command response shows
that the system is busy (or VMWare is busy - not enough clock cycles for the 507v VM?)
and the /usr/bin/checkdate script took a long time to execute and report the local
and remote server's time.

Whatever was causing the long execution time is likely to have contributed to the lock
timeout event.

Google search on lock timeout turned up a 5.0.7 system with lock timeout and kernel panic
issues post and a recommendation by Bella recommending running scodb to dig out the kernel
symbols associated with the PANIC error message http://preview.tinyurl.com/c4o5u4x.

When I run scodb I see:

# scodb
dumpfile = /dev/mem
namelist = /unix
Cannot open stun file 'stun.def' or '/etc/conf/pack.d/scodb/defs/stun.def'
Attempting to create '/etc/conf/pack.d/scodb/defs/stun.def'
stunfile = /etc/conf/pack.d/scodb/defs/stun.def
varifile = /etc/conf/pack.d/scodb/defs/vari.def
Could not obtain cursor position
PID 522C: scodb
scodb:1> F022F7E1
F022F7E1 os_gettimeofday_internal+11 text
scodb:2> F0290CEC
F0290CEC lock_clock data
scodb:3>

I expect that the os_gettimeofday_internal+11 and lock_clock are SCO OS symbols and functions
and are not from third party drivers.

Run patchck:
INSTALLED currently on unix.XXXXXX.com
--------------------------------------------------------------------
oss667b oss667b - OpenServer Supplement oss667b
oss672a oss672a - OpenServer Supplement oss672a
oss673a oss673a - OpenServer Supplement oss673a
oss674a oss674a - OpenServer Supplement oss674a
oss676a oss676a - OpenServer Supplement oss676a
rs507d Maintenance Pack 5 for OSR5.0.7
libdnet
vmtools
patchck patchck 13030602 - Package Installation Tool
sysinfo sysinfo 11071102 - Information Gathering Tool
j2jre131 Java 2 JRE 1.3.1_22
j2plg131 Java 2 Plugin 1.3.1_22
j2sdk131 Java 2 SDK 1.3.1_22
j2jre142 Java 2 JRE 1.4.2_19
j2plg142 Java 2 Plugin 1.4.2_19
j2sdk142 Java 2 SDK 1.4.2_19
eeg eeG 5.1.2 - Intel PRO/1000 series Driver
bnxii bnxii 4.6.0 - Broadcom NetXtremeII Driver
Press_<ENTER>_to_Continue


------------------------- DOWNLOAD PACKAGES ---------------------------
MISSING packages available for DOWNLOAD:
------------------------- DOWNLOAD PACKAGES ---------------------------
1 --- SYSINFO sysinfo 13030602 - Information Gathering Tool

DOWNLOAD now?
[Y/N Default=Y] N

How do I identify (and fix) the problem that caused the lock timeout on CPU1?


--
Steve Fabac
S.M. Fabac & Associates
816/765-1670
0 new messages