Hey all, I had issues this morning with hung processes. The usual tools failed to help.
Our office uses a separate software package for billing. They don't provide a direct interface for extracting data to be used in VistA. So we have a script that outputs a batch report of 3 weeks worth of appts and billing codes etc that we then ingest this report weekly into VistA. The report file contains thousands and thousands of lines which are compared against the VistA database to see if the information is already known, and if not, then filed. My suspicion is that it is this process that was causing problems. (But I'm not sure, see more below.)
Last night I noticed that VistA was very sluggish, even from the roll-and-scroll interface. CPRS would allow log in, but then was unusable and eventually froze and had to manually aborted via Task Manager. I let this continue running overnight, but this morning, the situation was unchanged.
From past experience, I thought this was a problem with a Taskman task. I tried to shut taskman down, but it wouldn't. Here is monitor output:
Checking Taskman. Current $H=66500,22783 (Jan 26, 2023@06:19:43)
RUN NODE=66499,79187 (Jan 25, 2023@21:59:47)
Taskman is late by 29981 seconds. shutting down.
Checking the Status List:
Node weight status time $J
EHR:ubuntu64 RUN T-1@21:59:47 238795 Main Loop
Checking the Schedule List:
Taskman has 67 tasks scheduled.
17 of them are overdue. First task is 31783 seconds late.
Checking the IO Lists: Last TM scan: 33211 sec,
There are no tasks waiting for devices.
Checking the Job List:
There are 8 tasks waiting for partitions.
Checking the Task List:
There are 3 tasks currently running.
Checking Sub-Managers:
On node EHR:ubuntu64 there are no free Sub-Manager(s). Status: Stop
Enter monitor action: UPDATE//
==============================================================
When I told it to stop, and to instruct running tasks to stop, I got this output. There were 3 "busy" tasks.
Signaling active Tasks to STOP.
TASK: 8713261 Stop status: 0^Busy
TASK: 8739272 Stop status: 0^Busy
TASK: 8739343 Stop status: 0^Busy
==============================================================
I then tried ^ZJOB, but when I entered 'S' to get system status, I was told that "Someone else is running the System status now", and I couldn't proceed. Looking at the code, I see that the 'S' option runs ^ZSY. Looking at that code, I found this command.
L +^XUTL("XUSYS","COMMAND"):1 I '$T G LW
And it was this lock that seemed to be failing. I dumped out the ^XUTL("XUSYS",*) global and got this:
yottadb>zwr ^XUTL("XUSYS",*)
^XUTL("XUSYS",220662,0)="66499,73800"
^XUTL("XUSYS",220662,"LOCK")="^%ZTSCH(""TASK"",8713261)"
^XUTL("XUSYS",220662,"NM")="BTask 8713261"
^XUTL("XUSYS",238795,0)="66499,75925"
^XUTL("XUSYS",238795,"NM")="Taskman EHR 1"
^XUTL("XUSYS",238806,0)="66499,75950"
^XUTL("XUSYS",238806,"LOCK")="^%ZTSCH(""SUBLK"",""EHR:ubuntu64"",238806)"
^XUTL("XUSYS",238816,0)="66499,75951"
^XUTL("XUSYS",238816,"LOCK")="^%ZTSCH(""SUBLK"",""EHR:ubuntu64"",238816)"
^XUTL("XUSYS",238816,"NM")="Sub 238816"
^XUTL("XUSYS",238822,0)="66499,75955"
^XUTL("XUSYS",238822,"LOCK")="^%ZTSCH(""SUBLK"",""EHR:ubuntu64"",238822)"
^XUTL("XUSYS",238822,"NM")="Sub 238822"
^XUTL("XUSYS",238830,0)="66499,75955"
^XUTL("XUSYS",238830,"LOCK")="^%ZTSCH(""SUBLK"",""EHR:ubuntu64"",238830)"
^XUTL("XUSYS",238830,"NM")="Sub 238830"
^XUTL("XUSYS",238836,0)="66499,75958"
^XUTL("XUSYS",238836,"LOCK")="^%ZTSCH(""SUBLK"",""EHR:ubuntu64"",238836)"
^XUTL("XUSYS",238836,"NM")="Sub 238836"
^XUTL("XUSYS",238838,0)="66499,75958"
^XUTL("XUSYS",238838,"LOCK")="^%ZTSCH(""SUBLK"",""EHR:ubuntu64"",238838)"
^XUTL("XUSYS",238838,"NM")="Sub 238838"
^XUTL("XUSYS",238847,0)="66499,75958"
^XUTL("XUSYS",238847,"LOCK")="^%ZTSCH(""SUBLK"",""EHR:ubuntu64"",238847)"
^XUTL("XUSYS",238847,"NM")="Sub 238847"
^XUTL("XUSYS",238855,0)="66499,75958"
^XUTL("XUSYS",238855,"LOCK")="^%ZTSCH(""SUBLK"",""EHR:ubuntu64"",238855)"
^XUTL("XUSYS",238855,"NM")="Sub 238855"
^XUTL("XUSYS",238857,0)="66499,75958"
^XUTL("XUSYS",238857,"LOCK")="^%ZTSCH(""SUBLK"",""EHR:ubuntu64"",238857)"
^XUTL("XUSYS",238857,"NM")="Sub 238857"
^XUTL("XUSYS",238863,0)="66499,75958"
^XUTL("XUSYS",238863,"LOCK")="^%ZTSCH(""SUBLK"",""EHR:ubuntu64"",238863)"
^XUTL("XUSYS",238863,"NM")="Sub 238863"
^XUTL("XUSYS",238870,0)="66499,75958"
^XUTL("XUSYS",238870,"LOCK")="^%ZTSCH(""SUBLK"",""EHR:ubuntu64"",238870)"
^XUTL("XUSYS",238870,"NM")="Sub 238870"
^XUTL("XUSYS",238874,0)="66499,75958"
^XUTL("XUSYS",238874,"LOCK")="^%ZTSCH(""SUBLK"",""EHR:ubuntu64"",238874)"
^XUTL("XUSYS",238874,"NM")="Sub 238874"
^XUTL("XUSYS",238882,0)="66499,75965"
^XUTL("XUSYS",238882,"LOCK")="^%ZTSCH(""TASK"",8739272)"
^XUTL("XUSYS",238882,"NM")="Task 8739272"
^XUTL("XUSYS",238896,0)="66499,75947"
^XUTL("XUSYS",238896,"LOCK")="^%ZTSCH(""SUBLK"",""EHR:ubuntu64"",238896)"
^XUTL("XUSYS",238896,"NM")="Sub 238896"
^XUTL("XUSYS",238904,0)="66499,75965"
^XUTL("XUSYS",238904,"LOCK")="^%ZTSCH(""TASK"",8739343)"
^XUTL("XUSYS",238904,"NM")="BTask 8739343"
^XUTL("XUSYS",238943,0)="66499,76043"
^XUTL("XUSYS",238943,"LOCK")="^XUTL(""XUSYS"",238943,1)"
^XUTL("XUSYS",238943,"NM")="ip3.161:0"
^XUTL("XUSYS",336721,0)="66500,21072"
^XUTL("XUSYS",336721,"LOCK")="^XUTL(""XUSYS"",336721,1)"
^XUTL("XUSYS",336721,"NM")="ip3.164:0"
^XUTL("XUSYS","CNT")=19
^XUTL("XUSYS","COMMAND")="Status"
==============================================================
This didn't give me much more information. So I copied ZSY.m to TMGZSY.m and modified the code to ignore the failing locks, just so I could get more information, and got this output:
GT.M Mumps users on 26-Jan-23 06:58:01
Proc. id Proc. name PS Device Routine MODE CPU time
-------- --------------- --- -------- -------- -------
220662 BTask 8713261 hib LOCK+3^DILF -direct 00:03:07
/dev/null CLOSED :
0 OPEN RMS STREAM NOWRAP :
0-out OPEN RMS STREAM NOWRAP :
236183 S+ pts/0 +1^GTM$DMOD -dir 00:00:02
/dev/pts/0 OPEN TERMINAL NOPAST NOESCA NOREADS TYP
238795 Taskman EHR 1 hib RECORD+7^%ZTLOAD1 -direct 00:02:26
0 OPEN RMS STREAM NOWRAP :
0-out OPEN RMS STREAM NOWRAP :
238882 Task 8739272 hib RECORD+7^%ZTLOAD1 -direct 00:02:26
/dev/null OPEN :
0 OPEN RMS STREAM NOWRAP :
0-out OPEN RMS STREAM NOWRAP :
238904 BTask 8739343 hib RECORD+7^%ZTLOAD1 -direct 00:02:26
0 OPEN RMS STREAM NOWRAP :
0-out OPEN RMS STREAM NOWRAP :
238943 ip3.161:0 hib LOCK+3^DILF -run 00:02:26
/dev/null OPEN :
0 OPEN SOCKET NOWRAP TOTAL=1 CURRENT=0 :
SOCKET[0]=h1674698838000 DESC=0 CONNECTED
ZDELAY ZBFSIZE=1024 ZIBFSIZE=0 NO
========================================
And finally I can see 3 errant records which seem to be stuck at RECORD+7^%ZTLOAD1
This RECORD^%ZTLOAD1 is as below:
53 RECORD ;build record
54 S ZTINC=$G(^%ZOSF("$INC"),1) ;Set to 1 if this system has $INCREMENT, otherwise 0.
55 S ZTGOT=0
56 I 'ZTINC D ;For System that don't have $INC (GT.M, DTM, MSM)
57 . ;Find a free entry, Claim it and Lock it.
58 . L +^%ZTSK(-1):0 S ZTSK=^%ZTSK(-1) ;This is just a starting point
59 . F S ZTSK=ZTSK+1 I '$D(^%ZTSK(ZTSK)) D Q:ZTGOT
60 . . L +^%ZTSK(ZTSK):$G(DILOCKTM,3) Q:'$T ;Can we lock it
61 . . I $D(^%ZTSK(ZTSK)) L -^%ZTSK(ZTSK) ;Already claimed
62 . . S ^%ZTSK(ZTSK,.1)=0,^%ZTSK(-1)=ZTSK,ZTGOT=1 ;Claim it
63 . . Q
64 . L -^%ZTSK(-1) ;
65 . Q
and RECORD+7 seems to be line 60, which is an attempt to get a lock on ^%ZTSK(ZTSK)
I didn't evoke the command to get more information from the errant tasks, for fear that I would break something with my skipping over locks. So I wasn't able to get any further information, and I ultimately had to just manually abort the tasks from Linux in a way I'm not happy about. (See other thread discussion).
As I compose the post, I see JOBSET^ZSY is where the information is gotten from the other task:
JOBSET ;Get data from a Linux job
S (IMAGE,INAME,UNAME,PS,TNAME,JTYPE,CTIME,LTIME,RTN)=""
S (%J,PID,PROCID)=$P(%LINE,U)
S TNAME=$P(%LINE,U,2) S:TNAME="?" TNAME="" ; TTY, ? if none
S PS=$P(%LINE,U,3) ; process STATE
S PS=$S(PS="D":"lef",PS="R":"com",PS="S":"hib",1:PS)
S CTIME=$P(%LINE,U,4) ;cpu time
S JTYPE=$P(%LINE,U,6),ACCESS(JTYPE)=JTYPE
ZSYSTEM "mupip intrpt "_%J
S UNAME=$G(^XUTL("XUSYS",%J,"NM"))
S RTN="" ; Routine, get at display time
S SI=$S(MODE=0:PID,MODE=1:CTIME,1:PID)
I IMAGE D
. S INAME="mumps",I=$GET(SORT(INAME,SI))+1,SORT(INAME,SI)=I
. S SORT(INAME,SI,I)=PROCID_"~"_UNAME_"~"_PS_"~"_TNAME_"~"_JTYPE_"~"_CTIME_"~"_LTI
E S I=$GET(SORT(SI))+1,SORT(SI)=I,SORT(SI,I)=PROCID_"~"_UNAME_"~"_PS_"~"_TNAME_"~
S USERS=USERS+1
Q
The mechanism seems to be to issue an interrupt to the other task. That task then uses its own value for $ZINTERRUPT do determine what to do. My environmental setup file exports this:
export gtm_zinterrupt='I $$JOBEXAM^ZU($ZPOSITION)'
And this code is as follows:
JOBEXAM(%ZPOS) ;
N %reference S %reference=$REFERENCE
S ^XUTL("XUSYS",$J,0)=$H,^XUTL("XUSYS",$J,"INTERRUPT")=$G(%ZPOS)
I %ZPOS["^" S ^XUTL("XUSYS",$J,"codeline")=$T(@%ZPOS)
K ^XUTL("XUSYS",$J,"JE")
I $G(^XUTL("XUSYS","COMMAND"))'="EXAM" ZSHOW "SD":^XUTL("XUSYS",$J,"JE")
I $G(^XUTL("XUSYS","COMMAND"))="EXAM" ZSHOW "*":^XUTL("XUSYS",$J,"JE")
I $G(^XUTL("XUSYS",$J,"CMD"))="HALT" ;To do.
Q 1
So apparently the interrupt looks for a command in a global, and outputs results of ZSHOW back to the global for review.
By the way, ^XTER doesn't have any information because no actual error state was caught.
Questions:
1) What do you all think the underlying problem was? A race condition with Taskman?
2) ^ZSY is dependent on getting a lock, because every task looks to a common point when interrupted. But in my case, this prevented the utility from helping me. I would like to modify my code such that I can proceed despite lock failure. My system is small and I know there is no on else using ^ZSY
3) Why did my database fail to get a lock in the first place? Is there some overlap between this ^ZSY system and Taskman? They seem to be using the same global for communication. I notice from the Taskman monitor, I can enter 'S' to get system status. So perhaps ^ZSY is originally a Taskman utility.
3) What should I do in this situation in the future?
Thanks
Kevin T