Question for the old timers: How did you do locks prior to the M90 standard?

154 views
Skip to first unread message

Sam Habiel

unread,
May 13, 2024, 2:38:21 AMMay 13
to hardhats
I am teaching an M class, and looking into teaching locks.

I only ever learned and used the M90 version of locks; you may know that the pre-90 version is now prohibited in the SAC: for good reason: it consists of gaining exclusive locks. You cannot do an incremental lock (lock +^x). Any lock you gain means that you are unlocking something else.

So practically speaking... that's the question... how were locks used in what was then DHCP? Was it assumed that Fileman would do locks, or were all applications required to do their own locking?

--Sam

Ed de Moel

unread,
May 13, 2024, 6:11:38 PMMay 13
to Hardhats
That's a long time ago...
I'm not sure I remember correctly, but what I recall is that the SAC didn't allow the use of the LOCK command back then, and that FileMan took care of all locks internally. Maybe George knows better...

Sam Habiel

unread,
May 13, 2024, 10:12:20 PMMay 13
to hard...@googlegroups.com
Thank you Ed.

PS: I keep referencing the Annotated Standard site that you maintain. Thank you for maintaining it.

--Sam

--
--
http://groups.google.com/group/Hardhats
To unsubscribe, send email to Hardhats+u...@googlegroups.com

---
You received this message because you are subscribed to the Google Groups "Hardhats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hardhats+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hardhats/9c8b15cf-d282-4e90-a2aa-1b61033bf92an%40googlegroups.com.

George Timson

unread,
May 14, 2024, 6:37:47 PMMay 14
to Hardhats
Hola, guys!   I'm sorry that I can't add anything from personal recollection to what Ed de Moel wrote.  As he said, it was a long time ago...

Apropos of teaching MUMPS, though, I remember a joke about a supposed MUMPS syllabus:
LESSON #1: The MUMPS language without the LOCK command.
LESSON #2: The LOCK command.
LESSON #3: The LOCK command done correctly.

--George

Jamie Crumley

unread,
May 14, 2024, 6:49:53 PMMay 14
to hard...@googlegroups.com
I don't know about DHCP, so this probably isn't helpful, but in my previous life, before incremental locks...

We did a lot of planning, of course, but if we added a new lock, we respecified any that were supposed to be held. 

So L ^A(1,1)  Then you need to keep that but add a different file you would 
L ^A(1,1),^B(1,2)  

Oddly, it's been so many years I'm struggling to remember the exact syntax. Lol. And I'm not in front of a keyboard.

--
--
http://groups.google.com/group/Hardhats
To unsubscribe, send email to Hardhats+u...@googlegroups.com

---
You received this message because you are subscribed to the Google Groups "Hardhats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hardhats+u...@googlegroups.com.

David Wicksell

unread,
May 14, 2024, 7:29:44 PMMay 14
to Hardhats
Hi Jamie,

 The syntax you want is:

LOCK (^A(1,1),^B(1,2))

 In your example, you have two lock arguments; execution of the second lock argument would end up
removing the lock reference in the first lock argument. In order to treat multiple lock references as one lock
argument, you must surround them in parentheses. If it is not possible to acquire every lock in a parenthesized
list of lock references, then none of them will be acquired. It's all or nothing.

 You mentioned you didn't remember the syntax, so that is why I decided to respond. This syntax goes all the way
back to the 1977 standard. I suspect that DHCP/VistA code would have done exactly what you suggest, given that
without incremental locking, there is no other way to hold two different lock references at once in the same job.

David Wicksell
Fourth Watch Software LC

Greg Kreis

unread,
May 15, 2024, 1:42:31 PMMay 15
to hard...@googlegroups.com

Hmmm...   One could have a standard variable with a list of locks that gets passed along and is added to or removed from and then re-locked with indirection. Though I don't ever remember seeing that in VistA. I don't know how MUMPS would handle a possible race condition where you are dropping all locks and then asking to re-lock a list. Is it an atomic operation or two distinct operations (release and acquire) in one syntax?

To view this discussion on the web visit https://groups.google.com/d/msgid/hardhats/3501b686-ea99-43f5-b5f9-309eede2e0dan%40googlegroups.com.
-- 
=================
Greg Kreis
Pioneer Data Systems, Inc.
770-509-2583
http://www.PioneerDataSys.com

David Wicksell

unread,
May 15, 2024, 1:58:59 PMMay 15
to Hardhats
Hi Greg,

 Yep, you could use indirection like that. The standard does not specify that scenario as an atomic operation. If you had a list of locks, and you added one to it, and then went to lock it with the old-style locks, using a lock list with parentheses, while at the same time, another job was waiting for one or more of the locks in the list, then your job would release all of its locks, the other job would acquire them, and the re-locking of the new list with the added lock or locks, would fail, or hang, trying to re-acquire them.

 That's one of many reasons why incremental locking was added in the 1990 standard. Though it can be argued that jobs shouldn't hog locks in scenarios like that, and when a job releases all of its locks, in order to add a new one for the next step of processing, other jobs can then continue to do their work that is guarded by those locks, and there is less job lock resource starvation. Thankfully we have a more powerful interlocking facility now, with more options to support more workflows. Thank you.

David Wicksell
Fourth Watch Software LC

Frederick D. S. Marshall

unread,
May 15, 2024, 3:39:55 PMMay 15
to hard...@googlegroups.com
David has it exactly right.

Here is an example from 1990 in Kernel version 6.5, specifically Task Manager, in routine %ZTLOAD1, which implements Taskman API service REQ^%ZTLOAD. It requires some explanation to fully describe the complex problem it's trying to solve, so this is a long email message. But I hope it's useful, because this example and its history illustrate most of the issues involving the multilock syntax. It includes some well crafted algorithms that were good examples to follow, as well as a couple mistakes I made back in 1990 that are bad examples to be avoided, all of which contributes to fully illuminating the issues involved with this syntax, and it culminates by describing an entirely different and simpler strategy that worked better.

For those unfamiliar with Vista Taskman, a task is a job an application wants started at a certain time and with certain resources (such as perhaps exclusive access to a file or network channel). Taskman's main API service is called to create a task record and to schedule it to run, and Taskman, which runs perpetually as a background job, watches its list of tasks and schedules so it will notice the new task and ensure it starts as specified.

This REQ^%ZTLOAD service, when passed a task ID and some modified properties, must dequeue an existing task from whatever resources it is currently scheduled to run with, then modify the task to match the new properties, then requeue it. The dequeuing step is the tricky part, because Taskman operated with what was in Vista in 1990 an unprecedented degree of parallel processing, with potentially hundreds of M jobs any one of which might or might not try to manipulate this task at the same time REQ^%ZTLOAD is running. From the perspective of typical system loads today, the loads back then were a minor amount of parallelism, but at the time it was pushing hard against the boundaries of what PDP-11s and the existing Vista algorithms could handle.

There are many ways for such an algorithm to go wrong, as VA found out the hard way.

Just a few years before, in Taskman 5.01, which was the version in the field when I began working on Taskman, a similar degree of parallelism existed, but the strategies the old Taskman code tried to use to manage it were not up to the challenge. Scheduled tasks frequently disappeared without ever running or were corrupted and came out wrong, because of defects in the algorithms at the time for how tasks were created and managed. Users were taught to just repeat their requests and create a new task if the old one didn't run when they expected it to, since it might have vanished. Given the delays caused by limited hardware and algorithm problems, this often meant that when tasks weren't disappearing or being corrupted they were running more than once, due to understandably frustrated and impatient users queuing them again.

It took me six rewrites and three years of work on Taskman 6, released in 1989, to untangle the old Taskman code, which suffered from severe patchitis as well as never having been written in a way that fully addressed the challenges of so much parallelism. Although Taskman 6 was a huge improvement, in it I avoided the use of the kind of multilock syntax David described, because I shared a widespread fear of deadlock situations.

Much discussion and several articles in MUG Quarterly had been devoted to the dangers of having one job try to lock two nodes in one order at the same time a second job was trying to lock them in the other order, resulting in both jobs being stuck indefinitely and unable to proceed because the other job had what they needed to complete the command. Dr. Ruth Dayhoff in particular had written a fantastic and well illustrated article that discussed various M locking strategies and their pros and cons, and this was a big influence on my thinking about how locking should work in Taskman. Back then there was so much awareness of deadlocks that you could hardly mention the word "lock" to another M programmer without them asking how you were going to avoid deadlocks. It was the zeitgeist.

So where multiple locks were needed, a rare situation for most algorithms back then, they were almost always broken up into separate lock commands, each with its own timeout and check to ensure the algorithm did something intelligent if the lock it wanted was not available. This is how all Taskman 6's multiple-lock algorithms were written.

That was especially necessary in Taskman 6, because when I inherited Taskman the convention throughout Vista was for people to create, schedule, and otherwise manipulate tasks through direct sets into Taskman's globals. Taskman's database had no encapsulation at all, which was a disaster for long-term planning and development. That was how Dr. Dave Wilson of Oklahoma City had designed the original Taskman, back when it was locally created and run at just that one VA hospital, and it remained true when the Kernel development team adopted his work nationally. This meant I had no control whatsoever over the order in which different jobs might lock, pretty much guaranteeing a multilock syntax would deadlock with some other job sooner or later. It also meant I couldn't develop Taskman's global structures in new directions without breaking more or less every Vista application, an intolerable degree of design paralysis, so I committed early in my work on Taskman to changing this situation. Taskman's database needed to be encapsulated to get control over the data and algorithms and make its evolution possible.

The seriousness of this problem was recognized by Greg Kreis, who created the "Taskman Loader," aka ^%ZTLOAD, Taskman's first API service, the one everyone uses now to create tasks. His work was released in Taskman version 5.01, ca. 1986, and it was popular because it made creating tasks so much simpler, but at the time I began work on Taskman in 1986 no widespread effort had yet been made to systematically eliminate the old direct global sets, so both the deadlock risk and the design paralysis problems remained overwhelming.

For Taskman 6, therefore, I avoided the multilock syntax.

But with the help of Kernel team leader Hans von Blanckensee I identified a complete list of all the kinds of additional task manipulations being done throughout Vista and extended Taskman's new API to include a service for each one, thus removing any need for direct manipulation of Taskman's globals. We then launched a nationwide effort to replace all code anywhere in Vista that directly accessed Taskman's globals with a call to the equivalent Taskman 6 API service. To make it as easy as possible for all those busy Vista programmers to convert their applications as swiftly as possibly, I searched all of Vista's code for direct global references, identified what the replacement code should be, and reached out to the developers in question to walk each of them personally through the process of replacing their code. A flurry of national patches was swiftly released, so by the time Taskman 6.5 was going to be released in 1990 I knew there would not be a single direct access of Taskman's globals left outside Taskman's code.

(As an aside, this illustrates the speed with which even widespread and revolutionary architectural changes can be made in Vista, thus putting the lie to the claim that because Vista is "old" and "tangled" VA has no choice but to bankrupt itself trying to adopt Cerner. A healthy community of Vista developers is fully capable of revolutionizing Vista in short order; any claim to the contrary is disproven by this actual empirical experience.)

The result was that with Taskman 6.5 for the first time it was safe for me to introduce the multilock syntax wherever Taskman needed to lock two different globals at once, because for the first time ever the Taskman developer had complete control over every such lock. This multiple-lock situation came up frequently, because the main task record is stored in global ^%ZTSK (the task global), but the scheduling information that contains the queues where active tasks wait in line to begin is over in ^%ZTSCH (the schedule global), and there are multiple queues there a task might be waiting in.

Dequeuing a task, editing it, and then requeuing it with the new properties thus had been a fraught process, given all the other jobs that might get their fingers into the middle of such an operation, corrupting the results, as had happened so often back with Taskman 5.01. Even with the freshly rewritten Taskman 6, and even after all the other Vista apps got their fingers out of Taskman's globals, a problem still remained that Taskman himself had many parallel jobs running faster than a human being could possible react. Although Taskman 6 eliminated the possibility of tasks simply vanishing or being corrupted, multistep operations such as requeue could still fail "cleanly" between two locks, because some other Taskman job got a look at the incomplete record in between the completion of the two separate lock commands and took action based on that inconsistent database state. So Taskman needed the multilock syntax, and during Taskman 6.5 development for the first time it was a potentially safe option.

Even with Taskman in complete control of all locks of its own globals, it was still all too easy to create deadlocks just by writing the same multilock in two different Taskman algorithms if I didn't always lock the nodes in the same order. With Taskman's extreme parallelism it is a guarantee sooner or later that any code fragment anywhere in Taskman's codebase will run at the same time as any other Taskman code fragment, sometimes hundreds or thousands of times in quick succession.

So for Taskman 6.5, to make the new multilock syntax safe I adopted strict conventions for the order of locking. Any time two nodes were to be locked in a multilock syntax, in every algorithm throughout Taskman where that pair occurred I would always lock them in the exact same order, thus preventing any deadlock situation. For the majority of such situations, the pair to be locked was (1) the task's record in ^%ZTSK and (2) a list where the task was scheduled, somewhere in ^%ZTSCH.

I generalized that pattern with a second strict convention that Taskman was never to take an action that altered the status of any task until it held a lock on that task in ^%ZTSK. This made it possible to interrupt all task-management operations on any single task anywhere in Task Manager simply by locking that task's record. Our interrupt lock would either succeed before those other operations began, if we got to the lock first, or it would succeed after those operations completed, if they had the lock first, in either case to create an early partial simulation of a database transaction using locking. This was the first step in a migration for Taskman that's still underway and will culminate in the introduction of true transaction processing and the elimination of the majority of Taskman's LOCK commands, significantly reducing bottlenecks and increasing performance. That completion is still a few years away, but back in 1990 it meant operations such as requeuing a task could complete successfully even if many other jobs were contending for access to that task.

So here are two stanzas from %ZTLOAD1 version 6.5 in 1990 that illustrate what David described and Sam and Jamie were asking about. DQ dequeues a task. FINAL applies the edits to the dequeued task and then requeues it with that revised definition, before cleaning up the symbol table.

 ;
DQ ;Dequeue Task (Ensure No Duplicates)
 I $D(^%ZTSCH("JOB")) L (^%ZTSK(ZTSK),^%ZTSCH("JOB")) S %ZTU="" F %ZTL=0:0 S %ZTU=$O(^%ZTSCH("JOB",%ZTU)),%ZTP="" Q:%ZTU=""  F %ZTL=0:0 S %ZTP=$O(^%ZTSCH("JOB",%ZTU,%ZTP)) Q:%ZTP=""  I $D(^(%ZTP,ZTSK)) K ^(ZTSK)
 I $D(^%ZTSCH("IO")) L (^%ZTSK(ZTSK),^%ZTSCH("IO")) S %ZTIO="" F %ZTL=0:0 S %ZTIO=$O(^%ZTSCH("IO",%ZTIO)),ZTUCI="" Q:%ZTIO=""  F %ZTL=0:0 S ZTUCI=$O(^%ZTSCH("IO",%ZTIO,ZTUCI)) Q:ZTUCI=""  I $D(^(ZTUCI,ZTSK)) D DQ^%ZTM3
 L ^%ZTSK(ZTSK) S %ZTD=0 F %ZTL=0:0 S %ZTD=$O(^%ZTSCH(%ZTD)) Q:'%ZTD  I $D(^%ZTSCH(%ZTD,ZTSK)) K ^(ZTSK)
 ;
FINAL ;Requeue, Cleanup, And Quit
 L ^%ZTSK(ZTSK) S ZTSK(0)=$D(^%ZTSK(ZTSK)) I ZTSK(0) S ^%ZTSK(ZTSK,0)=ZTREC,^(.1)=ZTREC1,^(.2)=ZTREC2,^(.25)=ZTREC25 L ^%ZTSCH(ZTDTH,ZTSK) S ^%ZTSCH(ZTDTH,ZTSK)=""
 L  K %H,%T,%Y,%ZTD,%ZTF,%ZTH,%ZTI,%ZTL,%ZTP,%ZTU,X,Y,ZTA1,ZTA4,ZTA5,ZTC1,ZTC2,ZTDESC,ZTDTH,ZTDUZ,ZTI,ZTIO,ZTREC,ZTREC1,ZTREC2,ZTREC25,ZTRTN Q
 ;


This was a highly designed and tuned passage of code, so let's walk through the locks to make clear what it's doing and why. Many of the essential lessons about locking are contained in these few lines and their history.

In DQ, the first line dequeues the task from the list of tasks that have been validated and are now waiting for a job to run in, the JOB list. When this line begins execution, %ZTLOAD1 holds no locks. The line is skipped if there's no JOB list, since that means our task can't be waiting in line within it. As per the Standard's definition of the multilock syntax, the lock itself only succeeds when both nodes are available to lock. As per the Taskman 6.5 strict locking convention—combined with the Vista Standards and Conventions, to which we had recently added the requirement to keep out of Taskman's globals and use its API instead—no other job in Vista (not Taskman nor the other apps) can possibly hold a lock on ^%ZTSCH("JOB") and seek to add ^%ZTSK(ZTSK) to it, so deadlock is impossible. (ZTSK itself is our task's ID number.)

Note that in ^%ZTSK we are not locking the whole global but only the specific task we're interested in, to avoid bottlenecking all task management during this operation, yet in ^%ZTSCH we're locking the entire JOB list, so only one Taskman 6.5 job at a time can modify the JOB list. By the time of the next version, Taskman 7 in 1992, I decided it was unnecessary overkill to lock the entire JOB list and introduced the ability to have multiple Taskman jobs working on that list at the same time, but back in 1990 with Taskman 6.5 I was still worried about controlling the order of operations to eliminate gaps in coverage, so I decided to create this bottleneck (which I later had to remove in Taskman 7). I was aware of the problem at the time, but I couldn't just lock our task's node in the JOB list because neither of the next two subscripts in that list is the task ID—that only comes third next after those two subscripts—so as a result of this structure we couldn't know which node to lock until after we searched the list to find it, so in the interests of transactional integrity we locked the whole JOB list. It worked, but on highly loaded systems throughout VA we saw the effects of the bottleneck, which motivated changing this algorithm in Taskman 7.

The rest of line 1 searches for the task in the JOB list and removes it wherever it finds it.

The second line in DQ does the exact same thing with the IO list, which is the list of tasks that are waiting for a needed I/O device to become available. Structurally line 1 is parallel to line 2, other than the different subscripts that make up the IO list. And here, too, we lock the entire IO list, creating another single-threaded access that proved a bottleneck until rewritten in Taskman 7.

The main point of interest here is the multilock itself, which executes while the two locks from line 1's multilock are still being held. As per the Standard, the lack of an incremental or decremental prefix for LOCK (introduced in M3, the third edition of the M Standard in 1990, but not in use in most of Vista back then, including Taskman 6.5) means this second lock will entirely release the lock from line 1, but it will only do so after both locks in the line 2 multilock can be acquired. In effect, this is a miniature transaction; either all the steps in this exchange of locks succeed, or we don't even begin to make the change, so we can never end up with some partially locked situation where we have lost some locks and acquired other but still not yet acquired the full set.

What makes this multilock interesting is that the first node in the list is identical to the first node in the list on line 1 (due to the strict convention in Taskman 6.5 I mentioned above). The effect of this is to do something quite impossible without the multilock syntax, until the advent of incremental and decremental locking—to release one of two locks we currently hold while simultaneously acquiring another lock. Out of fears of deadlocks back in the 1970s, the original definition of LOCK defines it as a two-part operation—(1) release all locks currently held, and (2) acquire the new lock(s) specified in the argument of the LOCK command. That "release everything" step doesn't happen until step 2 is possible, so they happen together, but the effect militates strongly against the incremental accumulation of a variety of locks, since every LOCK command in effect resets our lock list back to empty, aside from what we're explicitly adding back in. But by using the multilock syntax in this way, I was able to simulate decremental and incremental locking.

Taskman was not the only Vista app back then that used multilocks, but most avoided it completely, whereas it was rife throughout Taskman 6.5 and essential to the design of its locking strategy, which is why I chose this example.

Thus the multilock on line 1 acquires two locks, ^%ZTSK(ZTSK) and ^%ZTSCH("JOB"). The multilock on line 2 keeps the ^%ZTSK(ZTSK) lock, but releases ^%ZTSCH("JOB") as soon as it can acquire ^%ZTSCH("IO"). If we were using M3's incremental and decremental locks throughout %ZTLOAD3, this multilock syntax on line 2 might be the equivalent of L (-^%ZTSCH("JOB"),+^%ZTSCH("IO")), at least under our specific situation where we are holding no other locks except ^%ZTSK(ZTSK). (If we had any other locks, the behavior would be quite different, since with the Taskman 6.5 multilock syntax we would release all the other locks, but with the M3 decremental and incremental multilock syntax we would keep all those other locks).

This Taskman 6.5 code continuous holds ^%ZTSK(ZTSK) without any interruption, so under the strict locking conventions that it follows all other Taskman jobs remain unable to insert themselves into our operation by modifying our task in any way, since they would first have to acquire a lock on ^%ZTSK(ZTSK), which they can't do so long as we continuously hold it. This issue of ensuring the continuity of this lock while we shift from list to list in our processing was a fundamental design goal of Taskman 6.5's locking strategy, and you see it in action in the way these multilocks are used. It's why at the time I was willing to suffer minor bottlenecks (given the low level of Taskman parallelism in 1990 compared to today).

Ironically, this Taskman 6.5 locking strategy was so much more reliable than Taskman 5.0s had been that Vista applications began relying on Taskman more heavily. That increased the parallelism, which increased the pain of the tradeoff bottlenecks, which made this strategy obsolete and drove the development of a very different locking strategy in Taskman 7, an illustration of how success breeds failures that drive new successes.

Meanwhile, DQ line 3 continues the pattern of lines 1 and 2 but with a twist.

This line is concerned with the main Taskman schedule list, where queued tasks wait until the time they are scheduled to start. It's subscripted by time and then by task ID, as one would expect for a queue of tasks sorted by time, but unlike the JOB and IO lists it has no initial subscript to set it apart from the other scheduling lists. It's distinguished similarly to how File Manager distinguishes the main data section of a file from its indexes, because the main records have numeric subscripts whereas the index subscripts start with a nonnumeric alpha character and thus numerically evaluate to 0. So too the first subscript of Taskman's main schedule contains $HOROLOG values, which numerically evaluate as non-zero, whereas all ^%ZTSCH's other lists have alpha subscripts that evaluate to 0. This was Dr. Dave Wilson's original design, and it couldn't be changed at the time.

It directly impacted Taskman's locking algorithms, because it meant that if I wanted line 3 to do what lines 1 and 2 had done, I would have to write L (^%ZTSK(ZTSK),^%ZTSCH), but unlike lines 1 and 2 this would not just lock the main schedule list but would lock all of Taskman's schedule lists, thus creating a much larger bottleneck than lines 1 and 2 had done. Indeed, an early draft of Taskman 6.5 did have line 3 here containing that exact multilock, and the performance was dreadful.

After significant experimentation and performance measurement, I realized that I could replace it with the single lock argument shown in the code sample above, which would result in releasing the lock on ^%ZTSCH("IO") while retaining the lock on ^%ZTSK(ZTSK). I rewrote the various loops through the main schedule throughout Taskman to be tolerant of finding a node in the main schedule whose record over in ^%ZTSK was locked, and I discovered this new approach to locking worked great. It not only avoided the huge bottleneck from the earlier draft, it also avoided the smaller bottlenecks that lines 1 and 2 were still causing. This showed me that I could also rewrite all the loops and locks on ^%ZTSCH("JOB") and ^%ZTSCH("IO") throughout Taskman the same way to eliminate those bottlenecks, too, to achieve a much higher capacity for parallelism in Taskman, but by this point we were out of time to do further development on Taskman 6.5, which was entering testing and verification, so I left those conversions for Taskman 7. The resulting Taskman 6.5 code here in routine %ZTLOAD1 in 1990 is thus a hybrid locking algorithm, mainly following an intermediate multilock syntax approach but partly migrated to a newer, simpler, less bottlenecked approach.

The lesson about locking here is important.

Thinking through the possible consequences of various locking algorithms in a highly parallel algorithm is a complex undertaking. It is all too easy to lose oneself in getting the details right to avoid catastrophe, which means one can miss opportunities for radical improvements in performance that lie very close to the strategy one is currently fixated on. Indeed, it is inevitable, due to the structure of the human mind and the nature of empirically driven learning. It's essential to eliminate shame and reward and punishment from the working conditions of software engineers grappling with such complex problems. No amount of intelligence or education can make it impossible to make these kinds of mistakes. They are an essential part of any empirical learning process—and only empirical learning processes are capable of revealing all the consequences of any given strategy when dealing with such complex algorithms.

A corollary here specific to M locking is that one should always use the simplest locking strategy that solves all the problems that need to be solved. In Taskman 7, almost all multilocks were removed from Taskman, and performance improved significantly. In Taskman 9, over half of Taskman's locks were removed, and performance increased a hundredfold. The right number will rarely be zero locks, but some small number of carefully chosen locks in a freer algorithm can coordinate a higher degree of parallelism than a larger number of locks that follow an overcontrolling algorithm.

To wrap up the discussion of this example, the remaining locks in %ZTLOAD1 6.5 are not multilocks but single locks, but one of them used to be a multilock and the other should have been, plus they're closely related, and they contain two errors, so we need to inspect them.

Indeed, a failure to consider the lines in FINAL when I was fixing DQ line 3 back in 1990 was precisely the cause of the first error. Designing or redesigning M locking algorithms should never be rushed through just to meet an arbitrary deadline, because it can cause very difficult-to-debug problems. In this case, the first error is trivial, but only because I was lucky back in 1990; in my rush then I could easily have caused serious problems across VA, as I did in other places in Taskman at other times by hurrying to meet a deadline.

Line 1 of FINAL opens with L ^%ZTSK(ZTSK), which in effect is a no-op, because our job already holds this lock from DQ line 3. So why is this pointless command here? Remember that the lock from DQ line 3 used to be L (^%ZTSK(ZTSK),^%ZTSCH), at which time this first lock on FINAL line 1 had the effect of releasing the lock on ^%ZTSCH while retaining the lock on ^%ZTSK(ZTSK). When I changed the older multilock on DQ line 3 to the newer single lock, I should have removed the first single lock on FINAL line 1, but instead it shipped to the field in this state. Fortunately, this oversight during my rush to get Taskman 6.5 released in 1990 had no negative consequences, other than cluttering up the code.

However, the second lock on that line is not so innocent.

The effect here of L ^%ZTSCH(ZTDTH,ZTSK) is not only to acquire a lock on the specific node in the main schedule list for our task but also to prematurely release the lock on the task's record, thus making it possible for some other Taskman job to delete the record with a KILL command, leaving that main schedule node an orphan with no record to refer back to. That other job might think it's safe to kill the main record because at that instant in time the task exists over in ^%ZTSK but is not actively present on any of Taskman's schedule lists. FINAL line 1 is about to add our task to the main schedule list, after that lock, but it hasn't yet, and that momentary logical database corruption sends a dangerous signal to all other Vista jobs.

And indeed, over the next couple years we did experience problems with tasks occasionally disappearing right in the middle of their being rescheduled by this code, because of this gap in the consistency of the database. That was exactly the kind of problem I'd worked so hard on for three years in developing Taskman 6 to eliminate, and here by accident while rushing to meet a deadline I reintroduced it.

When analyzing locking algorithms, a major part of the work consists of searching for such gaps in the coverage, considering what all could go wrong during those gaps, and redesigning the algorithms to eliminate those possibilities. Taskman includes a cleanup job whose purpose is to delete unreferenced tasks, so even momentarily leaving a task incomplete, unreferenced by the schedule lists, as this problematic lock does, create the illusion that the record is no longer necessary and therefore ensures that the cleanup job will from time to time delete that vulnerable task right at that moment.

Unlike the problem with the first lock on this line, this second defect was not caused by the shift in locking of DQ line 3 described above. It was an independent error made during the initial rewrite of this routine for Taskman 6.5. In Taskman 6, %ZTLOAD1 did not use multilock syntax at all but only single locks, and the strategy created numerous gaps in coverage of the task while it was being dequeued, edited, and requeued, resulting in numerous problems caused by parallel Taskman jobs receiving those confusing signals about the state of the task. Taskman 6.5 largely rewrote this algorithm and reinvented its locking strategies, including adding multilocks, as part of a Taskman-wide rewrite aimed at eliminating as many such gaps in coverage as possible.

Wally Fort was not so keen on refactoring back then as I was. As he used to remind me, one consequence of doing so much refactoring was the introduction of new bugs, and that's exactly what happened with this second lock on FINAL line 1. This second lock should have read L (^%ZTSK(ZTSK),^%ZTSCH(ZTDTH,ZTSK)), so it didn't prematurely release the lock on the task record, but the entire lock was new, not present in Taskman 6. My brain was too full trying to keep track of other parts of the design, and I forgot to preserve that lock on the task record, accidentally releasing it early while acquiring the lock on the new schedule node I was about to create. And this one did cause occasional problems in the field until Taskman 7 replaced it. But since Taskman was such a mess in version 5.01 when I inherited it, I quickly learned that the occasional bug introduced during refactoring was a small price to pay for the truckloads of bugs eliminated by the refactoring process, including the many difficult bugs that can only be fixed by changing the underlying architecture, for which refactoring is the only way forward.

To conclude the example, FINAL line 2 has the argumentless LOCK that releases all held locks.

In the next year, the Vista Standards and Conventions were changed to mandate the use of incremental and decremental locks, now that M3 had been certified. That helped drive the rewrite of this code for Taskman 7. All these single and multi- and argumentless locks we've discussed vanished from Taskman entirely (except in a few rare spots), to be replaced with newly redesigned incremental-locking algorithms. When that change was made, partly as a result of discovering that the change on DQ line 3 discussed above worked so well, at the same time that I rolled out incremental and decremental locks I also radically simplified Taskman's locking strategy. I'm going to describe that here briefly, because it's based on a very useful but often overlooked feature of locking.

Although metaphorically we often explain lock and unlock operations by comparing them to OPEN and CLOSE commands, this can do more harm than good, so it's essential to point this out while you're teaching locks. It's not just that OPEN and CLOSE restrict access whereas lock and unlock don't, except by convention; that is widely understood. Far more importantly it's that the ID of the device you want to own has to be the argument of the OPEN or CLOSE, but the global node whose access you want to coordinate does not have to be the argument of the LOCK command, nor do any of its parent or descendant nodes. People who think of LOCK as analogous to OPEN will tend to do that—I thought that way during the design of the locking algorithms for Taskman 6 and 6.5, and at the time I was quickly becoming one of the Vista community's locking experts—but it's often a mistake, sometimes a critical one. It was a mistake in Taskman 6 and 6.5.

As Dr. Dayhoff pointed out in her article, there are a variety of strategies that can be pursued when designing locking algorithms, and some of the less obvious ones are some of the best. They would be more obvious if we didn't overemphasize describing LOCK as being like OPEN. If instead we emphasized it as an interjob signaling system, it would be easier for us to think about designing those signals independently of the specific global structure whose access we want to coordinate.

The discovery of how well the revised locking algorithm in DQ line 3 worked led me in Taskman 7 to almost entirely eliminate strategies that depended on simultaneously locking ^%ZTSK and a node in ^%ZTSCH, which was so fundamental to the previous versions. I had discovered for myself the practical application of what Dr. Dayhoff had been trying to tell us in her article, but which at the time I read it I was conceptually unprepared to understand, that I was making Taskman locking too complex by making it mirror the specific nodes I was trying to coordinate access to, as though they were I/O devices being coordinated.

But I just didn't need all those locks.

Instead, I could simplify the strict conventions I'd adopted in Taskman 6.5, so that in Taskman 7 the rule was that any time I want to change the status of a task I just lock its record. Period. I don't need to lock and unlock the various scheduling nodes I'm removing the task from or adding it to. Just the one lock in the right place was the new signal the task is in flux and other jobs need to keep their paws off it until the lock is released.

With this new strategy, there was no need for all those multilocks, nor did I need to use incremental locks in the new way introduced in M3 to simulate them. Instead, I stripped over half the locks out of Taskman 7 and massively improved performance while sweeping away all the gaps in coverage that had plagued earlier versions, resulting in far more logical database integrity. For comparison with DQ version 6.5, here's DQ version 7.

 ;
DQ ;make sure task is not pending
 S ZT1="" F ZT=0:0 S ZT1=$O(^%ZTSCH(ZT1)) Q:'ZT1  I $D(^(ZT1,ZTSK))#2 S ZT2=^(ZTSK) K ^(ZTSK) I ZT2]"" S $P(^%ZTSK(ZTSK,.2),U)=ZT2
 S ZT1="JOB",ZT2="" F ZT=0:0 S ZT2=$O(^%ZTSCH(ZT1,ZT2)) Q:ZT2=""  I $D(^(ZT2,ZTSK))#2 K ^(ZTSK)
 S ZT1="LINK",ZT2="" F ZT=0:0 S ZT2=$O(^%ZTSCH(ZT1,ZT2)),ZT3="" Q:ZT2=""  F ZT=0:0 S ZT3=$O(^%ZTSCH(ZT1,ZT2,ZT3)) Q:ZT3=""  I $D(^(ZT3,ZTSK))#2 K ^(ZTSK)
 S ZT1="IO",ZT2="" F ZT=0:0 S ZT2=$O(^%ZTSCH(ZT1,ZT2)),ZT3="" Q:ZT2=""  F ZT=0:0 S ZT3=$O(^%ZTSCH(ZT1,ZT2,ZT3)) Q:ZT3=""  I $D(^(ZT3,ZTSK))#2 D IODQ
 I $D(^%ZTSK(ZTSK,0))[0 L -^%ZTSK(ZTSK) S ZTSK(0)=0 K ZT,ZT1,ZT2,ZT3 Q
 ;


You can see at a glance that all the locks have been stripped out, even though the number of scheduling lists we have to check has increased. We enter this stanza already holding a lock on ^%ZTSK(ZTSK), and we keep holding it until the end of this routine, with no other locks needed. This drastically reduces the opportunities for bottlenecks—including, given how intensively parallel Taskman was becoming by this time, reducing the hidden bottleneck of getting access to the lock table itself. With fewer Taskman jobs pounding on the lock table for what ultimately proved unnecessary checks, the lock table as a whole was freed up for more essential uses by other Vista jobs.

When Taskman 9 or 10 completely eliminates most locking from Taskman, replacing it with transactions, the migration that I started back in 1986 will be complete, reducing locking bottlenecks then to the smallest possible number, increasing Vista system performance while improving database integrity.

I wrote an early draft of the Taskman 9 code back in 2018 for EHS of Jordan, because the number of parallel Taskman jobs in their national Vista system had grown far beyond anything VA has ever experienced, to the point where performance was bogging down due to Taskman's locking algorithms. Once the algorithms were further refined to let us strip out even more of the locks, along with a few other changes, Taskman's performance measurably jumped to a hundred times what VA experiences. (This code would not work as is for VA due to defects in Cache's implementation of locking, but on GT.M and YottaDB and any other Standard-compliant M implementation the difference was miraculous.) Since then EHS has continued adding more and more hospitals to their national Vista system, and Taskman 9's new locking strategies are continuing to hold up under pressure.

In general, as a result of this history, I would not recommend any attempt to create an abstract universal locking package, to which you pass an indirect list of nodes to lock. The concrete, specific details of how lock nodes are actually related to their use in the algorithms that depend on them absolutely determine whether a given locking strategy will work well or poorly or even be a catastrophe for the system. For example, the difference between the VA's Taskman 8 locking algorithm and the new draft Taskman 9 one running in Jordan is actually the difference between performance so bad the Minister of Health threatens to shut down the entire project versus performance so good the project thrives for the next six years and counting. Only accurate, detailed human understanding and experienced human judgment is capable of making such distinctions. The kind of "close enough" that a generic Locker routine might impose or today's generation of AIs might design is the kind of "not actually very close at all" that bogs down systems, gets projects canceled, and is used to justify massive swindles like the ongoing Cerner disaster at VA.

Strategically, an important lesson of this history is to reduce the amount of M locking to the smallest that will do the job, and likewise use the simplest algorithms that meet your design goals. The best locking algorithms are often not the most obvious ones, but they will be the cleanest. Every unnecessary complication in locking multiplies into downstream consequences, as the number of parallel jobs increases the consequences of poorly designed contention-resolution algorithms.

And in turn, this requires knowing our application, all of it, every last line and variable, so we can understand and take ownership of the consequence of our choices. Until we know our whole application, we don't really understand what the job is or what the right design goals should be, so our programming just consists of half-blind guessing and patching, which makes us victims of the Law of Unintended Consequences. As bad as patchitis usually is, when it comes to locking algorithms patchitis is the Devil. I speak from long experience and offer those experiences with Taskman in this essay as an object lesson, successes and failures both, in the hope that readers will take seriously the importance of a complete understanding of the code they propose to modify, in locking even more so than most other considerations.

I hope this has been helpful.

Yours truly,
Rick

Postscript: I could have written a similar lengthy account of how locking changed over time with File Manager, because of my experiences on the Fileman team in the 1990s and 2010s, but I chose Taskman because Fileman locking, although equally complex, is much less intense. Fileman locks tend to be spread out all over the database, across Vista's many files, whereas Taskman's locking tends to be concentrated on the same handful of globals over and over and over, thus better illustrating how these algorithms behave under the intense pounding pressure of massive parallelism.

Frederick D. S. Marshall
Executive Director, Vista Expertise Network
819 North 49th Street, Suite 203, Seattle, Washington 98103
rick.m...@vistaexpertise.net
+1 (206) 465-5765

Jamie Crumley

unread,
May 15, 2024, 6:24:30 PMMay 15
to hard...@googlegroups.com
Perfect! Thanks so much. I can't believe I'd forgotten. Lol. But it has been many years. 

John McCormack

unread,
May 16, 2024, 5:58:37 PMMay 16
to hard...@googlegroups.com
Deja vu... shades of Rick's FORUM Tombs.

Sam Habiel

unread,
May 16, 2024, 11:24:54 PMMay 16
to hard...@googlegroups.com
I wish all of Rick's FORUM tomes can be collected and kept for posterity! I am still struggling to read this one. Thank you Rick.

--Sam

Ar “art”

unread,
May 16, 2024, 11:53:41 PMMay 16
to Hardhats
I agree this information helps novice how to solve the problem 

thank you Rick !!!!

Nancy Anthracite

unread,
May 17, 2024, 5:00:43 AMMay 17
to hard...@googlegroups.com, John McCormack
I suspect the spell checker got you and you meant tome!

--
Nancy Anthracite
> > Much discussion and several articles in /MUG Quarterly/ had been devoted
> >> *
> >> *
> >> *LOCK (^A(1,1),^B(1,2))*
> >> ---
> >> You received this message because you are subscribed to the
Google Groups "Hardhats" group.
> >> To unsubscribe from this group and stop receiving emails from it,
send an email to hardhats+u...@googlegroups.com.
> >>
> >> To view this discussion on the web visit
> >> https://groups.google.com/d/msgid/hardhats/
CABHT962qXpk7YWqtxQ-1_RgOT7Ud37mpqXmeVe3yQeeH%3DjPSyQ%40mail.gmail.com
> >> <https://groups.google.com/d/msgid/hardhats/
CABHT962qXpk7YWqtxQ-1_RgOT7Ud37mpqXmeVe3yQeeH%3DjPSyQ%40mail.gmail.com?
utm_medium=email&utm_source=footer>.
> >>
> >>
> >> ---
> >> You received this message because you are subscribed to the Google Groups
"Hardhats" group.
> >> To unsubscribe from this group and stop receiving emails from it, send an
email to hardhats+u...@googlegroups.com.
> >> To view this discussion on the web visit
> >> https://groups.google.com/d/msgid/hardhats/3501b686-ea99-43f5-b5f9-309eede2e0dan%40googlegroups.com
> >> <https://groups.google.com/d/msgid/hardhats/3501b686-ea99-43f5-b5f9-309eede2e0dan%40googlegroups.com?utm_medium=email&utm_source=footer>.
> >
> >
> > ---
> > You received this message because you are subscribed to the Google Groups
"Hardhats" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
email to hardhats+u...@googlegroups.com.
> > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/hardhats/bf5e2a9d-01fa-4ef1-96f3-e24768a83758%40vistaexpertise.net
> > <https://groups.google.com/d/msgid/hardhats/bf5e2a9d-01fa-4ef1-96f3-e24768a83758%40vistaexpertise.net?utm_medium=email&utm_source=footer>.
>
>




Reply all
Reply to author
Forward
0 new messages