Hi Kim,
I took a closer look at the kernel module locks and canceling and see
a couple (minor) potential issues. Would you mind commenting on these?
ach_klinux rdlock/wrlock()
==========================
Wait queue behavior
-------------------
At first glance, it looks like kernel wait queues may to lead to the
following race:
1. reader checks condition, gets false
2. writer sets condition to true
3. writer wakes the wait queue
4. readers waits queue
Now, it seems the reader has missed the writers update. Pthreads gets
around this race by holding a mutex while the condition is updated and
checked. However, from the kernel's wait.h, it seems the actual
operations instead would be:
0. reader adds itself to queue
1. reader checks condition, gets false
2. writer sets condition to true
3. writer wakes the wait queue
4. readers goes to wait in queue, immediately returning due to wake
from writer
Kim, could you please confirm that this is correct?
Interrupted calls
-----------------
Another small issue: mutex_lock_interruptible() may return -EINTR if a
signal is received. If this happens, we should probably retry the
lock (or just use mutex_lock() instead) or return a suitable status
code to userspace. Similarly for wait_event_interruptible(),
wait_event_interruptible_timeout(), and wrlock().
A quick test shows that when a signal is received during a wait in
ach_get(), the caller receives ACH_FAILED_SYSCALL with errno=EINTR.
To be consistent with user-mode channels (which continue waiting
through signals), we would need to retry the system call. Does that
seem like the best behavior? One could still break on signals by
waiting for kernel channels in select(), etc.
ach_klinux ach_cancel()
=======================
I think the kernel mode ach_cancel() may be susceptible to deadlocks.
The user mode ach_cancel() is really a fancy way of interrupting a
pthread condition wait. If called from another thread
(async_unsafe=true), one can just take the mutex, set a flag, and
broadcast the condition to wakeup the other threads. However, if
called from signal handler (async_unsafe=false), we cannot issue any
pthread_* calls because they are not in the list of async safe
functions. Specifically, we cannot take the mutex because if also
held outside the signal handler, this will deadlock. Updating the
condition without the mutex held is racy. The workaround is to call
fork() (which is async safe) from the signal handler and then take the
mutex_lock in the child process.
In the kernel mode ach_cancel(), taking the lock from a signal handler
should have a similar deadlock risk. This is what happens when
unsafe=0, so I think it is incorrect to take the lock. Actually, if
Linux wait queues do the test-and-wait atomically (as above), then it
seems that we can always omit the mutex lock here, just setting
chan->cancel=1 and calling wake_up(). Is this correct?
Cheers,
--
Neil T. Dantam
http://www.Neil.Dantam.name