Hunting a possible fsemaphore-post/wait bug

29 views
Skip to first unread message

Dominik Pantůček

unread,
May 23, 2020, 12:51:29 PM5/23/20
to Racket Users
Hello again with futures!

I started working on futures-based workers and got quickly stuck with a
dead-lock I think does not originate in my code (although it is two
semaphores, 8 futures, so I'll refrain from strong opinions here).

I implemented a very simple futures-friendly queue using mutable pairs
and created a minimal-deadlocking-example[1]. I am running racket 3m
7.7.0.4 which includes fixes for the futures-related bugs I discovered
recently.

Sometimes the code just runs fine and shows the numbers of worker
iterations performed in different futures (as traced by the 'fid'
argument). But sometimes it locks in a state where there is one last
number in the queue (0 - zero) and yet the fsemaphore-count for the
count fsemaphore returns 0. Which means the semaphore was decremented
twice somewhere. The code is really VERY simple and I do not see a
race-condition within the code, that would allow any code path to
decrement the fsema-count fsemaphore twice once the worker future
receives 0.

I am able to reproduce the behavior with racket3m running under gdb and
get the stack traces for all the threads pretty consistently. The
deadlock is apparently at:

2 Thread 0x7ffff7fca700 (LWP 46368) "mfqueue.rkt"
futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x5555559d8e78) at ../sysdeps/nptl/futex-internal.h:183

But that is just where the issue is showing up. The real question is how
the counter gets decremented twice (given that fsemaphores should be
futures-safe).

Any hints would be VERY appreciated!


Cheers,
Dominik

[1] http://pasterack.org/pastes/28883

Matthew Flatt

unread,
May 23, 2020, 1:24:23 PM5/23/20
to Dominik Pantůček, Racket Users
I'm not sure this is the problem that you're seeing, but I see a
problem with the example. It boils down to the fact that futures do not
provide concurrency.

That may sound like a surprising claim, because the whole point of
futures is to run multiple things at a time. But futures merely offer
best-effort parallelism; they do not provide any guarantee of
concurrency.

As a consequence, trying to treat an fsemaphore as a lock can go wrong.
If a future manages to take an fsemaphore lock, but the future is not
demanded by the main thread --- or in a chain of future demands that
are demanded by the main thread --- then nothing obliges the future to
continue running; it can hold the lock forever.

(I put the blame on femspahores. Adding fsemaphores to the future
system was something like adding mutation to a purely functional
language. The addition makes certain things possible, but it also
breaks local reasoning that the original design was supposed to
enable.)

In your example program, I see

(define workers (do-start-workers))
(displayln "started")
(for ((i 10000))
(mfqueue-enqueue! mfq 1))

where `do-start-workers` creates a chain of futures, but there's no
`touch` on the root future while the loop calls `mfqueue-enqueue!`.
Therefore, the loop can block on an fsemaphore because some future has
taken the lock but stopped running for whatever reason.

In this case, adding `(thread (lambda () (touch workers)))` before the
loop after "started" might fix the example. In other words, you can use
the `thread` concurrency construct in combination with the `future`
parallelism construct to ensure progress. I think this will work
because all futures in the program end up in a linear dependency chain.
If there were a tree of dependencies, then I think you'd need a
`thread` for each `future` to make sure that every future has an active
demand.

If you're seeing a deadlock at the `(touch workers)`, though, my
explanation doesn't cover what you're seeing. I haven't managed to
trigger the deadlock myself.
> --
> You received this message because you are subscribed to the Google Groups
> "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to racket-users...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/racket-users/5dcf1260-e8bf-d719-adab-5a0fd937
> 8075%40trustica.cz.

Dominik Pantůček

unread,
May 23, 2020, 2:08:23 PM5/23/20
to racket...@googlegroups.com

On 23. 05. 20 19:24, Matthew Flatt wrote:
> I'm not sure this is the problem that you're seeing, but I see a
> problem with the example. It boils down to the fact that futures do not
> provide concurrency.
>
> That may sound like a surprising claim, because the whole point of
> futures is to run multiple things at a time. But futures merely offer
> best-effort parallelism; they do not provide any guarantee of
> concurrency.

It might be surprising as well - but that is exactly the reason I am
using futures for this task. Up until now I always built a data set and
then have setup a futures tree to process that data set - possibly in
parallel. But with no assumptions about whether it will happen in parallel.

What I am trying to achieve now is to allow the futures scheduling start
speculatively working on the data set while it is being created. And
yes, I don't want to make any assumptions about it actually starting or
even being run in parallel. I just want to provide the futures runtime
with the best setup, so it can also do the best.

Using os-threads with explicit scheduling is something I want to
investigate as well - but generally I do not see why this shouldn't work
as expected (with the great advantage that it can fallback on
program-level single-thread and not OS-run threads interleaved on the
same core transparent to the program).

>
> As a consequence, trying to treat an fsemaphore as a lock can go wrong.
> If a future manages to take an fsemaphore lock, but the future is not
> demanded by the main thread --- or in a chain of future demands that
> are demanded by the main thread --- then nothing obliges the future to
> continue running; it can hold the lock forever.

Duly noted. I'll look into that possibility. But honestly, I've never
encountered such behavior.

>
> (I put the blame on femspahores. Adding fsemaphores to the future
> system was something like adding mutation to a purely functional
> language. The addition makes certain things possible, but it also
> breaks local reasoning that the original design was supposed to
> enable.)

I understand and - another surprise - I agree. The example is very
"imperative", but that is really bare-bones example of the problem.

>
> In your example program, I see
>
> (define workers (do-start-workers))
> (displayln "started")
> (for ((i 10000))
> (mfqueue-enqueue! mfq 1))
>
> where `do-start-workers` creates a chain of futures, but there's no
> `touch` on the root future while the loop calls `mfqueue-enqueue!`.
> Therefore, the loop can block on an fsemaphore because some future has
> taken the lock but stopped running for whatever reason.
>
> In this case, adding `(thread (lambda () (touch workers)))` before the
> loop after "started" might fix the example. In other words, you can use
> the `thread` concurrency construct in combination with the `future`
> parallelism construct to ensure progress. I think this will work
> because all futures in the program end up in a linear dependency chain.
> If there were a tree of dependencies, then I think you'd need a
> `thread` for each `future` to make sure that every future has an active
> demand.

The same thing happens if I (do-start-workers) after filling the queue
using mfqueue-enqueue!.

>
> If you're seeing a deadlock at the `(touch workers)`, though, my
> explanation doesn't cover what you're seeing. I haven't managed to
> trigger the deadlock myself.

Should I open an issue at github and provide the gdb backtraces?

Btw, none of this behaviour is seen with Racket CS (so far) and the
observed futures scheduling is speculative very aggresively with CS
variant (which is definitely what I want).

I am running 3m v7.7.0.6 compiled from sources on Ubuntu 20.04 if that
helps.



And (as usual) - thank you very much!

Dominik

Matthew Flatt

unread,
May 23, 2020, 9:38:21 PM5/23/20
to Dominik Pantůček, racket...@googlegroups.com
At Sat, 23 May 2020 18:51:23 +0200, Dominik Pantůček wrote:
> But that is just where the issue is showing up. The real question is how
> the counter gets decremented twice (given that fsemaphores should be
> futures-safe).

I found a configuration that triggered the bug often enough on my
machine. Yes, it was a dumb mistake in the implementation of
fsemaphores. In your example, it wasn't so much that the counter
fsemaphore was decremented twice, but that the lock fsemaphore could go
negative --- so it didn't actually lock.

I've pushed a repair.


FWIW, after looking at this example more, I see why it's still not
enough in principle to make a thread that waits on the result of
`do-start-workers` or to run `do-start-workers` after the enqueue
loops. The `start-workers` function can run a dequeue loop after
starting a future to do the same, and before touching that future. So,
the dependencies aren't as linear as I thought before.

If your real code looks anything like this, consider using a CAS
operation, like `box-cas!`, instead of an fsemaphore as lock for the
queue. The queue's use of an fsemaphore for counting and signaling
seems fine, though.

In any case, the example worked well to expose the fsemaphore bug.

Dominik Pantůček

unread,
May 24, 2020, 3:10:31 AM5/24/20
to Matthew Flatt, racket...@googlegroups.com

On 24. 05. 20 3:38, Matthew Flatt wrote:
> At Sat, 23 May 2020 18:51:23 +0200, Dominik Pantůček wrote:
>> But that is just where the issue is showing up. The real question is how
>> the counter gets decremented twice (given that fsemaphores should be
>> futures-safe).
>
> I found a configuration that triggered the bug often enough on my
> machine. Yes, it was a dumb mistake in the implementation of
> fsemaphores. In your example, it wasn't so much that the counter
> fsemaphore was decremented twice, but that the lock fsemaphore could go
> negative --- so it didn't actually lock.
>
> I've pushed a repair.

Thank you for the repair (and good idea with the test case).

>
>
> FWIW, after looking at this example more, I see why it's still not
> enough in principle to make a thread that waits on the result of
> `do-start-workers` or to run `do-start-workers` after the enqueue
> loops. The `start-workers` function can run a dequeue loop after
> starting a future to do the same, and before touching that future. So,
> the dependencies aren't as linear as I thought before.
>
> If your real code looks anything like this, consider using a CAS
> operation, like `box-cas!`, instead of an fsemaphore as lock for the
> queue. The queue's use of an fsemaphore for counting and signaling
> seems fine, though.
>

Yes, the real code is a binary tree of futures. However each future
performs a lot of fixnum/flonum operations.

At first I was surprised that you are basically suggesting using
spinlocks (busy-wait loops) instead of futex-backed (at least on Linux)
fsemaphores. That is a waste of CPU time... But yes, CAS-based locking
outperforms fsemaphores locking and distributes the work way more
evenly. Now when I see it in action and look at the source it kind of
makes sense as the spinlock does not need to wait long and does not
cause any re-scheduling (which is the reason why I get so uneven counts
with fsemaphore-based locking).

> In any case, the example worked well to expose the fsemaphore bug.
>

Probably expect to see more in the future :)



Dominik

Alexis King

unread,
May 24, 2020, 6:00:01 AM5/24/20
to Dominik Pantůček, Racket Users
On May 24, 2020, at 02:10, Dominik Pantůček <dominik....@trustica.cz> wrote:

At first I was surprised that you are basically suggesting using
spinlocks (busy-wait loops) instead of futex-backed (at least on Linux)
fsemaphores. That is a waste of CPU time.

Performing CAS operations in a loop isn’t really directly comparable to spinlocking. With a spinlock, the blocked thread is spending cycles doing absolutely nothing for an arbitrarily-long time. The blocked thread is completely at the mercy of the thread that holds the lock, which may not release it for a lengthy duration even if the blocked thread only needs to acquire it briefly.

In contrast, lock-free concurrency approaches that use atomic operations never spend CPU time busy-waiting. All threads are always doing useful work; the worst case scenario is that work is duplicated. The burden here is reversed: if one thread needs to use the shared resource for a short period of time while a long-running computation is in progress on another thread, it doesn’t have to wait, it just wins and continues with its business. It’s the long-running computation that loses here, since it has to throw away the work it did and retry.

This means lock-free approaches are bad if you have (a) lots of contention over a shared resource, plus (b) long-running computations using the resource, which would be prohibitively-expensive to duplicate. But if contention is low and/or the operation is sufficiently cheap, the overhead of work duplication will be irrelevant. (This is why lock-free concurrency is sometimes called “optimistic concurrency”—you’re optimistically hoping nobody else needs the resource while you’re using it.) Furthermore, note that acquiring a mutex must ultimately use atomic operations internally, so if there is no contention, the lock-free approach will fundamentally be faster (or at least no slower) than the locking approach.

Now consider what blocking on a mutex must do in addition to the atomic operation: it must return to the thread scheduler, since a blocked thread must be added to a wakeup list and unscheduled. This incurs all the overhead of context-switching, which is small, but it isn’t zero. Compare that to the cost of a single enqueue or dequeue operation, which involve just a couple reads and a single write. Soon you end up in a situation where the cost of context-switching outweighs any work you might duplicate even if there is thread contention, and now you’re always beating the mutex under all usage patterns!

tl;dr: Even though lock-free approaches and spinlocks are superficially similar in that they involve “spinning in a loop,” it isn’t fair to conflate them! Optimistic concurrency is a bad idea if you have operations that need access to the shared resource for a long time, since you’ll end up duplicating tons of work, but if the operation is cheap, you lose nothing over a mutex-based approach and potentially gain a little extra performance.

Alexis

Alexis King

unread,
May 24, 2020, 11:58:14 AM5/24/20
to Dominik Pantůček, Matthew Flatt, Racket Users
I realized a little while after writing my previous message that I was probably misinterpreting you. I was envisioning you using box-cas! on a box containing a functional queue, so there would be no need to synchronize. You’d just pull the queue out of the box, functionally update it, and use box-cas! to store the new result back. But your original program is using an imperative queue, so it seems plausible you really were using box-cas! to implement a spinlock. Then you would indeed be busy-waiting, after all!

Rereading Matthew’s original email, I have no idea if he was suggesting to use a spinlock or to do something closer to what I had in mind. Either way, I thought it sounded fun to try to implement a lock-free imperative queue in Racket. Then (theoretically) you’d get the best of both worlds: less duplicated work and less allocation than the functional approach (and allocation is expensive for futures!), but no busy-waiting. Here is my solution: https://gist.github.com/lexi-lambda/c54c91867f931b56123e3c595d8e445a

As far as I can tell, it works nicely, and it’s not terribly complicated. Inspecting the behavior of my test cases in the futures visualizer, it seems to be doing good things. But I found I needed to use Matthew’s threads + futures trick, and in the process, I discovered a small subtlety needed to make it actually work. Here’s a self-contained program that reproduces the issue I ran into:

#lang racket
(require racket/logging)

(define (slow-sum)
(for/sum ([j (in-range 100000000)]) 1))

(define (go)
(define workers
(for/list ([i (in-range 8)])
(thread (λ () (touch (future slow-sum))))))
(for-each thread-wait workers))

(with-logging-to-port
#:logger (current-logger)
(current-output-port)
(thunk (go) (newline) (newline) (newline) (go))
'debug
'future)

This runs the `go` procedure twice back-to-back, printing future trace messages to stdout. The first execution is good; futures start up in parallel:

future: id -1, process 0: created; time: 1590334415654.675049
future: id 1, process 1: started work; time: 1590334415654.749023
future: id 1, process 0: paused for touch; time: 1590334415654.894043
future: id -1, process 0: created; time: 1590334415654.912109
future: id 2, process 2: started work; time: 1590334415654.934082
future: id 2, process 0: paused for touch; time: 1590334415655.214111
future: id 1, process 1: completed; time: 1590334415878.041992
future: id 1, process 1: ended work; time: 1590334415878.049072
future: id 1, process 0: resumed for touch; time: 1590334415878.070068
future: id 2, process 2: completed; time: 1590334415878.167969
future: id 2, process 2: ended work; time: 1590334415878.173096
future: id 2, process 0: resumed for touch; time: 1590334415878.217041

(I reduced the number of futures here from 8 to 2 for this run to keep the log messages from being uselessly verbose for an email.)

However, on the second execution, I get no parallelism at all:

future: id -1, process 0: created; time: 1590334415878.292969
future: id 3, process 0: started work; time: 1590334415878.298096
future: id -1, process 0: created; time: 1590334415903.156006
future: id 4, process 0: started work; time: 1590334415903.163086
future: id 3, process 0: completed; time: 1590334416300.748047
future: id 3, process 0: ended work; time: 1590334416300.749023
future: id 4, process 0: completed; time: 1590334416322.684082
future: id 4, process 0: ended work; time: 1590334416322.684082

What’s going on? The problem is that `touch` is getting called before the futures have a chance to get started (possibly because the VM is now warmed up and things run faster?). Normally, that would only happen if the futures were really, really short: the first `touch` would run the first future on the main thread, and the other futures would start up in parallel while that first one is running. But in this program, each future is started on its own (green) thread, so I essentially created a race between the thread scheduler and the future scheduler.

It seems splitting the future creation from the thread creation is enough to make this issue go away:

(define (go)
(define futures (for/list ([i (in-range 8)]) (future slow-sum)))
(define workers (for/list ([f (in-list futures)])
(thread (λ () (touch f)))))
(for-each thread-wait workers))

Now the futures always start up in parallel. This seems like it’s probably pretty reliable, so it isn’t really a problem, but I found the behavior surprising. It suggests the thread and future schedulers are blissfully unaware of one another: the thread scheduler is perfectly happy to run dozens of futures concurrently on the main OS thread rather than kick some of them onto another core. Normally this is something that work-stealing could fix, but it doesn’t seem like the futures scheduler ever steals work from futures executing on the main thread (presumably because those aren’t “real” futures at all).

Perhaps someone will find this information useful or interesting... or perhaps not. :)

Alexis
Reply all
Reply to author
Forward
0 new messages