Do futures actually work? What are they useful for?

Alexis King

unread,

Sep 11, 2017, 4:46:18 PM9/11/17

to Racket-Users List

Yesterday, a user asked a question on Stack Overflow[1] about attempting
to parallelize a solution to the n-queens problem. The program in
question is short, so I can reproduce it here in its entirety:

#lang racket

; following returns true if queens are on diagonals:
(define (check-diagonals bd)
(for*/or ([r1 (in-range (length bd))]
[r2 (in-range (add1 r1) (length bd))])
(= (abs (- r1 r2))
(abs (- (list-ref bd r1)
(list-ref bd r2))))))

; set board size:
(define N 8)
; start searching:
(for ([brd (in-permutations (range N))])
(unless (check-diagonals brd)
(displayln brd)))

I make no comment on the efficiency of the algorithm (or even its
correctness), as I didn’t write it. Either way, it does seem like the
sort of thing that could be parallelized. The above program runs very
quickly, but if N is increased to something larger, like 11, it takes
an appreciable amount of time:

$ time racket n-queens.rkt
[program output elided]
14.44 real 13.92 user 0.32 sys

Fortunately, this seems like a textbook use case for fine-grained
parallel evaluation of subexpressions. Theoretically, it seems like it
should be possible to use for/async from racket/future to parallelize
the top-level for loop. However, making that change does not improve
performance at all; in fact, it significantly reduces it. Merely running
this version with N=9 takes ~6.5 seconds; with N=10, it takes ~55.

This is, however, not too surprising. Running the code with the futures
visualizer (using N=7) indicates that displayln is not legal from within
a future, preventing any parallel execution from ever actually taking
place. Presumably, we can fix this by creating futures that just compute
the results, then print them serially:

(define results-futures
(for/list ([brd (in-permutations (range N))])
(future (λ () (cons (check-diagonals brd) brd)))))

(for ([result-future (in-list results-futures)])
(let ([result (touch result-future)])
(unless (car result)
(displayln (cdr result)))))

With this change, we get a small speedup over the naïve attempt with
for/async, but we’re still far slower than the serial version. Now, with
N=9, it takes ~4.58 seconds, and with N=10, it takes ~44.

Taking a look at the futures visualizer for this program (again, with
N=7), there are now no blocks, but there are some syncs (on
jit_on_demand and allocate memory). However, after some time spent
jitting, execution seems to get going, and it actually runs a lot of
futures in parallel[2]! After a little bit of that, however, the
parallel execution seems to run out of steam, and things seem to start
running relatively serially again[3].

I wasn’t really sure what was going on here, but I thought maybe it was
because some of the futures are just too *small*. It seems possible that
the overhead of scheduling thousands of futures (or, in the case of
N=10, millions) was far outweighing the actual runtime of the work being
done in the futures. Fortunately, this seems like something that could
be solved by just grouping the work into chunks, something that’s
relatively doable using in-slice:

(define results-futures
(for/list ([brds (in-slice 10000 (in-permutations (range N)))])
(future
(λ ()
(for/list ([brd (in-list brds)])
(cons (check-diagonals brd) brd))))))

(for* ([results-future (in-list results-futures)]
[result (in-list (touch results-future))])
(unless (car result)
(displayln (cdr result))))

It seems that my suspicion was correct, because that change helps a
whole lot. Now, running the parallel version of the program only takes
~3.9 seconds for N=10, a speedup of more than a factor of 10 over the
previous version using futures. However, unfortunately, that’s still
*slower* than the completely serial version, which only takes ~1.4
seconds. Increasing N to 11 makes the parallel version take ~44 seconds,
and if the chunk size provided to in-slice is increased to 100,000, it
takes even longer, ~55 seconds.

Taking a look at the future visualizer for that version of the program,
with N=11 and a chunk size of 1,000,000, I see that there seem to be
some periods of extended parallelism, but the futures get blocked a lot
on memory allocation[4]. This makes sense, since now each future is
handling much more work, but it means the futures are synchronizing
constantly, likely leading to the significant performance overhead I’m
seeing.

At this point, I’m not sure there’s much else I know how to tweak to
improve future performance. I tried cutting down allocation by using
vectors instead of lists and specialized fixnum operations, but for
whatever reason, that seemed to completely destroy parallelism. I
thought that maybe that was because the futures were never starting up
in parallel, so I replaced future with would-be-future, but the results
were confusing to me, and I didn’t really understand what they meant.

Either way, this is a little troubling, since this task seems extremely
contrived, so I would hope it would be something that futures could
easily cope with. I think I’ve heard mixed thoughts on whether or not
futures have “panned out” — my understanding is that they seem promising
but really been able to be applied practically, in contrast with places,
which have been used to good effect for the parallel build process, for
example. Is there something that I can do to get a real performance
boost with futures on this problem, or is it a lost cause?

Furthermore, if this problem really is a lost cause, what are futures
actually useful for? The documentation would imply that they are useful
for lots of number crunching using machine integers or flonums, but
that’s a pretty specific domain. I would hope that they would be able
to work with simple list-based operations as well, since operations that
are truly exclusively numeric are relatively uncommon in my experience.

Thanks,
Alexis

[1]: https://stackoverflow.com/q/46144576/465378
[2]: http://i.imgur.com/zdXt4Ei.png
[3]: http://i.imgur.com/e6yPlJ8.png
[4]: http://i.imgur.com/hzPDsy9.png

Alex Harsanyi

unread,

Sep 11, 2017, 6:38:18 PM9/11/17

to Racket Users

On Tuesday, September 12, 2017 at 4:46:18 AM UTC+8, Alexis King wrote:

Furthermore, if this problem really is a lost cause, what are futures
actually useful for? The documentation would imply that they are useful
for lots of number crunching using machine integers or flonums, but
that’s a pretty specific domain. I would hope that they would be able
to work with simple list-based operations as well, since operations that
are truly exclusively numeric are relatively uncommon in my experience.

This has been my experience as well : futures seem to block on pretty much any operation that is not a `fl+`, `fl-` or similar. I also noticed that accessing data stored in lists and vectors (even if these are not modified), seems to block the futures. This is even more frustrating, since even after you get parallel implementation working, a simple and seemingly harmless change to the code can silently break the parallelism.

In my case, I just gave up on them and optimized the algorithms for "single thread" execution. So far, this has been fast enough in my use case.

Best Regards,

Alex.

Message has been deleted

Jon Zeppieri

unread,

Sep 11, 2017, 6:45:45 PM9/11/17

to mrmyers.ra...@gmail.com, Racket Users

> On Sep 11, 2017, at 6:39 PM, mrmyers.ra...@gmail.com wrote:
>
> As far as I'm aware, futures usually shouldn't improve performance outside of networking or hardware-latency type situations. The main goal of futures is just time-sharing, not improving performance. It doesn't genuinely do things in parallel, it just interleaves the execution of several things at once.

This isn't true. Futures are for parallelism; they just happen to be defeated by many, many operations. More to the point, they're not for interleaving work. Racket's threads are for that.

Message has been deleted

James Swaine

unread,

Sep 11, 2017, 7:29:51 PM9/11/17

to mrmyers.ra...@gmail.com, Racket Users

They do work. They do exactly as advertised in the documentation. As far as their usefulness goes, I think it is fair to say that they are not as useful as they could be, or put another way, the set of operations that are future-safe is too small to make them practical for many folks.

On Mon, Sep 11, 2017 at 5:53 PM <mrmyers.ra...@gmail.com> wrote:

Oh. That does seem troubling then.

--
You received this message because you are subscribed to the Google Groups "Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to racket-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Matthias Felleisen

unread,

Sep 12, 2017, 10:57:04 AM9/12/17

to Racket Users

A word of caution on parallelism in general. Not too long ago, someone in CS at an Ivy League school studied the use of parallelism across different uses and applications. The goal was to find out how much the ‘hype’ about parallelism affected people. My recollection is that they found close to 20 projects where professors (in CE, EE, CS, Bio, Econ, etc) told their grad students/PhDs to use parallelism to run programs faster. They checked all of them and for N - 1 or 2, the projects ran faster once the parallelism was removed. Significantly faster.

People routinely underestimate the communication cost that parallelism introduces.

;; - - -

A word on futures. As James said, they work as advertised but if you do read the fine print, you need to understand that in Racket, too many operations block futures.

This obstacle will require a decent amount of labor on run-time system.

Even if we overcame this obstacle, we would soon run into others that people in the 80s and 90s encountered and addressed with various techniques (work stealing load balancing etcetc). We would need to implement all of this to catch up and provide a sufficient degree of convenience for futures.

Then it would be *really good thing* for Racket programmers and would even turn into a research project.

PRs welcome.

Philip McGrath

unread,

Sep 12, 2017, 11:47:21 AM9/12/17

to Matthias Felleisen, Racket Users

It may well be much too early to say, but is there a chance that the planned move to Chez will result in fewer operations blocking futures? From what I read of the plans (https://groups.google.com/d/msg/racket-dev/2BV3ElyfF8Y/UaHcKAHcCAAJ):

Chez exposes OS-level threads with limited safety guarantees. An implementation of futures can probably take advantage of threads with thread-unsafe primitives wrapped to divert to a a barrier when used in a future.
In other words, while we made futures work for Racket by ensuring that various future-safe primitives specifically cooperate, it looks like we can do the opposite on Chez: wrap future-unsafe primitives specifically. If that works, no changes are needed to Chez.

-Philip

--
You received this message because you are subscribed to the Google Groups "Racket Users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscribe@googlegroups.com.

Alexis King

unread,

Sep 12, 2017, 4:57:46 PM9/12/17

to Matthias Felleisen, James Swaine, Racket Users

> On Sep 12, 2017, at 7:57 AM, Matthias Felleisen <matt...@ccs.neu.edu>

> wrote:
>
> A word of caution on parallelism in general. Not too long ago, someone
> in CS at an Ivy League school studied the use of parallelism across
> different uses and applications. The goal was to find out how much the
> ‘hype’ about parallelism affected people. My recollection is that they
> found close to 20 projects where professors (in CE, EE, CS, Bio, Econ,
> etc) told their grad students/PhDs to use parallelism to run programs
> faster. They checked all of them and for N - 1 or 2, the projects ran
> faster once the parallelism was removed. Significantly faster.

Hah! I’ve heard similar anecdotes before, but this is an especially
amusing one. I’d love to have a citation for something like this if one
exists.

> A word on futures. As James said, they work as advertised but if you
> do read the fine print, you need to understand that in Racket, too
> many operations block futures.
>
> This obstacle will require a decent amount of labor on run-time
> system.

Yes, this has described my experience pretty well. I did two things
since my experiment: I read the paper on the future visualizer, and I
reimplemented the same experiment in Haskell. The former was helpful —
it gave me a little bit more perspective on the way they’re intended to
be used — and the latter mostly just provides a bit of a baseline for
what I feel I could feasibly hope for.

I rewrote the Racket program in Haskell, trying to do as direct a
translation as possible. Here’s the program, adjusted very slightly to
make it easy to add parallelism:

import Data.List (permutations)
import Data.Maybe (catMaybes)

checkDiagonals :: [Int] -> Bool
checkDiagonals bd =
or $ flip map [0 .. length bd - 1] $ \r1 ->
or $ flip map [r1 + 1 .. length bd - 1] $ \r2 ->
abs (r1 - r2) == abs ((bd !! r1) - (bd !! r2))

n :: Int
n = 11

main :: IO ()
main =
let results = flip map (permutations [0 .. n-1]) $ \brd ->
if checkDiagonals brd then Nothing else Just brd
in mapM_ print (catMaybes results)

I was able to easily add some parallelism using the
Control.Parallel.Strategies library. I added a line to the main function
that introduced some parallel evaluation:

import Control.Parallel.Strategies
import Data.List.Split (chunksOf)

main :: IO ()
main =
let results =
concat . withStrategy (parBuffer 10 rdeepseq) . chunksOf 100 $
flip map (permutations [0 .. n-1]) $ \brd ->
if checkDiagonals brd then Nothing else Just brd
in mapM_ print (catMaybes results)

It took some time to figure out the right chunk and rolling buffer
sizes, but these values gave me a consistent 30-40% speedup over the
original, sequential program.

Now, obviously, Haskell’s runtime is considerably more suited for
parallel programming than Racket’s, so this comparison is hardly fair.
But it helped me to see for myself that, despite having 4 cores (8 with
hyperthreading) at my disposal, I wasn’t able to get even a 50% speedup,
which is in line with your comments about the overhead of parallel
programming.

Still, I wonder how out of reach this sort of performance boost is for
Racket. As Philip said, I’m especially curious to learn more details
about how the Chez rewrite will impact Racket’s support for parallelism.
Might we get access to a lightweight thread primitive that provides
parallelism in addition to concurrency, a la GHC’s threads?

Either way, thank you, James and Matthias, for your frank feedback.

Reply all

Reply to author

Forward