Yesterday, a user asked a question on Stack Overflow[1] about attempting
to parallelize a solution to the n-queens problem. The program in
question is short, so I can reproduce it here in its entirety:
#lang racket
; following returns true if queens are on diagonals:
(define (check-diagonals bd)
(for*/or ([r1 (in-range (length bd))]
[r2 (in-range (add1 r1) (length bd))])
(= (abs (- r1 r2))
(abs (- (list-ref bd r1)
(list-ref bd r2))))))
; set board size:
(define N 8)
; start searching:
(for ([brd (in-permutations (range N))])
(unless (check-diagonals brd)
(displayln brd)))
I make no comment on the efficiency of the algorithm (or even its
correctness), as I didn’t write it. Either way, it does seem like the
sort of thing that could be parallelized. The above program runs very
quickly, but if N is increased to something larger, like 11, it takes
an appreciable amount of time:
$ time racket n-queens.rkt
[program output elided]
14.44 real 13.92 user 0.32 sys
Fortunately, this seems like a textbook use case for fine-grained
parallel evaluation of subexpressions. Theoretically, it seems like it
should be possible to use for/async from racket/future to parallelize
the top-level for loop. However, making that change does not improve
performance at all; in fact, it significantly reduces it. Merely running
this version with N=9 takes ~6.5 seconds; with N=10, it takes ~55.
This is, however, not too surprising. Running the code with the futures
visualizer (using N=7) indicates that displayln is not legal from within
a future, preventing any parallel execution from ever actually taking
place. Presumably, we can fix this by creating futures that just compute
the results, then print them serially:
(define results-futures
(for/list ([brd (in-permutations (range N))])
(future (λ () (cons (check-diagonals brd) brd)))))
(for ([result-future (in-list results-futures)])
(let ([result (touch result-future)])
(unless (car result)
(displayln (cdr result)))))
With this change, we get a small speedup over the naïve attempt with
for/async, but we’re still far slower than the serial version. Now, with
N=9, it takes ~4.58 seconds, and with N=10, it takes ~44.
Taking a look at the futures visualizer for this program (again, with
N=7), there are now no blocks, but there are some syncs (on
jit_on_demand and allocate memory). However, after some time spent
jitting, execution seems to get going, and it actually runs a lot of
futures in parallel[2]! After a little bit of that, however, the
parallel execution seems to run out of steam, and things seem to start
running relatively serially again[3].
I wasn’t really sure what was going on here, but I thought maybe it was
because some of the futures are just too *small*. It seems possible that
the overhead of scheduling thousands of futures (or, in the case of
N=10, millions) was far outweighing the actual runtime of the work being
done in the futures. Fortunately, this seems like something that could
be solved by just grouping the work into chunks, something that’s
relatively doable using in-slice:
(define results-futures
(for/list ([brds (in-slice 10000 (in-permutations (range N)))])
(future
(λ ()
(for/list ([brd (in-list brds)])
(cons (check-diagonals brd) brd))))))
(for* ([results-future (in-list results-futures)]
[result (in-list (touch results-future))])
(unless (car result)
(displayln (cdr result))))
It seems that my suspicion was correct, because that change helps a
whole lot. Now, running the parallel version of the program only takes
~3.9 seconds for N=10, a speedup of more than a factor of 10 over the
previous version using futures. However, unfortunately, that’s still
*slower* than the completely serial version, which only takes ~1.4
seconds. Increasing N to 11 makes the parallel version take ~44 seconds,
and if the chunk size provided to in-slice is increased to 100,000, it
takes even longer, ~55 seconds.
Taking a look at the future visualizer for that version of the program,
with N=11 and a chunk size of 1,000,000, I see that there seem to be
some periods of extended parallelism, but the futures get blocked a lot
on memory allocation[4]. This makes sense, since now each future is
handling much more work, but it means the futures are synchronizing
constantly, likely leading to the significant performance overhead I’m
seeing.
At this point, I’m not sure there’s much else I know how to tweak to
improve future performance. I tried cutting down allocation by using
vectors instead of lists and specialized fixnum operations, but for
whatever reason, that seemed to completely destroy parallelism. I
thought that maybe that was because the futures were never starting up
in parallel, so I replaced future with would-be-future, but the results
were confusing to me, and I didn’t really understand what they meant.
Either way, this is a little troubling, since this task seems extremely
contrived, so I would hope it would be something that futures could
easily cope with. I think I’ve heard mixed thoughts on whether or not
futures have “panned out” — my understanding is that they seem promising
but really been able to be applied practically, in contrast with places,
which have been used to good effect for the parallel build process, for
example. Is there something that I can do to get a real performance
boost with futures on this problem, or is it a lost cause?
Furthermore, if this problem really is a lost cause, what are futures
actually useful for? The documentation would imply that they are useful
for lots of number crunching using machine integers or flonums, but
that’s a pretty specific domain. I would hope that they would be able
to work with simple list-based operations as well, since operations that
are truly exclusively numeric are relatively uncommon in my experience.
Thanks,
Alexis
[1]:
https://stackoverflow.com/q/46144576/465378
[2]:
http://i.imgur.com/zdXt4Ei.png
[3]:
http://i.imgur.com/e6yPlJ8.png
[4]:
http://i.imgur.com/hzPDsy9.png