trying to use futures for some calculations

69 views
Skip to first unread message

Alex Harsanyi

unread,
Jun 17, 2020, 4:50:44 AM6/17/20
to Racket Users

I am trying to speed up an algorithm using futures, but I am getting some unexpected results (and no real speed improvements), and I was wondering if someone more experienced could have a look a the code and tell me what am I doing wrong.

I put up the code in this repository: https://github.com/alex-hhh/cp3-exeriments, unfortunately it is the simplest meaningful example that I can come up with.  Most of the functions, however are just support functions and there are six implementation of the same algorithm.

Basically, the problem I am trying to solve is to fit a model to existing data and this is done by evaluating 2.5 million combinations of parameters.  My best, non-futures based, algorithm can do this in about 3 seconds (8 seconds in DrRacket).

Given that each of this 2.5 million combination is completely independent from the others, they could all be done in parallel.  Given this, I "sliced" the combinations into 30 groups and tried to perform each "slice" in its own future and select the best among the 30 results produced by these futures.

Unfortunately, while the futures versions of the algorithm produce the correct result, the algorithm runs at the same speed as the non-futures version.  My `processor-count` returns 8, so I would expect at least some level of parallelism.

As a first step, I tried using `would-be-future`, to see if it reported any operations which might block, but nothing was printed out.

I also tried using the futures visualized, and I found the following:

* the code appears to be blocking on primitive operations, such as +, -, < etc.
* I suspect these operations are inside the code generated by the `for` loops, so I am not sure how to remove them without making the code even more difficult to read.
* there seems to be a lot more time spent in the garbage collector when running the futures visualizer than without it (DrRacket runs with unlimited memory)

So I am wondering if someone who is more familiar with futures can look at the code and provide some hints about what can be done to make this code run in parallel (or if it cannot, I would like to understand why).

This is already a long message, so I will not add further details here, but the repository at https://github.com/alex-hhh/cp3-exeriments has an explanation of what every function does, and I am happy to provide further clarifications if needed.

Thanks,
Alex.

Brian Adkins

unread,
Jun 17, 2020, 9:56:13 AM6/17/20
to Racket Users
On Wednesday, June 17, 2020 at 4:50:44 AM UTC-4, Alex Harsanyi wrote:

I am trying to speed up an algorithm using futures, but I am getting some unexpected results (and no real speed improvements), and I was wondering if someone more experienced could have a look a the code and tell me what am I doing wrong.
[...]

I would *love* to be proven wrong on this, but I think it's rare to be able to get decent parallelization in practice using futures. You may have better results using places, but it will depend on how the amount of processing for a unit compares to the overhead of communicating with the places i.e. you may get better results with 2 places than with 8 due to place communication overhead. In your case, if it's easy for the places to input their own sets of parameters, then the place overhead may be small since I think each place would simply need to communicate its best value.

Brian Adkins

unread,
Jun 17, 2020, 10:08:06 AM6/17/20
to Racket Users
On Wednesday, June 17, 2020 at 4:50:44 AM UTC-4, Alex Harsanyi wrote:

I am trying to speed up an algorithm using futures, but I am getting some unexpected results (and no real speed improvements), and I was wondering if someone more experienced could have a look a the code and tell me what am I doing wrong.

I put up the code in this repository: https://github.com/alex-hhh/cp3-exeriments, unfortunately it is the simplest meaningful example that I can come up with.  Most of the functions, however are just support functions and there are six implementation of the same algorithm.
[...]

This is entirely unrelated to your question, but I'm curious about your ranges. Andy Coggan defines the following:

Neuromuscular:  < 30 seconds
Anaerobic:  30 seconds to 3 minutes
VO2max:  3 minutes to 8 minutes
Lactate Threshold: 8 to 30 minutes
Tempo:  60 to 180 minutes
Endurance:  60 to 300 minutes

I'm curious about you using 2 to 5 minutes for anaerobic. In particular, because I'm targeting 5 minutes as an important benchmark, and I've considered that as high level aerobic w/ a strong anaerobic contribution.

Sam Tobin-Hochstadt

unread,
Jun 17, 2020, 10:24:53 AM6/17/20
to Alex Harsanyi, Racket Users
I have not yet done much investigating on this, but:

- on Racket BC, operations like `+` do indeed block, and effectively
you need to replace them with lower-level operations that don't (such
as `unsafe-fl+`). Typed Racket can help with this, or you can do it
all by hand. As you note, that makes the code more painful to read.
- on Racket CS, operations like `+` do not block, and I see much
better speedup. I changed the third range to 720-800 to get answers
quicker, and I got numbers like:

[samth@homer:/tmp/cp3-exeriments (master) plt] racketcs main.rkt
cp3-baseline:
cpu time: 1947 real time: 1947 gc time: 10
cp3-precomputed:
cpu time: 399 real time: 399 gc time: 4
cp3-precomputed-more:
cpu time: 475 real time: 475 gc time: 3
cp3-futures:
cpu time: 4285 real time: 740 gc time: 11
cp3-precomputed-futures:
cpu time: 785 real time: 138 gc time: 3
cp3-precomputed-more-futures:
cpu time: 876 real time: 153 gc time: 4

So a more than 2x increase in cpu time, but a more than 2x decrease in
wall-clock time.

Certainly more investigation is needed to figure out why things take
so much longer total, but this seems like a promising speedup.

- The futures-visualizer uses logging and a functional graphics
library, both of which will allocate a lot more. You can use
`trace-futures` and `show-visualizer` to separate out the gui display
from execution, which might help.

Sam
> --
> You received this message because you are subscribed to the Google Groups "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to racket-users...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/racket-users/8bf6f7c4-3b2f-4b86-9a8a-be68e82d09cfo%40googlegroups.com.

Matthew Flatt

unread,
Jun 17, 2020, 10:36:45 AM6/17/20
to Sam Tobin-Hochstadt, Alex Harsanyi, Racket Users
At Wed, 17 Jun 2020 10:24:37 -0400, Sam Tobin-Hochstadt wrote:
> - on Racket BC, operations like `+` do indeed block

... which mixing, say, fixnum and flonum arguments, but not when
operating on all fixnums or all flonums.

In this case, it may be the `in-range` with flonum bounds that results
in `+` with fixnum 1 and a flonum.

Robby Findler

unread,
Jun 17, 2020, 11:23:28 AM6/17/20
to Brian Adkins, Racket Users
While this may be true, it is also the case that the design of futures is such that incremental work on the primitives turns into incremental ability to parallelize programs. So while it is likely to be more work today, it may also be the case that people putting effort in to help their own programs will help us turn the corner here. Perhaps this is a place where an interested contributor can help us out a lot!

Robby

Dominik Pantůček

unread,
Jun 17, 2020, 11:49:56 AM6/17/20
to racket...@googlegroups.com
I've looked at it only briefly (it's the end of the semester and grading
is due soon).

>
>
> I would *love* to be proven wrong on this, but I think it's rare to
> be able to get decent parallelization in practice using futures. You
> may have better results using places, but it will depend on how the
> amount of processing for a unit compares to the overhead of
> communicating with the places i.e. you may get better results with 2
> places than with 8 due to place communication overhead. In your
> case, if it's easy for the places to input their own sets of
> parameters, then the place overhead may be small since I think each
> place would simply need to communicate its best value.
>

This is not even remotely true, I am using futures to get 100%
utilization on all cores available. The current situation is that it
takes quite some effort to leverage futures to get there.

A few generic remarks first. Arbitrary partitioning does not work well
with futures. I always partition the work based on the processor-count
with something like:

(define futures-depth (make-parameter (inexact->exact (ceiling (log
(processor-count) 2)))))

(define-syntax (define-futurized stx)
(syntax-case stx ()
((_ (proc start end) body ...)
#'(begin
(define max-depth (futures-depth))
(define (proc start end (depth 0))
(cond ((fx< depth max-depth)
(define mid (fx+ start (fxrshift (fx- end start) 1)))
(let ((f (future
(λ ()
(proc start mid (fx+ depth 1))))))
(proc mid end (fx+ depth 1))
(touch f)))
(else
body ...)))))))

Of course all those fx+, fx- and fx< must be unsafe versions from
racket/unsafe/ops.

Second problem is the allocation of flonums. The inner part of the loop
looks like even with flonums inlining it triggers the allocator more
than often. With CS just forget about this before the inlined flonums
are merged. In the meantime, you can drop the for/fold and use flvector
to store and accumulate whatever you need. Using futures-vizualizer is a
good start.

I'll look into it later this week. But generally you need to stick to
unsafe ops and avoid the allocator.

>
>
> While this may be true, it is also the case that the design of futures
> is such that incremental work on the primitives turns into incremental
> ability to parallelize programs. So while it is likely to be more work
> today, it may also be the case that people putting effort in to help
> their own programs will help us turn the corner here. Perhaps this is a
> place where an interested contributor can help us out a lot!

It is on the list :)


Dominik

Jos Koot

unread,
Jun 17, 2020, 12:23:47 PM6/17/20
to Robby Findler, Brian Adkins, Racket Users

It’s long ago, but I worked nicely with parallelization on a CDC 205 both in assembler and Fortran (CDC has expired shortly after). IIRC it did not protect against simultaneous updating the same memory location. But what I do recall is that it is very important that parallel processes use distinct parts of memory. So I wonder: give each future or place its own copies of the data they need and produce data in locations local within the place or future before returning them. Of course it may happen that a procedure is called for which Racket cannot know which parts of memory it will access. But for (very) primitive arithmetic functions this should be an avoidable problem, I think.

Just my single one cent.

Jos

--

You received this message because you are subscribed to the Google Groups "Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to racket-users...@googlegroups.com.

Jos Koot

unread,
Jun 17, 2020, 12:34:07 PM6/17/20
to Robby Findler, Brian Adkins, us...@racket-lang.org, Racket Users

In addition, some time ago I did use futures which did parallelization very well when using distinct parts of  the same vector. I did not need unsafe for that.

Jos

 

Sam Tobin-Hochstadt

unread,
Jun 17, 2020, 1:14:50 PM6/17/20
to Matthew Flatt, Alex Harsanyi, Racket Users
I tried this out, by adding 1.0 as the third argument in `in-range` in
all cases. The performance in Racket BC increased, but there's still
no parallelism. In Racket CS, it appears to have made things slower,
so I need to investigate more.

Sam

Dominik Pantůček

unread,
Jun 17, 2020, 4:45:38 PM6/17/20
to racket...@googlegroups.com
Hi Alex,

I finally got to investigate the issue in depth. And there are two major
problems blocking your implementation from running the futures in parallel.

1) Allocations of boxed flonums. I tried to get rid of those by
allocating "scratchpad" flvectors and mapping everything onto them. The
future scheduling started to look different, however there are many
allocations because of your loops in functions cp3-futures and
evaluate-cost.

I didn't finish the work though, because I noticed another strange
thing. The Typed Racket code in the lambda function returned by spline
(your mmax-fn argument) blocks almost entirely the parallel execution on
... type checks.

2) All those =, < and friends just block it.

So how to fix this? Well, it is relatively easy albeit quite a lot of
manual work.

(I am looking only at the cp3-futures function when talking about
possible improvements).

* Change the all the for loops for let loop forms and use preallocated
flvectors to hold all the values.
* Switch to racket/unsafe/ops for everything inside futures (this is not
necessary, but takes away a few possible surprises)
* Restructuralize the way you split the work into 30 futures and just
use a binary tree of futures as suggested earlier.
* Use racket/unsafe/ops from regular racket to implement the spline
interpolation. I would also move the coefficients into one huge flvector
and forgot about lists at all. This is very specific workload.

And do not worry about GCs. Once you get rid of all allocations inside
futures, the GCs will disappear.

Also bear in mind that my suggestions are rather low level. By following
them you will get the algorithm to scale over multiple cores if you
really need it. Just remember it will be at a huge cost to readability.
You can slightly mitigate this by some refactoring and custom syntax,
but that is even more work and I would really consider whether you need
the parallelism for a computation that takes a few seconds anyway.

Of course, if you plan to use your algorithm for much bigger data set,
this is the way to go (including the custom syntax).


Cheers,
Dominik
> --
> You received this message because you are subscribed to the Google
> Groups "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to racket-users...@googlegroups.com
> <mailto:racket-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/racket-users/8bf6f7c4-3b2f-4b86-9a8a-be68e82d09cfo%40googlegroups.com
> <https://groups.google.com/d/msgid/racket-users/8bf6f7c4-3b2f-4b86-9a8a-be68e82d09cfo%40googlegroups.com?utm_medium=email&utm_source=footer>.

Alex Harsanyi

unread,
Jun 18, 2020, 2:06:33 AM6/18/20
to Racket Users

Hi Dominic,

Thanks for taking the time to look into this.  For most of your suggestions I already suspected this to be the case, as I attempted to use futures several times in the past, but it is good to know that other people are of the same opinion.

I looked at some other suggestions in this thread and they do make small improvements to the speed, but nothing spectacular.  The biggest improvement (about 30%) comes from using Typed Racket -- this is attractive as the code looks nice and has type annotations too.  For realistic data sets, the best TR version still takes about 5 seconds.  While 5 seconds is not a lot, the code will be added to a GUI application, and this means that I will need to implement some kind of progress bar for the user to look at, while the results come in...

Given these experiments and discussion, and from my personal point of view, futures are not attractive, as they require a lot of effort to get any performance benefits out of them, and simple and "innocent" code changes can result in a complete performance loss.  Also the resulting code is not particularly elegant and maintainable, especially once I start using the unsafe operations.

Out of curiosity, I also implemented the straightforward CP3 search (cp3-baseline) in C++ and it runs in 0.7 seconds.

However, for now, I will go with the Typed Racket solution and just add a progress bar to the GUI.

Thanks everyone for looking at this code and providing suggestions.

Alex.

Alex Harsanyi

unread,
Jun 18, 2020, 2:20:57 AM6/18/20
to Racket Users
Perhaps my choice for the interval names is not ideal, so I will try to explain it:

I am trying to fit a function which uses 3 parameters, so I need three data points to solve for these parameters.  The three points are drawn from three intervals, and, rather than calling the intervals ivl1, ivl2, ivl3 I gave them names which seemed roughly appropriate.

As for the values I provided, they seem ok for getting a good fit, but in the final application, the user will be able to specify any range they want (as long as the ranges don't overlap).

Alex.
Reply all
Reply to author
Forward
0 new messages