Parallel merge-sort leveraging futures

57 views
Skip to first unread message

Dominik Pantůček

unread,
Oct 7, 2019, 9:01:25 AM10/7/19
to Racket Users
Hello,

over the course of past few months I have been tasked with solving a
(real-world) problem of sorting the sums of all possible combinations of
numbers. Some boring accounting stuff revolving around invoices. As the
total number of combinations is 2^N where N is the number of source
numbers, this task got "interesting" once there were more than 24
numbers - that is on my laptop with:

model name : Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

4 cores, 8 hyper-threads and 16GB RAM.

Long story short, I had to implement a custom merge-sort for fxvectors
and managed to leverage futures for fully parallel execution. Now in
some spare time, I slightly polished the code and released it as
"futures-sort" package.

The source code is available on GitHub[1] and it is already in Racket
package index[2].

The package provides four sorting functions - two for vector? and two
for fxvector?. There are two variants of each, one with progress
reporting and one without. For the original task I needed to show
progress using racket/gui gauge% and it might be helpful for others
(it's just a callback that gets current and total number of merges
performed regularly).

I would really appreciate any feedback regarding the naming conventions,
code style and generally anything else.

Yes, the tests are not there (yet).

The same algorithm without futures runs on par with Racket's default
sorting routines (tested for sets upto 2^30 - only 16G of memory here)
and for fixnums it is even (very slightly but consistently) faster than
vector-sort alone. With parallelism using futures it runs roughly N
times faster where N is the number reported by (processor-count).

I have already some benchmark results available and once I get more data
I am more than happy to share it as well.

The parallelism is configurable as parameter.

And yes, all of this can be seen in the documentation which is included
but I still have to look into how to build it - I was sort of assuming
this works automatically based on what I see for example in
scribble-math[3][4]. Any help with getting this right is also more than
welcome.


Lastly - thanks go out to Jens Axel Søgaard for pointing out that
fixnums and fxvectors might help (and why) and also to other helpful
folks on #racket at Freenode.


Cheers,
Dominik


[1] https://github.com/dzoep/futures-sort
[2] https://pkgd.racket-lang.org/pkgn/package/futures-sort
[3] https://docs.racket-lang.org/scribble-math/index.html
[4] https://pkgd.racket-lang.org/pkgn/package/scribble-math

Alex Harsanyi

unread,
Oct 7, 2019, 8:17:47 PM10/7/19
to Racket Users
Hi Dominik,

I tried to use your package and you are missing a dependency for the `scribble-math` package in your info.rkt file, this has to be installed separately otherwise your package won't install.

However, I tried to use the `vector-futures-sort!` function and got an error:

> (vector-futures-sort! #(3 1 7 6 4) <)
. . vector-futures-sort!: broke its own contract
  promised: vector?
  produced: #<void>
  in: the res result of
      (->i
       ((unsorted vector?))
       ((compare procedure?))
       (res vector?))
  contract from: <pkgs>/futures-sort/main.rkt
  blaming: <pkgs>/futures-sort/main.rkt
   (assuming the contract is correct)
  at: <pkgs>/futures-sort/main.rkt:40.2

I would also be interested to know what performance gains did you get by using
futures.  I experimented with futures a few years ago and noticed that they
will block when accessing shared data structures.  My understanding is that
all the futures will wait for each other while trying to access and set
elements in the vector and this the execution will be mostly serial with no
performance gains.  I am curious if anything has changed with regards to this
in the last few years.

Alex.

Sam Tobin-Hochstadt

unread,
Oct 8, 2019, 12:58:27 AM10/8/19
to Alex Harsanyi, Racket Users
I've submitted a pull request fixing those errors and supporting
running in serial mode.

One thing I notice is that it's substantially faster than
`vector-sort!` even when run in serial mode on traditional Racket (but
not on Racket CS), so perhaps this should be integrated (or there are
improvements that we can adopt).

Alex, just to clarify a couple things about futures: they block when
performing operations that are either sensitive to the current
continuation or need single-threaded access to the runtime. Vector
operations, as you see in Dominik's code, are not in this category.
Mutable hash table operations are blocking. Here's some numbers with
performance on sorting a vector with random fixnums using this package
on a 4 core machine with hyperthreading, "classic" is `vector-sort!`.
Note that the futures-sort library creates 2^k futures.

k: 0
cpu time: 3402 real time: 3399 gc time: 45
k: 1
cpu time: 3487 real time: 1834 gc time: 46
k: 2
cpu time: 3581 real time: 1097 gc time: 69
k: 4
cpu time: 5745 real time: 1014 gc time: 69
k: 8
cpu time: 5742 real time: 992 gc time: 45
k: 16
cpu time: 8111 real time: 2189 gc time: 279
'classic
cpu time: 4390 real time: 4388 gc time: 98

Here are similar numbers for Racket CS:

k: 0
cpu time: 2960 real time: 2960 gc time: 33
k: 1
cpu time: 3021 real time: 1594 gc time: 33
k: 2
cpu time: 3462 real time: 1154 gc time: 36
k: 4
cpu time: 4381 real time: 929 gc time: 36
k: 8
cpu time: 4406 real time: 889 gc time: 34
k: 16
cpu time: 7124 real time: 1655 gc time: 440
'classic
cpu time: 2749 real time: 2749 gc time: 51
> --
> You received this message because you are subscribed to the Google Groups "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to racket-users...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/racket-users/cc4e09cd-e968-4edf-8046-90e3cadb38fb%40googlegroups.com.

Dominik Pantůček

unread,
Oct 8, 2019, 3:44:18 AM10/8/19
to racket...@googlegroups.com
Hi,

On 08. 10. 19 6:58, Sam Tobin-Hochstadt wrote:
> I've submitted a pull request fixing those errors and supporting
> running in serial mode.

thank you for that, it's already fixed. I am actually pulling this from
our internal git repository and I imported older info.rkt into the
Github one.

For the contracts - now it is void? instead of suggested any/c. The
functions are modeled like vector-sort! - which returns #<void>.

Btw, is it necessary to bump the version on pkgd.racket-lang.org in
order to get it updated? When I `raco install' the package from its
directory, it works fine now. When I use the github repository using
plain `raco pkg install futures-sort' it still uses the old version.
Just removing and re-installing does not change anything.

>
> One thing I notice is that it's substantially faster than
> `vector-sort!` even when run in serial mode on traditional Racket (but
> not on Racket CS), so perhaps this should be integrated (or there are
> improvements that we can adopt).

We mostly discussed that with Jens Axel on #racket - main advantage at
the lowest level is the usage of fixnums for everything. This however
means that the vector's size to be sorted must be `fixnum?'. Which I
think is always the case on all supported architectures (addressable
memory and such).
Is there a package for these benchmarks? I created some benchmarks based
on sorting vectors of varying size from 1 to 2^27 elements. It would be
really nice to test this on a number of systems to get some empirical
data here. The GC can sometimes surprise (me, at least). My benchmarks
used (current-inexact-milliseconds) before and after each run which is
not something very exact. I'll post some graphs on our company blog on
Thursday probably.

Thank you Sam and Alex!


Cheers,
Dominik

Sam Tobin-Hochstadt

unread,
Oct 8, 2019, 9:49:10 AM10/8/19
to Dominik Pantůček, Racket Users
On Tue, Oct 8, 2019 at 3:44 AM Dominik Pantůček
<dominik....@trustica.cz> wrote:
> Btw, is it necessary to bump the version on pkgd.racket-lang.org in
> order to get it updated? When I `raco install' the package from its
> directory, it works fine now. When I use the github repository using
> plain `raco pkg install futures-sort' it still uses the old version.
> Just removing and re-installing does not change anything.

It re-runs once an hour. You can also ask it to refresh your packages.

>
> >
> > One thing I notice is that it's substantially faster than
> > `vector-sort!` even when run in serial mode on traditional Racket (but
> > not on Racket CS), so perhaps this should be integrated (or there are
> > improvements that we can adopt).
>
> We mostly discussed that with Jens Axel on #racket - main advantage at
> the lowest level is the usage of fixnums for everything. This however
> means that the vector's size to be sorted must be `fixnum?'. Which I
> think is always the case on all supported architectures (addressable
> memory and such).

Yes, `vector-length` always produces a fixnum.

> Is there a package for these benchmarks? I created some benchmarks based
> on sorting vectors of varying size from 1 to 2^27 elements. It would be
> really nice to test this on a number of systems to get some empirical
> data here. The GC can sometimes surprise (me, at least). My benchmarks
> used (current-inexact-milliseconds) before and after each run which is
> not something very exact. I'll post some graphs on our company blog on
> Thursday probably.

My code is here: https://gist.github.com/2c84e31f602b9ad95331a7e29b075294

Sam

Dominik Pantůček

unread,
Oct 10, 2019, 5:28:36 AM10/10/19
to racket...@googlegroups.com
Hello,

thanks everyone for helping me putting this into a reasonable shape.

My short-term plans with futures-sort are currently to check how various
construct could be improved (or'ing of touched futures is really just a
quick hack for example). Also I'd like to investigate why it didn't work
for me when I used the futures example in documentation[1] - that is to
perform one half of the computation in current context and then touch
future with the other half.

Feature-wise I'd like to implement a version that returns a fresh
vector. However I want to avoid any unnecessary copying so I want ot use
similar approach as with swapping/keeping for cnt=1 or 2. With this
ready it should be a faster drop-in replacement for all (fx)vector
sorting functions. I am not sure if similar approach is viable for lists.

And yes, it needs real tests. However I am not sure what kind of
reproducible tests to add there. Especially that the parallelism may not
be available on given platform. Maybe creating reasonably small vector
of random numbers with static seed and force two "parallel" futures no
matter what is available?

Also today I published on our company blog[2] (no need to follow this
link) some benchmarking results obtained on ThinkPad x280[3] (4 cores, 8
HT) and on a VM running on Intel Xeon with 16 cores / 32 HT with 16
VCPUs assigned to the VM[4]. I measured 10 runs for each N where N was
n*2^18 for n in 1 to 255 which covers the range from 0 to 2^24 or
roughly 16M.

All times are measured using time-apply and the resulting data set was
processed with Gnuplot and empirical data were fitted using
f(x)=a*n*log2(n).

A few interesting remarks - once we reach the HTs (and not just cores),
the memory starts to be a real bottleneck. Hence the small difference
between depth=2 and depth=3 on i7.

Also for depth=3 and depth=4 sometimes Racket just hung and I had to
restart the tests. I will try to create a minimal working example where
futures can block indefinitely. Does anyone have similar experience for
example with pmapf?


Cheers,
Dominik


[1]
https://docs.racket-lang.org/guide/parallelism.html#%28part._effective-futures%29
[2]
https://trustica.cz/en/2019/10/10/parallel-merge-sort-leveraging-futures-in-racket/
[3]
https://trustica.cz/wp-content/uploads/2019/10/futures-sort-telperion.png
[4] https://trustica.cz/wp-content/uploads/2019/10/futures-sort-builder.png
Reply all
Reply to author
Forward
0 new messages