Meeting Summary

Kevin Scaldeferri

unread,

Sep 9, 2008, 2:14:12 PM9/9/08

to Portland Functional Programming Study Group

I don't know if anyone was actually taking notes... here's my rough
memory of the discussion. Others feel free to augment with stuff I
might have forgotten.

Overall topic was the Computer Language Benchmarks Game: http://shootout.alioth.debian.org/
. Until recently they only had kinda old, single-core hardware, which
is maybe not that interesting or relevant for most of us these days.
A couple months ago, they got a 4-core machine to run benchmarks on.
But, still running the old entries this still isn't that interesting,
as none of the code uses the other 3 cores. In the last week or two,
parallelized Haskell and Erlang entries started to be written and
submitted.

(Aside: there's actually 3 variations being run on the new machine:
64bit, 32bit, and 32bit "set affinity mask 8", which apparently forces
the code to run on a single core. There are some interesting
differences between the results on these 3. Erlang seems to do better
on 32bit than 64bit, and takes a big hit with the affinity set,
although I think that setting "-S 1" to only run a single scheduler
thread would fix that.)

(Aside 2: Per the FAQ, excessive cleverness is generally disallowed in
this contest. You must implement the same algorithm. In some cases,
e.g. pidigits, this appears to completely eliminate the possibility of
parallelization. In general, it will be interesting to see what sort
of strategies are allowed or disallowed by the judges.)

We started out looking at the binarytrees benchmark in Erlang.
Without going into too much explanation of what this benchmark
actually _does_, we notice that a good portion of the work is a loop
that deals with trees for depths from 4 to Max. So, the obvious
strategy is to parallelize the loop. One minor obstacle is the the
original code intersperses IO with pure computation; a naive
parallelization will not preserve the same ordering of the output.
So, first refactor to a map of the computation over the sequence
[4..Max] and a loop over the results to print the output. Then we can
change the map to pmap and *presto* it's running in parallel.

On my 2-core MBP, I get 191% CPU utilization and run time drops from
1:54 to 1:08. Pretty good. Igal ran the benchmark on an 8-core EC2
instance and it was a little less impressive: 254% CPU and runtime
dropping form 1:46 to 0:44. EC2 doesn't seem to be doing a very good
job of distributing the scheduler threads between virtual cores; some
cores were completely idle the whole time while some had 2 threads
running on them all the time. Speculation ensued about whether this
was related to virtualization or not; Bart offered up use of some non-
virtualized multi-core research machines at PSU for future
investigations. (The official contest hardware also only gets about
250%CPU, but also sees the total CPU time nearly double, leading to
fairly modest gains. Need to understand why this is happening.)

Presumably Amdahl's law is also a factor here. There's some stuff
before and after the loop that we aren't attempting to parallelize,
which will limit our possible speedup. At the moment that seems
secondary to the scheduling issues, though. Could potentially address
this by partitioning the individual tree computations out to multiple
processes.

There's also a Haskell parallel binarytrees entry with the same
strategy. A standard map is replaced with
Control.Parallel.Strategies.parMap.

We moved on to look at the mandelbrot benchmark, which creates an NxN
bitmap of the mandelbrot set. The original version contains a nested
loop for the x and y coordinates. We converted the loop over the y
coordinate into a parallel map. Again, the 2-core machine did pretty
well, with 177% CPU usage, and dropping from 5:13 to 3:02. With more
cores, things are not as impressive. We got only about 25% speedup on
the 8-core EC2 instance. Again, we saw bad distribution of threads to
cores, but also there is a substantial sequential portion of the
computation as the results for each row are sent back to the parent
process and written to the output one-by-one. There's also a huge
increase in memory usage to buffer up the entire result. I also
pointed out that the code uses about 128 times more memory than you
might like, because each bit requires a 64bit integer and a 64bit list
pointer.

I speculated that using the built-in Erlang binary data type might
help, both in reducing memory use and also because they don't require
a copy to send from one process to another. (There is a shared memory
arena for binaries.) Stay tuned for updates on this.

We wrapped up with a quick run-down of the potential strategies for
parallelizing the other benchmarks:

chameneos: already uses multiple processes in the obvious way, but
does really badly on the multi-core machine. (It's pretty competitive
on the single core machines.) I suspect there's an issue with process-
scheduler affinity (rather the lack of). Might actually do better
with '-S 1' at this point, which is a little disappointing.

fannkuch: unclear what the judges will allow, but using 1 process to
generate the permutations, then hand them off to multiple worker
threads to do the flipping might work well

fasta: I have no good ideas here

k-nucleotide: has 2 obvious loops that can be parallelized.

n-body: standard approach would be to create a process for each body.
Not sure if inter-process communication costs will swamp the benefits
for such a small number of bodies, though.

pidigits: appears impossible to parallelize this particular algorithm.

regex-dna: has some obvious bits of computation to run in parallel.
Might be some subtleties in terms of when to do the list <-> binary
conversions.

reverse-complement: should probably be modified to use binaries
instead of lists before attempting to parallelize.

spectral-norm: matrix operations _ought_ to be easily parallelized,
but the rules here might make it tricky.

Then we retired to the pub where we discussed whether OCaml is
"bigger" than Haskell; whether F# will dominate the future of
functional programming or die a horrible death; why companies are
afraid to use functional languages; and why off-site backups are a
really, really good idea.

Kevin Scaldeferri

unread,

Sep 9, 2008, 4:29:17 PM9/9/08

to pdx...@googlegroups.com

On Sep 9, 2008, at 11:14 AM, Kevin Scaldeferri wrote:

> I speculated that using the built-in Erlang binary data type might
> help, both in reducing memory use and also because they don't require
> a copy to send from one process to another. (There is a shared memory
> arena for binaries.) Stay tuned for updates on this.

I just benchmarked the a version that uses binary objects to store the
lines in addition to the parallel map. I'm now getting near perfect
parallelization on my MBP. It uses 197%CPU, and shows a 1.93x speedup
over the non-parallel binary variant (and 2.14x speedup over the
original list based version). The one-process-per-row version is
still quite memory heavy (about 500MB), which may simply be the cost
of 6400 processes. The "shard" variant that spawns a number of
processes equal to the number of cores uses much less memory (20MB),
but is a couple percent slower on this machine. Curious to see what
happens on more cores.

-kevin

Isaac Gouy

unread,

Sep 10, 2008, 12:23:01 PM9/10/08

to pdxfunc

Kevin Scaldeferri wrote:

> (Aside: there's actually 3 variations being run on the new machine:
> 64bit, 32bit, and 32bit "set affinity mask 8", which apparently forces
> the code to run on a single core. There are some interesting
> differences between the results on these 3. Erlang seems to do better
> on 32bit than 64bit, and takes a big hit with the affinity set,
> although I think that setting "-S 1" to only run a single scheduler
> thread would fix that.)

Not too surprisingly, as of an hour ago, there are now 4 variations:

- 32bit "set affinity mask 8"
- 32bit
- 64bit "set affinity mask 8"
- 64bit

-snip-

> spectral-norm: matrix operations _ought_ to be easily parallelized,
> but the rules here might make it tricky.

1.0 C++ GNU g++ #3 5.75 100% 100% 99% 98%
http://shootout.alioth.debian.org/u32q/benchmark.php?test=spectralnorm&lang=all

1.4 C++ GNU g++ #3 22.72 0% 0% 0% 100%
http://shootout.alioth.debian.org/u32/benchmark.php?test=spectralnorm&lang=all

Kevin Scaldeferri

unread,

Sep 10, 2008, 5:06:51 PM9/10/08

to pdx...@googlegroups.com

On Sep 10, 2008, at 9:23 AM, Isaac Gouy wrote:
>
> -snip-
>> spectral-norm: matrix operations _ought_ to be easily parallelized,
>> but the rules here might make it tricky.
>
> 1.0 C++ GNU g++ #3 5.75 100% 100% 99% 98%
> http://shootout.alioth.debian.org/u32q/benchmark.php?test=spectralnorm&lang=all
>
> 1.4 C++ GNU g++ #3 22.72 0% 0% 0% 100%
> http://shootout.alioth.debian.org/u32/benchmark.php?test=spectralnorm&lang=all
>

Thanks! That's basically the same approach we discussed at the
meeting, so it's good to have validation that it will be allowed.

-kevin

Isaac Gouy

unread,

Sep 13, 2008, 12:28:52 PM9/13/08

to pdxfunc

fyi

"Which parallel problem would you add and why?"

http://alioth.debian.org/forum/forum.php?thread_id=14502&forum_id=2840

"Ask Proggit: Which parallel problem would you add now The Computer
Language Benchmarks Game is measured on quad-core hardware, and why?"

http://www.reddit.com/r/programming/comments/718qf/ask_proggit_which_parallel_problem_would_you_add/

Reply all

Reply to author

Forward