Places code not using all the CPU

174 views
Skip to first unread message

Paulo Matos

unread,
Oct 1, 2018, 5:14:11 AM10/1/18
to Racket Users

Hi,

I am not sure this is an issue with places or what it could be but my
devops-fu is poor and I am not even sure how to debug something like
this so maybe someone with more knowledge than me on this might chime in
to hint on a possible debug method.

I was running some benchmarks and noticed something odd for the first
time (although it doesn't mean it was ok before, just that this is the
first time I am actually analysing this issue).

My program (the master) will create N places (the workers) and each
place will start by issuing a rosette call which will trigger a call to
the z3 smt solver. So, N instances of Z3 will run and after it is done
it will run pure racket code that implements a graph search algorithm.
This N worker places are actually in a sync call waiting for messages
from the master and the work is being done by a thread on the worker
place. The master is either waiting for the timeout to arrive or for a
solution to be sent from a worker.

The interesting thing is that when the Z3 instances are running I get
all my 16 CPUs (on a dedicated machine) working at 100%. When the racket
code is running the search, they are all holding off at around 60%-80%
with a huge portion of it in the kernel (red bars in htop).

Since the Z3 calls come before the threads inside the places are started
and we get to the sync call, is it possible something bad is happening
in the sync call that uses the kernel so much? Take a look at htop
during Z3 and during the search - screenshots attached.

Are there any suggestions on what the problem might be or how I could
start to understand why the kernel is so active?

Kind regards,


--
Paulo Matos
2018-10-01-105711_1831x138_scrot.png
2018-10-01-105848_1837x139_scrot.png

Paulo Matos

unread,
Oct 1, 2018, 5:21:26 AM10/1/18
to racket...@googlegroups.com
I attach yet another example where this behaviour is much more
noticiable. This is on a 64 core dedicated machine in amazon aws.
2018-09-28-131705_611x362_scrot.png

Paulo Matos

unread,
Oct 5, 2018, 5:43:49 AM10/5/18
to racket...@googlegroups.com
All,

A quick update on this problem which is in my critical path.
I just noticed, in an attempt to reproduce it, that during the package
setup part of the racket compilation procedure the same happens.

I am running `make CPUS=24 in-place`on a 36 cpu machine and I see that
not only sometimes the racket process status goes from 'R' to 'D' (which
also happens in my case), the CPUs are never really working at 100% with
a lot of the work being done at kernel level.

Has anyone ever noticed this?

Matthew Flatt

unread,
Oct 5, 2018, 8:16:04 AM10/5/18
to Paulo Matos, racket...@googlegroups.com
It's difficult to be sure from your description, but it sounds like the
problem may just be the usual one of scaling parallelism when
communication is involved.

Red is probably synchronization. It might be synchronization due to the
communication you have between places, it might be synchronization on
Racket's internal data structures, or it might be that the OS has to
synchronize actions from multiple places within the same process (e.g.,
multiple places are allocating and calling OS functions like mmap and
mprotect, which the OS has to synchronize within a process). We've
tried to minimize sharing among places, and it's important that they
can GC independently, but there are still various forms of sharing to
manage internally. In contrast, running separate processes for Z3
should scale well, especially if the Z3 task is compute-intensive with
minimal I/0 --- a best-case scenario for the OS.

A parallel `raco setup` runs into similar issues. In recent development
builds, you might experiment with passing `--processes` to `raco setup`
to have it use separate processes instead of places within a single OS
process, but I think you'll still find that it tops out well below your
machine's compute capacity. Partly, dependencies constrain parallelism.
Partly, the processes have to communicate more and there's a lot of
I/O.
> --
> You received this message because you are subscribed to the Google Groups
> "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to racket-users...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Paulo Matos

unread,
Oct 5, 2018, 9:36:17 AM10/5/18
to Matthew Flatt, racket...@googlegroups.com


On 05/10/2018 14:15, Matthew Flatt wrote:
> It's difficult to be sure from your description, but it sounds like the
> problem may just be the usual one of scaling parallelism when
> communication is involved.
>

Matthew, thanks for the reply.

The interesting thing here is that there is no communication between
places _most of the time_. It works as a ring topology where every
worker only communicates with the master and the master with all workers.

This communication is relatively rare, as in a message sent every few
minutes.

> Red is probably synchronization. It might be synchronization due to the
> communication you have between places, it might be synchronization on
> Racket's internal data structures, or it might be that the OS has to
> synchronize actions from multiple places within the same process (e.g.,
> multiple places are allocating and calling OS functions like mmap and
> mprotect, which the OS has to synchronize within a process). We've
> tried to minimize sharing among places, and it's important that they
> can GC independently, but there are still various forms of sharing to
> manage internally. In contrast, running separate processes for Z3
> should scale well, especially if the Z3 task is compute-intensive with
> minimal I/0 --- a best-case scenario for the OS.
>

So, here you have pointed out to something that's surprising to me:
"OS has to synchronize actions from multiple places within the same
process (e.g., multiple places are allocating and calling OS functions
like mmap and mprotect, which the OS has to synchronize within a process)."

I thought each place was its own process similar to issuing a call of
racket itself on the body of the place. Now it seems somehow places are
all in the same process... in which case they'll probably even share
mutexes, although these low level details are a bit foggy in my mind.

> A parallel `raco setup` runs into similar issues. In recent development
> builds, you might experiment with passing `--processes` to `raco setup`
> to have it use separate processes instead of places within a single OS
> process, but I think you'll still find that it tops out well below your
> machine's compute capacity. Partly, dependencies constrain parallelism.
> Partly, the processes have to communicate more and there's a lot of
> I/O.

Again, I am really surprised that you mention that places are not
separate processes. Documentation does say they are separate racket
virtual machines, how is this accomplished if not by using separate
processes?

My workers are really doing Z3 style work - number crushing and lots of
searching. No IO (writing to disk) or communication so I would expect
them to really max out all CPUs.

--
Paulo Matos

Matthew Flatt

unread,
Oct 5, 2018, 10:32:48 AM10/5/18
to Paulo Matos, racket...@googlegroups.com
At Fri, 5 Oct 2018 15:36:04 +0200, Paulo Matos wrote:
> Again, I am really surprised that you mention that places are not
> separate processes. Documentation does say they are separate racket
> virtual machines, how is this accomplished if not by using separate
> processes?

Each place is an OS thread within the Racket process. The virtual
machine is essentially instantiated once in each thread, where things
that look like global variables at the C level are actually
thread-local variables to make them place-specific. Still, there is
some sharing among the threads.

> My workers are really doing Z3 style work - number crushing and lots of
> searching. No IO (writing to disk) or communication so I would expect
> them to really max out all CPUs.

My best guess is that it's memory-allocation bottlenecks, probably at
the point of using mmap() and mprotect(). Maybe things don't scale well
beyond the 4-core machines that I use.

On my machines, the enclosed program can max out CPU use with system
time being a small fraction. It scales ok from 1 to 4 places (i.e.,
real time increased only some). The machine's core are hyperthreaded,
and the example maxes out CPU utilization at 8 --- but it takes twice
as long in real time, so the hardware threads don't help much in this
case. Running two processes with 4 places takes about the same real
time as running one process with 8 places, as does 2 processes with 2
places.

Do you see similar effects, or does this little example stop scaling
before the number of processes matches the number of cores?
p.rkt

Sam Tobin-Hochstadt

unread,
Oct 5, 2018, 10:52:10 AM10/5/18
to Matthew Flatt, Paulo Matos, racket...@googlegroups.com
I tried this same program on my desktop, which also has 4 (i7-4770)
cores with hyperthreading. Here's what I see:

[samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
plt] time r ~/Downloads/p.rkt 1
N: 1, cpu: 5808/5808.0, real: 5804
[samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
plt] time r ~/Downloads/p.rkt 2
N: 2, cpu: 12057/6028.5, real: 6063
[samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
plt] time r ~/Downloads/p.rkt 3
N: 3, cpu: 23377/7792.333333333333, real: 7914
[samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
plt] time r ~/Downloads/p.rkt 4
N: 4, cpu: 41155/10288.75, real: 10357
[samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
plt] time r ~/Downloads/p.rkt 6
N: 6, cpu: 89932/14988.666666666666, real: 15687
[samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
plt] time r ~/Downloads/p.rkt 8
N: 8, cpu: 165152/20644.0, real: 21104

Real time goes up about 80% from 1-4 places, and then doubles again
from 4 to 8. System time for 8 places is also about 10x what it is for
2 places, but only gets up to 2 seconds.

Paulo Matos

unread,
Oct 5, 2018, 11:56:02 AM10/5/18
to Sam Tobin-Hochstadt, Matthew Flatt, racket...@googlegroups.com
I was trying to create a much more elaborate example when Matthew sent
his tiny one which is enough to show the problem.

I started a 64core machine on aws to show the issue.

I see a massive degradation as the number of places increases.

I use this slightly modified code:
#lang racket

(define (go n)
(place/context p
(let ([v (vector 0.0)])
(let loop ([i 3000000000])
(unless (zero? i)
(vector-set! v 0 (+ (vector-ref v 0) 1.0))
(loop (sub1 i)))))
(printf "Place ~a done~n" n)
n))

(module+ main
(define cores
(command-line
#:args (cores)
(string->number cores)))

(time
(map place-wait
(for/list ([i (in-range cores)])
(printf "Starting core ~a~n" i)
(go i)))))

Here's the results in the video (might take a few minutes until it is live):
https://youtu.be/cDe_KF6nmJM

The guide says about places:
"The place form creates a place, which is effectively a new Racket
instance that can run in parallel to other places, including the initial
place."

I think this is misleading at the moment. If this behaviour can be
'fixed' then great, if not I will have to redesign my system to use
'subprocess' to start another racket process and a footnote should be
added to places in documentation to alert the users about this behaviour.

Matthew, Sam, do you understand why this is happening?
--
Paulo Matos

Matthew Flatt

unread,
Oct 5, 2018, 1:23:58 PM10/5/18
to Paulo Matos, Sam Tobin-Hochstadt, racket...@googlegroups.com
At Fri, 5 Oct 2018 17:55:47 +0200, Paulo Matos wrote:
> Matthew, Sam, do you understand why this is happening?

I still think it's probably allocation, and probably specifically
content on the process's page table. Do you see different behavior with
a non-allocating variant (via `--no-alloc` below)?

We should certainly update the documentation with information about the
limits of parallelism via places.

----------------------------------------

#lang racket

(define (go n alloc?)
(place/context p
(let ([v (vector (if alloc? 0.0 0))]
[inc (if alloc? 1.0 1)])
(let loop ([i 3000000000])
(unless (zero? i)
(vector-set! v 0 (+ (vector-ref v 0) inc))
(loop (sub1 i)))))
(printf "Place ~a done~n" n)
n))

(module+ main
(define alloc? #t)
(define cores
(command-line
#:once-each
[("--no-alloc") "Non-allocating variant" (set! alloc? #f)]
#:args (cores)
(string->number cores)))

(time
(map place-wait
(for/list ([i (in-range cores)])
(printf "Starting core ~a~n" i)
(go i alloc?)))))

Neil Van Dyke

unread,
Oct 5, 2018, 3:12:08 PM10/5/18
to racket...@googlegroups.com

> if not I will have to redesign my system to use 'subprocess'

Expanding on this, for students on the list... Having many worker host
processes is not necessarily a bad thing.  It can be more programmer
work, but it simplifies the parallelism in a way (e.g., "let the Linux
kernel worry about it" :), and it potentially gives you better isolation
and resilience for some kinds of defects (in native code used via FFI,
in Racket code, and even in the suspiciously sturdy Racket VM/backend).

If appropriate for your application, you can also consider a worker
pool, with a health metric, sometimes reusing workers to avoid process
startup times, and sometimes retiring, and perhaps sometimes benching
workers for an induced big GC if that makes sense compared to
retiring&starting/unpooling, and maybe sometimes quarantining workers
for debugging/dumps while keeping the system running.  You can also
spread your workers across multiple hosts, not just CPUs/cores.

You can even use the worker pool to introduce new changes to a running
system (being very rapid, or as an additional mechanism beyond normal
testing for production), and do A/B performance/correctness of changes,
and change rollback.

If your data to be communicated to/from a worker is relatively small and
won't be a bottleneck, you can simply push it through the stdin and
stdout of each process; otherwise, you can get judicious/clever with the
many available host OS mechanisms.

(Students: Being able to get our hands dirty and engineer systems beyond
a framework, when necessary, is one of the reasons we get CS/SE/EE/CE
degrees and broad&deep experience, rather than only collect a binder
full of Certified Currently-Popular JS Framework Technician certs. 
Those oppressive student loans, and/or years of self-guided open source
experience, might not be in vain. :)

George Neuner

unread,
Oct 6, 2018, 2:03:29 AM10/6/18
to Matthew Flatt, Paulo Matos, racket...@googlegroups.com
As Matthew said, this may be a case where multiple processes are better.

One thing that likely is vastly different between your two systems is the memory architecture.  On Paulo's many-core machine, each group of [probably] 6 CPUs will have its own physical bank of memory which is close to it and which it uses preferentially.  Access to a different bank may be very costly.  Paulo's machine may be spending a much greater percentage of time moving data between VM instances that are located in different memory regions ... something Matthew can't see on his quad-core. 

Paulo, you might take a look at how memory is being allocated [not sure what tools you have for this] and see what happens if you restrict the process to running on various groups of CPUs.  It may be that some banks of your memory are "closer" than others.

Hope this helps,
George

James Platt

unread,
Oct 8, 2018, 3:39:14 PM10/8/18
to Racket Users
I wonder if this has anything to do with mitigation for Spectre, Meltdown or the other speculative execution vulnerabilities that have been identified recently. I understand that some or all of the patches affect the performance of multi-CPU processing in general.

James

Philip McGrath

unread,
Oct 8, 2018, 4:12:43 PM10/8/18
to James, Racket Users
This is much closer to the metal than where I usually spend my time, but, if it terns out that multiple OS processes is better than OS threads in this case, Distributed Places might provide an easier path to move to multiple processes than using `subprocess` directly: http://docs.racket-lang.org/distributed-places/index.html

On Mon, Oct 8, 2018 at 7:39 PM James Platt <jbiom...@gmail.com> wrote:
I wonder if this has anything to do with mitigation for Spectre, Meltdown or the other speculative execution vulnerabilities that have been identified recently.  I understand that some or all of the patches affect the performance of multi-CPU processing in general.

James 

George Neuner

unread,
Oct 9, 2018, 2:03:26 AM10/9/18
to racket...@googlegroups.com
It's possible but unlikely. The mitigations do slow down processing,
but they don't much affect CPU usage as measured by the OS. The OP
claimed to be seeing large variations in CPU usage as more cores were
involved.

George

Paulo Matos

unread,
Oct 9, 2018, 2:37:57 AM10/9/18
to Matthew Flatt, Sam Tobin-Hochstadt, racket...@googlegroups.com
Hi all,

Apologies for the delay in sending this email but I have been trying to
implement and test an alternative and wanted to be sure it works before
sending this off.

So, as Matthew suggested this problem has to do with memory allocation.
The --no-alloc option in Matthew's suggested snippet does not show the
delay I usually see in the thread CPU usage although thread creation is
still quite slow past around 20 places.

I started developing loci [1] to solve this problem instance yesterday
and I got it to a point where I can prove that subprocesses solve the
problem I am seeing. No point attaching a screenshot of htop with all
bars full to 100%... that's what happens. Also, process creation is
almost instantaneous and there's no delay compared to threads.

In the evening after I had almost everything sorted, Sam suggested on
Slack that I try distributed-places and use them locally. I haven't
tried this and I cannot say if it works better or worse but it seems
certainly harder to use than loci as my library uses the same API as places.

Part of the development was pretty quick because I noticed Matthew had
been playing with this before:
https://github.com/racket/racket/blob/master/pkgs/racket-benchmarks/tests/racket/benchmarks/places/place-processes.rkt
(might be worth noting that the code doesn't work with current racket)

I will adding contracts, tests and documentation throughout the week and
then replace places in my system with loci so I can dog-food the
library. Next step is to add remote loci at which point I will want to
compare with distributed-places and possibly improve on it.

If anyone has comments, suggestions or complaints on the library please
let me know but keep in mind it's barely a day old.

Paulo Matos


1: https://github.com/LinkiTools/racket-loci
https://pkgd.racket-lang.org/pkgn/search?q=loci
--
Paulo Matos

Paulo Matos

unread,
Oct 9, 2018, 2:38:34 AM10/9/18
to James Platt, Racket Users
I just confirmed that this is due to memory allocation locking in the
kernel. If your places do no allocation then all is fine.

Paulo Matos

On 08/10/2018 21:39, James Platt wrote:
> I wonder if this has anything to do with mitigation for Spectre, Meltdown or the other speculative execution vulnerabilities that have been identified recently. I understand that some or all of the patches affect the performance of multi-CPU processing in general.
>
> James
>

--
Paulo Matos

Paulo Matos

unread,
Oct 9, 2018, 2:40:03 AM10/9/18
to racket...@googlegroups.com


On 08/10/2018 22:12, Philip McGrath wrote:
> This is much closer to the metal than where I usually spend my time,
> but, if it terns out that multiple OS processes is better than OS
> threads in this case, Distributed Places might provide an easier path to
> move to multiple processes than using `subprocess` directly:
> http://docs.racket-lang.org/distributed-places/index.html
>

Sam mentioned trying that yesterday and I developed the loci library
before I did try them. Looking at the API, I can only say that at the
moment my library is certainly easier to use in the localhost. Once I
get to try to implement remote loci I will look into distributed places
and try to improve on that.

--
Paulo Matos

Paulo Matos

unread,
Oct 9, 2018, 3:12:12 AM10/9/18
to Matthew Flatt, Sam Tobin-Hochstadt, racket...@googlegroups.com


On 05/10/2018 19:23, Matthew Flatt wrote:
>
> We should certainly update the documentation with information about the
> limits of parallelism via places.
>

Added PR:
https://github.com/racket/racket/pull/2304


--
Paulo Matos
Reply all
Reply to author
Forward
0 new messages