Round-robin scheduling and forkIO

76 views
Skip to first unread message

Andreas

unread,
Feb 2, 2012, 6:28:18 PM2/2/12
to parallel-haskell
I have two somewhat related questions...

When forkIO is executed, which HEC gets the new thread on its run
queue? Is it the same HEC as the HEC executing the thread which called
forkIO, or some other HEC?

From the "Runtime support for multicore haskell" paper, it sounds like
each HEC is scheduling the threads on its run queue in round-robin
order, but that the threads as a whole might not be scheduled in round
robin order. Is that right? Here is an example: Say HEC 1 has just 1
thread on its run queue, and HEC 2 has 99 threads on its run queue,
and suppose all the threads are always runnable. Then HEC 1 will
always run thread 1 while HEC 2 will round robin schedule the other 99
threads. There would be no pushing of threads from HEC 1 to HEC 2
since HEC 1 is never idle (since thread 1 is always runnable). As a
result the threads are very unfairly scheduled. Can this happen?

I don't know my way around the GHC RTS source code yet. If someone has
a pointer to where about in the source tree I can find the relevant
code for these issues, I'd appreciate it :)

Thanks,
Andreas

Simon Marlow

unread,
Feb 3, 2012, 3:51:37 AM2/3/12
to andreas...@gmail.com, parallel-haskell
On 02/02/2012 23:28, Andreas wrote:
> I have two somewhat related questions...
>
> When forkIO is executed, which HEC gets the new thread on its run
> queue? Is it the same HEC as the HEC executing the thread which called
> forkIO, or some other HEC?

Initially, the thread goes onto the run queue of the current HEC, i.e.
the HEC of the thread that called forkIO.

> From the "Runtime support for multicore haskell" paper, it sounds like
> each HEC is scheduling the threads on its run queue in round-robin
> order, but that the threads as a whole might not be scheduled in round
> robin order. Is that right?

Yes.

> Here is an example: Say HEC 1 has just 1
> thread on its run queue, and HEC 2 has 99 threads on its run queue,
> and suppose all the threads are always runnable. Then HEC 1 will
> always run thread 1 while HEC 2 will round robin schedule the other 99
> threads. There would be no pushing of threads from HEC 1 to HEC 2
> since HEC 1 is never idle (since thread 1 is always runnable). As a
> result the threads are very unfairly scheduled. Can this happen?

Yes. It wouldn't be hard to do something more sensible here, I can give
pointers if anyone is interested.

> I don't know my way around the GHC RTS source code yet. If someone has
> a pointer to where about in the source tree I can find the relevant
> code for these issues, I'd appreciate it :)

rts/Schedule.c is the place to start, and the function that does
load-balancing is called schedulePushWork().

Cheers,
Simon

Andreas

unread,
Feb 4, 2012, 1:29:02 PM2/4/12
to parallel-haskell
Thanks!

I wrote a program to illustrate the problem. The program forks 100
worker threads, using 2 cores (i.e. using "-N2"). Each worker executes
a busy loop, and yields every 100k iterations of the loop. I let them
run for 10 seconds and then stop them. I keep track of how many
iterations each worker performs on each HEC. I've run it on OS X with
Intel Core 2 Duo and on Ubuntu with 12 core AMD Opteron. The results
differ in the details, but I can see the effect in both cases.
Essentially, the HEC of the main thread is initially busy forking
workers and running some workers and the other HEC gets pushed a
couple of the worker threads and gets to work. The vast majority of
the worker threads end up on the main HEC, presumably because the
other HEC was pushed a couple worker threads and always looks busy.
This condition persists until the end of the run when the threads on
the non-main HEC are killed and it looks idle and finally gets half
the remaining worker threads. Looking at some runs in threadscope
confirm this behavior. For example, here is the output from a run on
ubuntu in which only the second worker is pushed to HEC 1 initially,
and all the other workers are on HEC 0 (the HEC of the main thread):

worker thread, iterations, HEC 0 iterations, HEC 1 iterations, Start
HEC, #migrations
0 3800000 3800000 0 0 0
1 342654968 0 342654968 1 0
2 3604192 3604192 0 0 0
3 3567629 3567629 0 0 0
4 3597662 3597662 0 0 0
5 3663841 3631111 32730 0 1
6 3649443 3649443 0 0 0
7 3700000 3663948 36052 0 1
8 3685822 3685822 0 0 0
9 3800000 3757705 42295 0 1
10 3500238 3500238 0 0 0
11 3635728 3592768 42960 0 1
12 3690563 3690563 0 0 0
13 3765518 3660221 105297 0 1
14 3296291 3296291 0 0 0
15 3732768 3629047 103721 0 1
16 3600000 3600000 0 0 0
17 3678602 3562204 116398 0 1
18 3615676 3615676 0 0 0
19 3632768 3500000 132768 0 1
20 3783449 3783449 0 0 0
21 3732732 3597121 135611 0 1
22 3660493 3660493 0 0 0
23 3472668 3329877 142791 0 1
24 3832768 3695491 137277 0 1
25 3858206 3858206 0 0 0
26 3790262 3591394 198868 0 1
27 3597755 3597755 0 0 0
28 2700000 2500257 199743 0 1
29 3659636 3659636 0 0 0
30 3649542 3445468 204074 0 1
31 3800000 3800000 0 0 0
32 3898268 3633025 265243 0 1
33 3715282 3715282 0 0 0
34 3576062 3323007 253055 0 1
35 3986268 3986268 0 0 0
36 3833166 3586498 246668 0 1
37 3800000 3800000 0 0 0
38 3900000 3600000 300000 0 1
39 3990157 3990157 0 0 0
40 3834466 3583695 250771 0 1
41 2500238 2500238 0 0 0
42 3731317 3384904 346413 0 1
43 3947112 3947112 0 0 0
44 3800000 3480977 319023 0 1
45 3837537 3837537 0 0 0
46 3727112 3348518 378594 0 1
47 3900000 3900000 0 0 0
48 3867654 3496865 370789 0 1
49 3650735 3650735 0 0 0
50 3900000 3490435 409565 0 1
51 4083587 4083587 0 0 0
52 3900000 3491568 408432 0 1
53 3569919 3569919 0 0 0
54 3966496 3633025 333471 0 1
55 3985703 3985703 0 0 0
56 4098286 3655774 442512 0 1
57 3841104 3841104 0 0 0
58 3835063 3433025 402038 0 1
59 4029830 4029830 0 0 0
60 3832732 3375927 456805 0 1
61 3827932 3827932 0 0 0
62 3665518 3227455 438063 0 1
63 3754800 3754800 0 0 0
64 3996944 3500000 496944 0 1
65 3594195 3594195 0 0 0
66 3863905 3333025 530880 0 1
67 4028058 4028058 0 0 0
68 4032750 3529016 503734 0 1
69 3988622 3988622 0 0 0
70 4097156 3500000 597156 0 1
71 3987896 3987896 0 0 0
72 4269986 3671635 598351 0 1
73 4130200 4130200 0 0 0
74 4088747 3594405 494342 0 1
75 3936875 3936875 0 0 0
76 4134046 3565793 568253 0 1
77 3991758 3991758 0 0 0
78 4031446 3465793 565653 0 1
79 4017390 4017390 0 0 0
80 4266238 3675499 590739 0 1
81 4169156 4169156 0 0 0
82 4076276 3465793 610483 0 1
83 3893278 3893278 0 0 0
84 2964185 2357183 607002 0 1
85 3955880 3955880 0 0 0
86 4074210 3465793 608417 0 1
87 4100238 4100238 0 0 0
88 4300000 3600257 699743 0 1
89 4260040 4260040 0 0 0
90 4231236 3505662 725574 0 1
91 4088619 4088619 0 0 0
92 4028173 3376528 651645 0 1
93 3997459 3997459 0 0 0
94 4161812 3433025 728787 0 1
95 3993143 3993143 0 0 0
96 4059937 3355283 704654 0 1
97 4029024 4029024 0 0 0
98 4397647 3662246 735401 0 1
99 4018040 4018040 0 0 0

My test program is this:

import GHC.Conc
import Data.Array.MArray
import Data.Array.Unboxed
import Data.Array.IO
import Control.Monad
import Text.Printf
import Data.IORef
import Data.List (zip4)

main :: IO ()
main =
do -- init counters
capCountArrays <- sequence (replicate numThreads (newArray (0,
numCores - 1) 0)) :: IO [IOUArray Int Int]
startCapRefs <- sequence (replicate numThreads (newIORef
(-1)))
capSwitchRefs <- sequence (replicate numThreads (newIORef 0))
mainTId <- myThreadId
(mainCap,_) <- threadCapability mainTId
printf "main HEC: %2d\n" mainCap

-- fork workers
tids <- sequence [ forkIO (myThreadId >>= busyLoop capCounts n
startCap capSwitch)
| (n,startCap,capSwitch,capCounts) <- zip4 [0..]
startCapRefs capSwitchRefs capCountArrays ]

-- let workers run for a while
threadDelay testTime

-- stop workers
sequence_ [killThread tid | tid <- tids]

-- report stats
uarrs <- mapM freeze capCountArrays :: IO [UArray Int Int]

-- stats on each worker thread
printf "worker thread, iterations, HEC 0 iterations, HEC 1
iterations, Start HEC, #migrations\n"
forM_ (zip4 [(0::Int)..] startCapRefs capSwitchRefs uarrs) $ \
(n,startCapRef, capSwitchRef, arr) ->
do startCap <- readIORef startCapRef
switches <- readIORef capSwitchRef
printf "%2d %12d %12d %12d %2d %4d\n" n (arr!0 + arr!1) (arr!
0) (arr!1) startCap switches

busyLoop :: IOUArray Int Int -> Int -> IORef Int -> IORef Int ->
ThreadId -> IO ()
busyLoop capcounts myTid startCapRef switchCountRef tid =
do (cap,_) <- threadCapability tid
writeIORef startCapRef cap
go 0 cap
where go n cap | n == yieldLimit = yield >> go 0 cap
| otherwise = do (cap',_) <- threadCapability tid
x <- readArray capcounts cap'
writeArray capcounts cap' (x + 1)
when (cap /= cap') (do x <-
readIORef switchCountRef
let y = x + 1
y `seq`
writeIORef switchCountRef y)
go (n+1) cap'

yieldLimit :: Int
yieldLimit = 100000

numCores :: Int
numCores = 2

numThreads :: Int
numThreads = 100

testTime :: Int
testTime = 10 * 1000000

Regards,
Andreas
Reply all
Reply to author
Forward
0 new messages