Thanks!
I wrote a program to illustrate the problem. The program forks 100
worker threads, using 2 cores (i.e. using "-N2"). Each worker executes
a busy loop, and yields every 100k iterations of the loop. I let them
run for 10 seconds and then stop them. I keep track of how many
iterations each worker performs on each HEC. I've run it on OS X with
Intel Core 2 Duo and on Ubuntu with 12 core AMD Opteron. The results
differ in the details, but I can see the effect in both cases.
Essentially, the HEC of the main thread is initially busy forking
workers and running some workers and the other HEC gets pushed a
couple of the worker threads and gets to work. The vast majority of
the worker threads end up on the main HEC, presumably because the
other HEC was pushed a couple worker threads and always looks busy.
This condition persists until the end of the run when the threads on
the non-main HEC are killed and it looks idle and finally gets half
the remaining worker threads. Looking at some runs in threadscope
confirm this behavior. For example, here is the output from a run on
ubuntu in which only the second worker is pushed to HEC 1 initially,
and all the other workers are on HEC 0 (the HEC of the main thread):
worker thread, iterations, HEC 0 iterations, HEC 1 iterations, Start
HEC, #migrations
0 3800000 3800000 0 0 0
1 342654968 0 342654968 1 0
2 3604192 3604192 0 0 0
3 3567629 3567629 0 0 0
4 3597662 3597662 0 0 0
5 3663841 3631111 32730 0 1
6 3649443 3649443 0 0 0
7 3700000 3663948 36052 0 1
8 3685822 3685822 0 0 0
9 3800000 3757705 42295 0 1
10 3500238 3500238 0 0 0
11 3635728 3592768 42960 0 1
12 3690563 3690563 0 0 0
13 3765518 3660221 105297 0 1
14 3296291 3296291 0 0 0
15 3732768 3629047 103721 0 1
16 3600000 3600000 0 0 0
17 3678602 3562204 116398 0 1
18 3615676 3615676 0 0 0
19 3632768 3500000 132768 0 1
20 3783449 3783449 0 0 0
21 3732732 3597121 135611 0 1
22 3660493 3660493 0 0 0
23 3472668 3329877 142791 0 1
24 3832768 3695491 137277 0 1
25 3858206 3858206 0 0 0
26 3790262 3591394 198868 0 1
27 3597755 3597755 0 0 0
28 2700000 2500257 199743 0 1
29 3659636 3659636 0 0 0
30 3649542 3445468 204074 0 1
31 3800000 3800000 0 0 0
32 3898268 3633025 265243 0 1
33 3715282 3715282 0 0 0
34 3576062 3323007 253055 0 1
35 3986268 3986268 0 0 0
36 3833166 3586498 246668 0 1
37 3800000 3800000 0 0 0
38 3900000 3600000 300000 0 1
39 3990157 3990157 0 0 0
40 3834466 3583695 250771 0 1
41 2500238 2500238 0 0 0
42 3731317 3384904 346413 0 1
43 3947112 3947112 0 0 0
44 3800000 3480977 319023 0 1
45 3837537 3837537 0 0 0
46 3727112 3348518 378594 0 1
47 3900000 3900000 0 0 0
48 3867654 3496865 370789 0 1
49 3650735 3650735 0 0 0
50 3900000 3490435 409565 0 1
51 4083587 4083587 0 0 0
52 3900000 3491568 408432 0 1
53 3569919 3569919 0 0 0
54 3966496 3633025 333471 0 1
55 3985703 3985703 0 0 0
56 4098286 3655774 442512 0 1
57 3841104 3841104 0 0 0
58 3835063 3433025 402038 0 1
59 4029830 4029830 0 0 0
60 3832732 3375927 456805 0 1
61 3827932 3827932 0 0 0
62 3665518 3227455 438063 0 1
63 3754800 3754800 0 0 0
64 3996944 3500000 496944 0 1
65 3594195 3594195 0 0 0
66 3863905 3333025 530880 0 1
67 4028058 4028058 0 0 0
68 4032750 3529016 503734 0 1
69 3988622 3988622 0 0 0
70 4097156 3500000 597156 0 1
71 3987896 3987896 0 0 0
72 4269986 3671635 598351 0 1
73 4130200 4130200 0 0 0
74 4088747 3594405 494342 0 1
75 3936875 3936875 0 0 0
76 4134046 3565793 568253 0 1
77 3991758 3991758 0 0 0
78 4031446 3465793 565653 0 1
79 4017390 4017390 0 0 0
80 4266238 3675499 590739 0 1
81 4169156 4169156 0 0 0
82 4076276 3465793 610483 0 1
83 3893278 3893278 0 0 0
84 2964185 2357183 607002 0 1
85 3955880 3955880 0 0 0
86 4074210 3465793 608417 0 1
87 4100238 4100238 0 0 0
88 4300000 3600257 699743 0 1
89 4260040 4260040 0 0 0
90 4231236 3505662 725574 0 1
91 4088619 4088619 0 0 0
92 4028173 3376528 651645 0 1
93 3997459 3997459 0 0 0
94 4161812 3433025 728787 0 1
95 3993143 3993143 0 0 0
96 4059937 3355283 704654 0 1
97 4029024 4029024 0 0 0
98 4397647 3662246 735401 0 1
99 4018040 4018040 0 0 0
My test program is this:
import GHC.Conc
import Data.Array.MArray
import Data.Array.Unboxed
import
Data.Array.IO
import Control.Monad
import Text.Printf
import Data.IORef
import Data.List (zip4)
main :: IO ()
main =
do -- init counters
capCountArrays <- sequence (replicate numThreads (newArray (0,
numCores - 1) 0)) :: IO [IOUArray Int Int]
startCapRefs <- sequence (replicate numThreads (newIORef
(-1)))
capSwitchRefs <- sequence (replicate numThreads (newIORef 0))
mainTId <- myThreadId
(mainCap,_) <- threadCapability mainTId
printf "main HEC: %2d\n" mainCap
-- fork workers
tids <- sequence [ forkIO (myThreadId >>= busyLoop capCounts n
startCap capSwitch)
| (n,startCap,capSwitch,capCounts) <- zip4 [0..]
startCapRefs capSwitchRefs capCountArrays ]
-- let workers run for a while
threadDelay testTime
-- stop workers
sequence_ [killThread tid | tid <- tids]
-- report stats
uarrs <- mapM freeze capCountArrays :: IO [UArray Int Int]
-- stats on each worker thread
printf "worker thread, iterations, HEC 0 iterations, HEC 1
iterations, Start HEC, #migrations\n"
forM_ (zip4 [(0::Int)..] startCapRefs capSwitchRefs uarrs) $ \
(n,startCapRef, capSwitchRef, arr) ->
do startCap <- readIORef startCapRef
switches <- readIORef capSwitchRef
printf "%2d %12d %12d %12d %2d %4d\n" n (arr!0 + arr!1) (arr!
0) (arr!1) startCap switches
busyLoop :: IOUArray Int Int -> Int -> IORef Int -> IORef Int ->
ThreadId -> IO ()
busyLoop capcounts myTid startCapRef switchCountRef tid =
do (cap,_) <- threadCapability tid
writeIORef startCapRef cap
go 0 cap
where go n cap | n == yieldLimit = yield >> go 0 cap
| otherwise = do (cap',_) <- threadCapability tid
x <- readArray capcounts cap'
writeArray capcounts cap' (x + 1)
when (cap /= cap') (do x <-
readIORef switchCountRef
let y = x + 1
y `seq`
writeIORef switchCountRef y)
go (n+1) cap'
yieldLimit :: Int
yieldLimit = 100000
numCores :: Int
numCores = 2
numThreads :: Int
numThreads = 100
testTime :: Int
testTime = 10 * 1000000
Regards,
Andreas