How do I limit maximum number of OS threads?

2,031 views
Skip to first unread message

Alexey Borzenkov

unread,
Mar 27, 2011, 8:47:32 AM3/27/11
to golan...@googlegroups.com
Hi,

I'm looking into go and love its goroutines. However what I find puzzling is how does it decide to spawn new OS threads and more importantly how to limit maximum number of OS threads.

The reason is that whenever goroutine blocks, there seems to be a new OS thread going up, and this can quickly get out of hand if there are many goroutines that are likely to block, for example:

package main

import (
    "fmt"
    "time"
)

func main() {
  for i := 0; i < 256; i++ {
    go func(i int) {
      fmt.Printf("Sleeping in %d\n", i)
      time.Sleep(60*1e9)
    }(i)
    <- time.After(100*1e6)
  }
}

On my Mac OS X 10.6 it crashes after spawning 130 goroutines with the message:

runtime: failed to create new OS thread (have 132 already; errno=12)

Of course this is an extreme example, reality is now as severe, but still unnerving. I wanted to write a custom proxy server (with multiple proxied connections over a single tcp connection) in go, and since net.Dial uses a blocking connect() call I'm worried that if there's a sudden spike in connections (and I do want to support hundreds, maybe thousands connections a second) it might just crash.

Now to be fair I tried doing thousand simultaneous net.Dial calls and number of OS threads was surprisingly small (around ~25), but maybe there's something else at play. The question remains however, is there a way to limit the number of OS threads? If not, maybe there should be, i.e. I think it's better for the program to block for a little while, than to simply crash because it's out of resources (i.e. number of threads).

What do you think?

John Asmuth

unread,
Mar 27, 2011, 9:52:27 AM3/27/11
to golan...@googlegroups.com
Funny story, but someone had almost this exact same issue a little while ago.

The problem is that time.Sleep will block the current thread, requiring the runtime to put other goroutines in a new one. What you should do is use <-time.After(60*1e9) in both parts of the code. It's must more friendly, since it won't block the whole thread, and other goroutines will be able to use it too.

- John

Alexey Borzenkov

unread,
Mar 27, 2011, 10:16:26 AM3/27/11
to golan...@googlegroups.com
No, I already know about time.After. I used time.Sleep specifically because it blocks. If time.Sleep didn't block, I'd use syscall.Sleep to make sure it blocked, because I wanted to illustrate the problem related to blocking.

I'm talking about no upper limit to the number of OS threads, which is the issue for me. For example I just found that the reason I had relatively small number of OS threads with net.Dial is because of dns resolution and firewall (when I turned it off 1024 simultaneous net.Dial calls resulted in 75 threads). When I disabled firewall and moved dns lookup out of goroutines, this example crashes as well:

package main

import (
    "net"
    "fmt"
    "sync"
)

func main() {
  n := 1024
  _, addrs, err := net.LookupHost("kitsu.ru")
  if err != nil {
    fmt.Printf("ERROR: %s\n", err)
    return
  }
  addr := fmt.Sprintf("%s:80", addrs[0])
  fmt.Printf("Stress testing %s with %d goroutines\n", addr, n)
  wg := new(sync.WaitGroup)
  wg.Add(n)
  for i := 0; i < n; i++ {
    go func(i int) {
      defer wg.Done()
      c, err := net.Dial("tcp", "", addr)
      if c != nil {
        defer c.Close()
      }
      fmt.Printf("Connect in %d: %v\n", i, err)
    }(i)
  }
  wg.Wait()
}

I think I know how to work around the issue (for example with channels and dedicated limited number of "connector" goroutines), but the issue doesn't disappear if I work around it. It seems like there's no upper limit for OS threads (at least I can't seem to find any in runtime/proc.c :-/) and I wonder why.

If this is a conscious design decision then I'm ok with it. :) But if it's not, then maybe it's a bug.

Alexey Borzenkov

unread,
Mar 27, 2011, 11:05:36 AM3/27/11
to golan...@googlegroups.com
Hmm, I further looked into runtime/proc.c and from comments it looks like it was a design choice after all. Limiting total number of ms didn't work at first, but later I realized that I wasn't putting g back in queue, and this patch appears to work:

diff -r c5c62aeb6267 src/pkg/runtime/proc.c
--- a/src/pkg/runtime/proc.c Mon Mar 07 16:18:24 2011 +1100
+++ b/src/pkg/runtime/proc.c Sun Mar 27 19:01:17 2011 +0400
@@ -452,6 +452,10 @@
 
  // Find the m that will run g.
  if((m = mget(g)) == nil){
+ if (runtime·sched.mcount >= 4) {
+ gput(g);
+ break;
+ }
  m = runtime·malloc(sizeof(M));
  // Add to runtime·allm so garbage collector doesn't free m
  // when it is just in a register (R14 on amd64).

The number 4 is of course hard-coded, but could be take from an environment variable.

Also, while 4 ms work with my net.Dial example, it fails in tests with "all goroutines are asleep - deadlock!". :(

Alexey Borzenkov

unread,
Mar 27, 2011, 12:28:44 PM3/27/11
to golan...@googlegroups.com
Ok, if anyone is interested, it seems I found how to limit number of OS threads. Not very efficient for small gomaxthreads though, because it drains g queue and then puts most of them back in the same order. This patch makes it possible to specify GOMAXTHREADS=n to restrict maximum number of OS threads, doesn't work with cgo for some reason though. Simple programs work with at least GOMAXTHREADS=2. Test suite passes with GOMAXTHREADS=3 (timeout in http with GOMAXTHREADS=2). Here's my patch against release:

diff -r c5c62aeb6267 src/pkg/runtime/proc.c
--- a/src/pkg/runtime/proc.c Mon Mar 07 16:18:24 2011 +1100
+++ b/src/pkg/runtime/proc.c Sun Mar 27 20:24:18 2011 +0400
@@ -62,6 +62,7 @@
  M *mhead; // ms waiting for work
  int32 mwait; // number of ms waiting for work
  int32 mcount; // number of ms that have been created
+ int32 mcountmax;// max number of ms that can be created
  int32 mcpu; // number of ms executing on cpu
  int32 mcpumax; // max number of ms allowed on cpu
  int32 msyscall; // number of ms in system calls
@@ -119,6 +120,11 @@
  // so that we don't need to call malloc when we crash.
  // runtime·findfunc(0);
 
+ runtime·gomaxthreads = 0;
+ p = runtime·getenv("GOMAXTHREADS");
+ if(p != nil && (n = runtime·atoi(p)) != 0)
+ runtime·gomaxthreads = n >= 2 ? n : 2;
+ runtime·sched.mcountmax = runtime·gomaxthreads;
  runtime·gomaxprocs = 1;
  p = runtime·getenv("GOMAXPROCS");
  if(p != nil && (n = runtime·atoi(p)) != 0)
@@ -444,6 +450,9 @@
 matchmg(void)
 {
  G *g;
+ G *head = nil;
+ G *tail = nil;
+ int32 mcountmax = runtime·iscgo ? 0 : runtime·sched.mcountmax;
 
  if(m->mallocing || m->gcing)
  return;
@@ -452,6 +461,16 @@
 
  // Find the m that will run g.
  if((m = mget(g)) == nil){
+ if (mcountmax > 0 && runtime·sched.mcount >= mcountmax) {
+ // Cannot create new ms, reschedule for later
+ g->schedlink = nil;
+ if (head == nil)
+ head = g;
+ else
+ tail->schedlink = g;
+ tail = g;
+ continue;
+ }
  m = runtime·malloc(sizeof(M));
  // Add to runtime·allm so garbage collector doesn't free m
  // when it is just in a register (R14 on amd64).
@@ -481,6 +500,10 @@
  }
  mnextg(m, g);
  }
+ while ((g = head) != nil) {
+ head = g->schedlink;
+ gput(g);
+ }
 }
 
 // Scheduler loop: find g to run, run it, repeat.
diff -r c5c62aeb6267 src/pkg/runtime/runtime.h
--- a/src/pkg/runtime/runtime.h Mon Mar 07 16:18:24 2011 +1100
+++ b/src/pkg/runtime/runtime.h Sun Mar 27 20:24:18 2011 +0400
@@ -363,6 +363,7 @@
 M* runtime·allm;
 int32 runtime·goidgen;
 extern int32 runtime·gomaxprocs;
+extern int32 runtime·gomaxthreads;
 extern uint32 runtime·panicking;
 extern int32 runtime·gcwaiting; // gc is waiting to run
 int8* runtime·goos;

Ian Lance Taylor

unread,
Mar 27, 2011, 1:19:01 PM3/27/11
to golan...@googlegroups.com
Alexey Borzenkov <sna...@gmail.com> writes:

> Ok, if anyone is interested, it seems I found how to limit number of OS
> threads. Not very efficient for small gomaxthreads though, because it drains
> g queue and then puts most of them back in the same order. This patch makes
> it possible to specify GOMAXTHREADS=n to restrict maximum number of OS
> threads, doesn't work with cgo for some reason though. Simple programs work
> with at least GOMAXTHREADS=2. Test suite passes with GOMAXTHREADS=3 (timeout
> in http with GOMAXTHREADS=2). Here's my patch against release:

The general problem with this is that the program can deadlock in some
cases. I'm not saying that you haven't identified a real problem, but
your patch doesn't entirely solve it.

The specific case of connect we could actually handle by putting the
socket into nonblocking mode before calling connect and then extending
the epoll/kqueue code to handle completed connections.

Ian

Alexey Borzenkov

unread,
Mar 27, 2011, 1:37:25 PM3/27/11
to golan...@googlegroups.com, Ian Lance Taylor
Yes, I actually thought about making connect non-blocking, for which it likely needs to move syscall.Connect() to newFD in fd.go and fd_windows.go, and it's relatively easy to do for the last release (except adding a new ConnectEx syscall for windows, didn't quite figure that out), but then I looked in trunk and net/file.go uses newFD as well, so it's not that easy anymore, and it's hard to do something organic/beautiful instead of a mess.

On the other hand, I just realized that I can create thousands of threads in Python without any problems, and it doesn't even have any performance impact (aside from virtual memory). This makes it rather odd why go can only create 131 thread and yet manages to grind my Mac to a crawl? :-/

Maybe there's something wrong with threading after all...

Alexey Borzenkov

unread,
Mar 27, 2011, 1:47:12 PM3/27/11
to golan...@googlegroups.com, Ian Lance Taylor
Ok, now that is totally odd. It appears that 131 threads consume more than 126TB (terabytes!) of virtual memory on my Mac OS X 10.6 (as seen in Activity Monitor), this is totally wrong. I thought go was supposed to "start with little stack space", not almost a terabyte per thread! O.o

Alexey Borzenkov

unread,
Mar 27, 2011, 2:42:51 PM3/27/11
to golan...@googlegroups.com, Ian Lance Taylor
Oh my god, this is ridiculous. If you look at http://fxr.watson.org/fxr/source/bsd/kern/pthread_synch.c?v=xnu-1228 you will see, that bsdthread_create uses stack parameter as STACK SIZE when PTHREAD_CUSTOM is not specified. This means that go will allocate goroutine stack, and then USE ITS ADDRESS AS STACK SIZE. There's even a comment in pkg/runtime/darwin/amd64/sys.s saying:

// TODO(rsc): why do we get away with 0 flags here but not on 386?

Of course we get away with it, on amd64 there are tons of virtual memory, so 1TB stack is almost nothing! O.o ...except when stack grows to 126TB, then we're suddenly in trouble. :-/ The fix is actually trivial:

diff -r c5c62aeb6267 src/pkg/runtime/darwin/amd64/sys.s
--- a/src/pkg/runtime/darwin/amd64/sys.s Mon Mar 07 16:18:24 2011 +1100
+++ b/src/pkg/runtime/darwin/amd64/sys.s Sun Mar 27 22:31:15 2011 +0400
@@ -138,8 +138,7 @@
  MOVQ mm+16(SP), SI // "arg"
  MOVQ stk+8(SP), DX // stack
  MOVQ gg+24(SP), R10 // "pthread"
-// TODO(rsc): why do we get away with 0 flags here but not on 386?
- MOVQ $0, R8 // flags
+ MOVQ $0x01000000, R8 // flags
  MOVQ $0, R9 // paranoia
  MOVQ $(0x2000000+360), AX // bsdthread_create
  SYSCALL

Now I can have 2560 blocking goroutines which is really-really cool, and is practically infinite. :)

P.S. On the other hand now I wonder why 2560?

Devon H. O'Dell

unread,
Mar 27, 2011, 2:53:15 PM3/27/11
to golan...@googlegroups.com
2011/3/27 Alexey Borzenkov <sna...@gmail.com>:

> Oh my god, this is ridiculous. If you look
> at http://fxr.watson.org/fxr/source/bsd/kern/pthread_synch.c?v=xnu-1228 you
> will see, that bsdthread_create uses stack parameter as STACK SIZE when
> PTHREAD_CUSTOM is not specified. This means that go will allocate goroutine
> stack, and then USE ITS ADDRESS AS STACK SIZE. There's even a comment in
> pkg/runtime/darwin/amd64/sys.s saying:
> // TODO(rsc): why do we get away with 0 flags here but not on 386?

Given this citation, FreeBSD may need a similar fix. I'll take a look.

--dho

Alexey Borzenkov

unread,
Mar 27, 2011, 3:11:57 PM3/27/11
to Devon H. O'Dell, golan...@googlegroups.com
On Sun, Mar 27, 2011 at 10:53 PM, Devon H. O'Dell <devon...@gmail.com> wrote:
> 2011/3/27 Alexey Borzenkov <sna...@gmail.com>:
>> Oh my god, this is ridiculous. If you look
>> at http://fxr.watson.org/fxr/source/bsd/kern/pthread_synch.c?v=xnu-1228 you
>> will see, that bsdthread_create uses stack parameter as STACK SIZE when
>> PTHREAD_CUSTOM is not specified. This means that go will allocate goroutine
>> stack, and then USE ITS ADDRESS AS STACK SIZE. There's even a comment in
>> pkg/runtime/darwin/amd64/sys.s saying:
>> // TODO(rsc): why do we get away with 0 flags here but not on 386?
> Given this citation, FreeBSD may need a similar fix. I'll take a look.

No, freebsd uses different syscalls and arguments are different: is
specifies stack base and stack size explicitly.

However, there might be another bug, in pkg/runtime/proc.c there's a
call runtime·newosproc(m, m->g0, m->g0->stackbase, runtime·mstart),
however freebsd runtime·newosproc uses g->stackbase for stack_base and
stk - g->stackbase for stack size, doesn't it mean it uses 0 as
stack_size (besides the formula looks completely wrong, given that
both newosproc's stk and goroutine's stackbase point closer to stack
end)? Don't know if it matters though...

Reply all
Reply to author
Forward
0 new messages