Opening network connections blocks system thread

123 views
Skip to first unread message

Venkat T V

unread,
Feb 9, 2024, 10:29:15 AM2/9/24
to golang-nuts
Hi,

I am debugging an issue where a server opening a large number of connections on startup sometimes dies with "program exceeds 10000-thread limit". I know file IO operations could lock up an OS thread. Still seeing this crash after eliminating file IO, and it looks like "syscall.connect" is a blocking call and could tie up an OS thread. This is on Linux with golang 1.21.7.

I wrote a small program to test this out. Running this with "go run osthreads.go -parallel 500 -threads 5" does trigger crashes sometimes, and I see goroutines blocked on "syscall.connect" and "syscall.fcntl". Could I get confirmation that this is expected behavior and Connect is a blocking syscall?

===
package main

import (
"flag"
"fmt"
"net"
"runtime"
"runtime/debug"
"sync"
"time"
)

func main() {
numThreads := flag.Int("threads", 10, "number of threads (in addition to GOMAXPROCS)")
parallelism := flag.Int("parallel", 100, "number of parallel goroutines to start")
flag.Parse()

maxThreads := runtime.GOMAXPROCS(-1) + *numThreads
fmt.Printf("GOMAXPROCS=%d, max threads=%d\n", runtime.GOMAXPROCS(-1), maxThreads)
debug.SetMaxThreads(maxThreads)

// Server that does not accept any connections
listener, err := net.Listen("tcp", "127.0.0.1:9090")
if err != nil {
fmt.Println(err)
return
}
defer listener.Close()

wg := sync.WaitGroup{}
startSignal := make(chan struct{})

// Spawn all goroutines
for i := 0; i < *parallelism; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
<-startSignal

conn, err := net.DialTimeout("tcp", "127.0.0.1:9090", time.Second)
if err != nil {
fmt.Printf("%d: error: %s\n", id, err)
return
}
defer conn.Close()
time.Sleep(time.Second)
}(i)
}

time.Sleep(time.Second)

// Start them all at once
close(startSignal)
wg.Wait()
}

===

Kurtis Rader

unread,
Feb 9, 2024, 2:57:06 PM2/9/24
to Venkat T V, golang-nuts
The connect() syscall is normally blocking. It doesn't return until the connection is established or an error occurs. It can be made non-blocking by putting the file-descriptor into non-blocking mode before the connect() call. However, that then requires either an async callback or another syscall to check whether the connection was established or an error occurred. Neither approach is idiomatic Go. 

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/33ab22bf-088a-4724-8cfb-62b7f51fca96n%40googlegroups.com.


--
Kurtis Rader
Caretaker of the exceptional canines Junior and Hank

TheDiveO

unread,
Feb 9, 2024, 4:08:17 PM2/9/24
to golang-nuts
The socket created somewhat deeper inside dial is set to non-blocking before then firing of the non-blocking connect syscall, see

There are no "callbacks" from kernel to user space. Go's netpoller employs different OS kernel mechanisms, depending on the particular OS at hand. For instance, epoll or its follow-up in Linux land, or kqueues IIRC on macos. Heavily simplified, epoll is this other syscall that blocks until events become available. As for the connect part with a parallel goroutine waiting on the context, https://cs.opensource.google/go/go/+/master:src/net/fd_unix.go;drc=2057ad02bd8387378a2d1fd637e955e126f698bf;l=55

Ian Lance Taylor

unread,
Feb 9, 2024, 5:10:29 PM2/9/24
to Kurtis Rader, Venkat T V, golang-nuts
On Fri, Feb 9, 2024 at 11:57 AM Kurtis Rader <kra...@skepticism.us> wrote:
>
> The connect() syscall is normally blocking. It doesn't return until the connection is established or an error occurs. It can be made non-blocking by putting the file-descriptor into non-blocking mode before the connect() call. However, that then requires either an async callback or another syscall to check whether the connection was established or an error occurred. Neither approach is idiomatic Go.

That is true, but the Go standard library's net package does use
non-blocking calls to connect internally when implementing net.Dial
and friends.

I don't have any problem running the original program on Linux 6.5.13.

Ian
> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/CABx2%3DD92p6ZBjugETbX%3D5KjCqS7HH-Dv4PSuUg%3D9AZwBbioG1A%40mail.gmail.com.

Kurtis Rader

unread,
Feb 10, 2024, 12:17:21 AM2/10/24
to Ian Lance Taylor, Venkat T V, golang-nuts
On Fri, Feb 9, 2024 at 2:10 PM Ian Lance Taylor <ia...@golang.org> wrote:
On Fri, Feb 9, 2024 at 11:57 AM Kurtis Rader <kra...@skepticism.us> wrote:
>
> The connect() syscall is normally blocking. It doesn't return until the connection is established or an error occurs. It can be made non-blocking by putting the file-descriptor into non-blocking mode before the connect() call. However, that then requires either an async callback or another syscall to check whether the connection was established or an error occurred. Neither approach is idiomatic Go.

That is true, but the Go standard library's net package does use
non-blocking calls to connect internally when implementing net.Dial
and friends.

I don't have any problem running the original program on Linux 6.5.13.

Yes, I realized after my previous reply that Go could obviously use non-blocking connect() calls coupled with select(), poll(), or similar mechanisms to wakeup a goroutine blocked waiting for a net.Dial() (or equivalent) connection to complete in order to minimize the number of OS threads required to handle in-flight network connections. Without requiring exposing an async callback or a mechanism to explicitly start a connection and at a later time test whether it has been established or failed.

What might be happening for the O.P. is that the systems they are connecting to are not explicitly accepting or rejecting the connections in a timely manner. Thus causing a huge number of goroutines blocked waiting for the net.Dial() to complete. The systems they are connecting to may be simply discarding the TCP SYN packets due to firewall rules or something similar. This is something that is going to be hard for the Go community to provide help since it is fundamentally not an issue with Go itself.

Brian Candler

unread,
Feb 10, 2024, 4:47:16 AM2/10/24
to golang-nuts
I can't see how a large number of waiting goroutines would cause an increase in the number of OS threads, which was the OP's original problem (hitting the 10,000 thread limit)

What the OP is implying - but we have not seen good evidence for yet - is that in some circumstances a non-blocking connect() becomes blocking. But even then, how would Go know to allocate an OS thread for it *before* calling it??

I suspect something else is going on. Does the original program with the 10,000 thread problem make any use of external C code, directly or indirectly?
Reply all
Reply to author
Forward
0 new messages