Avoiding resource leaks with racing channels

151 views
Skip to first unread message

Daniel Eloff

unread,
Jul 9, 2019, 4:14:23 AM7/9/19
to golang-nuts
If a select statement has multiple channels ready when it runs, then it will choose one at a random. So if you fire something across a channel that holds a resource, like an open file descriptor - you have no guarantees that the other end of the channel receives it. The (possibly full) channel will get garbage collected later and the resource will leak in that case.

Some code that explains things better than my clumsy prose:

Receiver:
 // Wait on the channel, or for timeout
 
select {
 
case fd := <-channel:
     
return fd, nil
 
case <-time.After(queue.timeout):
     
return nil, ErrTimeoutElapsed
 
}

Sender:
 channel <- fd

What happens when the timeout races with the channel send? I think it's possible the select handles the timeout in that case and leaves the channel containing a connection alone.

Am I right that this is a problem? How might I fix this code?

Thanks,
Dan

Burak Serdar

unread,
Jul 9, 2019, 4:42:35 AM7/9/19
to Daniel Eloff, golang-nuts
Interesting problem. Maybe you shouldn't send file descriptors like this?

One solution could be:

case <- time.After(queue.timeout):
close(channel)
return nil, errTimeOutElapsed

and recover from panic on the sending side.




>
> Thanks,
> Dan
>
> --
> You received this message because you are subscribed to the Google Groups "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/5cda1189-2761-4110-8e9e-1d7311fd97fe%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Matt Harden

unread,
Jul 9, 2019, 4:52:38 AM7/9/19
to Burak Serdar, Daniel Eloff, golang-nuts
Only a buffered channel can "hold" anything. If the channel is unbuffered, then you are guaranteed that another goroutine has at least received the item you sent when the send statement returns.

Tamás Gulácsi

unread,
Jul 9, 2019, 4:54:29 AM7/9/19
to golang-nuts
// Wait on the channel, or for timeout
select {
case fd := <-channel:
return fd, nil
case <-time.After(queue.timeout):
// Check again

select {
case fd := <-channel:
return fd, nil
default:
return nil, ErrTimeoutElapsed
}
}

Burak Serdar

unread,
Jul 9, 2019, 5:02:11 AM7/9/19
to Tamás Gulácsi, golang-nuts
I believe this is wrong. If timeout happens first and the fd is sent
after the second select runs, the sender is stuck.

Two selects and another cancel channel might work:

sending side:

select {
case <- cancel:
close(fd)
case channel<- fd:
}

receiving side:

select {
case fd:=<-channel:
return fd,nil
case <-time.After(...) :
close(cancel)
return nil, ErrTimeout
}

}

>
> --
> You received this message because you are subscribed to the Google Groups "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/9ca56ba4-dcc2-4e1a-a0e8-5e2463042455%40googlegroups.com.

Ian Lance Taylor

unread,
Jul 9, 2019, 5:06:30 AM7/9/19
to Daniel Eloff, golang-nuts
There are many approaches. Here is a simple one:

select {
case fd := <-channel:
return fd, nil
case <-time.After(queue.timeout):
go func() {
<-channel.Close()
}()
return nil, ErrTimeoutElapsed
}

Another approach is to use a context.Context on the sending side, and
cancel the Context if the timeout occurs. I won't write that out, but
see https://blog.golang.org/context .

Ian

Dan Eloff

unread,
Jul 9, 2019, 4:27:32 PM7/9/19
to golang-nuts
I couldn't use <-channel.Close since in my case the goroutine isn't guaranteed to have something sent, so that would leak goroutines. I added a cleanup goroutine to scan timed-out channels and close them, which solves this problem. But now I can use that close as a signal to the receiver than the a timeout happened, and eliminate the select and the race entirely. The close can in rare cases race with the sender, but that's easily enough fixed:

// TrySend tries to send on a possibly closed channel and handles the panic if necessary.
// Returns true if conn was successfully sent over the channel.
func (waiter *Waiter) TrySend(conn Connection) (sent bool) {
    defer func() {
        r := recover()
        sent = r != nil
    }()
    waiter.channel <- conn
    return
}

So I guess the best thing to do in these cases is don't combine select with sending unmanaged resources over a channel. It's probably worth warning about this problem in the docs for select? It's not an obvious gotcha.

Ian Lance Taylor

unread,
Jul 9, 2019, 5:00:16 PM7/9/19
to Dan Eloff, golang-nuts
On Tue, Jul 9, 2019 at 9:27 AM Dan Eloff <dan....@gmail.com> wrote:
>
> So I guess the best thing to do in these cases is don't combine select with sending unmanaged resources over a channel. It's probably worth warning about this problem in the docs for select? It's not an obvious gotcha.

In my opinion the best place for this kind of discussion is a blog
post or talk. Such as https://blog.golang.org/pipelines or
https://www.youtube.com/watch?v=5zXAHh5tJqQ .

Ian

Daniel Eloff

unread,
Jul 9, 2019, 5:36:30 PM7/9/19
to golang-nuts

In my opinion the best place for this kind of discussion is a blog
post or talk. 

I disagree strongly. If there's a mode of operation that's dangerous when I'm operating a car or machinery I want it to not just be called out in the manual, but called attention to in a big bold font. Likewise with software engineering if there's a dangerous edge case it should be called out right there in the official documentation. This goes double for situations like this where the behavior is surprising or not intuitive. The number of replies to this thread with people suggesting broken workarounds due to race conditions between sending and timeout, suggests not only that this behavior is not intuitive, but it's also extremely fraught.

The great thing is Go is open-source so I'll just create a commit with what I feel is the appropriate change to the documentation, and submit it for review.

Cheers,
Dan

Ian Lance Taylor

unread,
Jul 9, 2019, 6:32:03 PM7/9/19
to Daniel Eloff, golang-nuts
On Tue, Jul 9, 2019 at 10:36 AM Daniel Eloff <dan....@gmail.com> wrote:
>>
>>
>> In my opinion the best place for this kind of discussion is a blog
>> post or talk.
>
>
> I disagree strongly. If there's a mode of operation that's dangerous when I'm operating a car or machinery I want it to not just be called out in the manual, but called attention to in a big bold font. Likewise with software engineering if there's a dangerous edge case it should be called out right there in the official documentation. This goes double for situations like this where the behavior is surprising or not intuitive. The number of replies to this thread with people suggesting broken workarounds due to race conditions between sending and timeout, suggests not only that this behavior is not intuitive, but it's also extremely fraught.

And not only that, it's complicated. The language spec is not the
right place to dig into the complexities of how to use select safely
while avoiding race conditions. There is just too much to say. And
there are no docs for select other than the language spec.

Ian

Daniel Eloff

unread,
Jul 9, 2019, 6:46:26 PM7/9/19
to golang-nuts

And not only that, it's complicated.  The language spec is not the
right place to dig into the complexities of how to use select safely
while avoiding race conditions.  There is just too much to say.  And
there are no docs for select other than the language spec.


Well here's the real problem. This should be in the docs for select, and not in the language spec. A language spec defines the rules, and doesn't talk much about their consequences or usage of the features. So I agree it can't go into the language spec, but I don't think the solution is to banish this kind of thing from the official documentation either.

First step's first, we'd need something like a practitioner's handbook, or Go programming guidelines where the select feature is discussed before we could add this to the docs. I think that's worthwhile, but not my fight. I'll relegate this to a blog post instead.

Cheers,
Dan

Jesper Louis Andersen

unread,
Jul 10, 2019, 11:39:19 AM7/10/19
to Daniel Eloff, golang-nuts
On Tue, Jul 9, 2019 at 6:14 AM Daniel Eloff <dan....@gmail.com> wrote:
If a select statement has multiple channels ready when it runs, then it will choose one at a random. So if you fire something across a channel that holds a resource, like an open file descriptor - you have no guarantees that the other end of the channel receives it. The (possibly full) channel will get garbage collected later and the resource will leak in that case.


To me, the real question is about "who owns the resource?"

If the channel is unbuffered, there are two parties, S and R (Sender and Receiver). If the channel is buffered, it is another party, C (channel). The delivery chain is really S -> C -> R. Whereas in the unbuffered case, rendezvous means an atomic exchange of the resource (S -> R). Clearly, cleanup is the responsibility of the owner at any point in time. But the extra owner, C, means that you will have to handle the case where the resource is in limbo between the two systems. Since a channel cannot run code, you will have to either let S or R handle it, or introduce a proxy, P, who handles eventual cleanup on behalf of C.

I think not introducing C in the first place is the best way to go if possible. Resource handling is a place where things need to be correct before being asynchronous.

Dan Eloff

unread,
Jul 10, 2019, 1:12:57 PM7/10/19
to Jesper Louis Andersen, golang-nuts
If the channel is unbuffered, there are two parties, S and R (Sender and Receiver). If the channel is buffered, it is another party, C (channel). The delivery chain is really S -> C -> R. Whereas in the unbuffered case, rendezvous means an atomic exchange of the resource (S -> R). Clearly, cleanup is the responsibility of the owner at any point in time. But the extra owner, C, means that you will have to handle the case where the resource is in limbo between the two systems. Since a channel cannot run code, you will have to either let S or R handle it, or introduce a proxy, P, who handles eventual cleanup on behalf of C.

Note in this case the channel is unbuffered, but there is no atomic exchange because of the select statement which is inherently a race between channels. If there are sends on multiple channels at close to the same time, one will randomly be chosen and the other will eventually get garbage collected with whatever was sent on it, unless you jump through hoops to avoid that situation.

 

Dan Eloff

unread,
Jul 10, 2019, 1:24:25 PM7/10/19
to Jesper Louis Andersen, matt....@gmail.com, golang-nuts
Maybe I'm wrong here in my understanding of unbuffered channels, but I don't think so:

Matt says earlier: "Only a buffered channel can "hold" anything. If the channel is unbuffered, then you are guaranteed that another goroutine has at least received the item you sent when the send statement returns."

I think at least in the simple case of `channel <- fd` this cannot be true, since that operation can only fail by panicking, and I beleive it will only panic if the channel is nil or closed. Now if you used a non-blocking send with a select, that would be a different story.

So if you send over that channel it blocks
the receiver wakes and runs the select
but sees both channels ready
picks the timeout channel at random

Now one of two things must happen, either the sender blocks forever because nobody read the sent value, or the value gets lost to space and both receiver and sender continue on their merry ways.

Am I wrong?

-Dan

Michael Jones

unread,
Jul 10, 2019, 2:54:55 PM7/10/19
to Dan Eloff, Jesper Louis Andersen, Matt Harden, golang-nuts
unbuffered means nothing is sent until is is simultaneously received, so there is no limbo or race or uncertainty. one sender "wins" the select and the others remain blocked waiting.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
Michael T. Jones
michae...@gmail.com

jake...@gmail.com

unread,
Jul 10, 2019, 4:30:53 PM7/10/19
to golang-nuts


On Tuesday, July 9, 2019 at 12:27:32 PM UTC-4, Dan Eloff wrote:
I couldn't use <-channel.Close since in my case the goroutine isn't guaranteed to have something sent, so that would leak goroutines. I added a cleanup goroutine to scan timed-out channels and close them, which solves this problem. But now I can use that close as a signal to the receiver than the a timeout happened, and eliminate the select and the race entirely. The close can in rare cases race with the sender, but that's easily enough fixed:

// TrySend tries to send on a possibly closed channel and handles the panic if necessary.
// Returns true if conn was successfully sent over the channel.
func (waiter *Waiter) TrySend(conn Connection) (sent bool) {
    defer func() {
        r := recover()
        sent = r != nil
    }()
    waiter.channel <- conn
    return
}


So I guess the best thing to do in these cases is don't combine select with sending unmanaged resources over a channel. It's probably worth warning about this problem in the docs for select? It's not an obvious gotcha.

The best thing to do is to not use unmanaged resources. This is obviously not always possible, but when you have to use unmanaged resources, and you care about cleaning them up, you must use a lot of care. This is not a problem specific to select. It is possible to correctly use select with unmanaged resources. It just takes more thought, and a really solid understanding of the language and the principals of concurrency and ownership to get it right.

Dan Eloff

unread,
Jul 10, 2019, 4:45:40 PM7/10/19
to golang-nuts
On Wed, Jul 10, 2019 at 7:54 AM Michael Jones <michae...@gmail.com> wrote:
unbuffered means nothing is sent until is is simultaneously received, so there is no limbo or race or uncertainty. one sender "wins" the select and the others remain blocked waiting.

So I'm correct then: "Now one of two things must happen, either the sender blocks forever because nobody read the sent value"

The implications of that are that the sending and receiving code are tightly coupled. It is not generally safe to send on a channel without knowing how the receiver receives it, otherwise you can block forever like in this case where the receiver is using a select and the timeout wins. It's very easy to make your Go program leak goroutines that way - and I bet a lot of serious software makes that mistake in practice. In this case the sender would need to loop doing a non-blocking send because the receiver is using a select, and then handle the case where the fd didn't get sent within a reasonable time period (which makes no sense because no both sender and receiver have a timeout baked in.)

Basically a simple send and receive are not too bad, but once you move beyond that the world gets complex and fraught very quickly. Multi-threaded programming is hard, and Go doesn't wave that burden away. No tools that I've seen wave that away, so it's not really a failing of Go, it speaks more to the inherit complexity of the domain.

Burak Serdar

unread,
Jul 10, 2019, 4:58:37 PM7/10/19
to Dan Eloff, golang-nuts
On Wed, Jul 10, 2019 at 10:45 AM Dan Eloff <dan....@gmail.com> wrote:
>
> On Wed, Jul 10, 2019 at 7:54 AM Michael Jones <michae...@gmail.com> wrote:
>>
>> unbuffered means nothing is sent until is is simultaneously received, so there is no limbo or race or uncertainty. one sender "wins" the select and the others remain blocked waiting.
>
>
> So I'm correct then: "Now one of two things must happen, either the sender blocks forever because nobody read the sent value"
>
> The implications of that are that the sending and receiving code are tightly coupled. It is not generally safe to send on a channel without knowing how the receiver receives it, otherwise you can block forever like in this case where the receiver is using a select and the timeout wins. It's very easy to make your Go program leak goroutines that way - and I bet a lot of serious software makes that mistake in practice. In this case the sender would need to loop doing a non-blocking send because the receiver is using a select, and then handle the case where the fd didn't get sent within a reasonable time period (which makes no sense because no both sender and receiver have a timeout baked in.)

Sender would not need a loop if you use a cancel channel:

select {
case channel<-fd:
case <-cancel:
close(fd)
}

I think the problem is, as someone else already mentioned, resource
ownership. If fd cannot be sent, it still belongs to the sender, and
it has to have a way of cleaning it up. So the receiver has to let the
sender know that it is no longer interested in the resource.

>
> Basically a simple send and receive are not too bad, but once you move beyond that the world gets complex and fraught very quickly. Multi-threaded programming is hard, and Go doesn't wave that burden away. No tools that I've seen wave that away, so it's not really a failing of Go, it speaks more to the inherit complexity of the domain.
>
> --
> You received this message because you are subscribed to the Google Groups "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/CADz32dnDZt_npnZvCyfcGKOZ-sXHz-0V59hbhu%3DQbz5WTV3B0w%40mail.gmail.com.

Jesper Louis Andersen

unread,
Jul 10, 2019, 5:15:13 PM7/10/19
to Dan Eloff, golang-nuts
On Wed, Jul 10, 2019 at 6:45 PM Dan Eloff <dan....@gmail.com> wrote:
On Wed, Jul 10, 2019 at 7:54 AM Michael Jones <michae...@gmail.com> wrote:
unbuffered means nothing is sent until is is simultaneously received, so there is no limbo or race or uncertainty. one sender "wins" the select and the others remain blocked waiting.

So I'm correct then: "Now one of two things must happen, either the sender blocks forever because nobody read the sent value"


If the sender is written as

channel <- fd

as you propose, then indeed, the sender will block forever. However, this is easily fixed via a select on the sender side as well with a timeout, or a context.Context that can cancel. If the send on the channel is _not_ selected, you still own the resource and have to clean up.

More advanced event systems, such as Concurrent ML, has a withNACK guard for this case. If a given event is not selected, its withNACK thunk is run, allowing for cleanup. But in your case and Go, you can just have a variable or such to handle the case and clean up properly.

You are right that a lot of concurrent programming is hard, especially in the presence of errors and faults. Hence, simple strategies first. And then you need to have a sketch of a proof present for more complicated interactions, or a model in TLA+ if you want it to be water-tight. However, given what AMD just launched, there is little hope for MIMD style operation now. SIMD style can still be done with a sequential but parallel program.


--
J.

Jesper Louis Andersen

unread,
Jul 10, 2019, 5:25:47 PM7/10/19
to Jake Montgomery, golang-nuts
On Wed, Jul 10, 2019 at 6:30 PM <jake...@gmail.com> wrote:

The best thing to do is to not use unmanaged resources.

Would be the Erlang solution. You cannot change the owner of an open file, and if the process owning it terminates, the file is closed. You can change ownership of a TCP socket, and then messages from the socket in the mailbox which has not yet been received are also transferred to the new owner in the same order, preserving linearizability guarantees.

--
J.

Dan Eloff

unread,
Jul 10, 2019, 7:08:38 PM7/10/19
to Jesper Louis Andersen, golang-nuts
Yeah, agreed. I've been deep into concurrent programming for a long time now, and into lock-free programming as well which is the most fraught kind of programming I've ever done. Parallel is the future, it has been that way for a long time, but it's only getting more and more obvious.

I think in this specific case, the timeout should have been handled on the sending side in a select,  almost identically to the receiver code I posted. If the timer channel triggers, then close the channel to indicate to the receiver that it can wake up and it has timed out. Then the sender can go ahead and clean up the resource which it still owns. Doing it on the receiver side is fraught with problems.

I solved it with a dedicated go routine that scans for timed out waiters and expires them by closing the channel, but that meant the sender now needs to handle the rare panic if it sends on a closed channel - not the end of the world, but not as clean.

Roman Kuprov

unread,
Jul 10, 2019, 9:24:42 PM7/10/19
to golang-nuts
Would the Context package not work for this issue?
(kinda new gopher here)


On Wednesday, July 10, 2019 at 1:08:38 PM UTC-6, Dan Eloff wrote:
Yeah, agreed. I've been deep into concurrent programming for a long time now, and into lock-free programming as well which is the most fraught kind of programming I've ever done. Parallel is the future, it has been that way for a long time, but it's only getting more and more obvious.

I think in this specific case, the timeout should have been handled on the sending side in a select,  almost identically to the receiver code I posted. If the timer channel triggers, then close the channel to indicate to the receiver that it can wake up and it has timed out. Then the sender can go ahead and clean up the resource which it still owns. Doing it on the receiver side is fraught with problems.

I solved it with a dedicated go routine that scans for timed out waiters and expires them by closing the channel, but that meant the sender now needs to handle the rare panic if it sends on a closed channel - not the end of the world, but not as clean.

Burak Serdar

unread,
Jul 10, 2019, 9:55:28 PM7/10/19
to Roman Kuprov, golang-nuts
On Wed, Jul 10, 2019 at 3:24 PM Roman Kuprov <kup...@gmail.com> wrote:
>
> Would the Context package not work for this issue?
> (kinda new gopher here)

It is similar to using a cancel channel. Context communicates the
cancellation by closing a channel returned by Done().
> --
> You received this message because you are subscribed to the Google Groups "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/7a3021d7-3637-46f2-84a2-58b4b338ffa2%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages