Re: [go-nuts] Good patterns for fast->slow process concurrency

1,033 views
Skip to first unread message

Jeremy Wall

unread,
Feb 22, 2013, 10:17:34 AM2/22/13
to ohoo...@gmail.com, golang-nuts


On Feb 22, 2013 8:39 AM, <ohoo...@gmail.com> wrote:
>
> Hi all,
>
> I've read and watched the various talks on Go concurrency and generally I have not had a big problem with it until now. I suspect it is because all of the programs I have been writing up until this point have a pattern where the slowest stage occurs first, and subsequent stages are faster so I know that there will never be any back-pressure from workers further in the chain. Now I have to write something that I guess is somewhat similar to the spidering problem, but I'm having trouble figuring out how to adequately implement it with both good performance and safety.
>
> Here is a working proof of concept that still suffers from at least one problem: https://gist.github.com/ohookins/5013420
>
> Basically the problem is that I have some large number of URLs to hit, way more than I can just throw go routines at in one loop. If I try to do that, some part of the program hits operating system limitations (some searching I did suggests socket limits while doing DNS lookups, but I'm not sure) and the whole thing blows up. I want to have a small number of workers processing the requests, but can't figure out how to achieve these goals:
> - cycle through the workers to find an available one for the next query

Send messages to your worker's using a buffered channel.

> - wait for all the workers to finish before we proceed/exit/etc
> - ensure that no more HTTP client requests than the OS can tolerate for socket limitations are *actually* active

The buffer size will control how many items of work can be waiting to start.

Only start as many go routines as your buffer size. That will throttle the amount of work to do. Signal no more work by closing the channel.

Use sync.WaitGroup to wait on your workers before exiting.

>
> This last one is the most tricky for me to think of a solution for. My test just uses very simple requests to a local http server, but I can't tell if all the workers are being utilised, or just a small number which are finishing their work rapidly - it's not clear to me what algorithm Go might be using to send the requests to the workers' input channel.
>
> Sorry for the long-winded question, and please feel free to point me in the direction of some existing good write-ups if you know of them.
>
> Best Regards,
> Oliver
>
> --
> You received this message because you are subscribed to the Google Groups "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>  
>  

Oliver

unread,
Feb 22, 2013, 11:18:03 AM2/22/13
to Jeremy Wall, golang-nuts
On 22 February 2013 16:17, Jeremy Wall <jw...@google.com> wrote:


On Feb 22, 2013 8:39 AM, <ohoo...@gmail.com> wrote:
>
> Hi all,
>
> I've read and watched the various talks on Go concurrency and generally I have not had a big problem with it until now. I suspect it is because all of the programs I have been writing up until this point have a pattern where the slowest stage occurs first, and subsequent stages are faster so I know that there will never be any back-pressure from workers further in the chain. Now I have to write something that I guess is somewhat similar to the spidering problem, but I'm having trouble figuring out how to adequately implement it with both good performance and safety.
>
> Here is a working proof of concept that still suffers from at least one problem: https://gist.github.com/ohookins/5013420
>
> Basically the problem is that I have some large number of URLs to hit, way more than I can just throw go routines at in one loop. If I try to do that, some part of the program hits operating system limitations (some searching I did suggests socket limits while doing DNS lookups, but I'm not sure) and the whole thing blows up. I want to have a small number of workers processing the requests, but can't figure out how to achieve these goals:
> - cycle through the workers to find an available one for the next query

Send messages to your worker's using a buffered channel.


I think I didn't phrase my question very well. The work units come in in batches of 1000, so there is nothing for any of the workers to do until they come in. Then once the batch comes in, they are stored in a slice and effectively available immediately as fast as the workers can process the work, so buffering the channel they come in on won't do anything but allow me to dequeue from a slice a few more items at a time.

Just as in my gist, where I have just a single URL I am retrieving (which is static), for all intents and purposes the URLs I want to process in the real program are already there and ready when the workers fire up. So it is really a problem of distributing that work which is all immediately available to a smaller number of workers.
 

> - wait for all the workers to finish before we proceed/exit/etc
> - ensure that no more HTTP client requests than the OS can tolerate for socket limitations are *actually* active

The buffer size will control how many items of work can be waiting to start.

Only start as many go routines as your buffer size. That will throttle the amount of work to do. Signal no more work by closing the channel.

Use sync.WaitGroup to wait on your workers before exiting.


This I like. I've been using a done channel (perhaps excessively) before and sync.WaitGroup seems a lot cleaner. Thanks for that suggestion! 

Jeremy Wall

unread,
Feb 22, 2013, 11:35:13 AM2/22/13
to Oliver, golang-nuts
On Fri, Feb 22, 2013 at 10:18 AM, Oliver <ohoo...@gmail.com> wrote:
On 22 February 2013 16:17, Jeremy Wall <jw...@google.com> wrote:


On Feb 22, 2013 8:39 AM, <ohoo...@gmail.com> wrote:
>
> Hi all,
>
> I've read and watched the various talks on Go concurrency and generally I have not had a big problem with it until now. I suspect it is because all of the programs I have been writing up until this point have a pattern where the slowest stage occurs first, and subsequent stages are faster so I know that there will never be any back-pressure from workers further in the chain. Now I have to write something that I guess is somewhat similar to the spidering problem, but I'm having trouble figuring out how to adequately implement it with both good performance and safety.
>
> Here is a working proof of concept that still suffers from at least one problem: https://gist.github.com/ohookins/5013420
>
> Basically the problem is that I have some large number of URLs to hit, way more than I can just throw go routines at in one loop. If I try to do that, some part of the program hits operating system limitations (some searching I did suggests socket limits while doing DNS lookups, but I'm not sure) and the whole thing blows up. I want to have a small number of workers processing the requests, but can't figure out how to achieve these goals:
> - cycle through the workers to find an available one for the next query

Send messages to your worker's using a buffered channel.


I think I didn't phrase my question very well. The work units come in in batches of 1000, so there is nothing for any of the workers to do until they come in. Then once the batch comes in, they are stored in a slice and effectively available immediately as fast as the workers can process the work, so buffering the channel they come in on won't do anything but allow me to dequeue from a slice a few more items at a time.

You could change the architecture to use a buffered channel instead a slice to batch the work for the workers. The combination of channel and limited groups of goroutines gives you a natural throttle. Buffering the channel is just an optimization that allows you to send up to buffer size amount of work without blocking on a receive. Leaving it out will still keep the throttle and automatically select from ready goroutines for you.


Just as in my gist, where I have just a single URL I am retrieving (which is static), for all intents and purposes the URLs I want to process in the real program are already there and ready when the workers fire up. So it is really a problem of distributing that work which is all immediately available to a smaller number of workers.

I was on my phone on a train so I didn't really look at your gist :-)
 
 

> - wait for all the workers to finish before we proceed/exit/etc
> - ensure that no more HTTP client requests than the OS can tolerate for socket limitations are *actually* active

The buffer size will control how many items of work can be waiting to start.

Only start as many go routines as your buffer size. That will throttle the amount of work to do. Signal no more work by closing the channel.

Use sync.WaitGroup to wait on your workers before exiting.


This I like. I've been using a done channel (perhaps excessively) before and sync.WaitGroup seems a lot cleaner. Thanks for that suggestion!

your welcome.

Oliver

unread,
Feb 22, 2013, 2:33:26 PM2/22/13
to Jeremy Wall, golang-nuts
On 22 February 2013 17:35, Jeremy Wall <jw...@google.com> wrote:



On Fri, Feb 22, 2013 at 10:18 AM, Oliver <ohoo...@gmail.com> wrote:
On 22 February 2013 16:17, Jeremy Wall <jw...@google.com> wrote:


On Feb 22, 2013 8:39 AM, <ohoo...@gmail.com> wrote:
>
> Hi all,
>
> I've read and watched the various talks on Go concurrency and generally I have not had a big problem with it until now. I suspect it is because all of the programs I have been writing up until this point have a pattern where the slowest stage occurs first, and subsequent stages are faster so I know that there will never be any back-pressure from workers further in the chain. Now I have to write something that I guess is somewhat similar to the spidering problem, but I'm having trouble figuring out how to adequately implement it with both good performance and safety.
>
> Here is a working proof of concept that still suffers from at least one problem: https://gist.github.com/ohookins/5013420
>
> Basically the problem is that I have some large number of URLs to hit, way more than I can just throw go routines at in one loop. If I try to do that, some part of the program hits operating system limitations (some searching I did suggests socket limits while doing DNS lookups, but I'm not sure) and the whole thing blows up. I want to have a small number of workers processing the requests, but can't figure out how to achieve these goals:
> - cycle through the workers to find an available one for the next query

Send messages to your worker's using a buffered channel.


I think I didn't phrase my question very well. The work units come in in batches of 1000, so there is nothing for any of the workers to do until they come in. Then once the batch comes in, they are stored in a slice and effectively available immediately as fast as the workers can process the work, so buffering the channel they come in on won't do anything but allow me to dequeue from a slice a few more items at a time.

You could change the architecture to use a buffered channel instead a slice to batch the work for the workers. The combination of channel and limited groups of goroutines gives you a natural throttle. Buffering the channel is just an optimization that allows you to send up to buffer size amount of work without blocking on a receive. Leaving it out will still keep the throttle and automatically select from ready goroutines for you.

Again, I probably should have been more clear on this. The input comes from deserializing the XML body of another HTTP response from another part of the program. The 1000-entry-long slice is actually in the struct deserialized from the XML, so it will be in slice form no matter what I do.

I'm still not sure what I gain from the buffered channel, aside from freeing up the loop that reads from the slice to the channel a little early. I'm choosing to wait until the entire batch is processed by the slower routines before retrieving another batch, so it can't finish up and return early. Perhaps I'm having trouble visualising the flow.

Tamás Gulácsi

unread,
Feb 22, 2013, 3:11:38 PM2/22/13
to golan...@googlegroups.com
I think you are right that a channel is unneeded for the data - but it is good for the job dispatch! Otherwise you'd need a mutex on the slice for the goroutines takeing urls all at once.

An unbuffered channel makes an easy idiomatic dispatch queue.

Jeremy Wall

unread,
Feb 22, 2013, 3:44:49 PM2/22/13
to Tamás Gulácsi, golang-nuts
On Fri, Feb 22, 2013 at 2:11 PM, Tamás Gulácsi <tgula...@gmail.com> wrote:
I think you are right that a channel is unneeded for the data - but it is good for the job dispatch! Otherwise you'd need a mutex on the slice for the goroutines takeing urls all at once.

An unbuffered channel makes an easy idiomatic dispatch queue.

Exactly.

Hǎiliàng

unread,
Feb 23, 2013, 2:29:13 AM2/23/13
to golan...@googlegroups.com, ohoo...@gmail.com
What you're trying to do is quite similar to my attempt a few days ago.

Just search "First attempt of programming concurrency" in golang-nuts). And also https://gist.github.com/hailiang/4723260 .

Finally I have chosen this method:
1. Use one single unbuffered channel to dispatch the job, and multiple worker goroutines receive from it
2. Launch limited number of goroutines all together rather than when a job arrived.
3. Close the input unbuffered channel to notify all the worker to quit the for-range loop.
4. Use sync.WaitGroup to wait for all those workers to finish.
5. http.Client can be used concurrently, so one instance of http.Client can be shared by all the worker goroutines.

Hǎiliàng


On Friday, February 22, 2013 10:15:07 PM UTC+8, ohoo...@gmail.com wrote:
Hi all,

I've read and watched the various talks on Go concurrency and generally I have not had a big problem with it until now. I suspect it is because all of the programs I have been writing up until this point have a pattern where the slowest stage occurs first, and subsequent stages are faster so I know that there will never be any back-pressure from workers further in the chain. Now I have to write something that I guess is somewhat similar to the spidering problem, but I'm having trouble figuring out how to adequately implement it with both good performance and safety.

Here is a working proof of concept that still suffers from at least one problem: https://gist.github.com/ohookins/5013420

Basically the problem is that I have some large number of URLs to hit, way more than I can just throw go routines at in one loop. If I try to do that, some part of the program hits operating system limitations (some searching I did suggests socket limits while doing DNS lookups, but I'm not sure) and the whole thing blows up. I want to have a small number of workers processing the requests, but can't figure out how to achieve these goals:
- cycle through the workers to find an available one for the next query
- wait for all the workers to finish before we proceed/exit/etc
- ensure that no more HTTP client requests than the OS can tolerate for socket limitations are *actually* active

Oliver

unread,
Feb 23, 2013, 10:07:52 AM2/23/13
to Hǎiliàng, golang-nuts
On 23 February 2013 08:29, Hǎiliàng <hwan...@gmail.com> wrote:
What you're trying to do is quite similar to my attempt a few days ago.

Just search "First attempt of programming concurrency" in golang-nuts). And also https://gist.github.com/hailiang/4723260 .

Finally I have chosen this method:
1. Use one single unbuffered channel to dispatch the job, and multiple worker goroutines receive from it
2. Launch limited number of goroutines all together rather than when a job arrived.
3. Close the input unbuffered channel to notify all the worker to quit the for-range loop.
4. Use sync.WaitGroup to wait for all those workers to finish.
5. http.Client can be used concurrently, so one instance of http.Client can be shared by all the worker goroutines.


This is exactly how I'm doing it now, and the code seems concise enough that it seems like the right way to do it. Thanks for the additional information on your implementation!
Reply all
Reply to author
Forward
0 new messages