Handling dynamic and unknown number of wait groups?

1,416 views
Skip to first unread message

Inspectre Gadget

unread,
Jun 29, 2016, 6:35:31 PM6/29/16
to golang-nuts
Hey everyone,

Here’s my issue, I will try to keep this short and concise:

I have written a program that will accept a URL, spider that URL’s domain and scheme (http/https), and return back all input fields found throughout to the console. The purpose is largely for web application security testing, as input fields are the most common vulnerability entry points (sinks), and this program automates that part of the reconnaissance phase.


The issue lies in the last for loop in the main() function. If you were to run this program, it would check the queue and workers so frequently that it is bound to find a point where there are both no workers working, and no URLs in the queue (as proved by the console output statements before it exits). Nevermind that the problem is exacerbated by network latency. The number of URLs actually checked varies on every run, which causes some serious inconsistencies, and prevents the program from being at all reliable.


I fixed it by removing all concurrency from network requests, leaving it only in the internal HTML processing functions.

So, the question is- how does one run efficient concurrent code when the number of wait groups is dynamic, and unknown at program initialization?

I have tried:
  • Using “worker pools”, which consist of channels of workers. The for loop checks the length of the URL queue and the number of workers available. If the URL queue is empty and all the workers are available, then it exits the loop.
  • Dynamically adding wait groups (wg.Add(1)) every time I pull a URL from the URL queue. I can’t set the wait group numbers before the loop, because I can never know how many URLs are going to be checked.

So I have tried using both channels and wait groups to check alongside the URL queue length to determine whether more concurrent network requests are needed. In both cases, the for loop checks the values so fast that it eventually stumbles upon a non-satisfied loop condition, and exits. This usually results in either the program hanging as it waits for wait groups to exit that never do, or it simply exits prematurely, as more URLs are added to the queue after the fact.

I would really like to know if there is a way to actually do this well in Go.

Cheers,

Inspectre 

Ian Lance Taylor

unread,
Jun 29, 2016, 9:25:46 PM6/29/16
to Inspectre Gadget, golang-nuts
First thing I noticed in your code is

for len(urlQueue) > 0 {
... <- urlQueue ...

Never do that. That is racy. Instead, do

for url : = range urlQueue {
}

Ian

Nick Craig-Wood

unread,
Jun 30, 2016, 5:32:14 AM6/30/16
to Inspectre Gadget, golang-nuts
Here is some code I wrote which demonstrates a solution. This is a
recursive concurrent directory lister which I think is an identical
problem to yours - you start with one directory - you find new
directories in the course of listing them which you also want to list,
but you want to limit the concurrency and not return until they are all
done.

https://github.com/ncw/rclone/blob/master/dircache/list.go#L27

I used another WaitGroup and whenever I put anything into the job queue
I added 1 to it and whenever I removed something I ran .Done() it in.
This gives you something to wait on at the end.

Note I use an extra go routine to put the new directories into the queue
- this is to avoid a deadlock when all the workers are busy trying to do
the same.
> * Using “worker pools”, which consist of channels of workers. The for
> loop checks the length of the URL queue and the number of workers
> available. If the URL queue is empty and all the workers are
> available, then it exits the loop.
> * Dynamically adding wait groups (wg.Add(1)) every time I pull a URL
> from the URL queue. *I can’t set the wait group numbers before the
> loop, because I can never know how many URLs are going to be checked.*
>
>
> So I have tried using both channels and wait groups to check alongside
> the URL queue length to determine whether more concurrent network
> requests are needed. In both cases, the for loop checks the values so
> fast that it eventually stumbles upon a non-satisfied loop condition,
> and exits. This usually results in either the program hanging as it
> waits for wait groups to exit that never do, or it simply exits
> prematurely, as more URLs are added to the queue after the fact.
>
> I would really like to know if there is a way to actually do this well
> in Go.
>
> Cheers,
>
> Inspectre
>
> --
> You received this message because you are subscribed to the Google
> Groups "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to golang-nuts...@googlegroups.com
> <mailto:golang-nuts...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.


--
Nick Craig-Wood <ni...@craig-wood.com> -- http://www.craig-wood.com/nick

Nick Craig-Wood

unread,
Jun 30, 2016, 5:46:00 AM6/30/16
to Inspectre Gadget, golang-nuts
I forgot to add - don't call `wg.Add(1)` inside the go routine - it is
racy! If you are trying to start a pool of go routines, you can find
that none of them get scheduled before you run the wg.Wait().

Chris Randles

unread,
Jun 30, 2016, 8:35:32 AM6/30/16
to golang-nuts
When dealing with an unknown amount of input, I find https://godoc.org/go4.org/syncutil#Gate a useful alternative to sync.WaitGroup.

Michael Jones

unread,
Jun 30, 2016, 12:57:44 PM6/30/16
to golang-nuts
Inspectre,

Here is a slight modification that may suit your needs:

https://play.golang.org/p/dppJOkPcvG

Change summary:

1. Added a terse option (‘-t’) so I could see the urls and errors only.
2. Changed the URL queue to a channel so that…
3. …the main loop (lines 139–148) would be clearer (matches Ian’s suggestion)
4. Added a concurrency-limit to as not to overwhelm websites and local open-file limits
5. Answered the “how do you know when to stop” question in one of several logical ways.

My answer for #5 was to consider that the processing was done when the initial root URLs had been processed—that is, when the dataRouter call had finished. Each one of these, in turn, is only done when it’s next-level dataRouter call is finished. So that sets up the structure that is necessary and my choice of implementation was to split it this way:

5a. Create a global URLsInProcess wait group
5b. Increment it as each new URL is added in addURL
5c. Decrement it only after each URL is completely processed in dataRouter
5d. Wait for it to reach zero in main

This structure means that the program will wait as long as necessary for each URL to complete. Each is a commitment. The “now completed” logic of (5c) is in a defer so it runs in error cases. However, this makes the program dependent on every attempt completing somehow, so if the URL Get() should fail to time out on a server that dies, the program will hang. Not a likely problem, but not as robust as you would want in production.

Another weakness to consider is that the throttling here (16 concurrent connections) really has two parts. The first is how many make sense on your computer based on OS and process limits as well as numbers of CPUs. The second is how many connections a remote server will tolerate before it decides you are a DoS threat. If you were to implement multiple roots, or follow links off-site, then it would make sense to separate these two limits—expansive local resources could be split across multiple destination servers to keep the per-remote-server load at a friendly limit.*

Finally, I did not change the logic of the program to use a worker pool. It still spawns a goroutine for every URL visited. That is not bad, but neither is it necessary. Making this change would have made changes 1–5 less clear, and the goal here is clarity.

Hope this helps,
Michael

Michael Jones
michae...@gmail.com

* Which is important at Google, for example, since nobody would want 20k+ Google servers to simultaneously crawl their website, no matter how eager they are to be included in the index.

kant...@gmail.com

unread,
Nov 2, 2017, 11:40:58 AM11/2/17
to golang-nuts
I am new to Go and I had read numerous times in various articles that one can technically create any number of Go routines and the runtime takes care of them mapping to kernel threads. If so, why does waitGroup designed in a way such that one has to specify how many go routines one need to create ahead of time ? why can't waitGroup be more dynamic such that I can add a Goroutine to waitgroup once I create it ? To put the question in another way why are people even thinking about pool of go routines in the first place given that goroutines are user level threads? If someone talks about pool of goroutines I feel like I am back to C++ or any other language that creates pool of kernel threads. Does Go runtimes doesn't keep its promise? Again, I am new so please enlighten me.

Andy Balholm

unread,
Nov 2, 2017, 12:02:06 PM11/2/17
to kant...@gmail.com, golang-nuts
You can add goroutines to a WaitGroup as you create them. There is nothing that keeps you from calling Add more than once.

Andy

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Henrik Johansson

unread,
Nov 2, 2017, 12:52:51 PM11/2/17
to Andy Balholm, kant...@gmail.com, golang-nuts

And technically they are not tracking goroutines but done things. Each goroutine could each finish 10 things.

howar...@gmail.com

unread,
Nov 2, 2017, 12:54:20 PM11/2/17
to golang-nuts
You absolutely can use goroutines without waitgroups! It all depends on what you are doing with them. Where waitgroups come in is when you need to something only after *all* of the goroutines on a specific task are done.

Chances are that if you are not using waitgroups, you are either using channels or you are only interested in the side-effects of the goroutines and are running a long-running service that does not need to have a clear end-point.

For example, there is a spot in one of my code-bases where I fire off a few optimization routines in parallel - each one takes a channel to feed a result in, and I know going in exactly how many results there will be. I am going to be comparing and keepign the best, so I don't care which routine returns its result first, last, etc - I simply pull the known number of results out of the channel - knowing that it will block until that number of goroutines has returned. This is a known number of routines though, and a waitgroup would be a viable alternative here.

I also have two other kinds of programs that don't use waitgroups - one sort of them builds a chain of channels with a goroutine pushing data through, and all the other goroutines just flow data through the channel using range until the channel closes, then they evaporate. Doesn't need a waitgroup because they aren't working on 'pieces of the data that have to be put back together at the end' but instead are each operating on *all* the data as it flows through them. And the others fire off a goroutine in response to an incoming connection, handle that connection, and then go away. No need for waitgroups there, unless they in turn fire off a collective operation.

A waitgroup is just a way of saying, "I need to be able to wait until all these goroutines have done there stuff and finished." If you don't need to wait for a group of them to finish, you don't need a waitgroup.
Message has been deleted

David Collier-Brown

unread,
Nov 2, 2017, 9:02:05 PM11/2/17
to golang-nuts
You don't even need to do waitgroups if the main() function is the one that is last in the pipeline: cf https://leaflessca.wordpress.com/2017/01/04/forwards-and-backwords-pipes-in-go/

Reply all
Reply to author
Forward
0 new messages