Go Tour #69 how to use "go Crawl"

495 views
Skip to first unread message

dlin

unread,
Oct 6, 2011, 7:43:32 PM10/6/11
to golan...@googlegroups.com
http://go-tour.appspot.com/#69

func Crawl(url string, depth int, fetcher Fetcher) {
for _, u := range urls {
// Crawl(u, depth-1, fetcher)
                go Crawl(u, depth-1, fetcher)  // I just think use 'go' keyword here, but can NOT work
}
return
}

Sonia Keys

unread,
Oct 6, 2011, 8:01:19 PM10/6/11
to golan...@googlegroups.com
Hint:  You will need a way to get your main goroutine to wait for these parallel goroutines to complete.  Otherwise the whole program terminates immediately without waiting for the answers to come in.

dlin

unread,
Oct 6, 2011, 8:47:04 PM10/6/11
to golan...@googlegroups.com
I've think about two methods.  Both are not Go idiomatic methods.

Method 1: declare a global variable.
When there is new url requires to get, increase the variable.
When job done, decrease the variable.
But, it require lock.

Method 2: use channel, but don't know how many signal should I wait.

func Crawl(url string, depth int, fetcher Fetcher, ok chan bool) {
        // ... omit ...
for _, u := range urls {
// Crawl(u, depth-1, fetcher)
                go Crawl(u, depth-1, fetcher, ch)  // I just think use 'go' keyword here, but can NOT work
}
       ok <- true
return
}

main() {
   okch := make(chan bool)
    Crawl("http://first_url", 4, fetcher, okch)
    for i:=0; i<n; i++ {   // I don't know how to get correct 'n' here
      <-okch
   }
}

Kyle Lemons

unread,
Oct 6, 2011, 9:38:24 PM10/6/11
to golan...@googlegroups.com
I've think about two methods.  Both are not Go idiomatic methods.

Method 1: declare a global variable.
When there is new url requires to get, increase the variable.
When job done, decrease the variable.
But, it require lock.

You don't need a global variable if you do it with a closure.  You would require a lock, though.
 
Method 2: use channel, but don't know how many signal should I wait.

When you're not sure how many "done" signals are necessary, often using sync.WaitGroup can be helpful.  In this case though, there are some even more fun options.
 
func Crawl(url string, depth int, fetcher Fetcher, ok chan bool) {
        // ... omit ...
// <- len(urls) 
for _, u := range urls {
// Crawl(u, depth-1, fetcher)
                go Crawl(u, depth-1, fetcher, ch)  // I just think use 'go' keyword here, but can NOT work
}
       ok <- true
// <- -1 
return
}

main() {
   okch := make(chan bool)
    Crawl("http://first_url", 4, fetcher, okch)
    for i:=0; i<n; i++ {   // I don't know how to get correct 'n' here

sum := 1
for diff := range active { sum += diff; if diff == 0 { break } }
 
      <-okch
   }
}

Instead of having a done chan bool, you could have an active chan int, and send +1 before you call a new fetcher and send -1 at the end of a fetcher (marked above in your code), then your loop is just waiting for the sum of all values sent to be zero.

One reason I don't suggest using a waitgroup in this instance is that you will probably want to use channels for controlling what URLs get fetched (to avoid duplicates).  If you went with WaitGroup, you would probably just interlock the map.

~K

Skip Tavakkolian

unread,
Oct 7, 2011, 2:14:23 AM10/7/11
to Kyle Lemons, golan...@googlegroups.com
My solution followed a similar "convergent evolution" -- to paraphrase Rob.

I've put it here (as to not spoil the fun for everyone). I'd like some
feedback particularly regarding possible race for seen[url]
setting/checking and increment/decrement of crawlers.

https://docs.google.com/document/d/1UXNMRe3pCmVMHwjSWY7TXWLEoTsrLRluxuJFGsUlgek/edit?hl=en_US

-Skip

Kyle Lemons

unread,
Oct 7, 2011, 2:30:39 AM10/7/11
to Skip Tavakkolian, golan...@googlegroups.com
I've put it here (as to not spoil the fun for everyone). I'd like some
feedback particularly regarding possible race for seen[url]
setting/checking and increment/decrement of crawlers.

https://docs.google.com/document/d/1UXNMRe3pCmVMHwjSWY7TXWLEoTsrLRluxuJFGsUlgek/edit?hl=en_US

Your use of the seen map is racy; to work properly in a multithreaded environment, it would need to be interlocked (or redesigned).  Also, you check that the channel is not closed on each receive, even though you don't close it. I personally used the mainline function as the owner goroutine of the map, being the only goroutine spawning off workers to fetch the URLs in parallel, receiving back from them any URLs they find.  Only new URLs are turned around and spawned off as workers again.  In this way, I avoided having multiple goroutines modifying the map.
~K

Skip Tavakkolian

unread,
Oct 7, 2011, 4:53:35 AM10/7/11
to Kyle Lemons, golan...@googlegroups.com
thanks. yes, i see it. your approach is correct; when i tried
GOMAXPROCS > 1 with mine it became very obvious. the document now has
both versions.

dlin

unread,
Oct 8, 2011, 12:54:53 PM10/8/11
to golan...@googlegroups.com
I try to write again.

throw: all goroutines are asleep - deadlock!

My wrong code here.

func Crawl(url string, depth int, fetcher Fetcher) {
    reqch := make(chan int)
    seen := make(map[string]bool)
    var crawl func(string, int)
    crawl = func(url string, depth int) {
        defer func(){reqch <- -1}()
        if depth <= 0 {
            return
        }
        if _,ok := seen[url] ; ok {
            return
        }
        seen[url] = true
        body, urls, err := fetcher.Fetch(url)
        if err != nil {
            fmt.Println(err)
            return
        }
        fmt.Printf("found: %s %q\n", url, body)
        reqch <- len(urls)
        for _, u := range urls {
            go crawl(u, depth-1)
        }
    }
    go crawl(url, depth)
    actsum := 1
    for diff :=range reqch {
        actsum += diff
        if actsum == 0 {
            break
        }
    }
}

dlin

unread,
Oct 8, 2011, 7:43:54 PM10/8/11
to golan...@googlegroups.com
It is very strange,  I recompiled this code again this morning.
GOMAXPROCS=2 ./a69 # works
./a69 # works.

But, I'm wonder then 'seen' is just a simple map without lock.  Will it crash at random time?

Eric I.

unread,
Oct 15, 2011, 12:59:06 AM10/15/11
to golang-nuts
I believe I have a race-free, mutex-free solution in the spirit of Go.
It sounds like it is similar to Kyle Lemons' with respect to the
mainline function being the sole manipulator of the map. This solution
also exits cleanly when there are no more URLs to fetch.

https://docs.google.com/document/d/12srV-uSQqmSmqrAGRVe7_5SRV5oD3vAEBxoV8MJsjzE/edit?hl=en_US

Eric

Kyle Lemons

unread,
Oct 15, 2011, 4:07:10 PM10/15/11
to Eric I., golang-nuts
I believe I have a race-free, mutex-free solution in the spirit of Go.
It sounds like it is similar to Kyle Lemons' with respect to the
mainline function being the sole manipulator of the map. This solution
also exits cleanly when there are no more URLs to fetch.

Yep, that's very similar to mine.  The contents of your Crawl are similar to the closure in my case, and the contents of your main are similar to my Crawl (I wanted to insulate the caller from having to know anything about how Crawl does what it does).

Guilherme Salgado

unread,
Nov 26, 2011, 1:37:06 PM11/26/11
to golang-nuts
I came up with a slight different solution to this and I'm wondering
if there's anything in it that makes it less idiomatic than the
(admittedly simpler) solution that Kyle posted at <https://
gist.github.com/1269252>. Basically, in my implementation, for every
depth level it fires one goroutine/channel for every URL found, waits
for all of them to complete and assembles a new slice, for the next
iteration, containing only the URLs that haven't been fetched yet.
Besides that, I've also changed the Fetch() API to take a channel as
argument and put the results there, but I'm under the impression it's
more common in Go code to implement network operations synchronously
and use a wrapper (like getPage() in Kyle's implementation) when you
want to use them asynchronously. Is that really the case?

Here's mine, btw: <https://gist.github.com/1396075>


On Oct 15, 5:07 pm, Kyle Lemons <kev...@google.com> wrote:
> > I believe I have a race-free, mutex-free solution in the spirit of Go.
> > It sounds like it is similar to Kyle Lemons' with respect to the
> > mainline function being the sole manipulator of the map. This solution
> > also exits cleanly when there are no more URLs to fetch.
>

> Yep, that's very similar to mine <https://gist.github.com/1269252>.  The

Gustavo Niemeyer

unread,
Nov 28, 2011, 6:23:45 AM11/28/11
to Guilherme Salgado, golang-nuts
Hey Salgado,

> I came up with a slight different solution to this and I'm wondering
> if there's anything in it that makes it less idiomatic than the
> (admittedly simpler) solution that Kyle posted at <https://

(...)


> and use a wrapper (like getPage() in Kyle's implementation) when you
> want to use them asynchronously. Is that really the case?

Not necessarily. It's also a common practice to send a channel to get
results back as you did in your implementation. Both yours and Kyle's
are in similar ground in terms of idioms.

In terms of the algorithm itself, your version has a slight problem in
that it is firing a fetcher for all of the urls found at each depth of
the iteration at once. This could quickly go to pretty excessive
levels. Kyle's version also doesn't have a good hold on the amount of
concurrency going on, but it bounds to an individual page at least. A
real algorithm for something like that would control the amount of
concurrency a bit more tightly.


--
Gustavo Niemeyer
http://niemeyer.net
http://niemeyer.net/plus
http://niemeyer.net/twitter
http://niemeyer.net/blog

-- I'm not absolutely sure of anything.

roger peppe

unread,
Nov 28, 2011, 6:48:38 AM11/28/11
to Gustavo Niemeyer, Guilherme Salgado, golang-nuts
On 28 November 2011 11:23, Gustavo Niemeyer <gus...@niemeyer.net> wrote:
> Hey Salgado,
>
>> I came up with a slight different solution to this and I'm wondering
>> if there's anything in it that makes it less idiomatic than the
>> (admittedly simpler) solution that Kyle posted at <https://
> (...)
>> and use a wrapper (like getPage() in Kyle's implementation) when you
>> want to use them asynchronously. Is that really the case?
>
> Not necessarily. It's also a common practice to send a channel to get
> results back as you did in your implementation. Both yours and Kyle's
> are in similar ground in terms of idioms.
>
> In terms of the algorithm itself, your version has a slight problem in
> that it is firing a fetcher for all of the urls found at each depth of
> the iteration at once. This could quickly go to pretty excessive
> levels. Kyle's version also doesn't have a good hold on the amount of
> concurrency going on, but it bounds to an individual page at least. A
> real algorithm for something like that would control the amount of
> concurrency a bit more tightly.

Some time ago, I posted here a version which is similar to the
above and limits the number of concurrent fetches.
For the record, there's a goplay version here:

http://play.golang.org/p/O6nQH3E_Z8

Guilherme Salgado

unread,
Dec 1, 2011, 2:19:11 PM12/1/11
to Gustavo Niemeyer, golang-nuts
Hi Gustavo!

On Mon, Nov 28, 2011 at 8:23 AM, Gustavo Niemeyer <gus...@niemeyer.net> wrote:
Hey Salgado,

> I came up with a slight different solution to this and I'm wondering
> if there's anything in it that makes it less idiomatic than the
> (admittedly simpler) solution that Kyle posted at <https://
(...)
> and use a wrapper (like getPage() in Kyle's implementation) when you
> want to use them asynchronously. Is that really the case?

Not necessarily. It's also a common practice to send a channel to get
results back as you did in your implementation. Both yours and Kyle's
are in similar ground in terms of idioms.

Oh, cool, I don't know where I got that idea from.
 

In terms of the algorithm itself, your version has a slight problem in
that it is firing a fetcher for all of the urls found at each depth of
the iteration at once. This could quickly go to pretty excessive

Well, I was told goroutines are cheap, so I thought I'd let them go crazy! :P

Joking aside, yes, I knew it'd fire one fetcher for every url found on a given depth level and that this is probably not acceptable on anything but toy scripts like this one.

levels. Kyle's version also doesn't have a good hold on the amount of
concurrency going on, but it bounds to an individual page at least. A
real algorithm for something like that would control the amount of
concurrency a bit more tightly.
 
I was under the impression that Kyle's version doesn't have any bounds whatsoever given that as soon as there's a response on the channel it fires one goroutine for each URL found. But maybe I missed something?

Cheers,
Guilherme

Kyle Lemons

unread,
Dec 1, 2011, 9:06:52 PM12/1/11
to Guilherme Salgado, Gustavo Niemeyer, golang-nuts
Well, I was told goroutines are cheap, so I thought I'd let them go crazy! :P

The goroutines are cheap.  10k goroutines is fine, but you probably don't want 10k HTTP requests going on simultaneously.
 
I was under the impression that Kyle's version doesn't have any bounds whatsoever given that as soon as there's a response on the channel it fires one goroutine for each URL found. But maybe I missed something?

Ours have similar checks, but at different times.  I think I can construct pathological cases for both.  In a real system, you'd want to limit the number of outgoing HTTP requests in both cases.

~K 

dlin

unread,
Jun 9, 2013, 9:25:08 PM6/9/13
to golan...@googlegroups.com
For Go 1.1, here is a updated version of rog's code in previous post.

Reply all
Reply to author
Forward
0 new messages