I've think about two methods. Both are not Go idiomatic methods.Method 1: declare a global variable.When there is new url requires to get, increase the variable.When job done, decrease the variable.But, it require lock.
Method 2: use channel, but don't know how many signal should I wait.
func Crawl(url string, depth int, fetcher Fetcher, ok chan bool) {// ... omit ...
for _, u := range urls {// Crawl(u, depth-1, fetcher)
go Crawl(u, depth-1, fetcher, ch) // I just think use 'go' keyword here, but can NOT work
}ok <- true
return}main() {okch := make(chan bool)Crawl("http://first_url", 4, fetcher, okch)for i:=0; i<n; i++ { // I don't know how to get correct 'n' here
<-okch}}
I've put it here (as to not spoil the fun for everyone). I'd like some
feedback particularly regarding possible race for seen[url]
setting/checking and increment/decrement of crawlers.
https://docs.google.com/document/d/1UXNMRe3pCmVMHwjSWY7TXWLEoTsrLRluxuJFGsUlgek/edit?hl=en_US
-Skip
I've put it here (as to not spoil the fun for everyone). I'd like some
feedback particularly regarding possible race for seen[url]
setting/checking and increment/decrement of crawlers.
https://docs.google.com/document/d/1UXNMRe3pCmVMHwjSWY7TXWLEoTsrLRluxuJFGsUlgek/edit?hl=en_US
I believe I have a race-free, mutex-free solution in the spirit of Go.
It sounds like it is similar to Kyle Lemons' with respect to the
mainline function being the sole manipulator of the map. This solution
also exits cleanly when there are no more URLs to fetch.
Here's mine, btw: <https://gist.github.com/1396075>
On Oct 15, 5:07 pm, Kyle Lemons <kev...@google.com> wrote:
> > I believe I have a race-free, mutex-free solution in the spirit of Go.
> > It sounds like it is similar to Kyle Lemons' with respect to the
> > mainline function being the sole manipulator of the map. This solution
> > also exits cleanly when there are no more URLs to fetch.
>
> Yep, that's very similar to mine <https://gist.github.com/1269252>. The
> I came up with a slight different solution to this and I'm wondering
> if there's anything in it that makes it less idiomatic than the
> (admittedly simpler) solution that Kyle posted at <https://
(...)
> and use a wrapper (like getPage() in Kyle's implementation) when you
> want to use them asynchronously. Is that really the case?
Not necessarily. It's also a common practice to send a channel to get
results back as you did in your implementation. Both yours and Kyle's
are in similar ground in terms of idioms.
In terms of the algorithm itself, your version has a slight problem in
that it is firing a fetcher for all of the urls found at each depth of
the iteration at once. This could quickly go to pretty excessive
levels. Kyle's version also doesn't have a good hold on the amount of
concurrency going on, but it bounds to an individual page at least. A
real algorithm for something like that would control the amount of
concurrency a bit more tightly.
--
Gustavo Niemeyer
http://niemeyer.net
http://niemeyer.net/plus
http://niemeyer.net/twitter
http://niemeyer.net/blog
-- I'm not absolutely sure of anything.
Some time ago, I posted here a version which is similar to the
above and limits the number of concurrent fetches.
For the record, there's a goplay version here:
Hey Salgado,
(...)
> I came up with a slight different solution to this and I'm wondering
> if there's anything in it that makes it less idiomatic than the
> (admittedly simpler) solution that Kyle posted at <https://
> and use a wrapper (like getPage() in Kyle's implementation) when youNot necessarily. It's also a common practice to send a channel to get
> want to use them asynchronously. Is that really the case?
results back as you did in your implementation. Both yours and Kyle's
are in similar ground in terms of idioms.
In terms of the algorithm itself, your version has a slight problem in
that it is firing a fetcher for all of the urls found at each depth of
the iteration at once. This could quickly go to pretty excessive
levels. Kyle's version also doesn't have a good hold on the amount of
concurrency going on, but it bounds to an individual page at least. A
real algorithm for something like that would control the amount of
concurrency a bit more tightly.
Well, I was told goroutines are cheap, so I thought I'd let them go crazy! :P
I was under the impression that Kyle's version doesn't have any bounds whatsoever given that as soon as there's a response on the channel it fires one goroutine for each URL found. But maybe I missed something?