i think this is quite a similar solution to the one i came up with:
type result struct {
url string
body string
urls []string
depth int
err os.Error
}
func fetch(url string, depth int, r chan<- result) {
body, urls, err := fetcher.Fetch(url)
r <- result{url, body, urls, depth, err}
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
m := map[string]bool{url: true}
rc := make(chan result)
go fetch(url, depth, rc)
nproc := 1
for nproc > 0 {
r := <-rc
nproc--
if r.depth > 0 {
for _, u := range r.urls {
if !m[u] {
go fetch(u, r.depth-1, rc)
nproc++
m[u] = true
}
}
}
if r.err != nil {
fmt.Println(r.err)
}else{
fmt.Printf("found: %s %q\n", r.url, r.body)
}
}
}
there's a straightforward idiom that you can use if you
want to limit the number of currently running goroutines:
use a buffered channel with as much buffer space as
you want running goroutines.
since the type struct{} takes no space in memory,
it can be as large as you like without worrying about
space used.
you can use this in one of two ways - start full or start empty.
to start full, you first prime the buffered channel by filling
it with values. this is the most generally useful technique
as the values themselves can be useful (for instance
a thread-id, some cache context, or in this case perhaps
an http.Client instance).
// starting full
var token = make(chan Token, maxprocs)
func init() {
for i := 0; i < maxprocs; i++) {
token <- someToken(i)
}
}
// to start a process
t := <-token
go func() {
doSomething(0
token <- t
}
in my code, i would probably use the start empty approach
as it's a few less lines of code. i'd just change my fetch function
to look something like this:
var limiter = make(chan struct{}, 10)
func fetch(url string, depth int, r chan<- result) {
limiter <- struct{}{}
go func() {
body, urls, err := fetcher.Fetch(url)
<-limiter
r <- result{url, body, urls, depth, err}
}()
}
then in the body of Crawl, i'd call fetch directly rather than
doing go fetch(...).
i'd be interested to see a "definitive" solution written
by someone in the Go team.
I have rewritten the Crawl function. This time, only a limited number
of urls can get fetched at a time. However, there is still no limit on
the number of goroutines that could be started. Is that a problem? If
so, how to fix this?
// Crawl uses fetcher to recursively crawl// pages starting with url, to a maximum of depth.func Crawl(url string, depth int, fetcher Fetcher) {
type moreUrls struct{depth inturls []string}more := make(chan moreUrls)
getPage := func(url string, depth int) {
body, urls, err := fetcher.Fetch(url)
if err != nil {fmt.Println(err)} else {
fmt.Printf("found[%d:%s] %q\n", depth, url, body)}more <- moreUrls{depth+1, urls}}
outstanding := 1go getPage(url, 0)
visited := map[string]bool{url:true}for outstanding > 0 {next := <-moreoutstanding--if next.depth > depth {continue}for _, url := range next.urls {if _, seen := visited[url]; seen {continue}visited[url] = trueoutstanding++go getPage(url, next.depth)}}}
func main() {}
Kyle, do you mean in the code which I posted will generate unbounded parallel fetch?
I've use buffered channel. I don't know why.