A Tour of Go #69 Exercise: Web Crawler -- Too many goroutines?

2,746 views
Skip to first unread message

Peter Kleiweg

unread,
Oct 8, 2011, 5:15:00 PM10/8/11
to golang-nuts
Below is my solution to Exercise: Web Crawler of A Tout of Go.

This works, but I doubt if it is correct. For each url to fetch, there
is a new goroutine started. The amount of goroutines is limited
because the depth of search is limited. But what if this was a true
webcrawler? The number of simultanous goroutines would keep
increasing. I presume, there is a limit to how many goroutines you can
start?

I welcome any comments on my solution. This is the first time I tried
something like this.


// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
// DONE: Fetch URLs in parallel.
// DONE: Don't fetch the same URL twice.

if depth < 1 {
return
}

type Target struct {
url string
depth int
}

targets := make(chan *Target)

seen := make(map[string]bool)

// This marks the end of a series of new targets
eot := &Target{depth: -1}

go func() {
targets <- &Target{url, depth}
targets <- eot
}()

for counter := 1; counter > 0; {
t := <-targets
if t.depth < 0 {
counter--
} else {
if !seen[t.url] {
seen[t.url] = true
counter++
go func() {
body, urls, err := fetcher.Fetch(t.url)
if err != nil {
fmt.Println(err)
} else {
fmt.Printf("found: %s %q\n", t.url, body)
if d := t.depth - 1; d > 0 {
for _, u := range urls {
targets <- &Target{u, d}
}
}
}
targets <- eot
}()
}
}
}
return
}

roger peppe

unread,
Oct 9, 2011, 4:55:23 AM10/9/11
to Peter Kleiweg, golang-nuts

i think this is quite a similar solution to the one i came up with:

type result struct {
url string
body string
urls []string
depth int
err os.Error
}

func fetch(url string, depth int, r chan<- result) {
body, urls, err := fetcher.Fetch(url)
r <- result{url, body, urls, depth, err}
}

// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {

m := map[string]bool{url: true}
rc := make(chan result)

go fetch(url, depth, rc)
nproc := 1
for nproc > 0 {
r := <-rc
nproc--
if r.depth > 0 {
for _, u := range r.urls {
if !m[u] {
go fetch(u, r.depth-1, rc)
nproc++
m[u] = true
}
}
}
if r.err != nil {
fmt.Println(r.err)
}else{
fmt.Printf("found: %s %q\n", r.url, r.body)
}
}
}


there's a straightforward idiom that you can use if you
want to limit the number of currently running goroutines:
use a buffered channel with as much buffer space as
you want running goroutines.

since the type struct{} takes no space in memory,
it can be as large as you like without worrying about
space used.

you can use this in one of two ways - start full or start empty.
to start full, you first prime the buffered channel by filling
it with values. this is the most generally useful technique
as the values themselves can be useful (for instance
a thread-id, some cache context, or in this case perhaps
an http.Client instance).

// starting full

var token = make(chan Token, maxprocs)

func init() {
for i := 0; i < maxprocs; i++) {
token <- someToken(i)
}
}

// to start a process
t := <-token
go func() {
doSomething(0
token <- t
}

in my code, i would probably use the start empty approach
as it's a few less lines of code. i'd just change my fetch function
to look something like this:

var limiter = make(chan struct{}, 10)
func fetch(url string, depth int, r chan<- result) {
limiter <- struct{}{}
go func() {
body, urls, err := fetcher.Fetch(url)
<-limiter
r <- result{url, body, urls, depth, err}
}()
}

then in the body of Crawl, i'd call fetch directly rather than
doing go fetch(...).

i'd be interested to see a "definitive" solution written
by someone in the Go team.

Message has been deleted
Message has been deleted

Andrew Gerrand

unread,
Oct 9, 2011, 6:39:57 PM10/9/11
to golan...@googlegroups.com, Peter Kleiweg
Third time's a charm. Here's a much simpler version.

As you can see, there are many different ways of approaching this.

Andrew

--

package main

import (
"os"
"fmt"
)

type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err os.Error)
}

type crawler struct {
f   Fetcher
req chan string
res chan *result
}

type result struct {
url  string
body string
urls []string
err  os.Error
}

func Crawl(url string, depth int, f Fetcher, workers int) {
c := &crawler{f, make(chan string), make(chan *result)}
for i := 0; i < workers; i++ {
go c.run()
}
go c.fetch(url)
fetched, n := map[string]bool{url: true}, 1
for res := range c.res {
n--
if res != nil {
if res.err != nil {
fmt.Println(res.err)
} else {
fmt.Printf("found: %s %q\n", res.url, res.body)
}
for _, url := range res.urls {
if !fetched[url] {
go c.fetch(url)
n++
}
fetched[url] = true
}
}
if n == 0 {
return
}
}
}

func (c *crawler) fetch(url string) {
c.req <- url
}

func (c *crawler) run() {
for url := range c.req {
body, urls, err := c.f.Fetch(url)
c.res <- &result{url, body, urls, err}
}
}

func main() {
Crawl("http://golang.org/", 4, fetcher, 10)
}

// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult

type fakeResult struct {
body string
urls []string
}

func (f *fakeFetcher) Fetch(url string) (string, []string, os.Error) {
if res, ok := (*f)[url]; ok {
return res.body, res.urls, nil
}
return "", nil, fmt.Errorf("not found: %s", url)
}

// fetcher is a populated fakeFetcher.
var fetcher = &fakeFetcher{
"http://golang.org/": &fakeResult{
"The Go Programming Language",
[]string{
},
},
"http://golang.org/pkg/": &fakeResult{
"Packages",
[]string{
},
},
"Package fmt",
[]string{
},
},
"http://golang.org/pkg/os/": &fakeResult{
"Package os",
[]string{
},
},
}

Peter Kleiweg

unread,
Oct 9, 2011, 6:55:46 PM10/9/11
to golang-nuts
This version keeps running even if depth is exceeded.

Sameer Ajmani

unread,
Oct 9, 2011, 7:52:43 PM10/9/11
to golang-nuts
My solution, which should respect depth properly, is below. If you
want to bound the number of pending fetches, you can use the idiom
Roger Peppe mentioned (create a channel with buffer size == man
pending; push onto a buffered channel where the code does pending++;
draw from that channel where the code does pending--).

type response struct {
url string
depth int
body string
urls []string
err os.Error
}

func Fetch(fetcher Fetcher, url string, depth int, responses chan
*response) {
body, urls, err := fetcher.Fetch(url)
responses <- &response{url, depth, body, urls, err}
}

// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
if depth == 0 {
return
}
responses := make(chan *response, 1)
go Fetch(fetcher, url, depth, responses)
// Using maps as sets is inefficient, though convenient. See
http://blog.golang.org/2011/06/profiling-go-programs.html
seen := map[string]bool{ url: true }
pending := 1
for pending > 0 {
r := <- responses
pending--
if r.err != nil {
fmt.Println(r.err)
continue
}
fmt.Printf("found: %s %q\n", r.url, r.body)
if r.depth == 0 {
continue
}
for _, u := range r.urls {
if !seen[u] {
go Fetch(fetcher, u, r.depth-1, responses)
seen[u] = true
pending++

Peter Kleiweg

unread,
Oct 9, 2011, 8:06:23 PM10/9/11
to golang-nuts, pkle...@xs4all.nl
On Oct 8, 11:15 pm, Peter Kleiweg <pklei...@xs4all.nl> wrote:
> Below is my solution to Exercise: Web Crawler of A Tout of Go.
>
> This works, but I doubt if it is correct. For each url to fetch, there
> is a new goroutine started. The amount of goroutines is limited
> because the depth of search is limited. But what if this was a true
> webcrawler? The number of simultanous goroutines would keep
> increasing. I presume, there is a limit to how many goroutines you can
> start?

I have rewritten the Crawl function. This time, only a limited number
of urls can get fetched at a time. However, there is still no limit on
the number of goroutines that could be started. Is that a problem? If
so, how to fix this?


// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
// DONE: Fetch URLs in parallel.
// DONE: Don't fetch the same URL twice.

if depth < 1 {
return
}

type Target struct {
url string
depth int
}

targets := make(chan *Target)
queue := make(chan *Target)

seen := make(map[string]bool)

// This marks the end of a series of new targets
eot := &Target{depth: -1}

// Start some fetchers, each fetcher fetches at most 1 url at a time
for i := 0; i < 3; i++ {
go func() {
for {
t := <-queue
body, urls, err := fetcher.Fetch(t.url)
if err != nil {
fmt.Println(err)
} else {
fmt.Printf("found: %s %q\n", t.url, body)
if d := t.depth - 1; d > 0 {
for _, u := range urls {
targets <- &Target{u, d}
}
}
}
targets <- eot
}
}()
}

queue <- &Target{url, depth}
for counter := 1; counter > 0; {
t := <-targets
if t.depth < 0 {
counter--
} else {
if !seen[t.url] {
seen[t.url] = true
counter++
go func() {
queue <- t
}()
}
}
}
return
}

Peter Kleiweg

unread,
Oct 9, 2011, 8:26:32 PM10/9/11
to golang-nuts
On Oct 10, 1:52 am, Sameer Ajmani <ajm...@gmail.com> wrote:

>         // Using maps as sets is inefficient, though convenient.  See
>         // http://blog.golang.org/2011/06/profiling-go-programs.html

What is the alternative? Binary trees?

Kyle Lemons

unread,
Oct 10, 2011, 6:56:26 PM10/10/11
to Peter Kleiweg, golang-nuts
I have rewritten the Crawl function. This time, only a limited number
of urls can get fetched at a time. However, there is still no limit on
the number of goroutines that could be started. Is that a problem? If
so, how to fix this?

Don't worry about the number of goroutines you create.  If you are worried about outgoing connections, use the suggestion from an earlier post of using a buffered channel and pulling a value from it at the beginning and pushing one to it after (or pushing one to it before and pulling one from it after, depending on your preference).

My solution is as follows:

// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
type moreUrls struct{
depth int
urls []string
}
more := make(chan moreUrls)

getPage := func(url string, depth int) {

body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
} else {
fmt.Printf("found[%d:%s] %q\n", depth, url, body)
}
more <- moreUrls{depth+1, urls}
}

outstanding := 1
go getPage(url, 0)

visited := map[string]bool{url:true}
for outstanding > 0 {
next := <-more
outstanding--
if next.depth > depth {
continue
}
for _, url := range next.urls {
if _, seen := visited[url]; seen {
continue
}
visited[url] = true
outstanding++
go getPage(url, next.depth)
}
}
}

func main() {
Crawl("http://golang.org/", 4, fetcher)
}

dlin

unread,
Oct 11, 2011, 9:57:19 AM10/11/11
to golan...@googlegroups.com
My solution is here.  Just use a buffered channel.  But, I don't know am I right, I still haven't add a real fetcher on it.

func Crawl(url string, depth, workerCnt int, fetcher Fetcher) {
    reqch := make(chan int)
    seen := make(map[string]bool)
    working := make(chan bool, workerCnt)  // TIPS here

    var crawl func(string, int)
    crawl = func(url string, depth int) {
        defer func(){reqch <- -1}()
        if depth <= 0 {
            return
        }
        if _,ok := seen[url] ; ok {
            return
        }
        seen[url] = true
        working <- true  // TIPS here
        defer func(){ <-working}()   // TIPS here
        body, urls, err := fetcher.Fetch(url)
        if err != nil {
            fmt.Println(err)
            return
        }
        fmt.Printf("found: %s %q\n", url, body)
        reqch <- len(urls)
        for _, u := range urls {
            go crawl(u, depth-1)
        }
    }
    go crawl(url, depth)
    actsum := 1
    for diff :=range reqch {
        actsum += diff
        if actsum == 0 {
            break
        }
    }
}

Kyle Lemons

unread,
Oct 11, 2011, 1:35:42 PM10/11/11
to golan...@googlegroups.com
You still produce an unbound number of goroutines and parallel fetches.

dlin

unread,
Oct 11, 2011, 6:30:46 PM10/11/11
to golan...@googlegroups.com
Kyle, do you mean in the code which I posted will generate unbounded parallel fetch?
I've use buffered channel.  I don't know why.

I agree it generate unbounded goroutine.  Your algorithm will generate lesser goroutine.
But, if a web page contains more then 10000 urls, Your and my algorithms will generate more then 10000 goroutines.

Kyle Lemons

unread,
Oct 11, 2011, 6:40:27 PM10/11/11
to golan...@googlegroups.com
Kyle, do you mean in the code which I posted will generate unbounded parallel fetch?
I apparently misread where you had your send/recv for your working channel.  It looks like it will correct bound the number of active requests correctly, so I apologize.
 
I've use buffered channel.  I don't know why.

Mine also produces an unbound number of parallel requests (and goroutines).  If I were to actually implement this, however, I would put the counting semaphore inside the fetcher itself.

Sameer Ajmani

unread,
Oct 12, 2011, 2:10:44 PM10/12/11
to golang-nuts
Reply all
Reply to author
Forward
0 new messages