Re: [go-nuts] My solution to Web Crawler excercise

439 views
Skip to first unread message

Kyle Lemons

unread,
Aug 27, 2012, 5:37:37 PM8/27/12
to Thomas Gawehns, golan...@googlegroups.com
Looks pretty good.

On Mon, Aug 27, 2012 at 2:57 AM, Thomas Gawehns <thomas....@gmail.com> wrote:
Hi folks,
I'm new to go and must admit i like it. But somehow my solution to the web crawler exercise looks to simple:

func Crawl(url string, depth int, fetcher Fetcher) {

visitedUrls := make(map[string]bool, 100)
This is accessed concurrently without a lock.  With GOMAXPROCS=1, this probably won't blow up, but it's technically a data race and thus not "correct."
 
var InnerCrawl func(url string, depth int, 
fetcher Fetcher, finish chan bool)
InnerCrawl = func (url string, depth int, 
fetcher Fetcher, finish chan bool) {
defer func() { finish<- true } ()
done := make(chan bool)
if depth <= 0 || visitedUrls[url] {
return
}
visitedUrls[url] = true
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
go InnerCrawl(u, depth-1, fetcher, done)
}
for _ = range urls {
        <-done 
    }
}
done := make(chan bool)
go InnerCrawl(url, depth, fetcher, done)
<-done
return
}

Is there some flaw in it???
 rgds Thomas

Thomas Gawehns

unread,
Aug 28, 2012, 4:11:29 AM8/28/12
to golan...@googlegroups.com
Thx Kyle,
for pointing the data race issue.

I assumed the map data structure is sort of thread safe  so concurrent read/write access won't blow it up. So that you could have two threads trying to insert key value pairs with the same key without crashing. 

If that's the case my solution doesn't lock but allows for possibly multiple spidering the same url.  

Anyway what does GOMAXPROCS=n mean? Is a define constant or an environment variable?

Rgds Thomas

minux

unread,
Aug 28, 2012, 4:18:12 AM8/28/12
to Thomas Gawehns, golan...@googlegroups.com
On Tue, Aug 28, 2012 at 4:11 PM, Thomas Gawehns <thomas....@gmail.com> wrote:
I assumed the map data structure is sort of thread safe  so concurrent read/write access won't blow it up. So that you could have two
no, it's not. you need to provide your own locking. The reason is explained here:
 
Anyway what does GOMAXPROCS=n mean? Is a define constant or an environment variable?

Thomas Gawehns

unread,
Aug 28, 2012, 4:31:19 AM8/28/12
to minux, golan...@googlegroups.com


2012/8/28 minux <minu...@gmail.com>


On Tue, Aug 28, 2012 at 4:11 PM, Thomas Gawehns <thomas....@gmail.com> wrote:
I assumed the map data structure is sort of thread safe  so concurrent read/write access won't blow it up. So that you could have two
no, it's not. you need to provide your own locking. The reason is explained here:
 
That a pity cause you don't need to lock all access. It is only the write operation to be implemented to not break the data structure to allow for reading while writing  And then synchonize on writing to the same key only.

Is it possible to extend the map type? Or use some C/C++ code which would implement these goodies?
--

Dipl.-Inf. Thomas Gawehns                                          www.gawehns.de
Weichselstrasse 3, 90419 Nürnberg       dashabeichgesehen.blogspot.com
Gerberstrasse 3, 78050 VS-Villingen

0173 37 44 383

minux

unread,
Aug 28, 2012, 4:38:14 AM8/28/12
to Thomas Gawehns, golan...@googlegroups.com
On Tue, Aug 28, 2012 at 4:31 PM, Thomas Gawehns <thomas....@gmail.com> wrote:
2012/8/28 minux <minu...@gmail.com>
On Tue, Aug 28, 2012 at 4:11 PM, Thomas Gawehns <thomas....@gmail.com> wrote:
I assumed the map data structure is sort of thread safe  so concurrent read/write access won't blow it up. So that you could have two
no, it's not. you need to provide your own locking. The reason is explained here:
That a pity cause you don't need to lock all access. It is only the write operation to be implemented to not break the data structure to allow for reading while writing  And then synchonize on writing to the same key only.
you need to lock for both reading and writing.

Is it possible to extend the map type? Or use some C/C++ code which would implement these goodies?
just embed a sync.Mutex (or sync.RWMutex) and map into a struct, and (optional)
define method such as Get/Set on it.

Thomas Gawehns

unread,
Aug 28, 2012, 5:50:26 AM8/28/12
to minux, golan...@googlegroups.com
Found a solution without data race and without mutex:

func Crawl(url string, depth int, fetcher Fetcher) {

var nextUrlToVisit func(in chan string, out chan string)
nextUrlToVisit = func(in chan string, out chan string) {
visitedUrls := make(map[string]bool, 100)
for {
url := <-in
if( url == "" ){
return
}
if(visitedUrls[url]){
out<- ""
}else{
out<- url
}
}
}
in := make(chan string)
out := make(chan string)
var InnerCrawl func(url string, depth int, 
fetcher Fetcher, finish chan bool)
InnerCrawl = func (url string, depth int, 
fetcher Fetcher, finish chan bool) {
defer func() { finish<- true } ()
done := make(chan bool)
if depth <= 0 {
return
}
in<- url
url = <-out
if(url == ""){
return
}
// put in planned url to visit and 
// read url to do crawl or nothing to return
//
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
go InnerCrawl(u, depth-1, fetcher, done)
}
for _ = range urls {
        <-done 
    }
}
done := make(chan bool)
go nextUrlToVisit(in, out)
go InnerCrawl(url, depth, fetcher, done)
<-done
return
}

What do you think?

roger peppe

unread,
Aug 28, 2012, 5:53:05 AM8/28/12
to Thomas Gawehns, minux, golan...@googlegroups.com
On 28 August 2012 09:31, Thomas Gawehns <thomas....@gmail.com> wrote:
>
>
> 2012/8/28 minux <minu...@gmail.com>
>>
>>
>> On Tue, Aug 28, 2012 at 4:11 PM, Thomas Gawehns <thomas....@gmail.com>
>> wrote:
>>>
>>> I assumed the map data structure is sort of thread safe so concurrent
>>> read/write access won't blow it up. So that you could have two
>>
>> no, it's not. you need to provide your own locking. The reason is
>> explained here:
>> http://golang.org/doc/go_faq.html#atomic_maps
>>
>
> That a pity cause you don't need to lock all access. It is only the write
> operation to be implemented to not break the data structure to allow for
> reading while writing And then synchonize on writing to the same key only.

Even if maps were safe to read and write concurrently, your solution
would still be
incorrect.

Think about these lines:

if depth <= 0 || visitedUrls[url] {
return
}
visitedUrls[url] = true

If two goroutines both observe that visitedUrls
is false for the same url, they will both carry on to
fetch it, so you'll be fetching the same URL twice.

You *could* use a mutex around the map, but you could
also use a more Go-like solution which avoids mutexes
and uses channel communication to the same end.

Here's an example: http://play.golang.org/p/J06a5tlNax

roger peppe

unread,
Aug 28, 2012, 6:12:31 AM8/28/12
to Thomas Gawehns, minux, golan...@googlegroups.com
On 28 August 2012 10:50, Thomas Gawehns <thomas....@gmail.com> wrote:
> Found a solution without data race and without mutex:
>
> func Crawl(url string, depth int, fetcher Fetcher) {
>
> var nextUrlToVisit func(in chan string, out chan string)
> nextUrlToVisit = func(in chan string, out chan string) {
> visitedUrls := make(map[string]bool, 100)
> for {
> url := <-in
> if( url == "" ){
> return
> }
> if(visitedUrls[url]){
> out<- ""
> }else{
> out<- url
> }
> }
> }

The above is pretty much equivalent to using a mutex.
(I assume you actually wanted to add urls to the visitedUrls map)

When I'm designing concurrent Go code, I often try to think:
"what piece of code *wants* this data?". If I'm creating a goroutine
solely to guard access to a particular piece of data, with no
extra functionality, then I tend to think I'm doing something wrong.

Sometimes, of course, several goroutines *do* need access to the
same data, in which case a mutex is a fine thing, but I don't think
that's the case here.

> in := make(chan string)
> out := make(chan string)
> var InnerCrawl func(url string, depth int,
> fetcher Fetcher, finish chan bool)
> InnerCrawl = func (url string, depth int,
> fetcher Fetcher, finish chan bool) {

Another minor point: it's not that usual in Go to
write recursive closures - a separate function is almost
as easy to write, and somewhat easier to read.

Thomas Gawehns

unread,
Aug 28, 2012, 6:23:25 AM8/28/12
to roger peppe, minux, golan...@googlegroups.com
You're right, it looks like a mutex version. Thx for the critics, made me reflect on that issue.




Vaibhav

unread,
Apr 17, 2020, 4:48:03 PM4/17/20
to golang-nuts
Hi,
I didn't want to create a new thread with same topic so though of making a reply here.
I made a solution for the same problem https://play.golang.org/p/GP8qVB9FPfx without the use of lock and somehow it is working, but ends up going in the deadlock! What is going wrong here?

func Crawl(root string, r_depth int, fetcher Fetcher) {
ch := make(chan struct {
string
int
})
var wg sync.WaitGroup
crawl := func(url string, depth int) {
defer wg.Done()
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
ch <- struct {
string
int
}{u, depth - 1}
}
}
urlMap := map[string]bool{root: true}
wg.Add(1)
go crawl(root, r_depth)
for url := range ch {
if _, ok := urlMap[url.string]; !ok {
wg.Add(1)
go crawl(url.string, url.int)
urlMap[url.string] = true
}
}
wg.Wait()
}

Regards,
Vaibhav

On Tuesday, August 28, 2012 at 3:07:37 AM UTC+5:30, Kyle Lemons wrote:
Looks pretty good.

Will S

unread,
Apr 17, 2020, 7:24:34 PM4/17/20
to Vaibhav, golang-nuts
Your deadlock comes from the channel never being closed so the for loop in your main go routine never ends.
You need to add some logic to figure out when to close the channel, or break out of the loop (and then close the channel).

- Will

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/a0aebb00-cb03-4bdb-a409-c24f723868d5%40googlegroups.com.

Vaibhav Kaushik

unread,
Apr 18, 2020, 4:28:15 PM4/18/20
to Will S, golang-nuts


> On 18-Apr-2020, at 3:46 AM, Will S <92f...@gmail.com> wrote:
>
> 
> Your deadlock comes from the channel never being closed so the for loop in your main go routine never ends.

This makes sense. Thank you.

> You need to add some logic to figure out when to close the channel, or break out of the loop (and then close the channel).

I tried doing that. Creating different channels for each crawl call and so on, but after some time realised I might be making a over complicated solution for a simple problem. Using channel when there might not be a need.

Regards,
Vaibhav
Reply all
Reply to author
Forward
0 new messages