Simple web crawler question. How to avoid crawling visited links?

359 views
Skip to first unread message

gdz...@gmail.com

unread,
Sep 24, 2017, 10:13:16 AM9/24/17
to golang-nuts
Hi I am learning Golang concurrency and trying to build a simple Website crawler. I managed to crawl all the links of the pages of any depth of website. But I still have one problem to tackle: how to avoid crawling visited links that are previously crawled?

Here is my code. Hope you guys can shed some light. Thank you in advance.

package main

import (
    "fmt"
    "log"
    "net/http"
    "os"
    "strings"

    "golang.org/x/net/html"
)

func main() {
    if len(os.Args) != 2 {
        fmt.Println("Usage: crawl [URL].")
    }

    url := os.Args[1]
    if !strings.HasPrefix(url, "http://") {
        url = "http://" + url
    }

    for link := range newCrawl(url, 1) {
        fmt.Println(link)
    }
}

func newCrawl(url string, num int) chan string {
    ch := make(chan string, 20)

    go func() {
        crawl(url, 1, ch)
        close(ch)
    }()

    return ch
}

func crawl(url string, n int, ch chan string) {
    if n < 1 {
        return
    }
    resp, err := http.Get(url)
    if err != nil {
        log.Fatalf("Can not reach the site. Error = %v\n", err)
        os.Exit(1)
    }

    b := resp.Body
    defer b.Close()

    z := html.NewTokenizer(b)

    nextN := n - 1
    for {
        token := z.Next()

        switch token {
        case html.ErrorToken:
            return
        case html.StartTagToken:
            current := z.Token()
            if current.Data != "a" {
                continue
            }
            result, ok := getHrefTag(current)
            if !ok {
                continue
            }

            hasProto := strings.HasPrefix(result, "http")
            if hasProto {
                done := make(chan struct{})
                go func() {
                    crawl(result, nextN, ch)
                    close(done)
                }()
                <-done
                ch <- result
            }
        }
    }
}

func getHrefTag(token html.Token) (result string, ok bool) {
    for _, a := range token.Attr {
        if a.Key == "href" {
            result = a.Val
            ok = true
            break
        }
    }
    return
}

Michael Jones

unread,
Sep 24, 2017, 1:14:52 PM9/24/17
to gdz...@gmail.com, golang-nuts
you must remember where you've been. for example, you might:

a. convert each candidate URL to a canonical form (absolute path)

b. look for canonical url in a map before visiting. If it was not there, insert and visit, if it was there, do nothing

this is enough.


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Michael T. Jones
michae...@gmail.com

Sam Vilain

unread,
Sep 24, 2017, 1:43:33 PM9/24/17
to Michael Jones, golang-nuts, gdz...@gmail.com
Why not be even more concurrent?

Pass "to visit" links to a channel.

Reader of channel holds the map, de-dupes and passes to worker channel.

Multiple workers dequeue the channel and feed back into the "to visit" channel.

Sam

To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Aaron

unread,
Sep 25, 2017, 8:35:43 AM9/25/17
to golang-nuts
I have come up with a fix, using Mutex. But I am not sure how to do it with channels.

package main

import (
    "fmt"
    "log"
    "net/http"
    "os"
    "strings"
    "sync"

    "golang.org/x/net/html"
)

var lock = sync.RWMutex{}

func main() {
    if len(os.Args) != 2 {
        fmt.Println("Usage: crawl [URL].")
    }

    url := os.Args[1]
    if !strings.HasPrefix(url, "http://") {
        url = "http://" + url
    }

    n := 0

    for link := range newCrawl(url, 1) {
        n++
        fmt.Println(link)
    }

    fmt.Printf("Total links: %d\n", n)
}

func newCrawl(url string, num int) chan string {
    visited := make(map[string]bool)
    ch := make(chan string, 20)

    go func() {
        crawl(url, 3, ch, &visited)
        close(ch)
    }()

    return ch
}

func crawl(url string, n int, ch chan string, visited *map[string]bool) {
    if n < 1 {
        return
    }
    resp, err := http.Get(url)
    if err != nil {
        log.Fatalf("Can not reach the site. Error = %v\n", err)
        os.Exit(1)
    }

    b := resp.Body
    defer b.Close()

    z := html.NewTokenizer(b)

    nextN := n - 1
    for {
        token := z.Next()

        switch token {
        case html.ErrorToken:
            return
        case html.StartTagToken:
            current := z.Token()
            if current.Data != "a" {
                continue
            }
            result, ok := getHrefTag(current)
            if !ok {
                continue
            }

            hasProto := strings.HasPrefix(result, "http")
            if hasProto {
                lock.RLock()
                ok := (*visited)[result]
                lock.RUnlock()
                if ok {
                    continue
                }
                done := make(chan struct{})
                go func() {
                    crawl(result, nextN, ch, visited)
                    close(done)
                }()
                <-done
                lock.Lock()
                (*visited)[result] = true
                lock.Unlock()
                ch <- result
            }
        }
    }
}

func getHrefTag(token html.Token) (result string, ok bool) {
    for _, a := range token.Attr {
        if a.Key == "href" {
            result = a.Val
            ok = true
            break
        }
    }
    return
}

Tim T

unread,
Sep 25, 2017, 10:44:45 AM9/25/17
to golang-nuts
I did something similar for an interview some months ago. If you want to get some inspiration: https://github.com/TimTosi/crawlr

Michael Jones

unread,
Sep 25, 2017, 10:47:15 AM9/25/17
to Aaron, golang-nuts
i suggest that you first make it work in the simple way and then make it concurrent.

however, one lock-free concurrent way to think of this is as follows...

1. start with a list of urls (in code, on command line, etc.)
2. spawn a go process that writes each of them to a channel of strings, perhaps called PENDING
3. spawn a go process that reads a url string from work and if it is not in the map of already processed url's, writes it to a channel of strings, WORK, after adding the url to the map.
4 spawn a set of go processes that read WORK, fetch the url, do whatever it it that you need to do, and for urls found there, writes them to PENDING

this is enough. now as written you have the challenge to know when the workers are done and pending is empty. that's when you exit. there are other ways to do this, but the point is to state with emphasis what an earlier email said, which is to have the map in its own goroutine, the one that decides which urls should be processed.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Jones

unread,
Sep 25, 2017, 6:18:43 PM9/25/17
to Aaron, golang-nuts

Aaron

unread,
Sep 25, 2017, 11:16:35 PM9/25/17
to golang-nuts
Wow this has everything for a crawler plus many more. Is it good to put the code into so many files? I thought go program is simple enough to put most code into one file. But I think if code reuse is the main concern maybe this is good.

Aaron

unread,
Sep 25, 2017, 11:46:01 PM9/25/17
to golang-nuts
Thank you Mike. I have read the book and done that exercise. The code in the example will not crawl into certain depth and stop at certain depth of the website. The requirements are a bit different. I can't just use the approach in the example directly or more precisely I don't know how to use it in my code.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
Michael T. Jones
michae...@gmail.com

Michael Jones

unread,
Sep 26, 2017, 10:30:47 AM9/26/17
to Aaron, golang-nuts
To be sure, examples are a starting place.

To crawl to a certain depth, you need to know the depth of each url. That means that the url being passed around stops being "just a string" and becomes a struct with information including at least, the url as a string and the link depth as an integer. Following a link means incrementing the depth, checking that, and also checking the presence of the string part in the map.

To crawl to a certain distance from a website, in the sense of how many hops to other websites, needs similar processing on the hostname.

To crawl with rate-limitation so as not to overburden a distant host's server(s), means to have multiple work queues for each hostname. That is the natural place to do rate limits, time windows, or other such processing.

The real meaning of the example is to point out that the work queue -- a channel here -- can be written to by the (go)routines that read from it. This is a useful notion in this case.

Good luck!

To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Jones

unread,
Sep 26, 2017, 5:57:38 PM9/26/17
to Aaron, golang-nuts
Aaron, this is the simplest change I could make to that program to show you an example. It does what you want.


It shows what I was saying about replacing a string with a struct of string and int.

Depth 1:

Reply all
Reply to author
Forward
0 new messages