looking for a sample that uses exp/html to parse a table

1,387 views
Skip to first unread message

Jens-Uwe Mager

unread,
Feb 27, 2012, 4:15:32 PM2/27/12
to golan...@googlegroups.com
I am trying to parse a HTML table using the exp/html and I am having a hard time to extract the table rows. Is there some sample code anywhere that uses this to get the table row by row? I would like to insert the result into a sqlite3 table.

Nigel Tao

unread,
Feb 27, 2012, 7:27:45 PM2/27/12
to Jens-Uwe Mager, golan...@googlegroups.com

Is the problem that parsing the <table><tr> is failing, or is it a
question of how to walk a correctly-parsed tree of DOM Nodes to
extract the rows?

Jens-Uwe Mager

unread,
Feb 27, 2012, 10:00:47 PM2/27/12
to golan...@googlegroups.com, Jens-Uwe Mager
Well, for example the following snippet to extract from a wikipedia table appears to be quite clumsy, so I was wondering if there was something easier:

package main

import (
    "bytes"
    "fmt"
    "os"
    "flag"
    "exp/html"
)

var (
)

func nodeText(n *html.Node) string {
    var b bytes.Buffer
    err := html.Render(&b, n)
    if err != nil {
        panic(err.Error())
    }
    return b.String()
}

func main() {
    flag.Parse()
    for _, fname := range flag.Args() {
        fi, err := os.Open(fname)
        if err != nil {
            panic(err.Error())
        }
        doc, err := html.Parse(fi)
        if err != nil {
            panic(err.Error())
        }
        var table *html.Node
        var f func(*html.Node)
        f = func(n *html.Node) {
            if n.Type == html.ElementNode && n.Data == "table" {
                //fmt.Printf("table = %#v\n", n)
                for _, a := range n.Attr {
                    if a.Key == "class" {
                        //fmt.Printf("class = %#v\n", a.Val)
                        table = n
                        return
                    }
                }
            }
            for _, c := range n.Child {
                f(c)
            }
        }
        f(doc)
        if table == nil {
            panic("no table in html")
        }
        fmt.Printf("table = %#v\n", table)
        var rows []*html.Node
        f = func(n *html.Node) {
            if n.Type == html.ElementNode && n.Data == "tr" {
                rows = append(rows, n)
            } else {
                for _, c := range n.Child {
                    f(c)
                }
            }
        }
        f(table)
        //fmt.Printf("rows = %#v\n", rows)
        for _, row := range rows {
            fmt.Printf("row = %#v\n", row)
            var td []*html.Node
            f = func(n *html.Node) {
                //fmt.Printf("td n = %#v\n", n)
                if n.Type == html.ElementNode && n.Data == "td" {
                    td = append(td, n)
                } else {
                    for _, c := range n.Child {
                        f(c)
                    }
                }
            }
            f(row)
            if len(td) == 0 {
                continue
            }
            //fmt.Printf("td = %#v\n", td)
            for _, d := range td {
                fmt.Printf("td = %#v\n", nodeText(d))
            }
            //break
        }
        fi.Close()
    }
}

I was just working on how to properly extract the node content, which for debugging I used the nodeText function.

Jeremy Wall

unread,
Feb 27, 2012, 10:25:46 PM2/27/12
to Jens-Uwe Mager, golan...@googlegroups.com
You may be interested in: http://code.google.com/p/go-html-transform/
Which makes scraping html using CSS selector queries fairly easy.

Nigel Tao

unread,
Feb 27, 2012, 10:37:31 PM2/27/12
to Jens-Uwe Mager, golan...@googlegroups.com
How about:

package main

import (
"exp/html"
"fmt"
"log"
"os"
"strings"
)

const data = `<html><body>
foo bar
<table>
<tr><td>a1</td><td>b1 is <b>bold</b></td></tr>
<tr><td>a2</td><td>b2</td></tr>
</table>
baz
</body></html>`

const (
nothing = iota
inTable
inTR
inTD
)

func main() {
n, err := html.Parse(strings.NewReader(data))
if err != nil {
log.Fatal(err)
}
walk(n, nothing)
}

func walk(n *html.Node, state int) {
if state == inTD {
html.Render(os.Stdout, n)
return
}
hasTD := false
if n.Type == html.ElementNode {
switch {
case state == nothing && n.Data == "table":
state++
fmt.Println("---- table ----")
case state == inTable && n.Data == "tr":
state++
fmt.Println("-- row --")
case state == inTR && n.Data == "td":
state++
hasTD = true


}
}
for _, c := range n.Child {

walk(c, state)
}
if hasTD {
fmt.Println()
}
}

Output:

$ go run main.go
---- table ----
-- row --
a1
b1 is <b>bold</b>
-- row --
a2
b2

Jens-Uwe Mager

unread,
Feb 28, 2012, 12:11:01 AM2/28/12
to golan...@googlegroups.com, Jens-Uwe Mager
Looks quite a bit shorter, will try to use this way.

Andy W. Song

unread,
Apr 9, 2012, 11:04:29 PM4/9/12
to Jeremy Wall, Jens-Uwe Mager, golan...@googlegroups.com
Just filed a memory leak issue here:  https://code.google.com/p/go-html-transform/issues/detail?id=3 

I guess it's worth letting the list know. I'm sort of surprised that with GC this kind of memory leak can still happen. 

Thanks
Andy


--
---------------------------------------------------------------
有志者,事竟成,破釜沉舟,百二秦关终属楚
苦心人,天不负,卧薪尝胆,三千越甲可吞吴

Andy W. Song

unread,
Apr 9, 2012, 11:05:16 PM4/9/12
to Jeremy Wall, Jens-Uwe Mager, golan...@googlegroups.com
By the way the package is very convenient to use, thanks for the effort. 

Andy

Jeremy Wall

unread,
Apr 10, 2012, 8:19:12 AM4/10/12
to Andy W. Song, Jens-Uwe Mager, golang-nuts

Thanks for the report. I'll see about tracking it down.

Jeremy Wall

unread,
Apr 10, 2012, 7:19:35 PM4/10/12
to Andy W. Song, Jens-Uwe Mager, golang-nuts
FYI I think this is now fixed
in:http://code.google.com/p/go-html-transform/source/detail?r=e9c854c95f16b88aa8af97f746d9fc20fbfef4ca

Do you mind verifying you don't still see the problem?

Reply all
Reply to author
Forward
0 new messages