Regexp to extract a href tags

1,998 views
Skip to first unread message

DEXTER

unread,
May 19, 2012, 9:14:37 AM5/19/12
to golang-nuts
Hi
I would like to extract all the anchor tags in a web page

webpage , err := ioutil.ReadAll(res.Body)
page := string(webpage);

const pattern string = "([<a.*\\/a>]+)"
r := regexp.MustCompile(pattern)
result := r.FindAllString(page ,-1)

This prints
< > < a a a > < a > < a > < a / a > < > a a</ > < a a > < a a a a . a
a a a a a a . > < a a a a a a a a a a a a a a a a > < / a a > // a . .
a a a </ > < / a a > //

I would like to get an array of strings of all the anchor tags

Is there any documentation on RegExp something similar to what php
maintains

Thanks
Dexter
Message has been deleted

Chris Broadfoot

unread,
May 19, 2012, 10:08:12 AM5/19/12
to DEXTER, golang-nuts
Try this, perhaps:

I would strongly recommend against parsing HTML5 with regular expressions.
Message has been deleted

Archos

unread,
May 19, 2012, 12:42:40 PM5/19/12
to golang-nuts
Use the package [html][1].

import (
"bytes"
"exp/html"
)

var (
anchorTag = []byte{'a'}
)

tkzer := html.NewTokenizer(page)

for {
switch tkzer.Next() {
case html.ErrorToken:
// HANDLE ERROR

case html.StartTagToken:
tag, hasAttr := tkzer.TagName()
if hasAttr && bytes.Equal(anchorTag, tag) { // a
// HANDLE ANCHOR
}
}
}

[1]: http://weekly.golang.org/pkg/exp/html/

Archos

unread,
May 19, 2012, 12:46:39 PM5/19/12
to golang-nuts


On May 19, 5:42 pm, Archos <raul....@sent.com> wrote:
> Use the package [html][1].
>
>         import (
>                 "bytes"
>                 "exp/html"
>         )
>
>         var (
>                 anchorTag = []byte{'a'}
>         )
>
>         tkzer := html.NewTokenizer(page)
>
>         for {
>                 switch tkzer.Next() {
>                 case html.ErrorToken:
>                         // HANDLE ERROR
>
>                 case html.StartTagToken:
>                         tag, hasAttr := tkzer.TagName()
>                         if hasAttr && bytes.Equal(anchorTag, tag) { // a
>                                 // HANDLE ANCHOR
>                         }
>                 }
>         }
>
> [1]:http://weekly.golang.org/pkg/exp/html/
And to match only the href tags:

var (
anchorTag = []byte{'a'}
hrefTag = []byte("href")
httpTag = []byte("http")
)

// * * *
// HANDLE ANCHOR
key, val, _ := tkzer.TagAttr()
if bytes.Equal(hrefTag, key) && bytes.HasPrefix(val, httpTag) { //
href, http(s)
// HREF TAG
}

Rodrigo Moraes

unread,
May 19, 2012, 1:20:59 PM5/19/12
to golang-nuts
On May 19, 11:08 am, Chris Broadfoot wrote:
> I would strongly recommend against parsing HTML5 with regular expressions.

Here's a classic:
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

-- rodrigo

Mack Johnson

unread,
May 19, 2012, 1:49:55 PM5/19/12
to golan...@googlegroups.com
[1]: http://weekly.golang.org/pkg/exp/html/

How do get this package? It is not in my go distribution.  Is there a way to get it using the goinstall command? 

Patrick Mylund Nielsen

unread,
May 19, 2012, 9:55:42 PM5/19/12
to Mack Johnson, golan...@googlegroups.com
The best way is to install Go (weekly) from source:
http://golang.org/doc/install/source

(exp packages aren't available in the stable distribution)

Brad Fitzpatrick

unread,
May 19, 2012, 11:51:13 PM5/19/12
to Patrick Mylund Nielsen, Mack Johnson, golan...@googlegroups.com
No, don't install weekly.  That's older than go1 itself.

Just copy the files somewhere.

Peter Bourgon

unread,
May 20, 2012, 3:28:30 AM5/20/12
to Mack Johnson, golan...@googlegroups.com
I've cloned this particular package,

go get -v github.com/peterbourgon/exp-html


On Sat, May 19, 2012 at 7:49 PM, Mack Johnson <johnma...@ymail.com> wrote:

Patrick Mylund Nielsen

unread,
May 20, 2012, 8:32:26 AM5/20/12
to Brad Fitzpatrick, Mack Johnson, golan...@googlegroups.com
Oh, whoops. Missed that this stopped.

Patrick Mylund Nielsen

unread,
May 20, 2012, 8:38:35 AM5/20/12
to Brad Fitzpatrick, Mack Johnson, golan...@googlegroups.com
Perhaps http://golang.org/doc/install/source should be updated? It
still refers to weekly being updated "about once per week." Or is it
the plan to resume these updates?

On Sun, May 20, 2012 at 2:32 PM, Patrick Mylund Nielsen

Carlos Castillo

unread,
May 22, 2012, 10:24:15 AM5/22/12
to golan...@googlegroups.com, Mack Johnson
Optionally you could go get the original:

go get code.google.com/p/go/src/pkg/exp/html

The upside is that go get -u will always be up to date, the downside is that you now have the entire go source in your GOPATH under code.google.com/p/go/

Peter S

unread,
May 22, 2012, 10:39:49 AM5/22/12
to Carlos Castillo, golan...@googlegroups.com, Mack Johnson
Optionally you could go get the original:
go get code.google.com/p/go/src/pkg/exp/html

AFAIK this doesn't work with the stable versions of go (go1, go1.0.1), at least it definitely doesn't work for me (go1.0.1 binary distribution on linux/64). IIUC "go get" uses the "go version" (i.e. go1, go1.0.1) as the tag when pulling the sources, and for those tags the exp packages are excluded. (Which is probably the main reason people have been setting up external mirrors in the first place.)

Peter

Carlos Castillo

unread,
May 22, 2012, 11:03:52 AM5/22/12
to golan...@googlegroups.com
Ah, you are right.
--
Carlos Castillo
Reply all
Reply to author
Forward
0 new messages