HTML5 parser that works?

1,326 views
Skip to first unread message

John Nagle

unread,
Jan 16, 2013, 3:47:21 PM1/16/13
to golan...@googlegroups.com
Is there an HTML5 parser for Go that works?

Tried

"code.google.com/p/go-html-transform/h5"

The parser dereferences nil on some documents,
from bogusCommentHandler. Issue filed:

http://code.google.com/p/go-html-transform/issues/detail?id=9

Project seems to have been abandoned.

There's a fork:

"github.com/darkhelmet/go-html-transform/h5"

However, just dropping in that as a replacement
results in a parse tree that only contains "html",
"head", "meta", and "title" items. Not clear that
the fork is even supposed to work yet.

John Nagle

Kyle Lemons

unread,
Jan 16, 2013, 3:49:24 PM1/16/13
to John Nagle, golang-nuts
Have you tried exp/html (at tip)?




                        John Nagle

--



Jeremy Wall

unread,
Jan 16, 2013, 3:54:08 PM1/16/13
to John Nagle, golang-nuts
That project is not quite abandoned. I just haven't had much to add to
it yet. :-)

I do have plans to replace the h5 parser with the one at exp/html
though once it's released since maintaining an html5 parser is not
really my cup of tea. I'll take a look at your issue and get a fix
ready soonish.
> --
>
>

John Nagle

unread,
Jan 16, 2013, 4:04:34 PM1/16/13
to golan...@googlegroups.com
On 1/16/2013 12:49 PM, Kyle Lemons wrote:
> Have you tried exp/html (at tip)?
>
> http://tip.golang.org/pkg/exp/html/

The comment there

"Package html implements an HTML5-compliant tokenizer and parser.
INCOMPLETE."

indicates a package not ready for prime time.

John Nagle

Andy Balholm

unread,
Jan 16, 2013, 4:47:11 PM1/16/13
to golan...@googlegroups.com, na...@animats.com

Andy Balholm

unread,
Jan 17, 2013, 11:12:34 AM1/17/13
to golan...@googlegroups.com, na...@animats.com
The docs at tip.golang.org/pkg/exp/html have now been updated to reflect the current state of the html package.

Kevin Gillette

unread,
Jan 20, 2013, 12:40:30 AM1/20/13
to golan...@googlegroups.com, na...@animats.com
Indeed. I've used exp/html for several projects, and its, by far, the cleanest, smoothest feeling html scanner/parser I've worked with in a long time. I'm particularly happy with the scanner, since that's easy to use, and allows me to get useful data out of the relative structure (general descendant relationships, data oriented) in one pass, which is much simpler for many tasks than needing to deal with the absolute structure (precise child relationships, often mixed data and display).

Rodrigo Moraes

unread,
Jan 21, 2013, 7:28:22 AM1/21/13
to golang-nuts
exp/html looks very solid, guys. Great work. The API is terrific, and
it seems to do a good job handling malformed HTML.

As Kevin said, a pull API like this is more appropriate to handle
mixed data and display -- to extract content from a tag salad with no
proper structure.

While experimenting with it this morning I created this tiny package:

http://godoc.org/github.com/moraes/sadbox/htmlfilter

-- rodrigo

John Nagle

unread,
Apr 30, 2013, 1:25:02 PM4/30/13
to golan...@googlegroups.com
On 1/17/2013 8:12 AM, Andy Balholm wrote:
> The docs at tip.golang.org/pkg/exp/html have now been updated to reflect
> the current state of the html package.

http://tip.golang.org/pkg/exp/html/

Error 404: Not found

There's an entertaining discussion on Stack Overflow of how
this was moved and broken in Go 1:

http://stackoverflow.com/questions/9986329/any-smart-method-to-get-exp-html-back-after-go1

If it's that hard to even install the package, it won't have many
users and will probably still be broken for the hard cases.

I'm going back to Python until the Go crowd figures out how to parse
HTML.

John Nagle


Aram Hăvărneanu

unread,
Apr 30, 2013, 1:32:39 PM4/30/13
to John Nagle, golang-nuts
> I'm going back to Python until the Go crowd figures out how to parse
> HTML.

Thanks for letting us know.

--
Aram Hăvărneanu

DisposaBoy

unread,
Apr 30, 2013, 2:03:42 PM4/30/13
to golan...@googlegroups.com
It's called exp/* for a reason. If you don't know why it's there and why exp/ wasn't part of the binary releases or are unwillong to keep up with the relevant packages progress then you probably shouldn't be using anything in exp/. Others have mentioned where the pkg went and if you want to know more details then search the MLs. The move was announced. With that said, you're evidently some kinda troll so tara!

Dan Kortschak

unread,
Apr 30, 2013, 5:00:53 PM4/30/13
to golan...@googlegroups.com
I think we just got a "Dear John" letter.

Nigel Tao

unread,
Apr 30, 2013, 7:36:39 PM4/30/13
to John Nagle, golang-nuts
On Wed, May 1, 2013 at 3:25 AM, John Nagle <na...@animats.com> wrote:
> I'm going back to Python until the Go crowd figures out how to parse
> HTML.

As others have said, exp/html has moved to
code.google.com/p/go.net/html. The exp/* packages were created before
the go tool was invented. Now that we have that tool, it's trivial to
install the html package:

go get code.google.com/p/go.net/html

and in your code

import "code.google.com/p/go.net/html"

func foo() {
doc, err := html.Parse(etc)
etc
}

Since you mentioned Python, here's an anecdote from almost a year ago.
When parsing a reasonably sized HTML page (a static version of
http://golang.org/doc/go1.html), Go's html library was 30x faster than
Python's html5lib:
https://groups.google.com/forum/#!msg/golang-dev/qgMCit53-2c/FvafhcI3jb4J

John Nagle

unread,
May 1, 2013, 12:21:07 AM5/1/13
to golan...@googlegroups.com
On 4/30/2013 4:36 PM, Nigel Tao wrote:
> On Wed, May 1, 2013 at 3:25 AM, John Nagle <na...@animats.com> wrote:
>> I'm going back to Python until the Go crowd figures out how to parse
>> HTML.
>
> As others have said, exp/html has moved to
> code.google.com/p/go.net/html. The exp/* packages were created before
> the go tool was invented. Now that we have that tool, it's trivial to
> install the html package:

The "html" package seems to be OK so far, except that it doesn't
understand character sets. The "h5" package was more troublesome.
There are two versions, neither of which seems to be ready for prime
time. I've stopped using it.

Has anyone implemented section 8.2.2.1 of the
HTML5 spec, "Determining the character encoding"?

Ref:
http://www.w3.org/html/wg/drafts/html/master/syntax.html#determining-the-character-encoding

(Yes, it's awful. But it's standardized, and
every browser does it.)

John Nagle

Donovan Hide

unread,
May 1, 2013, 4:20:00 AM5/1/13
to John Nagle, golang-nuts
   Has anyone implemented section 8.2.2.1 of the
HTML5 spec, "Determining the character encoding"?

See Andy Balholm's post


I'm sure you'd win some more favours if you contributed some code to the problems you encounter rather than slamming other people's code. Open source gains from collaboration, not just criticism.

Cheers,
Donovan.

Jeremy Wall

unread,
May 1, 2013, 11:37:56 AM5/1/13
to John Nagle, golang-nuts
On Tue, Apr 30, 2013 at 11:21 PM, John Nagle <na...@animats.com> wrote:
On 4/30/2013 4:36 PM, Nigel Tao wrote:
> On Wed, May 1, 2013 at 3:25 AM, John Nagle <na...@animats.com> wrote:
>> I'm going back to Python until the Go crowd figures out how to parse
>> HTML.
>
> As others have said, exp/html has moved to
> code.google.com/p/go.net/html. The exp/* packages were created before
> the go tool was invented. Now that we have that tool, it's trivial to
> install the html package:

   The "html" package seems to be OK so far, except that it doesn't
understand character sets.   The "h5" package was more troublesome.
There are two versions, neither of which seems to be ready for prime
time.  I've stopped using it.

h5 was always a stopgap until go's standard html library was ready. It handled what I needed it to handle that the standard lib didn't. It now wraps the standard lib instead of implementing it's own parser.
 

   Has anyone implemented section 8.2.2.1 of the
HTML5 spec, "Determining the character encoding"?

Ref:
http://www.w3.org/html/wg/drafts/html/master/syntax.html#determining-the-character-encoding

   (Yes, it's awful.  But it's standardized, and
every browser does it.)

                                John Nagle

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



John Nagle

unread,
May 1, 2013, 2:09:49 PM5/1/13
to golan...@googlegroups.com
On 5/1/2013 1:20 AM, Donovan Hide wrote:
>>
>> Has anyone implemented section 8.2.2.1 of the
>> HTML5 spec, "Determining the character encoding"?
>>
>
> See Andy Balholm's post
>
> https://groups.google.com/forum/?fromgroups=#!topic/golang-nuts/Qq5hTQyPuLg

That may be a port of some similar code in Java. There's
something with the same name in the Grizzly web server. There was
also something like it in

https://github.com/fern4lvarez/go-metainspector/blob/master/scraper.go

but that project seems to have been deleted.

It's a high-maintenance piece of code, and hard to test.

John Nagle

Andy Balholm

unread,
May 1, 2013, 2:15:10 PM5/1/13
to golan...@googlegroups.com, na...@animats.com
It isn't a port from Java, or from anyone else's code. It is a translation of http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding from English, or legalese, or whatever you call the language it's written in, to Go.

Andy Balholm

unread,
May 1, 2013, 2:21:18 PM5/1/13
to golan...@googlegroups.com, na...@animats.com
As far as the maintenance problems go, it would be nice to have it in the standard library, so that the maintenance gets centralized. But that doesn't really make sense unless the standard library gets a package for handling character encodings (which would be nice, as long as it doesn't get imported by everything and bloat the size of binaries that don't really use it).
Reply all
Reply to author
Forward
Message has been deleted
0 new messages