HTML Parser

4,125 views
Skip to first unread message

Ryan Slade

unread,
Aug 8, 2012, 4:24:29 AM8/8/12
to golan...@googlegroups.com
Hi

Any recomendations for an HTML parser? I found this answer on Stack Overflow:

Anyone here know the status of the standard package or can recommend any alternatives? I'd just like to know my options before choosing one.

Thanks
Ryan

Chris Broadfoot

unread,
Aug 8, 2012, 4:25:55 AM8/8/12
to Ryan Slade, golan...@googlegroups.com

Patrick Mylund Nielsen

unread,
Aug 8, 2012, 4:34:16 AM8/8/12
to Ryan Slade, golan...@googlegroups.com
exp/html is great:  http://tip.golang.org/pkg/exp/html/ 

Jeremy Wall

unread,
Aug 8, 2012, 10:59:14 AM8/8/12
to Ryan Slade, golang-nuts
I'm the author of go-html-transform and it mostly works for the stuff
I use it for. If you use it let me know how it went :-)

That said I'm hoping Nigel will finish the exp/html package soon so I
can stop fixing bugs and maintaining h5 and use his instead. I'm not
sure how close to completion the standard package is. Maybe nigel can
chime on the thread and let us know.

Andy Balholm

unread,
Aug 8, 2012, 11:32:33 AM8/8/12
to golan...@googlegroups.com, Ryan Slade
On Wednesday, August 8, 2012 7:59:14 AM UTC-7, Jeremy Wall wrote:
 That said I'm hoping Nigel will finish the exp/html package soon so I
can stop fixing bugs and maintaining h5 and use his instead. I'm not
sure how close to completion the standard package is. Maybe nigel can
chime on the thread and let us know.
Nigel and I have been working on getting the exp/html package to pass the parser test suite. There are only five failing tests left, so it probably won't be long till it passes. I'm not sure how much more work it will take after passing the tests before the package is taken out of exp, but it is quite usable right now—just no guarantee of API stability.

$ grep -r '^FAIL' testlogs
testlogs/tests_innerHTML_1.dat.log:FAIL "<textarea><option>"
testlogs/tests_innerHTML_1.dat.log:FAIL "</html><!--abc-->"
testlogs/webkit01.dat.log:FAIL "<ul><li><div id='foo'/>A</li><li>B<div>C</div></li></ul>"
testlogs/webkit02.dat.log:FAIL "<html><body><img src=\"\" border=\"0\" alt=\"><div>A</div></body></html>"
testlogs/webkit02.dat.log:FAIL "<isindex action=\"x\">"
 

Jeremy Wall

unread,
Aug 8, 2012, 12:00:06 PM8/8/12
to Andy Balholm, golan...@googlegroups.com, Ryan Slade
I look forward to your announcement :-)

Nigel Tao

unread,
Aug 8, 2012, 8:10:57 PM8/8/12
to Andy Balholm, golan...@googlegroups.com, Ryan Slade
On 9 August 2012 01:32, Andy Balholm <andyb...@gmail.com> wrote:
> Nigel and I have been working on getting the exp/html package to pass the
> parser test suite. There are only five failing tests left, so it probably
> won't be long till it passes.

I'd like to take this opportunity to thank Andy for all the work he's
done on exp/html. We wouldn't be anywhere near this close to passing
100% of the html5lib / webkit test suite without him.


> I'm not sure how much more work it will take
> after passing the tests before the package is taken out of exp, but it is
> quite usable right now—just no guarantee of API stability.

It is certainly usable right now. Moving out of exp would mean
freezing the API, and I don't think the API is quite right yet.
Specifically, html.Node is currently a struct type; I think it needs
to be an interface type so that programs can provide different
implementations according to their needs. For example, a simple
"scrape the links from this html file" would probably be happy with
the default node implementation. Someone trying to implement a
full-blown browser would probably need nodes to contain fields to
support layout and JavaScript access, but package (exp/)html shouldn't
have to mandate a particular css or javascript implementation.

There may be other performance-related API changes. The html.Attribute
type should probably use the exp/html/atom mechanism. Namespace
representation might also change.

matt

unread,
Mar 8, 2013, 8:16:04 AM3/8/13
to golan...@googlegroups.com, Andy Balholm, Ryan Slade
What happened to exp/html? I can't see it on the tip?

Daniel Morsing

unread,
Mar 8, 2013, 8:29:00 AM3/8/13
to matt, golan...@googlegroups.com, Andy Balholm, Ryan Slade
On Fri, Mar 8, 2013 at 2:16 PM, matt <matthew....@gmail.com> wrote:
> What happened to exp/html? I can't see it on the tip?
>

It was moved to the go.net sub-repository

https://code.google.com/p/go/source/browse?repo=net

mister...@gmail.com

unread,
May 30, 2014, 8:55:03 AM5/30/14
to golan...@googlegroups.com
The state 2 years after is quite good, but not awesome.

I'm a newcomer to Golang, as many are in 2014, anyway go.net/html and Cascadia provides a good combination for html parsing and querying, only that is a little slow, at least compared to the web rivals (node.js above all here)
I'll try and experiment also with GoQuery and report my findings :)

Martin Angers

unread,
May 30, 2014, 2:04:44 PM5/30/14
to golan...@googlegroups.com
goquery is a jquey-like higher level abstraction over cascadia and net/html. It uses both of these packages internally, so you won't get better performance.

Valerio Coltrè

unread,
May 31, 2014, 6:24:43 AM5/31/14
to golan...@googlegroups.com
You are very right, i didn't dive inside the dependencies to find this out.
Anyway i really can't understand how the html parser can be so slow, on my machine parsing a short-length web page takes ~150ms with GoQuery, and only ~50ms with cheerio (node.js)

Do you think the HTML parser can be rewrite with more attention to performance?

Martin Angers

unread,
May 31, 2014, 10:48:23 AM5/31/14
to golan...@googlegroups.com
Depending on what you need to do, you can use net/html's tokenizer api, there's an api that reuses the same buffer so that it doesn't allocate on each token. it should be very fast (compared to parse), but may not be an option for your use-case. Parse() fully parses the doc and loads the whole tree into memory. I don't know what cheerio does.

Andy Balholm

unread,
May 31, 2014, 10:58:17 AM5/31/14
to Valerio Coltrè, golan...@googlegroups.com

On May 31, 2014, at 3:24 AM, Valerio Coltrè <mister...@gmail.com> wrote:

Anyway i really can't understand how the html parser can be so slow, on my machine parsing a short-length web page takes ~150ms with GoQuery, and only ~50ms with cheerio (node.js)

Do you think the HTML parser can be rewrite with more attention to performance?

Can you share your benchmark setup? I find it hard to believe that the html package would take 150 ms to parse a short web page (except maybe on a Raspberry Pi or something like that).

Anyway, the HTML parser was written with quite a bit of attention to performance. Nigel rejected several of my ideas for how to do things because they were too slow.

Is cheerio (or htmlparser2, the parser it uses) HTML5-compliant?

Valerio Coltrè

unread,
May 31, 2014, 5:35:35 PM5/31/14
to golan...@googlegroups.com, mister...@gmail.com
So after all, i was very wrong, luckily :)

Andy helped me to track the thing down, i was improperly assuming that http.Method returned the whole body, instead returns as soon as the header response is done.
With this in mind, you can buffer the entire content of the page and then htmlparse it.

And yes, Golang does indeed perform much better than node.js (or python), as anyone would expect :)
Reply all
Reply to author
Forward
0 new messages