Do you use exp/html?

1,482 views
Skip to first unread message

Nigel Tao

unread,
Feb 1, 2013, 10:03:22 PM2/1/13
to golang-nuts, Andy Balholm
(Design discussions would normally be sent to
golan...@googlegroups.com (BCC'ed), but I am trawling wide for
feedback).

The exp/html package in tip provides a spec-compliant HTML5 parser. As
Go 1.1 is approaching, this package will likely either be promoted to
html, or move to the go.net sub-repository. If the former, this will
require freezing its API against incompatible changes, as per
http://golang.org/doc/go1compat.html It is unlikely that exp/html
will gain additional features before Go 1.1 is released, but ideally
the API that we freeze will still allow adding compatible features in
the future.

If you have had any problems with the API, any feature requests, or
comments in general, then now is the time to speak up. Below is a list
of known concerns.

0. Should Node be a struct or an interface?

1. There aren't enough hooks to support <script> tags, including ones
that call document.write. On the other hand, we do not want to mandate
a particular JS implementation.

2. It is not proven that the Node type can support the DOM API.

3. Even without scripting, it is not proven that the Node type can
support rendering: it is not obvious how to attach style and layout
information. On the other hand, we do not want to mandate a particular
style and layout implementation.

4. The parser assumes that the input is UTF-8. It is possible that
this is perfectly reasonable and the io.Reader given to it can be
responsible for auto-detecting the encoding and converting to UTF-8,
but it has not yet been proven. For example, there may be subtle
interaction with document.write.

5. The parser doesn't return the parse tree until it is complete. A
renderer may want to render a partially downloaded page if the network
is slow. It may also want to start the fetch of an <img>'s or
<script>'s src before parsing is complete. Do we want to support
incremental rendering, or does the complexity outweigh the benefit?
Should the API be that the caller pushes bytes to a parser multiple
times, instead of or alternatively to giving a parser an io.Reader
once?

6. The Node struct type has a Namespace string field for SVG or MathML
elements. These are rare, and could also be folded into the existing
Data string field. Eliminating the Namespace field might save a little
bit of memory.

7. The exp/html/atom list of atoms (and their hashes) needs to be
finalized. Relatedly, should an element Node provide API to look up an
attribute by atom (e.g. atom.Href, atom.Id).

8. Is Tokenizer.Raw worth keeping? Does anyone use it? Its presence
may constrain future refactoring and optimization of the tokenizer.

9. A Parser reaches into a Tokenizer to set a tokenizer's internal
state based on parser state. For example, how "<![CDATA[foo]]>" is
tokenized depends on whether or not we are in "foreign content" such
as SVG or MathML. Similarly, 'raw text' tokens are allowed for a
<title> inside regular HTML, but not for a <title> inside SVG inside
HTML. Ideally, a Tokenizer should not need to expose its state and
tokenization of an io.Reader is the same regardless of whether a
parser is driving that tokenizer, but that may turn out to be
impossible given the complexities of the HTML5 spec.

10. Should there be additional API to ease walking the Node tree? If
so, what should it look like?

11. A radical option is to remove the existing support for parsing
foreign content: SVG and MathML. It would mean losing 100% compliance
with the HTML5 specification, but it would also significantly simplify
the implementation (e.g. see issues 6 and 9 above, and things like
element tags are case-insensitive for HTML in general, but
case-sensitive for SVG inside HTML). Ideally, we would retain the
option to re-introduce SVG and MathML support in a future version, if
the benefits were re-assessed to outweigh the costs.

Steve McCoy

unread,
Feb 1, 2013, 10:50:16 PM2/1/13
to golan...@googlegroups.com
The tough parts of HTML5 are that it's constantly changing and (as you've mentioned) the other specs like javascript and MathML that are tied into it. If it became a core part of Go's stdlib, that would be nice, but I don't know if it could satisfy contemporary and future users. This is another tough question that ties into #5 and others — depending on how far you go, you could be implementing most of a browser, aside from the rendering. So, I think sticking to the HTML side of things would be simplest and most stable.

One thing I can say with confidence relates to #10: I think xpath, or something like it, is the most convenient way of dealing with this style of markup.

Andy Balholm

unread,
Feb 1, 2013, 10:58:48 PM2/1/13
to golan...@googlegroups.com, Andy Balholm
At this point, the package is well-suited to "static" uses: web scraping, building trees and rendering them as HTML, transforming content, etc. The question we face is, Is that all we want it to be used for?

If we are content to have a package for static HTML manipulation, we will probably be OK if we freeze the API for Go 1.1. But if we want the package to keep growing into something you could build a browser around, we will need to move it to go.net so that it can continue to develop without being tied to a limited API.

For me personally, an API freeze would be beneficial, because I use it for static HTML manipulation, and I would just as soon avoid having to update my code all of the time. But for the project as a whole, it might be better to keep moving, so that we don't end up needing to have two HTML libraries.

Yunge

unread,
Feb 2, 2013, 1:12:54 AM2/2/13
to golan...@googlegroups.com, Andy Balholm
Just like Andy, I'm using exp/html just for web scraping for now. 

But Go would be a foundation for browser/OS, or for the "world"(maybe it is too early to say).

Yunge

unread,
Feb 2, 2013, 1:49:14 AM2/2/13
to golan...@googlegroups.com, Andy Balholm
I suggest Go team talk to Chromium/Chrome OS team, just an advice.


On Saturday, February 2, 2013 11:03:22 AM UTC+8, Nigel Tao wrote:

jimmy frasche

unread,
Feb 2, 2013, 2:41:38 AM2/2/13
to Andy Balholm, golang-nuts
Admittedly I haven't had a chance to use exp/html yet but I've been meaning to.

I don't know if it's worth it to make the library so capable that you
could use it in a web browser's codebase. That's a noble goal and all,
but the vast majority of people are going to be using it for scraping
and checking and manipulating. A simple, bullet proof parser is going
to serve the vast majority of users just as well as a precision
engineered jet engine. In theory, I'd love a pure Go version of
something like phantomjs that I can import, but I'd take the html API
as is now and I don't know how useful something in between those two
is.

My only real concern with the API as it is today may be misreading of
a comment on Render. The last paragraph: (I rewrote the html tags to
[tag] to avoid confusing anybody's e-mail client)

> Programmatically constructed trees are typically also 'well-formed',
> but it is possible to construct a tree that looks innocuous but, when
> rendered and re-parsed, results in a different tree. A simple example
> is that a solitary text node would become a tree containing [html],
> [head] and [body] elements. Another example is that the programmatic
> equivalent of "a[head]b[/head]c" becomes "[html][head][head/][body]abc[/body][/html]"

Does this imply (I am very sorry I haven't tested this) that is I
Render a tree from ParseFragment that I get a complete HTML document
spat out? If so the API is seriously lacking a RenderFragment
function. It's unfortuantely not uncommon in CMS-y application to
store a document fragment produced by some awful WYSIWYG editor in the
DB. Without the ability to render a fragment of html it can't be used
for any automatic stripping of non-whitelisted tags or any of a myriad
of like tasks. That would be really unfortunate and limit it's
usefulness to me significantly. I hope I'm just misreading the docs.

As for the points, I can't speak to all of them, since as I have said
I haven't actually used the API yet, but

0 - struct seems fine to me. Maybe having typed nodes instead of a
types tags would be cleaner in some ways but it seem like a lot of
bother for not much gain.

5 - I'm sure there are plenty of people who would benefit from a
streaming API aside from browser implementers. If that means go.net
instead of the stdlib so be it, unless it's possible to add a new API
to the package later without breaking compatibility. The only use I
would have for such a thing is making semantically correct "teasers"
and even then I'd probably just parse the whole document fragment and
cache the results and move on to more interesting things.

7 - Do you mean something like the DOM getElementByID? It could be
useful, but stuff like that is really inadequate compared to a full
query language. See my comment to point 10.

8 - Reading that method's documentation makes me nervous. I don't see
the use but I do see potential danger. That looks like something that
shouldn't be exported. If there is a good use for it that others have
found I'll stick to just personally avoiding it, however.

10 - The two main contenders would be xpath and a sizzle-esque CSS
selector parser + an equivalent of querySelectorAll. Both would be
great (*cough* querySelectorAll *cough*). It would be easier to
implement them in the html package but it shouldn't be too difficult
to have them be 3rd party libs and if they get good enough they could
always be merged in in a future release.

11 - I'd rather have a parser that I know won't break when given some
obscure document fragment. If it's valid I want to be able to process
it safely. I do not envy you having to work with the w3c's "everything
has to touch everything else" specs. My hat's off to you. Thanks for
all the great work so far.

Rodrigo Moraes

unread,
Feb 2, 2013, 8:43:34 AM2/2/13
to golang-nuts
I see that most of your concerns are related to the DOM/Tree aspects.
I've been using the package for a couple of weeks to extract contents
from pages and only used the tokenizer mostly, and I found the API
quite good and the implementation solid. I've parsed some really
really ugly and messed HTML. I've only use Parse()/Render() to
preprocess and "normalize" the HTML (that is, remove unclosed tags)
before parsing.

So here are two observations.

- Sorry if this sounds silly. Is it too much performance gain to have
Tokenizer.Next() returning TokenType instead of simply a Token? I
believe that's why it is like it is, and if so it may be reasonable. I
just thought that returning a Token would be simpler and more
convenient. The only complication in the API is the "Next {Raw}
[ Token | Text | TagName {TagAttr} ]" part, and even that is easy to
get.

- Maybe the Tokenizer should belong to a sub-package? It really seems
that there are two separate things here: the pull parser (aka
Tokenizer) and the DOM-ish API. The former is the base for the latter
(or new implementations of the latter!) so maybe it should belong to
its own fundamental package.

That's it for now. Great work, Nigel.

-- rodrigo

Andy Balholm

unread,
Feb 2, 2013, 11:31:02 AM2/2/13
to golan...@googlegroups.com, Andy Balholm
On Friday, February 1, 2013 11:41:38 PM UTC-8, soapboxcicero wrote:
My only real concern with the API as it is today may be misreading of
a comment on Render. The last  paragraph: (I rewrote the html tags to
[tag] to avoid confusing anybody's e-mail client)

> Programmatically constructed trees are typically also 'well-formed',
> but it is possible to construct a tree that looks innocuous but, when
> rendered and re-parsed, results in a different tree. A simple example
> is that a solitary text node would become a tree containing [html],
> [head] and [body] elements. Another example is that the programmatic
> equivalent of "a[head]b[/head]c" becomes "[html][head][head/][body]abc[/body][/html]"

Does this imply (I am very sorry I haven't tested this) that is I
Render a tree from ParseFragment that I get a complete HTML document
spat out? If so the API is seriously lacking a RenderFragment
function. It's unfortuantely not uncommon in CMS-y application to
store a document fragment produced by some awful WYSIWYG editor in the
DB. Without the ability to render a fragment of html it can't be used
for any automatic stripping of non-whitelisted tags or any of a myriad
of like tasks. That would be really unfortunate and limit it's
usefulness to me significantly. I hope I'm just misreading the docs.

As you hoped, you are misreading the docs. If you render a single html.Node, 
you get just that node. The following program prints "<img/>":

package main

import (
"exp/html"
"exp/html/atom"
"os"
)

func main() {
n := &html.Node{
Data: "img",
DataAtom: atom.Img,
Type: html.ElementNode,
}
html.Render(os.Stdout, n)
}
 
10 - The two main contenders would be xpath and a sizzle-esque CSS
selector parser + an equivalent of querySelectorAll. Both would be
great (*cough* querySelectorAll *cough*). It would be easier to
implement them in the html package but it shouldn't be too difficult
to have them be 3rd party libs and if they get good enough they could
always be merged in in a future release.

Do you mean something like code.google.com/p/cascadia?

John Nagle

unread,
Feb 2, 2013, 12:37:03 PM2/2/13
to golan...@googlegroups.com, golan...@googlegroups.com
On 2/1/2013 7:03 PM, Nigel Tao wrote:
> 4. The parser assumes that the input is UTF-8. It is possible that
> this is perfectly reasonable and the io.Reader given to it can be
> responsible for auto-detecting the encoding and converting to UTF-8,
> but it has not yet been proven. For example, there may be subtle
> interaction with document.write.

The HTML5 spec allows multiple character encodings, and there
is a defined procedure for this:

http://www.w3.org/html/wg/drafts/html/master/syntax.html#determining-the-character-encoding

Yes, it's a huge pain, and sometimes the parser has to start
over from the beginning of the document after encountering a
"charset" element. Every browser does it. If you want to
process real-world HTML, you have to do it. (I have a web
crawler running in Python, and I have to deal with this.)

A parser which does not do this is not "standards-compliant".

John Nagle


Andy Balholm

unread,
Feb 2, 2013, 4:19:17 PM2/2/13
to golan...@googlegroups.com, golan...@googlegroups.com, na...@animats.com
On Saturday, February 2, 2013 9:37:03 AM UTC-8, John Nagle wrote:
    The HTML5 spec allows multiple character encodings, and there
is a defined procedure for this:

http://www.w3.org/html/wg/drafts/html/master/syntax.html#determining-the-character-encoding

    Yes, it's a huge pain, and sometimes the parser has to start
over from the beginning of the document after encountering a
"charset" element.  Every browser does it.  If you want to
process real-world HTML, you have to do it.  (I have a web
crawler running in Python, and I have to deal with this.)

    A parser which does not do this is not "standards-compliant".

The html package itself does not implement the spec for character-set detection,
because there is no support in the standard library for non-utf8 charsets.
It would definitely be more convenient to have the charset detection built in,
but this is how I do it:

var metaCharsetSelector = cascadia.MustCompile(`meta[charset], meta[http-equiv="Content-Type"]`)

// findCharset returns the character encoding to be used to interpret the
// page's content.
func findCharset(declaredContentType string, content []byte) (charset string) {
defer func() {
if ce := compatibilityEncodings[charset]; ce != "" {
charset = ce
}
}()

cs := charsetFromContentType(declaredContentType)
if cs != "" {
return cs
}

if len(content) > 1024 {
content = content[:1024]
}

if len(content) >= 2 {
if content[0] == 0xfe && content[1] == 0xff {
return "utf-16be"
}
if content[0] == 0xff && content[1] == 0xfe {
return "utf-16le"
}
}

if len(content) >= 3 && content[0] == 0xef && content[1] == 0xbb && content[2] == 0xbf {
return "utf-8"
}

if strings.Contains(declaredContentType, "html") || declaredContentType == "" {
// Look for a <meta> tag giving the encoding.
tree, err := html.Parse(bytes.NewBuffer(content))
if err == nil {
for _, n := range metaCharsetSelector.MatchAll(tree) {
a := make(map[string]string)
for _, attr := range n.Attr {
a[attr.Key] = attr.Val
}
if charsetAttr := a["charset"]; charsetAttr != "" {
return strings.ToLower(charsetAttr)
}
if strings.EqualFold(a["http-equiv"], "Content-Type") {
cs = charsetFromContentType(a["content"])
if cs != "" {
return cs
}
}
}
}
}

// Try to detect UTF-8.
// First eliminate any partial rune that may be split by the 1024-byte boundary.
for i := len(content) - 1; i >= 0 && i > len(content)-4; i-- {
b := content[i]
if b < 128 {
break
}
if utf8.RuneStart(b) {
content = content[:i]
break
}
}
if utf8.Valid(content) {
return "utf-8"
}

return "windows-1252"
}
 
func charsetFromContentType(t string) string {
t = strings.ToLower(t)
_, params, _ := mime.ParseMediaType(t)
return params["charset"]
}

// compatibilityEncodings contains character sets that should be misinterpreted
// for compatibility. The encodings that are commented out are not yet
// implemented by the Mahonia library.
var compatibilityEncodings = map[string]string{
// "euc-kr":         "windows-949",
// "euc-jp":         "cp51932",
"gb2312":     "gbk",
"gb_2312-80": "gbk",
// "iso-2022-jp":    "cp50220",
"iso-8859-1":  "windows-1252",
"iso-8859-9":  "windows-1254",
"iso-8859-11": "windows-874",
// "ks_c_5601-1987": "windows-949",
// "shift_jis":      "windows-31j",
"tis-620":  "windows-874",
"us-ascii": "windows-1252",
}

Kees Varekamp

unread,
Feb 2, 2013, 7:08:26 PM2/2/13
to golan...@googlegroups.com, Andy Balholm
+1 vote for "Just using it to read static html - loving it the way it is."

Except perhaps also +1 vote for some sort of builtin xpath-like query mechanism.

Kees

Dave Cheney

unread,
Feb 2, 2013, 7:15:14 PM2/2/13
to Steve McCoy, golan...@googlegroups.com
If it hasn't already been suggested, I think some time in the go.net
subrepo would help gain confidence that the API is correct and
complete.
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts...@googlegroups.com.
>
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Patrick Mylund Nielsen

unread,
Feb 2, 2013, 7:19:01 PM2/2/13
to Dave Cheney, Steve McCoy, golang-nuts
Yeah, I agree with this. I use exp/html in several (useful) production applications, and I love how clean it is, but can't say for sure that everything is perfect. HTML5 itself is becoming so complicated that the browsers don't even agree on how to implement. I also don't feel like there is a lot of pressure to get it into the standard library asap.

Nigel Tao

unread,
Feb 2, 2013, 8:12:52 PM2/2/13
to Rodrigo Moraes, golang-nuts
On Sun, Feb 3, 2013 at 12:43 AM, Rodrigo Moraes
<rodrigo...@gmail.com> wrote:
> - Sorry if this sounds silly. Is it too much performance gain to have
> Tokenizer.Next() returning TokenType instead of simply a Token?

The performance impact is significant. Token.Data is a string, so
having Tokenizer.Next return a Token would require []byte to string
conversions on every step. This creates a lot more garbage, which is
unnecessary garbage if e.g. all you want is to scrape the <a> tags and
ignore everything else.

$ go test -test.bench='Low|High' exp/html
PASS
BenchmarkLowLevelTokenizer 2000 964264 ns/op 81.06 MB/s 5066
B/op 25 allocs/op
BenchmarkHighLevelTokenizer 1000 1563499 ns/op 49.99 MB/s
103414 B/op 3221 allocs/op
ok exp/html 3.849s


> - Maybe the Tokenizer should belong to a sub-package?

Maybe, but point 9 in my OP describes some of the layering violation
that the parser does to influence token state. Also, supporting
document.write from script will surely affect the tokenizer. I'm not
sure if we can consider its API as finalized.


> That's it for now. Great work, Nigel.

Kudos should also go to Andy Balholm, who has written a significant
chunk of exp/html.

Nigel Tao

unread,
Feb 2, 2013, 8:27:53 PM2/2/13
to John Nagle, golang-nuts
On Sun, Feb 3, 2013 at 9:11 AM, John Nagle <na...@animats.com> wrote:
> Until you try parsing large amounts of real-world HTML, it's
> hard to appreciate just how awful some of what's out there is.

I am not saying that we can ignore non UTF-8 encodings. I think we all
agree that a large amout of real-world HTML is like that.

What I am saying is that it may be feasible for this to be done by an
io.Reader implementation instead of by html.Parser or html.Tokenizer
per se, and still be spec compliant. For example, bufio.Reader wraps
another io.Reader so that not every package needs to do its own
buffering. Andy Balholm's code snippet is a step towards proof by
example that a similar approach to encoding conversion is feasible. If
you know that the input HTML is UTF-8, then you don't need to pay any
cost, otherwise you can wrap the input in an autodetect.Reader (or
whatever the hypothetical package would be). As I said in the OP, how
this plays with document.write remains an open question.

John Nagle

unread,
Feb 2, 2013, 9:47:45 PM2/2/13
to golan...@googlegroups.com
On 2/2/2013 5:12 PM, Nigel Tao wrote:
> On Sun, Feb 3, 2013 at 12:43 AM, Rodrigo Moraes
> <rodrigo...@gmail.com> wrote:

>> - Maybe the Tokenizer should belong to a sub-package?
>
> Maybe, but point 9 in my OP describes some of the layering violation
> that the parser does to influence token state.

Painfully true. You can't reliably tokenize HTML without
information from the parsing level.

Also painfully true: the awful cases in HTML parsing aren't rare.

The good news is that the HTML5 spec, painful though it
is, covers all this stuff. There's no longer a need to have separate
"Netscape/Mozilla" and "Internet Explorer" parsing modes.

John Nagle

Patrick Mylund Nielsen

unread,
Feb 2, 2013, 9:50:15 PM2/2/13
to John Nagle, golang-nuts
We'll see.


Rodrigo Moraes

unread,
Feb 3, 2013, 9:23:49 AM2/3/13
to golang-nuts
On Feb 2, 11:12 pm, Nigel Tao wrote:
> The performance impact is significant. Token.Data is a string, so
> having Tokenizer.Next return a Token would require []byte to string
> conversions on every step. This creates a lot more garbage, which is
> unnecessary garbage if e.g. all you want is to scrape the <a> tags and
> ignore everything else.

Token could provide access to strings using methods only. Still it
would allocate struct data unnecessarily, so this is just a thought.
The Next() API is not a big deal.

> Maybe, but point 9 in my OP describes some of the layering violation
> that the parser does to influence token state.

Fair enough. I guess making a couple of fields public or adding some
hooks would help there, but then you compromise the API in the long
run.

-- rodrigo

Andrew Gerrand

unread,
Feb 3, 2013, 10:24:36 PM2/3/13
to Nigel Tao, golang-nuts, Andy Balholm
I've been using exp/html for scraping and re-writing HTML documents and it has been great, thanks!

My 2c:

On 2 February 2013 14:03, Nigel Tao <nige...@golang.org> wrote:
0. Should Node be a struct or an interface?

1. There aren't enough hooks to support <script> tags, including ones
that call document.write. On the other hand, we do not want to mandate
a particular JS implementation.

2. It is not proven that the Node type can support the DOM API.

3. Even without scripting, it is not proven that the Node type can
support rendering: it is not obvious how to attach style and layout
information. On the other hand, we do not want to mandate a particular
style and layout implementation.

It is very hard to design an idiomatic Go API for these tasks (supporting the DOM, rendering, JS interpretation, etc) without actually attempting the tasks themselves. This is infeasible for Go 1.1. If the html package is to be part of Go 1.1, you should just clean up the API you already have rather than try to redesign it.

Let people with greater needs design a richer API as the need arises. We can roll that into the html package later, and possibly provide a separate, parallel Node API if this cannot be done while preserving backwards compatibility.
 
5. The parser doesn't return the parse tree until it is complete. A
renderer may want to render a partially downloaded page if the network
is slow. It may also want to start the fetch of an <img>'s or
<script>'s src before parsing is complete. Do we want to support
incremental rendering, or does the complexity outweigh the benefit?
Should the API be that the caller pushes bytes to a parser multiple
times, instead of or alternatively to giving a parser an io.Reader
once?

A streaming API could be provided later, if necessary. The existing one-shot API is still useful. Don't worry about this now.
 
6. The Node struct type has a Namespace string field for SVG or MathML
elements. These are rare, and could also be folded into the existing
Data string field. Eliminating the Namespace field might save a little
bit of memory.

What kind of memory savings are we talking about here? 8-16 bytes per node? How much harder would it be for SVG users to access the namespace data through the Data field?
 
8. Is Tokenizer.Raw worth keeping? Does anyone use it? Its presence
may constrain future refactoring and optimization of the tokenizer.

I don't use it. 
 
10. Should there be additional API to ease walking the Node tree? If
so, what should it look like?

I think we need more time to work on this. External libraries may do the job for now.
 
11. A radical option is to remove the existing support for parsing
foreign content: SVG and MathML. It would mean losing 100% compliance
with the HTML5 specification, but it would also significantly simplify
the implementation (e.g. see issues 6 and 9 above, and things like
element tags are case-insensitive for HTML in general, but
case-sensitive for SVG inside HTML). Ideally, we would retain the
option to re-introduce SVG and MathML support in a future version, if
the benefits were re-assessed to outweigh the costs.

I wouldn't miss them if they were gone. But if we do want this to be a fully-compliant HTML5 parser, we still need to preserve the ability to do the nasty layering stuff. Taking it out now might cause more trouble later.

Andrew 

Kyle Lemons

unread,
Feb 5, 2013, 4:16:40 PM2/5/13
to Andrew Gerrand, Nigel Tao, golang-nuts, Andy Balholm
On Sun, Feb 3, 2013 at 7:24 PM, Andrew Gerrand <a...@golang.org> wrote:
I've been using exp/html for scraping and re-writing HTML documents and it has been great, thanks!

+1; such tasks have been quite straightforward, though admittedly verbose.
 
My 2c:

On 2 February 2013 14:03, Nigel Tao <nige...@golang.org> wrote:
0. Should Node be a struct or an interface?

1. There aren't enough hooks to support <script> tags, including ones
that call document.write. On the other hand, we do not want to mandate
a particular JS implementation.

2. It is not proven that the Node type can support the DOM API.

Tangentially, when I am writing javascript, I always end up writing jQuery because I find the flexibility of its selectors (based largely on CSS selectors) to be both flexible and powerful; it also generally lets me change the structure of the page without breaking code because I can refer to things symbolically and semantically instead of having to explicitly drill down based on a particular structure.  Much of this seems like it could be provided in a higher-level API outside the stdlib built upon a standard HTML5 tokenizer.
 
3. Even without scripting, it is not proven that the Node type can
support rendering: it is not obvious how to attach style and layout
information. On the other hand, we do not want to mandate a particular
style and layout implementation.

It is very hard to design an idiomatic Go API for these tasks (supporting the DOM, rendering, JS interpretation, etc) without actually attempting the tasks themselves. This is infeasible for Go 1.1. If the html package is to be part of Go 1.1, you should just clean up the API you already have rather than try to redesign it.

Let people with greater needs design a richer API as the need arises. We can roll that into the html package later, and possibly provide a separate, parallel Node API if this cannot be done while preserving backwards compatibility. 
5. The parser doesn't return the parse tree until it is complete. A
renderer may want to render a partially downloaded page if the network
is slow. It may also want to start the fetch of an <img>'s or
<script>'s src before parsing is complete. Do we want to support
incremental rendering, or does the complexity outweigh the benefit?
Should the API be that the caller pushes bytes to a parser multiple
times, instead of or alternatively to giving a parser an io.Reader
once?

A streaming API could be provided later, if necessary. The existing one-shot API is still useful. Don't worry about this now.
 
6. The Node struct type has a Namespace string field for SVG or MathML
elements. These are rare, and could also be folded into the existing
Data string field. Eliminating the Namespace field might save a little
bit of memory.

What kind of memory savings are we talking about here? 8-16 bytes per node? How much harder would it be for SVG users to access the namespace data through the Data field?
 
8. Is Tokenizer.Raw worth keeping? Does anyone use it? Its presence
may constrain future refactoring and optimization of the tokenizer.

I don't use it. 

I don't use it either, though I have a vague recollection that I used it for something at some point (so, probably for debugging).
 
10. Should there be additional API to ease walking the Node tree? If
so, what should it look like?

I think we need more time to work on this. External libraries may do the job for now.
 
11. A radical option is to remove the existing support for parsing
foreign content: SVG and MathML. It would mean losing 100% compliance
with the HTML5 specification, but it would also significantly simplify
the implementation (e.g. see issues 6 and 9 above, and things like
element tags are case-insensitive for HTML in general, but
case-sensitive for SVG inside HTML). Ideally, we would retain the
option to re-introduce SVG and MathML support in a future version, if
the benefits were re-assessed to outweigh the costs.

I wouldn't miss them if they were gone. But if we do want this to be a fully-compliant HTML5 parser, we still need to preserve the ability to do the nasty layering stuff. Taking it out now might cause more trouble later.

Andrew 

Nigel Tao

unread,
Feb 5, 2013, 5:36:53 PM2/5/13
to golang-nuts, Andy Balholm
On Sat, Feb 2, 2013 at 2:03 PM, Nigel Tao <nige...@golang.org> wrote:
> The exp/html package in tip provides a spec-compliant HTML5 parser. As
> Go 1.1 is approaching, this package will likely either be promoted to
> html, or move to the go.net sub-repository.

I have decided to to move exp/html to the go.net sub-repo as
code.google.com/p/go.net/html, but otherwise keep it as is. That means
that there will be no API differences in package html (without the
exp) in the standard library between Go 1.0 and Go 1.1.

Thanks to everyone for their feedback.
Reply all
Reply to author
Forward
0 new messages