(Design discussions would normally be sent to
golan...@googlegroups.com (BCC'ed), but I am trawling wide for
feedback).
The exp/html package in tip provides a spec-compliant HTML5 parser. As
Go 1.1 is approaching, this package will likely either be promoted to
html, or move to the
go.net sub-repository. If the former, this will
require freezing its API against incompatible changes, as per
http://golang.org/doc/go1compat.html It is unlikely that exp/html
will gain additional features before Go 1.1 is released, but ideally
the API that we freeze will still allow adding compatible features in
the future.
If you have had any problems with the API, any feature requests, or
comments in general, then now is the time to speak up. Below is a list
of known concerns.
0. Should Node be a struct or an interface?
1. There aren't enough hooks to support <script> tags, including ones
that call document.write. On the other hand, we do not want to mandate
a particular JS implementation.
2. It is not proven that the Node type can support the DOM API.
3. Even without scripting, it is not proven that the Node type can
support rendering: it is not obvious how to attach style and layout
information. On the other hand, we do not want to mandate a particular
style and layout implementation.
4. The parser assumes that the input is UTF-8. It is possible that
this is perfectly reasonable and the io.Reader given to it can be
responsible for auto-detecting the encoding and converting to UTF-8,
but it has not yet been proven. For example, there may be subtle
interaction with document.write.
5. The parser doesn't return the parse tree until it is complete. A
renderer may want to render a partially downloaded page if the network
is slow. It may also want to start the fetch of an <img>'s or
<script>'s src before parsing is complete. Do we want to support
incremental rendering, or does the complexity outweigh the benefit?
Should the API be that the caller pushes bytes to a parser multiple
times, instead of or alternatively to giving a parser an io.Reader
once?
6. The Node struct type has a Namespace string field for SVG or MathML
elements. These are rare, and could also be folded into the existing
Data string field. Eliminating the Namespace field might save a little
bit of memory.
7. The exp/html/atom list of atoms (and their hashes) needs to be
finalized. Relatedly, should an element Node provide API to look up an
attribute by atom (e.g. atom.Href, atom.Id).
8. Is Tokenizer.Raw worth keeping? Does anyone use it? Its presence
may constrain future refactoring and optimization of the tokenizer.
9. A Parser reaches into a Tokenizer to set a tokenizer's internal
state based on parser state. For example, how "<![CDATA[foo]]>" is
tokenized depends on whether or not we are in "foreign content" such
as SVG or MathML. Similarly, 'raw text' tokens are allowed for a
<title> inside regular HTML, but not for a <title> inside SVG inside
HTML. Ideally, a Tokenizer should not need to expose its state and
tokenization of an io.Reader is the same regardless of whether a
parser is driving that tokenizer, but that may turn out to be
impossible given the complexities of the HTML5 spec.
10. Should there be additional API to ease walking the Node tree? If
so, what should it look like?
11. A radical option is to remove the existing support for parsing
foreign content: SVG and MathML. It would mean losing 100% compliance
with the HTML5 specification, but it would also significantly simplify
the implementation (e.g. see issues 6 and 9 above, and things like
element tags are case-insensitive for HTML in general, but
case-sensitive for SVG inside HTML). Ideally, we would retain the
option to re-introduce SVG and MathML support in a future version, if
the benefits were re-assessed to outweigh the costs.