HTML document generation with Go via tree rewriting

1,803 views
Skip to first unread message

David Christensen

unread,
Sep 30, 2011, 10:51:36 PM9/30/11
to golang-nuts
golang-nuts:

I am a Go wannabe who is wondering if it is possible to create HTML
documents in Go as follows:

1. Convert HTML "templates" (input) into tree data structures.

2. Modify the trees programmatically.

3. Convert the processed trees into HTML documents (output).


I would like HTML input and output to pass World Wide Web Consortium
(W3C) markup validation for at least one type of HTML 4.01:

http://validator.w3.org/


Here is one solution, expressed in Perl:


http://search.cpan.org/~tbone/HTML-Seamstress-6.111950/lib/HTML/Seamstress.pm


Package template seems to operate textually on annotated HTML templates:

http://golang.org/pkg/template/


I don't know what to make of package exp/template/html:

http://golang.org/pkg/exp/template/html/


Package html seems to provide a means to convert HTML into a tree (func
Parse; step #1, above), but I don't see a way to turn a tree back into
HTML (step #3):

http://golang.org/pkg/html/


Any thoughts or suggestions?


TIA,

David

Terrence Brannon

unread,
Sep 30, 2011, 11:07:59 PM9/30/11
to David Christensen, golang-nuts
That form of templating is known as [push-style templating](http://www.perlmonks.org/?node_id=674225) and HTML::Seamstress is a system I developed for doing so, thought it most certainly is not the only one :)

If they have an XML processing library, then as long as the HTML is xHTML, you should be good to.... GO

Mike Samuel

unread,
Oct 2, 2011, 5:56:26 PM10/2/11
to golang-nuts


On Sep 30, 7:51 pm, David Christensen <dpchr...@holgerdanske.com>
wrote:
> golang-nuts:
>
> I am a Go wannabe who is wondering if it is possible to create HTML
> documents in Go as follows:
>
> 1.  Convert HTML "templates" (input) into tree data structures.
>
> 2.  Modify the trees programmatically.
>
> 3.  Convert the processed trees into HTML documents (output).
>
> I would like HTML input and output to pass World Wide Web Consortium
> (W3C) markup validation for at least one type of HTML 4.01:
>
>      http://validator.w3.org/
>
> Here is one solution, expressed in Perl:
>
> http://search.cpan.org/~tbone/HTML-Seamstress-6.111950/lib/HTML/Seams...
>
> Package template seems to operate textually on annotated HTML templates:
>
>      http://golang.org/pkg/template/
>
> I don't know what to make of package exp/template/html:
>
>      http://golang.org/pkg/exp/template/html/

That package wraps package template.

The docs to which you linked on golang.org are out of date, so do not
reflect most of the work on it. See code.google.com/p/go/source/
browse/src/pkg/exp/template/html/doc.go for the bleeding edge version
of the docs.

It does not manipulate DOM trees, in part, because it seeks to prevent
injection problems in CSS, JS, and URI content. Tree building and
rendering is a great way to produce validating output, but does not by
itself prevent script injection.


> Package html seems to provide a means to convert HTML into a tree (func
> Parse; step #1, above), but I don't see a way to turn a tree back into
> HTML (step #3):
>
>      http://golang.org/pkg/html/

That package says

"Package html implements an HTML5-compliant tokenizer and parser."

so you should not expect any renderer added to that package to
validate as strict HTML 4.0.1.

David Christensen

unread,
Oct 2, 2011, 8:22:15 PM10/2/11
to golan...@googlegroups.com
On Sep 30, 7:51 pm, David Christensen<dpchr...@holgerdanske.com>
> Package html seems to provide a means to convert HTML into a tree
(func Parse; step #1, above), but I don't see a way to turn a tree back
into HTML (step #3):


On 10/02/2011 02:56 PM, Mike Samuel wrote:
> That package says
> "Package html implements an HTML5-compliant tokenizer and parser."
> so you should not expect any renderer added to that package to
> validate as strict HTML 4.0.1.

So, where do I find a renderer for package html (validating or not)?


David

Nigel Tao

unread,
Oct 2, 2011, 9:51:35 PM10/2/11
to David Christensen, golan...@googlegroups.com
On 3 October 2011 11:22, David Christensen <dpch...@holgerdanske.com> wrote:
> On 10/02/2011 02:56 PM, Mike Samuel wrote:
>> That package says
>>     "Package html implements an HTML5-compliant tokenizer and parser."
>> so you should not expect any renderer added to that package to
>> validate as strict HTML 4.0.1.
>
> So, where do I find a renderer for package html (validating or not)?

Thinking out loud, I wouldn't mind seeing something like

func (*Node) Render(w io.Writer) os.Error

but it's not at the top of my list of things to do right now.

David Christensen

unread,
Oct 2, 2011, 11:58:26 PM10/2/11
to golang-nuts
On 3 October 2011 11:22, David Christensen wrote:
> So, where do I find a renderer for package html (validating or not)?

On 10/02/2011 06:51 PM, Nigel Tao wrote:
> Thinking out loud, I wouldn't mind seeing something like
> func (*Node) Render(w io.Writer) os.Error
> but it's not at the top of my list of things to do right now.

If you, or another knowledgeable person, are willing to guide a rookie
Go programmer in implementing that function, I can volunteer a few hours
per week to work on it. Please e-mail me off-list or telephone me if
you're interested.


David Christensen
dpch...@holgerdanske.com
(209) 830-8249

Mike Samuel

unread,
Oct 3, 2011, 2:20:31 AM10/3/11
to golang-nuts


On Oct 2, 8:58 pm, David Christensen <dpchr...@holgerdanske.com>
wrote:
Before you start coding maybe it would be good to be explicit about
how pedantic you want to be? When do you plan on producing an
os.Error instead of writing a fully-formed validating HTML document to
w?

It is possible to programmatically produce a DOM that cannot be
rendered to HTML.

You should definitely punt on 6 below. Whether you check for the
others or punt depends on why clients require validating output.

(1) Tags are only going to be interpreted as being contained in the
elements
in which they appear to be contained if they are not preceded by
certain other tags.

<!DOCTYPE html><title>x</title>

validates and produces a document with title "x", but

<!DOCTYPE html>x<title>x</title>

does not.

(2) A single SCRIPT element containing a text node, <script>%s</
script>, cannot be rendered
when the text node is the valid JavaScript program:

"</script>"</script>/

Any semantics preserving fixup would require understanding JS.

(3) Similarly, but when the valid JavaScript program is

a<!--b

because that would violate http://dev.w3.org/html5/markup/aria/syntax.html#escaping-text-span

"The text in style, script, title, and textarea elements must not
have an escaping text span
start that is not followed by an escaping text span end."

(4) Text in comments has certain restrictions.
http://dev.w3.org/html5/markup/aria/syntax.html#comments

"""
The text part of comments has the following restrictions:

* must not start with a ">" character
* must not start with the string "->"
* must not contain the string "--"
* must not end with a "-" character
"""

(5) It is possible to produce a validating document that does not
validate on some browsers. For example,

<!--[if IE]><<![endif]-->

obeys all the requirements in HTML5, but is not seen as a
validating document by IE.

(6) It is possible to produce a document that passes the validator but
that will trigger errors on any browser that loads it.

<!DOCTYPE html><html><head><title>x</
title><script>document.write('</body>')</script></head><body></body></
html>

(7) A plaintext element cannot appear anywhere except last in a pre-
order traversal, cannot appear inside an element that has a mandatory
end tag, and must not be followed by an end-tag.

(8) Only certain DOCTYPEs are valid HTML doctypes.

(9) It is unclear whether is an HTML document

<!DOCTYPE html><title></title><meta http-equiv="content-type"
content="text/plain">

is actually an HTML document, but it doesn't validate because of
an interaction between two attributes.
http://www.w3.org/TR/2008/WD-html5-20080610/the-root.html#encoding

(10) The NUL character cannot appear in certain places, not even
escaped according to the mapping in
http://www.w3.org/TR/html5/tokenization.html#consume-a-character-reference

XHTML moots 1,2,3,5,7,8 but 10 becomes worse.
http://www.w3.org/TR/2008/REC-xml-20081126/#NT-Char

David Christensen

unread,
Oct 3, 2011, 9:42:19 PM10/3/11
to golan...@googlegroups.com
On 10/02/2011 06:51 PM, Nigel Tao wrote:
> Thinking out loud, I wouldn't mind seeing something like
> func (*Node) Render(w io.Writer) os.Error
> but it's not at the top of my list of things to do right now.


On Oct 2, 8:58 pm, David Christensen wrote:
> If you, or another knowledgeable person, are willing to guide a rookie
> Go programmer in implementing that function, I can volunteer a few hours
> per week to work on it. Please e-mail me off-list or telephone me if
> you're interested.


On 10/02/2011 11:20 PM, Mike Samuel wrote:
> Before you start coding maybe it would be good to be explicit about

> how pedantic you want to be? ...

It doesn't surprise me that rendering HTML from a data structure is a
non-trivial task. Thank you for pointing out just how messy it can get.
:-)


I'm no expert on HTML, DOM, parsing, tree representation, rendering,
etc.; I'm just looking for a Go solution to match the capabilities of
Perl. The missing piece seems to be Go equivalent of the HTML::Element
as_HTML() object method:


http://search.cpan.org/~jfearn/HTML-Tree-4.2/lib/HTML/Element.pm#$h-%3Eas_HTML%28%29_or_$h-%3Eas_HTML%28$entities%29


If you or anyone would care to take the lead on the task of implementing:

func (*Node) Render(w io.Writer) os.Error

for the Go package html, I'm offering my help.


David

Nigel Tao

unread,
Oct 3, 2011, 11:14:06 PM10/3/11
to Mike Samuel, golang-nuts
On 3 October 2011 17:20, Mike Samuel <mikes...@gmail.com> wrote:
> Before you start coding maybe it would be good to be explicit about
> how pedantic you want to be?  When do you plan on producing an
> os.Error instead of writing a fully-formed validating HTML document to
> w?
>
> It is possible to programmatically produce a DOM that cannot be
> rendered to HTML.
>
> You should definitely punt on 6 below.  Whether you check for the
> others or punt depends on why clients require validating output.
>
> (6) It is possible to produce a document that passes the validator but
> that will trigger errors on any browser that loads it.
>
>    <!DOCTYPE html><html><head><title>x</
> title><script>document.write('</body>')</script></head><body></body></
> html>

Solving (6) would require solving the halting problem, no? How about a
goal like:

With javascript disabled, if n is a *Node (i.e. a document tree) that
resulted from Parse, then Parse(Render(n)) would result in a clone of
that tree. (Excuse the pseudo-syntax for connecting an io.Reader to an
io.Writer).

The converse wouldn't have to be true: e.g. Render(Parse(rawHTML))
would convert "a < b" to "<html><head></head><body>a &lt;
b</body></html>".

Jeremy Wall

unread,
Oct 3, 2011, 11:51:51 PM10/3/11
to David Christensen, golang-nuts

https://code.google.com/p/go-html-transform/ is my attempt at
something like that. Feel free to crib off of it, modify it, or use
it. It has a dependency on Go's html library though so it comes with a
few caveats. Mainly that if you're source html has javascript in it
then the html package has a tendency to fail to parse it.

It's been on my todo list to see about patching the html parser to
support the script insertion modes but I just haven't had time yet.

Mike Samuel

unread,
Oct 4, 2011, 1:31:59 AM10/4/11
to Nigel Tao, golang-nuts
2011/10/3 Nigel Tao <nige...@golang.org>:

> On 3 October 2011 17:20, Mike Samuel <mikes...@gmail.com> wrote:
>> Before you start coding maybe it would be good to be explicit about
>> how pedantic you want to be?  When do you plan on producing an
>> os.Error instead of writing a fully-formed validating HTML document to
>> w?
>>
>> It is possible to programmatically produce a DOM that cannot be
>> rendered to HTML.
>>
>> You should definitely punt on 6 below.  Whether you check for the
>> others or punt depends on why clients require validating output.
>>
>> (6) It is possible to produce a document that passes the validator but
>> that will trigger errors on any browser that loads it.
>>
>>    <!DOCTYPE html><html><head><title>x</
>> title><script>document.write('</body>')</script></head><body></body></
>> html>
>
> Solving (6) would require solving the halting problem, no? How about a
> goal like:

No it wouldn't. It only requires solving the halting program if you
are unwilling to reject any program that does not produce
non-validating HTML.

For example, if you whitelist free variables to disallow unfiltered
access to document, reflective operators like eval, and use of the []
operator with non-numeric arguments, then you can prevent script from
making validating HTML non-validating. http://www.adsafe.org/

It's not worth doing which I said above "You should definitely punt on 6 below."
I just listed it as one more example of why "validating HTML" is
rarely a useful property to have.

> With javascript disabled, if n is a *Node (i.e. a document tree) that
> resulted from Parse, then Parse(Render(n)) would result in a clone of
> that tree. (Excuse the pseudo-syntax for connecting an io.Reader to an
> io.Writer).

> The converse wouldn't have to be true: e.g. Render(Parse(rawHTML))
> would convert "a < b" to "<html><head></head><body>a &lt;
> b</body></html>".

You could always implement render to take advantage of
http://www.w3.org/TR/html5/syntax.html#optional-tags :

"Certain tags can be omitted.

Omitting an element's start tag does not mean the element is not
present; it is implied, but it is still there. An HTML document always
has a root html element, even if the string <html> doesn't appear
anywhere in the markup.

An html element's start tag may be omitted if the first thing inside
the html element is not a comment.

An html element's end tag may be omitted if the html element is not
immediately followed by a comment.

A head element's start tag may be omitted if the element is empty, or
if the first thing inside the head element is an element.

..."

Andy Balholm

unread,
Oct 4, 2011, 10:54:54 AM10/4/11
to golan...@googlegroups.com
I'm working on a project that will involve parsing HTML, modifying the tree, and writing it back out too. At this point, I've been concentrating on the parsing side, but the rendering is something I'll need too, so I could probably help with that.

Andy

David Christensen

unread,
Oct 5, 2011, 1:29:36 AM10/5/11
to golan...@googlegroups.com
On 10/03/2011 08:14 PM, Nigel Tao wrote:
> How about a goal like:
> With javascript disabled, if n is a *Node (i.e. a document tree) that resulted from Parse, then Parse(Render(n)) would result in a clone of that tree. (Excuse the pseudo-syntax for connecting an io.Reader to an io.Writer).
> The converse wouldn't have to be true: e.g. Render(Parse(rawHTML)) would convert "a< b" to "<html><head></head><body>a&lt; b</body></html>".

That would be just fine for what I have in mind. :-)


On 10/03/2011 10:31 PM, Mike Samuel wrote:
> ... It only requires solving the halting program if you are unwilling

to reject any program that does not produce non-validating HTML.
> For example, if you whitelist free variables to disallow unfiltered
access to document, reflective operators like eval, and use of the []
operator with non-numeric arguments, then you can prevent script from
making validating HTML non-validating. http://www.adsafe.org/ It's not

worth doing which I said above "You should definitely punt on 6 below."

Rather than trying to prevent applications from breaking Render() with
JavaScript, I would be happy with a first generation Render() that only
concerns itself with HTML and simply passes through JavaScript (CSS,
whatever) untouched. If that's too hard, then we could write a Render()
that drops JavaScript or anything else that causes problems.


> I just listed it as one more example of why "validating HTML" is
> rarely a useful property to have.

I'm not sure of the exact capabilities of package html Parse(), but I
think it would be useful if the Go/HTML rewriting system accepted a wide
range of HTML input, validating or otherwise. For example, HTML
snippets entered via text editors and/or textarea widgets, HTML
documents created with word processors and HTML editors, and HTML
created by other programming tools (Go, Perl CGI, PHP, Microsoft
ASP/.NET, whatever).


I think it would be useful to pick a specific HTML standard and specific
validation tool for the output of Render(). This will give us a target
to shoot for, and a yardstick for measuring progress. We are free to
choose whichever standard and tool suits us.


> You could always implement render to take advantage of
> http://www.w3.org/TR/html5/syntax.html#optional-tags :
> "Certain tags can be omitted.
> Omitting an element's start tag does not mean the element is not
> present; it is implied, but it is still there. An HTML document always
> has a root html element, even if the string<html> doesn't appear
> anywhere in the markup.
> An html element's start tag may be omitted if the first thing inside
> the html element is not a comment.
> An html element's end tag may be omitted if the html element is not
> immediately followed by a comment.
> A head element's start tag may be omitted if the element is empty, or
> if the first thing inside the head element is an element.

Wouldn't it be easier to have Render() output <html>, <head>, <body>,
</body>, </head>, </html> tags every time? We could add the special
cases to Render() later. (Does Parse() understand them?)


On 10/03/2011 08:51 PM, Jeremy Wall wrote:
> https://code.google.com/p/go-html-transform/ is my attempt at
something like that.


On 10/04/2011 07:54 AM, Andy Balholm wrote:
> I'm working on a project that will involve parsing HTML, modifying
the tree, and writing it back out too.

Great! :-)


David

biz...@gmail.com

unread,
May 29, 2015, 11:40:17 AM5/29/15
to golan...@googlegroups.com
It's years later now, but I stumbled upon this while trying to solve exactly the same problem, and the solution is (now) very simple.

So, in case anyone else stumbles upon the thread:  parse with html.ParseFragment, and render back with html.Render.


cheers

-- kf.
Reply all
Reply to author
Forward
0 new messages