I am a Go wannabe who is wondering if it is possible to create HTML
documents in Go as follows:
1. Convert HTML "templates" (input) into tree data structures.
2. Modify the trees programmatically.
3. Convert the processed trees into HTML documents (output).
I would like HTML input and output to pass World Wide Web Consortium
(W3C) markup validation for at least one type of HTML 4.01:
Here is one solution, expressed in Perl:
http://search.cpan.org/~tbone/HTML-Seamstress-6.111950/lib/HTML/Seamstress.pm
Package template seems to operate textually on annotated HTML templates:
http://golang.org/pkg/template/
I don't know what to make of package exp/template/html:
http://golang.org/pkg/exp/template/html/
Package html seems to provide a means to convert HTML into a tree (func
Parse; step #1, above), but I don't see a way to turn a tree back into
HTML (step #3):
Any thoughts or suggestions?
TIA,
David
On 10/02/2011 02:56 PM, Mike Samuel wrote:
> That package says
> "Package html implements an HTML5-compliant tokenizer and parser."
> so you should not expect any renderer added to that package to
> validate as strict HTML 4.0.1.
So, where do I find a renderer for package html (validating or not)?
David
Thinking out loud, I wouldn't mind seeing something like
func (*Node) Render(w io.Writer) os.Error
but it's not at the top of my list of things to do right now.
On 10/02/2011 06:51 PM, Nigel Tao wrote:
> Thinking out loud, I wouldn't mind seeing something like
> func (*Node) Render(w io.Writer) os.Error
> but it's not at the top of my list of things to do right now.
If you, or another knowledgeable person, are willing to guide a rookie
Go programmer in implementing that function, I can volunteer a few hours
per week to work on it. Please e-mail me off-list or telephone me if
you're interested.
David Christensen
dpch...@holgerdanske.com
(209) 830-8249
On Oct 2, 8:58 pm, David Christensen wrote:
> If you, or another knowledgeable person, are willing to guide a rookie
> Go programmer in implementing that function, I can volunteer a few hours
> per week to work on it. Please e-mail me off-list or telephone me if
> you're interested.
On 10/02/2011 11:20 PM, Mike Samuel wrote:
> Before you start coding maybe it would be good to be explicit about
> how pedantic you want to be? ...
It doesn't surprise me that rendering HTML from a data structure is a
non-trivial task. Thank you for pointing out just how messy it can get.
:-)
I'm no expert on HTML, DOM, parsing, tree representation, rendering,
etc.; I'm just looking for a Go solution to match the capabilities of
Perl. The missing piece seems to be Go equivalent of the HTML::Element
as_HTML() object method:
If you or anyone would care to take the lead on the task of implementing:
func (*Node) Render(w io.Writer) os.Error
for the Go package html, I'm offering my help.
David
Solving (6) would require solving the halting problem, no? How about a
goal like:
With javascript disabled, if n is a *Node (i.e. a document tree) that
resulted from Parse, then Parse(Render(n)) would result in a clone of
that tree. (Excuse the pseudo-syntax for connecting an io.Reader to an
io.Writer).
The converse wouldn't have to be true: e.g. Render(Parse(rawHTML))
would convert "a < b" to "<html><head></head><body>a <
b</body></html>".
https://code.google.com/p/go-html-transform/ is my attempt at
something like that. Feel free to crib off of it, modify it, or use
it. It has a dependency on Go's html library though so it comes with a
few caveats. Mainly that if you're source html has javascript in it
then the html package has a tendency to fail to parse it.
It's been on my todo list to see about patching the html parser to
support the script insertion modes but I just haven't had time yet.
No it wouldn't. It only requires solving the halting program if you
are unwilling to reject any program that does not produce
non-validating HTML.
For example, if you whitelist free variables to disallow unfiltered
access to document, reflective operators like eval, and use of the []
operator with non-numeric arguments, then you can prevent script from
making validating HTML non-validating. http://www.adsafe.org/
It's not worth doing which I said above "You should definitely punt on 6 below."
I just listed it as one more example of why "validating HTML" is
rarely a useful property to have.
> With javascript disabled, if n is a *Node (i.e. a document tree) that
> resulted from Parse, then Parse(Render(n)) would result in a clone of
> that tree. (Excuse the pseudo-syntax for connecting an io.Reader to an
> io.Writer).
> The converse wouldn't have to be true: e.g. Render(Parse(rawHTML))
> would convert "a < b" to "<html><head></head><body>a <
> b</body></html>".
You could always implement render to take advantage of
http://www.w3.org/TR/html5/syntax.html#optional-tags :
"Certain tags can be omitted.
Omitting an element's start tag does not mean the element is not
present; it is implied, but it is still there. An HTML document always
has a root html element, even if the string <html> doesn't appear
anywhere in the markup.
An html element's start tag may be omitted if the first thing inside
the html element is not a comment.
An html element's end tag may be omitted if the html element is not
immediately followed by a comment.
A head element's start tag may be omitted if the element is empty, or
if the first thing inside the head element is an element.
..."
That would be just fine for what I have in mind. :-)
On 10/03/2011 10:31 PM, Mike Samuel wrote:
> ... It only requires solving the halting program if you are unwilling
to reject any program that does not produce non-validating HTML.
> For example, if you whitelist free variables to disallow unfiltered
access to document, reflective operators like eval, and use of the []
operator with non-numeric arguments, then you can prevent script from
making validating HTML non-validating. http://www.adsafe.org/ It's not
worth doing which I said above "You should definitely punt on 6 below."
Rather than trying to prevent applications from breaking Render() with
JavaScript, I would be happy with a first generation Render() that only
concerns itself with HTML and simply passes through JavaScript (CSS,
whatever) untouched. If that's too hard, then we could write a Render()
that drops JavaScript or anything else that causes problems.
> I just listed it as one more example of why "validating HTML" is
> rarely a useful property to have.
I'm not sure of the exact capabilities of package html Parse(), but I
think it would be useful if the Go/HTML rewriting system accepted a wide
range of HTML input, validating or otherwise. For example, HTML
snippets entered via text editors and/or textarea widgets, HTML
documents created with word processors and HTML editors, and HTML
created by other programming tools (Go, Perl CGI, PHP, Microsoft
ASP/.NET, whatever).
I think it would be useful to pick a specific HTML standard and specific
validation tool for the output of Render(). This will give us a target
to shoot for, and a yardstick for measuring progress. We are free to
choose whichever standard and tool suits us.
> You could always implement render to take advantage of
> http://www.w3.org/TR/html5/syntax.html#optional-tags :
> "Certain tags can be omitted.
> Omitting an element's start tag does not mean the element is not
> present; it is implied, but it is still there. An HTML document always
> has a root html element, even if the string<html> doesn't appear
> anywhere in the markup.
> An html element's start tag may be omitted if the first thing inside
> the html element is not a comment.
> An html element's end tag may be omitted if the html element is not
> immediately followed by a comment.
> A head element's start tag may be omitted if the element is empty, or
> if the first thing inside the head element is an element.
Wouldn't it be easier to have Render() output <html>, <head>, <body>,
</body>, </head>, </html> tags every time? We could add the special
cases to Render() later. (Does Parse() understand them?)
On 10/03/2011 08:51 PM, Jeremy Wall wrote:
> https://code.google.com/p/go-html-transform/ is my attempt at
something like that.
On 10/04/2011 07:54 AM, Andy Balholm wrote:
> I'm working on a project that will involve parsing HTML, modifying
the tree, and writing it back out too.
Great! :-)
David