html5 validation

73 views
Skip to first unread message

Eric Hellman

unread,
Jan 14, 2022, 4:42:27 PM1/14/22
to standar...@googlegroups.com
Project Gutenberg is in the process of adopting HTML5 for source files.

One big advantage of HTML5 is the availability of the MUCH more powerful validator at https://validator.w3.org/nu/

As an example, PG XHTML files have many tables that don't meet the HTML4 spec, but pass xml schema validation becasue you can't use xml validation to do things like check that table rows have a legal number of cells or whether table cells overlap. The HTML5 validator flags all sorts of errors invisible to XML validators.

EPubCheck doesn't appear to do the advanced HTML5 validation. It doesn't check any css values - you can set "elephant: 4-legged" as a css property, for example

Unfortunately the "nu" validator won't let you add important EPUB attributes like 'epub:type', even in the xml serialization. It thus reports errors for all SE xhtml files.

I was wondering if SE folks have thought about taking advantage of full HTML5 validation. I imagine that a customized instance of the nu validator could be stood up to process EPUB ready XHTML5 to everyone's benefit. We've also thought about doing things like 'data-epub-type' attributes that get converted to epub:type attributes during conversion to EPUB. Or maybe such a thing exists somewhere?

Anyway, I would welcome feedback on this.


Eric Hellman
twitter: @gluejar

Alex Cabal

unread,
Jan 14, 2022, 5:59:47 PM1/14/22
to standar...@googlegroups.com
Hi Eric! Our `se recompose-epub` tool actually outputs HTML5 by default.
It also has the `--xhtml` option, which outputs XHTML5 instead. We use
`se recompoes-epub --xhtml` when creating the "single page" ebooks you
linked to below, which is why it's served as XHTML5 from the website. In
fact our entire website is served as XHTML5.

All of our ebooks are create using XHTML5 as the source format so
decomposing them to HTML5 is very easy. All of our ebook should pass the
validator out of the box when using the `se recompose-epub` tool to
output HTML5. If a particular one doesn't then we have to correct that
ebook but I think they will all pass.

For example:

> git clone
https://github.com/standardebooks/a-a-milne_the-red-house-mystery
> se recompose-epub --output ebook.html a-a-milne_the-red-house-mystery

(Though now that I actually check, it looks like there is a minor error
with spaces in base64-encoded images, which fails the validator but
still renders. I'll fix this in the next tools release.)

Let me know if you have any questions!

On 1/14/22 4:42 PM, Eric Hellman wrote:
> Project Gutenberg is in the process of adopting HTML5 for source files.
>
> One big advantage of HTML5 is the availability of the MUCH more powerful
> validator at https://validator.w3.org/nu/ <https://validator.w3.org/nu/>
>
> As an example, PG XHTML files have many tables that don't meet the HTML4
> spec, but pass xml schema validation becasue you can't use xml
> validation to do things like check that table rows have a legal number
> of cells or whether table cells overlap. The HTML5 validator flags all
> sorts of errors invisible to XML validators.
>
> EPubCheck doesn't appear to do the advanced HTML5 validation. It doesn't
> check any css values - you can set "elephant: 4-legged" as a css
> property, for example
>
> Unfortunately the "nu" validator won't let you add important EPUB
> attributes like 'epub:type', even in the xml serialization. It thus
> reports errors for all SE xhtml files.
> https://validator.w3.org/nu/?showsource=yes&doc=https%3A%2F%2Fstandardebooks.org%2Febooks%2Fwilliam-shakespeare%2Fthe-taming-of-the-shrew%2Ftext%2Fsingle-page
> <https://validator.w3.org/nu/?showsource=yes&doc=https://standardebooks.org/ebooks/william-shakespeare/the-taming-of-the-shrew/text/single-page>
> (same thing if you upload an xhtml file to invoke the xhtml validator)
>
> I was wondering if SE folks have thought about taking advantage of full
> HTML5 validation. I imagine that a customized instance of the nu
> validator could be stood up to process EPUB ready XHTML5 to everyone's
> benefit. We've also thought about doing things like 'data-epub-type'
> attributes that get converted to epub:type attributes during conversion
> to EPUB. Or maybe such a thing exists somewhere?
>
> Anyway, I would welcome feedback on this.
>
>
> Eric Hellman
> twitter: @gluejar
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/088C2B4B-8AE9-4D30-B86A-D572166A7398%40hellman.net
> <https://groups.google.com/d/msgid/standardebooks/088C2B4B-8AE9-4D30-B86A-D572166A7398%40hellman.net?utm_medium=email&utm_source=footer>.

Alex Cabal

unread,
Jan 14, 2022, 6:04:04 PM1/14/22
to standar...@googlegroups.com
I've fixed the base64 image error in master bbb9d1f so if you install
the toolset from master, `se recompose-epub` should output 100% valid HTML5.

Alex Cabal

unread,
Jan 14, 2022, 6:33:24 PM1/14/22
to standar...@googlegroups.com
Ooh... Now that I'm looking at this, it looks like the Nu Checker
finally has a command line version! This must be a recent development.
We can integrate this into se build or se lint now!

On 1/14/22 5:59 PM, Alex Cabal wrote:

Alex Cabal

unread,
Jan 15, 2022, 5:12:23 PM1/15/22
to standar...@googlegroups.com
Thinking about this more - and I admit that my knowledge of doctypes and
DTDs is a little shaky - I think the reason Nu barfs on epub flavored
XHTML is because it sees `<!doctype html>` and starts validating a
strict XHTML5 document. Of course XHTML5 doesn't have anything to say
about epub's special vocabulary so the validator thinks it's all invalid.

The question then is, what doctype *would* work to validate XHTML5 + epub?

This SO answer sheds a little light:
https://stackoverflow.com/questions/10075019/xhtml5-and-custom-namespaces-not-passing-validation

So it would seem that in order to have Nu validate against XHTML5 + epub
we'd have to specify some kind of custom DTD. However, AFAIK (X)HTML5
only uses <!doctype html> without a DTD.

So, I'm not sure where to go from there. Is there a way for us to say,
"this document should be validated like XHTML5 plus the epub namespace?"
It looks like XML can have internal DTDs:
<https://xmlwriter.net/xml_guide/doctype_declaration.shtml> so maybe a
solution is to insert an internal DTD into an XHTML5 document before
passing to Nu?

If not, then is that a question for the epub people or Nu validator
people? Can we even do this ourselves, or does some standards body need
to create a DTD that gets plugged in to Nu?

Thoughts?

On 1/14/22 4:42 PM, Eric Hellman wrote:
> Project Gutenberg is in the process of adopting HTML5 for source files.
>
> One big advantage of HTML5 is the availability of the MUCH more powerful
> validator at https://validator.w3.org/nu/ <https://validator.w3.org/nu/>
>
> As an example, PG XHTML files have many tables that don't meet the HTML4
> spec, but pass xml schema validation becasue you can't use xml
> validation to do things like check that table rows have a legal number
> of cells or whether table cells overlap. The HTML5 validator flags all
> sorts of errors invisible to XML validators.
>
> EPubCheck doesn't appear to do the advanced HTML5 validation. It doesn't
> check any css values - you can set "elephant: 4-legged" as a css
> property, for example
>
> Unfortunately the "nu" validator won't let you add important EPUB
> attributes like 'epub:type', even in the xml serialization. It thus
> reports errors for all SE xhtml files.
> https://validator.w3.org/nu/?showsource=yes&doc=https%3A%2F%2Fstandardebooks.org%2Febooks%2Fwilliam-shakespeare%2Fthe-taming-of-the-shrew%2Ftext%2Fsingle-page
> <https://validator.w3.org/nu/?showsource=yes&doc=https://standardebooks.org/ebooks/william-shakespeare/the-taming-of-the-shrew/text/single-page>
> (same thing if you upload an xhtml file to invoke the xhtml validator)
>
> I was wondering if SE folks have thought about taking advantage of full
> HTML5 validation. I imagine that a customized instance of the nu
> validator could be stood up to process EPUB ready XHTML5 to everyone's
> benefit. We've also thought about doing things like 'data-epub-type'
> attributes that get converted to epub:type attributes during conversion
> to EPUB. Or maybe such a thing exists somewhere?
>
> Anyway, I would welcome feedback on this.
>
>
> Eric Hellman
> twitter: @gluejar
>

Alex Cabal

unread,
Jan 17, 2022, 1:11:18 PM1/17/22
to standar...@googlegroups.com
I was successfully able to bundle v.Nu with the SE toolset, so now when
we run build --check it will also run v.Nu to validate XHTML5. The
errors related to epub namespaces are easy to ignore before returning
actual errors to the user. This is in commit 0b46231

Still curious to hear if there's a better way to specify some kind of
XHTML5 + epub DTD when invoking v.Nu. But for now, just discarding epub
namespace errors works fine.

On 1/14/22 4:42 PM, Eric Hellman wrote:
> Project Gutenberg is in the process of adopting HTML5 for source files.
>
> One big advantage of HTML5 is the availability of the MUCH more powerful
> validator at https://validator.w3.org/nu/ <https://validator.w3.org/nu/>
>
> As an example, PG XHTML files have many tables that don't meet the HTML4
> spec, but pass xml schema validation becasue you can't use xml
> validation to do things like check that table rows have a legal number
> of cells or whether table cells overlap. The HTML5 validator flags all
> sorts of errors invisible to XML validators.
>
> EPubCheck doesn't appear to do the advanced HTML5 validation. It doesn't
> check any css values - you can set "elephant: 4-legged" as a css
> property, for example
>
> Unfortunately the "nu" validator won't let you add important EPUB
> attributes like 'epub:type', even in the xml serialization. It thus
> reports errors for all SE xhtml files.
> https://validator.w3.org/nu/?showsource=yes&doc=https%3A%2F%2Fstandardebooks.org%2Febooks%2Fwilliam-shakespeare%2Fthe-taming-of-the-shrew%2Ftext%2Fsingle-page
> <https://validator.w3.org/nu/?showsource=yes&doc=https://standardebooks.org/ebooks/william-shakespeare/the-taming-of-the-shrew/text/single-page>
> (same thing if you upload an xhtml file to invoke the xhtml validator)
>
> I was wondering if SE folks have thought about taking advantage of full
> HTML5 validation. I imagine that a customized instance of the nu
> validator could be stood up to process EPUB ready XHTML5 to everyone's
> benefit. We've also thought about doing things like 'data-epub-type'
> attributes that get converted to epub:type attributes during conversion
> to EPUB. Or maybe such a thing exists somewhere?
>
> Anyway, I would welcome feedback on this.
>
>
> Eric Hellman
> twitter: @gluejar
>

Eric Hellman

unread,
Jan 17, 2022, 4:22:13 PM1/17/22
to standar...@googlegroups.com
There's probably a way to add the epub schema to a preset - whatever mechanism they use to add MathML 3.0 + RDFa 1.1 should work for the epub additions.

I don't think they use a DTD at all. XML Schema seems more likely based on the error messages. But I haven't dug at all.

The epub people I've talked to say epub:type is supported nowhere whereas aria:type covers a lot of the same ground, is built into the nu validator, and is used by a number of accessibility apps. So I'm leaning towards omitting epub:type in favor of aria:type and maybe the epub namespace serves no useful purpose.

To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/0b153639-0007-cf65-12c2-6e4c371cadd6%40standardebooks.org.

Alex Cabal

unread,
Jan 17, 2022, 4:29:42 PM1/17/22
to standar...@googlegroups.com
Is there really an aria:type, i.e. an aria XML namespace? The
application of aria I've heard of is the `role=` and `aria-*=`
attributes for HTML.

They cover *some* of the same ground, but epub:type has a richer
book-specific vocabulary, and we're making books after all. For example
there's no `epigraph` ARIA role. Unless I'm looking in the wrong place.

I know epub is probably going to lean towards replacing epub:type with
aria and I think that's a pity...

On 1/17/22 4:22 PM, Eric Hellman wrote:
> There's probably a way to add the epub schema to a preset - whatever
> mechanism they use to add MathML 3.0 + RDFa 1.1 should work for the epub
> additions.
>
> I don't think they use a DTD at all. XML Schema seems more likely based
> on the error messages. But I haven't dug at all.
>
> The epub people I've talked to say epub:type is supported nowhere
> whereas aria:type covers a lot of the same ground, is built into the nu
> validator, and is used by a number of accessibility apps. So I'm leaning
> towards omitting epub:type in favor of aria:type and maybe the epub
> namespace serves no useful purpose.
>
>> On Jan 17, 2022, at 1:11 PM, Alex Cabal <al...@standardebooks.org
>> <mailto:al...@standardebooks.org>> wrote:
>>
>> I was successfully able to bundle v.Nu with the SE toolset, so now
>> when we run build --check it will also run v.Nu to validate XHTML5.
>> The errors related to epub namespaces are easy to ignore before
>> returning actual errors to the user. This is in commit 0b46231
>>
>> Still curious to hear if there's a better way to specify some kind of
>> XHTML5 + epub DTD when invoking v.Nu. But for now, just discarding
>> epub namespace errors works fine.
>>
>> On 1/14/22 4:42 PM, Eric Hellman wrote:
>>> Project Gutenberg is in the process of adopting HTML5 for source files.
>>> One big advantage of HTML5 is the availability of the MUCH more
>>> powerful validator at https://validator.w3.org/nu/
>>> <https://validator.w3.org/nu/> <https://validator.w3.org/nu/
>>> <https://validator.w3.org/nu/>>
>>> As an example, PG XHTML files have many tables that don't meet the
>>> HTML4 spec, but pass xml schema validation becasue you can't use xml
>>> validation to do things like check that table rows have a legal
>>> number of cells or whether table cells overlap. The HTML5 validator
>>> flags all sorts of errors invisible to XML validators.
>>> EPubCheck doesn't appear to do the advanced HTML5 validation. It
>>> doesn't check any css values - you can set "elephant: 4-legged" as a
>>> css property, for example
>>> Unfortunately the "nu" validator won't let you add important EPUB
>>> attributes like 'epub:type', even in the xml serialization. It thus
>>> reports errors for all SE xhtml files.
>>> https://validator.w3.org/nu/?showsource=yes&doc=https%3A%2F%2Fstandardebooks.org%2Febooks%2Fwilliam-shakespeare%2Fthe-taming-of-the-shrew%2Ftext%2Fsingle-page
>>> <https://validator.w3.org/nu/?showsource=yes&doc=https%3A%2F%2Fstandardebooks.org%2Febooks%2Fwilliam-shakespeare%2Fthe-taming-of-the-shrew%2Ftext%2Fsingle-page>
>>> <https://validator.w3.org/nu/?showsource=yes&doc=https://standardebooks.org/ebooks/william-shakespeare/the-taming-of-the-shrew/text/single-page
>>> <https://validator.w3.org/nu/?showsource=yes&doc=https://standardebooks.org/ebooks/william-shakespeare/the-taming-of-the-shrew/text/single-page>>
>>> (same thing if you upload an xhtml file to invoke the xhtml validator)
>>> I was wondering if SE folks have thought about taking advantage of
>>> full HTML5 validation. I imagine that a customized instance of the nu
>>> validator could be stood up to process EPUB ready XHTML5 to
>>> everyone's benefit. We've also thought about doing things like
>>> 'data-epub-type' attributes that get converted to epub:type
>>> attributes during conversion to EPUB. Or maybe such a thing exists
>>> somewhere?
>>> Anyway, I would welcome feedback on this.
>>> Eric Hellman
>>> twitter: @gluejar
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Standard Ebooks" group.
>>> To unsubscribe from this group and stop receiving emails from it,
>>> send an email to standardebook...@googlegroups.com
>>> <mailto:standardebook...@googlegroups.com>
>>> <mailto:standardebook...@googlegroups.com
>>> <https://groups.google.com/d/msgid/standardebooks/088C2B4B-8AE9-4D30-B86A-D572166A7398%40hellman.net?utm_medium=email&utm_source=footer
>>> <https://groups.google.com/d/msgid/standardebooks/088C2B4B-8AE9-4D30-B86A-D572166A7398%40hellman.net?utm_medium=email&utm_source=footer>>.
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "Standard Ebooks" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to standardebook...@googlegroups.com
>> <mailto:standardebook...@googlegroups.com>.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/standardebooks/0b153639-0007-cf65-12c2-6e4c371cadd6%40standardebooks.org
>> <https://groups.google.com/d/msgid/standardebooks/0b153639-0007-cf65-12c2-6e4c371cadd6%40standardebooks.org>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/260FBB05-8D97-4B80-9E8F-9DCA8876FF40%40hellman.net
> <https://groups.google.com/d/msgid/standardebooks/260FBB05-8D97-4B80-9E8F-9DCA8876FF40%40hellman.net?utm_medium=email&utm_source=footer>.

Eric Hellman

unread,
Jan 17, 2022, 5:31:54 PM1/17/22
to standar...@googlegroups.com
right! aria-type. (coming back to this after a few weeks on other projects)

I think that for SE, using epub:* is the right thing to be doing, with the understanding that sooner or later SE willl have to move on, and will have done most of the work.

For PG/DP getting folks to start doing the extra work will be the main hill we have to climb, and being able to point at the accessibility benefits will be key.
> To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/b8c8762b-c3c0-393f-daa0-6d7fb1b2227f%40standardebooks.org.

Alex Cabal

unread,
Jan 22, 2022, 11:03:31 AM1/22/22
to standar...@googlegroups.com
Quick update, we've released a version of the toolset that features v.Nu
validation, and rebuilt the corpus. All ebooks pass! (Note that we
ignore the "consider adding headers to section" warning, because some
sections, like dedications and frontispieces, often don't have headers;
and best practice right now seems to be to rely on the <title> element
in this case.)
Reply all
Reply to author
Forward
0 new messages