Some HTML5 elements missing in html.tssl schema file

176 views
Skip to first unread message

markus

unread,
Sep 4, 2012, 9:08:11 AM9/4/12
to tagsoup...@googlegroups.com
Hello,

Apache Tika uses TagSoup for parsing HTML. In Apache Tika we're in the process of making a ContentHandler capable of reading microdata properties. This works very well but not for some HTML5 elements, those elements and their attributes are not sent to the startElement() method of the ContentHandler. To work-around the problem we've first identified it's because those elements are missing in TagSoup's schema.tssl and then used a dirty fix to tell TagSoup to return those elements as well:

        String html5Elements[] = { "article", "aside", "audio", "bdi",
          "command", "datalist", "details", "embed", "summary", "figure",
          "figcaption", "footer", "header", "hgroup", "keygen", "mark",
          "meter", "nav", "output", "progress", "section", "source", "time",
          "track", "video" };

        for (String html5Element : html5Elements) {
          HTML_SCHEMA.elementType(html5Element, HTMLSchema.M_ANY, 255, 0);
        }

We would prefer to have to elements added to the html.tssl configuration but we're not sure how the schema and elements should be configured.

Thanks,
Markus

[1]: https://issues.apache.org/jira/browse/TIKA-985

John Cowan

unread,
Sep 4, 2012, 10:45:09 PM9/4/12
to markus, tagsoup...@googlegroups.com
markus scripsit:

> We would prefer to have to elements added to the html.tssl configuration
> but we're not sure how the schema and elements should be configured.

It would be great to work with you to add these to html.tssl. In order
to know how to extend TSSL, what's needed is the following information
on each element:

Its name (obviously)

The element groups it belongs to

What element groups, if any, constitute its content model

What parent element should be provided if it appears without a suitable
parent (e.g. the parent of p is body, and the parent of body is html)

What attributes it has (other than CDATA #IMPLIED attributes, which need
not be declared)

In the TSSL context, an element group is a choice group. So for example
the M_BLOCK group consists of address, blockquote, center, del, dir,
div, dl, form, h1-h6, hr, ins, menu, ol, p, pre, listing, xmp, table,
ul, and noframes. The elements that can *contain* M_BLOCK elements
(possibly along with other element groups) body, applet, blockquote,
center, del, div, dd, form, button, fieldset, iframe, ins, map, noscript,
object, td, th, li, noframes.

If you need more information from me, just let me know.

--
We call nothing profound co...@ccil.org
that is not wittily expressed. John Cowan
--Northrop Frye (improved)

markus

unread,
Sep 5, 2012, 11:18:23 AM9/5/12
to tagsoup...@googlegroups.com, markus, co...@mercury.ccil.org
Hi John,

I made a first attempt in adding the missing elements to the schema. Some new groups may be needed to accomodate for the multimedia elements video and audio because the track and source elements can become a member of that group. The same may be true for the figcaption element that can appear only in the figure element, and the footer and header elements can neither be a member of or contain eachother.

Most elements have the mixed type but some are empty, i think i got them all correct. I also think i've got most if not all attributes correct.

Please improve.

Thanks,
Markus
html5.tssl

markus

unread,
May 27, 2013, 4:58:04 AM5/27/13
to tagsoup...@googlegroups.com
Hi John,

Do you see any change in incorporating the listed HTML5 elements in the html.tssl file for any future TagSoup version?

Many thanks,
Markus

markus

unread,
Jul 25, 2013, 10:39:25 AM7/25/13
to tagsoup...@googlegroups.com
I've seen i was missing h1..h6 being allowed in elements like anchors now. Sometimes we see HTML pages where the heading is placed inside an anchor but TagSoup's current html.tssl forces the heading out of the anchor.
Reply all
Reply to author
Forward
0 new messages