CSS4J stripping entities?

47 views
Skip to first unread message

Vincent Massol

unread,
Oct 7, 2018, 6:52:19 AM10/7/18
to css4j
Hi Carlos,

We've noticed an issue on XWiki. It seems that after we execute CSS4J we get our HTML entities removed.

Example input to CSS4J:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<title>
  Main - Home
</title>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type" />
<meta content="en" name="language" />

</head><body class="exportbody" id="body" pdfcover="0" pdftoc="0">

<div id="xwikimaincontainer">
<div id="xwikimaincontainerinner">

<div id="xwikicontent">
      <p>Cl&eacute;ment Aubin</p>
          </div>
</div>
</div>

</body></html>

Output after applying CSS4J (0.41.3):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head style="display: none; ">
<title style="display: none; ">
  Main - Home
</title>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type" style="display: none; "/>
<meta content="en" name="language" style="display: none; "/>

</head><body class="exportbody" id="body" pdfcover="0" pdftoc="0" style="display: block; margin-top: 8px; margin-right: 8px; margin-bottom: 8px; margin-left: 8px; unicode-bidi: embed; ">

<div id="xwikimaincontainer" style="display: block; unicode-bidi: embed; ">
<div id="xwikimaincontainerinner" style="display: block; unicode-bidi: embed; ">

<div id="xwikicontent" style="display: block; unicode-bidi: embed; ">
      <p style="display: block; margin-top: 3pt; margin-bottom: 3pt; unicode-bidi: embed; ">Clment Aubin</p>
          </div>
</div>
</div>

</body></html>

Notice the &eacute; which has been removed.

Note: https://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd include xhtml-lat1.ent so the &eacute entity is defined. 

Actually our code is:

String applyCSS(String html, String css, XWikiContext context)
{
LOGGER.debug("Applying the following CSS [{}] to HTML [{}]", css, html);
try {
//System.setProperty("org.w3c.css.sac.parser", "org.apache.batik.css.parser.Parser");

// Prepare the input
Reader re = new StringReader(html);
InputSource source = new InputSource(re);
SAXReader reader = new SAXReader(XHTMLDocumentFactory.getInstance());
reader.setEntityResolver(new DefaultEntityResolver());
XHTMLDocument document = (XHTMLDocument) reader.read(source);

// Set the base URL so that CSS4J can resolve URLs in CSS. Use the current document in the XWiki Context
document.setBaseURL(new URL(context.getDoc().getExternalURL("view", context)));

// Apply the style sheet
document.addStyleSheet(new org.w3c.css.sac.InputSource(new StringReader(css)));
applyInlineStyle(document.getRootElement());
OutputFormat outputFormat = new OutputFormat("", false);
if ((context == null) || (context.getWiki() == null)) {
outputFormat.setEncoding("UTF-8");
} else {
outputFormat.setEncoding(context.getWiki().getEncoding());
}
StringWriter out = new StringWriter();
XMLWriter writer = new XMLWriter(out, outputFormat);
writer.write(document);
String result = out.toString();
// Debug output
if (LOGGER.isDebugEnabled()) {
LOGGER.debug("HTML with CSS applied [{}]", result);
}
return result;
} catch (Exception e) {
LOGGER.warn("Failed to apply CSS [{}] to HTML [{}]", css, html, e);
return html;
}
}

We use CSS4J's DefaultEntityResolver class which has xhtml-lat1.ent added:

public DefaultEntityResolver() {
this.dtdNameToFilename.put("-//W3C//DTD XHTML 1.0 Strict//EN", "w3c/xhtml1-strict.dtd");
this.dtdNameToFilename.put("-//W3C//DTD XHTML 1.0 Transitional//EN", "w3c/xhtml1-transitional.dtd");
this.dtdNameToFilename.put("-//W3C//DTD XHTML 1.1//EN", "w3c/xhtml11.dtd");
this.dtdNameToFilename.put("-//W3C//ENTITIES Latin 1 for XHTML//EN", "w3c/xhtml-lat1.ent");
this.dtdNameToFilename.put("-//W3C//ENTITIES Symbols for XHTML//EN", "w3c/xhtml-symbol.ent");
this.dtdNameToFilename.put("-//W3C//ENTITIES Special for XHTML//EN", "w3c/xhtml-special.ent");
this.dtdNameToURL.put("-//W3C//DTD XHTML 1.0 Strict//EN", "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd");
this.dtdNameToURL.put("-//W3C//DTD XHTML 1.0 Transitional//EN", "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd");
this.dtdNameToURL.put("-//W3C//DTD XHTML 1.1//EN", "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd");
this.dtdNameToURL.put("-//W3C//ENTITIES Latin 1 for XHTML//EN", "http://www.w3.org/TR/xhtml11/DTD/xhtml-lat1.ent");
this.dtdNameToURL.put("-//W3C//ENTITIES Symbols for XHTML//EN", "http://www.w3.org/TR/xhtml11/DTD/xhtml-symbol.ent");
this.dtdNameToURL.put("-//W3C//ENTITIES Special for XHTML//EN", "http://www.w3.org/TR/xhtml11/DTD/xhtml-special.ent");
}

WDYT? Is there something we don't do correctly?

Thanks a lot

Links:




Vincent Massol

unread,
Oct 7, 2018, 10:06:50 AM10/7/18
to css4j
ok so I've found where the problem is...

If I write:

SAXReader reader = new SAXReader(XHTMLDocumentFactory.getInstance());

Then the entity is stripped.

But if I write:

SAXReader reader = new SAXReader();

Then the entity is not removed.

So it seems there's something in CSS4J's XHTMLDocumentFactory that causes the entity to be stripped.

Still debugging but posting in case you have an idea.

Thanks

Vincent Massol

unread,
Oct 8, 2018, 8:18:55 AM10/8/18
to css4j

admin

unread,
Oct 8, 2018, 1:38:46 PM10/8/18
to cs...@googlegroups.com
(Responding here instead of at the SF ticket, as the details were posted here)

I tested with dom4j 2.1.1 and 2.1.0. My findings:

dom4j 2.1.1:

1) I see the problem with entities that are in "ENTITIES Latin 1" but not with entities in e.g. "ENTITIES Special". If you test with &gt; or &lt; it works.
2) DefaultEntityResolver.resolveEntity is NOT being called, which is intriguing.

dom4j 2.1.0 (I assume prior versions have similar behaviour):

1) Unable to reproduce the issue.
2) DefaultEntityResolver.resolveEntity is being called.

The most interesting thing is that I cannot reproduce your result where plain DOM4J does not filter the entities but CSS4J + DOM4J does.

Tests that pass with dom4j 2.1.0 but fail with 2.1.1 (put these inside css4j's XHTMLDocumentFactoryTest.java unit test in css4j-dom4j module):

    @Test
   
public void testEntities1() throws Exception {

       
SAXReader reader = new SAXReader();

        reader
.setEntityResolver(new DefaultEntityResolver());
       
Reader re = DOMCSSStyleSheetFactoryTest.sampleXHTMLReader();
        org
.dom4j.Document dom4jdocument = reader.read(re);
        re
.close();
        org
.dom4j.Element dom4jelm = dom4jdocument.elementByID("entity");
        assertNotNull
(dom4jelm);
        assertEquals
("span", dom4jelm.getName());
        assertEquals
("<>", dom4jelm.getText());
        assertEquals
(2, dom4jelm.nodeCount());
        assertEquals
(org.dom4j.Node.TEXT_NODE, dom4jelm.node(0).getNodeType());
        assertEquals
("<", dom4jelm.node(0).getText());
        assertEquals
(org.dom4j.Node.TEXT_NODE, dom4jelm.node(1).getNodeType());
        assertEquals
(">", dom4jelm.node(1).getText());
       
//
       
XHTMLElement elm = xhtmlDoc.getElementById("entity");
       
NodeList nl = elm.getChildNodes();
        assertNotNull
(nl);
        assertEquals
(2, nl.getLength());
       
Node ent0 = nl.item(0);
        assertEquals
(Node.TEXT_NODE, ent0.getNodeType());
        assertEquals
("<", ent0.getNodeValue());
       
Node ent1 = nl.item(1);
        assertEquals
(Node.TEXT_NODE, ent1.getNodeType());
        assertEquals
(">", ent1.getNodeValue());
   
}

   
@Test
   
public void testEntities2() throws Exception {

       
SAXReader reader = new SAXReader();

        reader
.setEntityResolver(new DefaultEntityResolver());
       
Reader re = DOMCSSStyleSheetFactoryTest.sampleXHTMLReader();
        org
.dom4j.Document document = reader.read(re);
        re
.close();
        org
.dom4j.Element dom4jelm = document.elementByID("entiacute");
        assertNotNull
(dom4jelm);
        assertEquals
("span", dom4jelm.getName());
        assertEquals
("ítem", dom4jelm.getText());
        assertEquals
(1, dom4jelm.nodeCount());
        assertEquals
(org.dom4j.Node.TEXT_NODE, dom4jelm.node(0).getNodeType());
        assertEquals
("ítem", dom4jelm.node(0).getText());
       
//
       
XHTMLElement elm = xhtmlDoc.getElementById("entiacute");
        assertNotNull
(elm);
        assertEquals
("span", elm.getTagName());
       
String text = elm.getText();
        assertEquals
("ítem", text);
       
NodeList nl = elm.getChildNodes();
        assertNotNull
(nl);
        assertEquals
(1, nl.getLength());
       
Node ent0 = nl.item(0);
        assertEquals
(Node.TEXT_NODE, ent0.getNodeType());
        assertEquals
("ítem", ent0.getNodeValue());
   
}


In order to run the tests, you need the latest xhtmlsample.html file:



admin

unread,
Oct 8, 2018, 1:55:29 PM10/8/18
to css4j
Those two tests are now in git, in case you want to use the full tree(s). At css4j's homepage https://carte.sourceforge.io/css4j/ there are instructions to build from git.

Vincent Massol

unread,
Oct 9, 2018, 4:45:29 AM10/9/18
to css4j
Issue now reported on dom4j's github issues at https://github.com/dom4j/dom4j/issues/51

admin

unread,
Oct 9, 2018, 5:43:56 AM10/9/18
to css4j
Tests that pass with dom4j 2.1.0 but fail with 2.1.1

This wording is not accurate: dom4j 2.1.0 passes both tests while 2.1.1 passes the first but fails the second. Hence the statement "If you test with &gt; or &lt; it works".
Reply all
Reply to author
Forward
Message has been deleted
0 new messages