How an HTML5Parser can be integrated?

75 views
Skip to first unread message

Richard Gomes

unread,
Jun 17, 2012, 11:07:19 PM6/17/12
to scale...@googlegroups.com
Hello,

I'd like to integrate a parser able to accept malformed XML, namely: HTML.

I found the code below which does the job.
If I understood properly it would be necessary pass this custom parser to scale.xml.XmlParser.loadXml, using a Loaner[SAXParser].
Could someone please clarify how this can be done?

Thanks a lot


This is the code
------------------------

import org.xml.sax.InputSource
import scala.xml._
import parsing._
import java.io.InputStreamReader

case class HTML5Parser() extends NoBindingFactoryAdapter {

  override def loadXML(source : InputSource, _p: SAXParser) = loadXML(source)

  def loadXML(source : InputSource) = {
    import nu.validator.htmlparser.{sax,common}
    import sax.HtmlParser
    import common.XmlViolationPolicy

    val reader = new HtmlParser
    reader.setXmlPolicy(XmlViolationPolicy.ALLOW)
    reader.setContentHandler(this)
    reader.parse(source)
    rootElem
  }
}

Chris Twiner

unread,
Jun 18, 2012, 7:30:51 PM6/18/12
to scale...@googlegroups.com
Hi Richard,

Unfortunately I've not yet integrated XmlReaders, instead SAXParsers
(via JAXP), this is something I was going to do in the next release.
However since you're asking I'll push out an RC7 with a call to
Loaner[XmlReader] in addition, I'm hoping tomorrow night (it also
includes a host of performance improvements).

In the meantime I've added it to the repo
(experiment/treeProxyBuilders branch) revision 333ea03.

Cheers,
Chris

Chris Twiner

unread,
Jun 19, 2012, 8:30:48 PM6/19/12
to scale...@googlegroups.com
Hi Richard,

RC7 is published, with not one, but two posterous updates (wrong
project title in the first :>).

I've also added a test case for nu.validator as well, docs:
http://scala-scales.googlecode.com/svn/sites/scales/scales-xml_2.9.2/0.3-RC7/FullParsing.html#Direct_SAX_XMLReader_Usage

Cheers,
Chris

Richard Gomes

unread,
Jun 22, 2012, 8:27:54 PM6/22/12
to scale...@googlegroups.com
Hi Chris,

Thanks a lot for your prompt answer.
I'm sorry for the delay... I was pulling my hair with tests.

In a nutshell, I created the test below.
The expected results are indicated in comments after println(...).


class XPathTest extends App {

    val urlListing = new java.net.URL("http://biz.yahoo.com/p/").openStream

    import scales.utils._
    import ScalesUtils._
    import scales.xml._
    import scales.xml.jaxen._
    import ScalesXml._

    val doc = loadXmlReader(urlListing, parsers = NuValidatorFactoryPool)
    val root = top(doc)
    val xpath = ScalesXPath("/html/body/p[2]/table[2]/tbody/tr/td/table/tbody/tr/td[1]/a/font")

    println(ScalesXPath("//*").evaluate(root).size)             // 477 :: this one works! :)
    println(ScalesXPath("//table").evaluate(root).size)         //   5 :: fails :(
    println(ScalesXPath("//table//table").evaluate(root).size)  //   1 :: fails :(
    println(xpath.evaluate(root).size)                          //   9 :: fails :(

  object NuValidatorFactoryPool extends scales.utils.SimpleUnboundedPool[org.xml.sax.XMLReader] {
    def create = {

      import nu.validator.htmlparser.{sax,common}
      import sax.HtmlParser
      import common.XmlViolationPolicy

      val reader = new HtmlParser
      reader.setXmlPolicy(XmlViolationPolicy.ALLOW)
      reader.setXmlnsPolicy(XmlViolationPolicy.ALLOW)
      reader
    }     
  }   
}


I'm not obtaining all expected results, which makes me think that jaxen or scales-xml are not processing xpaths properly.

I've noticed that scales-xml links against jaxen 1.1.3 (which contains strange dependencies from findbugs and cobertura, by the way).
I found that jaxen has a new revision 1.1.4 which theoretically is more correct and robust than 1.1.3.
I've then recompiled it, using Maven2 and Gradle and I've made sure that I've replaced the dependecy from jaxen:jaxen:1.1.3 by my own build of jaxen. Unfortunately, a new jaxen made no difference to the results.

For your information:
     git clone https://github.com/frgomes/jaxen
     cd jaxen/jaxen
     mvn clean install  // com.github.frgomes:jaxen:1.1.4


Could you please clarify what I'm doing wrong?

Thanks a lot :)

Richard Gomes

Chris Twiner

unread,
Jun 23, 2012, 6:46:15 AM6/23/12
to scale...@googlegroups.com
I'll have a look, many thanks for the test case :-)

The jaxen dependencies are a bit strange, but unfortunately they don't
publish directly to Maven.

Chris Twiner

unread,
Jun 23, 2012, 8:12:41 AM6/23/12
to scale...@googlegroups.com
as a starting point it seems that nuvalidator is adding a namespace :

elem(root).name // {http://www.w3.org/1999/xhtml}html

val ns = Namespace("http://www.w3.org/1999/xhtml")
assertEquals(5, root.\\*(ns("table")).size)
assertEquals(1, root.\\*(ns("table")).\\*(ns("table")).size)

I'll dig a bit further anyway as I get 10 more elems, and no font
under an anchor.

Richard Gomes

unread,
Jun 23, 2012, 9:36:38 PM6/23/12
to scale...@googlegroups.com
Yes, there are nonsense compile/runtime dependencies from findbugs, cobertura and junit.
It's possible to remove these dependencies with Maven and Gradle, anyway.

I've created the jaxen thing in Github as a sake of exercise and tests.
If you find useful I can share some thoughts with you later, but for the moment, be aware that I strip all revisions but the HEAD due to the amount of rubbish which was in the source control, like .jar and .class files.

Thanks

Richard Gomes

Richard Gomes

unread,
Jun 24, 2012, 7:18:10 AM6/24/12
to scale...@googlegroups.com
Hi Chris,

More test cases follow:

    println("=== these tests pass")
    println(root.\\*(ns("table")).size)
    println(root.\\*(ns("table")).\\*(ns("table")).size)
    println(root.\\*(ns("table")).\\*(ns("table")).\\*(ns("tbody")).\\*(ns("tr")).size)
    println(root.\\*(ns("table")).\\*(ns("table")).\\*(ns("tbody")).\\*(ns("tr")).\\*(ns("td")).size)
    println(root.\\*(ns("table")).pos(2).\\*(ns("table")).\\*(ns("tbody")).\\*(ns("tr")).\\*(ns("td")).pos(1).\\*(ns("a")).size)
    println(root.\\*(ns("table")).pos(2).\\*(ns("table")).\\*(ns("tbody")).\\*(ns("tr")).\\*(ns("td")).pos(1).\\*(ns("a")).\\*(ns("font")).size)

    val almost = root.\\*(ns("html")).\\*(ns("body")).\\*(ns("p")).pos(2).\\*(ns("table")).pos(2).\\*(ns("tbody")).\\*(ns("tr")).\\*(ns("td")).\\*(ns("table")).\\*(ns("tbody")).\\*(ns("tr")).\\*(ns("td")).pos(1).\\*(ns("a")).\\*(ns("font"))
    println(almost.size)


    println("=== these tests fail : how do I advance step by step?")
    println(root.*(ns("html")).*(ns("body")).*(ns("p")).*(2).*(ns("table")).*(2).*(ns("tbody")).*(ns("tr")).*(ns("td")).*(ns("table")).*(ns("tbody")).*(ns("tr")).*(ns("td")).*(1).*(ns("a")).*(ns("font")).size)
    println(root.*(ns("html")).*(ns("body")).*(ns("p")).*(2).*(ns("table")).*(2).*(ns("tbody")).*(ns("tr")).*(ns("td")).*(ns("table")).*(ns("tbody")).*(ns("tr")).*(ns("td")).*(1).*(ns("a")).*(ns("font")).text.size)
    println(root.\*(ns("html")).\*(ns("body")).\*(ns("p")).pos(2).\*(ns("table")).pos(2).\*(ns("tbody")).\*(ns("tr")).\*(ns("td")).\*(ns("table")).\*(ns("tbody")).\*(ns("tr")).\*(ns("td")).pos(1).\*(ns("a")).\*(ns("font")).size)


    println("=== failed to obtain text()")
    val result = almost.text
    println(result.size)


It prints:


=== these tests pass
4
1
10
90
9
9
9
=== these tests fail : how do I advance step by step?
0
0
0
=== failed to obtain text()
0

Thanks

-- Richard

Chris Twiner

unread,
Jun 25, 2012, 2:10:36 PM6/25/12
to scale...@googlegroups.com
Hi Richard,

Its very close :-) Individual steps also use \ as per normal XPath,
the only quirk being \+ to allow non elem or attribute nodes to be
matched (I will probably add a \_text etc for 0.5 to smooth out this
inconsistency) and there not being a document root node. So:

val xpath =
root.*(ns("html")).\*(ns("body")).\*(ns("p")).pos(2).\*(ns("table")).pos(2).\*(ns("tbody")).\*(ns("tr")).
\*(ns("td")).\*(ns("table")).\*(ns("tbody")).\*(ns("tr")).\*(ns("td")).pos(1).\*(ns("a")).\*(ns("font"))

val xpath_with_star_pos =
root.*(ns("html")).\*(ns("body")).\*(ns("p")).*(2).\*(ns("table")).*(2).\*(ns("tbody")).\*(ns("tr")).
\*(ns("td")).\*(ns("table")).\*(ns("tbody")).\*(ns("tr")).\*(ns("td")).*(1).\*(ns("a")).\*(ns("font"))

both will return 9 fonts. As you noticed \*[1] is \*(1) and
predicates are simply chained, which is why ....\*(ns("font")).text
returns 0, as an elem is not a text node. ....\*(ns("font")).\+.text
then returns the 9 text children.

With the Jaxen interface thats then:

val strPath = ScalesXPath("/ns:html/ns:body/ns:p[2]/ns:table[2]/ns:tbody/ns:tr/ns:td/ns:table/ns:tbody/ns:tr/ns:td[1]/ns:a/ns:font/text()",
ns.prefixed("ns"))

// all the 9 text nodes
val strRes = strPath.evaluate(root)

hope that helps,

Cheers,
Chris

Richard Gomes

unread,
Jun 25, 2012, 7:48:21 PM6/25/12
to scale...@googlegroups.com
Hi Chris,

Thanks a lot for your response.

I'm interested on the XPath representation, available via ScalesXPath.


val strPath = ScalesXPath("/ns:html/ns:body/
ns:p[2]/ns:table[2]/ns:tbody/ns:tr/ns:td/ns:table/ns:tbody/ns:tr/ns:td[1]/ns:a/ns:font/text()",
ns.prefixed("ns"))
But I'd like to get rid of all those 'extra' ns: for every step.
Any idea how it can be done?

Thanks

Richard Gomes

Chris Twiner

unread,
Jun 26, 2012, 3:52:58 AM6/26/12
to scale...@googlegroups.com

Hi Richard,

This is a basic tenet of xpaths, they are always namespace aware.  Which leaves changing the parsing itself.

Alas I have also not been able to configure Nu.validator to turn this off.  Perhaps using JTagSoup or other similar libraries will be a better fit.  You can include them as Loaners in the same way.

Cheers,
Chris

Richard Gomes

unread,
Jun 27, 2012, 3:53:23 AM6/27/12
to scale...@googlegroups.com
Hi Chris,

I understand what the dogma is but, from the point of view of usability, I'd say that majority of users would like to simply ignore namespaces.
Just my humble opinion.

Thanks
-- Richard

Bart Schuller

unread,
Jun 27, 2012, 6:54:53 AM6/27/12
to scale...@googlegroups.com
Actually,

What I'd really like is XPath2, where it's possible to configure the interpretation of unprefixed QNames so that we can have our cake and eat it too. Just configure once that we're working with the xhtml namespace. This is probably easier done using the string-based xpaths than with the DSL, but finding a good library will be a problem (Jaxen has never been upgraded to XPath2).

Regards,

Bart.

Chris Twiner

unread,
Jun 27, 2012, 2:16:52 PM6/27/12
to scale...@googlegroups.com
I understand the sentiment, I've hacked JXPath to add a default ""
remapping. I've also added a transformer from _:localname to mean any
namespace for string xpaths.

Both were made to ease users expectations, but the dogma was there for
a good reason - it was then very surprising for other users.

The good thing with the _: syntax was that its not standard xpath, so
there was no mistaking it.

If this is really something of interest I could do a \*:* for the DSL
(the string path transformation is proprietary though) as a shortcut
to \*( hasLocalNameX("local") ). Or if there is a shortcut you'd
prefer we can try it out.

Chris Twiner

unread,
Jun 27, 2012, 2:30:39 PM6/27/12
to scale...@googlegroups.com
crazily this just gave me an idea. The Jaxen api is pretty well split
out. I could allow a re-wrapping for namespaces to equal "", a simple
function to override local lookup.

I've had the RC7 out for a week now, and not seen issues with it, so I
may push a final out and then try this.

So for example:

val path = ScalesXPath("/local/anotherlocal", nslist).withLocalOnly{
namespace-uri => true }

would then match any namespace, or you could configure a default ""
re-mapping instead.

Richard Gomes

unread,
Jun 27, 2012, 5:05:44 PM6/27/12
to scale...@googlegroups.com
I will not pretend I understand what you said. I don't understand the inner details and implications.
What I know about XPath is that I can use it in Firefox via plugins for Firebug. I also developed an application in Python using the lxml library which is full of XPath specs. I never had to care about namespaces. I believe most users never cared about namespaces.

For this reason, I dare say the default should be 'no-namespaces'.
IMHO, if a user needs namespaces, then the namespace should be specified as a parameter to ScalesXPath.
Otherwise not. If you don't need namespaces, simply ignore the matter: no parameter, no additional code, nothing.

Thanks

-- Richard Gomes

Chris Twiner

unread,
Jun 29, 2012, 4:01:35 AM6/29/12
to scale...@googlegroups.com

I will put a branch out tomorrow to demonstrate what I mean.  However I'll not break XPath usage by default, especially not given my previous experience of doing that.

If you are using a document without explicit namespaces you won't need prefixes.  This is also detailed on the lxml site for example.

Nu.validator adds a namespace and forces a standards compliant library (such as Jaxen) to need that namespace (or test via the local-name function), which is what you are seeing. (according to their docs this would seem to be a bug that you can't turn it off)

Currently on the master branch is a tagsoup loaner*, which allows for non namespaced html parsing. The xpaths work just the tbody elements don't exist.

* it required an abstraction over reading sax properties, so it isn't compatible with RC 7, but will be in the final.

Chris Twiner

unread,
Jun 29, 2012, 4:25:07 PM6/29/12
to scale...@googlegroups.com
The code is on experiment/ignoreNamespaces.

Using the example (and the nu.validator):

val noPrefixesPath =
ScalesXPath("/html/body/p[2]/table[2]/tbody/tr/td/table/tbody/tr/td[1]/a/font").
withNameConversion(ScalesXPath.localOnly)

// with tagsoup
// /html/body/p[2]/table[2]/tr/td/table/tr/td[1]/a/font

val xpath = root.*:*("html").\*:*("body").\*:*("p").pos(2).\*:*("table").pos(2).\*:*("tbody").\*:*("tr").
\*:*("td").\*:*("table").\*:*("tbody").\*:*("tr").\*:*("td").pos(1).\*:*("a").\*:*("font")

The *:*(X) would imply any namespace/prefix with the local name X.
And the withNameConversion takes a QName => QName so it can be used to
coerce qnames as well as simply stripping namespaces and prefixes.

Its simple enough that I'm thinking of including it in the 0.3 final.
Reply all
Reply to author
Forward
0 new messages