scala> { val t1 = System.currentTimeMillis; val r =
"/tmp/Stanford_University".parseXmlFile; val t2 =
System.currentTimeMillis; println("T:" + (t2 - t1)) }
T:77
Features:
1) Error-correcting; will parse invalid XML and rebalance tags, etc.
2) Zipper for traversing the XML tree structure.
3) A cursor shifting function (on top of the zipper) that logs its
operations (using Writer monad) so they can be inspected after document
transformation has been done.
4) Pretty-printing for displaying a XML document structure.
5) Partial lenses for union types and regular lenses for record types to
ease the composition to perform a single operation. There are lots of
record types in the XML library and a few union types.
6) Example usages to demonstrate its capability. It would be good to add
a couple more to exploit the lenses. I will do this if I get a chance.
--
Tony Morris
http://tmorris.net/
On Thu, 15 Mar 2012 03:42:29 -0400, Haoyi Li wrote:
> I suppose I could use scalaz.xml, or anti-xml, but I was wondering if
> there was some easy way to get the default xml working. Its just for
> some simple web-scraping, not a serious XML processing task, so I was
> hoping to keep third party modules to a minimum. If it continues
> taking 3 minutes per scrape, though, Ill probably go with some other
> xml library.
>> > Im looking at the source code for a wiki article. In particular,
>> > *this* wiki article:
>> >
>> > http://en.wikipedia.org/wiki/Stanford_University [1]
>> >
>> > Its about 450kb of html and javascript and all that jazz, which
>> is
>> > significant, but its not overwhelming. However, when i make a web
>> > request, and the time comes for me to do:
>> >
>> > val xml = XML.loadString(response.body)
>> >
>> > it takes on the order of 2 minutes in order to parse the whole
>> thing
>> > into XML (response.body is a string). Does anyone know what
>> could be
>> > causing it? Stranger still, it doesnt seem to be maxing out my
>> CPU
>> > and making my fans spin up, neither does it seem to be making
>> large
>> > numbers of pagefaults. CPU utilization is near zero, so I have no
>> idea
>> > what could be causing it to take so long. Any ideas?
>> >
>> > -Haoyi
>>
>> --
>> Tony Morris
>> http://tmorris.net/ [2]
>
>
>
> Links:
> ------
> [1] http://en.wikipedia.org/wiki/Stanford_University
> [2] http://tmorris.net/
> [3] mailto:tonym...@gmail.com
Have a look at the network usage, sax is probably trying to download the dtds etc. Unfortunately there isn't a simple way to generally disable it outside of entity resolvers etc.
The document itself is very quick to parse. Scales Xml by default won't do any better, but use another sax parser (with the dtd disabled like a tagsoup) and its immediately parsed as you'd expect. See the jira SI-2725 for more fun.
What do you mean by using a naked println, not even wrapped in a monad or
applicative of some kind... ?
Mmmmm, what? Doesn't this work?
import scala.xml.Elem
import scala.xml.factory.XMLLoader
import javax.xml.parsers.SAXParser
object MyXML extends XMLLoader[Elem] {
override def parser: SAXParser = {
val f = javax.xml.parsers.SAXParserFactory.newInstance()
f.setNamespaceAware(false)
f.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
f.newSAXParser()
}
}
(from http://stackoverflow.com/a/1099921/53013)
>
> The document itself is very quick to parse. Scales Xml by default won't do
> any better, but use another sax parser (with the dtd disabled like a
> tagsoup) and its immediately parsed as you'd expect. See the jira SI-2725
> for more fun.
>
> On Mar 15, 2012 7:30 AM, "Haoyi Li" <haoy...@gmail.com> wrote:
>>
>> I'm looking at the source code for a wiki article. In particular, *this*
>> wiki article:
>>
>> http://en.wikipedia.org/wiki/Stanford_University
>>
>> It's about 450kb of html and javascript and all that jazz, which is
>> significant, but it's not overwhelming. However, when i make a web request,
>> and the time comes for me to do:
>>
>> val xml = XML.loadString(response.body)
>>
>> it takes on the order of 2 minutes in order to parse the whole thing into
>> XML (response.body is a string). Does anyone know what could be causing it?
>> Stranger still, it doesn't seem to be maxing out my CPU and making my fans
>> spin up, neither does it seem to be making large numbers of pagefaults. CPU
>> utilization is near zero, so I have no idea what could be causing it to take
>> so long. Any ideas?
>>
>> -Haoyi
--
Daniel C. Sobral
I travel to the future all the time.
Which is great when you are using a compatible vm or jre. The apache bit is the clue.
Yeah good point. Open a bug.
The MyXML object I created does almost everything that scala.xml.XML
does, except the saving part. So where you'd use XML.loadXXX(...),
you'd use MyXML.loadXXX(...). So the example would be:
val xml = MyXML.loadString(response.body)
Yeah, well, it is based on a Java library, and the Java library has no
standard way of togging this off. Paulp tried once, and broken Scala
on OpenJDK, for example.
val parser = (new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl).newSAXParser
val xmlLoader = scala.xml.XML.withSAXParser(parser)
val html = xmlLoader.load(new java.net.URL("http://en.wikipedia.org/wiki/Stanford_University"))
Cheers
-- Brian Schlining
A year later, upon encountering this problem myself and searching online for a solution, I found this, and tried it and it doesn't seem to work any more.
I think the longevity of this thread is an excellent rebuttal to the people who say "Why not just use the default XML module? It's a built-in!". Some libraries just want to watch the world burn.