Scala XML taking horrendously long to parse XHTML file?

Haoyi Li

unread,

Mar 15, 2012, 2:29:51 AM3/15/12

to scala-user

I'm looking at the source code for a wiki article. In particular, *this* wiki article:

http://en.wikipedia.org/wiki/Stanford_University

It's about 450kb of html and javascript and all that jazz, which is significant, but it's not overwhelming. However, when i make a web request, and the time comes for me to do:

val xml = XML.loadString(response.body)

it takes on the order of 2 minutes in order to parse the whole thing into XML (response.body is a string). Does anyone know what could be causing it? Stranger still, it doesn't seem to be maxing out my CPU and making my fans spin up, neither does it seem to be making large numbers of pagefaults. CPU utilization is near zero, so I have no idea what could be causing it to take so long. Any ideas?

-Haoyi

Tony Morris

unread,

Mar 15, 2012, 2:39:49 AM3/15/12

to scala...@googlegroups.com

You might like to use scalaz.xml (on scalaz-seven branch):

scala> { val t1 = System.currentTimeMillis; val r =
"/tmp/Stanford_University".parseXmlFile; val t2 =
System.currentTimeMillis; println("T:" + (t2 - t1)) }
T:77

Features:
1) Error-correcting; will parse invalid XML and rebalance tags, etc.
2) Zipper for traversing the XML tree structure.
3) A cursor shifting function (on top of the zipper) that logs its
operations (using Writer monad) so they can be inspected after document
transformation has been done.
4) Pretty-printing for displaying a XML document structure.
5) Partial lenses for union types and regular lenses for record types to
ease the composition to perform a single operation. There are lots of
record types in the XML library and a few union types.
6) Example usages to demonstrate its capability. It would be good to add
a couple more to exploit the lenses. I will do this if I get a chance.

--
Tony Morris
http://tmorris.net/

Haoyi Li

unread,

Mar 15, 2012, 3:42:29 AM3/15/12

to tmo...@tmorris.net, scala...@googlegroups.com

I suppose I could use scalaz.xml, or anti-xml, but I was wondering if there was some easy way to get the default xml working. It's just for some simple web-scraping, not a serious XML processing task, so I was hoping to keep third party modules to a minimum. If it continues taking 3 minutes per scrape, though, I'll probably go with some other xml library.

Haoyi Li

unread,

Mar 15, 2012, 4:14:57 AM3/15/12

to tmo...@tmorris.net, scala...@googlegroups.com

update: I just tried anti-xml, and it stack-overflowed when i had it parse that big html page =( I guess i'm going to have to figure out scalaz7's xml library and give it a shot

HKjolhede

unread,

Mar 15, 2012, 4:21:53 AM3/15/12

to scala-user

You could take a look at TagSoup. It's a java library, but pretty
fast. The dom can then be parsed with XPath.

On 15 mar, 09:14, Haoyi Li <haoyi...@gmail.com> wrote:
> update: I just tried anti-xml, and it stack-overflowed when i had it parse
> that big html page =( I guess i'm going to have to figure out scalaz7's xml
> library and give it a shot
>
>
>
>
>
>
>
> On Thu, Mar 15, 2012 at 3:42 AM, Haoyi Li <haoyi...@gmail.com> wrote:
> > I suppose I could use scalaz.xml, or anti-xml, but I was wondering if
> > there was some easy way to get the default xml working. It's just for some
> > simple web-scraping, not a serious XML processing task, so I was hoping to
> > keep third party modules to a minimum. If it continues taking 3 minutes per
> > scrape, though, I'll probably go with some other xml library.
>

Henrik Kjölhede

unread,

Mar 15, 2012, 3:52:09 AM3/15/12

to scala...@googlegroups.com

You could try TagSoup (ccil.org/~cowan/XML/tagsoup). Combined with
XPath it is pretty fast and easy to use.

On Thu, 15 Mar 2012 03:42:29 -0400, Haoyi Li wrote:
> I suppose I could use scalaz.xml, or anti-xml, but I was wondering if

> there was some easy way to get the default xml working. Its just for

> some simple web-scraping, not a serious XML processing task, so I was
> hoping to keep third party modules to a minimum. If it continues

> taking 3 minutes per scrape, though, Ill probably go with some other
> xml library.

>> > Im looking at the source code for a wiki article. In particular,
>> > *this* wiki article:
>> >
>> > http://en.wikipedia.org/wiki/Stanford_University [1]
>> >
>> > Its about 450kb of html and javascript and all that jazz, which
>> is
>> > significant, but its not overwhelming. However, when i make a web

>> > request, and the time comes for me to do:
>> >
>> > val xml = XML.loadString(response.body)
>> >
>> > it takes on the order of 2 minutes in order to parse the whole
>> thing
>> > into XML (response.body is a string). Does anyone know what
>> could be

>> > causing it? Stranger still, it doesnt seem to be maxing out my

>> CPU
>> > and making my fans spin up, neither does it seem to be making
>> large
>> > numbers of pagefaults. CPU utilization is near zero, so I have no
>> idea
>> > what could be causing it to take so long. Any ideas?
>> >
>> > -Haoyi
>>
>> --
>> Tony Morris

>> http://tmorris.net/ [2]
>
>
>
> Links:
> ------
> [1] http://en.wikipedia.org/wiki/Stanford_University
> [2] http://tmorris.net/
> [3] mailto:tonym...@gmail.com

Tony Morris

unread,

Mar 15, 2012, 4:25:35 AM3/15/12

to Haoyi Li, scala...@googlegroups.com

Let me know if you need a hand. I will be on IRC in about 45 minutes.

irc://freenode.net/#scalaz

Chris Twiner

unread,

Mar 15, 2012, 5:54:09 AM3/15/12

to Haoyi Li, scala-user

Have a look at the network usage, sax is probably trying to download the dtds etc. Unfortunately there isn't a simple way to generally disable it outside of entity resolvers etc.

The document itself is very quick to parse. Scales Xml by default won't do any better, but use another sax parser (with the dtd disabled like a tagsoup) and its immediately parsed as you'd expect. See the jira SI-2725 for more fun.

Haoyi Li

unread,

Mar 15, 2012, 10:48:27 AM3/15/12

to Chris Twiner, scala-user

Ahhhh downloading dtd's. that may be the key in this, given that trying to get the damn

http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

manually through chrome from the w3.org website takes on the order of minutes! I knew there must have been some blocking io going on (given the mysteriously slow rates and the abyssmal CPU using) but was dumbfounded as to where "blocking IO" could fit into parsing XML, particularly of the "blocking for !two minutes! IO" kind.

Could I just download the file locally and munge the DOCTYPE via regexes to point to the local file to speed it up? I'm not that familiar with the intricacies of XML/HTML/XHTML, but even downloading that 31.3 file off my Dropbox for example would not take two minutes. Why the hell does getting it from W3 take two minutes anyway?

Thanks for the offer Tony. Unfortunately I went to bed shortly after sending that; I'll look you up if i end up needing help!

As it stands now, I used scala Futures to push a whole bunch of the XHTML pages. It didn't speed things up 1730 times (i have about that many pages, and I *think* i put them to run in parallel) but it did speed things up considerably. With any luck, in a hour or so i'll have scraped (scrapped?) all the data I need for now, so any optimization would be for future reference.

-Haoyi

Simon Ochsenreither

unread,

Mar 15, 2012, 10:57:07 AM3/15/12

to scala...@googlegroups.com, Chris Twiner

The slowness is by design. Application developers are supposed to cache and ship with a catalog of the schemas you plan to use.

See http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic/ for more information.

Haoyi Li

unread,

Mar 15, 2012, 11:14:19 AM3/15/12

to Simon Ochsenreither, scala...@googlegroups.com, Chris Twiner

I suppose that makes sense. Does scala.XML come with a catalog of schemas then? It seems to me that it's going to be a pretty common concern (a huge number of XML documents will have a dtd) and having everyone either

A) be incredibly slow or

B) write their own dtd caching infrastructure

seems like the wrong way of doing things. What's the correct way of doing this? Surely parsing XML with dtds is a topic that's been beaten to death.

-Haoyi

Razvan Cojocaru

unread,

Mar 15, 2012, 11:43:41 AM3/15/12

to tmo...@tmorris.net, scala...@googlegroups.com

Tony - that is totally not FP. Now that I read the first chapter and I know
what it is :)

What do you mean by using a naked println, not even wrapped in a monad or
applicative of some kind... ?

Josh Suereth

unread,

Mar 15, 2012, 12:14:24 PM3/15/12

to Haoyi Li, Simon Ochsenreither, scala...@googlegroups.com, Chris Twiner

I think that'd be a great idea to do so. The XML library is lacking for contriibutors/maintainers, if anyone wants to step up and help out.

- Josh

Daniel Sobral

unread,

Mar 15, 2012, 12:21:43 PM3/15/12

to Chris Twiner, Haoyi Li, scala-user

On Thu, Mar 15, 2012 at 06:54, Chris Twiner <chris....@gmail.com> wrote:
> Have a look at the network usage, sax is probably trying to download the
> dtds etc. Unfortunately there isn't a simple way to generally disable it
> outside of entity resolvers etc.

Mmmmm, what? Doesn't this work?

import scala.xml.Elem
import scala.xml.factory.XMLLoader
import javax.xml.parsers.SAXParser
object MyXML extends XMLLoader[Elem] {
override def parser: SAXParser = {
val f = javax.xml.parsers.SAXParserFactory.newInstance()
f.setNamespaceAware(false)
f.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
f.newSAXParser()
}
}

(from http://stackoverflow.com/a/1099921/53013)

>
> The document itself is very quick to parse. Scales Xml by default won't do
> any better, but use another sax parser (with the dtd disabled like a
> tagsoup) and its immediately parsed as you'd expect. See the jira SI-2725
> for more fun.
>
> On Mar 15, 2012 7:30 AM, "Haoyi Li" <haoy...@gmail.com> wrote:
>>
>> I'm looking at the source code for a wiki article. In particular, *this*
>> wiki article:
>>
>> http://en.wikipedia.org/wiki/Stanford_University
>>
>> It's about 450kb of html and javascript and all that jazz, which is
>> significant, but it's not overwhelming. However, when i make a web request,
>> and the time comes for me to do:
>>
>> val xml = XML.loadString(response.body)
>>
>> it takes on the order of 2 minutes in order to parse the whole thing into
>> XML (response.body is a string). Does anyone know what could be causing it?
>> Stranger still, it doesn't seem to be maxing out my CPU and making my fans
>> spin up, neither does it seem to be making large numbers of pagefaults. CPU
>> utilization is near zero, so I have no idea what could be causing it to take
>> so long. Any ideas?
>>
>> -Haoyi

--
Daniel C. Sobral

I travel to the future all the time.

Chris Twiner

unread,

Mar 15, 2012, 2:29:36 PM3/15/12

to Daniel Sobral, scala-user, Haoyi Li

Which is great when you are using a compatible vm or jre. The apache bit is the clue.

Haoyi Li

unread,

Mar 15, 2012, 5:23:01 PM3/15/12

to Chris Twiner, Daniel Sobral, scala-user

@Daniel: sorry if I don't understand you; my experience in this field basically involves some automated regex-based web scraping/crawling, some simple XML construction/deconstruction using Scala's XML stuff, and a year and a half working with HTML on web pages.

I have completely no idea how the code you gave fits into the "i wants to extract data from HTML page" process; could you elaborate on how it would fit in to the rest of scala's XML machinery? Or a full example of taking a simple XML string with a nasty slow-loading DTD and converting it into a XML data structure? I'm feeling way out of my depth here D=

Thanks!

-Haoyi

Tony Morris

unread,

Mar 15, 2012, 5:38:15 PM3/15/12

to Razvan Cojocaru, scala...@googlegroups.com

Yeah good point. Open a bug.

Daniel Sobral

unread,

Mar 15, 2012, 8:22:56 PM3/15/12

to Haoyi Li, Chris Twiner, scala-user

On Thu, Mar 15, 2012 at 18:23, Haoyi Li <haoy...@gmail.com> wrote:
> @Daniel: sorry if I don't understand you; my experience in this field
> basically involves some automated regex-based web scraping/crawling, some
> simple XML construction/deconstruction using Scala's XML stuff, and a year
> and a half working with HTML on web pages.
>
> I have completely no idea how the code you gave fits into the "i wants to
> extract data from HTML page" process; could you elaborate on how it would
> fit in to the rest of scala's XML machinery? Or a full example of taking a
> simple XML string with a nasty slow-loading DTD and converting it into a XML
> data structure? I'm feeling way out of my depth here D=

The MyXML object I created does almost everything that scala.xml.XML
does, except the saving part. So where you'd use XML.loadXXX(...),
you'd use MyXML.loadXXX(...). So the example would be:

val xml = MyXML.loadString(response.body)

Haoyi Li

unread,

Mar 15, 2012, 8:46:15 PM3/15/12

to Daniel Sobral, Chris Twiner, scala-user

Ok, I'll give it a shot! But mostly just out of interest. For now I went and used JSoup, which seems to work perfectly well and pretty fast (i guess anything seems fast compared to 2 minute web requests).

I'm still rather confuzzled as to why the default behavior for Scala's XML to be so far from optimal; I thought the point of having this stuff in built (with literals too!) was that I wouldn't need to go hunting for third party libraries when tasks like these come up.

-Haoyi

Daniel Sobral

unread,

Mar 15, 2012, 10:17:52 PM3/15/12

to Haoyi Li, Chris Twiner, scala-user

On Thu, Mar 15, 2012 at 21:46, Haoyi Li <haoy...@gmail.com> wrote:
> Ok, I'll give it a shot! But mostly just out of interest. For now I went and
> used JSoup, which seems to work perfectly well and pretty fast (i guess
> anything seems fast compared to 2 minute web requests).
>
> I'm still rather confuzzled as to why the default behavior for Scala's XML
> to be so far from optimal; I thought the point of having this stuff in built
> (with literals too!) was that I wouldn't need to go hunting for third party
> libraries when tasks like these come up.

Yeah, well, it is based on a Java library, and the Java library has no
standard way of togging this off. Paulp tried once, and broken Scala
on OpenJDK, for example.

Matthew Pocock

unread,

Mar 16, 2012, 7:44:46 AM3/16/12

to Daniel Sobral, Chris Twiner, Haoyi Li, scala-user

I had to set: http://apache.org/xml/features/nonvalidating/load-external-dtd to make svg parse in sane time without blocking. It's a PITA.

M

--

Dr Matthew Pocock

Integrative Bioinformatics Group, School of Computing Science, Newcastle University

mailto: turingate...@gmail.com

gchat: turingate...@gmail.com

msn: matthew...@yahoo.co.uk

irc.freenode.net: drdozer

skype: matthew.pocock

tel: (0191) 2566550

mob: +447535664143

Brian Schlining

unread,

Mar 16, 2012, 12:21:55 PM3/16/12

to scala-user

>
> It's about 450kb of html and javascript and all that jazz, which is significant, but it's not overwhelming. However, when i make a web request, and the time comes for me to do:
>
> val xml = XML.loadString(response.body)
>

TagSoup!! http://home.ccil.org/~cowan/XML/tagsoup/

val parser = (new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl).newSAXParser
val xmlLoader = scala.xml.XML.withSAXParser(parser)
val html = xmlLoader.load(new java.net.URL("http://en.wikipedia.org/wiki/Stanford_University"))

Cheers

-- Brian Schlining

Franklin Chen

unread,

Jul 31, 2013, 5:35:10 PM7/31/13

to scala...@googlegroups.com, Chris Twiner, Haoyi Li

A year later, upon encountering this problem myself and searching online for a solution, I found this, and tried it and it doesn't seem to work any more.

Alex Cruise

unread,

Jul 31, 2013, 11:26:52 PM7/31/13

to Franklin Chen, scala-user, Chris Twiner, Haoyi Li

On Wed, Jul 31, 2013 at 2:35 PM, Franklin Chen <franklin...@gmail.com> wrote:

A year later, upon encountering this problem myself and searching online for a solution, I found this, and tried it and it doesn't seem to work any more.

Did you try adding f.setValidating(false)? The W3C doesn't like people retrieving the DTDs over and over and over again, so they put them behind a very long delay. When the parser is set to non-validating, hopefully it won't bother to try to load the DTD.

If that doesn't work, there's also a "feature" "http://apache.org/xml/features/nonvalidating/load-external-dtd".

-0xe1a

RJ Regenold

unread,

Aug 7, 2013, 10:03:23 AM8/7/13

to scala...@googlegroups.com, Chris Twiner, Haoyi Li

I just ran into this as well. Here is how I ended up getting around DTD validation (note: I'm using scales 0.6.0-M1 and scalaz 7.0.1):

https://gist.github.com/rjregenold/6174166

Haoyi Li

unread,

Aug 7, 2013, 11:38:49 AM8/7/13

to RJ Regenold, scala-user, Chris Twiner

I think the longevity of this thread is an excellent rebuttal to the people who say "Why not just use the default XML module? It's a built-in!". Some libraries just want to watch the world burn.

Alex Cruise

unread,

Aug 7, 2013, 6:31:55 PM8/7/13

to Haoyi Li, RJ Regenold, scala-user, Chris Twiner

On Wed, Aug 7, 2013 at 8:38 AM, Haoyi Li <haoy...@gmail.com> wrote:

I think the longevity of this thread is an excellent rebuttal to the people who say "Why not just use the default XML module? It's a built-in!". Some libraries just want to watch the world burn.

I just created https://issues.scala-lang.org/browse/SI-7726. :)

I'm looking forward to the decoupling of XML from mainline; it will hopefully make it much, much easier to get XML bugs fixed promptly. Assuming the compiler-to-library XML interface is sane. I really should look at that. :)

-0xe1a

Reply all

Reply to author

Forward