Amara parser is too slow or is it a bug?

Luis Miguel Morillas

unread,

Nov 5, 2011, 2:54:46 AM11/5/11

to akar...@googlegroups.com, Uche Ogbuji

I'm going to give a talk about scraping with amara at the Libre
Software Workd Converence. I wanted to use the official conference
pages (http://www.libresoftwareworldconference.com/ and
http://www.libresoftwareworldconference.com/programa/horario.html) but
I can't because it takes 124 seconds to parse these pages :-(

Can you try parsing these pages? Is it a bug at amara parsers? I
think it's too much time. How can I debug it?

Regards,

-- luismiguel

Luis Miguel Morillas

unread,

Nov 5, 2011, 3:40:11 AM11/5/11

to akar...@googlegroups.com, Uche Ogbuji

2011/11/5 Luis Miguel Morillas <mori...@gmail.com>:

A comparative with lxml.html raises these results:

lxml 0.738401889801 secs.
amara 124.755213976 secs.

Saludos,

-- luismiguel

Uche Ogbuji

unread,

Nov 5, 2011, 10:53:40 AM11/5/11

to Luis Miguel Morillas, akar...@googlegroups.com

I'm not sure. I'd have to look at it. I would definitely expect lxml.html to parse significantly more quickly, because it's entire stack, just about, is in C whereas Amara uses html5lib. But I wouldn't expect that much of a difference, so it might be a bug. The best way to tell is comparing with html5's basic DOM parer.

stream = urllib2.urlopen(url)

tree_builder = html5lib.treebuilders.getTreeBuilder('dom')

parser = html5lib.html5parser.HTMLParser(tree=tree_builder)

doc = parser.parse(stream)

--
Uche Ogbuji http://uche.ogbuji.net
Weblog: http://copia.ogbuji.net
Poetry ed @TNB: http://www.thenervousbreakdown.com/author/uogbuji/
Founding Partner, Zepheira http://zepheira.com
Linked-in: http://www.linkedin.com/in/ucheogbuji
Articles: http://uche.ogbuji.net/tech/publications/
Friendfeed: http://friendfeed.com/uche
Twitter: http://twitter.com/uogbuji
http://www.google.com/profiles/uche.ogbuji

Uche Ogbuji

unread,

Nov 5, 2011, 11:03:42 AM11/5/11

to Luis Miguel Morillas, akar...@googlegroups.com

On Sat, Nov 5, 2011 at 8:53 AM, Uche Ogbuji <uc...@ogbuji.net> wrote:

On Sat, Nov 5, 2011 at 1:40 AM, Luis Miguel Morillas <mori...@gmail.com> wrote:

2011/11/5 Luis Miguel Morillas <mori...@gmail.com>:

> I'm going to give a talk about scraping with amara at the Libre
> Software Workd Converence. I wanted to use the official conference
> pages (http://www.libresoftwareworldconference.com/ and
> http://www.libresoftwareworldconference.com/programa/horario.html) but
> I can't because it takes 124 seconds to parse these pages :-(
>
> Can you try parsing these pages? Is it a bug at amara parsers? I
> think it's too much time. How can I debug it?
>

A comparative with lxml.html raises these results:

lxml 0.738401889801 secs.
amara 124.755213976 secs.

I'm not sure. I'd have to look at it. I would definitely expect lxml.html to parse significantly more quickly, because it's entire stack, just about, is in C whereas Amara uses html5lib. But I wouldn't expect that much of a difference, so it might be a bug. The best way to tell is comparing with html5's basic DOM parer.

stream = urllib2.urlopen(url)
tree_builder = html5lib.treebuilders.getTreeBuilder('dom')

parser = html5lib.html5parser.HTMLParser(tree=tree_builder)
doc = parser.parse(stream)

I just did a quick and dirty test. I see:

$ time python -c "from amara.bindery import html; html.parse('http://www.libresoftwareworldconference.com/programa/horario.html')"

real 0m2.161s

user 0m0.619s

sys 0m0.096s

So 2 seconds on my MacBook Pro, much of which is Python start-up overhead, which is about the performance I'd expect. So I guess the next logical question is: what version of stuff are you using? (Python, Amara, etc.) Also can you compare the raw html5lib results, as i suggested above?

Thanks.

Reply all

Reply to author

Forward