Amara parser is too slow or is it a bug?

1 view
Skip to first unread message

Luis Miguel Morillas

unread,
Nov 5, 2011, 2:54:46 AM11/5/11
to akar...@googlegroups.com, Uche Ogbuji
I'm going to give a talk about scraping with amara at the Libre
Software Workd Converence. I wanted to use the official conference
pages (http://www.libresoftwareworldconference.com/ and
http://www.libresoftwareworldconference.com/programa/horario.html) but
I can't because it takes 124 seconds to parse these pages :-(

Can you try parsing these pages? Is it a bug at amara parsers? I
think it's too much time. How can I debug it?

Regards,

-- luismiguel

Luis Miguel Morillas

unread,
Nov 5, 2011, 3:40:11 AM11/5/11
to akar...@googlegroups.com, Uche Ogbuji
2011/11/5 Luis Miguel Morillas <mori...@gmail.com>:

A comparative with lxml.html raises these results:

lxml 0.738401889801 secs.
amara 124.755213976 secs.


Saludos,

-- luismiguel

Uche Ogbuji

unread,
Nov 5, 2011, 10:53:40 AM11/5/11
to Luis Miguel Morillas, akar...@googlegroups.com
I'm not sure. I'd have to look at it. I would definitely expect lxml.html to parse significantly more quickly, because it's entire stack, just about, is in C whereas Amara uses html5lib. But I wouldn't expect that much of a difference, so it might be a bug. The best way to tell is comparing with html5's basic DOM parer.

    stream = urllib2.urlopen(url)
    tree_builder = html5lib.treebuilders.getTreeBuilder('dom')
    parser = html5lib.html5parser.HTMLParser(tree=tree_builder)
    doc = parser.parse(stream)


--
Uche Ogbuji                       http://uche.ogbuji.net
Weblog: http://copia.ogbuji.net
Poetry ed @TNB: http://www.thenervousbreakdown.com/author/uogbuji/
Founding Partner, Zepheira        http://zepheira.com
Linked-in: http://www.linkedin.com/in/ucheogbuji
Articles: http://uche.ogbuji.net/tech/publications/
Friendfeed: http://friendfeed.com/uche
Twitter: http://twitter.com/uogbuji
http://www.google.com/profiles/uche.ogbuji

Uche Ogbuji

unread,
Nov 5, 2011, 11:03:42 AM11/5/11
to Luis Miguel Morillas, akar...@googlegroups.com
On Sat, Nov 5, 2011 at 8:53 AM, Uche Ogbuji <uc...@ogbuji.net> wrote:
On Sat, Nov 5, 2011 at 1:40 AM, Luis Miguel Morillas <mori...@gmail.com> wrote:
2011/11/5 Luis Miguel Morillas <mori...@gmail.com>:
> I'm going to give a talk about scraping with amara at the Libre
> Software Workd Converence. I wanted to use the official conference
> pages (http://www.libresoftwareworldconference.com/ and
> http://www.libresoftwareworldconference.com/programa/horario.html) but
> I can't because  it takes 124 seconds to parse these pages :-(
>
> Can you try parsing these pages?  Is it a bug at amara parsers?  I
> think it's too much time. How can I debug it?
>

A comparative with lxml.html raises these results:

lxml 0.738401889801 secs.
amara 124.755213976 secs.


I'm not sure. I'd have to look at it. I would definitely expect lxml.html to parse significantly more quickly, because it's entire stack, just about, is in C whereas Amara uses html5lib. But I wouldn't expect that much of a difference, so it might be a bug. The best way to tell is comparing with html5's basic DOM parer.

    stream = urllib2.urlopen(url)
    tree_builder = html5lib.treebuilders.getTreeBuilder('dom')
    parser = html5lib.html5parser.HTMLParser(tree=tree_builder)
    doc = parser.parse(stream)

I just did a quick and dirty test. I see:

$ time python -c "from amara.bindery import html; html.parse('http://www.libresoftwareworldconference.com/programa/horario.html')"

real 0m2.161s
user 0m0.619s
sys 0m0.096s

So 2 seconds on my MacBook Pro, much of which is Python start-up overhead, which is about the performance I'd expect. So I guess the next logical question is: what version of stuff are you using? (Python, Amara, etc.) Also can you compare the raw html5lib results, as i suggested above?

Thanks.
Reply all
Reply to author
Forward
0 new messages