Can you try parsing these pages? Is it a bug at amara parsers? I
think it's too much time. How can I debug it?
Regards,
-- luismiguel
A comparative with lxml.html raises these results:
lxml 0.738401889801 secs.
amara 124.755213976 secs.
Saludos,
-- luismiguel
On Sat, Nov 5, 2011 at 1:40 AM, Luis Miguel Morillas <mori...@gmail.com> wrote:2011/11/5 Luis Miguel Morillas <mori...@gmail.com>:
A comparative with lxml.html raises these results:> I'm going to give a talk about scraping with amara at the Libre
> Software Workd Converence. I wanted to use the official conference
> pages (http://www.libresoftwareworldconference.com/ and
> http://www.libresoftwareworldconference.com/programa/horario.html) but
> I can't because it takes 124 seconds to parse these pages :-(
>
> Can you try parsing these pages? Is it a bug at amara parsers? I
> think it's too much time. How can I debug it?
>
lxml 0.738401889801 secs.
amara 124.755213976 secs.I'm not sure. I'd have to look at it. I would definitely expect lxml.html to parse significantly more quickly, because it's entire stack, just about, is in C whereas Amara uses html5lib. But I wouldn't expect that much of a difference, so it might be a bug. The best way to tell is comparing with html5's basic DOM parer.stream = urllib2.urlopen(url)tree_builder = html5lib.treebuilders.getTreeBuilder('dom')parser = html5lib.html5parser.HTMLParser(tree=tree_builder)doc = parser.parse(stream)