Break out of parse for large pages

35 views
Skip to first unread message

Little_Grungy

unread,
Mar 6, 2008, 3:27:23 PM3/6/08
to beautifulsoup
Hi,

Relatively new to BS but finding it very useful! My problem is this, I
need to parse many (~30) huge documents and compare them. I am using
regex to find what I need however it still takes a very long time to
complete a full cycle. I have noticed that the information I need
tends to be near the top of each large page, yet I am still required
to wait for a large parse tree to be constructed even though I don't
use most of it.

Is there any way to speed up the process by breaking out of the tree
construction or do I have no choice but to wait for it to complete?

Can anyone suggest anything with BS or anything else that might help?

Thanks

Kent Johnson

unread,
Mar 6, 2008, 3:55:57 PM3/6/08
to beauti...@googlegroups.com
Little_Grungy wrote:
> Is there any way to speed up the process by breaking out of the tree
> construction or do I have no choice but to wait for it to complete?
>
> Can anyone suggest anything with BS or anything else that might help?

http://www.crummy.com/software/BeautifulSoup/documentation.html#Improving%20Performance%20by%20Parsing%20Only%20Part%20of%20the%20Document

Besides the suggestions there, maybe you could make a SoupStrainer
subclass that allows you to short-circuit when you find what you want.

If your source is valid XHTML:
http://effbot.org/zone/element-iterparse.htm

lxml has some ability to parse broken HTML and it also has an
incremental parse:
http://codespeak.net/lxml/parsing.html#parsing-html
http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk

Kent

John Glazebrook

unread,
Mar 7, 2008, 7:03:40 AM3/7/08
to beauti...@googlegroups.com
Hello,

How would I match the following 2 meta = "robots" lines:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="robots" oops="NOINDEX, NOFOLLOW">
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta name="robots" CONTENT="NOINDEX, NOFOLLOW">
<meta http-equiv="reply-to" content="wooo">

at the moment I am doing:

for robots in soup('meta',{'name':'robots'}):
print robots
try:
print robots['content']
except KeyError:
print 'oops'

for robots in soup('meta',{'name':'ROBOTS'}):
etc......

which I know is wrong :-( but I can't figure out how to make all the attributes lower case? Or the search case in-sensitive.

I looked up META tags on the w3c and it doesn't say anything about the case of characters inside the attributes. I assume <meta name='John'> would want to preserve case....

Any help would be muchly appreciated :-)

John Glazebrook

marty

unread,
Mar 7, 2008, 7:21:56 AM3/7/08
to beautifulsoup
Hey,

I thought BS made everything lowercase..but if it doesn't, the spontan
idea would be to use re.compile('robots', re.IGNORECASE)

Cheers

Kent Johnson

unread,
Mar 7, 2008, 7:54:25 AM3/7/08
to beauti...@googlegroups.com
marty wrote:
> Hey,
>
> I thought BS made everything lowercase..but if it doesn't, the spontan
> idea would be to use re.compile('robots', re.IGNORECASE)

BS makes attribute names lowercase but not the values. Your solution
works; in more detail it is:
soup.findAll('meta', {'name':re.compile('robots', re.IGNORECASE)})

Kent

John Glazebrook

unread,
Mar 7, 2008, 10:26:47 AM3/7/08
to beauti...@googlegroups.com

>> Kent

Excelent - works like a charm. Thanks :-)

I have blogged this answer - so hopefully other people may find it and benifit :-)

Kind Regards,

John Glazebrook

_________________________________________
Neutralize (*\*)
Search Engine Marketing Services
T: 08700 630707
F: 08700 630708
E: jo...@neutralize.com
U: http://www.neutralize.com

International T: 00 44 1209 722340
International F: 00 44 1209 717263
_________________________________________
Members of the Search Marketing Association UK
http://www.sma-uk.org

The information transmitted is intended only for the person or entity to which it is addressed. This email is subject to the Terms and Conditions available at:
http://www.neutralize.com/emailterms.txt
_________________________________________
Head Office: 3 The Setons, Tolvaddon Energy Park, Cornwall, TR14 0HX
Registered Address: Nuera Limited trading as Neutralize, 70 Conduit Street,London W1S 2GF
Company Registration No. 3849708 - VAT Registration No. 743 9641 09
Neutralize & (*\*) are registered TradeMarks of Nuera Limited.

Reply all
Reply to author
Forward
0 new messages