Besides the suggestions there, maybe you could make a SoupStrainer
subclass that allows you to short-circuit when you find what you want.
If your source is valid XHTML:
http://effbot.org/zone/element-iterparse.htm
lxml has some ability to parse broken HTML and it also has an
incremental parse:
http://codespeak.net/lxml/parsing.html#parsing-html
http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk
Kent
How would I match the following 2 meta = "robots" lines:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="robots" oops="NOINDEX, NOFOLLOW">
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta name="robots" CONTENT="NOINDEX, NOFOLLOW">
<meta http-equiv="reply-to" content="wooo">
at the moment I am doing:
for robots in soup('meta',{'name':'robots'}):
print robots
try:
print robots['content']
except KeyError:
print 'oops'
for robots in soup('meta',{'name':'ROBOTS'}):
etc......
which I know is wrong :-( but I can't figure out how to make all the attributes lower case? Or the search case in-sensitive.
I looked up META tags on the w3c and it doesn't say anything about the case of characters inside the attributes. I assume <meta name='John'> would want to preserve case....
Any help would be muchly appreciated :-)
John Glazebrook
BS makes attribute names lowercase but not the values. Your solution
works; in more detail it is:
soup.findAll('meta', {'name':re.compile('robots', re.IGNORECASE)})
Kent
>> Kent
Excelent - works like a charm. Thanks :-)
I have blogged this answer - so hopefully other people may find it and benifit :-)
Kind Regards,
John Glazebrook
_________________________________________
Neutralize (*\*)
Search Engine Marketing Services
T: 08700 630707
F: 08700 630708
E: jo...@neutralize.com
U: http://www.neutralize.com
International T: 00 44 1209 722340
International F: 00 44 1209 717263
_________________________________________
Members of the Search Marketing Association UK
http://www.sma-uk.org
The information transmitted is intended only for the person or entity to which it is addressed. This email is subject to the Terms and Conditions available at:
http://www.neutralize.com/emailterms.txt
_________________________________________
Head Office: 3 The Setons, Tolvaddon Energy Park, Cornwall, TR14 0HX
Registered Address: Nuera Limited trading as Neutralize, 70 Conduit Street,London W1S 2GF
Company Registration No. 3849708 - VAT Registration No. 743 9641 09
Neutralize & (*\*) are registered TradeMarks of Nuera Limited.