Trouble Parsing some whacked out html

70 views
Skip to first unread message

Daniel W

unread,
Aug 19, 2012, 12:04:08 PM8/19/12
to beauti...@googlegroups.com
Thanks in advance for any guidance.  Leonard, you're doing god's work. 

I'm struggling to parse the html I get from this page, a macy's product page. Html is attached.
When I BeautifulSoup it, I get something that ends with "&gt;&lt;/BODY&gt;&lt;/HTML&gt;</iframe></span></fb:like></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></body></html>", and commands like soup.find('ul', {'class':'bullets'}) don't work when they 'should' - ie there's a ul with class bullets in there. When I open the page in Chrome and inspect elements, the html looks legit and Chrome doesn't seem to have trouble parsing it. 

My setup:
-I'm running firefox 3.0.18 in headless mode using xvfb on a linux ec2 server.  
-I've tried both beautiful soup 3 and 4, both built in parser and the html5lib parser, all in python 2.6.8

Questions:
-Is there anything about my setup that could be causing this? Old firefox?
-Is there any pre-processing I can do to 'fix' the html so the soup becomes navigable? 

troublesome_html.rtf

Daniel W

unread,
Aug 19, 2012, 9:58:01 PM8/19/12
to beauti...@googlegroups.com
I did some more digging, and I think I've solved my own problem.
When I grab the text from my linux setup, I see this (excerpt):

<TEXTAREA rows="5" cols="1500" id="messageArea"/><DIV class="hLine"/><DIV id="captcha"><IMG border="0" id="captchaImg"/></DIV> <DIV class="floatLeft"><LABEL><SPAN id="captchaLabel">*Please type the characters above: </SPAN> 

...and when I do 
foo=soup.findAll('textarea',{'id':'messageArea'})
foo contains basically the rest of the document - it's not seeing the closure of the text area.

When I view page source on my local desktop running Chrome, I see:

<textarea id="messageArea" cols="1500" rows="5"></textarea><div class="hLine"></div>
<div id="captcha"><img id="captchaImg" border="0"></div>
<div class="floatLeft">
<label>
<span id="captchaLabel">*Please type the characters above: </span>

Note the closure of the textarea. 

So, I added this to my preprocessing of the html text before souping it:

text=re.sub(re.compile(r'<textarea.+?>', re.I|re.DOTALL), '', text)

...which kills the textarea tags, because I don't need 'em.  Seems to have fixed the problem.
Reply all
Reply to author
Forward
0 new messages