Trouble Parsing some whacked out html

70 views

Skip to first unread message

Daniel W

unread,

Aug 19, 2012, 12:04:08 PM8/19/12

to beauti...@googlegroups.com

Thanks in advance for any guidance. Leonard, you're doing god's work.

I'm struggling to parse the html I get from this page, a macy's product page. Html is attached.

When I BeautifulSoup it, I get something that ends with "></BODY></HTML></iframe></span></fb:like></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></body></html>", and commands like soup.find('ul', {'class':'bullets'}) don't work when they 'should' - ie there's a ul with class bullets in there. When I open the page in Chrome and inspect elements, the html looks legit and Chrome doesn't seem to have trouble parsing it.

My setup:

-I'm running firefox 3.0.18 in headless mode using xvfb on a linux ec2 server.

-I've tried both beautiful soup 3 and 4, both built in parser and the html5lib parser, all in python 2.6.8

Questions:

-Is there anything about my setup that could be causing this? Old firefox?

-Is there any pre-processing I can do to 'fix' the html so the soup becomes navigable?

troublesome_html.rtf

Daniel W

unread,

Aug 19, 2012, 9:58:01 PM8/19/12

to beauti...@googlegroups.com

I did some more digging, and I think I've solved my own problem.

When I grab the text from my linux setup, I see this (excerpt):

<TEXTAREA rows="5" cols="1500" id="messageArea"/><DIV class="hLine"/><DIV id="captcha"><IMG border="0" id="captchaImg"/></DIV> <DIV class="floatLeft"><LABEL><SPAN id="captchaLabel">*Please type the characters above: </SPAN>

...and when I do

foo=soup.findAll('textarea',{'id':'messageArea'})

foo contains basically the rest of the document - it's not seeing the closure of the text area.

When I view page source on my local desktop running Chrome, I see:

<label>

<span id="captchaLabel">*Please type the characters above: </span>

Note the closure of the textarea.

So, I added this to my preprocessing of the html text before souping it:

text=re.sub(re.compile(r'<textarea.+?>', re.I|re.DOTALL), '', text)

...which kills the textarea tags, because I don't need 'em. Seems to have fixed the problem.

Reply all

Reply to author

Forward

0 new messages