<tr"> tag error

89 views
Skip to first unread message

Tom

unread,
Jul 24, 2012, 9:22:55 AM7/24/12
to beauti...@googlegroups.com
BS4 crashed at a <tr"> tag calling it in the error a, HTMLParseError: malformed start tag...

is this an instance where beautifulsoup can't parse a page and I need to use lxml or something of the sort?  Any suggestions on how to correct this error if at all possible??

Thanks,
Tom

Leonard Richardson

unread,
Jul 24, 2012, 10:16:18 AM7/24/12
to beauti...@googlegroups.com
You should tell Beautiful Soup to use the lxml parser instead of
Python's built-in parser:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#other-parser-problems

"HTMLParser.HTMLParseError: malformed start tag or
HTMLParser.HTMLParseError: bad end tag - Caused by giving Python’s
built-in HTML parser a document it can’t handle. Any other
HTMLParseError is probably the same problem. Solution: Install lxml or
html5lib."

Leonard

Tom

unread,
Jul 30, 2012, 12:19:13 PM7/30/12
to beauti...@googlegroups.com, leon...@segfault.org
Hey Leonard,
     So Im now parsing with html5lib and it is working... However that <tr"> tag.... it turns out I need the text from that, go figure.  Typically that tag looks like <tr class="even "> and Ive been getting the text from it easily... However there are multiple instances where that <tr class="even "> looks like this <tr">  I am not sure if its a server error or what but all the data/text associated with that class is still there its just preceded by a malformed tag....  Below is an example of a good <tr class="even "> VS. <tr"> tag

Good:  <tr class="even "><th scope=col><a href="http://rivals.yahoo.com/ncaa/football/recruiting/commitments/2013/virginia-8;_ylt=AootV7c9bC6K1MuWPoaKXMNHPZB4" >Virginia</a></th><td>None</td><td class="offered">Offered</td><td>None</td><td></td></tr>
Bad:   <tr"><th scope=col><a href="http://rivals.yahoo.com/ncaa/football/recruiting/commitments/2013/westvirginia-17;_ylt=As9wOKU0mLGxzNX.IYTwS1xHPZB4" >West Virginia</a></th><td>None</td><td class="offered">Offered</td><td>None</td><td></td></tr>

is there anyway to fix or replace that malformed tag?

I was looking around here in your documentation:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#insert

Thanks,
Tom

Leonard Richardson

unread,
Jul 30, 2012, 1:34:21 PM7/30/12
to beauti...@googlegroups.com
I'm not 100% what you're asking, but my advice is to parse the markup
with lxml instead of html5lib. Here's how html5lib parses the markup
you gave me:

<html>
<head>
</head>
<body>
<tr">
NoneOfferedNone
</tr">
</body>
</html>

I'm guessing your problem has to do with the fact that html5lib loses
the <th> tag and the <td> tags.

Here's how lxml parses the same markup:

<html>
<body>
<tr>
<th scope="col">
<a href="http://rivals.yahoo.com/ncaa/football/recruiting/commitments/2013/westvirginia-17;_ylt=As9wOKU0mLGxzNX.IYTwS1xHPZB4">
West Virginia
</a>
</th>
<td>
None
</td>
<td class="offered">
Offered
</td>
<td>
None
</td>
<td>
</td>
</tr>
</body>
</html>

Note that the <th> and <td> tags are preserved.

Leonard
> --
> You received this message because you are subscribed to the Google Groups
> "beautifulsoup" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/beautifulsoup/-/eMCNQ3nxdOsJ.
>
> To post to this group, send email to beauti...@googlegroups.com.
> To unsubscribe from this group, send email to
> beautifulsou...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/beautifulsoup?hl=en.

Thomas Booth

unread,
Jul 30, 2012, 8:39:36 PM7/30/12
to beauti...@googlegroups.com
Hey Leonard,
Thanks for your reply.. I meant that the <tr"> tag should be
<tr class="even ". But for whatever reason that <tr"> tag is spread
throughout the table... here is the code for the table I am trying to
parse...
<tr class="even ">
<th scope="col">
<a href="http://rivals.yahoo.com/ncaa/football/recruiting/commitments/2010/arizona-57">
Arizona
</a>
</th>
<td>
None
</td>
<td class="offered">
Offered
</td>
<td>
None
</td>
<td>
<a href="http://footballrecruiting.rivals.com/viewcoach.asp?coach=525&amp;sport=1&amp;year=2010">
Tim Kish
</a>
</td>
</tr>
<tr> ###this tag is <tr"> in html5lib and <tr> in lxml...
How can I target this tag
<th scope="col">
<a href="http://rivals.yahoo.com/ncaa/football/recruiting/commitments/2010/arizonastate-58">
Arizona St.
</a>
</th>
<td>
None
</td>
<td class="offered">
Offered
</td>
<td>
None
</td>
<td>
<a href="http://footballrecruiting.rivals.com/viewcoach.asp?coach=337&amp;sport=1&amp;year=2010">
Matt Lubick
</a>
</td>
</tr>
<tr class="even ">
<th scope="col">
<a href="http://rivals.yahoo.com/ncaa/football/recruiting/commitments/2010/auburn-75">
Auburn
</a>
</th>
<td>
None
</td>
<td class="offered">
Offered
</td>
<td>
None
</td>
<td>
<a href="http://footballrecruiting.rivals.com/viewcoach.asp?coach=621&amp;sport=1&amp;year=2010">
Curtis Luper
</a>
</td>
</tr>
<tr> ### here is another example....
<th scope="col">
<a href="http://rivals.yahoo.com/ncaa/football/recruiting/commitments/2010/california-59">
California
</a>
</th>
<td>
None
</td>
<td class="offered">
Offered
</td>
<td>
None
</td>
<td>
<a href="http://footballrecruiting.rivals.com/viewcoach.asp?coach=2140&amp;sport=1&amp;year=2010">
Tosh Lupoi
</a>
</td>
</tr>

Here is my code below...

evens = soup.find_all('tr', 'even ')
todd = soup.find_all('tr') ######This just grabs all the tr tags
throughout the page... and in html5lib I try <tr"> and nothing happens
evenss = evens + todd
for n in evenss:
r = n.find_all('td')
a = n.find_all('a')
data18 = a[0].text
data19 = r[0].text
data20 = r[1].text
data21 = r[2].text



#####Basically the code only grabs half of the data b/c the tr
class="even " tag changes throughout the table...
fyi: here is the url, the table is called college choices:
http://rivals.yahoo.com/ncaa/football/recruiting/player-Ronald-Powell-67362;_ylt=AgOlT31Q3EmLywDUQAxBMl9IPZB4



I appreciate your help,
Thanks,
Tom

Leonard Richardson

unread,
Jul 31, 2012, 9:49:17 AM7/31/12
to beauti...@googlegroups.com
Let me try to summarize your problem:

1. Your document has a <tbody> tag that contains a number of <tr> tags.
2. Some of the <tr> tags have a CSS class "even".
3. Some of the <tr> tags have no CSS class (because the markup is invalid)
4. Some of the <tr> tags have other CSS classes such as "committed".

You want to get the <tr> tags in #2 and #3, but not #4.

The simplest to do this is to simply get all the <tr> tags and filter
out the ones you don't want.

for tr in soup.tbody.find_all('tr'):
if 'committed' not in tr.get('class'):
...
You can filter out those tags ahead of time, by defining a function
that excludes the CSS classes you don't want:

def exclude_css_classes(cls):
return cls is None or not "committed" in cls
for tr in soup.tbody.find_all("tr", exclude_css_classes):
...

Or you can go the other way, and define a function that only includes
the CSS classes you want:

def include_css_classes(cls):
return cls is None or "even" in cls

for tr in soup.tbody.find_all("tr", include_css_classes):
...

Documentation on passing a function into a find() method:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function

Hope this helps,
Leonard
> </td>http://www.crummy.com/2012/07/31/0
Reply all
Reply to author
Forward
0 new messages