Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion <tr"> tag error
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Leonard Richardson  
View profile  
 More options Jul 30 2012, 1:34 pm
From: Leonard Richardson <leona...@segfault.org>
Date: Mon, 30 Jul 2012 13:34:21 -0400
Local: Mon, Jul 30 2012 1:34 pm
Subject: Re: <tr"> tag error
I'm not 100% what you're asking, but my advice is to parse the markup
with lxml instead of html5lib. Here's how html5lib parses the markup
you gave me:

<html>
 <head>
 </head>
 <body>
  <tr">
   <a href="http://rivals.yahoo.com/ncaa/football/recruiting/commitments/2013/wes...">
    West Virginia
   </a>
   NoneOfferedNone
  </tr">
 </body>
</html>

I'm guessing your problem has to do with the fact that html5lib loses
the <th> tag and the <td> tags.

Here's how lxml parses the same markup:

<html>
 <body>
  <tr>
   <th scope="col">
    <a href="http://rivals.yahoo.com/ncaa/football/recruiting/commitments/2013/wes...">
     West Virginia
    </a>
   </th>
   <td>
    None
   </td>
   <td class="offered">
    Offered
   </td>
   <td>
    None
   </td>
   <td>
   </td>
  </tr>
 </body>
</html>

Note that the <th> and <td> tags are preserved.

Leonard

On Mon, Jul 30, 2012 at 12:19 PM, Tom <boot...@gmail.com> wrote:
> Hey Leonard,
>      So Im now parsing with html5lib and it is working... However that <tr">
> tag.... it turns out I need the text from that, go figure.  Typically that
> tag looks like <tr class="even "> and Ive been getting the text from it
> easily... However there are multiple instances where that <tr class="even ">
> looks like this <tr">  I am not sure if its a server error or what but all
> the data/text associated with that class is still there its just preceded by
> a malformed tag....  Below is an example of a good <tr class="even "> VS.
> <tr"> tag

> Good:  <tr class="even "><th scope=col><a
> href="http://rivals.yahoo.com/ncaa/football/recruiting/commitments/2013/vir..."
>>Virginia</a></th><td>None</td><td
> class="offered">Offered</td><td>None</td><td></td></tr>
> Bad:   <tr"><th scope=col><a
> href="http://rivals.yahoo.com/ncaa/football/recruiting/commitments/2013/wes..."
>>West Virginia</a></th><td>None</td><td
> class="offered">Offered</td><td>None</td><td></td></tr>

> is there anyway to fix or replace that malformed tag?

> I was looking around here in your documentation:
> http://www.crummy.com/software/BeautifulSoup/bs4/doc/#insert

> Thanks,
> Tom

> On Tuesday, July 24, 2012 10:16:18 AM UTC-4, Leonard Richardson wrote:

>> On Tue, Jul 24, 2012 at 9:22 AM, Tom <boot...@gmail.com> wrote:
>> > BS4 crashed at a <tr"> tag calling it in the error a, HTMLParseError:
>> > malformed start tag...

>> > is this an instance where beautifulsoup can't parse a page and I need to
>> > use
>> > lxml or something of the sort?  Any suggestions on how to correct this
>> > error
>> > if at all possible??

>> You should tell Beautiful Soup to use the lxml parser instead of
>> Python's built-in parser:

>> http://www.crummy.com/software/BeautifulSoup/bs4/doc/#other-parser-pr...

>> "HTMLParser.HTMLParseError: malformed start tag or
>> HTMLParser.HTMLParseError: bad end tag - Caused by giving Python’s
>> built-in HTML parser a document it can’t handle. Any other
>> HTMLParseError is probably the same problem. Solution: Install lxml or
>> html5lib."

>> Leonard

> --
> You received this message because you are subscribed to the Google Groups
> "beautifulsoup" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/beautifulsoup/-/eMCNQ3nxdOsJ.

> To post to this group, send email to beautifulsoup@googlegroups.com.
> To unsubscribe from this group, send email to
> beautifulsoup+unsubscribe@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/beautifulsoup?hl=en.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.