Re: HTML halts BeautifulSoup

10 views
Skip to first unread message
Message has been deleted

sste...@gmail.com

unread,
Feb 3, 2010, 10:06:34 AM2/3/10
to beauti...@googlegroups.com
On Feb 3, 2010, at 7:30 AM, Telemat wrote:

> I came across this line that makes beautifulsoup fails
>
> <EMBED NAME="IncrediFlash" SRC="http://www.heroturko.org/
> globaldomain.swf?317338960" BGCOLOR="#000000"
> WIDTH="150" HEIGHT="250" TYPE="application/x-shockwave-flash"
> pluginspage=http://www.macromedia.com/go/getflashplayer">
> </EMBED>

What version of Python, BeautifulSoup, and in what context.

> Hope you can fix it.

Hope you can provide a complete bug report.

S

Aaron DeVore

unread,
Feb 3, 2010, 12:29:59 PM2/3/10
to beauti...@googlegroups.com
I didn't test this, but maybe it's the missing quote mark on this part:

pluginspage=http://www.macromedia.com/go/getflashplayer"

Maybe a search and replace before putting it into Beautiful Soup would be good?

src = src.replace('pluginspage=h', 'pluginspage="h')

-Aaron DeVore

On Wed, Feb 3, 2010 at 4:30 AM, Telemat <kdo...@gmail.com> wrote:
> Hi


>
> I came across this line that makes beautifulsoup fails
>
> <EMBED NAME="IncrediFlash" SRC="http://www.heroturko.org/
> globaldomain.swf?317338960" BGCOLOR="#000000"
> WIDTH="150" HEIGHT="250" TYPE="application/x-shockwave-flash"
> pluginspage=http://www.macromedia.com/go/getflashplayer">
> </EMBED>
>

> Hope you can fix it.
>

> --
> You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
> To post to this group, send email to beauti...@googlegroups.com.
> To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.
>
>

Message has been deleted
Message has been deleted

Aaron DeVore

unread,
Feb 4, 2010, 1:48:10 PM2/4/10
to beauti...@googlegroups.com
On Wed, Feb 3, 2010 at 10:08 PM, Telemat <kdo...@gmail.com> wrote:
> Hi
>
> It's actually
> data = data.replace('pluginspage=', 'pluginspage="')
>
> Works like a charm. Thanks for the tip

Excellent! For the record, this will be either difficult or impossible
to fix as part of HTMLParser. The problem is deciding what should
happen when the value of the attribute really is foo". Case in point

<tag attr=value">

Is that 'value"'? That's how sgmllib handles it. Or is it 'value'?
What happens in this case:

<tag attr="value>

sgmllib handles it as '"value'. HTMLParser silently ignores the tag
and the rest of the document. Manual filtering is really the only way
to handle this error.

-Aaron DeVore

>
> On Feb 4, 1:29 am, Aaron DeVore <aaron.dev...@gmail.com> wrote:
>> I didn't test this, but maybe it's the missing quote mark on this part:
>>
>> pluginspage=http://www.macromedia.com/go/getflashplayer"
>>
>> Maybe a search and replace before putting it into Beautiful Soup would be good?
>>
>> src = src.replace('pluginspage=h', 'pluginspage="h')
>>
>> -Aaron DeVore
>>

Aaron DeVore

unread,
Feb 6, 2010, 1:17:42 AM2/6/10
to beauti...@googlegroups.com
On Wed, Feb 3, 2010 at 10:08 PM, Telemat <kdo...@gmail.com> wrote:
> Hi
>
> It's actually
> data = data.replace('pluginspage=', 'pluginspage="')

I just look back on this message and noticed a problem. 'pluginspage='
will catch valid pluginspage attributes

<tag pluginspage="...">

goes to

<tag pluginspage=""...">

That's why I instead matched 'pluginspage=h'.

-Aaron DeVore

Reply all
Reply to author
Forward
0 new messages