Re: BS4 Parser seems broken on website with dubious encoding cited in HEAD

112 views
Skip to first unread message

Leonard Richardson

unread,
Nov 3, 2012, 11:37:19 AM11/3/12
to beautifulsoup
Hi, Alec,

It looks like you've encountered bug 972466:

https://bugs.launchpad.net/beautifulsoup/+bug/972466

This bug is caused by a workaround for a bug in lxml's parser. The
workaround was removed in Beautiful Soup 4.0.3.

You have a couple options:

1. Use the html5lib parser instead of the lxml parser. Since you're on
Python 2.7.3, Python's built-in parser will also work well.
2. Upgrade Beautiful Soup to version 4.0.3 or later.

Leonard



On Sat, Nov 3, 2012 at 11:23 AM, Alec Muffett <alec.m...@gmail.com> wrote:
> Hi,
>
> Python 2.7.3
> BeautifulSoup 4.0.2-1 (Ubuntu)
>
> I am doing the following:
>
>> from bs4 import BeautifulSoup
>> import urllib2
>> url = 'https://cybersecuritychallenge.org.uk/'
>> data = urllib2.urlopen(url).read()
>> soup = BeautifulSoup(data)
>> print soup
>
>
> ...and the result is a mess:
>
>> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
>> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
>> <html xmlns="http://www.w3.org/1999/xhtml"><head><meta content="IE=7"
>> http-equiv="X-UA-Compatible"/><meta content="text/html; charset=utf-8"
>> http-equiv="Content-Type"/><meta content="en-GB"
>> http-equiv="Content-Language"/><title>Cyber Security Challenge</title><meta
>> content="" name="description"/><meta content=""
>> name="keywor"/></head><body><p>d s " / &gt;
>> b a s e h r e f = " h t t p s : / /
>> c y b e r s e c u r i t y c h a l l e
>> n g e . o r g . u k / " / &gt;
>> l i n k r e l = " s t y l e s h e e
>> t " h r e f = " c s s / s t y l e s
>> . p h p " m e d i a = " a l l " /
>> &gt;
>
>
> ...which seems to be due to the encoding cited in the head, below.
> Certainly if I remove the encoding then the document parses correctly.
>
> Not merely have the characters been spaced-out like bad unicode, but also
> there are chunks of the document body missing.
>
> I've tried setting various forms of from_encoding to coerce BS into doing
> the right thing, but it remains broken.
>
> I'm stuck, considering chucking out my code and going to lxml.
>
> Can anyone please tell me how to force BS4 to do the sensible thing and
> ignore the broken encoding at this / any other URL?
>
> -a
>
> HEAD follows:
> ----
> <head>
> <meta http-equiv="X-UA-Compatible" content="IE=7" />
> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
> <meta http-equiv="Content-Language" content="en-GB">
> <title>Cyber Security Challenge</title>
> <meta content="" name="description" />
> <meta content="" name="keywords" />
> <base href="https://cybersecuritychallenge.org.uk/" />
> <link rel="stylesheet" href="css/styles.php" media="all" />
> <link rel="alternate" type="application/rss+xml"
> href="https://cybersecuritychallenge.org.uk/rss.xml" title="Cyber Security
> Challenge UK News Feed" />
> <script type="text/javascript" src="js/jquery-1.6.1.min.js"></script>
> <script type="text/javascript"
> src="js/jquery.prettyphoto/jquery.prettyPhoto.js"></script>
> </head>
>
> --
> You received this message because you are subscribed to the Google Groups
> "beautifulsoup" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/beautifulsoup/-/VSdNGH2PzskJ.
> To post to this group, send email to beauti...@googlegroups.com.
> To unsubscribe from this group, send email to
> beautifulsou...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/beautifulsoup?hl=en.

Alec Muffett

unread,
Nov 3, 2012, 11:49:52 AM11/3/12
to beauti...@googlegroups.com, leon...@segfault.org

You have a couple options:
1. Use the html5lib parser instead of the lxml parser. Since you're on
Python 2.7.3, Python's built-in parser will also work well.

Yay, thank you Leonard; since I am a SoupNewbie this begs a question that I still am wondering about: is there (is this?) a way to put a different parser on the front of BS4, or is BS4 effectively standalone, ie: that I would have to dump all my BS code and rewrite to adopt html5lib as a parser?

I looked at the lxml webpages, it seems like they some some degree of integration with BS, but it seems unidirectional?

2. Upgrade Beautiful Soup to version 4.0.3 or later.

Shall do this.

Thank you!

 

Leonard Richardson

unread,
Nov 3, 2012, 1:00:42 PM11/3/12
to beautifulsoup
Alec,

> Yay, thank you Leonard; since I am a SoupNewbie this begs a question that I
> still am wondering about: is there (is this?) a way to put a different
> parser on the front of BS4, or is BS4 effectively standalone, ie: that I
> would have to dump all my BS code and rewrite to adopt html5lib as a parser?

BS4 uses the best parser you have installed. Right now for you that's
lxml; that's why you encounter the bug.
To use html5lib instead you would specify that as an argument to the
constructor:

BeautifulSoup(markup, "html5lib")

See the docs for more information:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

Leonard
Reply all
Reply to author
Forward
0 new messages