Hi, Alec,
It looks like you've encountered bug 972466:
https://bugs.launchpad.net/beautifulsoup/+bug/972466
This bug is caused by a workaround for a bug in lxml's parser. The
workaround was removed in Beautiful Soup 4.0.3.
You have a couple options:
1. Use the html5lib parser instead of the lxml parser. Since you're on
Python 2.7.3, Python's built-in parser will also work well.
2. Upgrade Beautiful Soup to version 4.0.3 or later.
Leonard
On Sat, Nov 3, 2012 at 11:23 AM, Alec Muffett <
alec.m...@gmail.com> wrote:
> Hi,
>
> Python 2.7.3
> BeautifulSoup 4.0.2-1 (Ubuntu)
>
> I am doing the following:
>
>> from bs4 import BeautifulSoup
>> import urllib2
>> url = '
https://cybersecuritychallenge.org.uk/'
>> data = urllib2.urlopen(url).read()
>> soup = BeautifulSoup(data)
>> print soup
>
>
> ...and the result is a mess:
>
>> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
>> "
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
>> <html xmlns="
http://www.w3.org/1999/xhtml"><head><meta content="IE=7"
>> http-equiv="X-UA-Compatible"/><meta content="text/html; charset=utf-8"
>> http-equiv="Content-Type"/><meta content="en-GB"
>> http-equiv="Content-Language"/><title>Cyber Security Challenge</title><meta
>> content="" name="description"/><meta content=""
>> name="keywor"/></head><body><p>d s " / >
>> b a s e h r e f = " h t t p s : / /
>> c y b e r s e c u r i t y c h a l l e
>> n g e . o r g . u k / " / >
>> l i n k r e l = " s t y l e s h e e
>> t " h r e f = " c s s / s t y l e s
>> . p h p " m e d i a = " a l l " /
>> >
>
>
> ...which seems to be due to the encoding cited in the head, below.
> Certainly if I remove the encoding then the document parses correctly.
>
> Not merely have the characters been spaced-out like bad unicode, but also
> there are chunks of the document body missing.
>
> I've tried setting various forms of from_encoding to coerce BS into doing
> the right thing, but it remains broken.
>
> I'm stuck, considering chucking out my code and going to lxml.
>
> Can anyone please tell me how to force BS4 to do the sensible thing and
> ignore the broken encoding at this / any other URL?
>
> -a
>
> HEAD follows:
> ----
> <head>
> <meta http-equiv="X-UA-Compatible" content="IE=7" />
> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
> <meta http-equiv="Content-Language" content="en-GB">
> <title>Cyber Security Challenge</title>
> <meta content="" name="description" />
> <meta content="" name="keywords" />
> <base href="
https://cybersecuritychallenge.org.uk/" />
> <link rel="stylesheet" href="css/styles.php" media="all" />
> <link rel="alternate" type="application/rss+xml"
> href="
https://cybersecuritychallenge.org.uk/rss.xml" title="Cyber Security
> Challenge UK News Feed" />
> <script type="text/javascript" src="js/jquery-1.6.1.min.js"></script>
> <script type="text/javascript"
> src="js/jquery.prettyphoto/jquery.prettyPhoto.js"></script>
> </head>
>
> --
> You received this message because you are subscribed to the Google Groups
> "beautifulsoup" group.
> To view this discussion on the web visit
>
https://groups.google.com/d/msg/beautifulsoup/-/VSdNGH2PzskJ.
> To post to this group, send email to
beauti...@googlegroups.com.
> To unsubscribe from this group, send email to
>
beautifulsou...@googlegroups.com.
> For more options, visit this group at
>
http://groups.google.com/group/beautifulsoup?hl=en.