Hi,
Python 2.7.3
BeautifulSoup 4.0.2-1 (Ubuntu)
I am doing the following:
from bs4 import BeautifulSoup
...and the result is a mess:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
> <html xmlns="
http://www.w3.org/1999/xhtml"><head><meta content="IE=7"
> http-equiv="X-UA-Compatible"/><meta content="text/html; charset=utf-8"
> http-equiv="Content-Type"/><meta content="en-GB"
> http-equiv="Content-Language"/><title>Cyber Security Challenge</title><meta
> content="" name="description"/><meta content=""
> name="keywor"/></head><body><p>d s " / >
> b a s e h r e f = " h t t p s : / /
> c y b e r s e c u r i t y c h a l l e
> n g e . o r g . u k / " / >
> l i n k r e l = " s t y l e s h e e
> t " h r e f = " c s s / s t y l e s
> . p h p " m e d i a = " a l l " /
> >
...which seems to be due to the encoding cited in the head, below.
Certainly if I remove the encoding then the document parses correctly.
Not merely have the characters been spaced-out like bad unicode, but also
there are chunks of the document body missing.
I've tried setting various forms of from_encoding to coerce BS into doing
the right thing, but it remains broken.
I'm stuck, considering chucking out my code and going to lxml.
Can anyone please tell me how to force BS4 to do the sensible thing and
ignore the broken encoding at this / any other URL?
-a
HEAD follows:
----
<head>
<meta http-equiv="X-UA-Compatible" content="IE=7" />
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
<meta http-equiv="Content-Language" content="en-GB">
<title>Cyber Security Challenge</title>
<meta content="" name="description" />
<meta content="" name="keywords" />
<base href="https://cybersecuritychallenge.org.uk/" />
<link rel="stylesheet" href="css/styles.php" media="all" />
<link rel="alternate" type="application/rss+xml"
href="https://cybersecuritychallenge.org.uk/rss.xml" title="Cyber Security
Challenge UK News Feed" />
<script type="text/javascript" src="js/jquery-1.6.1.min.js"></script>
<script type="text/javascript"
src="js/jquery.prettyphoto/jquery.prettyPhoto.js"></script>
</head>