Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
BS4 Parser seems broken on website with dubious encoding cited in HEAD
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  4 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Alec Muffett  
View profile  
 More options Nov 3 2012, 11:23 am
From: Alec Muffett <alec.muff...@gmail.com>
Date: Sat, 3 Nov 2012 08:23:41 -0700 (PDT)
Local: Sat, Nov 3 2012 11:23 am
Subject: BS4 Parser seems broken on website with dubious encoding cited in HEAD

Hi,

Python 2.7.3
BeautifulSoup 4.0.2-1 (Ubuntu)

I am doing the following:

  from bs4 import BeautifulSoup

>   import urllib2
>   url = 'https://cybersecuritychallenge.org.uk/'
>   data = urllib2.urlopen(url).read()
>   soup = BeautifulSoup(data)
>   print soup

...and the result is a mess:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
> <html xmlns="http://www.w3.org/1999/xhtml"><head><meta content="IE=7"
> http-equiv="X-UA-Compatible"/><meta content="text/html; charset=utf-8"
> http-equiv="Content-Type"/><meta content="en-GB"
> http-equiv="Content-Language"/><title>Cyber Security Challenge</title><meta
> content="" name="description"/><meta content=""
> name="keywor"/></head><body><p>d   s   "       /   &gt;  
> b   a   s   e       h   r   e   f   =   "   h   t   t   p   s   :   /   /
>   c   y   b   e   r   s   e   c   u   r   i   t   y   c   h   a   l   l   e
>   n   g   e   .   o   r   g   .   u   k   /   "       /   &gt;  
> l   i   n   k       r   e   l   =   "   s   t   y   l   e   s   h   e   e
>   t   "       h   r   e   f   =   "   c   s   s   /   s   t   y   l   e   s
>   .   p   h   p   "       m   e   d   i   a   =   "   a   l   l   "       /
>   &gt;  

...which seems to be due to the encoding cited in the head, below.
 Certainly if I remove the encoding then the document parses correctly.  

Not merely have the characters been spaced-out like bad unicode, but also
there are chunks of the document body missing.

I've tried setting various forms of from_encoding to coerce BS into doing
the right thing, but it remains broken.

I'm stuck, considering chucking out my code and going to lxml.

Can anyone please tell me how to force BS4 to do the sensible thing and
ignore the broken encoding at this / any other URL?

    -a

HEAD follows:
----
<head>
<meta http-equiv="X-UA-Compatible" content="IE=7" />
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
<meta http-equiv="Content-Language" content="en-GB">
<title>Cyber Security Challenge</title>
<meta content="" name="description" />
<meta content="" name="keywords" />
<base href="https://cybersecuritychallenge.org.uk/" />
<link rel="stylesheet" href="css/styles.php" media="all" />
<link rel="alternate" type="application/rss+xml"
href="https://cybersecuritychallenge.org.uk/rss.xml" title="Cyber Security
Challenge UK News Feed" />
<script type="text/javascript" src="js/jquery-1.6.1.min.js"></script>
<script type="text/javascript"
src="js/jquery.prettyphoto/jquery.prettyPhoto.js"></script>
</head>


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Leonard Richardson  
View profile  
 More options Nov 3 2012, 11:37 am
From: Leonard Richardson <leona...@segfault.org>
Date: Sat, 3 Nov 2012 11:37:19 -0400
Local: Sat, Nov 3 2012 11:37 am
Subject: Re: BS4 Parser seems broken on website with dubious encoding cited in HEAD
Hi, Alec,

It looks like you've encountered bug 972466:

https://bugs.launchpad.net/beautifulsoup/+bug/972466

This bug is caused by a workaround for a bug in lxml's parser. The
workaround was removed in Beautiful Soup 4.0.3.

You have a couple options:

1. Use the html5lib parser instead of the lxml parser. Since you're on
Python 2.7.3, Python's built-in parser will also work well.
2. Upgrade Beautiful Soup to version 4.0.3 or later.

Leonard


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alec Muffett  
View profile  
 More options Nov 3 2012, 11:49 am
From: Alec Muffett <alec.muff...@gmail.com>
Date: Sat, 3 Nov 2012 08:49:52 -0700 (PDT)
Local: Sat, Nov 3 2012 11:49 am
Subject: Re: BS4 Parser seems broken on website with dubious encoding cited in HEAD

> You have a couple options:
> 1. Use the html5lib parser instead of the lxml parser. Since you're on
> Python 2.7.3, Python's built-in parser will also work well.

Yay, thank you Leonard; since I am a SoupNewbie this begs a question that I
still am wondering about: is there (is this?) a way to put a different
parser on the front of BS4, or is BS4 effectively standalone, ie: that I
would have to dump all my BS code and rewrite to adopt html5lib as a parser?

I looked at the lxml webpages, it seems like they some some degree of
integration with BS, but it seems unidirectional?

2. Upgrade Beautiful Soup to version 4.0.3 or later.


Shall do this.

Thank you!


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Leonard Richardson  
View profile  
 More options Nov 3 2012, 1:01 pm
From: Leonard Richardson <leona...@segfault.org>
Date: Sat, 3 Nov 2012 13:00:42 -0400
Local: Sat, Nov 3 2012 1:00 pm
Subject: Re: BS4 Parser seems broken on website with dubious encoding cited in HEAD
Alec,

> Yay, thank you Leonard; since I am a SoupNewbie this begs a question that I
> still am wondering about: is there (is this?) a way to put a different
> parser on the front of BS4, or is BS4 effectively standalone, ie: that I
> would have to dump all my BS code and rewrite to adopt html5lib as a parser?

BS4 uses the best parser you have installed. Right now for you that's
lxml; that's why you encounter the bug.
To use html5lib instead you would specify that as an argument to the
constructor:

BeautifulSoup(markup, "html5lib")

See the docs for more information:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-pa...

Leonard


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »