Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Error when writing prettified html page to a file with f.write(...) with beautifulsoup4 (4.1.1) --- Why?
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  3 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Mark T  
View profile  
 More options Aug 11 2012, 5:46 pm
From: Mark T <v...@it.uu.se>
Date: Sat, 11 Aug 2012 14:46:12 -0700 (PDT)
Local: Sat, Aug 11 2012 5:46 pm
Subject: Error when writing prettified html page to a file with f.write(...) with beautifulsoup4 (4.1.1) --- Why?

The following code snippet works (*gives no errors*) when using python 2.6
and beautifulsoup (3.2.0)

              from BeautifulSoup import BeautifulSoup
        ...
        ...
        ...

       # Now the entire webpage into htmltext
       htmltext = webpage.read()
       # Squeeze it all togather
       soup = BeautifulSoup(''.join(htmltext))
       strsoup = str(soup)
       # Remove chars. that define tabs, returns (leave newlines (\n))
       #text = re.sub(r"[\t]?[\r]?[\n]?",'',strsoup)
       text = re.sub(r"[\t]?[\r]?",'',strsoup)
       # Replace two or more white spaces by a single blank
       text = re.sub(r"[\s]{2,}",' ',text)
       if SaveFile:
           # Now save it to the current folder
           filename = 'NASDAQ_'+symbol+'_'+timestamp+'.html'
           f = open(filename,'w+')      
           f.write(text)
           f.close()
           filename = 'NASDAQ_'+symbol+'_'+timestamp+'_prettify.html'
           prettysoup = soup.prettify()
           f = open(filename,'w+')
           f.write(prettysoup)
           f.close()

But the following fails!

              from bs4 import BeautifulSoup             # for python 2.7
and greater
        ...
        ...
        ...

       # Now the entire webpage into htmltext
       htmltext = webpage.read()
       # Squeeze it all togather
       soup = BeautifulSoup(''.join(htmltext))
       strsoup = str(soup)
       # Remove chars. that define tabs, returns (leave newlines (\n))
       #text = re.sub(r"[\t]?[\r]?[\n]?",'',strsoup)
       text = re.sub(r"[\t]?[\r]?",'',strsoup)
       # Replace two or more white spaces by a single blank
       text = re.sub(r"[\s]{2,}",' ',text)
       if SaveFile:
           # Now save it to the current folder
           filename = 'NASDAQ_'+symbol+'_'+timestamp+'.html'
           f = open(filename,'w+')      
           f.write(text)
           f.close()
           filename = 'NASDAQ_'+symbol+'_'+timestamp+'_prettify.html'
           prettysoup = soup.prettify()
           f = open(filename,'w+')
           f.write(prettysoup)
           f.close()

The f.write(prettysoup) generates the following error:

*   UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in
position 28329: ordinal not in range(12*8)

Note, the only difference in these code snippets is the replacement of
beautifulsoup with beautifulsoup4

Why does this error occur when writing the prettified html file with
beautifulsoup4 and how can it be corrected?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Leonard Richardson  
View profile  
 More options Aug 11 2012, 6:43 pm
From: Leonard Richardson <leona...@segfault.org>
Date: Sat, 11 Aug 2012 18:43:37 -0400
Local: Sat, Aug 11 2012 6:43 pm
Subject: Re: Error when writing prettified html page to a file with f.write(...) with beautifulsoup4 (4.1.1) --- Why?

> Why does this error occur when writing the prettified html file with
> beautifulsoup4 and how can it be corrected?

If you examine the string before writing it to a file, you should see
that it's a bytestring under Beautiful Soup 3, and a Unicode string
under Beautiful Soup 4. BS3's handling of Unicode was inconsistent.
BS4 won't convert Unicode strings to bytestrings unless you explicitly
tell it to.

Your error happens when Python attempts to encode a Unicode character
(EM DASH, in this case) into your system encoding, but your system
encoding doesn't include that character. This class of error is
discussed here:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#miscellaneous

I've updated the documentation to cover the case where the error
happens while writing to a file.

You can get the BS3 behavior by calling prettify(encoding="utf8"), or
you can encode the Unicode string to a UTF-8 bytestring before writing
it to a file.

Leonard


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Virgil Stokes  
View profile  
 More options Aug 12 2012, 6:04 am
From: Virgil Stokes <v...@it.uu.se>
Date: Sun, 12 Aug 2012 12:04:10 +0200
Local: Sun, Aug 12 2012 6:04 am
Subject: Re: Error when writing prettified html page to a file with f.write(...) with beautifulsoup4 (4.1.1) --- Why?
On 12-Aug-2012 00:43, Leonard Richardson wrote:

Ok Leonard,

Thanks very much for your "on target" answer :-)


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »