The following code snippet works (*gives no errors*) when using python 2.6
and beautifulsoup (3.2.0)
from BeautifulSoup import BeautifulSoup
...
...
...
# Now the entire webpage into htmltext
htmltext = webpage.read()
# Squeeze it all togather
soup = BeautifulSoup(''.join(htmltext))
strsoup = str(soup)
# Remove chars. that define tabs, returns (leave newlines (\n))
#text = re.sub(r"[\t]?[\r]?[\n]?",'',strsoup)
text = re.sub(r"[\t]?[\r]?",'',strsoup)
# Replace two or more white spaces by a single blank
text = re.sub(r"[\s]{2,}",' ',text)
if SaveFile:
# Now save it to the current folder
filename = 'NASDAQ_'+symbol+'_'+timestamp+'.html'
f = open(filename,'w+')
f.write(text)
f.close()
filename = 'NASDAQ_'+symbol+'_'+timestamp+'_prettify.html'
prettysoup = soup.prettify()
f = open(filename,'w+')
f.write(prettysoup)
f.close()
But the following fails!
from bs4 import BeautifulSoup # for python 2.7
and greater
...
...
...
# Now the entire webpage into htmltext
htmltext = webpage.read()
# Squeeze it all togather
soup = BeautifulSoup(''.join(htmltext))
strsoup = str(soup)
# Remove chars. that define tabs, returns (leave newlines (\n))
#text = re.sub(r"[\t]?[\r]?[\n]?",'',strsoup)
text = re.sub(r"[\t]?[\r]?",'',strsoup)
# Replace two or more white spaces by a single blank
text = re.sub(r"[\s]{2,}",' ',text)
if SaveFile:
# Now save it to the current folder
filename = 'NASDAQ_'+symbol+'_'+timestamp+'.html'
f = open(filename,'w+')
f.write(text)
f.close()
filename = 'NASDAQ_'+symbol+'_'+timestamp+'_prettify.html'
prettysoup = soup.prettify()
f = open(filename,'w+')
f.write(prettysoup)
f.close()
The f.write(prettysoup) generates the following error:
* UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in
position 28329: ordinal not in range(12*8)
Note, the only difference in these code snippets is the replacement of
beautifulsoup with beautifulsoup4
Why does this error occur when writing the prettified html file with
beautifulsoup4 and how can it be corrected?