FILE : parseShift.py
import urllib.request as url
from html.parser import HTMLParser
class myParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Start of %s tag : %s" % (tag, attrs))
test = myParser()
handle = url.urlretrieve("http://localhost/shift.html")
handleTemp = open( handle[0] , encoding="Shift-JIS" )
test.feed( handleTemp.read() )
handleTempl.close()
FILE : shift.html (encoded Shift-JIS)
<p class="thisisclass (not_in_japanese) reading_this_should_be_ok">Some
random japanese
<p><strong>東方プロジェクト</strong> <a href="#" title="キャプテン・ムラ
サ">Link</a>
OUTPUT
Start of p tag : [('class', 'thisisclass (not_in_japanese)
reading_this_should_be_ok')]
Start of p tag : []
Start of strong tag : []
Traceback (most recent call last):
File "D:\Dorian\Python\parseShift.py", line 12, in <module>
test.feed( handleTemp.read() )
File "C:\Python31\lib\html\parser.py", line 108, in feed
self.goahead(0)
File "C:\Python31\lib\html\parser.py", line 148, in goahead
k = self.parse_starttag(i)
File "C:\Python31\lib\html\parser.py", line 268, in parse_starttag
self.handle_starttag(tag, attrs)
File "D:\Dorian\Python\parseShift.py", line 6, in handle_starttag
print("Start of %s tag : %s" % (tag, attrs))
File "C:\Python31\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position
44-52: c
haracter maps to <undefined>
any help?
Dorian
any workaround?
Dorian
You problem is the last line. Your terminal does not support printing the
text, so you get an exception here.
Either change your terminal encoding to a suitable encoding, or write the
text to an encoded file instead (see the 'encoding' option of the open()
function for that).
Stefan
HTMLparser should already have converted from Shift-JIS
to Unicode, so the "print" is outputting Unicode.
John Nagle