Trouble with UnicodeEncodeError and email

Florian Lindner

unread,

Jan 8, 2014, 8:14:29 AM1/8/14

to pytho...@python.org

Hello!

I've written some tiny script using Python 3 and it used to work perfectly. Then I realized it needs to run on my Debian Stable server too, which offers only Python 2. Ok, most backporting was a matter of minutes, but I'm becoming desperate on some Unicode error...

i use scikit-learn to train a filter on a set of email messages:

vectorizer = CountVectorizer(input='filename', decode_error='replace', strip_accents='unicode',
preprocessor=self.mail_preprocessor, stop_words='english')

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

The vectorizer gets a list of filenames, reads them and passes them to the preprocessor:

def mail_preprocessor(self, message):
# Filter POPFile cruft by matching date string at the beginning.
print("Type:", type(message)) # imported from __future__
pop_reg = re.compile(r"^[0-9]{4}/[0-1][1-9]/[0-3]?[0-9]")
message = [line for line in message.splitlines(True) if not pop_reg.match(line)]
xxx = "".join(message)
msg = email.message_from_string(xxx) # <-- CRASH here

msg_body = ""

for part in msg.walk():
if part.get_content_type() in ["text/plain", "text/html"]:
body = part.get_payload(decode=True)
soup = BeautifulSoup(body)
msg_body += soup.get_text(" ", strip=True)

if "-----BEGIN PGP MESSAGE-----" in msg_body:
msg_body = ""

msg_body += " ".join(email.utils.parseaddr(msg["From"]))
try:
msg_body += " " + msg["Subject"]
except TypeError: # Can't convert 'NoneType' object to str implicitly
pass
msg_body = msg_body.lower()
return msg_body

Type: <type 'unicode'>

Traceback (most recent call last):
File "flofify.py", line 182, in <module>
main()
File "flofify.py", line 161, in main
model.train()
File "flofify.py", line 73, in train
vectors = vectorizer.fit_transform(data[:,1])
File "/usr/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "/usr/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 715, in _count_vocab
for feature in analyze(doc):
File "/usr/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 229, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "flofify.py", line 119, in mail_preprocessor
msg = email.message_from_string(xxx)
File "/usr/lib/python2.7/email/__init__.py", line 57, in message_from_string
return Parser(*args, **kws).parsestr(s)
File "/usr/lib/python2.7/email/parser.py", line 82, in parsestr
return self.parse(StringIO(text), headersonly=headersonly)
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 1624: ordinal not in range(128)

I've tried various modifications like encoding/decoding the message argument to utf-8.

Any help?

Thanks!

Florian

Chris Angelico

unread,

Jan 8, 2014, 8:26:15 AM1/8/14

to pytho...@python.org

On Thu, Jan 9, 2014 at 12:14 AM, Florian Lindner <mailin...@xgm.de> wrote:
> I've written some tiny script using Python 3 and it used to work perfectly. Then I realized it needs to run on my Debian Stable server too, which offers only Python 2. Ok, most backporting was a matter of minutes, but I'm becoming desperate on some Unicode error...

Are you sure it does? The current Debian stable is Wheezy, which comes
with a package 'python3' in the repository, which will install 3.2.3.
(The previous Debian stable, Squeeze, has 3.1.3 under the same name.)
You may need to change your shebang, but that's all you'd need to do.
Or are you unable to install new packages? If so, I strongly recommend
getting Python 3 added, as it's going to spare you a lot of Unicode
headaches.

Mind you, I compile my own Py3 for Wheezy, since I like to be on the
bleeding edge. But that's not for everyone. :)

ChrisA

Florian Lindner

unread,

Jan 8, 2014, 1:09:19 PM1/8/14

to pytho...@python.org

Well, I thought I had scanned to repos but obviously... I had to install BeautifulSoup and scikit-learn manually. Now some other Unicode issues have arised, but I need to sort them out first how they are connected to my mail delivery agent.

Thx a lot,

Florian