'm in trouble with decoding email subjects. Here are some examples:
> =?koi8-r?B?4tnT1NLP19nQz8zOyc3PIMkgzcHMz9rB1NLB1M7P?=
> [Fwd: re:Flags Of The World, Us States, And Military]
> =?ISO-8859-2?Q?=E9rdekes?=
> =?UTF-8?B?aGliw6Fr?=
I know that "=?UTF-8?B" means UTF-8 + base64 encoding, but I wonder if
there is a standard method in the "email" package to decode these
subjects? I do not want to re-invent the weel.
Thanks,
Laszlo
> def decode_header(self,headervalue):
> val,encoding = decode_header(headervalue)[0]
> if encoding:
> return val.decode(encoding)
> else:
> return val
However, there are malformed emails and I have to put them into the
database. What should I do with this:
Return-Path: <imi...@exalumnos.com>
X-Original-To: in...@designasign.biz
Delivered-To: dap...@localhost.com
Received: from 195.228.74.135 (unknown [122.46.173.89])
by shopzeus.com (Postfix) with SMTP id F1C071DD438;
Tue, 18 Mar 2008 05:43:27 -0400 (EDT)
Date: Tue, 18 Mar 2008 12:43:45 +0200
Message-ID: <60285728...@optometrist.com>
From: "Euro Dice Casino" <imi...@exalumnos.com>
To: tho...@designasign.biz
Subject: With 2’500 Euro of Welcome Bonus you can’t miss the chance!
MIME-Version: 1.0
Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: 7bit
There is no encoding given in the subject but it contains 0x92. When I
try to insert this into the database, I get:
ProgrammingError: invalid byte sequence for encoding "UTF8": 0x92
All right, this probably was a spam email and I should simply discard
it. Probably the spammer used this special character in order to prevent
mail filters detecting "can't" and "2500". But I guess there will be
other important (ham) emails with bad encodings. How should I handle this?
Thanks,
Laszlo
Try this code:
from email.header import decode_header
def getheader(header_text, default="ascii"):
"""Decode the specified header"""
headers = decode_header(header_text)
header_sections = [unicode(text, charset or default)
for text, charset in headers]
return u"".join(header_sections)
I get the following output for your strings:
Быстровыполнимо и малозатратно
érdekeshibák
Regards,
Ryan Ginstrom
> I know that "=?UTF-8?B" means UTF-8 + base64 encoding, but I wonder if
> there is a standard method in the "email" package to decode these
> subjects?
The standard library function email.Header.decode_header will parse these
headers into an encoded bytestring paired with the appropriate encoding
specification, if any. For example:
>>> raw_headers = [
... '=?koi8-r?B?4tnT1NLP19nQz8zOyc3PIMkgzcHMz9rB1NLB1M7P?=',
... '[Fwd: re:Flags Of The World, Us States, And Military]',
... '=?ISO-8859-2?Q?=E9rdekes?=',
... '=?UTF-8?B?aGliw6Fr?=',
... ]
>>> from email.Header import decode_header
>>> for raw_header in raw_headers:
... for header, encoding in decode_header(raw_header):
... if encoding is None:
... print header.decode()
... else:
... print header.decode(encoding)
...
Быстровыполнимо и малозатратно
[Fwd: re:Flags Of The World, Us States, And Military]
érdekes
hibák
Jeffrey
Maybe with some heuristics about the types of mistakes made by do-it-
yourself e-mail header constructors. For example, 'iso-8859-1' often
should be construed as 'cp1252':
>>> import unicodedata as ucd
>>> ucd.name('\x92'.decode('iso-8859-1'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
>>> ucd.name('\x92'.decode('cp1252'))
'RIGHT SINGLE QUOTATION MARK'
>>>
>However, there are malformed emails and I have to put them into the
>database. What should I do with this:
[...]
>There is no encoding given in the subject but it contains 0x92. When I
>try to insert this into the database, I get:
This is indeed malformed email. The content type in the header specifies
iso-8859-1, but this looks like Windows code page 1252, where character
\x92 is a single right quote character (unicode \x2019).
As the majority of the mail clients out there are Windows-based, and as
far as I can tell many of them get the encoding wrong, I'd simply try to
decode as CP1252 on error, especially if the content-type claims
iso-8859-1. Many Windows mail clients consider iso-8859-1 equivalent to
1252 (it's not; the former doesn't use code points in the range \x8n and
\x9n, the latter does.)
Regards,
Gertjan.
--
Gertjan Klein <gkl...@xs4all.nl>