Decode email subjects into unicode

Laszlo Nagy

unread,

Mar 18, 2008, 5:44:13 AM3/18/08

to pytho...@python.org

Hi All,

'm in trouble with decoding email subjects. Here are some examples:

> =?koi8-r?B?4tnT1NLP19nQz8zOyc3PIMkgzcHMz9rB1NLB1M7P?=
> [Fwd: re:Flags Of The World, Us States, And Military]
> =?ISO-8859-2?Q?=E9rdekes?=
> =?UTF-8?B?aGliw6Fr?=

I know that "=?UTF-8?B" means UTF-8 + base64 encoding, but I wonder if
there is a standard method in the "email" package to decode these
subjects? I do not want to re-invent the weel.

Thanks,

Laszlo

Laszlo Nagy

unread,

Mar 18, 2008, 6:09:32 AM3/18/08

to pytho...@python.org

Sorry, meanwhile i found that "email.Headers.decode_header" can be used
to convert the subject into unicode:

> def decode_header(self,headervalue):
> val,encoding = decode_header(headervalue)[0]
> if encoding:
> return val.decode(encoding)
> else:
> return val

However, there are malformed emails and I have to put them into the
database. What should I do with this:

Return-Path: <imi...@exalumnos.com>
X-Original-To: in...@designasign.biz
Delivered-To: dap...@localhost.com
Received: from 195.228.74.135 (unknown [122.46.173.89])
by shopzeus.com (Postfix) with SMTP id F1C071DD438;
Tue, 18 Mar 2008 05:43:27 -0400 (EDT)
Date: Tue, 18 Mar 2008 12:43:45 +0200
Message-ID: <60285728...@optometrist.com>
From: "Euro Dice Casino" <imi...@exalumnos.com>
To: tho...@designasign.biz
Subject: With 2’500 Euro of Welcome Bonus you can’t miss the chance!
MIME-Version: 1.0
Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: 7bit

There is no encoding given in the subject but it contains 0x92. When I
try to insert this into the database, I get:

ProgrammingError: invalid byte sequence for encoding "UTF8": 0x92

All right, this probably was a spam email and I should simply discard
it. Probably the spammer used this special character in order to prevent
mail filters detecting "can't" and "2500". But I guess there will be
other important (ham) emails with bad encodings. How should I handle this?

Thanks,

Laszlo

Ryan Ginstrom

unread,

Mar 18, 2008, 6:24:50 AM3/18/08

to pytho...@python.org

> On Behalf Of Laszlo Nagy

> > =?koi8-r?B?4tnT1NLP19nQz8zOyc3PIMkgzcHMz9rB1NLB1M7P?=
> > [Fwd: re:Flags Of The World, Us States, And Military]
> > =?ISO-8859-2?Q?=E9rdekes?= =?UTF-8?B?aGliw6Fr?=

Try this code:

from email.header import decode_header

def getheader(header_text, default="ascii"):
"""Decode the specified header"""

headers = decode_header(header_text)
header_sections = [unicode(text, charset or default)
for text, charset in headers]
return u"".join(header_sections)

I get the following output for your strings:

Быстровыполнимо и малозатратно
érdekeshibák

Regards,
Ryan Ginstrom

Jeffrey Froman

unread,

Mar 18, 2008, 12:24:03 PM3/18/08

to

Laszlo Nagy wrote:

> I know that "=?UTF-8?B" means UTF-8 + base64 encoding, but I wonder if
> there is a standard method in the "email" package to decode these
> subjects?

The standard library function email.Header.decode_header will parse these
headers into an encoded bytestring paired with the appropriate encoding
specification, if any. For example:

>>> raw_headers = [
... '=?koi8-r?B?4tnT1NLP19nQz8zOyc3PIMkgzcHMz9rB1NLB1M7P?=',
... '[Fwd: re:Flags Of The World, Us States, And Military]',
... '=?ISO-8859-2?Q?=E9rdekes?=',
... '=?UTF-8?B?aGliw6Fr?=',
... ]
>>> from email.Header import decode_header
>>> for raw_header in raw_headers:
... for header, encoding in decode_header(raw_header):
... if encoding is None:
... print header.decode()
... else:
... print header.decode(encoding)
...
Быстровыполнимо и малозатратно

[Fwd: re:Flags Of The World, Us States, And Military]

érdekes
hibák

Jeffrey

John Machin

unread,

Mar 18, 2008, 2:57:40 PM3/18/08

to

On Mar 18, 9:09 pm, Laszlo Nagy <gand...@shopzeus.com> wrote:
> Sorry, meanwhile i found that "email.Headers.decode_header" can be used
> to convert the subject into unicode:
>
> > def decode_header(self,headervalue):
> > val,encoding = decode_header(headervalue)[0]
> > if encoding:
> > return val.decode(encoding)
> > else:
> > return val
>
> However, there are malformed emails and I have to put them into the
> database. What should I do with this:
>

> Return-Path: <imit...@exalumnos.com>
> X-Original-To: i...@designasign.biz
> Delivered-To: dapi...@localhost.com

> Received: from 195.228.74.135 (unknown [122.46.173.89])
> by shopzeus.com (Postfix) with SMTP id F1C071DD438;
> Tue, 18 Mar 2008 05:43:27 -0400 (EDT)
> Date: Tue, 18 Mar 2008 12:43:45 +0200
> Message-ID: <60285728...@optometrist.com>

> From: "Euro Dice Casino" <imit...@exalumnos.com>

> To: tho...@designasign.biz
> Subject: With 2'500 Euro of Welcome Bonus you can't miss the chance!
> MIME-Version: 1.0
> Content-Type: text/html; charset=iso-8859-1
> Content-Transfer-Encoding: 7bit
>
> There is no encoding given in the subject but it contains 0x92. When I
> try to insert this into the database, I get:
>
> ProgrammingError: invalid byte sequence for encoding "UTF8": 0x92
>
> All right, this probably was a spam email and I should simply discard
> it. Probably the spammer used this special character in order to prevent
> mail filters detecting "can't" and "2500". But I guess there will be
> other important (ham) emails with bad encodings. How should I handle this?

Maybe with some heuristics about the types of mistakes made by do-it-
yourself e-mail header constructors. For example, 'iso-8859-1' often
should be construed as 'cp1252':

>>> import unicodedata as ucd
>>> ucd.name('\x92'.decode('iso-8859-1'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
>>> ucd.name('\x92'.decode('cp1252'))
'RIGHT SINGLE QUOTATION MARK'
>>>

Gertjan Klein

unread,

Mar 19, 2008, 5:11:18 AM3/19/08

to

Laszlo Nagy wrote:

>However, there are malformed emails and I have to put them into the
>database. What should I do with this:

[...]

>There is no encoding given in the subject but it contains 0x92. When I
>try to insert this into the database, I get:

This is indeed malformed email. The content type in the header specifies
iso-8859-1, but this looks like Windows code page 1252, where character
\x92 is a single right quote character (unicode \x2019).

As the majority of the mail clients out there are Windows-based, and as
far as I can tell many of them get the encoding wrong, I'd simply try to
decode as CP1252 on error, especially if the content-type claims
iso-8859-1. Many Windows mail clients consider iso-8859-1 equivalent to
1252 (it's not; the former doesn't use code points in the range \x8n and
\x9n, the latter does.)

Regards,
Gertjan.

--
Gertjan Klein <gkl...@xs4all.nl>

Laszlo Nagy

unread,

Mar 19, 2008, 6:24:45 AM3/19/08

to Gertjan Klein, pytho...@python.org

Thank you very much!

akoma...@gmail.com

unread,

Apr 18, 2017, 7:27:08 PM4/18/17

to

Im sorry to intrude this conversation but I was wondering if I could get some help with a partial email (n********7@m***.ru) this is how it appears when its obviously being hidden and I need the full email so I can get a hacker off my back and get my accounts returned to me my business needs these accounts to run. Help please!