Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Decode email subjects into unicode

20 views
Skip to first unread message

Laszlo Nagy

unread,
Mar 18, 2008, 5:44:13 AM3/18/08
to pytho...@python.org
Hi All,

'm in trouble with decoding email subjects. Here are some examples:

> =?koi8-r?B?4tnT1NLP19nQz8zOyc3PIMkgzcHMz9rB1NLB1M7P?=
> [Fwd: re:Flags Of The World, Us States, And Military]
> =?ISO-8859-2?Q?=E9rdekes?=
> =?UTF-8?B?aGliw6Fr?=


I know that "=?UTF-8?B" means UTF-8 + base64 encoding, but I wonder if
there is a standard method in the "email" package to decode these
subjects? I do not want to re-invent the weel.

Thanks,

Laszlo

Laszlo Nagy

unread,
Mar 18, 2008, 6:09:32 AM3/18/08
to pytho...@python.org
Sorry, meanwhile i found that "email.Headers.decode_header" can be used
to convert the subject into unicode:

> def decode_header(self,headervalue):
> val,encoding = decode_header(headervalue)[0]
> if encoding:
> return val.decode(encoding)
> else:
> return val

However, there are malformed emails and I have to put them into the
database. What should I do with this:


Return-Path: <imi...@exalumnos.com>
X-Original-To: in...@designasign.biz
Delivered-To: dap...@localhost.com
Received: from 195.228.74.135 (unknown [122.46.173.89])
by shopzeus.com (Postfix) with SMTP id F1C071DD438;
Tue, 18 Mar 2008 05:43:27 -0400 (EDT)
Date: Tue, 18 Mar 2008 12:43:45 +0200
Message-ID: <60285728...@optometrist.com>
From: "Euro Dice Casino" <imi...@exalumnos.com>
To: tho...@designasign.biz
Subject: With 2’500 Euro of Welcome Bonus you can’t miss the chance!
MIME-Version: 1.0
Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: 7bit

There is no encoding given in the subject but it contains 0x92. When I
try to insert this into the database, I get:

ProgrammingError: invalid byte sequence for encoding "UTF8": 0x92

All right, this probably was a spam email and I should simply discard
it. Probably the spammer used this special character in order to prevent
mail filters detecting "can't" and "2500". But I guess there will be
other important (ham) emails with bad encodings. How should I handle this?

Thanks,

Laszlo

Ryan Ginstrom

unread,
Mar 18, 2008, 6:24:50 AM3/18/08
to pytho...@python.org
> On Behalf Of Laszlo Nagy

> > =?koi8-r?B?4tnT1NLP19nQz8zOyc3PIMkgzcHMz9rB1NLB1M7P?=
> > [Fwd: re:Flags Of The World, Us States, And Military]
> > =?ISO-8859-2?Q?=E9rdekes?= =?UTF-8?B?aGliw6Fr?=

Try this code:

from email.header import decode_header

def getheader(header_text, default="ascii"):
"""Decode the specified header"""

headers = decode_header(header_text)
header_sections = [unicode(text, charset or default)
for text, charset in headers]
return u"".join(header_sections)

I get the following output for your strings:

Быстровыполнимо и малозатратно
érdekeshibák

Regards,
Ryan Ginstrom

Jeffrey Froman

unread,
Mar 18, 2008, 12:24:03 PM3/18/08
to
Laszlo Nagy wrote:

> I know that "=?UTF-8?B" means UTF-8 + base64 encoding, but I wonder if
> there is a standard method in the "email" package to decode these
> subjects?

The standard library function email.Header.decode_header will parse these
headers into an encoded bytestring paired with the appropriate encoding
specification, if any. For example:

>>> raw_headers = [
... '=?koi8-r?B?4tnT1NLP19nQz8zOyc3PIMkgzcHMz9rB1NLB1M7P?=',
... '[Fwd: re:Flags Of The World, Us States, And Military]',
... '=?ISO-8859-2?Q?=E9rdekes?=',
... '=?UTF-8?B?aGliw6Fr?=',
... ]
>>> from email.Header import decode_header
>>> for raw_header in raw_headers:
... for header, encoding in decode_header(raw_header):
... if encoding is None:
... print header.decode()
... else:
... print header.decode(encoding)
...
Быстровыполнимо и малозатратно


[Fwd: re:Flags Of The World, Us States, And Military]

érdekes
hibák


Jeffrey

John Machin

unread,
Mar 18, 2008, 2:57:40 PM3/18/08
to
On Mar 18, 9:09 pm, Laszlo Nagy <gand...@shopzeus.com> wrote:
> Sorry, meanwhile i found that "email.Headers.decode_header" can be used
> to convert the subject into unicode:
>
> > def decode_header(self,headervalue):
> > val,encoding = decode_header(headervalue)[0]
> > if encoding:
> > return val.decode(encoding)
> > else:
> > return val
>
> However, there are malformed emails and I have to put them into the
> database. What should I do with this:
>
> Return-Path: <imit...@exalumnos.com>
> X-Original-To: i...@designasign.biz
> Delivered-To: dapi...@localhost.com

> Received: from 195.228.74.135 (unknown [122.46.173.89])
> by shopzeus.com (Postfix) with SMTP id F1C071DD438;
> Tue, 18 Mar 2008 05:43:27 -0400 (EDT)
> Date: Tue, 18 Mar 2008 12:43:45 +0200
> Message-ID: <60285728...@optometrist.com>
> From: "Euro Dice Casino" <imit...@exalumnos.com>

> To: tho...@designasign.biz
> Subject: With 2'500 Euro of Welcome Bonus you can't miss the chance!
> MIME-Version: 1.0
> Content-Type: text/html; charset=iso-8859-1
> Content-Transfer-Encoding: 7bit
>
> There is no encoding given in the subject but it contains 0x92. When I
> try to insert this into the database, I get:
>
> ProgrammingError: invalid byte sequence for encoding "UTF8": 0x92
>
> All right, this probably was a spam email and I should simply discard
> it. Probably the spammer used this special character in order to prevent
> mail filters detecting "can't" and "2500". But I guess there will be
> other important (ham) emails with bad encodings. How should I handle this?

Maybe with some heuristics about the types of mistakes made by do-it-
yourself e-mail header constructors. For example, 'iso-8859-1' often
should be construed as 'cp1252':

>>> import unicodedata as ucd
>>> ucd.name('\x92'.decode('iso-8859-1'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
>>> ucd.name('\x92'.decode('cp1252'))
'RIGHT SINGLE QUOTATION MARK'
>>>

Gertjan Klein

unread,
Mar 19, 2008, 5:11:18 AM3/19/08
to
Laszlo Nagy wrote:

>However, there are malformed emails and I have to put them into the
>database. What should I do with this:

[...]


>There is no encoding given in the subject but it contains 0x92. When I
>try to insert this into the database, I get:

This is indeed malformed email. The content type in the header specifies
iso-8859-1, but this looks like Windows code page 1252, where character
\x92 is a single right quote character (unicode \x2019).

As the majority of the mail clients out there are Windows-based, and as
far as I can tell many of them get the encoding wrong, I'd simply try to
decode as CP1252 on error, especially if the content-type claims
iso-8859-1. Many Windows mail clients consider iso-8859-1 equivalent to
1252 (it's not; the former doesn't use code points in the range \x8n and
\x9n, the latter does.)

Regards,
Gertjan.

--
Gertjan Klein <gkl...@xs4all.nl>

Laszlo Nagy

unread,
Mar 19, 2008, 6:24:45 AM3/19/08
to Gertjan Klein, pytho...@python.org
Thank you very much!

akoma...@gmail.com

unread,
Apr 18, 2017, 7:27:08 PM4/18/17
to
Im sorry to intrude this conversation but I was wondering if I could get some help with a partial email (n********7@m***.ru) this is how it appears when its obviously being hidden and I need the full email so I can get a hacker off my back and get my accounts returned to me my business needs these accounts to run. Help please!
0 new messages