Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

MIME decode

0 views
Skip to first unread message

Oleg Broytmann

unread,
Oct 12, 2001, 10:31:10 AM10/12/01
to
Hello!

MIME decode

WHAT IS IT

Mail users, especially in non-English countries, often find that mail
messages arrived in different formats, with different content types, in
different encodings and charsets. Usually this is very good because it allows
us to use apropriate formats/encodings/whatever. Sometimes, though, some
unification is desireable. For example, one may want to put mail messages into
an archive, make HTML indicies, run search indexer, etc. In such situations
converting messages to text in one character set and skipping some binary
atachmetnts will be much desireable.

Here is the solution - mimedecode.py.

This is a program to decode MIME messages. The program expects one input
file (either on command line or on stdin) which treated as an RFC822 mesage,
and decoded to stdout. If the file is not an RFC822 message the file just
piped to stdout one-to-one. If it is a simple RFC822 message it is just
decoded as one part. If it is a MIME message with multiple parts
("attachments") all parts decoded. Decoding can be controlled by command-line
options.


WHERE TO GET
Master site: http://phd.pp.ru/Software/Python/#mimedecode

Faster mirror: http://phd.by.ru/Software/Python/#mimedecode

Requires: Python 2.0+, configured mailcap database.

Documentation (also included in the package):
http://phd.pp.ru/Software/Python/mimedecode.txt
http://phd.by.ru/Software/Python/mimedecode.txt


AUTHOR
Oleg Broytmann <p...@phd.pp.ru>

COPYRIGHT
Copyright (C) 2001 PhiloSoft Design

LICENSE
GPL


Detailed manual

NAME
mimedecode.py - decode MIME message.


SYNOPSIS
mimedecode.py [-h|--help] [-V|--version] [-cCfFsS] [-beit mask] [filename]


DESCRIPTION
First, Subject and Content-Disposition headers are examined. If any of
those exists, they decoded according to RFC2047. Content-Disposition header
is not decoded - only its "filename" parameter. Encoding header's
parameters is in violation of the RFC, but widely deployed anyway,
especially in the M$ Ophice GUI (often referred as "Windoze") world, where
programmers are usually ignorant lamers who never even heard about RFCs.
Correct parameter encoding specified by RFC2231. This program decodes
RFC2231-encoded parameters; continuation parameters (header*1, header*2,
etc.) are not yet supported.

Then the body of the message (or current part) decoded. Decoding starts
with looking at header Content-Transfer-Encoding. If the header specifies
non-8bit encoding (usually base64 or quoted-printable), the body converted
to 8bit. Then, if its content type is multipart (multipart/related or
multipart/mixed, e.g) every part recursively decoded. If it is not
multipart, mailcap database is consulted to find a way to convert the body
to plain text. (I have no idea how mailcap could be configured on said M$
Ophice GUI, please don't ask me; real OS users can consult my example at
http://phd.pp.ru/Software/dotfiles/mailcap.html). The decoding process uses
first copiousoutput filter it can find. If there is no filter the body just
passed as is.
Then Content-Type header consulted for charset. If it is not equal to
current default charset the body text recoded using Unicode codecs. Finally
message headers and body flushed to stdout.


OPTIONS
-h
--help
Print brief usage help and exit.

-V
--version
Print version and exit.

-c
Recode different character sets to current default charset; this is
the default.

-C
Do not recode character sets.

-f
Decode "filename" parameter of Content-Disposition header; this is
the default.

-F
Do not decode filenames.

-s
Decode Subject header; this is the default.

-S
Do not decode Subject.

-b mask
Append mask to the list of binary content types; if the message to
decode has a part of this type the program will pass the part as is,
without any additional processing.

-e mask
Append mask to the list of error content types; if the message to
decode has a part of this type the program will raise ValueError.

-i mask
Append mask to the list of content types to ignore; if the message to
decode has a part of this type the program will not pass it, instead
a line \nMessage body of type `%s' skipped.\n" will be issued.

-t mask
Append mask to the list of content types to convert to text; if the
message to decode has a part of this type the program will consult
mailcap database, find first copiousoutput filter and convert the
part.

The last 4 options (-beit) require more explanation. They allow a user
to control body decoding with great flexibility. Think about said mail
archive; for example, its maintainer wants to put there only texts, convert
Postscript/PDF to text, pass HTML and images as is, and ignore everything
else. Easy:

mimedecode.py -t application/postscript -t application/pdf \
-b text/html -b 'image/*' -i '*/*'

When the program decodes a message (or its part), it consults
Content-Type header. The content type is searched in all 4 lists, in order
"text-binary-ignore-error". If found, appropriate action performed. If not
found, the program search the same lists for "type/*" mask (the type of
"text/html" is just "text"). If found, appropriate action performed. If not
found, the program search the same lists for "*/*" mask. If found,
appropriate action performed. If not found, the program use default action,
which is to decode everything to text (if mailcap specifies filters).
Initially all 4 lists are empty, so without any additional parameters
the program always use the default decoding.


ENVIRONMENT
LANG
LC_ALL
LC_CTYPE
Define current locale settings. Usually used to determine current
default charset.


BUGS
The program may output incorrect MIME message. The purpose of the
program is to decode whatever is possible to decode, not to produce
absolutely correct MIME output. The incorrect parts are obvious - decoded
Subject headers and filenames.
Decoding mail header parameters is incomplete - continuations in
RFC2231-encoded parameters (header*1, header*2, etc.) are not parsed yet.


NO WARRANTIES
This program is distributed in the hope that it will be
useful, but WITHOUT ANY WARRANTY; without even the implied
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE. See the GNU General Public License for more
details.
Oleg.
----
Oleg Broytmann http://phd.pp.ru/ p...@phd.pp.ru
Programmers don't die, they just GOSUB without RETURN.

0 new messages