do you use the formail program?

Eli the Bearded

unread,

Oct 24, 2018, 5:53:00 PM10/24/18

to

Do you use the formail program? If so maybe you could help test
something.

In a fit of frustration with RFC-2047 MIME encoded words, I added
code to formail that will decode them. (It cannot encode them, at
least yet.)

MIME words, if you don't know, are the things that look like this:

=?CHARSET?Q?Quoted=2Dprintable_content?=

That are frequently used in Subject: and From: lines (and rarely in
other headers) to encode non-ASCII content in a 7-bit safe way.

In my modified formail, everything works as in a stock 3.22 version
except that an additional flag has been added. When you use that
flag to specify an acceptable charset, formail will decode words
matching that charset on read, and all further actions with the content
will operate on the decoded version.

This is beta quality code and I'm looking for people to test it and
flush out any problems with it.

https://github.com/Eli-the-Bearded/procmail-formail

Note that this contains a complete procmail source package, but only
the formail program (and manpage) has been modified.

git clone https://github.com/Eli-the-Bearded/procmail-formail
cd procmail-formail
make
cp new/formail $YOUR_PREFERED_BIN_LOCATION
cp new/formail.1 $YOUR_PREFERED_MAN_LOCATION

And add "-M charset" to flags to activate the decoder.

(I'm also vaguely aware there have been security patches to the
procmail package since 3.22 came out in 2001 which are not incorporated
here.)

Elijah
------
procmail was clearly written by someone who hates whitespace in code

Jorgen Grahn

unread,

Oct 26, 2018, 1:57:16 AM10/26/18

to

On Wed, 2018-10-24, Eli the Bearded wrote:
> Do you use the formail program? If so maybe you could help test
> something.
>
> In a fit of frustration with RFC-2047 MIME encoded words, I added
> code to formail that will decode them.

...

> This is beta quality code and I'm looking for people to test it and
> flush out any problems with it.
>
> https://github.com/Eli-the-Bearded/procmail-formail

I don't use formail (yet), but I'm fairly interested in running this
on my mailboxes to see if they trigger any problems. If I find the
time, I'll let you know the result.

Note that mailing list archives are good test data.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Grant Taylor

unread,

Oct 31, 2018, 6:03:38 PM10/31/18

to

On 10/24/2018 03:53 PM, Eli the Bearded wrote:
> Do you use the formail program?

Yes.

> If so maybe you could help test something.

Maybe.

> In a fit of frustration with RFC-2047 MIME encoded words, I added code
> to formail that will decode them. (It cannot encode them, at least yet.)

Intriguing.

> MIME words, if you don't know, are the things that look like this:
>
> =?CHARSET?Q?Quoted=2Dprintable_content?=

To be honest, I hadn't given them much thought. I actually searched my
mailbox and found a bunch of occurrences of MIME words. (5739 Subjects
with MIME words.)

So, I think I am more interested in this than I was when I originally
read your post.

3596 utf-8
262 windows-1252
246 iso-8859-1
104 koi8-r
63 windows-1251
16 iso-8859-9
11 gb2312
10 windows-1254
9 us-ascii
4 windows-1256
2 iso-8859-5
2 iso-2022-jp
2 gbk
1 iso-8859-8
1 iso-8859-7
1 iso-8859-3
1 big5

> That are frequently used in Subject: and From: lines (and rarely in
> other headers) to encode non-ASCII content in a 7-bit safe way.

ACK

> In my modified formail, everything works as in a stock 3.22 version
> except that an additional flag has been added. When you use that flag to
> specify an acceptable charset, formail will decode words matching that
> charset on read, and all further actions with the content will operate
> on the decoded version.

I like the idea, but I get the impression that I need to specify the
source encoding. Which as you can see above, there are a number of them.

Or am I misunderstanding you? Do you mean that you specify the target
encoding? Like I would want something like all of the above to be
decoded to utf-8 for processing in formail?

> This is beta quality code and I'm looking for people to test it and
> flush out any problems with it.

Fair enough.

--
Grant. . . .
unix || die

Eli the Bearded

unread,

Nov 1, 2018, 2:54:43 PM11/1/18

to

In comp.mail.mime, Grant Taylor <gta...@tnetconsulting.net> wrote:
> On 10/24/2018 03:53 PM, Eli the Bearded wrote:

>> =?CHARSET?Q?Quoted=2Dprintable_content?=

> So, I think I am more interested in this than I was when I originally
> read your post.
>
> 3596 utf-8
> 262 windows-1252
> 246 iso-8859-1
> 104 koi8-r

...

> I like the idea, but I get the impression that I need to specify the
> source encoding. Which as you can see above, there are a number of them.
>
> Or am I misunderstanding you? Do you mean that you specify the target
> encoding? Like I would want something like all of the above to be
> decoded to utf-8 for processing in formail?

There's no misunderstanding. It extracts only when the encoding matches
because it really doesn't make sense to extract an encoding you can't
understand. In all likelihood, /you/ understand (some) utf-8,
windows-1252, and iso-8859-1 when they are displayed by a compatible
terminal, but your terminal window will only understand one of those,
and koi8-r will not be understood by you or your terminal.

IF your terminal understands some form of Unicode (UTF-8, UTF-16,
UCS-32, UTF-7) then any charset you encounter can be converted to
your prefered Unicode charset. But if your terminal is ISO-8859-1
(or Windows-1252), then chances are the content in any of the other
encodings /cannot/ be converted to your charset. Windows-1252 in
particular is ISO-8859-1 plus some other characters, so ISO-8859-1
will display as Windows-1252, but not always the other way around.

>> This is beta quality code and I'm looking for people to test it and
>> flush out any problems with it.
> Fair enough.

I wouldn't say no to someone wiring in iconv to convert between
charsets. To someone who wanted to do that, I'd suggest building it
into the self-contained mime.c that I wrote.

I can see two useful ways to do this:

a) A decoder function that takes a target charset and decodes MIME
words that can be safely re-encoded into the target charset. If
a MIME word uses characters outside teh target charset, the
original string is left unmodified.

b) A decoder function that takes a target charset and lossily, if
needed, decodes MIME words somehow indicating when characters
have been omitted as untranslatable.

The procmail / formail code is a nightmare to edit. I don't know if you
are familiar with C, but here's how main() starts for formail:

int main(lastm,argv)int lastm;const char*const argv[];
{ int i,split=0,force=0,bogus=1,every=0,headreply=0,digest=0,nowait=0,keepb=0,
minfields=(char*)progid-(char*)progid,conctenate=0,babyl=0,babylstart,
berkeley=0,forgetclen;
long maxlen,ctlength;FILE*idcache=0;pid_t thepid;
size_t j,lnl,escaplen;char*chp,*namep,*escap=ESCAP;
struct field*fldp,*fp2,**afldp,*fdate,*fcntlength,*fsubject,*fFrom_;
charset = NULL;
if(lastm) /* sanity check, any argument at all? */
#define Qnext_arg() if(!*chp&&!(chp=(char*)*++argv))goto usg
while(chp=(char*)*++argv)
{ if((lastm= *chp++)==FM_SKIP)
goto number;
else if(lastm!=FM_TOTAL)
goto usg;
for(;;)
{ switch(lastm= *chp++)
{ case FM_TRUST:headreply|=1;
continue;
case FM_REPLY:areply=1;

White space is eschewed at every opportunity, and ugly C-isms abound:
"chp=(char*)*++argv" and the like. This makes adding features a bit
tricky. (That line with "charset = NULL;" is an addition of mine.)

My next version of this project puts the decoder into procmail proper,
again gated by a charset. There will be no sense in using trying to
match a ISO-8859-1 regular expression against a KOI8-R header.

I personally use a UTF-8 terminal which can display basically anything
in Unicode, but the content that is in koi8-r (Cyrillic) I probably
can't read and won't want to see anyway. Checking my mail log, I have
several dozen koi8-r entries that begin 7sUg1cTBxdTT0SDEz9PUwdfJ1NgK
which is "Не удается доставить" in UTF-8 and which translates to
"unable to deliver". All of them were joe-job bounce messages.

Another option is to not modify the C code at all but handle this
in procmail if you care.

:0
* ^Subject:.*=\?[a-z0-9][a-z0-9.-]+\?[qp]\?[^? ]*\?=
* ^Subject:.*=\?\/[a-z0-9][a-z0-9.-]+
{
CHARSET=$MATCH
Subject=`formail -x Subject: -M $CHARSET | iconv -f $CHARSET -t UTF-8`
}
:0 E
{
Subject=`formail -x Subject:`
}

That still leaves the allowed but "hard to imagine why except for
pathological tests" case of multiple different charsets used in
different MIME words in one header, but it's probably rare enough to
not worry about.

Elijah
------
seriously considering running a pretty-printer over this code, diffs be damned