Getting RFC 2047 encoding right

Arnt Gulbrandsen

unread,

Dec 7, 2003, 11:49:02 AM12/7/03

to IETF RFC-822 list

Hi,

I have a tiny little problem that you chaps may be able to help me with.

Suppose that a mail client receives a message whose subject is encoded
according to RFC 2047. Suppose further that the message is decoded and
stored in a database somewhere - in RAM, on disk, on a server
somewhere. Some time later, the same or a different program sends a
reply.

Now, to be kind and courteous that program should use the same subject
field, perhaps prefaced by "Re: " (or "Auto: "), such that if the
recipient threads based on subject, everything work, no matter whether
the recipient supports RFC 2047 or not.

That implies using the same character set, q/b encoding etc. as the original.

To be kind and courteous, the program should use a widely supported
character set when 2047-encoding (whether it's composing original
messages or replies), and use as few 2047-encoded words as possible.

To be kind and courteous, the program should observe all other rules set
out in RFC 2047 and 2822 (including ones which were broken by an
earlier message).

To be reliable and bug-free, the algorithm that does all this should be
simple and straightforward.

There's a conflict here. How do you all address this?

--Arnt

Charles Lindsey

unread,

Dec 8, 2003, 10:36:52 AM12/8/03

to ietf...@imc.org

In <HuU1P1DxTyF2k...@prosecco.oryx.com> Arnt Gulbrandsen <arnt@=
gulbrandsen.priv.no> writes:

>Now, to be kind and courteous that program should use the same subject
>field, perhaps prefaced by "Re: " (or "Auto: "), such that if the
>recipient threads based on subject, everything work, no matter whether
>the recipient supports RFC 2047 or not.

By and large, user agents should (or SHOULD) decode the RFC2047 stuff
before doing anything "clever" with the header. Note that all the encoded
stuff is really for human consumption, so it should not matter (but of
course it does). The only thing that is even mentioned in any
standards-like document regarding interpreting the Subject-header is
"Re: " (it has a mention in RFC 2822, for example, and is likely to hav=
e
a slightly firmer mention in USEFOR).

However, one of the tasks the USEFOR WG is undertaking is a companion "Be=
st
Current Practice" document, and in the initial draft for that I have
written that if you add a "Re: " to a Subject-header, then it SHOULD be=
in
the clear _before_ the start of any RFC 2047-encoded stuff. That seems
like a good practice to me, since it means that agents that want to strip
it off (before producing sorted lists of headers, for example) can at lea=
st
do that correctly.

>That implies using the same character set, q/b encoding etc. as the orig=
inal.

Best practice is propably to store the original, and decode it on the fly
whenever it is needed for humans.

--
Charles H. Lindsey ---------At Home, doing my own thing------------------=
------
Tel: +44 161 436 6131 Fax: +44 161 436 6133 Web: http://www.cs.man.ac.u=
k/~chl
Email: c...@clerew.man.ac.uk Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU=
, U.K.
PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4=
AB A5

Keith Moore

unread,

Dec 8, 2003, 10:58:06 AM12/8/03

to Arnt Gulbrandsen, mo...@cs.utk.edu, ietf...@imc.org

The correct way to reply to a message that has an RFC 2047-encoded subject
is to use the same RFC 2047-encoded subject that appeared in the message
being replied-to. That's very easy to do - just prepend "Re: " to that
subject line, and rewrap the header (leaving encoded-words intact) if
necessary to accomodate RFC 2047 line-length restrictions.

> Suppose that a mail client receives a message whose subject is encoded
> according to RFC 2047. Suppose further that the message is decoded and
> stored in a database somewhere - in RAM, on disk, on a server
> somewhere.

RFC 2047 encoded words were not intended to be used in this way. Decoding
is only supposed to happen prior to display or printing, not prior to
storage.

Keith

Keith Moore

unread,

Dec 8, 2003, 10:58:46 AM12/8/03

to Charles Lindsey, mo...@cs.utk.edu, ietf...@imc.org

> >Now, to be kind and courteous that program should use the same subject=

> >field, perhaps prefaced by "Re: " (or "Auto: "), such that if the

> >recipient threads based on subject, everything work, no matter whether=

> >the recipient supports RFC 2047 or not.
>
> By and large, user agents should (or SHOULD) decode the RFC2047 stuff
> before doing anything "clever" with the header.

Nope, user agents SHOULD NOT decode RFC 2047 except for display purposes.

> However, one of the tasks the USEFOR WG is undertaking is a companion "=
Best

> Current Practice" document, and in the initial draft for that I have

> written that if you add a "Re: " to a Subject-header, then it SHOULD =
be in

> the clear _before_ the start of any RFC 2047-encoded stuff.

Yes, but this happens naturally if you just prepend "Re: " to whatever
subject happens to be in the message being replied to (encoded or not).

> Best practice is propably to store the original, and decode it on the f=

ly
> whenever it is needed for humans.

right.

Keith

Pete Resnick

unread,

Dec 8, 2003, 12:31:11 PM12/8/03

to Keith Moore, Charles Lindsey, mo...@cs.utk.edu, ietf...@imc.org

On 12/8/03 at 10:43 AM -0500, Keith Moore wrote:

>Nope, user agents SHOULD NOT decode RFC 2047 except for display purposes.

Do you consider sorting a "display purpose"? How about searching
given some user input?

pr
--
Pete Resnick <http://www.qualcomm.com/~presnick/>
QUALCOMM Incorporated - Direct phone: (858)651-4478, Fax: (858)651-1102

Keith Moore

unread,

Dec 8, 2003, 12:59:25 PM12/8/03

to Pete Resnick, mo...@cs.utk.edu, c...@clerew.man.ac.uk, ietf...@imc.org

> On 12/8/03 at 10:43 AM -0500, Keith Moore wrote:
>
> >Nope, user agents SHOULD NOT decode RFC 2047 except for display purposes.
>
> Do you consider sorting a "display purpose"? How about searching
> given some user input?

seems fine to me.

Pete Resnick

unread,

Dec 8, 2003, 1:07:27 PM12/8/03

to Keith Moore, mo...@cs.utk.edu, c...@clerew.man.ac.uk, ietf...@imc.org

Well, in that case I think you're being a bit naive when you say
earlier, "Decoding is only supposed to happen prior to display or
printing, not prior to storage." Requiring a search to do 2047 (and
therefore likely 822) parsing on each and every searched message in a
large message store is much too processor intensive in some
environments. Now, maybe that would require a shadow database of
pre-parsed messages along with the un-parsed 2047 fields to use in
replies, but that starts to become a significant burden too. In
either case, allowing the user to edit the subject of a reply
probably means de-coding and re-encoding the 2047, which runs into
exactly the kind of problem that Arnt was getting at in his original
message.

It would be nice to think that "decode only for display purposes"
would be a complete answer. It's not.

Keith Moore

unread,

Dec 8, 2003, 1:25:35 PM12/8/03

to Pete Resnick, mo...@cs.utk.edu, c...@clerew.man.ac.uk, ietf...@imc.org

> >>Do you consider sorting a "display purpose"? How about searching
> >>given some user input?
> >
> >seems fine to me.
>
> Well, in that case I think you're being a bit naive when you say
> earlier, "Decoding is only supposed to happen prior to display or
> printing, not prior to storage."

point taken.

> Requiring a search to do 2047 (and
> therefore likely 822) parsing on each and every searched message in a
> large message store is much too processor intensive in some
> environments.

it's hardly unusual for search engines to construct indices to speed
searching. I don't see any problem with the search indices being built
with decoded header fields, but it's cleaner if the actual messages
are not altered.

> In
> either case, allowing the user to edit the subject of a reply
> probably means de-coding and re-encoding the 2047, which runs into
> exactly the kind of problem that Arnt was getting at in his original
> message.

yes, but once the subject is edited, it's not the same subject any
longer, and you don't expect it to match up with previous subjects in
the thread.

> It would be nice to think that "decode only for display purposes"
> would be a complete answer. It's not.

no, but it's close. the point is that if you try to decode these things
too early then you break things.

Keith

Arnt Gulbrandsen

unread,

Dec 8, 2003, 1:40:08 PM12/8/03

to Keith Moore, Pete Resnick, ietf...@imc.org, c...@clerew.man.ac.uk

Keith Moore writes:
>> Do you consider sorting a "display purpose"? How about searching
>> given some user input?
>
> seems fine to me.

Well, I do search, and it's extremely expensive to decode all the
messages in order to do a search. To search, I must decode when the
message goes into the data store.

Storing two instances of the same header is also not a good answer.
First of all, it's bad practice. Never keep two separate variables that
are supposed to stay in sync.

Second, it only solves the common case - if I go down that path I have
problems again soon enough. Suppose I want to answer a message which
had an overlong encoded-word. Should I blithely emit overlong
encoded-words? Suppose I want to answer with "subject: re: <original>
<ticket id>", then I risk having two encoded-words separated only by
whitespace, and must do magic in order to preserve that space.

I don't like adding hacks like the "store twice" if I have the same
problem again five minutes later.

--Arnt

Keith Moore

unread,

Dec 8, 2003, 1:50:56 PM12/8/03

to Arnt Gulbrandsen, mo...@cs.utk.edu, pres...@qualcomm.com, ietf...@imc.org, c...@clerew.man.ac.uk

> Well, I do search, and it's extremely expensive to decode all the
> messages in order to do a search. To search, I must decode when the
> message goes into the data store.
>
> Storing two instances of the same header is also not a good answer.
> First of all, it's bad practice. Never keep two separate variables
> that are supposed to stay in sync.

Every fast search engine I know of creates indices out of the data
being searched. I don't see why searching mail should be different.
You do have to rebuild the index for that message if the message
is changed, but that should be a rare case.

> Second, it only solves the common case - if I go down that path I have
> problems again soon enough. Suppose I want to answer a message which
> had an overlong encoded-word. Should I blithely emit overlong
> encoded-words?

For that matter, suppose you want to answer a message that has some
other kind of invalid header field - maybe one that isn't encoded at
all, and has illegal characters. The basic answer is that what you do
with illegal input is generally not specified - but clearly you aren't
expected to make the subject of the reply match the subject of the
message being replied to in that case.

> Suppose I want to answer with "subject: re: <original>
> <ticket id>", then I risk having two encoded-words separated only by
> whitespace, and must do magic in order to preserve that space.

why not just use an ASCII ticket id?

Arnt Gulbrandsen

unread,

Dec 9, 2003, 1:08:06 PM12/9/03

to Keith Moore, pres...@qualcomm.com, ietf...@imc.org, c...@clerew.man.ac.uk

Keith Moore writes:
> For that matter, suppose you want to answer a message that has some
> other kind of invalid header field - maybe one that isn't encoded at
> all, and has illegal characters.

Right.

> The basic answer is that what you do with illegal input is generally
> not specified - but clearly you aren't expected to make the subject
> of the reply match the subject of the message being replied to in
> that case.

Right.

What I cannot see is how to make something reasonable, correct and
fairly simple.

In most cases I have code that is right when the input is good, and not
wrong when the input is bad. RFC 2047 just doesn't seem to make that
simple.

>> Suppose I want to answer with "subject: re: <original> <ticket id>",
>> then I risk having two encoded-words separated only by whitespace,
>> and must do magic in order to preserve that space.
>
> why not just use an ASCII ticket id?

Why should I make "always ASCII" a requirement for that case, in code
that otherwise allows all of Unicode?

--Arnt

Keith Moore

unread,

Dec 9, 2003, 5:12:08 PM12/9/03

to Arnt Gulbrandsen, Keith Moore, pres...@qualcomm.com, ietf...@imc.org, c...@clerew.man.ac.uk

>> The basic answer is that what you do with illegal input is generally
>> not specified - but clearly you aren't expected to make the subject
>> of the reply match the subject of the message being replied to in
>> that case.
>
> Right.
>
> What I cannot see is how to make something reasonable, correct and
> fairly simple.
>
> In most cases I have code that is right when the input is good, and
> not wrong when the input is bad. RFC 2047 just doesn't seem to make
> that simple.

Well for untagged text basically you just have to guess the charset.
ISO-2022-* and UTF-8 can be distinguished from other charsets simply
and fairly reliably, and you can make guesses at some of the others
using heuristics. It's difficult to tune the heuristics, and subject
lines are too brief for them to work really well. But I really don't
see how RFC 2047 makes determining the charset label of untagged text
any worse than it inherently is.

>>> Suppose I want to answer with "subject: re: <original> <ticket
>>> id>", then I risk having two encoded-words separated only by
>>> whitespace, and must do magic in order to preserve that space.
>>
>> why not just use an ASCII ticket id?
>
> Why should I make "always ASCII" a requirement for that case, in code
> that otherwise allows all of Unicode?

For the same reason that you should probably avoid using some forms of
email addresses even though they are perfectly valid - such as "Keith
\"Mr. Cynic\" Moore"@cs.utk.edu - corner cases that are seldom seen
often fail in practice.

If you want to be entirely reliable your code to detect ticket-ids has
to be able to find them whether or not they're embedded in
encoded-words. And it's not as if you can't put a ticket-id into an
encoded-word, though (as you point out) you might have to encode %20 as
the first character of that encoded-word.

Charles Lindsey

unread,

Dec 9, 2003, 6:14:53 PM12/9/03

to ietf...@imc.org

In <20031208104132...@cs.utk.edu> Keith Moore <mo...@cs.utk.edu> writes:

>The correct way to reply to a message that has an RFC 2047-encoded subject
>is to use the same RFC 2047-encoded subject that appeared in the message
>being replied-to. That's very easy to do - just prepend "Re: " to that
>subject line, and rewrap the header (leaving encoded-words intact) if
>necessary to accomodate RFC 2047 line-length restrictions.

Sure. That's the "obvious" way to do it, but since when have implementors
of User Agents invariably done the "obvious" thing? So it needs to be
said somewhere. Usefor has simply taken the view that "somewhere" should
be BCP rather than standards-track.

--
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133 Web: http://www.cs.man.ac.uk/~chl
Email: c...@clerew.man.ac.uk Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5

Arnt Gulbrandsen

unread,

Dec 10, 2003, 12:37:38 PM12/10/03

to Charles Lindsey, ietf...@imc.org

Charles Lindsey writes:
> In <20031208104132...@cs.utk.edu> Keith Moore
> <mo...@cs.utk.edu> writes:
>
>> The correct way to reply to a message that has an RFC 2047-encoded
>> subject is to use the same RFC 2047-encoded subject that appeared in
>> the message being replied-to. That's very easy to do - just prepend
>> "Re: " to that subject line, and rewrap the header (leaving
>> encoded-words intact) if necessary to accomodate RFC 2047
>> line-length restrictions.
>
> Sure. That's the "obvious" way to do it, but since when have
> implementors of User Agents invariably done the "obvious" thing?

The "obvious" way for most MUAs is to fire up an editor and use the
subject line it gets back from the editor. And that's what most of them
do.

--Arnt

Keith Moore

unread,

Dec 10, 2003, 4:57:01 PM12/10/03

to Arnt Gulbrandsen, Keith Moore, Charles Lindsey, ietf...@imc.org

> The "obvious" way for most MUAs is to fire up an editor and use the
> subject line it gets back from the editor. And that's what most of
> them do.

I guess it depends on what you mean by "fire up an editor". I wasn't
aware that "most" MUAs these days used external programs to do message
editing.

I do suspect that "most" MUAs these days initialize a buffer with the
(decoded) subject field and give the user a chance to edit it, along
with to/cc/etc. fields and the message body.

In that case things would work better if the MUA could detect whether
the user chose to change any of these fields that were initialized from
header fields that contained encoded-words, and if no changes were
made, to use the original (encoded) field text. For address fields,
you'd want to do this on a per-address, rather than per-field basis.

Philip Hazel

unread,

Dec 11, 2003, 4:09:01 AM12/11/03

to Keith Moore, Arnt Gulbrandsen, Charles Lindsey, ietf...@imc.org

On Wed, 10 Dec 2003, Keith Moore wrote:

> I wasn't aware that "most" MUAs these days used external programs to
> do message editing.

I've been using an external editor from my MUA for more than 10 years,
but not, as it happens, for the Subject: line; just for the body.

--
Philip Hazel University of Cambridge Computing Service,
ph...@cus.cam.ac.uk Cambridge, England. Phone: +44 1223 334714.

Arnt Gulbrandsen

unread,

Dec 11, 2003, 4:46:33 AM12/11/03

to Keith Moore, IETF RFC-822 list, Charles Lindsey

Keith Moore writes:
>> The "obvious" way for most MUAs is to fire up an editor and use the
>> subject line it gets back from the editor. And that's what most of
>> them do.
>
> I guess it depends on what you mean by "fire up an editor".

Something like this, usually:

a = new QLineEdit(...); // http://doc.trolltech.com/3.0/qlineedit.html
if ( reply )
a->setText( "Re: " + orig->subject() );
else if ( forwarding )
a->setText( "Fwd: " + orig->subject() );
a->show();

> I wasn't aware that "most" MUAs these days used external programs to
> do message editing.

I didn't say program. An editor in a third-party DLL/shared library is
external, just like a program, because in both cases the MUA author has
little or no control over how it works.

If the editor were internal the MUA could do 2047-decoding for display
purposes and keep the raw data as its basic storage. But since the
editor is external, the MUA must do 2047-decoding and hand the result
to the editor. Later, when the editor hands it back, the "obvious" way
is to 2047-encode the editor's result use it. Then there's only one
encoder to write and test, and it's used for original messages, for
forwarding and for replies. Less to write, less to test, fewer bugs.

> I do suspect that "most" MUAs these days initialize a buffer with the
> (decoded) subject field and give the user a chance to edit it, along
> with to/cc/etc. fields and the message body.

Yes.

--Arnt

Keith Moore

unread,

Dec 11, 2003, 10:17:11 AM12/11/03

to Arnt Gulbrandsen, Keith Moore, IETF RFC-822 list, Charles Lindsey

> If the editor were internal the MUA could do 2047-decoding for display
> purposes and keep the raw data as its basic storage. But since the
> editor is external, the MUA must do 2047-decoding and hand the result
> to the editor. Later, when the editor hands it back, the "obvious" way
> is to 2047-encode the editor's result use it. Then there's only one
> encoder to write and test, and it's used for original messages, for
> forwarding and for replies. Less to write, less to test, fewer bugs.

if there's no good way to save state (like the original header fields)
unchanged across that edit, so that the encoder has the ability to tell
whether the human user changed those fields from those originally
specified, then I can understand the difficulty. If I were writing an
MUA that used such an arrangement for editing header fields I'd try to
find some way to save that state and make it available to the encoder -
perhaps by using hidden fields that the human can't change, or by
storing the original state in a parent process and having the editor
run in a subprocess, or something.

Charles Lindsey

unread,

Dec 11, 2003, 12:28:46 PM12/11/03

to ietf...@imc.org

In <E92FBCA6-2B5A-11D8...@cs.utk.edu> Keith Moore <moore@c=
s.utk.edu> writes:

>I guess it depends on what you mean by "fire up an editor". I wasn't
>aware that "most" MUAs these days used external programs to do message
>editing.

Yes, I think "fire up editors" got lost in the rush for "plug and play
software" that forces you to use the built-in (and usually woefully
inadequate) editor.

>In that case things would work better if the MUA could detect whether
>the user chose to change any of these fields that were initialized from
>header fields that contained encoded-words, and if no changes were
>made, to use the original (encoded) field text. For address fields,
>you'd want to do this on a per-address, rather than per-field basis.

Indeed, but I doubt that many currently take that trouble. They have to
decode the field in order to display it. The easy way out is to reencode
it before sending. If it is a Subject starting with "Re: ", they might
well try to include the "Re: " inside the encoding. Even (especially) i=
f
it was a "Re: " they had added themselves. Somebody needs to tell them
that would be a Bad Thing.

--
Charles H. Lindsey ---------At Home, doing my own thing------------------=
------
Tel: +44 161 436 6131 Fax: +44 161 436 6133 Web: http://www.cs.man.ac.u=
k/~chl
Email: c...@clerew.man.ac.uk Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU=
, U.K.
PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4=
AB A5

Arnt Gulbrandsen

unread,

Dec 11, 2003, 12:30:50 PM12/11/03

to Keith Moore, IETF RFC-822 list, Charles Lindsey

Keith Moore writes:
> if there's no good way to save state (like the original header fields)
> unchanged across that edit, so that the encoder has the ability to
> tell whether the human user changed those fields from those
> originally specified, then I can understand the difficulty.

Of course there are such ways.

> If I were writing an MUA that used such an arrangement for editing
> header fields I'd try to find some way to save that state and make it
> available to the encoder - perhaps by using hidden fields that the
> human can't change, or by storing the original state in a parent
> process and having the editor run in a subprocess, or something.

Sure. I've been down that road myself. Not once, but twice. My
decode2047() was liberal, and my encode2047() was conservative, and I
found myself in a position where the code would nevertheless generate
monstrosities like "Re: =?latin_1?q?=80?=' (which should be "Re:
=?iso-8859-15?q?=A4?=" but can be decoded).

After spending a few hours trying to rewrite the decoder such that I
could know whether the raw text was something a conservative encoder
could make, I decided that the result wasn't worth the complexity.

I can understand why a lot of people don't even start down that road.

--Arnt

Keith Moore

unread,

Dec 11, 2003, 1:20:02 PM12/11/03

to Arnt Gulbrandsen, Keith Moore, IETF RFC-822 list, Charles Lindsey

> My decode2047() was liberal, and my encode2047() was conservative, and
> I found myself in a position where the code would nevertheless
> generate monstrosities like "Re: =?latin_1?q?=80?=' (which should be
> "Re: =?iso-8859-15?q?=A4?=" but can be decoded).

I don't think I understand what you are saying. Would you really use
"latin_1" as a charset name? Given that it's nonstandard, how could
that be conservative?

I assume the 0x80 is a non-break-space? Other than using a nonstandard
charset, what is it that makes
=?latin_1?q?=80?= a monstrosity?

And how does this relate to the problem of not changing encoded-words
from the subject message?

> After spending a few hours trying to rewrite the decoder such that I
> could know whether the raw text was something a conservative encoder
> could make, I decided that the result wasn't worth the complexity.

why would the decoder need to care?

to me this seems fairly simple:

struct msg *newmsg, *savmsg;

/* initialize two message structures; one for the message to be edited;
another to save decoded fields from the message being replied-to */

newmsg = new_message();
savmsg = new_message ();

savemsg->subject = newmsg->subject = prepend_Re (decode_2047
(origmsg->subject));
...
edit_message (newmsg);
...
/* now that the message has been edited,
encode any header fields from the edited message that contain non
ASCII chars
if the user didn't change the field, use the original encoding */

if (strcmp (savemsg->subject, newmsg->subject) == 0)
newmsg->subject = prepend_Re (origmsg->subject);
else
newmsg->subject = encode_2047 (newmsg->subject);

what am I missing?

Arnt Gulbrandsen

unread,

Dec 12, 2003, 5:51:08 AM12/12/03

to Keith Moore, IETF RFC-822 list, Charles Lindsey

Keith Moore writes:
> I don't think I understand what you are saying. Would you really use
> "latin_1" as a charset name? Given that it's nonstandard, how could
> that be conservative?

I'd never do that. But code that simply copies the received subject would.

> I assume the 0x80 is a non-break-space? Other than using a
> nonstandard charset, what is it that makes =?latin_1?q?=80?= a
> monstrosity?

Neither latin_1 nor 0x80 are defined, yet I've seen both in real life.
0x80 is not allocated in ISO 8859 character sets, it's a Microsoft
extension for the euro sign. (Non-break-space is 0xA0.)

> And how does this relate to the problem of not changing encoded-words
> from the subject message?

The MUA has a choice. Either is can be conservative in what it generates
and liberal in what it accepts, or it can blindly generate whatever it
accepts, or it can make a smart judgment.

The first is easiest to program. The second is only slightly harder and
is what Charles said is the "obvious way", to which I took exception.
The third is a great deal of work.

> why would the decoder need to care?

See below.

> to me this seems fairly simple:

...

> if (strcmp (savemsg->subject, newmsg->subject) == 0)
> newmsg->subject = prepend_Re (origmsg->subject);
> else
> newmsg->subject = encode_2047 (newmsg->subject);
>
> what am I missing?

If origmsg->subject is "=?latin_1?q?=80?=" and the user doesn't change
the subject, newmsg->subject is "Re: =?latin_1?q?=80?=". If the decoder
knows whether the string could have been generated by a reasonably
conservative generator, that case can be avoided. A very tricky
decision.

(And if I've injured the English language again, I apologize. I suppose
I'm too old to learn new languages without forgetting bits of the ones
I know, or mixing them up. Sad.)

--Arnt

Keith Moore

unread,

Dec 12, 2003, 7:20:05 AM12/12/03

to Arnt Gulbrandsen, Keith Moore, IETF RFC-822 list, Charles Lindsey

On Dec 12, 2003, at 5:42 AM, Arnt Gulbrandsen wrote:

> Keith Moore writes:
>> I don't think I understand what you are saying. Would you really use
>> "latin_1" as a charset name? Given that it's nonstandard, how could
>> that be conservative?
>
> I'd never do that. But code that simply copies the received subject
> would.

I see. Well, I don't think you're expected to fix that. Furthermore,
if your code doesn't know what "latin_1" is, I don't see how you could
translate it into anything better anyway. Leaving it as-is at least
allows for the possibility that it's valid, and your software just
hasn't learned about it yet.

> The MUA has a choice. Either is can be conservative in what it
> generates
> and liberal in what it accepts, or it can blindly generate whatever it
> accepts, or it can make a smart judgment.

To me, being conservative in what it generates means not decoding and
reencoding things it doesn't understand from a message being replied-to
- it means keeping things as they are.

>> if (strcmp (savemsg->subject, newmsg->subject) == 0)
>> newmsg->subject = prepend_Re (origmsg->subject);
>> else
>> newmsg->subject = encode_2047 (newmsg->subject);
>>
>> what am I missing?
>
> If origmsg->subject is "=?latin_1?q?=80?=" and the user doesn't change
> the subject, newmsg->subject is "Re: =?latin_1?q?=80?=". If the decoder
> knows whether the string could have been generated by a reasonably
> conservative generator, that case can be avoided. A very tricky
> decision.

If you don't know what latin_1 is, then you can't make any sense of the
0x80 anyway. You might as well keep it in the reply.

There are almost certainly still some unregistered charsets in use in
some communities, and some new charsets are being added from time to
time. You can't really expect your software to be aware of all of
them. It makes sense to make your software tolerant of charsets it
doesn't understand yet.

Arnt Gulbrandsen

unread,

Dec 12, 2003, 8:22:52 AM12/12/03

to Keith Moore, IETF RFC-822 list, Charles Lindsey

Keith Moore writes:
> On Dec 12, 2003, at 5:42 AM, Arnt Gulbrandsen wrote:
>> The MUA has a choice. Either is can be conservative in what it
>> generates and liberal in what it accepts, or it can blindly generate
>> whatever it accepts, or it can make a smart judgment.
>
> To me, being conservative in what it generates means not decoding and
> reencoding things it doesn't understand from a message being
> replied-to - it means keeping things as they are.

I agree. But what "are" things when a MUA sends a reply? IMO, things
"are" whatever the user sees on-sceen when issuing the "send" command,
and that is what MUA should keep.

Suppose the original message had "Subject: =?latin_1?q?The price is
=80216". The MUA fuzzily matches latin_1 to the IANA-defined alias
latin1, knows about the Microsoft breakage, and presents the user with
"Subject: The price is €216".

The user types a reply indicating acceptance and hits the send button.
What should the MUA use as encoded subject?

1. "Subject: Re: The price is =?latin_1?q?=80216", ie. something
matching what the original sender meant, but not necessarily matching
what the user saw and agreed to.

2. "Subject: Re: The price is =?iso-8859-15?q?=A4216", ie. something
that unambiguously describes what user chose to send.

> If you don't know what latin_1 is, then you can't make any sense of
> the 0x80 anyway. You might as well keep it in the reply.

I disagree. The reply is a message from the user to the recipient(s),
and should faithfully encode whatever the user saw and typed. The MUA
SHOULD NOT substitute some other text of unknown meaning for its user's
text.

> There are almost certainly still some unregistered charsets in use in
> some communities, and some new charsets are being added from time to
> time. You can't really expect your software to be aware of all of
> them. It makes sense to make your software tolerant of charsets it
> doesn't understand yet.

Sure. But misrepresenting the user does not make sense to me.

--Arnt

Keith Moore

unread,

Dec 12, 2003, 12:56:31 PM12/12/03

to Arnt Gulbrandsen, Keith Moore, IETF RFC-822 list, Charles Lindsey

>>> The MUA has a choice. Either is can be conservative in what it
>>> generates and liberal in what it accepts, or it can blindly generate
>>> whatever it accepts, or it can make a smart judgment.
>>
>> To me, being conservative in what it generates means not decoding and
>> reencoding things it doesn't understand from a message being
>> replied-to - it means keeping things as they are.
>
> I agree. But what "are" things when a MUA sends a reply? IMO, things
> "are" whatever the user sees on-sceen when issuing the "send" command,
> and that is what MUA should keep.

I see your point. I suppose I would say that if the MUA does reliably
understand the charsets in the original message, it should use the same
encoding for the reply if the subject isn't changed. But the receiving
MUA is having to use heuristics to guess what the encoding is, maybe
it's reasonable to re-encode the result to say "this is what the
responder thought the subject was".

Michael Bell

unread,

Dec 13, 2003, 4:13:10 AM12/13/03

to Keith Moore, Arnt Gulbrandsen, IETF RFC-822 list

It could be argued (or shot down <g>) that the truly conservative approach would be then to add a header

X-OriginalSubject: <raworiginalsubject> here

so the user has a HOPE of recovering it if the MUA munged it accidentally. Of course that only helps for a first hop between confused and angry MUAs. And not at all for the average user.

A truly creative but criminally insane MUA would simply always change the subject to

Subject: =?q?iso-8859-1?Read the damn message?=

Can we have an RFC on that <g,d,r>

Keith Moore

unread,

Dec 13, 2003, 7:42:59 AM12/13/03

to Michael Bell, Keith Moore, Arnt Gulbrandsen, IETF RFC-822 list

> It could be argued (or shot down <g>) that the truly conservative
> approach would be then to add a header
>
> X-OriginalSubject: <raworiginalsubject> here
>
> so the user has a HOPE of recovering it if the MUA munged it
> accidentally.

surely it's easier to say that MUAs shouldn't mung subjects
accidentally?

(actually I've thought for a long time that it would be useful to have
a Subject-Was: field, that would automatically include the Subject of
the message being replied-to if the person composing the reply changed
the Subject from that of the message being replied to.)

Bruce Lilly

unread,

Jan 4, 2004, 4:41:52 AM1/4/04

to Arnt Gulbrandsen, IETF RFC-822 list

Arnt Gulbrandsen wrote:

>
> Hi,
>
> I have a tiny little problem that you chaps may be able to help me with.

>
> Suppose that a mail client receives a message whose subject is encoded
> according to RFC 2047. Suppose further that the message is decoded and
> stored in a database somewhere - in RAM, on disk, on a server

> somewhere. Some time later, the same or a different program sends a
> reply.

>
> Now, to be kind and courteous that program should use the same subject

> field, perhaps prefaced by "Re: " (or "Auto: "), such that if the
> recipient threads based on subject, everything work, no matter whether

> the recipient supports RFC 2047 or not.
>

> That implies using the same character set, q/b encoding etc. as the
> original.
>
> To be kind and courteous, the program should use a widely supported
> character set when 2047-encoding (whether it's composing original
> messages or replies), and use as few 2047-encoded words as possible.
>
> To be kind and courteous, the program should observe all other rules
> set out in RFC 2047 and 2822 (including ones which were broken by an
> earlier message).
>
> To be reliable and bug-free, the algorithm that does all this should
> be simple and straightforward.
>
> There's a conflict here. How do you all address this?

In reverse order:

The nature of the problem precludes a truly simple algorithm; it's a
complex issue.

Some errors can be repaired; others cannot. Attempts to repair errors
might be
more-or-less successful. Successful repair is contingent on being able to
unambiguously determine what was intended.

As far as practicable, original content should be preserved. E.g. in a
reply, the address
given in the original message's Reply-To (or From) field should be used
verbatim
(same case, same display name if present, including the same
encoded-words if used)
in the To field of the reply.

I would extend that to the Subject field, and go so far as to say that
"Re:", "Auto:",
etc. are best avoided. Incidentally, collating (colloquially "sorting")
by subject is
not the same as "threading"; the latter entails use of References and/or
In-Reply-To
fields with Message-ID fields to follow a related "thread" of messages
(Consider a
collection of 10 messages with "Subject: Help" and 50 with "Subject: Re:
Help" --
collating by Subject (with or w/o stripping "Re: ") doesn't group
responses with
originals).

Non-transient storage of a message is best done in RFC 2822/MIME format,
possibly with lossless compression if the tradeoff between space and
compression/
decompression effort warrants it, and possibly with encryption where
necessary or
desirable. That does not preclude some additional metadata regarding the
message,
but the original message ought to be 100% recoverable for use in
replies, etc. Even
conversion of CRLF to local line endings can be troublesome (consider a
multipart
MIME message with a binary part containing the octets 0x0D 0x0A).

In <vyDCOksjW0T5JTUD+zj/Vw....@prosecco.oryx.com>:

> If origmsg->subject is "=?latin_1?q?=80?=" and the user doesn't c=
hange

> the subject, newmsg->subject is "Re: =?latin_1?q?=80?=". If the
> decoder knows whether the string could have been generated by a
> reasonably conservative generator, that case can be avoided. A very
> tricky decision.

Observing RFC 2047 rules, "=?latin_1?q?=80?=" should remain unchang=
ed --
it's
NOT an encoded-word. RFC 2047 section 3 requires that the charset name
be one that
is allowed in a MIME charset parameter for media type text/plain or that
it be registered
for use with text/plain. The rules for text/plain are in RFC 2046
section 4.1.2 which
states (in part):

" No character set name other than those defined above may be used in
Internet mail without the publication of a formal specification and
its registration with IANA, or by private agreement, in which case
the character set name must begin with "X-".
"

The "defined above" text refers to the us-ascii and iso-8859-X charsets.
"latin_1"
is not registered nor is it in the initial set of MIME-compatible
charsets in RFC 2046
(all of which are now registered), and it obviously does not begin with
"X-",
therefore the RFC 822 atom containing "latin_1" is NOT an encoded-word. I=
t
should be displayed verbatim and should remain unchanged.

In <vblVCzk7yuOTm...@libertango.oryx.com>:

> Suppose the original message had "Subject: =?latin_1?q?The price is
> =80216". The MUA fuzzily matches latin_1 to the IANA-defined alias
> latin1, knows about the Microsoft breakage, and presents the user with
> "Subject: The price is €216".

That's where things go wrong. There's no encoded-word --
"=?latin_1q?The" and
"=80216" should be displayed verbatim. Even if the spaces in the subjec=
t
were replaced
with underscores, and the subject ended with "?=" -- as would be the
case in a real
encoded-word -- the subject would have to be displayed verbatim as it
still would not
contain a valid encoded-word. And as Keith Moore has pointed out, if it
so happened
that "latin_1" were valid, but not recognized by the MUA in question, it
should still
be displayed verbatim (because that MUA has no way to know what to
display). In
this case the "fuzzily matches" is wrong, and two wrongs don't make a rig=
ht.

> The reply is a message from the user to the recipient(s), and should
> faithfully encode whatever the user saw and typed. The MUA SHOULD NOT
> substitute some other text of unknown meaning for its user's text.

And that's exactly *why* "fuzzily matches" is wrong -- it involves
substituting text
of unknown meaning for the original text.

In <xqRenBEr5R5kj...@prosecco.oryx.com>:

> Something like this, usually:
>
> a = new QLineEdit(...); // http://doc.trolltech.com/3.0/qlineedit.htm=

l
> if ( reply )
> a->setText( "Re: " + orig->subject() );
> else if ( forwarding )
> a->setText( "Fwd: " + orig->subject() );
> a->show();

Why the special case for "Fwd: "? Where is that standardized? Why not
"FW: "?

> If the editor were internal the MUA could do 2047-decoding for display
> purposes and keep the raw data as its basic storage. But since the
> editor is external, the MUA must do 2047-decoding and hand the result
> to the editor. Later, when the editor hands it back, the "obvious" way
> is to 2047-encode the editor's result use it. Then there's only one
> encoder to write and test, and it's used for original messages, for
> forwarding and for replies. Less to write, less to test, fewer bugs.

Editing of the subject field is contrary to "to be kind and courteous
that program should use
the same subject field". That's a design decision for an MUA author.
Clearly, eliminating
editing means "[l]ess to write, less to test, fewer bugs". If editing is
desired, that need not
mean that the entire field is decoded, then re-encoded; real-world
editing usually means that
some portion(s) of the text are added, deleted, or changed, while much
is unchanged.

"Re: " is an interesting case. To be kind, courteous, and RFC 2277
conforming, one should
indicate the language. So that should probably be "=?us-ascii*la?q?Re:?=
=
" (or the B
encoded equivalent, and/or using the ISO 3-letter code for Latin, "lat",
and/or using any other
MIME-compatible charset, and/or another capitalization variant...). And
the reply should
have appropriate References and In-Reply-To fields, assuming that the
original had a
Message-ID.

Subject is supposed to be an unstructured field, but things like "Re: "
impose unnecessary
and useless structure. Consider
Subject: FW: Sv: Fwd: Re^2: =?us-ascii*en-us?q?Auto:_?=
=?iso-8859-1*lat?q?Re:?= RE: Auto: cmsg sendsys
Do you really want to have to be able to recognize and handle every type
of hack to the subject
field, in every possible combination of capitalization, in encoded or
unencoded form? What
do any of them indicate that isn't already evident via MIME packaging,
Resent- fields, References,
In-Reply-To, and Message-ID fields, and/or Auto-Submitted fields?