Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

What order is coding applied to header lines?

24 views
Skip to first unread message

~greg

unread,
Jan 9, 2009, 2:45:41 PM1/9/09
to

Headers lines (and I'm thinking in particular of From: headers)
can involve different kinds of quoting and escaping.

I'm thinking in particular of \.-character-escaping,
"..."-substring quoting, (...)-substring-comments,
and quoted-printable areas.

I'll call these, generically, the different "codings."

And after thinking about it for awhile, it occurs to me
that the order these different kinds of coding are applied
(and therefore the order they should be decoded in)
can make a big difference in how the decoded line
parses. (I'm thinking in particular of parsing
addresses and friendly names out of From: header lines.)

But I have no idea what that order is.

My current guess is that quoted-printable should be
decoded first, since it's probably the last thing applied
by a MUA to get the mail through SMTP.

As for \.-escaped-characters vs "..."-quoted-substrings,
my guess is that the escaped characters have a higher
precidence over quoted substrings. That is, they are to
be interpreted first, as escaped characters, before
attempting to parse out quoted substrings.

The difference is, for example, that
"x \" @ \" z"
should be interpreted as
x "@" z
and not as
"x \" @ \ " z"

This obviously would make a big difference in how an address
is parsed out. But I don't know if it's right.

As for "..."-quotes vs (...)-comments, I have no idea,
but my guess is that, after escaped characters are
protected (or removed), then, scanning the line
from left to right, whichever of " or ( is encountered first
will turn on quoting or commenting respectively,
which is then turned off at the next " or ), respectively.
And that neither quoting nor commenting nest
in themselves, or in each other. In other words,
they're relatively simple.

Finally, apart from the very strict syntax for the domain
part of addresses, it may be that any of this can occur,
in principle, anywhere on a From-line. But that's probably
wrong. (Or, as Nixon put it, "but it would be wrong.")

~~

Can anyone help me - with any of this - please?

~greg

Jorgen Grahn

unread,
Jan 10, 2009, 7:46:05 AM1/10/09
to
On Fri, 9 Jan 2009 14:45:41 -0500, ~greg <g...@remove-comcast.net> wrote:
>
> Headers lines (and I'm thinking in particular of From: headers)
> can involve different kinds of quoting and escaping.
>
> I'm thinking in particular of \.-character-escaping,
> "..."-substring quoting, (...)-substring-comments,
> and quoted-printable areas.
>
> I'll call these, generically, the different "codings."
>
> And after thinking about it for awhile, it occurs to me
> that the order these different kinds of coding are applied
> (and therefore the order they should be decoded in)
> can make a big difference in how the decoded line
> parses. (I'm thinking in particular of parsing
> addresses and friendly names out of From: header lines.)
>
> But I have no idea what that order is.
>
> My current guess is that quoted-printable should be
> decoded first, since it's probably the last thing applied
> by a MUA to get the mail through SMTP.

I cannot help with the details, but QP is defined by the MIME RFCs,
right? When someone talks SMTP, it's SMTP they have to follow, so the
encodings defined in RFC 2821 (or indirectly from it) are the last
things applied -- and the first you decode on the receiving end.
Not QP.

/Jorgen

--
// Jorgen Grahn <grahn@ Ph'nglui mglw'nafh Cthulhu
\X/ snipabacken.se> R'lyeh wgah'nagl fhtagn!

~greg

unread,
Jan 14, 2009, 5:51:54 AM1/14/09
to

Jorgen Grahn > ...
> ~greg > ...

>>
>> Headers lines (and I'm thinking in particular of From: headers)
>> can involve different kinds of quoting and escaping.
>>
>> I'm thinking in particular of \.-character-escaping,
>> "..."-substring quoting, (...)-substring-comments,
>> and quoted-printable areas.
>>
>> I'll call these, generically, the different "codings."
>>
>> And after thinking about it for awhile, it occurs to me
>> that the order these different kinds of coding are applied
>> (and therefore the order they should be decoded in)
>> can make a big difference in how the decoded line
>> parses. (I'm thinking in particular of parsing
>> addresses and friendly names out of From: header lines.)
>>
>> But I have no idea what that order is.
>>
>> My current guess is that quoted-printable should be
>> decoded first, since it's probably the last thing applied
>> by a MUA to get the mail through SMTP.
>
> I cannot help with the details, but QP is defined by the MIME RFCs,
> right? When someone talks SMTP, it's SMTP they have to follow, so the
> encodings defined in RFC 2821 (or indirectly from it) are the last
> things applied -- and the first you decode on the receiving end.
> Not QP.
>
> /Jorgen

~~~

Sorry for the delayed response!
(But I suppose it's par for the course around here.)

I am climbing a very steep learning curve right now about these things.
Everything I've learnt about this stuff, I've learnt in the last week.
And so therefore I feel I am entitled to act like an insufferable authority about it.

You are wrong!
MIME simply encodes a stream of 8 bit bytes into 7 bit bytes,
via either Q(essentially quote-printable, =?char-set?Q?xxxxxxxxxx?= stuff)
or B(base64) encodings. It is a very blunt instrument intended, essentially,
simply to overcome the original SMTP limitation of 7bit bytes.

RFC 2821 on the other hand
(which incidentally was obsoleted by 5321 --- in Oct 2008)
deals entirely with things that conform to SMTP's 7-bit bytes.
What I meant by SMPT "encodings" (ie, pre-MIME-stuff)
were its "comments" ( things in headers in parentheses )
and its \-character escaping, ---which are things
that are used in "structured" headers (like "From:", vs
"unstructured" headers like "Subject:") so as not to confuse
the the specific syntaxes of the specific headers.
(eg
From: "Me@Play"\@home<mailbox@domain>(having <big> fun)
)

The upshot is that MIME "envelopes" must definitely
be removed before any RCF2821 email syntax parsing can done.

What was confusing me was thinking that Q and B encoding
in a header may sometimes be used in just the same way,
and for the same purpose, that \-escaping and ()-commenting
is used, --to avoid collisions with the header syntax.
But I now realize that it's never used that way. And it
wouldn't work anyway.

I suppose that the reason I thought that it might is that
MIME can be used in headers without there being any
MIME headers! In other words, MIME Q and B
encoding can be used in an email header
even if the email doesn't contain a
MIME-Version: 1.0
header. Or any other MIME-specific header.
The MIME-RFCs say so: ...
"One reason for this is that the mail reader
is not expected to parse the entire message header
before displaying lines that may contain 'encoded-word's."

(and because email headers can occur in any random order.)


But I never saw that mentioned in any informal discussion about MIME!
And so I guess I was thinking that Q and B encoding in headers
must pre-dated MIME -- going back perhaps even to REC821.
I must have thought that its status in headers was similar to
basic \-character-escaping and ()-commenting, and that therefore
there'd be an issue about their relative presidencies.

~~

Thanks for responding!

~greg

Jorgen Grahn

unread,
Jan 14, 2009, 11:53:38 AM1/14/09
to

Once again I'll reply without doing any research ... but it still
seems backwards to me. MIME is a layer *on top* of SMTP; that
invariably means that when receiving a message, you have to start
decoding from the bottom up -- in this case start with SMTP.

If the MIME in turn contained a JPEG image, would you pick the raw
text from port 25 and feed it into a JPEG decoder?

~greg

unread,
Jan 14, 2009, 12:45:17 PM1/14/09
to

"Jorgen Grahn" > ...

>
> Once again I'll reply without doing any research ... but it still
> seems backwards to me. MIME is a layer *on top* of SMTP; that
> invariably means that when receiving a message, you have to start
> decoding from the bottom up -- in this case start with SMTP.
>
> If the MIME in turn contained a JPEG image, would you pick the raw
> text from port 25 and feed it into a JPEG decoder?
>
> /Jorgen


I am quite certain that we're just using words differently.

If by "raw text" you mean MIME base64 encoded-data,
then you have to decode that first, before doing anything else with it.

In general, when several encodings (or envelopes) are applied to anything in a certain order,
then it has to be decoded (or de-enveloped) in the reverse order.
(Think of the 7 layers of TCP/IP.)

~greg.

~greg

unread,
Jan 14, 2009, 1:06:00 PM1/14/09
to
Let me try once again.

~~~~~

"Jorgen Grahn" > ...


>
> Once again I'll reply without doing any research ... but it still
> seems backwards to me. MIME is a layer *on top* of SMTP; that
> invariably means that when receiving a message, you have to start
> decoding from the bottom up -- in this case start with SMTP.
>
> If the MIME in turn contained a JPEG image, would you pick the raw
> text from port 25 and feed it into a JPEG decoder?
>
> /Jorgen

~~~~~

As I said before,
"MIME" can be used in headers without there being any MIME headers.
That is to say, you can find
=?character-set?Q|B?coded_bytes=?

sections in any email header line, even if the email doesn't contain any
MIME-Version: 1.0
header. Or any other MIME header.

Again, quoting from some MIME RFC or other,

"One reason for this is that the mail reader
is not expected to parse the entire message header
before displaying lines that may contain 'encoded-word's."

So that makes me think that what you must mean by "SMTP"
is the preliminary interpretation of headers lines, generally.

And that what you must think of as MIME, are the MIME encoded
entities, or bodies, which are declared in the MIME headers.

And so yes, you do have to interpret those headers first
(which are usually mostly in 7-bit US-ASCII, so maybe
that's what you think of as the SMTP decoding) before you can
deal with the MIME bodies (or "entities", or "attachments").

Whereas what I was talking about was the kind of MIME
that occurs in the headers themselves, in
=?character-set?Q|B?coded_text?=
stretches. Which is part of the MIME spec too,
although, as I said, I've never seen it much talked
about in informal discussions. (Or not clearly enough
differentiated from the other kind of MIME, anyway.)

~greg


~greg

unread,
Jan 15, 2009, 9:52:06 PM1/15/09
to

Jorgen Grahn >...

> ~greg > ...


>> The upshot is that MIME "envelopes" must definitely
>> be removed before any RCF2821 email syntax parsing can done.

I was wrong about that!

> Once again I'll reply without doing any research ... but it still
> seems backwards to me. MIME is a layer *on top* of SMTP; that
> invariably means that when receiving a message, you have to start
> decoding from the bottom up -- in this case start with SMTP.

And you were right!

That is, if this is what you meant: ... .

6.2. Display of 'encoded-word's

Any 'encoded-word's so recognized are decoded, and if possible,
the resulting unencoded text is displayed in the original character set.

NOTE: Decoding and display of encoded-words occurs *after*
a structured field body is parsed into tokens. It is therefore
possible to hide 'special' characters in encoded-words which,
when displayed, will be indistinguishable from 'special' characters
in the surrounding text. For this and other reasons, it is NOT
generally possible to translate a message header containing
'encoded-word's to an unencoded form which can be parsed
by an RFC 822 mail reader.

-- MIME (Multipurpose Internet Mail Extensions) Part Three:
-- Message Header Extensions for Non-ASCII Text
-- http://tools.ietf.org/html/rfc2047


But then you said >

> If the MIME in turn contained a JPEG image, would you pick the raw
> text from port 25 and feed it into a JPEG decoder?

Which I don't follow.

You seemed to want to analogize jpeg compression syntax (or "jpeg format")
with SMTP syntax (or "email structured header syntax".)

However, you said before to "start with SMTP", then MIME,
(which, it turns out, was right. (and I was wrong.))

But then you seem to be implying that it is obvious
that one must decode the MIME first before one
can send jpeg-data off to a JPEG renderer.
Which is, of course, true. But that's the
opposite order from what you said before.
MIME first, then JPEG, is the opposite
order from SMTP first, then MIME.

So I didn't find your analogy helpful.

But thank you for responding! That's definitely what motivated me to break down
and read some RFCs. (Which, for some reason, I seemed to need some motivating to do.)

~greg


Jorgen Grahn

unread,
Jan 21, 2009, 4:39:02 AM1/21/09
to
On Wed, 14 Jan 2009 12:45:17 -0500, ~greg <g...@remove-comcast.net> wrote:
>
> "Jorgen Grahn" > ...
...

>> If the MIME in turn contained a JPEG image, would you pick the raw
>> text from port 25 and feed it into a JPEG decoder?
...

> I am quite certain that we're just using words differently.
>
> If by "raw text" you mean MIME base64 encoded-data, [...]

With "raw text from port 25" I meant a sequence of octets which
primarily follow the SMTP RFC.

But I don't know -- maybe in practice you parse SMTP and MIME in one
step, and maybe the specs are clevery worded so you can do it in
either order.

(And I assumed an actual SMTP receiver. If you parse mail from a file
in e.g. mbox format, you're not really looking at an SMTP message.
For example, my mbox files do not have CRLF line endings.)

I'll stop here. I was hoping more knowledgable people would reply
and support (or correct) me, but that hasn't happened yet.

0 new messages