Update:
I now have the relevant mbox file and can reproduce the issue.
Linus helpfully identified two nuances in the offending mbox file.
First, is that the encoded name in the From header is quoted.
From: "=?UTF-8?q?Christian=20K=C3=B6nig?=" <
ckoenig.lei...@gmail.com>
Second, although the Content-Type header indicates that the body is
UTF-8, there is a stray latin-1 "ö" character in the body.
There are different failing behaviors in the v0.18 release vs. git
master.
v0.18 imports successfully if the quotes are removed from the From
header. The stray latin-1 character is apparently not a problem.
git master imports if the stray latin-1 character is removed from the
body. However, there is a python2 vs python3 difference w.r.t. whether
the imported patch's author retains the encoded words:
- python3: quotes are okay, author ends up correct
- python2: quotes not okay, author ends up incorrect
Author: =?UTF-8?q?Christian=20K=C3=B6nig?= <
ckoenig.lei...@gmail.com>
So it appears that python2's email library treats encoded words inside
quotes literally whereas python3's email library unwraps the quotes. I
have not yet dug into the RFC's to figure out which behavior is correct,
but, in this case at least, it seems that the python3 behavior is more
desirable.
W.r.t. the stray latin-1 character, I believe that v0.18 works because
it uses the python2 email library in such a manner that email bodies are
treated as a stream of bytes; i.e. never encoded or decoded, whereas on
git master we now attempt to decode email bodies and are thus exposed to
mis-encoded emails.
I believe the goal for stgit is to import this mbox file as-is, even if
the mbox file is incorrect in the ways noted. I will be attempting to
modify stgit to that end.
Thanks,
Pete