utf-8 fixes

Simon Josefsson

unread,

Oct 13, 2001, 5:09:36 PM10/13/01

to Per Abrahamsen, bu...@gnus.org

The following patch changes some UTF-8 related stuff, for READING
only: It removes article-decode-group-name. I wrongly must have
thought that UTF-8 only applied to Newsgroups: header, but I think the
entire set of headers is in UTF-8 according to USEFOR. So we should
use `g-g-charset-alist' and `gnus-default-charset' instead.

For example, see Erland's post to the UTF-8 group on news.gnksa.org
with raw UTF-8 and Latin-1 in Subject:, with this patch both look
fine. I can't see any other noticable changes.

What do you think?

I still doesn't like some stuff in message.el, I'll try to explain it
in another mail.

2001-10-13 Simon Josefsson <j...@extundo.com>

* gnus-art.el (gnus-article-decode-hook): Remove a-d-group-name.
(article-decode-group-name): Remove.

* gnus.el (gnus-group-charset-alist): Default to UTF-8 if
available.
(gnus-default-charset): Ditto. Also fix doc.

Index: gnus.el
===================================================================
RCS file: /usr/local/cvsroot/gnus/lisp/gnus.el,v
retrieving revision 6.55
diff -u -r6.55 gnus.el
--- gnus.el 2001/09/24 17:35:22 6.55
+++ gnus.el 2001/10/13 20:51:53
@@ -1556,7 +1556,7 @@
"Return the default charset of GROUP."
:variable gnus-group-charset-alist
:variable-default
- '(("\\(^\\|:\\)hk\\>\\|\\(^\\|:\\)tw\\>\\|\\<big5\\>" cn-big5)
+ `(("\\(^\\|:\\)hk\\>\\|\\(^\\|:\\)tw\\>\\|\\<big5\\>" cn-big5)
("\\(^\\|:\\)cn\\>\\|\\<chinese\\>" cn-gb-2312)
("\\(^\\|:\\)fj\\>\\|\\(^\\|:\\)japan\\>" iso-2022-jp-2)
("\\(^\\|:\\)tnn\\>\\|\\(^\\|:\\)pin\\>\\|\\(^\\|:\\)sci.lang.japan" iso-2022-7bit)
@@ -1568,7 +1568,11 @@
("\\(^\\|:\\)alt.chinese.text.big5\\>" chinese-big5)
("\\(^\\|:\\)soc.culture.vietnamese\\>" vietnamese-viqr)
("\\(^\\|:\\)\\(comp\\|rec\\|alt\\|sci\\|soc\\|news\\|gnu\\|bofh\\)\\>" iso-8859-1)
- (".*" iso-8859-1))
+ (".*" ,(if (or (and (fboundp 'find-coding-system)
+ (find-coding-system 'utf-8))
+ (and (fboundp 'coding-system-p) (coding-system-p 'utf-8)))
+ 'utf-8
+ 'iso-8859-1)))
:variable-document
"Alist of regexps (to match group names) and default charsets to be used when reading."
:variable-group gnus-charset
@@ -1723,10 +1727,14 @@
(defvar gnus-plugged t
"Whether Gnus is plugged or not.")

-(defcustom gnus-default-charset 'iso-8859-1
+(defcustom gnus-default-charset
+ (if (or (and (fboundp 'find-coding-system) (find-coding-system 'utf-8))
+ (and (fboundp 'coding-system-p) (coding-system-p 'utf-8)))
+ 'utf-8
+ 'iso-8859-1)
"Default charset assumed to be used when viewing non-ASCII characters.
This variable is overridden on a group-to-group basis by the
-gnus-group-charset-alist variable and is only used on groups not
+`gnus-group-charset-alist' variable and is only used on groups not
covered by that variable."
:type 'symbol
:group 'gnus-charset)
Index: gnus-art.el
===================================================================
RCS file: /usr/local/cvsroot/gnus/lisp/gnus-art.el,v
retrieving revision 6.113
diff -u -r6.113 gnus-art.el
--- gnus-art.el 2001/10/12 17:01:18 6.113
+++ gnus-art.el 2001/10/13 20:51:54
@@ -638,8 +638,7 @@
(face :value default)))))

(defcustom gnus-article-decode-hook
- '(article-decode-charset article-decode-encoded-words
- article-decode-group-name)
+ '(article-decode-charset article-decode-encoded-words)
"*Hook run to decode charsets in articles."
:group 'gnus-article-headers
:type 'hook)
@@ -1747,29 +1746,6 @@
(save-restriction
(article-narrow-to-head)
(funcall gnus-decode-header-function (point-min) (point-max)))))
-
-(defun article-decode-group-name ()
- "Decode group names in `Newsgroups:'."
- (let ((inhibit-point-motion-hooks t)
- buffer-read-only
- (method (gnus-find-method-for-group gnus-newsgroup-name)))
- (when (and (or gnus-group-name-charset-method-alist
- gnus-group-name-charset-group-alist)
- (gnus-buffer-live-p gnus-original-article-buffer))
- (when (mail-fetch-field "Newsgroups")
- (nnheader-replace-header "Newsgroups"
- (gnus-decode-newsgroups
- (with-current-buffer
- gnus-original-article-buffer
- (mail-fetch-field "Newsgroups"))
- gnus-newsgroup-name method)))
- (when (mail-fetch-field "Followup-To")
- (nnheader-replace-header "Followup-To"
- (gnus-decode-newsgroups
- (with-current-buffer
- gnus-original-article-buffer
- (mail-fetch-field "Followup-To"))
- gnus-newsgroup-name method))))))

(defun article-de-quoted-unreadable (&optional force read-charset)
"Translate a quoted-printable-encoded article.

Simon Josefsson

unread,

Oct 13, 2001, 6:17:01 PM10/13/01

to Per Abrahamsen, bu...@gnus.org

Simon Josefsson <j...@extundo.com> writes:

> I still doesn't like some stuff in message.el, I'll try to explain it
> in another mail.

Thinking more about this I changed my mind, the current approach is
probably not worse than any other method I can think of. But the bug
you wrote about in message.el would have to be fixed if crossposting
should work, but I doubt it is worth the effort..

Anyway, with the patch below I can Gcc utf-8 headers to nnimap groups
as well, I've committed it..

2001-10-14 Simon Josefsson <j...@extundo.com>

* gnus-msg.el (gnus-inews-do-gcc): Port header encoded-word
charset magic from message.el.

--- gnus-msg.el.~6.48.~ Sat Oct 6 21:14:52 2001
+++ gnus-msg.el Sun Oct 14 00:04:32 2001
@@ -1241,9 +1241,28 @@
(message-encode-message-body)
(save-restriction
(message-narrow-to-headers)
- (let ((mail-parse-charset message-default-charset)
+ (let* ((mail-parse-charset message-default-charset)
+ (newsgroups-field (save-restriction
+ (message-narrow-to-headers-or-head)
+ (message-fetch-field "Newsgroups")))
+ (followup-field (save-restriction
+ (message-narrow-to-headers-or-head)
+ (message-fetch-field "Followup-To")))
+ ;; BUG: We really need to get the charset for
+ ;; each name in the Newsgroups and Followup-To
+ ;; lines to allow crossposting between group
+ ;; namess with incompatible character sets.
+ ;; -- Per Abrahamsen <abr...@dina.kvl.dk> 2001-10-08.
+ (group-field-charset
+ (gnus-group-name-charset method newsgroups-field))
+ (followup-field-charset
+ (gnus-group-name-charset method (or followup-field "")))
(rfc2047-header-encoding-alist
- (cons '("Newsgroups" . default)
+ (append
+ (when group-field-charset
+ (list (cons "Newsgroups" group-field-charset)))
+ (when followup-field-charset
+ (list (cons "Followup-To" followup-field-charset)))
rfc2047-header-encoding-alist)))
(mail-encode-encoded-word-buffer)))
(goto-char (point-min))

Per Abrahamsen

unread,

Oct 15, 2001, 4:33:19 AM10/15/01

to Simon Josefsson, bu...@gnus.org

> I wrongly must have thought that UTF-8 only applied to Newsgroups:
> header, but I think the entire set of headers is in UTF-8 according
> to USEFOR.

It is, but not according to local conventions. I *want* the ability
to say "raw 8bit is latin-1 in headers, except for group names, where
it is UTF8." Your old code carefully gave me an oppertunity to say
that. Your new code doesn't.

> What do you think?

Please unapply the patch, you old code was much better.

[ gnus-group-charset-alist ]

> - (".*" iso-8859-1))
> + (".*" ,(if (or (and (fboundp 'find-coding-system)
> + (find-coding-system 'utf-8))
> + (and (fboundp 'coding-system-p) (coding-system-p 'utf-8)))
> + 'utf-8
> + 'iso-8859-1)))

This part is ok, but dangerous, almost all West European hierarchies
should then be listed explictly as "Latin-1".

Per Abrahamsen

unread,

Oct 15, 2001, 4:38:33 AM10/15/01

to Simon Josefsson, bu...@gnus.org

Simon Josefsson <j...@extundo.com> writes:

> Simon Josefsson <j...@extundo.com> writes:
>
>> I still doesn't like some stuff in message.el, I'll try to explain it
>> in another mail.
>
> Thinking more about this I changed my mind, the current approach is
> probably not worse than any other method I can think of. But the bug
> you wrote about in message.el would have to be fixed if crossposting
> should work, but I doubt it is worth the effort..

The bug would be totally irreelvant with your patch to gnus-art.el, as
it would make the 'g-g-n-c-group-a' variable useless. It does not
make sense to use one character set for reading
(gnus-group-charset-alist) and another for posting (g-g-n-c-group-a).

Simon Josefsson

unread,

Oct 15, 2001, 6:37:24 AM10/15/01

to Per Abrahamsen, bu...@gnus.org

On Mon, 15 Oct 2001, Per Abrahamsen wrote:

> > I wrongly must have thought that UTF-8 only applied to Newsgroups:
> > header, but I think the entire set of headers is in UTF-8 according
> > to USEFOR.
>
> It is, but not according to local conventions. I *want* the ability
> to say "raw 8bit is latin-1 in headers, except for group names, where
> it is UTF8."

This is extremely ugly. Is this really what is going to be used? Does
any other clients support this?

> Your old code carefully gave me an oppertunity to say
> that. Your new code doesn't.

I understand now, right. So the old code is good, but the new code may be
good as well (depending on the outcome of the discussion below).

> > What do you think?
>
> Please unapply the patch, you old code was much better.

I haven't applied it..

> [ gnus-group-charset-alist ]
>
> > - (".*" iso-8859-1))
> > + (".*" ,(if (or (and (fboundp 'find-coding-system)
> > + (find-coding-system 'utf-8))
> > + (and (fboundp 'coding-system-p) (coding-system-p 'utf-8)))
> > + 'utf-8
> > + 'iso-8859-1)))
>
> This part is ok, but dangerous, almost all West European hierarchies
> should then be listed explictly as "Latin-1".

Are there any "official" resources on what charset different (european)
hierarchies should use? I see more and more UTF-8 in the swedish groups.

Defaulting to UTF-8 with exceptions for european groups as Latin-1, is
probably more "universal", assuming we know what hierarchy delimiter the
european groups uses.

Incorrectly defaulting to Latin-1 would be very wrong in non-europe areas,
while incorrectly defaulting to UTF-8 would work everywhere even though
local policy may say otherwise (in which case the policy should be
incorporated into `gnus-group-charset-alist').

Per Abrahamsen

unread,

Oct 15, 2001, 7:05:01 AM10/15/01

to Simon Josefsson, bu...@gnus.org

Simon Josefsson <j...@extundo.com> writes:

> On Mon, 15 Oct 2001, Per Abrahamsen wrote:
>
>> > I wrongly must have thought that UTF-8 only applied to Newsgroups:
>> > header, but I think the entire set of headers is in UTF-8 according
>> > to USEFOR.
>>
>> It is, but not according to local conventions. I *want* the ability
>> to say "raw 8bit is latin-1 in headers, except for group names, where
>> it is UTF8."
>
> This is extremely ugly. Is this really what is going to be used?

Yes, in a transition period.

> Does any other clients support this?

I doubt it. But we can do better than other clients.

> Are there any "official" resources on what charset different (european)
> hierarchies should use?

I think most hiearchies with rules prefer RFC 2047 over raw 8bit. The
exceptions are listen in "gnus-group-posting-charset-alist". They are
fr,no (latin-1) and fido7,relcom (koi8).

The point being, apart from these groups, raw 8-bit characters in
headers are against the rules, and we should therefore just use the
"guess" that is most likely to be right for that hierarchy.

For the four above, we _must_ of course treat raw 8-bit characters as
the same character sets as we generate.

For dk,de,se,swnet (as a start) we should specify latin-1. And ask
people in the ding list to contribute more to that list.

Unless we can start guessing? Most latin-1 text is invalid utf-8, so
one could say utf-8 if valid, otherwise latin-1. But that would be
work. Or use the mule heuristics, which I know nothing about.

> Defaulting to UTF-8 with exceptions for european groups as Latin-1, is
> probably more "universal", assuming we know what hierarchy delimiter the
> european groups uses.

It would be more geopolitically correct. Also, it is much nicer
defaulting to the standard (USEFOR), and listing the exceptions, than
the other way.

Simon Josefsson

unread,

Oct 15, 2001, 8:07:13 AM10/15/01

to Per Abrahamsen, bu...@gnus.org

On Mon, 15 Oct 2001, Per Abrahamsen wrote:

> > Are there any "official" resources on what charset different (european)
> > hierarchies should use?
>
> I think most hiearchies with rules prefer RFC 2047 over raw 8bit. The
> exceptions are listen in "gnus-group-posting-charset-alist". They are
> fr,no (latin-1) and fido7,relcom (koi8).
>
> The point being, apart from these groups, raw 8-bit characters in
> headers are against the rules, and we should therefore just use the
> "guess" that is most likely to be right for that hierarchy.
>
> For the four above, we _must_ of course treat raw 8-bit characters as
> the same character sets as we generate.
>
> For dk,de,se,swnet (as a start) we should specify latin-1. And ask
> people in the ding list to contribute more to that list.
>
> Unless we can start guessing? Most latin-1 text is invalid utf-8, so
> one could say utf-8 if valid, otherwise latin-1. But that would be
> work. Or use the mule heuristics, which I know nothing about.

There may already be some Mule heuristics going on, with my patch Erland's
both messages on news.gnksa.org (raw 8859-1 and raw UTF-8 in Subject:)
displayed fine. It may be that emacs UTF-8 recognizes 8859-1 chars and
displays them automagically.

I think we should apply my patch, except for the removal of
article-decode-group-name, and add dk,de,se,swnet as Latin-1 hierarchies.

It would be nice if UTF-8 was used for reading even in the Latin-1
hierarchies though, since it seems to work as good for Latin-1 but would
also support UTF-8. I wonder if the UTF-8 support in XEmacs works the
same, I'll check tonight.

Per Abrahamsen

unread,

Oct 15, 2001, 8:52:52 AM10/15/01

to Simon Josefsson, bu...@gnus.org

Simon Josefsson <j...@extundo.com> writes:

> There may already be some Mule heuristics going on, with my patch Erland's
> both messages on news.gnksa.org (raw 8859-1 and raw UTF-8 in Subject:)
> displayed fine. It may be that emacs UTF-8 recognizes 8859-1 chars and
> displays them automagically.

If that is intended behaviour, I'll vote we should make UTF-8 default
when reading unspecified 8bit characters everywhere but fr, no, fido7
and relcom.

Can you create a self-contained testcase for the behavior we can send
to emacs...@gnu.org and ask?

Simon Josefsson

unread,

Oct 15, 2001, 9:23:20 AM10/15/01

to Per Abrahamsen, bu...@gnus.org

On Mon, 15 Oct 2001, Per Abrahamsen wrote:

> Simon Josefsson <j...@extundo.com> writes:
>
> > There may already be some Mule heuristics going on, with my patch Erland's
> > both messages on news.gnksa.org (raw 8859-1 and raw UTF-8 in Subject:)
> > displayed fine. It may be that emacs UTF-8 recognizes 8859-1 chars and
> > displays them automagically.
>
> If that is intended behaviour, I'll vote we should make UTF-8 default
> when reading unspecified 8bit characters everywhere but fr, no, fido7
> and relcom.

As it shouldn't affect posting, another variable is probably needed.
`gnus-charset-override-for-reading-alist' with `((latin-1 . utf-8))' or
something, maybe.

> Can you create a self-contained testcase for the behavior we can send
> to emacs...@gnu.org and ask?

I'll try.

I wonder if the same applies to latin-2 etc as well.

I also wonder about the failure-rate on the hueristic.

Per Abrahamsen

unread,

Oct 15, 2001, 10:01:21 AM10/15/01

to Simon Josefsson, bu...@gnus.org

Simon Josefsson <j...@extundo.com> writes:

> As it shouldn't affect posting, another variable is probably needed.
> `gnus-charset-override-for-reading-alist' with `((latin-1 . utf-8))' or
> something, maybe.

It won't affect posting, posting is controlled by a different option,
namely "gnus-group-posting-charset-alist".

Or did I misunderstood you?

> I wonder if the same applies to latin-2 etc as well.

How? It can't detect that the original was latin-2 rather than
latin-1, unless it applies language specific heuristics (i.e. a high
frequency of Polish words suggest latin-2...) which I very much doubt.

The question is if the 8-bit "fallback" for utf-8 is "latin-1" or "the
default character set". I suspect the first, since

1) latin-1 is the 256 first Unicode characters.

2) there are no reason the "default character set" should be an 8bit
character set, it could just as well be an iso 2022 variant.

> I also wonder about the failure-rate on the hueristic.

Pretty low, I suspect. I mean, it will often guess "latin-1" where
neither "latin-1" nor "utf-8" is correct, but rarely guess "utf-8"
when "latin-1" was intended. Most latin-1 characters translate to
byte sequences in utf-8 that almost never occur in latin-1 text.

Simon Josefsson

unread,

Oct 15, 2001, 10:24:34 AM10/15/01

to Per Abrahamsen, bu...@gnus.org

On Mon, 15 Oct 2001, Per Abrahamsen wrote:

> Simon Josefsson <j...@extundo.com> writes:
>
> > As it shouldn't affect posting, another variable is probably needed.
> > `gnus-charset-override-for-reading-alist' with `((latin-1 . utf-8))' or
> > something, maybe.
>
> It won't affect posting, posting is controlled by a different option,
> namely "gnus-group-posting-charset-alist".

Um, no, that variable only says which charsets are allowed to be
unencoded. It does not say which charset to use for posting. (The latter
is what `gnus-group-charset-alist' is for.)

> > I wonder if the same applies to latin-2 etc as well.
>
> How? It can't detect that the original was latin-2 rather than
> latin-1, unless it applies language specific heuristics (i.e. a high
> frequency of Polish words suggest latin-2...) which I very much doubt.
>
> The question is if the 8-bit "fallback" for utf-8 is "latin-1" or "the
> default character set". I suspect the first, since
>
> 1) latin-1 is the 256 first Unicode characters.
>
> 2) there are no reason the "default character set" should be an 8bit
> character set, it could just as well be an iso 2022 variant.

I'll try to post a raw Latin-2 Subject: and we'll see what happens.

Per Abrahamsen

unread,

Oct 15, 2001, 11:28:45 AM10/15/01

to Simon Josefsson, bu...@gnus.org

Simon Josefsson <j...@extundo.com> writes:

> On Mon, 15 Oct 2001, Per Abrahamsen wrote:
>
>> Simon Josefsson <j...@extundo.com> writes:
>>
>> > As it shouldn't affect posting, another variable is probably needed.
>> > `gnus-charset-override-for-reading-alist' with `((latin-1 . utf-8))' or
>> > something, maybe.
>>
>> It won't affect posting, posting is controlled by a different option,
>> namely "gnus-group-posting-charset-alist".
>
> Um, no, that variable only says which charsets are allowed to be
> unencoded. It does not say which charset to use for posting. (The latter
> is what `gnus-group-charset-alist' is for.)

Here is the doc string for gnus-group-charset-alist:

"Alist of regexps (to match group names) and default charsets to be used WHEN READING."

(my emphasis).

I always assumed Gnus just used whatever character set the user used
for typing his message when posting. Is that not the case?

Simon Josefsson

unread,

Oct 15, 2001, 1:49:35 PM10/15/01

to Per Abrahamsen, bu...@gnus.org

Per Abrahamsen <abr...@dina.kvl.dk> writes:

>>> > As it shouldn't affect posting, another variable is probably needed.
>>> > `gnus-charset-override-for-reading-alist' with `((latin-1 . utf-8))' or
>>> > something, maybe.
>>>
>>> It won't affect posting, posting is controlled by a different option,
>>> namely "gnus-group-posting-charset-alist".
>>
>> Um, no, that variable only says which charsets are allowed to be
>> unencoded. It does not say which charset to use for posting. (The latter
>> is what `gnus-group-charset-alist' is for.)
>
> Here is the doc string for gnus-group-charset-alist:
>
> "Alist of regexps (to match group names) and default charsets to be used WHEN READING."
>
> (my emphasis).
>
> I always assumed Gnus just used whatever character set the user used
> for typing his message when posting. Is that not the case?

You are right. :-) This m17n/i18n stuff is making my head spin.