Automatic recognition of some specific coding systems

Jürgen Hartmann

unread,

Feb 24, 2015, 12:52:54 PM2/24/15

to help-gn...@gnu.org

Most of the text files that I have to work with are encoded with one
of the coding systems

utf-8-unix
latin-9-unix
cp850-dos

Therefore, it was very convenient for me that the version 22.3 of
Emacs was (after some minor configuration) perfectly able to
automatically recognize these coding systems from the contents of the
respective files. As I understand, this was possible because in that
old Emacs version these tree systems were associated with three
different coding categories, i.e.

coding-category-utf-8
coding-category-iso-8-1 (meanwhile depreciated)
coding-category-ccl

respectively.

Now switching to Emacs 24.4, I found that two of these coding systems,

latin-9-unix
cp850-dos

were bunched together into the category

coding-category-charset

presumably with the consequence that I have to choose which one of
these two systems will not be automatically recognized any more.

Is this conclusion correct?

If yes, this would be a big regression from my point of view, so I am
very interested in any kind of workaround.
(If this has to be done by reimplementing cp850 via CCL, it would be
great to get some (link to a) tutorial on this topic.)

It is clear that there are coding systems that can not be
distinguished just by analyzing the encoded text. But from the
experience with former versions of Emacs I know that this particular
problem is not ill-posed.

Therefore, I would greatly appreciate any help.

Juergen

Eli Zaretskii

unread,

Feb 24, 2015, 1:28:37 PM2/24/15

to help-gn...@gnu.org

> From: Jürgen Hartmann <juergen_...@hotmail.com>
> Date: Tue, 24 Feb 2015 16:31:46 +0100

>
> Most of the text files that I have to work with are encoded with one
> of the coding systems
>
> utf-8-unix
> latin-9-unix
> cp850-dos
>
> Therefore, it was very convenient for me that the version 22.3 of
> Emacs was (after some minor configuration) perfectly able to
> automatically recognize these coding systems from the contents of the
> respective files. As I understand, this was possible because in that
> old Emacs version these tree systems were associated with three
> different coding categories, i.e.
>
> coding-category-utf-8
> coding-category-iso-8-1 (meanwhile depreciated)
> coding-category-ccl
>
> respectively.
>
> Now switching to Emacs 24.4, I found that two of these coding systems,
>
> latin-9-unix
> cp850-dos
>
> were bunched together into the category
>
> coding-category-charset
>
> presumably with the consequence that I have to choose which one of
> these two systems will not be automatically recognized any more.
>
> Is this conclusion correct?

No, I don't think so. There's no direct relation between categories
and recognition of encoding.

If you have specific problems, i.e. if Emacs doesn't recognize the
encoding of some file(s), please post the details. (I'd suggest to
try in "emacs -Q" first, because some problems might be caused by your
customizations that need to be removed or adapted to the new version.)
Then people here could review the problems and advise you about
possible solutions, or ask you to file a bug report.

But in general, there shouldn't be any regressions in recognizing
encodings.

Jürgen Hartmann

unread,

Feb 24, 2015, 5:30:59 PM2/24/15

to help-gn...@gnu.org

Thank you, Eli Zaretskii, for your speedy answer:

>> Is this conclusion correct?
>
> No, I don't think so. There's no direct relation between categories
> and recognition of encoding.

I am very glad to hear that.

> If you have specific problems, i.e. if Emacs doesn't recognize the
> encoding of some file(s), please post the details. (I'd suggest to
> try in "emacs -Q" first, because some problems might be caused by your
> customizations that need to be removed or adapted to the new version.)
> Then people here could review the problems and advise you about
> possible solutions, or ask you to file a bug report.
>
> But in general, there shouldn't be any regressions in recognizing
> encodings.

OK. I will try to give a specific example - it is rather artificial
but representative:

Consider an utf-8-unix encoded text file, meaningfully named
utf-8-unix, that just contains the seven German special characters

äöüßÄÖÜ ("a"o"u"s"A"O"U)

in one single line followed by a newline character. Now we make two
copies of this file and recode them to the other coding systems of
interest:

cp utf-8-unix latin-9-unix
recode ..l9 latin-9-unix

cp utf-8-unix cp850-dos
recode ..pc cp850-dos

Visiting all tree files in an Emacs session that was freshly started
by means of

emacs -Q

- thank you for that important hint - yields a perfect recognition of
the respective coding in the case of

utf-8-unix
latin-9-unix (recognized as latin-1-unix, equivalent here)

but the recognition fails tor the cp850-dos encoded file, as it is
recognized as

raw-text-dos

encoded and its contents is displayed as

\204\224\201\341\216\231\232

Looking on the contents of the variable coding-category-list, it has
the form

(coding-category-utf-8 coding-category-charset ...
coding-category-raw-text ...)

where the values of the variables coding-category-utf-8 and
coding-category-charset are utf-8 and iso-latin-1 respectively.

If I start again with a new Emacs session (emacs -Q), but this time
performing the commands

prefer-coding-system cp850
prefer-coding-system utf-8

prior to visiting the files, the codings of

utf-8-unix
cp850-dos

are recognized correctly, while the file latin-9-unix is recognized as
cp850-unix encoded and its contents is displayed as some cryptic symbols.

The coding-category-list and the variable coding-category-utf-8 have
the same values as before, but the variable coding-category-charset
contains cp850 this time.

So my problem is to find a configuration of Emacs 24.4 that yields a
correct automatic recognition of all tree coding systems

   utf-8-unix
   latin-9-unix or
   cp850-dos

when the files of the example above are visited. One has to keep in
mind that this was perfectly possible with Emacs 22.3, as I just
verified again.

Sorry for the rather long post, but I hope that I could state my
problem more precisely.

Juergen

Eli Zaretskii

unread,

Feb 25, 2015, 11:19:14 AM2/25/15

to help-gn...@gnu.org

> From: Jürgen Hartmann <juergen_...@hotmail.com>
> Date: Tue, 24 Feb 2015 23:30:49 +0100

>
> Consider an utf-8-unix encoded text file, meaningfully named
> utf-8-unix, that just contains the seven German special characters
>
>    äöüßÄÖÜ   ("a"o"u"s"A"O"U)
>
> in one single line followed by a newline character. Now we make two
> copies of this file and recode them to the other coding systems of
> interest:
>
>    cp utf-8-unix latin-9-unix
>    recode ..l9 latin-9-unix
>
>    cp utf-8-unix cp850-dos
>    recode ..pc cp850-dos
>
> Visiting all tree files in an Emacs session that was freshly started
> by means of
>
>    emacs -Q
>
> - thank you for that important hint - yields a perfect recognition of
> the respective coding in the case of
>
>    utf-8-unix
>    latin-9-unix   (recognized as latin-1-unix, equivalent here)
>
> but the recognition fails tor the cp850-dos encoded file, as it is
> recognized as
>
>    raw-text-dos
>
> encoded and its contents is displayed as
>
>    \204\224\201\341\216\231\232

That's true, but I see the same behavior in Emacs 22.3, if I invoke it
with "emacs -q" (lowercase 'q', since 22.x didn't support -Q), so
there's no change in behavior here.

> So my problem is to find a configuration of Emacs 24.4 that yields a
> correct automatic recognition of all tree coding systems
>
>    utf-8-unix
>    latin-9-unix or
>    cp850-dos
>
> when the files of the example above are visited. One has to keep in
> mind that this was perfectly possible with Emacs 22.3, as I just
> verified again.

How exactly did you verify with v22.3? As I wrote above, I see the
same behavior in that version. Did you invoke it with -q? If not,
there are some customization of yours that modify the default
behavior, and the question becomes how to express the same
customizations in Emacs 24.

Jürgen Hartmann

unread,

Feb 25, 2015, 12:53:50 PM2/25/15

to help-gn...@gnu.org

Thank you, Eli Zaretskii, for repetitively digging into this problem:

>> encoded and its contents is displayed as
>>
>> \204\224\201\341\216\231\232
>
> That's true, but I see the same behavior in Emacs 22.3, if I invoke it
> with "emacs -q" (lowercase 'q', since 22.x didn't support -Q), so
> there's no change in behavior here.

That is right: I had to do some minor configuration to get Emacs 22.3
to correctly recognize these three coding systems. See below.

> How exactly did you verify with v22.3? As I wrote above, I see the
> same behavior in that version. Did you invoke it with -q? If not,
> there are some customization of yours that modify the default
> behavior, and the question becomes how to express the same
> customizations in Emacs 24.

To set up a clean stage, I just recompiled Emacs 22.3 from the vanilla
Gnu sources, and started one session with -q and another with -Q,
receiving the same result in both cases.

For the tests I used the same sample text files

utf-8-unix
latin-9-unix
cp850-dos

that I described in my previous post.

As you already described, without any customization the automatic
recognition fails in the case of the cp850-dos encoded text file, as
its coding is recognized as raw-text-dos. So far we get the same
result as in the Emacs 24.4 case.

But if one issues the commands

(check-coding-system 'cp850)
(setq coding-category-ccl 'cp850)
(update-coding-systems-internal)

in the *scratch* buffer (Lisp Interaction mode) of Emacs 22.3 right
after starting the session, all three coding systems will be perfectly
recognized when the text files are visited.

After this customization, the contents of the variable
coding-category-list has the form

(coding-category-utf-8 coding-category-iso-8-1 coding-category-ccl ...)

where the values of the variables coding-category-utf-8,
coding-category-iso-8-1, and coding-category-ccl are mule-utf-8,
iso-latin-1, and cp850 respectively.

You are perfectly right stating that the question to be addressed now
is how to port these customization commands to the contemporary
version 24.4 of Emacs: In that version the coding system cp850 is not
any more implemented via CCL and it is associated with the coding
category coding-category-charset--the same category that the systems
latin-1 and latin-9 are associated with. Furthermore, the command
update-coding-systems-internal is not available any more, but this
might be a minor detail.

I am rather clueless here, so any help is most welcome.

Juergen

Eli Zaretskii

unread,

Feb 25, 2015, 3:29:03 PM2/25/15

to help-gn...@gnu.org

> From: Jürgen Hartmann <juergen_...@hotmail.com>
> Date: Wed, 25 Feb 2015 18:53:39 +0100

>
> You are perfectly right stating that the question to be addressed now
> is how to port these customization commands to the contemporary
> version 24.4 of Emacs: In that version the coding system cp850 is not
> any more implemented via CCL and it is associated with the coding
> category coding-category-charset--the same category that the systems
> latin-1 and latin-9 are associated with. Furthermore, the command
> update-coding-systems-internal is not available any more, but this
> might be a minor detail.
>
> I am rather clueless here, so any help is most welcome.

Try this:

(set-coding-system-priority 'utf-8 'cp850)

Jürgen Hartmann

unread,

Feb 25, 2015, 6:23:59 PM2/25/15

to help-gn...@gnu.org

@Eli Zaretskii: Thank you very much for your hint:

> Try this:
>
> (set-coding-system-priority 'utf-8 'cp850)

After doing this, the coding systems

utf-8
cp850

get correctly recognized, but

latin-9-unix

gets wrongly recognized as cp850-unix encoded.

If I modify the lisp expression to

(set-coding-system-priority 'utf-8 'latin-9)

it is utf-8 and latin-9 that are properly recognized while the test
file

cp850-dos

gets detected as iso-latin-9-dos encoded.

If I pass all three coding systems to set-coding-system-priority,

(set-coding-system-priority 'utf-8 'latin-9 'cp850) or
(set-coding-system-priority 'utf-8 'cp850 'latin-9)

it turns out that the function set-coding-system-priority ignores the third
coding system in these cases, because it belongs to the same coding
category as the coding system named in the second place. The source
code src/coding.c comments this in the lines 9972 and 9973 like this:

/* Ignore this coding system because a coding system of the
same category already had a higher priority. */

So I fear that we can not use this function to establish the
simultaneous recognizability of all tree coding systems.

By the way, could you verify, that this is possible with Emacs 22.3
with the customization described in my previous post?

Juergen

Eli Zaretskii

unread,

Feb 26, 2015, 11:36:01 AM2/26/15

to help-gn...@gnu.org

> From: Jürgen Hartmann <juergen_...@hotmail.com>
> Date: Thu, 26 Feb 2015 00:23:50 +0100

>
> > Try this:
> >
> >   (set-coding-system-priority 'utf-8 'cp850)
>
> After doing this, the coding systems
>
>    utf-8
>    cp850
>
> get correctly recognized, but
>
>    latin-9-unix
>
> gets wrongly recognized as cp850-unix encoded.
>
> If I modify the lisp expression to
>
>    (set-coding-system-priority 'utf-8 'latin-9)
>
> it is utf-8 and latin-9 that are properly recognized while the test
> file
>
>    cp850-dos
>
> gets detected as iso-latin-9-dos encoded.

I feared that might be the result.

> If I pass all three coding systems to set-coding-system-priority,
>
>    (set-coding-system-priority 'utf-8 'latin-9 'cp850)   or
>    (set-coding-system-priority 'utf-8 'cp850 'latin-9)
>
> it turns out that the function set-coding-system-priority ignores the third
> coding system in these cases, because it belongs to the same coding
> category as the coding system named in the second place. The source
> code src/coding.c comments this in the lines 9972 and 9973 like this:
>
>     /* Ignore this coding system because a coding system of the
>        same category already had a higher priority. */

Yes, I know. That's why I only mentioned 2 of them.

It looks like what you want is beyond the current capabilities of
Emacs's auto-detection of encoding. See below for some alternatives.

Having said that...

> By the way, could you verify, that this is possible with Emacs 22.3
> with the customization described in my previous post?

...no, it doesn't work for me. The latin-9 file is decoded using my
locale's encoding (which isn't latin-9), and cp850 file is still
raw-text.

So I think some other factor(s) is/are at work on your system. Your
locale's encoding is certainly one of them, but I think there should
be something else, either in your customizations or somewhere else.

In general, even if Emacs 22.3 was capable to do the job, I think it
was by sheer luck, and is anyway fragile, since the same
customizations don't work for me (and AFAIU, aren't supposed to work).
So I would suggest to explore alternative ways of doing this in Emacs
24 reliably. Some possibilities you may wish to explore:

. Put a 'coding: cp850' cookie in the cp850 files

. If the names of the cp850 files all match some common pattern, you
can use modify-coding-system-alist to tell Emacs to decode them by
cp850

. Similarly, if the cp850 files' contents match some common regexp,
you can customize auto-coding-regexp-alist to force their decoding
by cp850

Of course, you can always turn the table, and do the above for
latin-9, while keeping cp850 in set-coding-system-priority call. It
all depends which one of these 2 lends itself better to one of these
methods.

I believe that if one of these alternatives can do the job for you,
the result will be much more reliable.

Jürgen Hartmann

unread,

Feb 26, 2015, 5:34:16 PM2/26/15

to help-gn...@gnu.org

@Eli Zaretskii: Thank you very much for your profound assessment:

> It looks like what you want is beyond the current capabilities of
> Emacs's auto-detection of encoding. See below for some alternatives.
>
> Having said that...
>
>> By the way, could you verify, that this is possible with Emacs 22.3
>> with the customization described in my previous post?
>
> ...no, it doesn't work for me. The latin-9 file is decoded using my
> locale's encoding (which isn't latin-9), and cp850 file is still
> raw-text.

Oops, this is an important finding indeed.

> So I think some other factor(s) is/are at work on your system. Your
> locale's encoding is certainly one of them, but I think there should
> be something else, either in your customizations or somewhere else.

I just repeated the tests with Emacs 22.3 using the POSIX locale,

LC_ALL=C ./emacs -q

and you are right: the cp850 file was recognized as raw-text now. The
locale I used before was

de_DE.UTF-8

The more I get involved in this topic the more I see that it is much
more complex that I thought at first glance.

> In general, even if Emacs 22.3 was capable to do the job, I think it
> was by sheer luck, and is anyway fragile, since the same
> customizations don't work for me (and AFAIU, aren't supposed to work).
> So I would suggest to explore alternative ways of doing this in Emacs
> 24 reliably.

This sounds reasonable to me. Besides the aspect of reliability, which
is of curse the most important one, doing so might also yield a
solution that is likely to survive future updates.

> Some possibilities you may wish to explore:
>
> . Put a 'coding: cp850' cookie in the cp850 files

I would rather avoid altering the files content for this technical reason.

>   . If the names of the cp850 files all match some common pattern, you
>     can use modify-coding-system-alist to tell Emacs to decode them by
>     cp850

Unfortunately in my case there is no such pattern in the file names
that would allow to tell which coding the respective file might use.

>   . Similarly, if the cp850 files' contents match some common regexp,
>     you can customize auto-coding-regexp-alist to force their decoding
>     by cp850

That one might do the trick: In my case the only files (at least in
the big picture) that use the DOS EOL variant are those encoded with
cp850 and vice versa. So one could think about a regular expression
that matches this unique EOL pattern.

> Of course, you can always turn the table, and do the above for
> latin-9, while keeping cp850 in set-coding-system-priority call. It
> all depends which one of these 2 lends itself better to one of these
> methods.
>
> I believe that if one of these alternatives can do the job for you,
> the result will be much more reliable.

I also think so.

So, I have to play around a little bit to get acquainted with the
construction of regular expressions for Emacs. I will be back when I
have gained a deeper insight, or a concrete solution at best.

Meanwhile I would like to thank you, Eli Zaretskii, very much for your
time and effort that you spent to provide me with this thorough
analysis and your valuable suggestions.

Juergen

Yuri Khan

unread,

Feb 26, 2015, 8:50:54 PM2/26/15

to Jürgen Hartmann, help-gn...@gnu.org

On Tue, Feb 24, 2015 at 9:31 PM, Jürgen Hartmann
<juergen_...@hotmail.com> wrote:
> Most of the text files that I have to work with are encoded with one
> of the coding systems
>
> utf-8-unix
> latin-9-unix
> cp850-dos

> […]

Now that Eli has suggested a direction of your search, I’ll go in and
suggest another.

The general problem you’re solving is that of encoding detection.
There exist ready-made solutions for that, e.g. by computing byte
frequencies and matching them against known character frequencies in
your language. One of these is called enca.

Googling for “emacs enca” yields a post by Dmitriyi Paduchikh in
gnu.emacs.sources, dated 2007.

https://lists.gnu.org/archive/html/gnu-emacs-sources/2007-06/msg00037.html

Jürgen Hartmann

unread,

Feb 27, 2015, 7:13:02 AM2/27/15

to help-gn...@gnu.org

Thank you, Yuri Khan, for widening the perspective:

> The general problem you’re solving is that of encoding detection.
> There exist ready-made solutions for that, e.g. by computing byte
> frequencies and matching them against known character frequencies in
> your language. One of these is called enca.
>
> Googling for “emacs enca” yields a post by Dmitriyi Paduchikh in
> gnu.emacs.sources, dated 2007.
>
> https://lists.gnu.org/archive/html/gnu-emacs-sources/2007-06/msg00037.html

To use Google is always a good advise that I will gratefully follow
once more with respect to this broader background.

Actually I didn't know Enca at all up to now: A language based attempt
to recognize encoding is an interesting idea.

Unfortunately, Enca can not be used in my special case, because--I
didn't mention this before, sorry--the text files to handle are mostly
in English and German. For the former ones encoding is not an issue,
and for the latter the language German is not supported by Enca.

Enca 1.14 for example only supports

Belarussian
Bulgarian
Czech
Estonian
Croatian
Hungarian
Lithuanian
Latvian
Polish
Russian
Slovak
Slovene
Ukrainian
Chinese

But for people that use any of these languages this might be a
promising option.

Apart from that--and this might be helpful in my case also--the idea
to use an external software to detect encoding is very charming, and
maybe it is possible to adapt the lisp snippets contained in your link
to other programs. E.g.

find -bi ...

is capable to identify file encodings although it recognizes cp850
rather non-specifically as "unknown-8bit".

So thank you very much for your suggestions.

Juergen

Jürgen Hartmann

unread,

Feb 27, 2015, 7:26:00 AM2/27/15

to help-gn...@gnu.org

Sorry, I made a mistake in my previous post: Of curse it should read

file -bi ...

instead of

find -bi ...

Juergen

Eli Zaretskii

unread,

Feb 28, 2015, 11:56:05 AM2/28/15

to help-gn...@gnu.org

> From: Jürgen Hartmann <juergen_...@hotmail.com>
> Date: Thu, 26 Feb 2015 23:34:05 +0100

>
> >   . Similarly, if the cp850 files' contents match some common regexp,
> >     you can customize auto-coding-regexp-alist to force their decoding
> >     by cp850
>
> That one might do the trick: In my case the only files (at least in
> the big picture) that use the DOS EOL variant are those encoded with
> cp850 and vice versa. So one could think about a regular expression
> that matches this unique EOL pattern.

A more reliable test might be characters whose codepoints are between
128 and 159: those should generally be absent from ISO-8859 encodings.
(Emacs doesn't use this fact for good reasons, but in your specific
case those reasons should not matter, I think.)

Jürgen Hartmann

unread,

Mar 3, 2015, 5:58:34 PM3/3/15

to help-gn...@gnu.org

Sorry for the delay of my response: I just was busy to recover my system
from a nasty hard disk failure. But now mail service is back up again...

So, thank you, Eli Zaretskii, for giving this solution the right twist:

>> So one could think about a regular expression
>> that matches this unique EOL pattern.
>
> A more reliable test might be characters whose codepoints are between
> 128 and 159: those should generally be absent from ISO-8859 encodings.
> (Emacs doesn't use this fact for good reasons, but in your specific
> case those reasons should not matter, I think.)

That's great: I didn't recognize this distinctive feature. Of course this is
by far the better test: It is more specific, since it is per se related to
the actual task. I think this is the approach to favor.

When my system is restored again, I will try to implement it, reporting the
findings.

Juergen