utf-8-unix
latin-9-unix
cp850-dos
Therefore, it was very convenient for me that the version 22.3 of
Emacs was (after some minor configuration) perfectly able to
automatically recognize these coding systems from the contents of the
respective files. As I understand, this was possible because in that
old Emacs version these tree systems were associated with three
different coding categories, i.e.
coding-category-utf-8
coding-category-iso-8-1 (meanwhile depreciated)
coding-category-ccl
respectively.
Now switching to Emacs 24.4, I found that two of these coding systems,
latin-9-unix
cp850-dos
were bunched together into the category
coding-category-charset
presumably with the consequence that I have to choose which one of
these two systems will not be automatically recognized any more.
Is this conclusion correct?
If yes, this would be a big regression from my point of view, so I am
very interested in any kind of workaround.
(If this has to be done by reimplementing cp850 via CCL, it would be
great to get some (link to a) tutorial on this topic.)
It is clear that there are coding systems that can not be
distinguished just by analyzing the encoded text. But from the
experience with former versions of Emacs I know that this particular
problem is not ill-posed.
Therefore, I would greatly appreciate any help.
Juergen
>> Is this conclusion correct?
>
> No, I don't think so. There's no direct relation between categories
> and recognition of encoding.
I am very glad to hear that.
> If you have specific problems, i.e. if Emacs doesn't recognize the
> encoding of some file(s), please post the details. (I'd suggest to
> try in "emacs -Q" first, because some problems might be caused by your
> customizations that need to be removed or adapted to the new version.)
> Then people here could review the problems and advise you about
> possible solutions, or ask you to file a bug report.
>
> But in general, there shouldn't be any regressions in recognizing
> encodings.
OK. I will try to give a specific example - it is rather artificial
but representative:
Consider an utf-8-unix encoded text file, meaningfully named
utf-8-unix, that just contains the seven German special characters
äöüßÄÖÜ ("a"o"u"s"A"O"U)
in one single line followed by a newline character. Now we make two
copies of this file and recode them to the other coding systems of
interest:
cp utf-8-unix latin-9-unix
recode ..l9 latin-9-unix
cp utf-8-unix cp850-dos
recode ..pc cp850-dos
Visiting all tree files in an Emacs session that was freshly started
by means of
emacs -Q
- thank you for that important hint - yields a perfect recognition of
the respective coding in the case of
utf-8-unix
latin-9-unix (recognized as latin-1-unix, equivalent here)
but the recognition fails tor the cp850-dos encoded file, as it is
recognized as
raw-text-dos
encoded and its contents is displayed as
\204\224\201\341\216\231\232
Looking on the contents of the variable coding-category-list, it has
the form
(coding-category-utf-8 coding-category-charset ...
coding-category-raw-text ...)
where the values of the variables coding-category-utf-8 and
coding-category-charset are utf-8 and iso-latin-1 respectively.
If I start again with a new Emacs session (emacs -Q), but this time
performing the commands
prefer-coding-system cp850
prefer-coding-system utf-8
prior to visiting the files, the codings of
utf-8-unix
cp850-dos
are recognized correctly, while the file latin-9-unix is recognized as
cp850-unix encoded and its contents is displayed as some cryptic symbols.
The coding-category-list and the variable coding-category-utf-8 have
the same values as before, but the variable coding-category-charset
contains cp850 this time.
So my problem is to find a configuration of Emacs 24.4 that yields a
correct automatic recognition of all tree coding systems
utf-8-unix
latin-9-unix or
cp850-dos
when the files of the example above are visited. One has to keep in
mind that this was perfectly possible with Emacs 22.3, as I just
verified again.
Sorry for the rather long post, but I hope that I could state my
problem more precisely.
Juergen
>> encoded and its contents is displayed as
>>
>> \204\224\201\341\216\231\232
>
> That's true, but I see the same behavior in Emacs 22.3, if I invoke it
> with "emacs -q" (lowercase 'q', since 22.x didn't support -Q), so
> there's no change in behavior here.
That is right: I had to do some minor configuration to get Emacs 22.3
to correctly recognize these three coding systems. See below.
> How exactly did you verify with v22.3? As I wrote above, I see the
> same behavior in that version. Did you invoke it with -q? If not,
> there are some customization of yours that modify the default
> behavior, and the question becomes how to express the same
> customizations in Emacs 24.
To set up a clean stage, I just recompiled Emacs 22.3 from the vanilla
Gnu sources, and started one session with -q and another with -Q,
receiving the same result in both cases.
For the tests I used the same sample text files
utf-8-unix
latin-9-unix
cp850-dos
that I described in my previous post.
As you already described, without any customization the automatic
recognition fails in the case of the cp850-dos encoded text file, as
its coding is recognized as raw-text-dos. So far we get the same
result as in the Emacs 24.4 case.
But if one issues the commands
(check-coding-system 'cp850)
(setq coding-category-ccl 'cp850)
(update-coding-systems-internal)
in the *scratch* buffer (Lisp Interaction mode) of Emacs 22.3 right
after starting the session, all three coding systems will be perfectly
recognized when the text files are visited.
After this customization, the contents of the variable
coding-category-list has the form
(coding-category-utf-8 coding-category-iso-8-1 coding-category-ccl ...)
where the values of the variables coding-category-utf-8,
coding-category-iso-8-1, and coding-category-ccl are mule-utf-8,
iso-latin-1, and cp850 respectively.
You are perfectly right stating that the question to be addressed now
is how to port these customization commands to the contemporary
version 24.4 of Emacs: In that version the coding system cp850 is not
any more implemented via CCL and it is associated with the coding
category coding-category-charset--the same category that the systems
latin-1 and latin-9 are associated with. Furthermore, the command
update-coding-systems-internal is not available any more, but this
might be a minor detail.
I am rather clueless here, so any help is most welcome.
Juergen
> Try this:
>
> (set-coding-system-priority 'utf-8 'cp850)
After doing this, the coding systems
utf-8
cp850
get correctly recognized, but
latin-9-unix
gets wrongly recognized as cp850-unix encoded.
If I modify the lisp expression to
(set-coding-system-priority 'utf-8 'latin-9)
it is utf-8 and latin-9 that are properly recognized while the test
file
cp850-dos
gets detected as iso-latin-9-dos encoded.
If I pass all three coding systems to set-coding-system-priority,
(set-coding-system-priority 'utf-8 'latin-9 'cp850) or
(set-coding-system-priority 'utf-8 'cp850 'latin-9)
it turns out that the function set-coding-system-priority ignores the third
coding system in these cases, because it belongs to the same coding
category as the coding system named in the second place. The source
code src/coding.c comments this in the lines 9972 and 9973 like this:
/* Ignore this coding system because a coding system of the
same category already had a higher priority. */
So I fear that we can not use this function to establish the
simultaneous recognizability of all tree coding systems.
By the way, could you verify, that this is possible with Emacs 22.3
with the customization described in my previous post?
Juergen
> It looks like what you want is beyond the current capabilities of
> Emacs's auto-detection of encoding. See below for some alternatives.
>
> Having said that...
>
>> By the way, could you verify, that this is possible with Emacs 22.3
>> with the customization described in my previous post?
>
> ...no, it doesn't work for me. The latin-9 file is decoded using my
> locale's encoding (which isn't latin-9), and cp850 file is still
> raw-text.
Oops, this is an important finding indeed.
> So I think some other factor(s) is/are at work on your system. Your
> locale's encoding is certainly one of them, but I think there should
> be something else, either in your customizations or somewhere else.
I just repeated the tests with Emacs 22.3 using the POSIX locale,
LC_ALL=C ./emacs -q
and you are right: the cp850 file was recognized as raw-text now. The
locale I used before was
de_DE.UTF-8
The more I get involved in this topic the more I see that it is much
more complex that I thought at first glance.
> In general, even if Emacs 22.3 was capable to do the job, I think it
> was by sheer luck, and is anyway fragile, since the same
> customizations don't work for me (and AFAIU, aren't supposed to work).
> So I would suggest to explore alternative ways of doing this in Emacs
> 24 reliably.
This sounds reasonable to me. Besides the aspect of reliability, which
is of curse the most important one, doing so might also yield a
solution that is likely to survive future updates.
> Some possibilities you may wish to explore:
>
> . Put a 'coding: cp850' cookie in the cp850 files
I would rather avoid altering the files content for this technical reason.
> . If the names of the cp850 files all match some common pattern, you
> can use modify-coding-system-alist to tell Emacs to decode them by
> cp850
Unfortunately in my case there is no such pattern in the file names
that would allow to tell which coding the respective file might use.
> . Similarly, if the cp850 files' contents match some common regexp,
> you can customize auto-coding-regexp-alist to force their decoding
> by cp850
That one might do the trick: In my case the only files (at least in
the big picture) that use the DOS EOL variant are those encoded with
cp850 and vice versa. So one could think about a regular expression
that matches this unique EOL pattern.
> Of course, you can always turn the table, and do the above for
> latin-9, while keeping cp850 in set-coding-system-priority call. It
> all depends which one of these 2 lends itself better to one of these
> methods.
>
> I believe that if one of these alternatives can do the job for you,
> the result will be much more reliable.
I also think so.
So, I have to play around a little bit to get acquainted with the
construction of regular expressions for Emacs. I will be back when I
have gained a deeper insight, or a concrete solution at best.
Meanwhile I would like to thank you, Eli Zaretskii, very much for your
time and effort that you spent to provide me with this thorough
analysis and your valuable suggestions.
Juergen
> The general problem you’re solving is that of encoding detection.
> There exist ready-made solutions for that, e.g. by computing byte
> frequencies and matching them against known character frequencies in
> your language. One of these is called enca.
>
> Googling for “emacs enca” yields a post by Dmitriyi Paduchikh in
> gnu.emacs.sources, dated 2007.
>
> https://lists.gnu.org/archive/html/gnu-emacs-sources/2007-06/msg00037.html
To use Google is always a good advise that I will gratefully follow
once more with respect to this broader background.
Actually I didn't know Enca at all up to now: A language based attempt
to recognize encoding is an interesting idea.
Unfortunately, Enca can not be used in my special case, because--I
didn't mention this before, sorry--the text files to handle are mostly
in English and German. For the former ones encoding is not an issue,
and for the latter the language German is not supported by Enca.
Enca 1.14 for example only supports
Belarussian
Bulgarian
Czech
Estonian
Croatian
Hungarian
Lithuanian
Latvian
Polish
Russian
Slovak
Slovene
Ukrainian
Chinese
But for people that use any of these languages this might be a
promising option.
Apart from that--and this might be helpful in my case also--the idea
to use an external software to detect encoding is very charming, and
maybe it is possible to adapt the lisp snippets contained in your link
to other programs. E.g.
find -bi ...
is capable to identify file encodings although it recognizes cp850
rather non-specifically as "unknown-8bit".
So thank you very much for your suggestions.
Juergen
file -bi ...
instead of
find -bi ...
Juergen
So, thank you, Eli Zaretskii, for giving this solution the right twist:
>> So one could think about a regular expression
>> that matches this unique EOL pattern.
>
> A more reliable test might be characters whose codepoints are between
> 128 and 159: those should generally be absent from ISO-8859 encodings.
> (Emacs doesn't use this fact for good reasons, but in your specific
> case those reasons should not matter, I think.)
That's great: I didn't recognize this distinctive feature. Of course this is
by far the better test: It is more specific, since it is per se related to
the actual task. I think this is the approach to favor.
When my system is restored again, I will try to implement it, reporting the
findings.
Juergen