Single unrecognized character wrecks entire display

Alexandre Oberlin

unread,

Aug 18, 2012, 12:35:06 PM8/18/12

to

Hi all,

I have a file created by a translation software. The file initially
displayed well in emacs and after some edit in the translation software it
didn’t anymore and reverted to raw text. I got this message when I tried
to save in utf-8:
utf-8-mac cannot encode these: \351 \350 \351 \342 \351 \234 \350 \350
\311 \240

In short all the non ASCI had become unrecognized. However once I removed
a \234 character,that should display as œ, I could save it and recover a
correct display in emacs.

Is there a way to quickly spot the offending character within emacs in
such cases ?

Alexandre

Peter Dyballa

unread,

Aug 18, 2012, 6:33:25 PM8/18/12

to Alexandre Oberlin, help-gn...@gnu.org

Am 18.08.2012 um 18:35 schrieb Alexandre Oberlin:

> Is there a way to quickly spot the offending character within emacs in
> such cases ?

Yes! When you try save a faulty "text" in UTF-8 then GNU Emacs will name the faulty codes. \234 is such a faulty code. In HEX it's (U+00)9C – obviously an 8-bit control character. With \240 or U+00A0 you have again a text character.

œ is U+0153 or \523. You need to update your translator.

--
Greetings

Pete

Windows, c'est un peu comme le beaujolais nouveau: à chaque nouvelle cuvée on sait que ce sera dégueulasse, mais on en prend quand même, par masochisme.

Alexandre Oberlin

unread,

Aug 22, 2012, 5:36:21 AM8/22/12

to

Thank you Peter for your answer.

> Yes! When you try save a faulty "text" in UTF-8 then GNU Emacs will name
> the faulty codes.

The problem is that it names a full list as bad characters, when only one.

utf-8-mac cannot encode these: \351 \350 \351 \342 \351 \234 \350 \350
\311 \240

How can I spot the true non utf-8 without trying them all?

Windows, c'est un peu comme le beaujolais nouveau: à chaque nouvelle
cuvée on sait que ce sera dégueulasse, mais on en prend quand même, par
masochisme.

Masochisme ou obligation ?

Alexandre

On Sun, 19 Aug 2012 00:33:25 +0200, Peter Dyballa <Peter_...@web.de>
wrote:

--
Using Opera's revolutionary email client: http://www.opera.com/mail/

Stefan Monnier

unread,

Aug 22, 2012, 11:01:04 AM8/22/12

to

> Is there a way to quickly spot the offending character within Emacs in
> such cases?

Good question. Please send it as a feature request via
M-x report-emacs-bug.

As for an answer, it might be that C-x RET r utf-8 RET might do the
trick (basically, the single offending byte caused Emacs to decide that
the file is not using utf-8 and read it as a binary file; so by
forcing the use of utf-8 you should get all the utf-8 encoded chars to
appear correctly and the invalid byte to appear as \234).

Stefan

Peter Dyballa

unread,

Aug 22, 2012, 11:18:36 AM8/22/12

to Alexandre Oberlin, help-gn...@gnu.org

Am 22.08.2012 um 11:36 schrieb Alexandre Oberlin:

> The problem is that it names a full list as bad characters, when only one.
> utf-8-mac cannot encode these: \351 \350 \351 \342 \351 \234 \350 \350
> \311 \240
> How can I spot the true non utf-8 without trying them all?

When you set read-quoted-char-radix to 8 you can search for these "characters" in the text by:

C-s C-q 3 5 1 RET

Hopefully! I think the problem is that your convertor (can't you use something reliable like iconv or recode?) makes mistakes. \240 or A0 in hex exists as partner of another byte (with C2 it constructs NO-BREAK SPACE, with C3 it's LATIN SMALL LETTER A WITH GRAVE, …), \234 or 9C builds with C3 LATIN CAPITAL LETTER U WITH DIAERESIS etc. I think what GNU Emacs wants to tell you and what I did not understand the first time is, that some characters obviously are not encoded correctly so that these "isolated" *bytes* are left over, they don't fit into regular 2- or 3- or even 4-byte codes of the UTF-8 encoding – and of course none of them is an ASCII character encoded by one byte (i.e., itself).

The utf-8-mac encoding in GNU Emacs is UTF-8 that uses ^M or CR as end of line character (UNIX uses ^J or Line Feed).

Can you give us some more details of the original source and the convertor, and its working principle (command line options)? How do you open it in GNU Emacs? How does it behave when you had launched GNU Emacs with -Q, i.e., with none of your possibly faulty customisation? By using for example on the command line:

env LC_CTYPE=UTF-8 LANG=fr_FR.UTF-8 emacs -Q &

or

env LC_CTYPE=UTF-8 LANG=fr_FR.UTF-8 /Applications/Emacs.app/Contents/MacOS/Emacs -Q &

GNU Emacs should then automatically switch to some UTF-8 encoding – whether it's Apple or UNIX or MS line endings should not play such a role. You should see, if the input is faulty, searchable octal codes.

--
Greetings

Pete

A lot of us are working harder than we want, at things we don't like to do. Why? ...In order to afford the sort of existence we don't care to live.
– Bradford Angier

Alexandre Oberlin

unread,

Aug 24, 2012, 9:13:45 AM8/24/12

to

> it might be that C-x RET r utf-8 RET might do the
> trick

Thanks Stefan, but this does not work.

Alexandre

Alexandre Oberlin

unread,

Aug 24, 2012, 9:46:47 AM8/24/12

to

Thank you Peter for your detailed reply.

On Wed, 22 Aug 2012 17:18:36 +0200, Peter Dyballa <Peter_...@web.de>
wrote:

> When you set read-quoted-char-radix to 8 you can search for these
> "characters" in the text by:
>
> C-s C-q 3 5 1 RET
>

Nice command to find eventual other occurrences once you’ve found the
culprit!

> Hopefully! I think the problem is that your convertor (can't you use
> something reliable like iconv or recode?) makes mistakes.

iconv acts just the same. It tells me the 13th character is faulty (\351),
while only the 40th is (\234)

> \240 or A0 in hex exists as partner of another byte (with C2 it
> constructs NO-BREAK SPACE, with C3 it's LATIN SMALL LETTER A WITH GRAVE,
> …), \234 or 9C builds with C3 LATIN CAPITAL LETTER U WITH DIAERESIS etc.
> I think what GNU Emacs wants to tell you and what I did not understand
> the first time is, that some characters obviously are not encoded
> correctly so that these "isolated" *bytes* are left over, they don't fit
> into regular 2- or 3- or even 4-byte codes of the UTF-8 encoding – and
> of course none of them is an ASCII character encoded by one byte (i.e.,
> itself).

Clear.

> Can you give us some more details of the original source and the
> convertor, and its working principle (command line options)?

It is the target document output of Trados Studio 2009 (well known
translation software running on Windows only). The original is in English
and has no such problems, but opens in emacs as raw text Mac.

> How do you open it in GNU Emacs?

C-x C-f

How does it behave when you had launched GNU Emacs

> env LC_CTYPE=UTF-8 LANG=fr_FR.UTF-8 emacs -Q &

Same.

I’m using GNU emacs on Linux, on Cygwin when I use Trados. The behavior is
the same on that regard.

Cheers,

Alexandre

Alexandre Oberlin

unread,

Aug 24, 2012, 10:54:04 AM8/24/12

to

On Fri, 24 Aug 2012 15:46:47 +0200, Alexandre Oberlin <ple...@nospam.com>
wrote:

> The original is in English and has no such problems, but opens in emacs
> as raw text Mac.

Wrong! Actually the original also has offending characters in geographic
names:
\207 and \222.

This is getting very weird.
There are 2 different issues :
1. will the file display correctly?
2. will it save in utf-8?
I could save in utf-8 only by manually specifying utf-8-mac when saving
(changing the buffer’s mode to utf-8-mac is not enough). However the
display remains garbled.
I could display correctly the other accented characters only by erasing
\207 and \234, though \222 would not hurt in that regard.

Alexandre

Peter Dyballa

unread,

Aug 24, 2012, 11:01:49 AM8/24/12

to Alexandre Oberlin, help-gn...@gnu.org

Am 24.08.2012 um 15:46 schrieb Alexandre Oberlin:

> iconv acts just the same. It tells me the 13th character is faulty (\351), while only the 40th is (\234)

Alexandre,

I think you're making here the same mistake as I did before! \351 is not the number of the character in the Unicode encoding but an UTF-8 byte. The UTF encodings are multi-byte encodings and therefore there cannot be that byte \351 stands for character \351 (or 233 decimal or E9 hexadecimal). Iconv and GNU Emacs obviously find some single isolated bytes are spread into the text. This could also explain the different counting: characters vs. bytes (13th vs. 40th).

Could you try a native MS Losedos GNU Emacs?
Could you send me privately such a translation output before GNU Emacs or iconv have changed anything? Can it be that this output is not plain text but some structured format containing these odd bytes you mentioned initially which might switch font or emphasising or tell where a paragraph ends or a footnote starts?

--
Greetings

Pete

"By filing this bug report you have challenged the honor of my family. Prepare to die!"

Peter Dyballa

unread,

Aug 24, 2012, 11:08:59 AM8/24/12

to Alexandre Oberlin, help-gn...@gnu.org

Am 24.08.2012 um 16:54 schrieb Alexandre Oberlin:

> On Fri, 24 Aug 2012 15:46:47 +0200, Alexandre Oberlin <ple...@nospam.com> wrote:
>> The original is in English and has no such problems, but opens in emacs as raw text Mac.
> Wrong! Actually the original also has offending characters in geographic names:
> \207 and \222.

This could be easily cured: pass it to a converter which can remove inappropriate bytes. Iconv has the -c option. You could use it to convert from UTF-8 to UTF-8-MAC and from one the untouched original file to a working copy. Then translate the working copy.

--
Greetings

Pete

Build a man a fire and he'll be warm for a night, but set a man on fire and he'll be warm for the rest of his life.

Stefan Monnier

unread,

Aug 24, 2012, 11:07:39 PM8/24/12

to

>> it might be that C-x RET r utf-8 RET might do the trick
> Thanks Stefan, but this does not work.

This operation should work, so if it "does not work" you probably found
a bug. If it did do something, but just not what you wanted, then it
would help to spell out what it did and in which way this didn't satisfy
your needs.

>> Good question. Please send it as a feature request via
>> M-x report-emacs-bug.

And this suggestion was really the most important one: IIUC you have
solved your immediate problem and are looking for a way to avoid such
pain in the future, so M-x report-emacs-bug is one of the best ways to
do that.

Stefan