Baffling behaviour with some non-ASCII symbols

18 views
Skip to first unread message

Ian Fantom

unread,
May 11, 2025, 11:46:09 AMMay 11
to latexus...@googlegroups.com

I have a curious anomaly.

I am still using Xelex, and polyglossia for Esperanto. I presume this could apply to any language or UTF-8 file using non-ASCII symbols.

I compiled a booklet successfully, with each chapter in a separate .tex file, and including them with eg:

\input{uea_enketo_enkonduko.tex}

Everything worked fine. I did a preprint with Amazon kdp.

Then I added a chapter in exactly the same way as I had done with the others:

\input{antaux_kaj_post_la_enketo.tex} % 2025-05-11

The first few lines of the file are:

- -------------

\chapter[Antaŭ kaj Post la Enketo]{Antaŭ kaj Post la Enketo}

Test: ĉĥĝĵŝŭ

Antaŭ kaj Post la Enketo

Ian Fantom

- ----------------

I rebuilt, and the pdf was produced just fine:

However, the display of the tex file in the editor had changed to:

- -----------

\chapter[AntaÅ­ kaj Post la Enketo]{AntaÅ­ kaj Post la Enketo}

Test: Ä‰Ä¥Ä ÄµÅ Å­

AntaÅ­ kaj Post la Enketo

Ian Fantom
- --------------

I can't edit this with the editor using ĉĥĝĵŝŭ from the keyboard!

This applies only to this one file. If I copy a para from this .tex file and past it onto the main file, or onto another file, then the special characters do show properly in the editor.

I've tried removing the '\chapter' line and get the same result. I've tried removing the '.tex' extension and get the same result. I'm at a loss to see any difference whatsoever between this file and the files for the other chapters. It seems some transliteration program has intervened to produce 8-bit sequences instead of the Unicode letters.

I'm baffled!

Regards,

Ian


Peter Flynn

unread,
May 11, 2025, 7:53:23 PMMay 11
to latexus...@googlegroups.com
On 11/05/2025 16:45, Ian Fantom wrote:
> I have a curious anomaly.

Don't we all :-)

> I am still using XeLaTex, and polyglossia for Esperanto. I presume this
> could apply to any language or UTF-8 file using non-ASCII symbols.

Pretty much.

[...]
> However, the display of the tex file in the editor had changed to:
>
> - -----------
>
> \chapter[AntaÅ­ kaj Post la Enketo]{AntaÅ­ kaj Post la Enketo}

I have seen this happen before.

Something, somewhere, has changed the encoding of your file from UTF-8
to some obsolescent encoding like Windows-1252, Mac Roman 8, or DEC
Multinational.

You're using Linux, amirite? You can find out the encoding by typing

file antaux_kaj_post_la_enketo.tex

> I can't edit this with the editor using ĉĥĝĵŝŭ from the keyboard!

Be very careful. Something, somewhere on your system has rewritten the
file into this encoding.

Have you at any time opened and saved the file in an editor that you
don't normally use?
Have you changed editor recently?
Have you done anything to the file with any of the regular Unix text
utilities (sed, awk, etc)? They are nowadays encoding-sensitive, but
some features just read 8-bit bytes, not multibyte characters.

> This applies only to this one file.

You can use the 'file' command on the other files and see their
encoding. It should be Unicode UTF-8

> If I copy a para from this .tex file and paste it onto the main
> file, or onto another file, then the special characters do show
> properly in the editor.

What do you do to do this? You said above you couldn't edit this file.

Do you have both documents open in different panes or windows in the
same editor?
Or does your editor run a separate instance to open another file?
Or are you opening one file, copying the text, closing the file, then
opening another file and pasting the text?

> I've tried removing the '\chapter' line and get the same result. I've
> tried removing the '.tex' extension and get the same result.

It has noting to do with any of that, or with LaTeX.

> It seems some transliteration program has
> intervened to produce 8-bit sequences instead of the Unicode letters.

The intervening program has changed the file encoding so that an editor
will no longer see the multibyte characters as such, but will interpret
them as 2-byte characters in one of the older encoding systems.

You might be able to change it all back using the iconv utility, but you
need to use the file command first to see what encoding the file is
currently in, and check what encoding all the other files are. Assuming
the file has been put into ISO 8859-1 (Latin-1), and the others are all
UTF-8, you can type

iconv -f ISO-8859-1 -t UTF-8 -o test.tex antaux_kaj_post_la_enketo.tex

Check that test.tex is editable and has all the right characters, then
edit your workflow to use test.tex and check the book compiles OK, then
finally, copy test.tex to antaux_kaj_post_la_enketo.tex, overwriting the
broken copy.

I'd be interested to know the output of the file commands on the various
files.

Peter

Ian Fantom

unread,
May 12, 2025, 6:23:56 AMMay 12
to latexus...@googlegroups.com

Many thanks, Peter. I wasn't expecting such a knowledgeable reply - it was such an unexpected and weird problem! So it's interesting that you've met something similar before.

I tried:

$ file antaux_kaj_post_la_enketo.tex
antaux_kaj_post_la_enketo.tex: LaTeX document, UTF-8 Unicode text, with very long lines

If I copy a para from this .tex file and paste it onto the main
file, or onto another file, then the special characters do show
properly in the editor.
You: What do you do to do this? You said above you couldn't edit this file.

vi antaux_kaj_post_la_enketo.tex

Then use the right mouse button to copy a para, and paste it into the tex editor with the right button.

Ah - I've just seen an ambiguity: Right button, CTRL-v, Right button no formatting. I'll try all of those. ... I got exactly the same with:

TEST: cp using cntrl-v:
Test: ĉĥĝĵŝŭ

TEST: cp using right mouse click 'Paste': Test: ĉĥĝĵŝŭ

TEST: cp using right mouse click 'Paste as Latex': Test: ĉĥĝĵŝŭ

I typed the 'Test: ĉĥĝĵŝŭ' directly into the file using vi, before copying.

I haven't used any other utilities on the file. I created it by 'Save as' in LibreOffice Writer, then I tried again by reading the file in the text editor Mousepad. The result was the same. I tried playing the the end-of-line marker on saving: LF | CR LF. Same result.

I'm still baffled!

Best wishes,

Ian

Ian Fantom

unread,
May 12, 2025, 6:32:40 AMMay 12
to latexus...@googlegroups.com
I've just tried copying a test text from the tex file using vi, back to
the test file (using vi). The result is the same as before: appears as
ASCII characters in the tex editor, but as proper Unicode characters in
the pdf.

Regards,

Ian

On 12/05/2025 00:53, Peter Flynn wrote:

Peter Flynn

unread,
May 12, 2025, 11:10:26 AMMay 12
to latexus...@googlegroups.com
On 12/05/2025 11:23, Ian Fantom wrote:
> Many thanks, Peter. I wasn't expecting such a knowledgeable reply -
> it was such an unexpected and weird problem! So it's interesting
> that you've met something similar before.

There was a period during which there were a lot of people using old
system software and old editing software which assumed ISO-8859-1, then
moving to new versions which assumed UTF-8, so there were quite a lot of
instances.

Your problem is curious because (I assume) nothing changed in the
background, so it must be down to some version of something that doesn't
support UTF-8.


> $ file antaux_kaj_post_la_enketo.tex
> antaux_kaj_post_la_enketo.tex: LaTeX document, UTF-8 Unicode text, with
> very long lines

Oh. So as far as the operating system is concerned, it *is* a UTF-8
document.

That means whatever software processed it did not ask the OS at file
open time what the encoding was, but just blundered ahead reading bytes,
not characters.

> You: What do you do to do this? You said above you couldn't edit this file.
>
> vi antaux_kaj_post_la_enketo.tex

That would definitely be a candidate, except that last time I asked, vi
was said to be fully UTF-8 compliant. And TBH I'd be very surprised if
it wasn't...but I don't know from which version their UTF-8 compliance
started.

> Then use the right mouse button to copy a para, and paste it into the
> tex editor with the right button.

Which TeX editor are you using?
A Ctrl-C...Ctrl-V copy and paste should work. If it doesn't then the
fault lies with the operating system.

> Ah - I've just seen an ambiguity: Right button, CTRL-v, Right button no
> formatting.

I'm not familiar with right button = copy. Do you mean click (left) and
select (highlight) the text (in vi) and then right-click performs a copy?

I'll try all of those. ... I got exactly the same with:
>
> TEST: cp using cntrl-v:
> Test: ĉĥĝĵŝŭ
>
> TEST: cp using right mouse click 'Paste': Test: ĉĥĝĵŝŭ
>
> TEST: cp using right mouse click 'Paste as Latex': Test: ĉĥĝĵŝŭ
>
> I typed the 'Test: ĉĥĝĵŝŭ' directly into the file using vi, before copying.

Looks like pasting is OK, so maybe the culprit is copying.

> I haven't used any other utilities on the file. I created it by 'Save
> as' in LibreOffice Writer, then I tried again by reading the file in the
> text editor Mousepad. The result was the same. I tried playing the the
> end-of-line marker on saving: LF | CR LF. Same result.
>
> I'm still baffled!

It sounds like the text has been on a long journey. I have so far
avoided this in recent times by checking and double-checking beforehand
that the software path was going to be UTF-8 for all saves and reads.
But sometimes they just trip you up, or the encoding is a drop-down
initially set to something other than UTF-8 and it's up to you to notice
that it's there (LibreOffice Calc saving CSV is a prime example).

On 12/05/2025 11:32, Ian Fantom wrote:
> I've just tried copying a test text from the tex file using vi, back
> to the test file (using vi). The result is the same as before:
> appears as ASCII characters in the tex editor, but as proper Unicode
> characters in the pdf.
I'd need to see an example of the ASCII characters because AFAIK you
can't encode anything with just ASCII: the signal character is always
going to be >127d (ie non-ASCII). So for example, an e-acute (é) in
UTF-8 is 0#xC3A9, is C3 (capital A tilde or Ã) plus A9 (copyright or ©)
whereas e-acute in ISO-8859-1 is just 0#xE9 (é).

A file just stores bytes, and it's only the file attribute which
provides the encoding, so if the file is UTF-8 and has é for é, then a
non-compliant editor will indeed display just é because it never asks
what encoding to use. Alternatively if the file was (wrongly) stored
with the ISO-8859-1 flag set, but still contained é, the editor would
display it as such.

BUT...once it gets passed to XeLaTeX or LuaLaTeX using fontspec, those
two bytes get (correctly) interpreted as a single é character, so the
PDF comes out OK.

Have you tried other TeX editors?

Peter

Ian Fantom

unread,
May 12, 2025, 4:39:59 PMMay 12
to latexus...@googlegroups.com

I've just tried another test. I copied from the screen a section from the LibreOffice Writer directly to the tex file in the Tex editor. It showed correctly in the Tex editor, but the special characters appeared as question marks in the pdf file.

I've just been looking at the man page for 'file'. It seems there's no flag to say what the encoding is, but 'file' figures it out from the text, and concludes that my file is UTF-8. I wonder how the Tex editor does it.

My tex editor is:

TeXstudio 2.12.22 (Build: 2.12.22+debian-1build1) Using Qt Version 5.12.8, compiled with Qt 5.12.5 R

Perhaps the Tex editor guesses from the contents, too, and comes to a different conclusion for that particular file.

I'll look for any difference that might have thrown the Tex equivalent of 'file'.

I haven't yet ventured into trying out other Tex editors!

Another observation is that the Tex editor rewrites the file using '^M^M' for line endings rather than '^L' or '^M^L'. I'm not sure what the implication is, but it might mean it thinks it's a Microsoft format? The '^M^M' in the Unix file doesn't create a new line, and so the whole document appears as one paragraph with the occasional 'MM' shown in blue.

Regards,

Ian

Peter Flynn

unread,
May 12, 2025, 7:26:59 PMMay 12
to latexus...@googlegroups.com
On 12/05/2025 21:39, Ian Fantom wrote:
> I've just tried another test. I copied from the screen a section
> from the LibreOffice Writer directly to the tex file in the Tex
> editor. It showed correctly in the Tex editor, but the special
> characters appeared as question marks in the pdf file.

It sounds as if the editor is interpreting the file contents differently.

> I've just been looking at the man page for 'file'. It seems there's
> no flag to say what the encoding is, but 'file' figures it out from
> the text, and concludes that my file is UTF-8. I wonder how the Tex
> editor does it.

I may have misunderstood the way in which the operating system does
this. The file command does it by interpretation, but (in my limited
understanding) when an editor or other program (eg TeX) opens a file,
the file-open function can provide information about the file encoding.
Perhaps this is the information that the file command accesses.

> Perhaps the Tex editor guesses from the contents, too, and comes to a
> different conclusion for that particular file.

If that is the case, I don't know how to resolve the problem.

> Another observation is that the Tex editor rewrites the file using
> '^M^M' for line endings rather than '^L' or '^M^L'. I'm not sure what
> the implication is, but it might mean it thinks it's a Microsoft format?

This is very weird. I have never seen that line-end sequence before. The
normal ones are:

CR on Mac OS (^M)
LF on Unix/Linux (^J)
CRLF on Windows (^M^J)

I hope you don't mean ctrl-L, as that's a formfeed not a line
terminator. Its used as a page separator in many systems.

I do know at least one editor (I think it was an old DOS editor) that
used ^J^M (LFCR) as the line-end, much to everyone's confusion.

> The '^M^M' in the Unix file doesn't create a new line, and so the whole
> document appears as one paragraph with the occasional 'MM' shown in blue.

Right. CRCR is bogus. On a Mac, it's a double line-end.

Peter

Ian Fantom

unread,
May 13, 2025, 5:33:40 AMMay 13
to latexus...@googlegroups.com
Thanks for confirming that I'm not just going crazy!

I tried removing anything in the tex file that may not have been present
in previous tex files (using vi):

backslashes, rounded double and single quote marks, and all the ^M's.

No change. I tried adding to the first line '% UTF-8 utf-8 utf8' to give
any automatic detector a clue! No change.

I looked up the TexStudio website, and found
(https://texstudio-org.github.io/configuration.html):

- -----------------

Configuring the editor

You may change the default encoding for new files (“Configure TeXstudio”
-> “Editor” -> “Editor Font Encoding”) if you don’t want utf8 as
encoding. Don’t forget to set the same encoding in the preamble of your
documents. (e.g. \usepackage[utf8]{inputenc}, if you use utf-8).

TeXstudio can auto detect utf-8 and latin1 encoded files, but if you use
a different encoding in your existing documents you have to specify it
in the configuration dialog before opening them. (and then you also have
to disable the auto detection)

- ------------------------

My default was already set to utf8. I disabled 'auto detection' on the
configuration page.

Same result.

I commented out the offending file in the main tex file (using
TexStudio), rebuilt, then reinstated it, just in case the each run was
depending on the results of the encoding detector from the previous run.
No change.

So I'm about to contact the TexStudio people to see what they have to say!

Cheers,

Ian

Peter Flynn

unread,
May 13, 2025, 9:03:43 AMMay 13
to latexus...@googlegroups.com
On 13/05/2025 10:33, Ian Fantom wrote:
> Thanks for confirming that I'm not just going crazy!

:-)

> I tried removing anything in the tex file that may not have been present
> in previous tex files (using vi):
>
> backslashes, rounded double and single quote marks,

Those are all ASCII characters which are identical in UTF-8 and
ISO-8859-1 so that won't make any difference.

> and all the ^M's.

All the double ones, anyway.

> No change. I tried adding to the first line '% UTF-8 utf-8 utf8' to give
> any automatic detector a clue! No change.

To do that, there is a specific format for the top two lines of the file:

% !TEX TS-program = lualatex
% !TEX encoding = UTF-8 Unicode

>> You may change the default encoding for new files (“Configure
>> TeXstudio” -> “Editor” -> “Editor Font Encoding”) if you don’t
>> want utf8 as encoding.

That makes it sound as if UTF-8 is the default, which is what we want.

>> Don’t forget to set the same encoding in the preamble of your
>> documents. (e.g. \usepackage[utf8]{inputenc}, if you use utf-8).

Irrelevant if you use fontspec because inputenc and fontenc aren't used.

>> TeXstudio can auto detect utf-8 and latin1 encoded files, but if
>> you use a different encoding in your existing documents you have
>> to specify it in the configuration dialog before opening them.
>> (and then you also have to disable the auto detection)

But you're not, are you? Did you run the file command on all your LaTeX
documents to see if they are all UTF-8?

>> My default was already set to utf8. I disabled 'auto detection' on
>> the configuration page.
>
> Same result.

Sounds OK.

> I commented out the offending file in the main tex file (using
> TexStudio), rebuilt, then reinstated it, just in case the each run was
> depending on the results of the encoding detector from the previous run.
> No change.

Good test but I don't think anything like that gets passed to the next
run EXCEPT generated files that get reprocessed, eg

.toc, .lof,. and .lot files (contents)
.ind and .idx files (indexes)
.gls and .glo files (glossaries)

If you're using makeindex you might check to see if that is picking up
something from the index files. Sometimes these generated files lac
sufficient of a signature to be detected as UTF-8 (or indeed anything)
so they may default to something unwanted. Even one isolated ISO-8859-1
8-bit character (invalid in UTF-8) could be enough to make the system
treat the whole file as ISO-8859-1.

> So I'm about to contact the TexStudio people to see what they have to say!

Could be worth a try, but first of all make sure all the files are UTF-8
(use iconv on the suspect file to convert it) and then reprocess the
whole book from the command line (eg using latexmk) so that you don't
use the editor at all, and check that it is OK.

Peter

Ian Fantom

unread,
May 13, 2025, 1:16:41 PMMay 13
to latexus...@googlegroups.com
Eureka! I did what you suggested in using iconv:

$ iconv -t UTF-8 < antaux_kaj_post_la_enketo.tex.txt > output.txt

(antaux_kaj_post_la_enketo.tex.txt is as converted from LibreOffice
Writer in utf-8)

copied output.txt to the original antaux_kaj_post_la_enketo.tex

rebuilt using the editor, and all is perfect!

The special characters appear correctly both in the editor and in the
pdf file.

So it does appear that there was something weird in the encoding.

At least it appears that I can now overcome the problem when it arises,
though I don't understand the logic of it. Changing the configuration
options didn't seem to do anything, when it should have done.

And yes, all my .tex files did test as UTF-8.

Actually, the book I'm getting out is a minimal one - a shortish report,
with no index - chosen partly to test out the system!

Many thanks for your efforts, Peter.
Reply all
Reply to author
Forward
0 new messages