Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Encoding remarks

6,793 views
Skip to first unread message

Merciadri Luca

unread,
Sep 27, 2010, 5:31:08 AM9/27/10
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

Here is a text I wrote some months ago about encoding, LaTeX, and
BiBTeX. It is divided in three `big' parts:

1) Inputenc package;
2) Proper encoding of the document file;
2.1) Directly writing characters without commands,
2.2) Converting a file to the good encoding;
3) Dealing with BiB files.

Here is the text. If you have something to add, to correct, etc.,
please tell it! Thanks.

==
If you are collaborating with other persons using different OSes, or
simply migrating from one platform to another, you may have troubles
with accents and special characters, especially if the language is
French, Polish, or another language which makes an extensive use of
special characters.

As Computer Science is, like every science, complicated, some persons
sometimes mix up the words which are related to *encoding*. We shall
give here a brief summary of how you need to deal with encoding and
LaTeX.


1) The inputenc Package: the Encoding of the Document
- -----------------------------------------------------------
Every LaTeX document should have, in its preamble,

===
\usepackage[encoding]{inputenc}
===

According to Mc Kichan, the inputenc package maps certain characters
to their corresponding TeX macros according to the encoding option you
select. On a standard Linux platform, you may replace *encoding* by
*utf8x*, to use the e*x*tended UTF-8 character set.

Note that the *utf8x* option asks *inputenc* to load the *ucs*
package, which is no longer maintained. Consequently, a compromise has
to be found between *utf8* and *utf8x*: if *utf8x* is not needed
(i.e. *utf8* is sufficient), you may only use *utf8*. However, if
*utf8x* always works, do not change!


On Microsoft Windows, users tend to use ISO-8859-1, which is commonly
referred to as Latin-1. This one is generally intended for ``Western
European'' languages. In this case, you need to replace encoding by
latin1.

On Microsoft Windows, the character encoding of the files is
CP1251. This is an eight-bit character encoding designed to cover
languages that use the Cyrillic alphabet such as Russian, Bulgarian,
Serbian Cyrillic and other languages (French, ...), which do not use
the Cyrillic alphabet. It is the most widely used for encoding the
Bulgarian, Serbian and Macedonian languages. The CP1251 is quite
compatible with the latin1, and they do not often clash.

In modern applications, the ``Unicode'' standard is a preferred
character set. Consequently, it is recommend to always use utf8x,
whatever the platform, as it provides the best (i.e. the most
complete) set of characters. Sticking with utf8x allows you to never
change your character encoding as utf8x is the future, for many
reasons which come out of the scope of this booklet. It is possible to
use utf8x as encoding, by simply replacing it in the option of
inputenc.


2) The Proper Encoding of the Document File
- ----------------------------------------
To avoid clashes, the best thing is to keep your document in the same
encoding as the encoding encoding, which is the option of inputenc. It
is easier to do when you are working with Linux.

2.1) Directly Writing Characters Without Commands
-------------------------------------------------
Directly writing characters without their associated commands (for
example ``\'e'' written with a e and an acute accent) is strongly
disadviced. With examples such as the ``\'e'' there is no problem, but
when you begin using French quotes like « Mot » directly typing \verb
« and », it may never be rendered, cause errors, etc., depending on
your local implementation. There are actually three kinds of persons:

a. Those who stick with commands. These are the best ones: commands
will always be valid, and, if deprecated one day, using renewcommand
or other structures will make no problem,
b. Those who use commands only when necessary. These are persons who
try to see which character from their keyboard is directly rendered,
which one is not, and, for those which are rendered, they typeset
them directly, and, for those which are not rendered (as now), they
use commands. This is not the best approach, as it is extremely
tedious, difficult, error-prone, and not the aim of LaTeX,
c. Those who use various tricks to make LaTeX behaves as they want,
even if they want things that are contrary to the
state-of-the-art. This is not the best solution, as their commands
have great chances to be unuseful after that.

Let's take an example: if | is not implemented in your architecture,
it is sure that, for example, x will be considered as x , or, if you
put bars around the x, there must be reasons. Consequently, it is
better for these bars not to disappear from your page. If you want to
be sure that they will always be here, a best idea is to use commands:
write \lvert x \rvert.

The best thing which can be recommended is evidently to stick with
commands, such as demonstrated before. One sometimes needs to include
packages for other symbols, but this is better than using the two
other approaches, namely using commands only when necessary, and
using various unstandard tricks.

2.2) Converting a File to the Good Encoding
-------------------------------------------
If, say, you are dealing with documents in another encoding, the best
thing is to use the following procedure, assuming you are working
with Linux (or with such a virtual machine):

a. Find their current encoding,
b. Know what their future encoding will be,
c. Be sure that the encodings are compatible (i.e. that both
character sets contain the same symbols, even if they can be
expressed differently). If this is not the case, you will loose
information,
d. Execute, a sample file being fileinoldencoding.tex, and the same
file, in its new encoding being fileinnewencoding.tex:
===
iconv -f oldencoding fileinoldencoding.tex -o fileinnewencoding.tex
===
where oldencoding could be, for example, windows-1252 . This is the
same as redirecting the flux using
===
iconv -f oldencoding fileinoldencoding.tex > fileinnewencoding.tex
===
You might make this process automatic, e.g. by creating a shell file
(here it is bash):
===
#/bin/bash
LIST=`ls *.tex`
for i in $LIST;
do iconv -f windows-1252 $i -o $i.”utf8;
mv $i."utf8 $i;
done
===
and executing this file in a folder containing .tex files. Here, the
new files' names will contain a .utf8 extension. You can evidently
modify this script or the aforementioned commands as you want, to use
another encoding. By default, the encoding is utf8, but if you want
encoding newencoding, use
===
-t newencoding
===,
e. Open the file in an editor, the editor being set to open files in
the output encoding,
f. If you see strange characters, there is a problem, and check the
procedure. If everything seems normal, you can modify the file, save
the modifications, but everything under the new encoding,
g. Compile the file(s) with the good inputenc declaration, as
explained above.

3) Dealing With BiB Files
- -------------------------
If you are using BiBTeX, you may also have problems, especially if you
are switching from Microsoft Windows to Linux, or dealing with files
transiting between both OSes. The main problem is that BiBTeX is not
really good at dealing with Unicode (and consequently utf-8 ,
etc.). Consequently, the best suggestion, if you do not want to use
other alternatives to BiBTeX, is to

a. Follow the same recommendations as before for characters: always
write \'e for ``\'e'' and the same for other ones,
b. Either keep the .bib file in latin1 encoding, and consequently using
===
\begingroup
\inputencoding{latin1}
\bibliography{bibliography}
\endgroup
===
in your .tex file, or convert your .bib file in utf-8. In this case,
you do not need the inputencoding declaration, and the group
markers. Note that it is really important to use \'e and other sets
of commands before the conversion. If you do not work this way,
strange characters are likely to occur after the conversion. That is
one of the reasons which justifies the use of such commands. If you
make a document under Microsoft Windows, there will not be any
problem until you keep on with the same encoding,
c. Note that the same remark applies even if you keep your file in
latin1 and that you use the above code: use \'e for ``\'e'' and other
commands to display related symbols. If you stick with Microsoft
Windows , there will not, roughly, be any problem, and you can type
everything with your keyboard.
==

- --
Merciadri Luca
See http://www.student.montefiore.ulg.ac.be/~merciadri/
- --

Repetitio Mater Memoriae.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>

iEYEARECAAYFAkygZFwACgkQM0LLzLt8MhwvPgCeKfQXFakAwn2HNNqvWpzA+U1R
5zUAnjVoGQWwGknrOs48Kmd3f7HqzsiI
=Jt91
-----END PGP SIGNATURE-----

Erik Quaeghebeur

unread,
Sep 27, 2010, 11:16:10 AM9/27/10
to
Op 27-09-10 11:31, Merciadri Luca schreef:

>
> Note that the *utf8x* option asks *inputenc* to load the *ucs*
> package, which is no longer maintained. Consequently, a compromise has
> to be found between *utf8* and *utf8x*: if *utf8x* is not needed
> (i.e. *utf8* is sufficient), you may only use *utf8*. However, if
> *utf8x* always works, do not change!

Well, I've just installed tl2010, and it seems biblatex has become
incompatible with ucs, so I needed to revert to plain utf8, which
luckily worked for the document at hand. However, if ucs is unmaintained
and incompatible with other packages, it is perhaps better to steer
people away from using it if they want their source files to be
future-proof.

Erik

Merciadri Luca

unread,
Sep 27, 2010, 12:12:18 PM9/27/10
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Erik Quaeghebeur <use...@equaeghe.nospammail.net> writes:

But utf8x is extended. If utf8 is sufficient for you, no problem, but
I don't actually know how much it extends utf8.

The teacher has not taught, until the student has learned.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>

iEYEARECAAYFAkygwmIACgkQM0LLzLt8Mhz8BQCfT8QHDITO/T/gxCWbEysEKpWx
Kv8An2NI4OSZ8BGdn/tB0HqDS7FSSMvq
=ce0L
-----END PGP SIGNATURE-----

Robin Fairbairns

unread,
Sep 27, 2010, 5:55:26 PM9/27/10
to
Merciadri Luca <Luca.Me...@student.ulg.ac.be> writes:

> Erik Quaeghebeur <use...@equaeghe.nospammail.net> writes:
>
>> Op 27-09-10 11:31, Merciadri Luca schreef:
>>>
>>> Note that the *utf8x* option asks *inputenc* to load the *ucs*
>>> package, which is no longer maintained. Consequently, a compromise has
>>> to be found between *utf8* and *utf8x*: if *utf8x* is not needed
>>> (i.e. *utf8* is sufficient), you may only use *utf8*. However, if
>>> *utf8x* always works, do not change!
>>
>> Well, I've just installed tl2010, and it seems biblatex has become
>> incompatible with ucs, so I needed to revert to plain utf8, which
>> luckily worked for the document at hand. However, if ucs is unmaintained
>> and incompatible with other packages, it is perhaps better to steer
>> people away from using it if they want their source files to be
>> future-proof.
>
> But utf8x is extended. If utf8 is sufficient for you, no problem, but
> I don't actually know how much it extends utf8.

as a rule of thumb, utf8 covers the code ranges for which there's a
defined latex encoding, while utf8x covers code ranges for which there's
a usable font.

from my point of view (as a western european) the only language that's
problematic is (modern) greek (which i've studied, and ocasionally
write). it's sad that no latex encoding was ever produced (there have
been proposals, but none that would actually work).

it seems clear to me that significant work using utf-8 needs the
services of xetex or luatex.
--
Robin Fairbairns, Cambridge

Merciadri Luca

unread,
Sep 28, 2010, 8:47:03 AM9/28/10
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robin Fairbairns <rf...@sxp10.cl.cam.ac.uk> writes:

Ok. Thanks for your precisions. If I understand you well, utf8x should
only help writing Greek, when utf8 supports everything utf8x supports,
except Greek?

Anything to add to my text? This is an introductory text about
encoding and LaTeX.

Thanks.

The whole dignity of man lies in the power of thought.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>

iEYEARECAAYFAkyh48cACgkQM0LLzLt8MhyMlACgqLzQVWubTdDxG1h3rAthvrZ8
ANEAnimj9bF9oCM0Jr6P4RQbvyj42p78
=PKNj
-----END PGP SIGNATURE-----

Robin Fairbairns

unread,
Sep 28, 2010, 11:15:25 AM9/28/10
to
Merciadri Luca <Luca.Me...@student.ulg.ac.be> writes:

> Robin Fairbairns <rf...@sxp10.cl.cam.ac.uk> writes:
>
>> [lots of stuff]


>
> Ok. Thanks for your precisions. If I understand you well, utf8x should
> only help writing Greek, when utf8 supports everything utf8x supports,
> except Greek?

no. i stated what utf8 supports (scripts for which there is a latex
encoding). of the languages that utf8x supports and utf8 doesn't, only
greek bothers me. utf8x supports all sorts of things that are almost
certainly irrelevant to my tex _usage_.

> Anything to add to my text? This is an introductory text about
> encoding and LaTeX.

i've lost track. i'll go back and check...

... ah, yes. i remembered being a bit confused, and am slightly worried
by some of the approaches you take. i'm fantastically tired, but i'll
see if i can provide some alternative suggestions, later.

who's your introductory text for, exactly?
--
Robin Fairbairns, Cambridge

Merciadri Luca

unread,
Sep 28, 2010, 12:01:46 PM9/28/10
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robin Fairbairns <rf...@sxp10.cl.cam.ac.uk> writes:

Okay. Nice thing.

> who's your introductory text for, exactly?

For beginners, but it would be published in a TUGboat.

In the land of the blind, the one-eyed man is king.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>

iEYEARECAAYFAkyiEWkACgkQM0LLzLt8MhxVoQCaA2uGcgAO7Gmzz/Sh0eUYQDIg
9z0An2eletDZr7ymAjws13k74kcarrJl
=mkKJ
-----END PGP SIGNATURE-----

Philipp Lehman

unread,
Sep 28, 2010, 12:27:36 PM9/28/10
to
Merciadri Luca wrote:

There have been a few comments about (La)TeX and input encodings, I'll
add a few things not mentioned yet:

> 1) The inputenc Package: the Encoding of the Document
> - -----------------------------------------------------------

I'd add inputenx (a drop-in replacement of inputenc) and luainputenc
(for 8-bit encodings on LuaTeX). Also mention XeTeX and LuaTeX in
native UTF-8 mode. This is in fact your best option if you want UTF-8.

> On Microsoft Windows, users tend to use ISO-8859-1, which is
> commonly referred to as Latin-1.

Isn't that rather cp1252 (inputenc: "winansi")?

> In modern applications, the ``Unicode'' standard is a preferred
> character set. Consequently, it is recommend to always use utf8x,
> whatever the platform, as it provides the best (i.e. the most
> complete) set of characters. Sticking with utf8x allows you to never
> change your character encoding as utf8x is the future, for many
> reasons which come out of the scope of this booklet. It is possible
> to use utf8x as encoding, by simply replacing it in the option of
> inputenc.

If you want to get more 'technical', you may want to mention that the
UTF-8 decoder of utf8x is more intrusive than the one of utf8. utf8
has an expandable scanner, the one in utf8x is non-expandable. There's
more potential for conflicts with other packages in the latter case.

Rationale: use utf8 unless you really need utf8x (e.g., for Greek). In
the latter case, I'd recommend switching to XeTeX or LuaTeX, though.

If you're switching to UTF-8 because you're sharing documents (i.e.,
the LaTeX sources) with others and want to avoid problems with
Latin1/15 vs. CP1252 vs. MacRoman (or similar for Eastern European
encodings), inputenc + utf8 will work fine. If you need 'real' Unicode
support (e.g., when mixing scripts), you better use a Unicode-savvy
engine.

> 2.2) Converting a File to the Good Encoding

> iconv -f oldencoding fileinoldencoding.tex -o fileinnewencoding.tex

afaik, "recode" is also quite common.

> 3) Dealing With BiB Files
> - -------------------------

I strongly recommend rewriting this section. Have a look at section
2.4.3 in this manual:

http://www.ctan.org/texarchive/macros/latex/exptl/
biblatex/doc/biblatex.pdf

That's the user manual of the biblatex package but section 2.4.3
discusses encoding issues relevant to all users working with BibTeX.

> a. Follow the same recommendations as before for characters: always
> write \'e for ``\'e'' and the same for other ones,

Some assorted remarks:

1) Note that BibTeX requires "{\'e}". You might want to mention that
this will effectively break the kerning of all accented characters...

2) Make it absolutely clear that BibTeX can't handle UTF-8. This is
probably the second most common misunderstanding when it comes to
BibTeX. There is no way around this restriction. BibTeX can't deal
with multi-byte encodings. It's just not going to work. If you want to
see a trivial example which yields broken output, just ask.

3) Mention bibtex8, a drop-in BibTeX replacement which supports 8-bit
input. While it can't handle UTF-8 either, it can sort 8-bit input in
a way that is actually useful in languages other than English, when
supplied with a suitable csf file.

--
Sender address blackholed, do not reply directly.
You can still reach me by email at: lehman gmx net.

Merciadri Luca

unread,
Sep 28, 2010, 3:11:29 PM9/28/10
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Philipp Lehman <devnull....@spamgourmet.com> writes:

Thanks for all your suggestions. You might find a recent version at

http://www.student.montefiore.ulg.ac.be/~merciadri/to_display/encoding_remarks.pdf.

I would like to give a precise way to avoid encoding problems, such as
such a rationale, assuming the reader does not want to play around
with either LuaTeX or XeTeX:

==
* If you're using Linux, thus working under UTF-8, use
===
\usepackage[utf8]{inputenc}
===
or use
===
\usepackage[utf8x]{inputenc}
===
if you're typing Greek or another problematic language and that utf8 does not work.

* If you're using Windows, thus working under CP1251-1, use
===
\usepackage[winansi]{inputenc}
===

* If files are encoded using ISO-8859-1 (which is Latin-1), use
===
\usepackage[latin1]{inputenc}
===

* In any case (any platform, any encoding), always use ASCII notation.

* Another solution is to use LuaTeX or XeTeX (built-in UTF-8).
==

This summary does not give any clue about bib-stuff, because the
related section now appears to be sufficiently short.

First deserve, then desire.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>

iEYEARECAAYFAkyiPd8ACgkQM0LLzLt8Mhzl4gCdGrPjYuq7ew4rfR4cAKFO+iud
v3gAn1J7ANKHGGR59/SdCrg7LaiAyH+l
=sXY/
-----END PGP SIGNATURE-----

Philipp Lehman

unread,
Sep 28, 2010, 3:24:52 PM9/28/10
to
Erik Quaeghebeur wrote:

> Well, I've just installed tl2010, and it seems biblatex has become
> incompatible with ucs

FWIW, it never really worked (because of the non-expandable scanner
which doesn't work with \MakeCapital and \MakeSentenceCase). I've
merely added the error message.

Merciadri Luca

unread,
Sep 28, 2010, 4:17:18 PM9/28/10
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Apart from that, I'm now understanding why the following construction:

==
\begingroup
\inputencoding{latin1}
\bibliography{bibliography}
\endgroup
==

gives a correct compilation of some of my bib files. Because they are
internally encoded as latin1 (despite being under Linux, i.e. UTF-8,
Debian), and treated as such. As a result, BiBTeX does not need to
mess around with UTF-8 files. Strange approach, though.

It is better to die on one's feet than live on one's knees.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>

iEYEARECAAYFAkyiTU4ACgkQM0LLzLt8Mhwj5QCfXow3kbl6DbAMAxSDzlsILRqf
FYsAniEzGIzI5coy/I6EyNFEE0E2PlD1
=qkfo
-----END PGP SIGNATURE-----

Philipp Stephani

unread,
Sep 28, 2010, 5:40:05 PM9/28/10
to
(I have to answer here because the original post disappeared from my
news reader/server)

Thanks for this contribution; it touches several really important
points. However, I've got to add a few remarks:

Merciadri Luca <Luca.Me...@student.ulg.ac.be> writes:

This is only true for documents that are intended for compilation with
pdfTeX. Documents compiled with XeTeX or LuaTeX should never load the
inputenc package.

> According to Mc Kichan, the inputenc package maps certain characters
> to their corresponding TeX macros according to the encoding option you
> select.

More precisely, it maps input characters to the LICR.

> On a standard Linux platform,

Although there are popularity differences, text encoding is nowadays
unrelated to the operating system since all modern operating system
components (windowing systems, font systems, drawing engines...) are
Unicode-based or at least Unicode-capable.

> you may replace *encoding* by
> *utf8x*, to use the e*x*tended UTF-8 character set.

> Note that the *utf8x* option asks *inputenc* to load the *ucs*
> package, which is no longer maintained.

And that is why the utf8x option should be avoided if possible. If you
are writing a text intended for beginners, don't mention it at all! The
ucs package breaks important packages such as csquotes. If you want
basic UTF-8 support in pdfTeX (e.g., for accented Latin characters),
load utf8enc.def. If you want real Unicode support, use XeTeX or
LuaTeX.

> Consequently, a compromise has
> to be found between *utf8* and *utf8x*: if *utf8x* is not needed
> (i.e. *utf8* is sufficient), you may only use *utf8*. However, if
> *utf8x* always works, do not change!

There is no compromise. Never use utf8x.

> On Microsoft Windows, users tend to use ISO-8859-1,

It's rather Windows-1252. Even if users have historically used
Windows-1252 and similar encodings a lot, they are obsolete and
everybody (including Microsoft) recommends against using them. Only
Unicode encodings (UTF-8, UTF-16, UTF-32, GB 18030...) are state of the
art. Since UTF-8 is the only Unicode encoding that is at least
partially supported by the inputenc package (and thus by pdfTeX-based
LaTeX documents), it is the only choice that can be recommended.

> In modern applications, the ``Unicode'' standard is a preferred
> character set. Consequently, it is recommend to always use utf8x,

utf8 instead of utf8x, but otherwise correct. To keep it simple, don't
mention other encodings at all.

> 2) The Proper Encoding of the Document File
> - ----------------------------------------

The file encoding has been discussed in section 1, what is the purpose
of section 2?

> To avoid clashes, the best thing is to keep your document in the same
> encoding as the encoding encoding,

What is an "encoding encoding"?

> which is the option of inputenc. It
> is easier to do when you are working with Linux.

No, it depends on the editor. Saving a text file as UTF-8 is very easy
with Notepad, for example.

> 2.1) Directly Writing Characters Without Commands
> -------------------------------------------------
> Directly writing characters without their associated commands (for
> example ``\'e'' written with a e and an acute accent) is strongly
> disadviced.

No! Writing characters instead of LICR commands is recommended because
it is natural. That's the whole purpose of the inputenc package.

> With examples such as the ``\'e'' there is no problem, but
> when you begin using French quotes like « Mot » directly typing \verb
> « and », it may never be rendered, cause errors, etc., depending on
> your local implementation.

Maybe, but usually it works just fine if the encoding is declared
correctly. For quotation marks the csquotes package is the preferred
solution.

> There are actually three kinds of persons:
> a. Those who stick with commands. These are the best ones: commands
> will always be valid, and, if deprecated one day, using renewcommand
> or other structures will make no problem,

See above, using the LICR is very unnatural and makes code unreadable.

> b. Those who use commands only when necessary. These are persons who
> try to see which character from their keyboard is directly rendered,
> which one is not, and, for those which are rendered, they typeset
> them directly, and, for those which are not rendered (as now), they
> use commands. This is not the best approach, as it is extremely
> tedious, difficult, error-prone, and not the aim of LaTeX,

This *is* the best approach. It is natural to assume that a character
that is keyed in generates appropriate output unless it belongs to a
special category (like \ or %). If the application does otherwise, it
is a software bug.

> c. Those who use various tricks to make LaTeX behaves as they want,
> even if they want things that are contrary to the
> state-of-the-art. This is not the best solution, as their commands
> have great chances to be unuseful after that.

Depends on what the tricks are. The inputenc package is a gross hack
(like most of LaTeX), but it's stable enough to be of general use.
XeTeX and LuaTeX have Unicode support built in, and using it is devoid
of problems.

> Let's take an example: if | is not implemented in your architecture,

A bit odd: | is an ASCII character that has no special meaning in usual
TeX setups, and I'd be surprised if it causes problems.

> it is sure that, for example, x will be considered as x , or, if you
> put bars around the x, there must be reasons. Consequently, it is
> better for these bars not to disappear from your page. If you want to
> be sure that they will always be here, a best idea is to use commands:
> write \lvert x \rvert.

That's an entirely different matter. You should use \lvert x \rvert
instead of |x| not because | might be unavailable, but because of
spacing issues. This is only relevant for math, and \lvert in text mode
would cause an error.

> 2.2) Converting a File to the Good Encoding
> -------------------------------------------
> If, say, you are dealing with documents in another encoding, the best
> thing is to use the following procedure, assuming you are working
> with Linux (or with such a virtual machine):
> a. Find their current encoding,

Which often isn't trivial. Unix-like systems generally have the "file"
utility, but I don't know of any easy Windows solution.

> b. Know what their future encoding will be,

Usually UTF-8.

> c. Be sure that the encodings are compatible (i.e. that both
> character sets contain the same symbols, even if they can be
> expressed differently). If this is not the case, you will loose
> information,

Compatibility here means that the target character set is a superset of
the source character set, not that they are equal.

> d. Execute, a sample file being fileinoldencoding.tex, and the same
> file, in its new encoding being fileinnewencoding.tex:
> ===
> iconv -f oldencoding fileinoldencoding.tex -o fileinnewencoding.tex
> ===

The -t flag should be given explicitly, otherwise the target encoding
defaults to the encoding defined by the current locale, which is usually
UTF-8 nowadays, but can in principle be anything.

> where oldencoding could be, for example, windows-1252 . This is the
> same as redirecting the flux using
> ===
> iconv -f oldencoding fileinoldencoding.tex > fileinnewencoding.tex
> ===

If it's the same, then you don't need to mention it. Here I'd prefer
the second option because the first is a nonstandard Linux-specific
addition.

> You might make this process automatic, e.g. by creating a shell file
> (here it is bash):
> ===
> #/bin/bash
> LIST=`ls *.tex`

Please no ls in shell scripts
LIST=(*.tex)

> for i in $LIST;
for i in "${LIST[@]}"
or better
for i in *.tex
but usually the file names are given on the command line:
for i

> do iconv -f windows-1252 $i -o $i.”utf8;
> mv $i."utf8 $i;

iconv -f windows-1252 -t utf-8 -- "$i" > "$i.utf8" && mv -- "$i.utf8" "$i"

> done
> ===
> and executing this file in a folder containing .tex files. Here, the
> new files' names will contain a .utf8 extension.

No, because they are renamed afterwards; and the original files are lost.

> You can evidently
> modify this script or the aforementioned commands as you want, to use
> another encoding. By default, the encoding is utf8,

No, the default encoding is defined by the current locale, and can be
anything.

> but if you want
> encoding newencoding, use
> ===
> -t newencoding
> ===,
> e. Open the file in an editor, the editor being set to open files in
> the output encoding,
> f. If you see strange characters, there is a problem, and check the
> procedure. If everything seems normal, you can modify the file, save
> the modifications, but everything under the new encoding,
> g. Compile the file(s) with the good inputenc declaration, as
> explained above.
> 3) Dealing With BiB Files
> - -------------------------
> If you are using BiBTeX, you may also have problems, especially if you
> are switching from Microsoft Windows to Linux,

Again, this is unrelated to the operating system (which treats all files
as blobs without semantics), but caused by the lack of Unicode (or even
non-ASCII) support in BibTeX.

> or dealing with files
> transiting between both OSes. The main problem is that BiBTeX is not
> really good at dealing with Unicode (and consequently utf-8 ,
> etc.).

"Not really good" is too polite :-)
Because it doesn't know anything about UTF-8 or non-fixed-width
encodings in general, BibTeX can even tear apart code unit sequences
that encode a single code point, leading to garbage that even text
editors fail to display.

> Consequently, the best suggestion, if you do not want to use
> other alternatives to BiBTeX

Why not using alternatives to BibTeX? Nowadays I'd recommend biblatex +
biber to all beginners, given that working binaries of biber have been
produced.

>, is to
> a. Follow the same recommendations as before for characters: always
> write \'e for ``\'e'' and the same for other ones,

But in braces!

> b. Either keep the .bib file in latin1 encoding,

That requires BibTeX8, classical BibTeX cannot handle anything that is
not ASCII, meaning that sorting produces garbage etc.

> and consequently using
> ===
> \begingroup
> \inputencoding{latin1}
> \bibliography{bibliography}
> \endgroup
> ===

Or biblatex's bibencoding option.

> in your .tex file, or convert your .bib file in utf-8. In this case,
> you do not need the inputencoding declaration, and the group
> markers.

But you need biber then (or maybe BibTeXu?)

> Note that it is really important to use \'e and other sets
> of commands before the conversion. If you do not work this way,
> strange characters are likely to occur after the conversion. That is
> one of the reasons which justifies the use of such commands.

But only because of BibTeX and accordingly only in bibliography
databases.

> If you
> make a document under Microsoft Windows, there will not be any
> problem until you keep on with the same encoding,

I don't quite understand where to put that sentence. You can get
problems even if you stick with one encoding, and you might be able to
avoid problems in spite of using mutliple encodings, and all this is
barely Windows-related.

--
Change “LookInSig” to “tcalveu” to answer by mail.

Merciadri Luca

unread,
Sep 29, 2010, 2:12:24 AM9/29/10
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Philipp Stephani <Look...@arcor.de> writes:

> (I have to answer here because the original post disappeared from my
> news reader/server)
>
> Thanks for this contribution; it touches several really important
> points. However, I've got to add a few remarks:

First, thanks for all these suggestions.

> This is only true for documents that are intended for compilation with
> pdfTeX. Documents compiled with XeTeX or LuaTeX should never load the
> inputenc package.

Okay. But inputenc is also important for generating dvi, ps and PDF,
the PDF being not necessarily generated using pdfTeX.

Okay for the rest. You might find a recent version at the URL that I
gave in an older message, i.e. at

<http://www.student.montefiore.ulg.ac.be/~merciadri/to_display/encoding_remarks.pdf>

(provided Montefiore's students' server is not down in some seconds).

It is better to die on one's feet than live on one's knees.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>

iEYEARECAAYFAkyi2MgACgkQM0LLzLt8Mhw40QCeIcVrqjhmmMsHHwQ9csVEfSUa
9nUAn3GZHa4oEEld+V5vFATYdAQoZGZ+
=mn06
-----END PGP SIGNATURE-----

Guenter Milde

unread,
Sep 29, 2010, 6:50:32 AM9/29/10
to
On 2010-09-27, Merciadri Luca wrote:

> Here is a text I wrote some months ago about encoding, LaTeX, and
> BiBTeX.

...


> Here is the text. If you have something to add, to correct, etc.,
> please tell it! Thanks.

...


> 1) The inputenc Package: the Encoding of the Document
> - -----------------------------------------------------------
> Every¹ LaTeX document should have, in its preamble,

\usepackage[<encoding>]{inputenc}

where the specified <encoding> **must** match the encoding of the
document!

On a modern system where Unicode in utf-8 encoding is the default,
replace <encoding> by *utf8*.

¹ Exceptions to this rule:

a) documents that only use 7-bit ASCII characters

b) documents intended for processing with XeLaTeX or luaTeX

> 2) The Proper Encoding of the Document File
> - ----------------------------------------

- To avoid clashes, the best thing is to keep your document in the same
- encoding as the encoding encoding, which is the option of inputenc.
+ To avoid clashes, always keep your document in the
+ encoding which is the option of inputenc.
Never forget to change the inputenc option if you change the
document encoding.


> 2.1) Directly Writing Characters Without Commands
> -------------------------------------------------

...

> a. Those who stick with commands. These are the best ones: commands
> will always be valid, and, if deprecated one day, using renewcommand
> or other structures will make no problem,

on the downside, the LaTeX source will be more difficult to read,
write, and edit (especially on keyboards where the \ { and }
characters are only acessible via Algr+<some key>, like with German
ones).

> b. Those who use commands only when necessary. These are persons who
> try to see which character from their keyboard is directly rendered,
> which one is not, and, for those which are rendered, they typeset
> them directly, and, for those which are not rendered (as now), they
> use commands.

- This is not the best approach, as it is extremely
- tedious, difficult, error-prone, and not the aim of LaTeX,

This "mixed" approach results in documents that are easy to typeset,
read, and edit for users with a similar keyboard. Beware of
problems with legacy applications that can only handle 7-bit input
(bibtex, some e-mail systems) and make sure the specified encoding
matches the one actually used.

...

d. Those who use Unicode whenever possible.

While sometimes difficult to input (a good text editor will help),
this results in readable sources that will work across OS boundaries.

Günter

Merciadri Luca

unread,
Sep 29, 2010, 5:12:03 PM9/29/10
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Guenter Milde <mi...@users.berlios.de> writes:

Thanks. I did the related modifications, and the document is
consultable at the aforementioned URL.

If it's too good to be true, then it probably is.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>

iEYEARECAAYFAkyjq6MACgkQM0LLzLt8Mhzd3QCePBtSNz3VxAC/bAcaLYo5Qt02
ctgAnRx0rwGet5Bl9HIAmwXzWdyRxmuz
=y9C3
-----END PGP SIGNATURE-----

Philipp Stephani

unread,
Sep 29, 2010, 6:50:41 PM9/29/10
to
Merciadri Luca <Luca.Me...@student.ulg.ac.be> writes:

> Philipp Stephani <Look...@arcor.de> writes:
>
>> (I have to answer here because the original post disappeared from my
>> news reader/server)
>>
>> Thanks for this contribution; it touches several really important
>> points. However, I've got to add a few remarks:
> First, thanks for all these suggestions.
>
>> This is only true for documents that are intended for compilation with
>> pdfTeX. Documents compiled with XeTeX or LuaTeX should never load the
>> inputenc package.
> Okay. But inputenc is also important for generating dvi, ps and PDF,
> the PDF being not necessarily generated using pdfTeX.

On current distributions, pdfTeX is also used to produce DVIs. AFAIK
the only program that isn't based on either pdfTeX, XeTeX or LuaTeX is
tex, which is the original Knuth TeX. E.g., here is my version
information:

/ $ tex --version
TeX 3.1415926 (TeX Live 2010)
kpathsea version 6.0.0
Copyright 2010 D.E. Knuth.
There is NO warranty. Redistribution of this software is
covered by the terms of both the TeX copyright and
the Lesser GNU General Public License.
For more information about these matters, see the file
named COPYING and the TeX source.
Primary author of TeX: D.E. Knuth.

/ $ latex --version
pdfTeX 3.1415926-1.40.11-2.2 (TeX Live 2010)
kpathsea version 6.0.0
Copyright 2010 Peter Breitenlohner (eTeX)/Han The Thanh (pdfTeX).
There is NO warranty. Redistribution of this software is
covered by the terms of both the pdfTeX copyright and
the Lesser GNU General Public License.
For more information about these matters, see the file
named COPYING and the pdfTeX source.
Primary author of pdfTeX: Peter Breitenlohner (eTeX)/Han The Thanh (pdfTeX).
Compiled with libpng 1.2.40; using libpng 1.2.40
Compiled with zlib 1.2.3; using zlib 1.2.3
Compiled with xpdf version 3.02pl4

Guenter Milde

unread,
Sep 30, 2010, 3:52:59 AM9/30/10
to
On 2010-09-27, Merciadri Luca wrote:
> Erik Quaeghebeur <use...@equaeghe.nospammail.net> writes:
>> Op 27-09-10 11:31, Merciadri Luca schreef:

>> Well, I've just installed tl2010, and it seems biblatex has become


>> incompatible with ucs, so I needed to revert to plain utf8, which
>> luckily worked for the document at hand. However, if ucs is unmaintained
>> and incompatible with other packages, it is perhaps better to steer
>> people away from using it if they want their source files to be
>> future-proof.

> But utf8x is extended. If utf8 is sufficient for you, no problem, but
> I don't actually know how much it extends utf8.

In a strict sense, utf8x does not extend utf8 but predates it.

OTOH, it covers a wider range of Unicode chars (most notably Greek)
and may hence be described as "extended Unicode support".

The second big bonus is automatic font-encoding switches with the
package "autofe" so that e.g. writing

Hello Σαμ, how about going to Москва?

just works. (Without "autofe", you need to indicate language-switches
or at least font-encoding switches) in the LaTeX source.

Unfortunately, the "autofe" package is even more likely to clash with
newer versions of other packages.

Günter

Merciadri Luca

unread,
Sep 30, 2010, 5:26:55 PM9/30/10
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Philipp Stephani <Look...@arcor.de> writes:

Thanks for this info.

The teacher has not taught, until the student has learned.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>

iEYEARECAAYFAkylAJ8ACgkQM0LLzLt8Mhzz2ACgj3HiEg2+30O5uL7b5D6LUval
FpUAoK9OzCMpvKhQnrQR8N+zOTEjL8SA
=o7W7
-----END PGP SIGNATURE-----

Merciadri Luca

unread,
Sep 30, 2010, 5:28:07 PM9/30/10
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Guenter Milde <mi...@users.berlios.de> writes:

As a direct consequence of your says, I won't speak about autofe in
the text, simply because of its `clashing probability.'

Failure is not falling down, you fail when you don't get back up.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>

iEYEARECAAYFAkylAOcACgkQM0LLzLt8MhwlWgCfdXgFZe4EWE9NiUqMvR8PlCO3
0J4AnjyUyExuNU79A0qkqY7b5XZ7MkJU
=g4dh
-----END PGP SIGNATURE-----

Dominik Waßenhoven

unread,
Oct 1, 2010, 4:49:27 AM10/1/10
to
Merciadri Luca wrote:

> \usepackage[winansi]{inputenc}

\usepackage[ansinew]{inputenc}

or

\usepackage[cp1252]{inputenc}

Inputenc does not know any »winansi« option.

Regards,
Dominik.-
--
UK-TeX-FAQ: http://www.tex.ac.uk/cgi-bin/texfaq2html
minimal example: http://www.minimalbeispiel.de/mini-en.html
biblatex styles: http://biblatex.dominik-wassenhoven.de/?en

Merciadri Luca

unread,
Oct 1, 2010, 5:00:00 PM10/1/10
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dominik Waßenhoven <dom...@web.de> writes:

> Merciadri Luca wrote:
>
>> \usepackage[winansi]{inputenc}
>
> \usepackage[ansinew]{inputenc}
>
> or
>
> \usepackage[cp1252]{inputenc}
>
> Inputenc does not know any »winansi« option.

Thanks.

Give a man a fish and you feed him for a day; teach a man to fish
and you feed him for a lifetime.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>

iEYEARECAAYFAkymS9AACgkQM0LLzLt8MhzXPQCcChCt9pbmMXpqv3IRFeaYXDIi
//8AnAx3ihCf+hq2ti1mUx94GGkF2TUf
=8YID
-----END PGP SIGNATURE-----

Merciadri Luca

unread,
Oct 3, 2010, 2:13:23 PM10/3/10
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Here is the preprint article. Somethind to add/modify?

Consider reading from section 5 please.

<http://www.student.montefiore.ulg.ac.be/~merciadri/to_display/encoding_remarks.pdf>

Fall down seven times, stand up eight.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>

iEYEARECAAYFAkyox8EACgkQM0LLzLt8MhxkPQCdHrJD5LwrQ3yT9KYxv7lV0SyO
ihkAniCFugO3Q/Ohwat+1yq+76WoI0kY
=h9eK
-----END PGP SIGNATURE-----

Dominik Waßenhoven

unread,
Oct 4, 2010, 3:03:07 AM10/4/10
to
Merciadri Luca wrote:

> Somethind to add/modify?

Pointing to this discussion, I would add either the Message-ID of your
post starting this thread
(87r5gf8...@merciadriluca-station.MERCIADRILUCA) or a link to the
GoogleGroups site with this thread
(http://groups.google.com/group/comp.text.tex/browse_thread/thread/e0b0bec4e0dbf8b5/defbbc95cd273119)

5.1.2
Typo: “among others languages” -> “among other languages”

5.1.4
the Windows encoding is CP1252, not CP1251

5.3
“we did not speak neither …” -> “we did neither speak …”

Merciadri Luca

unread,
Oct 5, 2010, 9:25:19 AM10/5/10
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dominik Waßenhoven <dom...@web.de> writes:

> Merciadri Luca wrote:
>
>> Somethind to add/modify?
>
> Pointing to this discussion, I would add either the Message-ID of your
> post starting this thread
> (87r5gf8...@merciadriluca-station.MERCIADRILUCA) or a link to the
> GoogleGroups site with this thread
> (http://groups.google.com/group/comp.text.tex/browse_thread/thread/e0b0bec4e0dbf8b5/defbbc95cd273119)

It would put a very long and extremely disgracious bibtex entry, when
googling for the name of the thread redirects you directly here. No?

> 5.1.2
> Typo: “among others languages” -> “among other languages”

Sorry, pure typo.

> 5.1.4
> the Windows encoding is CP1252, not CP1251

Okay, thanks.

> 5.3
> “we did not speak neither …” -> “we did neither speak …”

Thanks. Semantically, does it mean the same? Wouldn't the second
sentence emphasize on something?

Thanks.

Laugh and the world laughs with you ... Cry and you will find no one
with tears.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>

iEYEARECAAYFAkyrJz4ACgkQM0LLzLt8MhyqYgCeLbxd9PIlPU5H29K32s+lrs2J
rf0AoJHGwf8F2Ij9YggRIsW38Ot9Fq4L
=BHEG
-----END PGP SIGNATURE-----

Dominik Waßenhoven

unread,
Oct 5, 2010, 10:32:13 AM10/5/10
to
Merciadri Luca wrote:

>> Pointing to this discussion, I would add either the Message-ID of your
>> post starting this thread
>> (87r5gf8...@merciadriluca-station.MERCIADRILUCA) or a link to the
>> GoogleGroups site with this thread
>> (http://groups.google.com/group/comp.text.tex/browse_thread/thread/e0b0bec4e0dbf8b5/defbbc95cd273119)
> It would put a very long and extremely disgracious bibtex entry, when
> googling for the name of the thread redirects you directly here. No?

You could provide a short link, e.g. with http://bit.ly or
http://tinyurl.com.

>> 5.3
>> “we did not speak neither …” -> “we did neither speak …”
> Thanks. Semantically, does it mean the same? Wouldn't the second
> sentence emphasize on something?

In my understanding, no. But I am not a native speakter, so I might be
corrected.

Best,

Robin Fairbairns

unread,
Oct 5, 2010, 11:36:02 AM10/5/10
to
Dominik Waßenhoven <dom...@web.de> writes:

> Merciadri Luca wrote:
>
>>> Pointing to this discussion, I would add either the Message-ID of your
>>> post starting this thread
>>> (87r5gf8...@merciadriluca-station.MERCIADRILUCA) or a link to the
>>> GoogleGroups site with this thread
>>> (http://groups.google.com/group/comp.text.tex/browse_thread/thread/e0b0bec4e0dbf8b5/defbbc95cd273119)
>> It would put a very long and extremely disgracious bibtex entry, when
>> googling for the name of the thread redirects you directly here. No?
>
> You could provide a short link, e.g. with http://bit.ly or
> http://tinyurl.com.
>
>>> 5.3
>>> “we did not speak neither …” -> “we did neither speak …”
>> Thanks. Semantically, does it mean the same? Wouldn't the second
>> sentence emphasize on something?

no. the original sentence has a double negative, so would mean "we did
speak either".

i would say "we neither spoke..." (but i don't have the original to
hand).

(i don't read things on-screen: i did try special glasses with lenses
designed for the job, but they didn't really work. so i continue to
print things to read -- shameful, but true.)

> In my understanding, no. But I am not a native speakter, so I might be
> corrected.

you do as well as many native speakers, imo.
--
Robin Fairbairns, Cambridge

Merciadri Luca

unread,
Oct 5, 2010, 11:41:58 AM10/5/10
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robin Fairbairns <rf...@sxp10.cl.cam.ac.uk> writes:

> Dominik Waßenhoven <dom...@web.de> writes:
>
>> Merciadri Luca wrote:
>>
>>>> Pointing to this discussion, I would add either the Message-ID of your
>>>> post starting this thread
>>>> (87r5gf8...@merciadriluca-station.MERCIADRILUCA) or a link to the
>>>> GoogleGroups site with this thread
>>>> (http://groups.google.com/group/comp.text.tex/browse_thread/thread/e0b0bec4e0dbf8b5/defbbc95cd273119)
>>> It would put a very long and extremely disgracious bibtex entry, when
>>> googling for the name of the thread redirects you directly here. No?
>>
>> You could provide a short link, e.g. with http://bit.ly or
>> http://tinyurl.com.
>>
>>>> 5.3
>>>> “we did not speak neither …” -> “we did neither speak …”
>>> Thanks. Semantically, does it mean the same? Wouldn't the second
>>> sentence emphasize on something?
>
> no. the original sentence has a double negative, so would mean "we did
> speak either".

You're right. That's what I learnt many years ago. Robin, nothing
special to add for the document?

The whole dignity of man lies in the power of thought.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>

iEYEARECAAYFAkyrR0UACgkQM0LLzLt8MhxnlQCfcFLTUPRTCiWSjwmSXR1E5OEh
9TAAn3XvLaQYj1LiBz1skxtnBf2WTSAW
=22xp
-----END PGP SIGNATURE-----

Philipp Stephani

unread,
Oct 5, 2010, 5:53:10 PM10/5/10
to
Merciadri Luca <Luca.Me...@student.ulg.ac.be> writes:

>> 5.1.4
>> the Windows encoding is CP1252, not CP1251
> Okay, thanks.

Well, "the" Windows encoding is UTF-16. The traditional legacy encoding
is Windows-1252 only in Western Europe, other regions have used
different encodings.

>
>> 5.3
>> “we did not speak neither …” -> “we did neither speak …”
> Thanks. Semantically, does it mean the same? Wouldn't the second
> sentence emphasize on something?

perhaps "we didn't speak ... either" or "neither did we speak ..."? But
I'm not a native speaker.

0 new messages