Google Groups

Encoding remarks


Merciadri Luca Sep 27, 2010 2:31 AM
Posted in group: comp.text.tex
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

Here is a text I wrote some months ago about encoding, LaTeX, and
BiBTeX. It is divided in three `big' parts:

1) Inputenc package;
2) Proper encoding of the document file;
 2.1) Directly writing characters without commands,
 2.2) Converting a file to the good encoding;
3) Dealing with BiB files.

Here is the text. If you have something to add, to correct, etc.,
please tell it! Thanks.

==
If you are collaborating with other persons using different OSes, or
simply migrating from one platform to another, you may have troubles
with accents and special characters, especially if the language is
French, Polish, or another language which makes an extensive use of
special characters.

As Computer Science is, like every science, complicated, some persons
sometimes mix up the words which are related to *encoding*. We shall
give here a brief summary of how you need to deal with encoding and
LaTeX.


1) The inputenc Package: the Encoding of the Document
- -----------------------------------------------------------
Every LaTeX document should have, in its preamble,

===
\usepackage[encoding]{inputenc}
===

According to Mc Kichan, the inputenc package maps certain characters
to their corresponding TeX macros according to the encoding option you
select. On a standard Linux platform, you may replace *encoding* by
*utf8x*, to use the e*x*tended UTF-8 character set.

Note that the *utf8x* option asks *inputenc* to load the *ucs*
package, which is no longer maintained. Consequently, a compromise has
to be found between *utf8* and *utf8x*: if *utf8x* is not needed
(i.e. *utf8* is sufficient), you may only use *utf8*. However, if
*utf8x* always works, do not change!


On Microsoft Windows, users tend to use ISO-8859-1, which is commonly
referred to as Latin-1. This one is generally intended for ``Western
European'' languages. In this case, you need to replace encoding by
latin1.

On Microsoft Windows, the character encoding of the files is
CP1251. This is an eight-bit character encoding designed to cover
languages that use the Cyrillic alphabet such as Russian, Bulgarian,
Serbian Cyrillic and other languages (French, ...), which do not use
the Cyrillic alphabet. It is the most widely used for encoding the
Bulgarian, Serbian and Macedonian languages. The CP1251 is quite
compatible with the latin1, and they do not often clash.

In modern applications, the ``Unicode'' standard is a preferred
character set. Consequently, it is recommend to always use utf8x,
whatever the platform, as it provides the best (i.e. the most
complete) set of characters. Sticking with utf8x allows you to never
change your character encoding as utf8x is the future, for many
reasons which come out of the scope of this booklet. It is possible to
use utf8x as encoding, by simply replacing it in the option of
inputenc.


2) The Proper Encoding of the Document File
- ----------------------------------------
To avoid clashes, the best thing is to keep your document in the same
encoding as the encoding encoding, which is the option of inputenc. It
is easier to do when you are working with Linux.

 2.1) Directly Writing Characters Without Commands
 -------------------------------------------------
 Directly writing characters without their associated commands (for
 example ``\'e'' written with a e and an acute accent) is strongly
 disadviced. With examples such as the ``\'e'' there is no problem, but
 when you begin using French quotes like « Mot » directly typing \verb
 « and », it may never be rendered, cause errors, etc., depending on
 your local implementation. There are actually three kinds of persons:

 a. Those who stick with commands. These are the best ones: commands
 will always be valid, and, if deprecated one day, using renewcommand
 or other structures will make no problem,
 b. Those who use commands only when necessary. These are persons who
 try to see which character from their keyboard is directly rendered,
 which one is not, and, for those which are rendered, they typeset
 them directly, and, for those which are not rendered (as now), they
 use commands. This is not the best approach, as it is extremely
 tedious, difficult, error-prone, and not the aim of LaTeX,
 c. Those who use various tricks to make LaTeX behaves as they want,
 even if they want things that are contrary to the
 state-of-the-art. This is not the best solution, as their commands
 have great chances to be unuseful after that.

 Let's take an example: if | is not implemented in your architecture,
 it is sure that, for example, x will be considered as x , or, if you
 put bars around the x, there must be reasons. Consequently, it is
 better for these bars not to disappear from your page. If you want to
 be sure that they will always be here, a best idea is to use commands:
 write \lvert x \rvert.

 The best thing which can be recommended is evidently to stick with
 commands, such as demonstrated before. One sometimes needs to include
 packages for other symbols, but this is better than using the two
 other approaches, namely using commands only when necessary, and
 using various unstandard tricks.

 2.2) Converting a File to the Good Encoding
 -------------------------------------------
 If, say, you are dealing with documents in another encoding, the best
 thing is to use the following procedure, assuming you are working
 with Linux (or with such a virtual machine):

 a. Find their current encoding,
 b. Know what their future encoding will be,
 c. Be sure that the encodings are compatible (i.e. that both
 character sets contain the same symbols, even if they can be
 expressed differently). If this is not the case, you will loose
 information,
 d. Execute, a sample file being fileinoldencoding.tex, and the same
 file, in its new encoding being fileinnewencoding.tex:
 ===
 iconv -f oldencoding fileinoldencoding.tex -o fileinnewencoding.tex
 ===
 where oldencoding could be, for example, windows-1252 . This is the
 same as redirecting the flux using
 ===
 iconv -f oldencoding fileinoldencoding.tex > fileinnewencoding.tex
 ===
 You might make this process automatic, e.g. by creating a shell file
 (here it is bash):
 ===
 #/bin/bash
 LIST=`ls *.tex`
 for i in $LIST;
 do iconv -f windows-1252 $i -o $i.”utf8;
 mv $i."utf8 $i;
 done
 ===
 and executing this file in a folder containing .tex files. Here, the
 new files' names will contain a .utf8 extension. You can evidently
 modify this script or the aforementioned commands as you want, to use
 another encoding. By default, the encoding is utf8, but if you want
 encoding newencoding, use
 ===
 -t newencoding
 ===,
 e. Open the file in an editor, the editor being set to open files in
 the output encoding,
 f. If you see strange characters, there is a problem, and check the
 procedure. If everything seems normal, you can modify the file, save
 the modifications, but everything under the new encoding,
 g. Compile the file(s) with the good inputenc declaration, as
 explained above.

3) Dealing With BiB Files
- -------------------------
If you are using BiBTeX, you may also have problems, especially if you
are switching from Microsoft Windows to Linux, or dealing with files
transiting between both OSes. The main problem is that BiBTeX is not
really good at dealing with Unicode (and consequently utf-8 ,
etc.). Consequently, the best suggestion, if you do not want to use
other alternatives to BiBTeX, is to

 a. Follow the same recommendations as before for characters: always
 write \'e for ``\'e'' and the same for other ones,
 b. Either keep the .bib file in latin1 encoding, and consequently using
 ===
 \begingroup
 \inputencoding{latin1}
 \bibliography{bibliography}
 \endgroup
 ===
 in your .tex file, or convert your .bib file in utf-8. In this case,
 you do not need the inputencoding declaration, and the group
 markers. Note that it is really important to use \'e and other sets
 of commands before the conversion. If you do not work this way,
 strange characters are likely to occur after the conversion. That is
 one of the reasons which justifies the use of such commands. If you
 make a document under Microsoft Windows, there will not be any
 problem until you keep on with the same encoding,
 c. Note that the same remark applies even if you keep your file in
 latin1 and that you use the above code: use \'e for ``\'e'' and other
 commands to display related symbols. If you stick with Microsoft
 Windows , there will not, roughly, be any problem, and you can type
 everything with your keyboard.
==

- --
Merciadri Luca
See http://www.student.montefiore.ulg.ac.be/~merciadri/
- --

Repetitio Mater Memoriae.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>

iEYEARECAAYFAkygZFwACgkQM0LLzLt8MhwvPgCeKfQXFakAwn2HNNqvWpzA+U1R
5zUAnjVoGQWwGknrOs48Kmd3f7HqzsiI
=Jt91
-----END PGP SIGNATURE-----