## Encoding remarks

Merciadri Luca Sep 27, 2010 2:31 AM
Posted in group: comp.text.tex
 -----BEGIN PGP SIGNED MESSAGE-----Hash: SHA1Hi,Here is a text I wrote some months ago about encoding, LaTeX, andBiBTeX. It is divided in three `big' parts:1) Inputenc package;2) Proper encoding of the document file; 2.1) Directly writing characters without commands, 2.2) Converting a file to the good encoding;3) Dealing with BiB files.Here is the text. If you have something to add, to correct, etc.,please tell it! Thanks.==If you are collaborating with other persons using different OSes, orsimply migrating from one platform to another, you may have troubleswith accents and special characters, especially if the language isFrench, Polish, or another language which makes an extensive use ofspecial characters.As Computer Science is, like every science, complicated, some personssometimes mix up the words which are related to *encoding*. We shallgive here a brief summary of how you need to deal with encoding andLaTeX.1) The inputenc Package: the Encoding of the Document- -----------------------------------------------------------Every LaTeX document should have, in its preamble,===\usepackage[encoding]{inputenc}===According to Mc Kichan, the inputenc package maps certain charactersto their corresponding TeX macros according to the encoding option youselect. On a standard Linux platform, you may replace *encoding* by*utf8x*, to use the e*x*tended UTF-8 character set.Note that the *utf8x* option asks *inputenc* to load the *ucs*package, which is no longer maintained. Consequently, a compromise hasto be found between *utf8* and *utf8x*: if *utf8x* is not needed(i.e. *utf8* is sufficient), you may only use *utf8*. However, if*utf8x* always works, do not change!On Microsoft Windows, users tend to use ISO-8859-1, which is commonlyreferred to as Latin-1. This one is generally intended for ``WesternEuropean'' languages. In this case, you need to replace encoding bylatin1.On Microsoft Windows, the character encoding of the files isCP1251. This is an eight-bit character encoding designed to coverlanguages that use the Cyrillic alphabet such as Russian, Bulgarian,Serbian Cyrillic and other languages (French, ...), which do not usethe Cyrillic alphabet. It is the most widely used for encoding theBulgarian, Serbian and Macedonian languages. The CP1251 is quitecompatible with the latin1, and they do not often clash.In modern applications, the ``Unicode'' standard is a preferredcharacter set. Consequently, it is recommend to always use utf8x,whatever the platform, as it provides the best (i.e. the mostcomplete) set of characters. Sticking with utf8x allows you to neverchange your character encoding as utf8x is the future, for manyreasons which come out of the scope of this booklet. It is possible touse utf8x as encoding, by simply replacing it in the option ofinputenc.2) The Proper Encoding of the Document File- ----------------------------------------To avoid clashes, the best thing is to keep your document in the sameencoding as the encoding encoding, which is the option of inputenc. Itis easier to do when you are working with Linux. 2.1) Directly Writing Characters Without Commands ------------------------------------------------- Directly writing characters without their associated commands (for example ``\'e'' written with a e and an acute accent) is strongly disadviced. With examples such as the ``\'e'' there is no problem, but when you begin using French quotes like « Mot » directly typing \verb « and », it may never be rendered, cause errors, etc., depending on your local implementation. There are actually three kinds of persons: a. Those who stick with commands. These are the best ones: commands will always be valid, and, if deprecated one day, using renewcommand or other structures will make no problem, b. Those who use commands only when necessary. These are persons who try to see which character from their keyboard is directly rendered, which one is not, and, for those which are rendered, they typeset them directly, and, for those which are not rendered (as now), they use commands. This is not the best approach, as it is extremely tedious, difficult, error-prone, and not the aim of LaTeX, c. Those who use various tricks to make LaTeX behaves as they want, even if they want things that are contrary to the state-of-the-art. This is not the best solution, as their commands have great chances to be unuseful after that. Let's take an example: if | is not implemented in your architecture, it is sure that, for example, x will be considered as x , or, if you put bars around the x, there must be reasons. Consequently, it is better for these bars not to disappear from your page. If you want to be sure that they will always be here, a best idea is to use commands: write \lvert x \rvert. The best thing which can be recommended is evidently to stick with commands, such as demonstrated before. One sometimes needs to include packages for other symbols, but this is better than using the two other approaches, namely using commands only when necessary, and using various unstandard tricks. 2.2) Converting a File to the Good Encoding ------------------------------------------- If, say, you are dealing with documents in another encoding, the best thing is to use the following procedure, assuming you are working with Linux (or with such a virtual machine): a. Find their current encoding, b. Know what their future encoding will be, c. Be sure that the encodings are compatible (i.e. that both character sets contain the same symbols, even if they can be expressed differently). If this is not the case, you will loose information, d. Execute, a sample file being fileinoldencoding.tex, and the same file, in its new encoding being fileinnewencoding.tex: === iconv -f oldencoding fileinoldencoding.tex -o fileinnewencoding.tex === where oldencoding could be, for example, windows-1252 . This is the same as redirecting the flux using === iconv -f oldencoding fileinoldencoding.tex > fileinnewencoding.tex === You might make this process automatic, e.g. by creating a shell file (here it is bash): === #/bin/bash LIST=`ls *.tex` for i in \$LIST; do iconv -f windows-1252 \$i -o \$i.”utf8; mv \$i."utf8 \$i; done === and executing this file in a folder containing .tex files. Here, the new files' names will contain a .utf8 extension. You can evidently modify this script or the aforementioned commands as you want, to use another encoding. By default, the encoding is utf8, but if you want encoding newencoding, use === -t newencoding ===, e. Open the file in an editor, the editor being set to open files in the output encoding, f. If you see strange characters, there is a problem, and check the procedure. If everything seems normal, you can modify the file, save the modifications, but everything under the new encoding, g. Compile the file(s) with the good inputenc declaration, as explained above.3) Dealing With BiB Files- -------------------------If you are using BiBTeX, you may also have problems, especially if youare switching from Microsoft Windows to Linux, or dealing with filestransiting between both OSes. The main problem is that BiBTeX is notreally good at dealing with Unicode (and consequently utf-8 ,etc.). Consequently, the best suggestion, if you do not want to useother alternatives to BiBTeX, is to a. Follow the same recommendations as before for characters: always write \'e for ``\'e'' and the same for other ones, b. Either keep the .bib file in latin1 encoding, and consequently using === \begingroup \inputencoding{latin1} \bibliography{bibliography} \endgroup === in your .tex file, or convert your .bib file in utf-8. In this case, you do not need the inputencoding declaration, and the group markers. Note that it is really important to use \'e and other sets of commands before the conversion. If you do not work this way, strange characters are likely to occur after the conversion. That is one of the reasons which justifies the use of such commands. If you make a document under Microsoft Windows, there will not be any problem until you keep on with the same encoding, c. Note that the same remark applies even if you keep your file in latin1 and that you use the above code: use \'e for ``\'e'' and other commands to display related symbols. If you stick with Microsoft Windows , there will not, roughly, be any problem, and you can type everything with your keyboard.==- -- Merciadri LucaSee http://www.student.montefiore.ulg.ac.be/~merciadri/- -- Repetitio Mater Memoriae.-----BEGIN PGP SIGNATURE-----Version: GnuPG v1.4.9 (GNU/Linux)Comment: Processed by Mailcrypt 3.5.8 iEYEARECAAYFAkygZFwACgkQM0LLzLt8MhwvPgCeKfQXFakAwn2HNNqvWpzA+U1R5zUAnjVoGQWwGknrOs48Kmd3f7HqzsiI=Jt91-----END PGP SIGNATURE-----