Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

ISO 8859-1 National Character Set FAQ

9 views
Skip to first unread message

mi...@vlsivie.tuwien.ac.at

unread,
May 6, 1996, 3:00:00 AM5/6/96
to

Archive-name: internationalization/iso-8859-1-charset
Posting-Frequency: monthly
Version: 2.9887


ISO 8859-1 National Character Set FAQ

Michael K. Gschwind

<mi...@vlsivie.tuwien.ac.at>


DISCLAIMER: THE AUTHOR MAKES NO WARRANTY OF ANY KIND WITH REGARD TO
THIS MATERIAL, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

Note: Most of this was tested on a Sun 10, running SunOS 4.1.* - other
systems might differ slightly

This FAQ discusses topics related to the use of ISO 8859-1 based 8 bit
character sets. It discusses how to use European (Latin American)
national character sets on UNIX-based systems and the Internet.

If you need to use a character set other than ISO 8859-1, much of
what is described here will be of interest to you. However, you will
need to find appropriate fonts for your character set (see section 17)
and input mechanisms adapted to you language.

1. Which coding should I use for accented characters?
Use the internationally standardized ISO-8859-1 character set to type
accented characters. This character set contains all characters
necessary to type all major (West) European languages. This encoding
is also the preferred encoding on the Internet. ISO 8859-X character
sets use the characters 0xa0 through 0xff to represent national
characters, while the characters in the 0x20-0x7f range are those used
in the US-ASCII (ISO 646) character set. Thus, ASCII text is a proper
subset of all ISO 8859-X character sets.

The characters 0x80 through 0x9f are earmarked as extended control
chracters, and are not used for encoding characters. These characters
are not currently used to specify anything. A practical reason for
this is interoperability with 7 bit devices (or when the 8th bit gets
stripped by faulty software). Devices would then interpret the character
as some control character and put the device in an undefined state.
(When the 8th bit gets stripped from the characters at 0xa0 to 0xff, a
wrong character is represented, but this cannot change the state of a
terminal or other device.)

This character set is also used by AmigaDOS, MS-Windows, VMS (DEC MCS
is practically equivalent to ISO 8859-1) and (practically all) UNIX
implementations. MS-DOS normally uses a different character set and
is not compatible with this character set. (It can, however, be
translated to this format with various tools. See section 5.)

Footnote: Supposedly, IBM code page 819 is fully ISO 8859-1 compliant.


ISO 8859-1 supports the following languages:
Afrikaans, Basque, Catalan, Danish, Dutch, English, Faeroese, Finnish,
French, Galician, German, Icelandic, Irish, Italian, Norwegian,
Portuguese, Spanish and Swedish.

(Reportedly, Welsh cannot be handled due to missing \^{w} and \^{y}.)

(It has been called to my attention that Albanian can be written with
ISO 8859-1 also. However, from a standards point of view, ISO 8859-2
is the appropriate character set for Balkan countries.)

ISO 8859-1 is just one part of the ISO-8859 standard, which specifies
several character sets:
8859-1 Europe, Latin America
8859-2 Eastern Europe
8859-3 SE Europe/miscellaneous (Esperanto, Maltese, etc.)
8859-4 Scandinavia/Baltic (mostly covered by 8859-1 also)
8859-5 Cyrillic
8859-6 Arabic
8859-7 Greek
8859-8 Hebrew
8859-9 Latin5, same as 8859-1 except for Turkish instead of Icelandic
8859-10 Latin6, for Lappish/Nordic/Eskimo languages

Unicode is advantageous because one character set suffices to encode
all the world's languages, however very few programs (and even fewer
operating systems) support wide characters. Thus, only 8 bit wide
character sets (such as the ISO 8859-X) can be used with these
systems. Unfortunately, some programmers still insist on using the
`spare' eigth bit for clever tricks, crippling these programs such
that they can process only US-ASCII characters.


Footnote: Some people have complained about missing characters,
e.g. French users about a missing 'oe'. Note that oe is
not a character, but a typographical ligature (a combination of two
characters for typographical purposes). Ligatures are not
part of the ISO 8859-X standard. (Although 'oe' used to
be in the draft 8859-1 standard before it was unmasked as
`mere' ligature.)

Two stories exist for the removal of the oe:
(1) argues that in the final session, the French admitted
that oe was only a ligature. This prompted the
committee to remove it.
(2) argues that the French member missed the session and the
members from the other countries simply decided to
remove it. (If this is true, where were the Swiss and
Belgians?)

Note that the oe ligature is different from the 'historical
ligature' æ which is now considered a letter in Nordic
countries and cannot be replaced by the the latters 'ae'.

2. Getting your terminal to handle ISO characters.
Terminal drivers normally do not pass 8 bit characters. To enable
proper handling of ISO characters, add the following lines to your
.cshrc:
----------------------------------
tty -s
if ($status == 0) stty cs8 -istrip -parenb
----------------------------------
If you don't use csh, add equivalent code to your shell's start up
file.

Note that it is necessary to check whether your standard I/O streams
are connected to a terminal. Only then should you reconfigure the
terminal driver. Note that tty checks stdin, but stty changes stdout.
This is OK in normal code, but if the .cshrc is executed in a pipe,
you may get spurious warnings :-(

If you use the Bourne Shell or descendants (sh, ksh, bash,
zsh), use this code in your startup (e.g. .profile) file:
----------------------------------
tty -s
if [ $? = 0 ]; then
stty cs8 -istrip -parenb >&0
fi
----------------------------------

Footnote: In the /bin/sh version, we redirect stdout to stdin, so both
tty and stty operate on stdin. This resolves the problem discussed in
the /bin/csh script version. A possible workaround is to use the
following code in .cshrc, which spawns a Bourne shell (/bin/sh) to
handle the redirection:
----------------------------------
tty -s
if ($status == 0) sh -c "stty cs8 -istrip -parenb >&0"
----------------------------------

3. Getting the locale setting right.
For the ctype macros (and by extension, applications you are running
on your system) to correctly identify accented characters, you
may have to set the ctype locale to an ISO 8859-1 conforming
configuration. On SunOS, this may be done by placing
------------------------------------
setenv LANG C
setenv LC_CTYPE iso_8859_1
------------------------------------
in your .login script (if you use the csh). An equivalent statement
will adjust the ctype locale for non-csh users.

The process is the same for other operating systems, e.g. on HP/UX use
'setenv LANG german.iso88591'; on IRIX 5.2 use 'setenv LANG de'; on Ultrix 4.3
use 'setenv LANG GER_DE.8859' and on OSF/1 use 'setenv LANG
de_DE.88591'. The examples given here are for German. Other
languages work too, depending on your operating system. Check out
'man setlocale' on your system for more information.

*****If you can confirm or deny this, please let me know.*****
Currently, each system vendor has his own set of locale names, which
makes portability a bit problematic. Supposedly there is some X/Open
document specifying a

<language>_<country>.<character_encoding>

syntax for environment variables specifying a locale, but I'm unable
to confirm this.

While many vendors know use the <language>_<country> encoding, there
are many different encodings for languages and countries.

Many vendors seem to use some derivative of this encoding:
It looks as if <language> is the two-letter code for the language from
ISO 639, and <country> is the two-letter code for the country from ISO
3166, but I don't know of any standard specifying <character_encoding>.

An appropriate name source for the <character_encoding> part of the
locale name would be to use the character set names specified in RFC
1345 which contains names for all standardized character sets.
(Preferably, the canonical name and all aliases should be accepted,
with the canonical name being the first choice.) Using this
well-known character set repository as name source would bring an end
to conflicting names, without the need to introduce yet another
character set directory with the inherent dangers of inconsistency and
duplicated effort.
*****If you can confirm or deny this, please let me know.*****

Footnote on HP/UX systems:
As of 10.0, you can use either german.iso88591 or de_DE.iso88591 (a
name more in line with other vendors and developing standards for
locale names). For a complete listing of locale names, see the text
file /usr/lib/nls/config. Or, on HP-UX 10.0, execute locale -a . This
command will list all locales currently installed on your system.

4. Selecting the right font under X11 for xterm (and other applications)
To actually display accented characters, you need to select a font
which does contains bit maps for ISO 8859-1 characters in the
correct character positions. The names of these fonts normally
have the suffix "iso8859-1". Use the command
# xlsfonts
to list the fonts available on your system. You can preview a
particular font with the
# xfd -fn <fontname>
command.

Add the appropriate font selection to your ~/.Xdefaults file, e.g.:
----------------------------------------------------------------------------
XTerm*Font: -adobe-courier-medium-r-normal--18-180-75-75-m-110-iso8859-1
Mosaic*XmLabel*fontList: -*-helvetica-bold-r-normal-*-14-*-*-*-*-*-iso8859-1
----------------------------------------------------------------------------

While X11 is farther than most system software when it comes to
internationalization, it still contains many bugs. A number of bug
fixes can be found at URL http://www.dtek.chalmers.se:80/~maf/i18n/.

Footnote: The X11R5 distribution has some fonts which are labeled as
ISO fonts, but which contain only the US-ASCII characters.

5. Translating between different international character sets.
While ISO 8859-1 is an international standard, not everybody uses this
encoding. Many computers use their own, vendor-specific character sets
(most notably Microsoft for MS-DOS). If you want to edit or view files
written in different encoding, you will have to translate them to an
ISO 8859-1 based representation.

There are several PD/free character set translators available on the
Internet, the most notable being 'recode'. recode is available from
URL ftp://prep.ai.mit.edu/u2/emacs. recode is covered by FSF
copyright and is freely redistributable.

The general format of the program call is one of:

recode [OPTION]... [BEFORE]:[AFTER] [FILE]

The second form is the common case. Each FILE will be read assuming
it is coded with charset BEFORE, it will be recoded over itself so to
use the charset AFTER. If there is no such FILE, the program rather
acts as a filter and recode standard input to standard output.

Some recodings are not reversible, so after you have converted the
file (recode overwrites the original file with the new version!), you
may never be able to recontruct the original file. A safer way of
changing the encoing of a file is to use the filter mechanism of
recode and invoke it as follows:

recode [OPTION]... [BEFORE]:[AFTER] <[OLDFILE] >[NEWFILE]

Under SunOS, the dos2unix and unix2dos programs (distributed with
SunOS) will translate between MS-DOS and ISO 8859-1 formats.

It is somewhat more difficult to convert German, `Duden'-conformant
Ersatzdarstellung (ä = ae, ß = ss or sz etc.) into the ISO 8859-1
character set. The German dictionary available as URL
ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/dicts/deutsch.tar.gz also
contains a UNIX shell script which can handle all conversions except
ones involving ß (German scharfes-s), as for `ss' this change is more
complicated.

A more sophisticated program to translate Duden Ersatzdarstellung to
ISO 8859-1 is Gustaf Neumann's diac program (version 1.3 or later)
which can translate all ASCII sequences to their respective ISO 8859-1
character set representation. 'diac' is available as URL
ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/diac.

Translating ISO 8859-1 to ASCII can be performed with a little sed
script according to your needs. But be aware that
* No one-to-one mapping between Latin 1 and ASCII strings is possible.
* Text layout may be destroyed by multi-character substitutions,
especially in tables.
* Different replacements may be in use for different languages,
so no single standard replacement table will make everyone happy.
* Truncation or line wrapping might be necessary to fit textual data
into fields of fixed width.
* Reversing this translation may be difficult or impossible.
* You may be introducing ambiguities into your data.

6. Printing accented characters.

6.1 PostScript printers
If you want to print accented characters on a postscript printer, you
may need a PS filter which can handle ISO characters.

Our Postscript filter of choice is a2ps, the more recent version of
which can handle ISO 8859-1 characters with the -8 option. a2ps V4.3
is available as URL ftp://imag.imag.fr/archive/postscript/a2ps.V4.3.tar.gz.

If you use the pps postscript filter, use the 'pps -ISO' option for
pps to handle ISO 8859-1 characters properly.


6.2 Other (non-PS) printers:
If you want to print to non-PS printers, your success rate depends on
the encoding the printer uses. Several alternatives are possible:

* Your printer accepts ISO 8859-1:
You're lucky. No conversion is needed, just send your files to the
printer.


* You printer supports a PC-compatible font:
You can use the recode tool to translate from ISO 8859-1 to this
encoding. (If you are using a SunOS based computer, you can also use
the unix2dos utility which is part of the standard distribution.)
Just add the appropriate invocation as a built-in filter to your
printer driver.

At our site, we use the following configuration to print ISO 8859-1
characters on an IBM Proprinter XL :

/etc/printcap

lp|isolp|Line Printer with ISO-8859-1:\
:lp=/dev/null:\
:sd=/usr/spool/lpd/lp:mx#0:if=/usr/spool/lpd/iso2dos.sh:rs:
rawlp|Lineprinter:\
:lp=:rm=lphost.vlsivie.tuwien.ac.at:rp=lp:sd=/usr/spool/lpd/rawlp:rs:

/usr/spool/lpd/iso2dos.sh

#!/bin/sh
if /usr/local/gnu/bin/recode latin-1:ibm-pc | /usr/ucb/lpr -Prawlp
then
exit 0
else
exit -1
fi


* Your printer uses a national ISO 646 variant (7 bit ASCII
with some special characters replaced by national characters):
You will have to use a translation tool; this tool would
then be installed in the printer driver and translate character
conventions before sending a file to the printer. The recode
program supports many national ISO 646 norms. (If you add do
this, please submit it to the maintainers of recode, so that it can
benefit everybody.)

Unfortunately, you will not be able to display all characters with
the built-in characters set. Most printers have user-definable
bit-map characters, which you can use to print all ISO characters.
You just have to generate a pix-map for any particular character and
send this bitmap to the printer. The syntax for these characters
varies, but a few conventions have gained universal acceptance
(e.g., many printers can process Epson-compatible escape sequences).


* Your printer supports a strange format:
If your printer supports some other strange format (e.g. HP Roman8,
DEC MCS, Atari, NeXTStep, EBCDIC or what have you), you have to add a
filter which will translate ISO 8859-1 to this encoding before
sending your data to the printer. 'recode' supports many of these
character sets already. If you have to write your own conversion
tool, consider this as a good starting base. (If you add support for
any new character sets, please submit your code changes to the
maintainers of recode).

If your printer supports DEC MCS, this is nearly equivalent to ISO
8859-1 (actually, it is a former ISO 8859-1 draft standard. The only
characters which are missing are the Icelandic characters (eth and
thorn) at locations 0xD0, 0xF0, 0xDE and 0xFE) - the difference is
only a few characters. You could probably get by with just sending
ISO 8859-1 to the printer.


* Your printer supports ASCII only:
You have several options:
+ If your printer supports user-defined characters, you can print all
ISO characters not supported by ASCII by sending the appropriate
bitmaps. You will need a filter to convert ISO 8859-1 characters
to the appropriate bitmaps. (A good starting point would be recode.)
+ Add a filter to the printer driver which will strip the accent
characters and just print the unaccented characters. (This
character set is supported by recode under the name `flat' ASCII.)
+ Add a filter which will generate escape sequences (such as
" <BACKSPACE> a for Umlaut-a (ä), etc.) to be printed. Recode
supports this encoding under the name `ascii-bs'.

Footnote: For more information on character translation and the
'recode' tool, see section 5.

7. TeX and ISO 8859-1
If you want to write TeX without having to type {\"a}-style escape
sequences, you can either get a TeX versions configured to read 8-bit
ISO characters, or you can translate between ISO and TeX codings.

The latter is arduous if done by hand, but can be automated if you use
emacs. If you use Emacs 19.23 or higher, simply add the following line
to your .emacs startup file. This mode will perform the necessary
translations for you automatically:
------------------
(require 'iso-cvt)
------------------

If you are using pre-19.23 versions of emacs, get the "gm-lingo.el"
lisp file via URL ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit. Load
gm-lingo from your .emacs startup file and this mode will perform the
necessary translations for you automatically.

If you want to configure TeX to read 8 bit characters, check out the
configuration files available in URL
ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit.

In LaTeX 2.09 (or earlier), use the isolatin or isolatin1 styles to
include support for ISO latin1 characters. Use the following
documentstyle definition:
\documentstyle[isolatin]{article}

isolatin.sty and isolatin1 are available from all CTAN servers and
from URL ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit. (The isolatin1
version on vlsivie is more complete than the one on CTAN servers.)

There are several possibilities in LaTeX 2e to provide comprehensive
support for 8 bit characters:

The preferred method is to use the inputenc package with the latin1
option. Use the following package invocation to achieve this:
\usepackage[latin1]{inputenc}

The inputenc package should be the first package to be included in the
document. For a more detailed discussion, check out URL
ftp://ftp.vlsivie/tuwien.ac.at/pub/8bit/latex2e.ps (in German).

Alternatively, the styles used for earlier LaTeX versions (see above)
can also be used with 2e. To do this, use the commands:
\documentclass{article}
\usepackage{isolatin}


You can also get the latex-mode to handle opening and closing quotes
correctly for your language. This can be achieved by defining the
emacs variables 'tex-open-quote' and 'tex-closing-quote'. You can
either set these varaibles in your ~/.emacs startup file or as a
buffer-local variable in your TeX file if you want to define quotes on
a per-file basis.

For German TeX quotes, use:
-----------
(setq tex-open-quote "\"`")
(setq tex-closing-quote "'\"")
-----------

If you want to use French quotes (guillemets), use:
-----------
(setq tex-open-quote "«")
(setq tex-closing-quote "»")
-----------

Bibtex has some problems with 8 bit characters, esp. when they are
used as keys. BibTeX 1.0, when it eventually comes out (most likely
some time in 1996), will support 8-bit characters.

8. ISO 8859-1 and emacs
Emacs 19 (as opposed to Emacs 18) can automatically handle 8 bit
characters. (If you have a choice, upgrade to Emacs version 19.23,
which has the most complete ISO support.) Emacs 19 has extensive
support for ISO 8859-1. If your display supports ISO 8859-1 encoded
characters, add the following line to your .emacs startup file:
-----------------------------
(standard-display-european t)
-----------------------------

If want to display ISO-8859-1 encoded files by using TeX-like escape
sequences (e.g. if your terminal supports only ASCII characters), you
should add the following line to your .emacs file (DON'T DO THIS IF
YOUR TERMINAL SUPPORTS ISO OR SOME OTHER ENCODING OF NATIONAL
CHARACTERS):
--------------------
(require 'iso-ascii)
--------------------

If your terminal supports a non-ISO 8859-1 encoding of national
characters (e.g. 7 bit national variant ISO 646 character sets,
aka. `national ASCII' variants), you should configure your own display
table. The standard emacs distribution contains a configuration
(iso-swed.el) for terminals which have ASCII in the G0 set and a
Swedish/Finnish version of ISO 646 in the G1 set. If you want to
create your own display table configuration, take a look at this
sample configuration and at disp-table.el for available support
functions.


Emacs can also accept 8 bit ISO 8859-1 characters as input. These
character codes might either come from a national keyboard (and
driver) which generates ISO-compliant codes, or may have been entered
by use of a COMPOSE-character mechanism.
If you use such an input format, execute the following expression in
your .emacs startup file to enable Emacs to understand them:
-------------------------------------------------
(set-input-mode (car (current-input-mode))
(nth 1 (current-input-mode))
0)
-------------------------------------------------

In order to configure emacs to handle commands operating on words
properly (such as 'Beginning of word', etc.), you should also add the
following line to your .emacs startup file:
-------------------------------
(require 'iso-syntax)
-------------------------------

This lisp script will change character attributes such that ISO 8859-1
characters are recognized as such by emacs.


For further information on using ISO 8859-1 with emacs, also see the
Emacs manual section on "European Display" (available as hypertext
document by typing C-h i in emacs or as a printed version).


If you need to edit text in a non-European language(Arabic, Chinese,
Cyrillic-based languages, Ethiopic, Korean, Thai, Vietnamese, etc.),
MULE (URL ftp://etlport.etl.go.jp/pub/mule) is a Multilingual
Enhancement to GNU Emacs which supports these languages.

9. Typing ISO with US-style keyboards.
Many computer users use US-ASCII keyboards, which do not have keys for
national characters. You can use escape sequences to enter these
characters. For ASCII terminals (or PCs), check the documentation of
your terminal for particulars.


9.1 US-keyboards under X11
Under X Windows, the COMPOSE multi-language support key can be used to
enter accented characters. Thus, when running X11 on a SunOS-based
computer (or any other X11R4 or X11R5 server supporting COMPOSE
characters), you can type three character sequences such as
COMPOSE " a -> ä
COMPOSE s s -> ß
COMPOSE ` e -> è
to type accented characters.

Note that this COMPOSE capability has been removed as of X11R6,
because it does not adequately support all the languages in the world.
Instead, compose processing is supposed to be performed in the client
using an `input method', a mechanism which has been available since
X11R5. (In the short term, this is a step backward for European
users, as few clients support this type of processing at the moment.
It is unfortunate that the X Consortium did not implement a mechanism
which allows for a smoother transition. Even the xterm terminal
emulator supplied by the X Consortium itself does not yet support this
mechanism!)

Input methods are controlled by the locale environment variables (LANG
and LC_xxx). The values for these variables are (or at least, should be
made equivalent by any sane vendor) equivalent to those expected by
the ANSI/POSIX locale library. For a list of possible settings see
section 3.

9.2 US-keyboards and emacs
9.2.1 Using ALT for composing national characters
There are several modes to enter Umlaut characters under emacs when
using a US-style keyboard. One such mode is iso-transl, which is
distributed with the standard emacs distribution. This mode uses the
Alt-key for entering diacritical marks (accents et al.).

To activate iso-transl mode, add the following line to your .emacs
setup file:
(require 'iso-transl)

As of emacs 19.29, Alt-sequences optimized for a particular language
are available. Use the following call in .emacs to select your
favorite keybindings:
(iso-transl-set-language "German")

If you do not have an Alt-key on your keyboard, you can use the C-x 8
prefix to access the same capabilities.

For pre-19.29 versions, similar functionality is availble as extended
iso-transl mode (iso-transl+) which allows the definition of language
specific short cuts is available as URL
ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/iso-transl+.shar. This file
also includes sample configurations for the German and Spanish
languages.


9.2.2 Electric Accents
An alternative to using Alt-sequences for entering diacritical marks
is the use of `electric accents', such as used on old type writers or
under many MS Windows programs. With this method, typing an accent
character will place this accent on the next character entered. One
mode which supports this entry method is the iso-acc minor mode which
comes with the standard emacs distribution. Just add
------------------
(require 'iso-acc)
------------------
to your emacs startup script, and you can turn the '`~/^" keys into
electric accents by typing 'M-x iso-accents-mode' in a specific
buffer. To type the ç (c with cedilla) and ß (German scharfes s)
characters, type ~c and "s, respectively.

Footnote: When starting up under X11, Emacs looks for a Meta key and
if it finds no Meta key, it will use the Alt key instead. The way to
solve this problem, is to define a Meta key using the xmodmap utility
which comes with X11.

10. File names with ISO characters
If your OS is 8 bit clean, you can use ISO characters in file names.
(This is possible under SunOS.)

11. Command names with ISO 8859-1
If your OS supports file names with ISO characters, and your shell is
8 bit clean, you can use command names containing ISO characters. If
your shell does not handle ISO characters correctly, use one of the
many PD shells which do (e.g. tcsh, an extended csh). These are
available from a multitude of ftp sites around the world.

See section 14 on application specific information for a discussion of
various shells.

12. Spell checking
Ispell 3.1 has by far the best understanding of non-English
languages and can be configured to handle 8-bit characters
(Thus, it can handle ISO-8859-1 encoded files).

Ispell 3.1 now comes with hash tables for several languages (English,
German, French,...). It is available via URL ftp://ftp.cs.ucla.edu/pub.
Ispell also contains a list of international dictionaries and about
their availability in the file ispell/languages/Where.

To choose a dictionary for ispell, use the `-d <dictionary>'
option. The `-T <input-encoding>' option should be set set to `-T
latin1' if you want to use ISO 8859-1 as input encoding.

If you use ispell inside emacs (using the ispell.el mode) to spell
check a buffer, you can choose language and input encoding either
using the `M-x ispell-change-dictionary' function, or by choosing the
`Spell' item in the `Edit' pull-down menu. This will present you with
a choice of dictionaries (cum input encodings): all languages are
listed twice, such as in `Deutsch' and `Deutsch8'. `Deutsch8' is the
setting which will use the German dictionary and the 8 bit ISO 8859-1
input encoding.

Alternatively, ispell.el lets you specify the dictionary to use for a
particular file at the end of of that file by adding a line such as
----
Local IspellDict: castellano8
----

The following sites also have dictionaries for ispell available via
anonymous ftp:
language site file name
French ireq-robot.hydro.qc.ca /pub/ispell
French ftp.inria.fr /INRIA/Projects/algo/INDEX/iepelle
French ftp.inria.fr /gnu/ispell3.0-french.tar.gz
German ftp.vlsivie.tuwien.ac.at /pub/8bit/dicts/deutsch.tar.gz
Spanish ftp.eunet.es /pub/unix/text/TeX/spanish/ispell
Portuguese http://www.di.uminho.pt/~jj/pln/pln.html

Some spell checkers use strange encodings for accented characters. If
you have to use one of these spell checkers, you may have to run
recode before invoking the spell checker to generate a file using your
spell checker's coding conventions. After running the spell checker,
you have to translate the file back to ISO with recode.

Of course, this can be automated with a shell script:
---------------------
recode <options to generate spell checker encoding from ISO> $i tmp.file
spell_check tmp.file
recode <options to generate ISO from spell checker encoding> tmp.file $i
---------------------

Footnote: Ispell 4.* is not a superset of ispell 3.*. Ispell 4.* was
developed independently from a common ancestor, but DOES NOT
support any internationalization, but is restricted to the
English language.

13. TCP and ISO 8859-1
TCP was specified by US-Americans, for US-Americans. TCP still carries
this heritage: while TCP/IP protocol itself *is* 8 bit clean, no
effort was made to support the transfer of non-English characters in
many application level protocols (mail, news, etc.). Some of these
protocols still only specify the transfer of 7-bit data, leaving
anything else implementation dependent.

Since the TCP/IP protocol itself transfers 8 bit data correctly,
writing applications based on TCP/IP does not lead to any loss of
encoding information.


13.1 FTP and ISO 8859-1
Transmitting data via FTP is an interesting issue, depending on what
system you use, how the relevant RFCs are interpreted, and what is
actually implemented.

If you transfer data between two hosts using the same ISO 8859-1
representation (such as two Unix hosts), the safest solution is to
specify 'binary' transmission mode.

Note, however, that use of the binary mode for text files will disable
translation between the line-ending conventions of different operating
systems. You might have to provide some filter to convert between the
LF-only convention of Unix and the CR-LF convention of VMS and MS
Windows when you copy from one of these systems to another.

If the FTP server and client computers use different encoding, there
are two possible approaches:
* Transfer all data as binary data, then convert the format using a
conversion tools such as recode to translate the tranferred data.
* Specify an ASCII connection, and have your FTP server and client
convert the encoding automatically.

While the first approach always works, it is somewhat cumbersome if
you transmit a lot of data. The second transfer solution is much more
comfortavle, but it depends on you client (and server) to take care of
the appropriate character translations. Since there is no universal
standard for network characters beyond ASCII (NVT-ASCII as specified
in RFC 854), this depends on attitude of your software vendor.

Most Apple Macintosh network software is configured to treat all
network data as having ISO 8859-1 encoding and automatically
translates from and to the internal MacOS data representation. (This
can be problematic, if you want to send or receive text using the
Macintosh character set. The correct solution to this problem is
to use MIME.)

MS-DOS programs are much less well-behaved, and you have to try
whether your particular FTP program performs conversion.

An additional issue with the automatic translation is how to translate
unavailable characters. If FTP is used to store and retrieve data,
the original file should be re-constructable after conversion. If
data is to printed or processed, different encodings (e.g. graphic
approximation of characters) may be necessary. (See the section on
character set translation for a full discussion of encoding
transformations.)

A second, optional parameter is possible for 'type ascii' commands,
which specifies whether the data is for non-printing or printing
purposes. Ideally, FTP servers for non-8859-1 servers would use this
parameter to determine whether to use an invertible encoding or
graphical and/or logical approximation during translation. (Although
RFC 959, section 3.1.1.5 does not require this.)


13.2 Mail and ISO 8859-1
Most Internet eMail standards come from a time when the Internet was a
mostly-US phenomenon. Other countries did have access to the net, but
much of the communication was in English nevertheless. With the
propagation of Internet, these standards have become a problem for
languages which cannot be represented in a 7 bit ISO 646 character
set.

Using ISO 646, which uses a slightly different character set for each
language, also poses a problem when crossing a language barrier, as
the interpretation of characters will change. As a result, most
countries use the ISO 646 standard commonly referred to as US-ASCII
and will use escape sequences such as 'e (é) or "a (ä) to refer to
national characters. The exception to this rule are Nordic countries
(more so in Sweden and Finland, less so in Denmark and Norway, I'm
being told), where the national ISO 646 variant has garnered a
formidable following and is a common reference point for all Nordic
users.

There are several languages, for which there are not enough
replacement characters to code all national variants (e.g. French).

Footnote:
Hence, French has not followed the nordic track. French
net-convention is e' instead of 'e ("l''el'ephant" is strange
spelling) and many think that this is very ugly writing anyway and
drop the accents altogether but this makes text sometimes funny and
incorrect at least.


As this situation is clearly unsatisfactory, several methods of
sending mails encoded in national character sets have been developed.
We start with a discussion of the mail delivery infrastructure and
will then look at some high-level protocols which can protect mail
users and their messages from the shortcomings of the underlying mail
protocols.

Footnote: Many other email standards exist for proprietary systems.
If you use one of these mail systems, it is the responsibility of the
mail gateway to translate your messages to an appropriate Internet
mail message when you send a message to the Internet.


13.2.1 Mail Transfer Agents and the Internet Mail Infrastructure
The original sendmail protocol specification (SMTP) in RFC 821
specified the transfer of only 7 bit messages. Many sendmail
implementations have been made 8 bit transparent (see RFC 1428), but
some SMTP handling agents are still strictly conforming to the
(somewhat outdated) RFC 821 and intentionally cut off the 8th bit.
This behavior stymies all efforts to transfer messages containing
national characters. Thus, only if all SMTP agents between mail
originator and mail recipient are 8 bit clean, will messages be
transferred correctly. Otherwise, accented characters are mapped to
some ASCII character (e.g. Umlaut a -> 'd'), but the rest of the
messages is still transferred correctly.

A new, enhanced (and compatible) SMTP standard, ESMTP, has been
released as RFC 1425. This standard defines and standardizes 8 bit
extensions. This should be the mail protocol of choice for newly
shipped versions of sendmail.

Much of the European and Latin American network infrastructure
supports the transfer of 8 bit mail messages, the success rate is
somewhat lower for the US.

DEC Ultrix sendmail still implements the somewhat outdated RFC 821 to
the letter, and thus cuts off the eighth bit of all mail passing
through it. Thus ISO encoded mail will always lose the accent marks
when transferred through a DEC host.

If your computer is running DEC Ultrix and you want it to handle 8 bit
characters properly, you can get the source for a more recent version
of sendmail via ftp (see section 14.9). OR, you can simply
call DEC, complain that their standard mail system cannot handle
international 8 bit mail, encourage them to implement 8 bit
transparent SMTP, or (even better) ESMTP, and ask for the sendmail
patch which makes their current sendmail 8 bit transparent.
(Reportedly, such a patch is available from DEC for those who ask.)
In the meantime, an 8 bit transparent sendmail MIPS binary for Ultrix
is available as URL
ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/mips.sendmail.8bit)

If you want to change MTAs, the popular smail PD-MTA is also 8 bit
clean.


13.2.2 High-level protocols
In the Good Old Days, messages were 7-bit US-ASCII only. When users
wanted to transfer 8 bit data (binaries or compressed files, for
example), it was their responsibility to translate them to a 7 bit
form which could be sent. At the other end, the recipient had to
unpack the data using the same protocol. The commonly used encoding
mechanism used for this purpose is uuencode/uudecode.

Today, a standard, MIME (MIME stands for Multi-purpose Internet Mail
Extensions), exists which automatically packs and unpacks data as is
required. This standard can take advantage of different underlying
protocol capabilities and automatically transform messages to
guarantee delivery. This standard can also be used to include
multimedia data types in your mail messages.

The MIME standard defines a mail transfer protocol which can handle
different character sets and multimedia mail, independent of the
network infrastructure. This protocol should eventually solve
problems with 7-bit mailers etc. Unfortunately, no mail transfer
agents (mail routers) and few end user mail readers support this
standard. Source for supporting MIME (the `metamail' package) in
various mail readers is available in URL
ftp://thumper.bellcore.com/pub/nsb. MIME is specified in RFC 1521 and
RFC 1522 which are available from ftp.uu.net. There is also a MIME
FAQ which is available as URL
ftp://ftp.ics.uci.edu/mh/contrib/multimedia/mime-faq.txt.gz. (This
file is in compressed format. You will need the GNU gunzip program to
decompress this file.)

PS: Newer versions of sendmail support ESMTP negotiation and can pass
8 bit data. However, they do not (yet?) support downgrading of 8 bit
MIME messages.


13.3 News and ISO 8859-1
Much as mail, the Usenet news protocol specification is 7 bit based,
but the infrastructure has been upgraded to 8 bit service... Thus,
accented characters are transferred correctly between much of Europe
(and Latin America).

ISO 8859-1 is _the_ standard for typing accented characters in most
newsgroups (may be different for MS-DOS centered newsgroups ;-), and
is preferred in most European news group hierarchies, such as at.* or
de.*

For those who speak French, there is an excellent FAQ on using ISO
8859-1 coded characters on Usenet by François Yergeau (URL
ftp://ftp.ulaval.ca/contrib/yergeau/faq-accents). This FAQ is
regularly posted in soc.culture.french and other relevant newsgroups.


13.4 WWW (and other information servers)
The WWW protocol can transfer 8 bit data without any problems and you
can advertise ISO-8859-1 encoded data from your client. The display
of data is dependent upon the user client. xmosaic (freely available
from the NCSA) which is available for most UNIX platforms uses an
ISO-8859-1 compliant font by default and will display data correctly.


13.5 rlogin
For rlogin to pass 8 bit data correctly, invoke it with 'rlogin -8' or
'rlogin -L'.

14. Some applications and ISO 8859-1
14.1 bash
You need version 1.13 or higher and set the locale correctly (see
section 3). Also, to configure the `readline' input function of bash
to handle 8 bit characters correctly, you have to set some environment
variables in the readline startup file .inputrc:
-------------------------------------------------------
set meta-flag On
set convert-meta Off
set output-meta On
-------------------------------------------------------

Before bash version 1.13, bash used the eighth bit of characters to
mark whether or not they were quoted when performing word expansions.
While this was not a problem in a 7-bit US-ASCII environment, this was
a major restriction for users working in a non-English environment.

These readline variables have the following meaning (and default
values):
meta-flag (Off)
If set to On, readline will enable eight-bit input
(that is, it will not strip the high bit from the char-
acters it reads), regardless of what the terminal
claims it can support.
convert-meta (On)
If set to On, readline will convert characters with the
eighth bit set to an ASCII key sequence by stripping
the eighth bit and prepending an escape character (in
effect, using escape as the meta prefix).
output-meta (Off)
If set to On, readline will display characters with the
eighth bit set directly rather than as a meta-prefixed
escape sequence.

Bash is available from prep.ai.mit.edu in /pub/gnu.


14.2 elm
Elm automatically supports the handling of national character sets,
provided the environment is configured correctly. If you configure
elm without MIME support, you can receive, display, enter and send 8
bit ISO 8859-1 messages (if your environment supports this character
set).

When you compile elm with MIME support, you have two options:
* you can compile elm to use 8 bit ISO-8859-1 as transport encoding:
If you use this encoding even people without MIME compliant mailers
will be able to read your mail messages, if they use the same
character set. The eight bit may however be cut off by 7 bit MTAs
(mail transfer agents), and mutilated mail might be received by the
recipient, regardless of whether she uses MIME or not. (This
problem should be eased when 8 bit mailers are upgraded to
understand how to translate 8 bit mails to 7 bit encodings when they
encounter a 7 bit mailer.)

* you can compile elm to use 7 bit US-ASCII `quoted printable' as
transport encoding:
this encoding ensures that you can transfer your mail containing
national characters without having to worry about 7 bit MTAs. A
MIME compliant mail reader at the other end will translate your
message back to your national character set. Recipients without
MIME compliant mail readers will however see mutilated messages:
national characters will have been replaced by sequences of the type
'=FF' (with FF being the ISO code (in hexadecimal) of the national
character being encoded).


14.3 GNUS
GNUS is a newsreader based on emacs. It is 8 bit transparent and
contains all national character support available in emacs 19.


14.4 less
Version 237 and later automatically displays latin1 characters, if
your locale is configured correctly.

If your OS does not support the locale mechanism, or if you use a
version of less older than 237, set the LESSCHARSET environment
variable with 'setenv LESSCHARSET latin1'.

14.5 metamail
To configure the metamail package for ISO 8859-1 input/output, set the
MM_CHARSET environment variable with 'setenv MM_CHARSET ISO-8859-1'.
Also, set the MM_AUXCHARSETS variable with 'setenv MM_AUXCHARSETS
iso-8859-1'.


14.6 nn
Add the line
-----------------
set data-bits 8
-----------------
to your ~/.nn/init (or the global configuration file) in order for nn
to be able to process 8 bit characters.


14.7 nroff
The GNU replacement for nroff, groff, has an option to generate ISO
8859-1 coded output, instead of plain ASCII. Thus, you can preview
nroff documents with correctly displayed accented characters. Invoke
groff with the 'groff -Tlatin1' option to achieve this.

Groff is free software. It is available from URL
ftp://prep.ai.mit.edu/pub/gnu and many other GNU archives around the
world.


14.8 pgp
PGP (Phil Zimmermann's Pretty Good Privacy) uses Latin1 as canonical
form to transmit crypted data. Your host computer's local character
set should be configured in the configuration file
${PGPPATH}/config.txt by setting the CHARSET parameter. If you are
using ISO 8859-1 as your native character set, CHARSET should bet set
to LATIN1, on MS-DOS computers with code page 850 set 'CHARSET =
CP850'. This will make PGP automatically translate all crypted texts
from/to the LATIN1 canonical form. A setting of 'CHARSET = NOCONV'
can be used to inhibit all translations. (

When PGP is used to code Cyrillic text, KOI8 is regarded as canonical
form (use 'CHARSET = KOI8'). If you use the ALT_CODES encoding for
Cyrillic (popular on PCs), set 'CHARSET = ALT_CODES' and it will
automatically be converted to KOI8.

Footnote: Note that PGP treats KOI8 as LATIN1, even though it is a
completely different character set (Russian), because trying to
convert KOI8 to either LATIN1 or CP850 would be futile anyway.


14.* samba
To make samba work with ISO 8859-1, use the following line in the
[global] section:
valid chars = 0xa0 0xa1 0xa2 0xa3 0xa4 0xa5 0xa6 0xa7 0xa8 0xa9 0xaa 0xab 0xac 0xad 0xae 0xaf 0xb0 0xb1 0xb2 0xb3 0xb4 0xb5 0xb6 0xb7 0xb8 0xb9 0xba 0xbb 0xbc 0xbd 0xbe 0xbf 0xc0:0xe0 0xc1:0xe1 0xc2:0xe2 0xc3:0xe3 0xc4:0xe4 0xc5:0xe5 0xc6:0xe6 0xc7:0xe7 0xc8:0xe8 0xc9:0xe9 0xca:0xea 0xcb:0xeb 0xcc:0xec 0xcd:0xed 0xce:0xee 0xcf:0xef 0xd0:0xf0 0xd1:0xf1 0xd2:0xf2 0xd3:0xf3 0xd4:0xf4 0xd5:0xf5 0xd6:0xf6 0xd7 0xf7 0xd8:0xf8 0xd9:0xf9 0xda:0xfa 0xdb:0xfb 0xdc:0xfc 0xdd:0xfd 0xde:0xfe 0xdf 0xff


14.9 sendmail
BSD Sendmail Version 8 has a flag in the configuration file set to
True or False which determines whether v8 passes any 8-bit data it
encounters, presumably to match the behavior of other 8-bit
transparent MTAs and to meet the wants of non-ASCII users, or if it
strips to 7 bits to conform to SMTP. The source code for an 8 bit
clean sendmail is available in URL ftp://ftp.cs.berkeley.edu/ucb/sendmail.
A pre-compiled binary for DEC MIPS systems running Ultrix is available
as URL ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/mips.sendmail.8bit.


14.10 tcsh
You need version 6.04 or higher, and your locale has to be set
properly (see section 3). Tcsh also needs to be compiled with the
national language support feature, see the config.h file in the tcsh
source directory. Tcsh is an extended csh and is available in URL
ftp://ftp.deshaw.com/pub/tcsh

If tcsh has been configured correctly, it will allow national
characters in ENVIRONMENT variables, shell variables, file names, etc.

set BenötigteDateien=/etc/rc
cat $BenötigteDateien > /dev/null


14.11 vi
Support for 8 bit character sets depends on the OS. It works under
SunOS 4.1.*, but on OSF/1 vi gets confused about the current cursor
position in the presence of 8 bit characters. Some versions of vi
require an 8bit locale to work with 8 bit characters.


All major replacements for vi seem to support 8 bit characters:

14.11.1 vile ('VI Like Emacs')
Vile (by Paul Fox) can be told that the usual range of 8th-bit
characters are printable with "set printing-low 160" and "set
printing-high 255". By either executing these command in vile or by
placing them in ~/.exrc, vile will not use the usual octal or hex
expansion for these characters. vile is available from
ftp://id.wing.net/pub/pgf/vile.


************************* REQUIRES A RE-WRITE ********************************
Normally, 8 bit chars are printed either in hex (the default) or octal
("set unprintable-as-octal"). they look like "\xC7" or "\307" on your
screen.

vile was the first vi rewrite to provide multi-window/multi-buffer
operation. and since it was derived from micro-emacs, it retains
fully rebindable keys, and a built in macro language. the ftp site is
id.wing.net:/pub/pgf/vile. the current version is 5.2. it's pretty
mature (5 years old). there's an X-aware version as well, that makes
full use of the mouse, with scrollbars, etc. (to answer your
question, initialization stuff goes in a .vilerc file.)

Do you require use of the correct locale settings?
no. 8-bit support is fairly primitive. i'll include the
pertinent sections of the doc down below.


hope all this helps --

paul

------------------------------------
from vile's Help file:

8-Bit Operation
---------------

vile allows input, manipulation, and display of all 256 possible
byte-wide characters. (Double-wide characters are not supported.)

Output
------
By default, characters with the high bit set (decimal value 128 or
greater) will display as hex (or octal; see "non-printing- octal"
above) sequences, e.g. \xA5. A range of characters which should
display as themselves (that is, characters understood by the user's
display terminal) may be given using the "printing-low" and
"printing-high" settings (see above). Useful values for these
settings are 160 and 255, which correspond to the printable range
of the ISO-Latin-1 character set.

Input
-----
If the user's input device can generate all characters, and if the
terminal settings are such that these characters pass through
unmolested (Using "stty cs8 -parenb -istrip" works for me, on an
xterm. Real serial lines may take more convincing, at both ends.),
then vile will happily incorporate them into the user's text, or
act on them if they are bound to functions. Users who have no need
to enter 8-bit text may want access to the meta-bound functions
while in insert mode as well as command mode. The mode
"meta-insert-bindings" controls whether functions bound to meta-
keys (characters with the high bit set) are executed only in
command mode, or in both command and insert modes. In either case,
if a character is _not_ bound to a function, then it will be
self-inserting when in insert mode. (To bind to a meta key in the
.vilerc file, one may specify it as itself, or in hex or octal, or
with the shorthand 'M-c' where c is the corresponding character
without the high bit set.

------------------------------------
also from vile's Help file, these are the settable modes which affect
8-bit operation:

meta-insert-bindings (mib) Controls behavior of 8-bit characters
during insert. Normally, key-bindings are only operational
when in command mode: when in insert mode, all characters
are self-inserting. If this mode is on, and a meta-character
is typed which is bound to a function, then that function
binding will be honored and executed from within insert
mode. Any unbound meta-characters will remain self-inserting.
(B)

printing-low The integer value representing the first of the
printable set of "high bit" (i.e. 8-bit) characters.
Defaults to 0. Most foreign (relative to me!) users would
set this to 160, the first printable character in the upper
range of the ISO 8859/1 character set. (U)

printing-high The integer value representing the last character of
the printable set of "high bit" (i.e. 8-bit) characters.
Defaults to 0. Set this to 255 for ISO 8859/1
compatibility. (U)

unprintable-as-octal (uo) If an 8-bit character is non-printing, it
will normally be displayed in hex. This setting will force
octal display. Non-printing characters whose 8th bit is
not set are always displayed in control character (e.g. '^C')
notation. (B)
************************* REQUIRES A RE-WRITE ********************************

14.11.2 vim
vim was developed on an Amiga in Europe, and supports a mechanism
similar to vile. 'vim' supports input digraphs for entering 8-bit
chars, the output convention is similar to vile -- raw or nothing.

Details are unkonwn. (If you know more about vim,
please let me know. A request to comp.editors should yield additional
information.)

14.11.3 nvi
A recent vi-rewrite which should also should support 8 bit characters.
(Keith Bostic (bos...@cs.berkeley.edu) is the author and should know
more about nvi.)

15. Terminals
15.1 X11 Terminal Emulators
See section 4 on X11 for bug fixes for X11 clients.

15.1.1 xterm
If you are using X11 and xterm as your terminal emulator, you should
place the following line in ~/.Xdefaults (this seems to be required in
some releases of X11, not in all):
-------------------------
XTerm*EightBitInput: True
-------------------------

15.1.2 rxvt
rxvt is another terminal emulator used for X11, mostly under
Linux. Invoke rxvt with the 'rxvt -8' command line.


15.2 VT2xx, VT3xx
The character encoding used in VT2xx terminals is a preliminary
version of the ISO-8859-1 standard (DEC MCS), so some characters (the
more obscure ones) differ slightly. However, these terminals can be
used with ISO 8859-1 characters without problems.

The newer VT3xx terminals use the official ISO 8859-1 standard.

The international versions of the VT[23]xx terminals have a COMPOSE
key which can be used to enter accented characters, e.g.
<COMPOSE><e><'> will give an e with accent aigu (é).


15.3 Various UNIX terminals
Some terminals support down-loadable fonts. If characters sent to
these terminals can be 8 bit wide, you can down-load your own ISO
characters set. To see how this can be achieved, take a look at the
/pub/culture/russian/comp/cyril-term on nic.funet.fi.


15.4 MS-DOS PCs
MS-DOS PCs normally use a different encoding for accented characters,
so there are two options:

* you can use a terminal emulator which will translate between the
different encodings. If you use the PROCOMM PLUS, TELEMATE and
TELIX modem programs, you can down-load the translation tables
from URL ftp://oak.oakland.edu/pub/msdos/modem/xlate.zip. (You need
to install CP850 for this to work.)

* you can reconfigure your MS-DOS PC to use an ISO-8859-1 code page.
Either install IBM code page 819 (see section 19), or you can get
the free ISO 8859-X support files from the anonymous ftp archive
ftp://ftp.uni-erlangen.de/pub/doc/ISO/charsets, which contains data
on how to do this (and other ISO-related stuff). The README file
contains an index of the files you need.

Note that many terminal emulations for PCs strip the 8th bit when in
text transmission mode. If you are using such a program to dial up
a computer, you may have to configure your terminal program to
transmit all 8 bits.


16. Programming applications which support the use of ISO 8859-1
For information on how to write applications with support for
localization (to the ISO 8859-1 and other character representations)
check out URL ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/ISO-programming.

17. Other relevant i18n FAQs
This is a list of other FAQs on the net which might be of interest.
Topic Newsgroup(s) Comments
Nordic graphemes soc.culture.nordic interesting stuff about
handling nordic letters
accents sur Usenet soc.culture.french,... Accents on Usenet (French)
+ more
Programming for I18N comp.unix.questions,... see section 16.
International fonts ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/ISO-fonts
Discusses international fonts
and where to find them
I18N on WWW http://www.vlsivie.tuwien.ac.at/mike/i18n.html
German-HowTo for Linux ftp://ftp.univie.ac.at/systems/linux/sunsite/docs/HOWTO/German-HOWTO

Using 8 bit characters ftp://ftp.ulg.ac.be/pub/docs/iso8859/* (1)

Much charactersets info ftp://kermit.columbia.edu/kermit/charsets/
http://www.columbia.edu/kermit/ (2)

(1) written to "convey" the problem to the ASCII programmer, hence
more theoretical background.
(2) Kermit is second to none (in time and quality) for character sets
support and deserves a pointer in this FAQ.

18. Operating Systems and ISO 8859-1
18.1 UNIX
Most Unix implementations use the ISO 8859--1 character set, or at
least have an option to use it. Some systems may also support other
encodings, e.g.~Roman8 (HP/UX), DEC MCS (DEC Ultrix, see the section
on VMS), etc.


18.2 NeXTSTEP
NeXTSTEP uses a proprietary character set.


18.3 MS DOS
IBM code page 819 _is_ ISO 8859-1. Code Page 850 has the same
characters as ISO 8859-1, BUT the characters are in different
locations (i.e., you can translate 1-to-1, but you do have to
translate the characters.)


18.4 MS-Windows
Microsoft Windows uses an ISO 8859-1 compatible character set (Code
Page 1252), as delivered in the US, Europe (except Eastern Europe) and
Latin America. In Windows 3.1, Microsoft has added additional characters
in the 0x80-0x9F range.


18.5 DEC VMS
DEC VMS uses the DEC MCS character set, which is practically
equivalent to ISO 8859-1 (it is a fromer ISO 8859--1 draft standard).
The only characters which differ between DEC MCS and ISO 8859-1 are
the Icelandic characters (eth and thorn) at locations 0xD0, 0xF0, 0xDE
and 0xFE.


19. Table of ISO 8859-1 Characters
This section gives an overview of the ISO 8859-1 character set. The
ISO 8859-1 character set consists of the following four blocks:

00 19 CONTROL CHARACTERS
20 7E BASIC LATIN
80 9F EXTENDED CONTROL CHARACTERS
A0 FF LATIN-1 SUPPLEMENT

The control characters and basic latin blocks are similar do those
used in the US national variant of ISO 646 (US-ASCII), so they are not
listed here. Nor is the second block of control characters listed,
for which not functions have yet been defined.

+----+-----+---+------------------------------------------------------
|Hex | Dec |Car| Description ISO/IEC 10646-1:1993(E)
+----+-----+---+------------------------------------------------------
| | | |
| A0 | 160 | | NO-BREAK SPACE
| A1 | 161 | ¡ | INVERTED EXCLAMATION MARK
| A2 | 162 | ¢ | CENT SIGN
| A3 | 163 | £ | POUND SIGN
| A4 | 164 | ¤ | CURRENCY SIGN
| A5 | 165 | ¥ | YEN SIGN
| A6 | 166 | ¦ | BROKEN BAR
| A7 | 167 | § | SECTION SIGN
| A8 | 168 | ¨ | DIAERESIS
| A9 | 169 | © | COPYRIGHT SIGN
| AA | 170 | ª | FEMININE ORDINAL INDICATOR
| AB | 171 | « | LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
| AC | 172 | ¬ | NOT SIGN
| AD | 173 | ­ | SOFT HYPHEN
| AE | 174 | ® | REGISTERED SIGN
| AF | 175 | ¯ | MACRON
| | | |
| B0 | 176 | ° | DEGREE SIGN
| B1 | 177 | ± | PLUS-MINUS SIGN
| B2 | 178 | ² | SUPERSCRIPT TWO
| B3 | 179 | ³ | SUPERSCRIPT THREE
| B4 | 180 | ´ | ACUTE ACCENT
| B5 | 181 | µ | MICRO SIGN
| B6 | 182 | ¶ | PILCROW SIGN
| B7 | 183 | · | MIDDLE DOT
| B8 | 184 | ¸ | CEDILLA
| B9 | 185 | ¹ | SUPERSCRIPT ONE
| BA | 186 | º | MASCULINE ORDINAL INDICATOR
| BB | 187 | » | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
| BC | 188 | ¼ | VULGAR FRACTION ONE QUARTER
| BD | 189 | ½ | VULGAR FRACTION ONE HALF
| BE | 190 | ¾ | VULGAR FRACTION THREE QUARTERS
| BF | 191 | ¿ | INVERTED QUESTION MARK
| | | |
| C0 | 192 | À | LATIN CAPITAL LETTER A WITH GRAVE ACCENT
| C1 | 193 | Á | LATIN CAPITAL LETTER A WITH ACUTE ACCENT
| C2 | 194 | Â | LATIN CAPITAL LETTER A WITH CIRCUMFLEX ACCENT
| C3 | 195 | Ã | LATIN CAPITAL LETTER A WITH TILDE
| C4 | 196 | Ä | LATIN CAPITAL LETTER A WITH DIAERESIS
| C5 | 197 | Å | LATIN CAPITAL LETTER A WITH RING ABOVE
| C6 | 198 | Æ | LATIN CAPITAL LIGATURE AE
| C7 | 199 | Ç | LATIN CAPITAL LETTER C WITH CEDILLA
| C8 | 200 | È | LATIN CAPITAL LETTER E WITH GRAVE ACCENT
| C9 | 201 | É | LATIN CAPITAL LETTER E WITH ACUTE ACCENT
| CA | 202 | Ê | LATIN CAPITAL LETTER E WITH CIRCUMFLEX ACCENT
| CB | 203 | Ë | LATIN CAPITAL LETTER E WITH DIAERESIS
| CC | 204 | Ì | LATIN CAPITAL LETTER I WITH GRAVE ACCENT
| CD | 205 | Í | LATIN CAPITAL LETTER I WITH ACUTE ACCENT
| CE | 206 | Î | LATIN CAPITAL LETTER I WITH CIRCUMFLEX ACCENT
| CF | 207 | Ï | LATIN CAPITAL LETTER I WITH DIAERESIS
| | | |
| D0 | 208 | Ð | LATIN CAPITAL LETTER ETH
| D1 | 209 | Ñ | LATIN CAPITAL LETTER N WITH TILDE
| D2 | 210 | Ò | LATIN CAPITAL LETTER O WITH GRAVE ACCENT
| D3 | 211 | Ó | LATIN CAPITAL LETTER O WITH ACUTE ACCENT
| D4 | 212 | Ô | LATIN CAPITAL LETTER O WITH CIRCUMFLEX ACCENT
| D5 | 213 | Õ | LATIN CAPITAL LETTER O WITH TILDE
| D6 | 214 | Ö | LATIN CAPITAL LETTER O WITH DIAERESIS
| D7 | 215 | × | MULTIPLICATION SIGN
| D8 | 216 | Ø | LATIN CAPITAL LETTER O WITH STROKE
| D9 | 217 | Ù | LATIN CAPITAL LETTER U WITH GRAVE ACCENT
| DA | 218 | Ú | LATIN CAPITAL LETTER U WITH ACUTE ACCENT
| DB | 219 | Û | LATIN CAPITAL LETTER U WITH CIRCUMFLEX ACCENT
| DC | 220 | Ü | LATIN CAPITAL LETTER U WITH DIAERESIS
| DD | 221 | Ý | LATIN CAPITAL LETTER Y WITH ACUTE ACCENT
| DE | 222 | Þ | LATIN CAPITAL LETTER THORN
| DF | 223 | ß | LATIN SMALL LETTER SHARP S
| | | |
| E0 | 224 | à | LATIN SMALL LETTER A WITH GRAVE ACCENT
| E1 | 225 | á | LATIN SMALL LETTER A WITH ACUTE ACCENT
| E2 | 226 | â | LATIN SMALL LETTER A WITH CIRCUMFLEX ACCENT
| E3 | 227 | ã | LATIN SMALL LETTER A WITH TILDE
| E4 | 228 | ä | LATIN SMALL LETTER A WITH DIAERESIS
| E5 | 229 | å | LATIN SMALL LETTER A WITH RING ABOVE
| E6 | 230 | æ | LATIN SMALL LIGATURE AE
| E7 | 231 | ç | LATIN SMALL LETTER C WITH CEDILLA
| E8 | 232 | è | LATIN SMALL LETTER E WITH GRAVE ACCENT
| E9 | 233 | é | LATIN SMALL LETTER E WITH ACUTE ACCENT
| EA | 234 | ê | LATIN SMALL LETTER E WITH CIRCUMFLEX ACCENT
| EB | 235 | ë | LATIN SMALL LETTER E WITH DIAERESIS
| EC | 236 | ì | LATIN SMALL LETTER I WITH GRAVE ACCENT
| ED | 237 | í | LATIN SMALL LETTER I WITH ACUTE ACCENT
| EE | 238 | î | LATIN SMALL LETTER I WITH CIRCUMFLEX ACCENT
| EF | 239 | ï | LATIN SMALL LETTER I WITH DIAERESIS
| | | |
| F0 | 240 | ð | LATIN SMALL LETTER ETH
| F1 | 241 | ñ | LATIN SMALL LETTER N WITH TILDE
| F2 | 242 | ò | LATIN SMALL LETTER O WITH GRAVE ACCENT
| F3 | 243 | ó | LATIN SMALL LETTER O WITH ACUTE ACCENT
| F4 | 244 | ô | LATIN SMALL LETTER O WITH CIRCUMFLEX ACCENT
| F5 | 245 | õ | LATIN SMALL LETTER O WITH TILDE
| F6 | 246 | ö | LATIN SMALL LETTER O WITH DIAERESIS
| F7 | 247 | ÷ | DIVISION SIGN
| F8 | 248 | ø | LATIN SMALL LETTER O WITH OBLIQUE BAR
| F9 | 249 | ù | LATIN SMALL LETTER U WITH GRAVE ACCENT
| FA | 250 | ú | LATIN SMALL LETTER U WITH ACUTE ACCENT
| FB | 251 | û | LATIN SMALL LETTER U WITH CIRCUMFLEX ACCENT
| FC | 252 | ü | LATIN SMALL LETTER U WITH DIAERESIS
| FD | 253 | ý | LATIN SMALL LETTER Y WITH ACUTE ACCENT
| FE | 254 | þ | LATIN SMALL LETTER THORN
| FF | 255 | ÿ | LATIN SMALL LETTER Y WITH DIAERESIS
+----+-----+---+------------------------------------------------------

Footnote: ISO 10646 calls Æ a `ligature', but this is a
letter in (at least some) Scandinavian languages. Thus, it
is not in the same, merely typographic `ligature' class as
`oe' ({\oe} in {\LaTeX} convention) which was not included
in the ISO8859-1 standard.

***Tentative info***
Supposedly the Danish press, some months ago, reported that ISO has
changed the standard so from now on æ and Æ are classified as
letters.

If you can confirm or deny this, please let me know...
***Tentative info***

20. History
In April 1965, the ECMA (European Computer Manufacturer's Association)
stndardized ECMA-6. This the character set is also (and more
commonly) also know under the names of ISO 646, US-ASCII or DIN 66003.

However, this standard only contained the basic Latin alphabet, with
no provisions for national characters in use all across Europe. These
characters were later added by replacing several special characters
from the US-ASCII alphabet (such as {[|]}\ etc.). These variants were
local to each country and were calle `national ISO 646 variants'.
Portability from one country to another was low, as each country had
their own national variant, and some of the special characters were
still needed (such as for programming C), which made this an
altogether unsatisfying solution.

In 1981, IBM released the IBM PC with an 8 bit character set, code
page 437. The order of the characters added was somewhat confusing,
to say the least. However, in 1982 the first hardware (DEC VT220 and
VT240 terminal) using a more satisfying character set, the DEC MCS
(Multilanguage Character Set) was released.

This character set was very similar to ISO 6937/2, which is
essentially equivalent to today's ISO 8859-1. In March 1985, ECMA
standardized ECMA-94, which later came to be known as ISO 8859-1
through 8859-4. However, ISO 8859-1 was officially stndardized by ISO
only in 1987.

1987 also saw the release of MS-DOS 3.3 which used Code Page 850.
Code Page 850 contains all characters from ISO 8859-1, making a
loss-free conversion possible. Code Page 819 which was released later
goes one step further, as it is fully ISO 8859-1 compliant.

The ISO 8859-X standard was designed to allow as much interoperability
between character sets as possible. Thus, all ISO 8859-X character
sets are a superset of US-ASCII and all character sets will render
English text properly. Also, there is considerable overlap between
several character sets: a text written in German using the ISO 8859-1
character set can be correctly rendered in ISO 8859-2, the Eastern
European character set, where German is the primary foreign language
(-3, -4, -9, -10 supposedly also can display German text without
changes).

While ISO 8859-X was designed for considerable portability, texts are
still restricted mostly to their character set and portability to
other cultural areas is a problem. One solution is to use a
meta-protocol (such as -> MIME) which specifies the character set
which was used to write a text and which causes the correct character
set to be used in displaying text.

A different approach to overloading the character set as done in the
ISO 8859-X standard (where the locations 0xa0 to 0xff are used to
encode national characters) is to use wider characters. This is the
approach employed in Unicode (which is an enocing of Basic
MUlitlanguage Plane (BMP) of ISO/IEC 10646). The downside to this
approach is that most of the software available today only accepts 8
bit wide characters (7 bit if you have bad luck :-( ), so the Unicode
approach is problematic. This 8 bit restriction permeates nearly all
code in use today, including such system software (file systems,
process identifiers, etc.!). To ease this problem somewhat, several
representations which map Unicode characters to a variable length 8
bit based encoding have been introduced (this encoding is called
UTF-8). More information about Unicode can be obtained from URL
http://unicode.org.

21. Glossary: Acronyms, Names, etc.
i18n I<-- 18 letters -->n = Internationalization
e13n Europeanization
l10n Localization
ANSI American National Standards Institute, the US member of ISO
ASCII American Standard Code of Information Interchange
CP Code Page
CP850 Code Page 850, the most widely used MS DOS code page
CR Carriage Return
CTAN server Comprehensive TeX Archive Network, the world's largest
repository for TeX related material. It consists of three
sites mirrowing each other: ftp.shsu.edu, ftp.tex.ac.uk,
ftp.dante.de. The current configuration, including known
mirrows, can be obtained by fingering cta...@ftp.shsu.edu
DEC Digital Equipment Corp.
DIN Deutsche Industrie Norm (German Industry Norm)
DOS Disk Operating System
EBCDIC Extended Binary Coded Decimal Interchange Code
---a proprietary IBM character set used on mainframes
ECMA European Computer Manufacturer's Association
emacs Editing Macros, a family of popular text editors
ESMTP Enhanced SMTP
Esperanto A synthetic, ``universal'' language developed by
Dr.~Zamenhof in~1887.
FSF Free Software Foundation
FTP File Transmission Protocol
GNU GNU's not Unix, an FSF project
HP Hewlett Packard
HP/UX HP Unix
IBM International Business Machines Corp.
IEEE Institute of Electrical and Electronics Engineers
INRIA Institut National de Recherche en Informatique et Automation
IP Internet Protocol
ISO International Standards Organization
KOI8 ???---a popular encoding for Cyrillic on UNIX workstations
\LaTeX{} A macro package for \TeX{}
LF Linefeed
MCS DEC's Multilingual Character Set---the ISO 8859--1 draft standard
MIME Multi-Purpose Internet Mail Extension
MS-DOS Microsoft's program loader
MTA mail transfer agent
MUA mail user agent
OS Operating System
OSF the Open Software Foundation
OSF/1 the Open Software Foundation's Unix, Revision 1
PGP Pretty Good Privacy, an encryption package
POSIX Portable Operating System Interface (an IEEE UNIX standard)
PS PostScript, Adobe's printer language
RFC Request for Comment, an Internet standard
sed stream editor, a UNIX file manipulation utility
SMTP Simple Mail Transfer Protocol
TCP Transmission Control Protocol
\TeX{} Donald Knuth's typesetting program
UDP User Datagram Protocol
URL a WWW Uniform Resource Locator
US-ASCII the US national variant of ISO 646, see ASCII
VMS Virtual Memory System---DEC's proprietary OS
W3 WWW
WWW World Wide Web
X11 X Window System

22. Comments
This FAQ is somewhat Sun-centered, though I have tried to include
other machine types. If you have figured out how to configure your
machine type, please let me (mi...@vlsivie.tuwien.ac.at) know so that I
can include it in future revisions of this FAQ.

23. Home location of this document
23.1 www
You can find this and other i18n documents under URL
http://www.vlsivie.tuwien.ac.at/mike/i18n.html.

23.2 ftp
The most recent version of this document is available via anonymous
ftp from ftp.vlsivie.tuwien.ac.at under the file name
/pub/8bit/FAQ-ISO-8859-1

-----------------

Copyright © 1994,1995 Michael Gschwind (mi...@vlsivie.tuwien.ac.at)

This document may be copied for non-commercial purposes, provided this
copyright notice appears. Publication in any other form requires the
author's consent. (Distribution or publication with a product requires
the author's consent, as does publication in any book, journal or
other work.)

Dieses Dokument darf unter Angabe dieser urheberrechtlichen
Bestimmungen zum Zwecke der nicht-kommerziellen Nutzung beliebig
vervielfältigt werden. Die Publikation in jeglicher anderer Form
erfordert die Zustimmung des Autors. (Verteilung oder Publikation mit
einem Produkt erfordert die Zustimmung des Autors, wie auch die
Veröffentlichung in Büchern, Zeitschriften, oder anderen Werken.)

Local IspellDict: english

Michael Gschwind, Institut f. Technische Informatik, TU Wien
snail: Treitlstraße 3-182-2 || A-1040 Wien || Austria
email: mi...@vlsivie.tuwien.ac.at PGP key available via www (or email)
www : URL:http://www.vlsivie.tuwien.ac.at/mike/mike.html
phone: +(43)(1)58801 8156 fax: +(43)(1)586 9697


Dick Dawson

unread,
May 11, 1996, 3:00:00 AM5/11/96
to

In article <internationalization/iso-8859-1-cha...@rtfm.mit.edu>,
<mi...@vlsivie.tuwien.ac.at> wrote:

[..]

Thanks for posting the article on 8859/x. Long but informative.

A few comments:

My interests are primarily western European languages so 8859/1
generally does the job.

I usually run a unix host remotely via telco running Telix comm
software on msdos. Not my preference, an expediency. Telix will run
translate tables; I've got translation from 8859/1 to msdos 'code
page' 850. cp850 satisfies most western European languages and as
far as I can tell contains all characters of 8859/1. But there is no
'hooked-o' used in Icelandic. Looks like 'o' with a backwards comma
descender. Any suggestions?

When possible I run Sun Xwindows computers that have 8895/1 as their
default character sets. I intend to set up linux and X11 soon at
home and expect to be able to run X11 on it.

On UNICODE and/or ISO-10646: Any hope of translate sofware for these
coding schemes that will convert the 16bit words to 8bit words for
use with older equipment?

Dick
dda...@mailbox.syr.edu
http://web.syr.edu/~ddawson

Fridrik Skulason

unread,
May 15, 1996, 3:00:00 AM5/15/96
to

In <4n192f$a...@newstand.syr.edu> dda...@gamera.syr.edu (Dick Dawson) writes:

>But there is no
>'hooked-o' used in Icelandic. Looks like 'o' with a backwards comma
>descender. Any suggestions?

uh...suggestions for what ? I'm afraid I don't understand the problem.
You are right that Icelandic does not have any "hooked-o" (whatever
that is)....our special characters are A-acute, E-Acute, I-acute, O-acute,
U-Acute, Y-Acute, Thorn, Eth, O-umlaut, and AE (and as a side note, the
Icelandic alphabet does not have C, Q, W or Z)

-frisk


--
Fridrik Skulason Frisk Software International phone: +354-5-617273
Author of F-PROT E-mail: fr...@complex.is fax: +354-5-617274

Erik Dutton

unread,
May 22, 1996, 3:00:00 AM5/22/96
to

On 15 May 1996 12:18:54 -0000, fr...@complex.is (Fridrik Skulason)
wrote:

>uh...suggestions for what ? I'm afraid I don't understand the problem.
>You are right that Icelandic does not have any "hooked-o" (whatever
>that is)....

probably a cedilla (cf. Français)

Erik Dutton
edu...@worldnet.att.net

Cat

unread,
May 22, 1996, 3:00:00 AM5/22/96
to Fridrik Skulason

Fridrik Skulason wrote:
>
> In <4n192f$a...@newstand.syr.edu> dda...@gamera.syr.edu (Dick Dawson) writes:
>
> >But there is no
> >'hooked-o' used in Icelandic. Looks like 'o' with a backwards comma
> >descender. Any suggestions?
>
> uh...suggestions for what ? I'm afraid I don't understand the problem.
> You are right that Icelandic does not have any "hooked-o" (whatever
> that is)....our special characters are A-acute, E-Acute, I-acute, O-acute,
> U-Acute, Y-Acute, Thorn, Eth, O-umlaut, and AE (and as a side note, the
> Icelandic alphabet does not have C, Q, W or Z)

Well, the character in question was used by some mediaeval scribes and is
an essential part of the standardized form of the Old Icelandic character
set that was devised by the German philologist Wimmer in the latter half of
the 19th century and is still used in some scholarly editions of the Sagas.
In the Wimmer character set, there was no 'ö', but instead the 'o' with a
descender similar to that of 'ç'; and the 'ø'. Both those characters are
represented today with 'ö'. Also, there were two types of 'æ', the 'æ' and
the 'œ'.

Elias (e...@itn.is, posting from my gf's account)

Johan Olofsson

unread,
May 23, 1996, 3:00:00 AM5/23/96
to

Dick Dawson (dda...@gamera.syr.edu) writes:

>> But there is no
>> 'hooked-o' used in Icelandic. Looks like 'o' with a backwards comma
>> descender. Any suggestions?

Fridrik Skulason (fr...@complex.is)


> uh...suggestions for what ? I'm afraid I don't understand the problem.
> You are right that Icelandic does not have any "hooked-o" (whatever
> that is)....our special characters are A-acute, E-Acute, I-acute, O-acute,
> U-Acute, Y-Acute, Thorn, Eth, O-umlaut, and AE (and as a side note, the
> Icelandic alphabet does not have C, Q, W or Z)

Elias (e...@itn.is) responds:

Well, the character in question was used by some mediaeval scribes and is
an essential part of the standardized form of the Old Icelandic character
set that was devised by the German philologist Wimmer in the latter half of
the 19th century and is still used in some scholarly editions of the Sagas.
In the Wimmer character set, there was no 'ö', but instead the 'o' with a
descender similar to that of 'ç'; and the 'ø'. Both those characters are
represented today with 'ö'. Also, there were two types of 'æ', the 'æ' and
the 'œ'.

This might be of rather limited importance, since the "hooked-o" might be
represented by 'ö' while the 'ø' can continue to represent itself.

See for instance http://www.lysator.liu.se/runeberg/eddan/

From that site I quote:


Notes to the Old Norse version

The "o-cedilla" is an o with a comma (,) below it. This letter was used
in Old Norse, but is not represented in the ISO 8859-1 character set.
In modern day Icelandic, it has been replaced by "o-dieresis", an o
with two dots on top (ö). In the electronic edition, we have chosen to
do the same.


kind regards!

Johan


ps
...but what should this your last quoted letter be ???
It seems to be a non-valid code in iso-8859-1.
But your header says:


Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
X-Mailer: Mozilla 3.0b3 (Win95; I)


???


--
e-mail: johan.o...@magnus.ct.se
s-mail: Majeldsvägen 8a, 582 63 LINKÖPING, Sweden


Jari Oksanen

unread,
May 23, 1996, 3:00:00 AM5/23/96
to

Quite obviously these "Nordic" character sets show no interest to Sami
languages which have characters which cannot be represented under ISO 8859-1
or ordinary 255 character ascii sets. For instance, the c with a hat.
--- Jari Oksanen Tromssa, Ruija / Romsa, Norga / Tromsø, Norge


Johan Olofsson

unread,
May 23, 1996, 3:00:00 AM5/23/96
to

In article <jari.414...@ibg.uit.no>
ja...@ibg.uit.no (Jari Oksanen) writes:

> Quite obviously these "Nordic" character sets show no interest to Sami
> languages which have characters which cannot be represented under ISO 8859-1
> or ordinary 255 character ascii sets. For instance, the c with a hat.

How is it with the following character sets? I don't have the data here right
now, but I think I remember the Baltic languages to be covered by ISO 8859-2 ?

Erik Naggum

unread,
May 23, 1996, 3:00:00 AM5/23/96
to

[Jari Oksanen]

| Quite obviously these "Nordic" character sets show no interest to Sami
| languages which have characters which cannot be represented under ISO
| 8859-1 or ordinary 255 character ascii sets. For instance, the c with a
| hat.

there are other character sets that do. only 96 additional characters can
be fitted in the ISO 4873-conforming 8-bit coded character sets. you will
therefore have to find the set that matches your needs, and write software
that understands that it may need to change character sets on the fly.


Peter Kerr

unread,
May 24, 1996, 3:00:00 AM5/24/96
to

> >You are right that Icelandic does not have any "hooked-o" (whatever
> >that is)....
>
> probably a cedilla (cf. Français)

or what I call a "d with its stem bent back"
called, I believe, "eth" and not accessible from my keyboard

--
Peter Kerr bodger
School of Music chandler
University of Auckland neo-Luddite

Cat

unread,
May 24, 1996, 3:00:00 AM5/24/96
to Peter Kerr

Peter Kerr wrote:
>
> > >You are right that Icelandic does not have any "hooked-o" (whatever
> > >that is)....
> >
> > probably a cedilla (cf. Français)
>
> or what I call a "d with its stem bent back"
> called, I believe, "eth" and not accessible from my keyboard

No, those are two different characters. One is a type of o-umlaut not
used today except in scholarly editions and the other is the edh
(situated next right to 'p' on the Icelandic keyboard) which is
originally an Anglo-Saxon (or Old English) letter that was derived from
the Greek delta.

And yes, it is not accessible from the International version of MacOs.
As far as I can undestand, it is due to some lame copyright dispute
between Apple in Iceland and Apple Inc. Why Apple Inc. did not design
their MacOs in accordance to ISO 8859-1 as default regardless of the
lameness of apple.is, is a mystery to me.

Elias (e...@itn.is)

Michael Salmon

unread,
May 24, 1996, 3:00:00 AM5/24/96
to

Peter Kerr wrote:
>
> > >You are right that Icelandic does not have any "hooked-o" (whatever
> > >that is)....
> >
> > probably a cedilla (cf. Français)
>
> or what I call a "d with its stem bent back"
> called, I believe, "eth" and not accessible from my keyboard

If you have a compose key it is probably <compose> d -, in xterm you can use <meta>p.

--
© 1995,1996 Michael Salmon
All opinions expressed in this article remain the property of
Michael Salmon. Permission is hereby granted for use in
followup articles, FAQ's and digests.

Johan Olofsson

unread,
May 24, 1996, 3:00:00 AM5/24/96
to

ja...@ibg.uit.no (Jari Oksanen) writes:

> > Quite obviously these "Nordic" character sets show no interest to Sami
> > languages which have characters which cannot be represented under ISO 8859-1
> > or ordinary 255 character ascii sets. For instance, the c with a hat.

> How is it with the following character sets? I don't have the data here right


> now, but I think I remember the Baltic languages to be covered by ISO 8859-2 ?


And url showing the different ISO 8859-x is found at
"http://www.cs.tu-berlin.de/~czyborra/charsets/"

Markus Kuhn

unread,
May 25, 1996, 3:00:00 AM5/25/96
to

Let's hope the days of the 8-bit coded character sets are counted.
Anyone interested in character sets should first of all have a look at
Unicode (ISO 10646, the Universal Character Set), before complaining
about the old stuff (ISO 8859, etc.).

Some people argue that the 30 000 characters of the all-in-one Unicode
charset are much too expensive to implement for European users who
only want to cover their around 1000 latin/greek/cyrillic/technical
symbols. That's right. Please have a look at the European Standard ENV
1973:1995 which defines the

Minimum European Subset (MES)

of Unicode. This character set contains only 926 characters but covers
all European languages and many others. The 926 character subset is
very easy and cheap to implement. In addition, if you think you need
additional characters for your far-east customers or for scientific
applications, just look into ISO 10646 to see the upwards compatible
extension of MES.

For more information and an ISO 10646/MES character list, please have
a look at

<URL:http://www.indigo.ie/egt/standards/mes.html>

Folks who wonder how 16-bit Unicode could be used best on
ASCII-oriented system like Unix clearly should have a look at the
UTF-8 encoding. Some start information is available on

<URL:ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/>

Forget the many ISO 8859 versions. Unicode/MES and UTF-8 is definitely
the way to go!

If you are responsible for buying computer software or equipment used
for handling person names of bibliographic data in Europe, make sure
the Unicode/MES character set is supported.

Markus

--
Markus Kuhn, Computer Science student -- University of Erlangen,
Internet Mail: <msk...@cip.informatik.uni-erlangen.de> - Germany
WWW Home: <http://wwwcip.informatik.uni-erlangen.de/user/mskuhn>

Osmo Ronkanen

unread,
May 27, 1996, 3:00:00 AM5/27/96
to

In article <4o79sm$5...@cortex.dialin.rrze.uni-erlangen.de>,

Markus Kuhn <msk...@cip.informatik.uni-erlangen.de> wrote:
>Let's hope the days of the 8-bit coded character sets are counted.
>Anyone interested in character sets should first of all have a look at
>Unicode (ISO 10646, the Universal Character Set), before complaining
>about the old stuff (ISO 8859, etc.).
>
>Some people argue that the 30 000 characters of the all-in-one Unicode
>charset are much too expensive to implement for European users who
>only want to cover their around 1000 latin/greek/cyrillic/technical
>symbols. That's right. Please have a look at the European Standard ENV
>1973:1995 which defines the
>
> Minimum European Subset (MES)
>
>of Unicode. This character set contains only 926 characters but covers
>all European languages and many others. The 926 character subset is
>very easy and cheap to implement. In addition, if you think you need
>additional characters for your far-east customers or for scientific
>applications, just look into ISO 10646 to see the upwards compatible
>extension of MES.
>

I looked that and I do not see the point. Why should I pay anything in
the for of size or speed so that I could get Cyrillic or Greek
characters. I do not know Greek or Russian and therefore I do not need
those characters. Should I have a need to write Greek or Russian names,
I' use Latin alphabet as one should do.

Yes, Russians could use them, but they already have their codes
optimized for their need. Why should they use system that even with the
UTF-8 encoding wastes two bytes per character. Even in Finnish some
characters would use 16 bits when I know can do all in 8-bits with
Latin-1 or PC character set.

Majority of all communications is always local one, so the local
communication should be the one that decides. The international
communication is secondary. This UTF works so for Americans but not for
Europeans. Why should I compromise something to get same character set
with Russians or Greeks. Why should I have special drivers for printer,
keyboard and screen so that I can store my text files in a way that
consumes more space?

I do see the value of Unicode etc. in international communications and
translations between character sets, but I really do not see it as a
replacement for 8 bit codes. There is simply so much hardware and
software that transforming is not easy. I think a good think would be if
operating systems were aware of various character sets, like having an
attribute for each text file that tells the set. In this way the Unicode
etc. could better live in harmony with current sets and it could be used
only when necessary.

>For more information and an ISO 10646/MES character list, please have
>a look at
>
> <URL:http://www.indigo.ie/egt/standards/mes.html>
>
>Folks who wonder how 16-bit Unicode could be used best on
>ASCII-oriented system like Unix clearly should have a look at the
>UTF-8 encoding. Some start information is available on
>
> <URL:ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/>
>

Best for whom? IMO it looked like it was designed best for those who
think non-ASCII characters as some rare special characters and so one can
waste space for those.

>Forget the many ISO 8859 versions. Unicode/MES and UTF-8 is definitely
>the way to go!
>

Sorry, but some people need daily existing 8-bit codes and they cannot
just forget them. Some people still use 7-bit national character sets
even though the change would be quite small compared to a change to 16
bits.

>If you are responsible for buying computer software or equipment used
>for handling person names of bibliographic data in Europe, make sure
>the Unicode/MES character set is supported.
>
>Markus
>
>--
>Markus Kuhn, Computer Science student -- University of Erlangen,
>Internet Mail: <msk...@cip.informatik.uni-erlangen.de> - Germany
>WWW Home: <http://wwwcip.informatik.uni-erlangen.de/user/mskuhn>


Osmo


Sergei B. Pokrovsky

unread,
May 28, 1996, 3:00:00 AM5/28/96
to

Markus Kuhn

unread,
May 28, 1996, 3:00:00 AM5/28/96
to

ronk...@cc.helsinki.fi (Osmo Ronkanen) writes:

>I think a good think would be if
>operating systems were aware of various character sets, like having an
>attribute for each text file that tells the set.

Sorry, I disagree. I strongly feel that the wonderful beauty and
simplicity of having one *single* character set instead of complicated
attributes and conversion tools by far overrules the minor increased
memory usage of Unicode. Raw uncompressed non-ASCII text files
consitute less than 5% of typical harddisks. Unicode is the first
change for a true simplification in the field of character sets since
US-ASCII became popular.

Have you ever tried to implement a cross-application cut&paste
facility (like that of xterm) that supports a character set switching
system like ISO 2022 or your attributed files? This is a *very*
healthy exercise as far as old-fashioned opinions about sticking with
the 8-bit sets is concerned.

Erik Naggum

unread,
May 28, 1996, 3:00:00 AM5/28/96
to

[Markus Kuhn]

| Raw uncompressed non-ASCII text files consitute less than 5% of typical
| harddisks.

do you have any references for this number?

and what does "raw, uncompressed non-ASCII text files" mean?

--
SIGNATURE -- biological interference

Scott Schwartz

unread,
May 29, 1996, 3:00:00 AM5/29/96
to 1996 17: 44:22 +0200

msk...@unrza3.dialin.rrze.uni-erlangen.de (Markus Kuhn) writes:
| Folks who wonder how 16-bit Unicode could be used best on
| ASCII-oriented system like Unix clearly should have a look at the
| UTF-8 encoding. Some start information is available on
|
| <URL:ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/>

See also:
<URL:http://plan9.att.com/plan9/doc/utf.html>

Unix ports of the Plan 9 editor, terminal emulator, and X11 fonts are
freely available. Look around in <URL://ftp.ecf.toronto.edu/pub/plan9/>


Osmo Ronkanen

unread,
May 29, 1996, 3:00:00 AM5/29/96
to

In article <4of365$1...@cortex.dialin.rrze.uni-erlangen.de>,

Markus Kuhn <msk...@cip.informatik.uni-erlangen.de> wrote:
>ronk...@cc.helsinki.fi (Osmo Ronkanen) writes:
>
>>I think a good think would be if
>>operating systems were aware of various character sets, like having an
>>attribute for each text file that tells the set.
>
>Sorry, I disagree. I strongly feel that the wonderful beauty and
>simplicity of having one *single* character set instead of complicated
>attributes and conversion tools by far overrules the minor increased
>memory usage of Unicode. Raw uncompressed non-ASCII text files

>consitute less than 5% of typical harddisks.

Why non-ASCII? Are you saying that after the switch there would be both
ASCII and then Unicode?

Hard disks are not the only resource there is. There are also memory,
serial ports etc. Either those resources are wasted as well or then all
programs become more complicated especially if the system cannot
translate between different character sets.

If the transition do Unicode is made completely then all text no matter
where it is stored. Compression helps somewhat, but it does not remove
the extra resource need completely. Also I do not think I should have to
use compression just to get rid of something that was inserted in my
files without any good need.

>Unicode is the first
>change for a true simplification in the field of character sets since
>US-ASCII became popular.
>
>Have you ever tried to implement a cross-application cut&paste
>facility (like that of xterm) that supports a character set switching
>system like ISO 2022 or your attributed files?

What is the problem? The paste buffer could be in Unicode and the system
could provide the translations.

>This is a *very*
>healthy exercise as far as old-fashioned opinions about sticking with
>the 8-bit sets is concerned.
>
>Markus
>
>--
>Markus Kuhn, Computer Science student -- University of Erlangen,
>Internet Mail: <msk...@cip.informatik.uni-erlangen.de> - Germany
>WWW Home: <http://wwwcip.informatik.uni-erlangen.de/user/mskuhn>


Osmo

Markus Kuhn

unread,
May 30, 1996, 3:00:00 AM5/30/96
to

ronk...@cc.helsinki.fi (Osmo Ronkanen) writes:

>>Sorry, I disagree. I strongly feel that the wonderful beauty and
>>simplicity of having one *single* character set instead of complicated
>>attributes and conversion tools by far overrules the minor increased
>>memory usage of Unicode. Raw uncompressed non-ASCII text files
>>consitute less than 5% of typical harddisks.

>Why non-ASCII? Are you saying that after the switch there would be both
>ASCII and then Unicode?

My field of experience are Unix systems and on Unix, the UTF-8
encoding of Unicode is a very nice and elegant solution. As long as
you use only ASCII characters, there is no difference between UTF-8
and ASCII. That was the only reason why I excluded pure-ASCII files.
But I hope the discussion below will make clear why the difference
between ASCII and non-ASCII text is not very relevant at all.

Text strings are not a major resource factor anywhere. If they are,
then you use data compression anyway, and then switching to Unicode
does not make resource utilization much worse.

>Hard disks are not the only resource there is. There are also memory,
>serial ports etc. Either those resources are wasted as well or then all
>programs become more complicated especially if the system cannot
>translate between different character sets.

Let's concentrate on personal computers:

RS-232 serial ports are practically only used to connect modems and
mice to them. Mice don't transmit text and every reasonable modem
implements data compression which almost eliminates the difference
between the various character encodings as far as ressource
utilization is concerned.

Look on your harddisk! Look in your RAM! Readable text is *not* what
fills the major fraction of typical harddisks and RAM chips today. The
overwelming majority of resources are required to handle executable
code, libraries, fonts, caches, and frame buffers of GUIs.

You don't believe me? I just examined the 16 Mb of my own system with
"string /dev/mem | wc -c" this showed that only around 10% of the RAM
looked like ASCII text at all. Examination of the output of "string
/dev/mem" at random spots made clear that this measurement method is
much too pessimistic, because much less than 1/4 of the text that the
Unix "string" command identified as text was actual human readable
ASCII text that would be a candidate vor Unicode conversion.

Summary: Less then 3% of the RAM of my workstation would be affected
from a general ASCII -> Unicode switch. During the measurement, the
executing processed were the system daemons, an X server, Emacs, my
newsreader, a Web browser, xterm, shell, and various tiny little
tools. A very typical workstation environment.

So when switching to a 16-bit character encoding, I would need 3% more
RAM. Sounds harmless compared to what switching to a new improved
software release usually costs be RAM.

Start your favourite word processor and read in a long text document.
Your wordprocessor will usually require at least 2-3 Mb RAM (recent
versions much more), but your text file is rarely longer than 80 kb.
Again we have a <4% fraction of RAM which is occupied by ASCII encoded
text.

>If the transition do Unicode is made completely then all text no matter
>where it is stored. Compression helps somewhat, but it does not remove
>the extra resource need completely. Also I do not think I should have to
>use compression just to get rid of something that was inserted in my
>files without any good need.

If this bug you, then I have very bad news for you:

English text has an entropy of less than two bits per character! The
numbers for most other scripts are similar. You have been able to live
with the factor four waste of the ASCII encoding over decades now and
it never seems to have troubled you. If it troubled you, then you
certainly started using more efficient text encodings than plain ASCII
(LZW, deflate, LZSS, etc.). Have there been similar discussions when
we switched from the TELEX character set IA2 to ASCII in order to
introduce these wonderful modern lowercase letters at the cost of
giving up 5 bit per character efficiency?

What is wrong with using Unicode? Going from a factor 4 to a factor 8
waste with the benefit of eliminating all the character encoding
transformation pain does not sound too bad for me in a time where the
resources of computers are occupied by multimedia applications that
need a factor of 10 000 more resources. The argument that a 16-bit
character set is wasting resources becomes fully ridiculous when you
fire up your MPEG video application, which requires for a single
second of low quality video the amount of memory that can also store
more Unicode text than you can read per day!

Convinced? :-)

RAM and mass storage prices are reduced by >50% every two years but
how many decades have we been frustrated by computers which were not
able to encode the characters that we needed urgently? How many weeks
have I spent writing theoretically unnecessary character set
conversion tools that are only necessary in order to cope with the
expensive 8-bit character set legacy?

Erik Naggum

unread,
May 31, 1996, 3:00:00 AM5/31/96
to

[Markus Kuhn]

| Text strings are not a major resource factor anywhere.

you back this up with evidence that is anecdotal at best. I say it's
nonsense. text is _the_ major resource factor in computing today. it is
not even beaten by sound, images, or video.

my system currently uses 375M of disk space. I don't want to pay the cost
of data compression on my text files (which is considerable, especially
when it comes to searching or obtaining random access into them). I went
through it right now, and I find that 211M of it is text, probably what you
would call "raw, uncompressed text". it happens to be mostly ASCII, but I
use ISO Latin 1 for my native tongue. I did the "strings /dev/mem | wc -c"
thing, and of my 32M of memory, 9.5M were considered text by `strings'.

my system is populated with source code and documentation and articles and
books I'm writing. my editor has to "unpack" UTF-8 into either 16- or
32-bit quantities. editing UTF-8 with random access is _very_ painful.
therefore, you will need to use the still limited CPU resource to decode
and encode, as well as compress and uncompress the text, _all_ the time,
instead of doing something useful.

I don't want your silly notions of the unimportance of my resources to
dictate any future waste of my resources. OK?

ISO 10646 is a patently silly standard in that its creators were unable to
perceive the advantages of backward compatibility and went ahead with a
"16-bit standard", which, of course, were neither sufficient nor useful.
UTF-8 is spitting in the face of anything outside of ASCII.

ISO 10646 was a very good idea as long as it maintained conformance to ISO
4873, which would have given us the ability to use scripts natively in
"8-bit mode" and escapes to other sets much more easily than with ISO 2022.

but programmers don't understand character sets, and so they went for ISO
10646 a.k.a. Unicode. we blew it, guys. it's time to start over. this
time, we get rid of the programmers and get some computer scientists in on
the team -- people who aren't terrified of "stateful encodings" because
they don't know how to program them.

it also appears that you want to talk to the Microsoft worshippers, despite
your claim to experience from Unix systems. I don't care what Microsoft
worshippers do -- they will have lost their religion by 2005, anyway. I do
care about having the information survive their fling with Microsoft, and I
_don't_ care for stupid incompatibilities in the fundamental representation
of that information, the likes of ISO 10646. someday soon, Microsoftists
will realize that their god has stolen their information from them, and
will demand further sacrifices to get it back, *huge* sacrifices. that's
when the world of computer scientists and international standards bodies
should be prepared to have a solution ready. that solution is _not_ some
silly thing like ISO 10646. more probably, it's a system that allows you
to use _any_ encoding you like, because that's what computers already do,
and still interoperate successfully through descriptions of your favorite
encoding. this should enable users to preserve precious RAM and disk, and
us to preserve the information from the vagaries of vendors.

ISO 10646 is a _vendor_ standard. that's another good reason to avoid it.

--
sometimes a .sig is just a .sig

Tony Gaunt

unread,
May 31, 1996, 3:00:00 AM5/31/96
to


In article <30425584...@arcana.naggum.no>, Erik Naggum (er...@naggum.no) writes:
>[Markus Kuhn]

[Everything snipped]

Er.... anyone want to speak English here? Norwegian? Danish?
Swedish? Finnish? Serbo-Croat?? Anything, except this obscure
language which I don't understand?

Tarra sithee. ;-)

Tony

===================================================================
Absence is the best defence.

(Nils-Fredrik Nielsen)

Gary Capell

unread,
Jun 1, 1996, 3:00:00 AM6/1/96
to

Erik Naggum <er...@naggum.no> writes:

>my system currently uses 375M of disk space. I don't want to pay the
>cost of data compression on my text files (which is considerable,
>especially when it comes to searching or obtaining random access into
>them).

I don't understand. You seem to favour using one or more special 8bit
encodings and translating between them when necessary, then speak about
searching all your files. How are you going to search through all your
files unless they all use the same character encoding? Or are you
assuming you'll never deal with text that doesn't fit in your regional
8-bit character set? Or am I missing something basic?

>my system is populated with source code and documentation and articles
>and books I'm writing. my editor has to "unpack" UTF-8 into either
>16- or 32-bit quantities. editing UTF-8 with random access is _very_
>painful. therefore, you will need to use the still limited CPU resource to
>decode and encode, as well as compress and uncompress the text, _all_
>the time, instead of doing something useful.

IMO a universal character set is worth some expenditure of resources.
I don't believe there's anything stopping folks from using whatever
character set they feel like internally, and only using Unicode to
communicate with other folks.

>I don't want your silly notions of the unimportance of my resources to
>dictate any future waste of my resources. OK?

My most important resource is time. If people send me text with a
character set that isn't Unicode, they're wasting that resource.
Internally, you can use whatever character set makes you happy.

>but programmers don't understand character sets, and so they went for
>ISO 10646 a.k.a. Unicode. we blew it, guys. it's time to start over.

Yeah, that's what we need, _more_ standards, lots and lots of them.
--
http://www.cs.su.oz.au/~gary/

Erik Naggum

unread,
Jun 1, 1996, 3:00:00 AM6/1/96
to

[Gary Capell]

| I don't understand. You seem to favour using one or more special 8bit
| encodings and translating between them when necessary, then speak about
| searching all your files. How are you going to search through all your
| files unless they all use the same character encoding? Or are you
| assuming you'll never deal with text that doesn't fit in your regional
| 8-bit character set? Or am I missing something basic?

my argument was not that I had any more panaceas than you do, only that one
datapoint on the relative usage of "raw uncompressed text" was not quite as
general as it was implied to be, to put it very politely. specifically, my
argument was that data compression is a waste of resources (you seem to
have a special thing for "time", but only your own, not mine) that does not
compare favorably to the space it saves.

also: try to understand that not all people use tools that compare bytes.
some of us use tools that compare _characters_. obviously, when searching
files, translating character sets is typically covered by "when necessary".
your hostile questions are therefore completely irrelevant, too.

| IMO a universal character set is worth some expenditure of resources.

I'm not criticizing your opinions. opine whatever you want. my argument
is: don't _force_ your _opinions_ on me. I hope this is clear, now.

| I don't believe there's anything stopping folks from using whatever
| character set they feel like internally, and only using Unicode to
| communicate with other folks.

then why do you want to restrict them to Unicode when communicating with
"other folks"? why can we not compare internal sets and optimize for the
cases where they are the same, just as, e.g., modern FTP clients do when
transferring text files between Unix systems instead of the wasteful and
expensive translation of line termination conventions? if every file was
associated with (or contained) an identifier stating its character set,
on-the-fly conversion would be so much simpler, and information would win.
if everybody had to convert to Unicode first, that would be an enormous
waste of bandwidth and computer power, especially if UTF-8 is considered to
be a winner.

| My most important resource is time. If people send me text with a
| character set that isn't Unicode, they're wasting that resource.

you admit that you have failed to solve, but still recognize, the problem
that I'm trying to address: the fact that people are _not_ using Unicode,
and, considering the flaws of that standard and its handful of encoding
schemes, _should_ not be using Unicode. you will have to have your time
wasted a lot more than I will, because I plan for the existing world, not
some Unicoded dreamscape. Unicode (ISO 10646) is but _another_ encoding in
the mess of character sets. it solves exactly _nothing_, and creates a
whole slew of new problems that it also leaves to others to handle. talk
about wasting people's time, eh?

there are some 800 coded character sets in use in the real world. if we
can collapse them to a question of mapping codes to symbolic names that can
be manipulated as such, we only need a system for mapping each set to the
set of symbolic names, possibly including aliases among the names. this is
not hard. it's a day's work to write such a "character subsystem" in any
reasonably powerful programming language. it would, of course, be useful
if those languages supported the distinction between representation and
value, which popular languages such as C and C++ most definitely do not,
but it should be possible to fake it even in popular programming languages.

| >but programmers don't understand character sets, and so they went for
| >ISO 10646 a.k.a. Unicode. we blew it, guys. it's time to start over.
|
| Yeah, that's what we need, _more_ standards, lots and lots of them.

we need standards that do it right, not any specific quantity of them. the
quantity of standards per se is neither a problem nor a solution to any
problems. Unicode (ISO 10646) does it more wrong than any other character
set standard before it, and only adds to the number, as well.

this does not mean that the work done is not valuable: I consider the
naming scheme in ISO 10646 to be very important. unfortunately, no member
of the committee that made them up thinks so, at least not in public
documents emanating from the committee. therefore, I cannot trust ISO
10646 to be any more useful than a brick the size and the weight of the
document itself. this is truly tragic, because a massive amount of work
was put into that standard.

ISO 10646 is a _magnificent_ waste of time, money, and intellect.

Erland Sommarskog

unread,
Jun 1, 1996, 3:00:00 AM6/1/96
to

Ha scritto Markus Kuhn (msk...@cip.informatik.uni-erlangen.de):

>Some people argue that the 30 000 characters of the all-in-one Unicode
>charset are much too expensive to implement for European users who
>only want to cover their around 1000 latin/greek/cyrillic/technical
>symbols. That's right. Please have a look at the European Standard ENV
>1973:1995 which defines the
>
> Minimum European Subset (MES)
>
>of Unicode. This character set contains only 926 characters but covers
>all European languages and many others. The 926 character subset is
>very easy and cheap to implement. In addition, if you think you need
>additional characters for your far-east customers or for scientific
>applications, just look into ISO 10646 to see the upwards compatible
>extension of MES.

I'd question the usefulness of this subset. It certainly does not
cover all scripts being in use in Europe. To put it bluntly, it dis-
regards the fact that Europe all sorts of people living there these
days. Anyone who has walked around Leceister Square can't have failed
to notice that the names of streets are also available in Chinese, and
I believe in other parts of London you find Indian scripts. Whether
you actually find any official signs in Arabic in France I don't know,
but they have a sizable Arabic population there. In Bulgaria there are
a few islands of Armenians. (You can even find Georgian in Bulgaria,
but that's on an icon in a monastery.)

And while it does some simplification to avoid complicated scripts
like Chinese and Arabic, I can't see that it buys anything to skip
over Armenian and Georgian. (And looking over the Cyrillic section,
I get a nagging feeling that letters needed for languages such as
Mordvinian or Chechen are missing as well.)

Where do I send my protest list? :-)


--
Erland Sommarskog, Stockholm, som...@algonet.se
F=F6r =F6vrigt anser jag att QP b=F6r f=F6rst=F6ras.
B=65sid=65s, I think QP should b=65 d=65stroy=65d.

Johan Olofsson

unread,
Jun 2, 1996, 3:00:00 AM6/2/96
to

Osmo Ronkanen wrote:

> I looked that and I do not see the point. Why should I pay anything in
> the for of size or speed so that I could get Cyrillic or Greek
> characters. I do not know Greek or Russian and therefore I do not need
> those characters. Should I have a need to write Greek or Russian names,
> I' use Latin alphabet as one should do.

As long as we are bound to 8-bit charactersets problems will arise each time
we try to combine two languages needing different charactersets in the same
text or in the same program.

Do you remember how we recently couldn't produce a proper version of the
Sapmi national anthem sine we here use iso-8859-1 and not iso-8859-10.
[ We == soc.culture.nordic ]
--
e-mail: j...@lysator.liu.se

Osmo Ronkanen

unread,
Jun 2, 1996, 3:00:00 AM6/2/96
to

In article <4ol01l$n...@cortex.dialin.rrze.uni-erlangen.de>,

Markus Kuhn <msk...@cip.informatik.uni-erlangen.de> wrote:
>ronk...@cc.helsinki.fi (Osmo Ronkanen) writes:
>
>>In article <4of365$1...@cortex.dialin.rrze.uni-erlangen.de>,
>>Markus Kuhn <msk...@cip.informatik.uni-erlangen.de> wrote:
>
>>>Sorry, I disagree. I strongly feel that the wonderful beauty and
>>>simplicity of having one *single* character set instead of complicated
>>>attributes and conversion tools by far overrules the minor increased
>>>memory usage of Unicode. Raw uncompressed non-ASCII text files
>>>consitute less than 5% of typical harddisks.
>
>>Why non-ASCII? Are you saying that after the switch there would be both
>>ASCII and then Unicode?
>
>My field of experience are Unix systems and on Unix, the UTF-8
>encoding of Unicode is a very nice and elegant solution. As long as
>you use only ASCII characters, there is no difference between UTF-8
>and ASCII. That was the only reason why I excluded pure-ASCII files.
>But I hope the discussion below will make clear why the difference
>between ASCII and non-ASCII text is not very relevant at all.
>

As I understand the UTF-8 is intended as a temporary solution. At the
end there lies a true 16-bit character set where the idea of using 8
bits per character sounds like using 4 bits per number. In this system
even ASCII characters use as much.

The UTF-8 has some idea in in, but it works only for streams, one cannot
index characters that are stored in UTF-8 format. Also the UTF is most
effective on English text, on East European text it is almost as
effective as the share of accented characters is so small. However,
with languages like Russian and Greek it is less effective. Do you think
those people can afford the waste of resources?

>Text strings are not a major resource factor anywhere. If they are,
>then you use data compression anyway, and then switching to Unicode
>does not make resource utilization much worse.
>
>>Hard disks are not the only resource there is. There are also memory,
>>serial ports etc. Either those resources are wasted as well or then all
>>programs become more complicated especially if the system cannot
>>translate between different character sets.
>
>Let's concentrate on personal computers:
>
>RS-232 serial ports are practically only used to connect modems and
>mice to them. Mice don't transmit text and every reasonable modem
>implements data compression which almost eliminates the difference
>between the various character encodings as far as ressource
>utilization is concerned.
>

The data compression in modems is not that good. I for example typically
compress text files with pkzip for downloading.

>Look on your harddisk! Look in your RAM! Readable text is *not* what
>fills the major fraction of typical harddisks and RAM chips today. The
>overwelming majority of resources are required to handle executable
>code, libraries, fonts, caches, and frame buffers of GUIs.
>

Yeah fonts, nice to mention that. When one switches to 16-bit sets then
the space needed for fonts does not increase just two fold, it increases
256-fold. Even with MES it increases over 3-fold. I would really love
to waste my resources in storing Russian and Chinese characters.

>You don't believe me? I just examined the 16 Mb of my own system with
>"string /dev/mem | wc -c" this showed that only around 10% of the RAM
>looked like ASCII text at all. Examination of the output of "string
>/dev/mem" at random spots made clear that this measurement method is
>much too pessimistic, because much less than 1/4 of the text that the
>Unix "string" command identified as text was actual human readable
>ASCII text that would be a candidate vor Unicode conversion.
>
>Summary: Less then 3% of the RAM of my workstation would be affected
>from a general ASCII -> Unicode switch. During the measurement, the
>executing processed were the system daemons, an X server, Emacs, my
>newsreader, a Web browser, xterm, shell, and various tiny little
>tools. A very typical workstation environment.
>

Are you sure that your workstation is typical? Are programmers typical
computer users? In some systems text is the major waste of resources.
Think about Usenet.

>So when switching to a 16-bit character encoding, I would need 3% more
>RAM. Sounds harmless compared to what switching to a new improved
>software release usually costs be RAM.
>
>Start your favourite word processor and read in a long text document.
>Your wordprocessor will usually require at least 2-3 Mb RAM (recent
>versions much more), but your text file is rarely longer than 80 kb.
>Again we have a <4% fraction of RAM which is occupied by ASCII encoded
>text.

The word processor also has help files, dictionaries etc.

>
>>If the transition do Unicode is made completely then all text no matter
>>where it is stored. Compression helps somewhat, but it does not remove
>>the extra resource need completely. Also I do not think I should have to
>>use compression just to get rid of something that was inserted in my
>>files without any good need.
>
>If this bug you, then I have very bad news for you:
>
>English text has an entropy of less than two bits per character! The
>numbers for most other scripts are similar. You have been able to live
>with the factor four waste of the ASCII encoding over decades now and
>it never seems to have troubled you. If it troubled you, then you
>certainly started using more efficient text encodings than plain ASCII
>(LZW, deflate, LZSS, etc.). Have there been similar discussions when
>we switched from the TELEX character set IA2 to ASCII in order to
>introduce these wonderful modern lowercase letters at the cost of
>giving up 5 bit per character efficiency?
>
>What is wrong with using Unicode? Going from a factor 4 to a factor 8
>waste with the benefit of eliminating all the character encoding
>transformation pain does not sound too bad for me in a time where the
>resources of computers are occupied by multimedia applications that
>need a factor of 10 000 more resources. The argument that a 16-bit
>character set is wasting resources becomes fully ridiculous when you
>fire up your MPEG video application, which requires for a single
>second of low quality video the amount of memory that can also store
>more Unicode text than you can read per day!
>

When you drive car, only one millionth part of the energy hold by the
gasoline (remember E=mc^2) is used, so what would it affect if someone
sold you gasoline that would give you half the energy as normal gas.
(Meaning the entropy of English text is absolutely irrelevant to the
question should one use 8 or 16 bits to store it)

As for MPG, I do not use them. If I want to see moving pictures, I watch
TV. If I posted some MPEGs into this group I would get angry responses
and for a reason.

>Convinced? :-)
>
>RAM and mass storage prices are reduced by >50% every two years but
>how many decades have we been frustrated by computers which were not
>able to encode the characters that we needed urgently?

Who we? I have no need for Chinese characters.

>How many weeks
>have I spent writing theoretically unnecessary character set
>conversion tools that are only necessary in order to cope with the
>expensive 8-bit character set legacy?

How hard are those conversion programs to write anyway? Especially
conversion between Latin-1 and Unicode is easy if only the Latin-1
subset is used. So the question is basically why should I store an extra
null after each character (before if the computer is big endian). So
unless idiots make the computers of future, then there will be an eight
bit option for text. Now after that various people from Eastern Europe
will ask why do we haver to use 16-bits.

I say that codes, line Unicode have a place in the future, but not as
the only character sets. 8-bit character sets will live on.

In fact if the system knows Unicode, it takes 512 bytes to define any
8-bit character set. Why would such a definitions be used.

Osmo

Scott Schwartz

unread,
Jun 2, 1996, 3:00:00 AM6/2/96
to Osmo Ronkanen

ronk...@cc.helsinki.fi (Osmo Ronkanen) writes:
| As I understand the UTF-8 is intended as a temporary solution.

Not according to the designers of Plan 9, which uses it throughout.

| The UTF-8 has some idea in in, but it works only for streams, one cannot
| index characters that are stored in UTF-8 format.

True, the nth character won't be the nth byte, in general. But I've
never found that to be a problem. A similar argument is that unix
doesn't have record oriented files, so you can't index to the nth
string. In the end, it works well enough.

| Yeah fonts, nice to mention that. When one switches to 16-bit sets then
| the space needed for fonts does not increase just two fold, it increases
| 256-fold. Even with MES it increases over 3-fold. I would really love
| to waste my resources in storing Russian and Chinese characters.

With a halfway decent window system you only load the glyphs that you
need to display, and you have exactly one copy of each glyph on disk,
which you need indepenently of what character set you use, so it is
not a problem.

| So the question is basically why should I store an extra
| null after each character (before if the computer is big endian).

Don't. That's what utf-8 is for.


Kai Henningsen

unread,
Jun 2, 1996, 3:00:00 AM6/2/96
to

ronk...@cc.helsinki.fi (Osmo Ronkanen) wrote on 29.05.96 in <4oh9rn$a...@kruuna.helsinki.fi>:

> >ronk...@cc.helsinki.fi (Osmo Ronkanen) writes:
> >
> >>I think a good think would be if
> >>operating systems were aware of various character sets, like having an
> >>attribute for each text file that tells the set.
> >

> >Sorry, I disagree. I strongly feel that the wonderful beauty and
> >simplicity of having one *single* character set instead of complicated
> >attributes and conversion tools by far overrules the minor increased
> >memory usage of Unicode. Raw uncompressed non-ASCII text files
> >consitute less than 5% of typical harddisks.
>
> Why non-ASCII? Are you saying that after the switch there would be both
> ASCII and then Unicode?

No. He's saying that an UTF-8 file that contains only characters also
found in ASCII, is in fact the same as an ASCII file, or in other words,
the encoding for ASCII (characters 0x00-0x7f) in UTF-8 is, in fact, 0x00-
0x7f.

It's only when you include non-ASCII chars that you get differences.

> Hard disks are not the only resource there is. There are also memory,
> serial ports etc. Either those resources are wasted as well or then all
> programs become more complicated especially if the system cannot
> translate between different character sets.

I don't think pure text makes for a significant amount of memory usage.
Even when people use lots of text, there's probably a lot of formatting
info and management data structures and fonts and so on around at the same
time.

As for serial ports, if and when you shovel lots of text across them, you
will definitively want to use compression. After that, I don't think
there's that much overhead left as long as you use UTF-8 - of course, raw
Unicode or even UCS (ISO 10646) is worse. But then, that's one of the
reasons we _have_ UTF-8.

> If the transition do Unicode is made completely then all text no matter
> where it is stored. Compression helps somewhat, but it does not remove
> the extra resource need completely. Also I do not think I should have to
> use compression just to get rid of something that was inserted in my
> files without any good need.

The "good need" is there all right.

> >Have you ever tried to implement a cross-application cut&paste
> >facility (like that of xterm) that supports a character set switching
> >system like ISO 2022 or your attributed files?
>
> What is the problem? The paste buffer could be in Unicode and the system
> could provide the translations.

This sounds like cutting off your finger to spite your hand. If you want
to keep a byte-oriented character set with occasionally multiple byte
sequences, why don't you use the simpler UTF-8?

Of course, there _are_ ISO 2022 escape sequences to announce UTF-8,
Unicode, ISO 10646, and so on for the masochistically minded ...

Kai
--
Current reorgs: news.groups, news.admin.net-abuse.* (see nana.misc)
Internet: k...@khms.westfalen.de
Bang: major_backbone!khms.westfalen.de!kai
http://www.westfalen.de/private/khms/

Gary Capell

unread,
Jun 3, 1996, 3:00:00 AM6/3/96
to

ronk...@cc.helsinki.fi (Osmo Ronkanen) writes:

>Also the UTF is most effective on English text, on East European text
>it is almost as effective as the share of accented characters is so
>small. However, with languages like Russian and Greek it is less
>effective. Do you think those people can afford the waste of resources?

You don't have to _store_ your text as UTF-encoded Unicode. Feel
free to _store_ and _use_ text in whatever encoding you like. It
would just be nice if we had one universal character set to _exchange_
text. Unicode seems to fit the bill. And if we're going to pick one
transfer encoding for this, we should pick the one that minimizes the
total resources for all electronic communications, and for now, that
means optimizing for ASCII. It's not (meant to be) imperialism, just
pragmatism.

If you and your friend make private arrangements to exchange data using
some other character set, well and good. But to put data in a form that
_everyone_ should be able to read, I'd suggest UTF-8 encoded unicode.
(Except, of course, at the moment almost _noone_ is ready to read
this. ahem)

>Yeah fonts, nice to mention that. When one switches to 16-bit sets
>then the space needed for fonts does not increase just two fold, it
>increases 256-fold. Even with MES it increases over 3-fold. I would
>really love to waste my resources in storing Russian and Chinese
>characters.

You don't have to. Just because your application understands a large
character set doesn't mean it has to have glyphs for every one of those
characters. I use a Unicode editor (http://www.cs.su.oz.au/~gary/wily/)
where I can specify different fonts for different ranges of the
character set. I don't have (or need) fonts for the whole range (but it
would be nice to have them).

>Are you sure that your workstation is typical? Are programmers typical
>computer users? In some systems text is the major waste of resources.
>Think about Usenet.

Yes, lets think about Usenet. Should everyone on Usenet have a
character set converter for every possible 8-bit encoding, or should
they all be able to read one character set (Unicode). Again, feel free
to immediately convert that Unicode into your local 8-bit encoding as
soon as you grab the text from Usenet (assuming all the characters in
everything you look at fit into your 8-bit encoding), but let's have a
_single_ standard for character _exchange_.

>I say that codes, line Unicode have a place in the future, but not as
>the only character sets. 8-bit character sets will live on.

Sure, if you like. Just keep them for _local_ use, and don't _expect_
anyone else to know about your 8-bit character set.
--
http://www.cs.su.oz.au/~gary/

Markus Kuhn

unread,
Jun 3, 1996, 3:00:00 AM6/3/96
to

ronk...@cc.helsinki.fi (Osmo Ronkanen) writes:

>As I understand the UTF-8 is intended as a temporary solution. At the
>end there lies a true 16-bit character set where the idea of using 8
>bits per character sounds like using 4 bits per number. In this system
>even ASCII characters use as much.

I hope that for the software environment which I use in order to do
most of my work (Linux/GNU), UTF-8 will become the permanent solution,
like it has already become for the Plan9 operating system.

>The UTF-8 has some idea in in, but it works only for streams, one cannot
>index characters that are stored in UTF-8 format.

What do you mean by this? With the Unix way of encoding ASCII text
files, you also can't index characters by row/column coordinates
because Unix stores text files with a variable line length format. I
know ("antique" ;-) operating systems that store text files in a fixed
80 character/line format (where lines are called 'cards' or
'records'), but I see no particular technical advantage of fixed line
length and fixed character length formats. At least not for
developping editors, parsers, and data bases, the applications I am
familiar with.

You got used to variable line length text formats, so I hope you will
also get used to variable character length text formats.

>However,
>with languages like Russian and Greek it is less effective. Do you think
>those people can afford the waste of resources?

Latin users will be happy with UTF-8 and Far East users are anyway
using 16-bit character sets today (which is ok for them, because a
Chinese character carries much more information than a latin
character).

Unicode covers 331 Greek characters (i.e., characters which contain
the word GREEK in their name). They don't fit completely into 8-bit
space anyway, so either you use a combining character encoding in
order to get all characters, or you use a 16-bit encoding. If you want
to have all 226 cyrillic characters from Unicode together with the
ASCII characters (at least punctuation, control, digits) in a 8-bit
set without combining characters, you have a similar situation.

>Yeah fonts, nice to mention that. When one switches to 16-bit sets then
>the space needed for fonts does not increase just two fold, it increases
>256-fold.

Come on, you can surely find yourself more clever implementation
techniques before you start protesting. Even X11 can already fragment
fonts into subset collections which are loaded only when needed and
this is exactly what is used for the Unicode fonts that are available
for X11. Large fonts can be handled by a GUI in virtal memory such
that only the characters recently displayed are located in expensive
RAM. I am not familiar with what Microsoft did in its GUI libraries
and servers, but if they did use a considerably more silly approach, I
would not be too much surprised. Their problem, I don't use their
products anyway.

>Even with MES it increases over 3-fold. I would really love
>to waste my resources in storing Russian and Chinese characters.

Ok, let's leave the PC/workstation scenario: An application where
memory *is* a critical resource factor are cellular phones with LCD
display and e-mail display capability. A clever implementation of MES
would use combining characters internally, which reduces MES overhead
considerably.

>Are you sure that your workstation is typical? Are programmers typical
>computer users? In some systems text is the major waste of resources.
>Think about Usenet.

My workstation operates as a Usenet server, but postings are
transmitted and stored using the gzip/deflate algorithm of course. I
agree, raw ASCII would be a major waste of ressources here.

>The word processor also has help files, dictionaries etc.

On my workstation, all manuals are stored in compressed form and are
quickly decompressed on-the-fly when someone wants to read them (see
/usr/man/preformat/cat1/man.1.gz). Compared to formatting and
displaying the help page, decompression is a almost not measurable
time effort. Help pages don't need random access, so there is no
reason not to compress them (and even random acces can be implemented
by compressed block addressing). If your really wordprocessor stores
help texts in raw ASCII, that only demonstrates some lack of
competence of the developers of your word processor, or they just want
to impress their users by huge harddisk space requirements.

There are many freeware compression libraries out there that can be
added to existing applications very easily. It took me only minutes to
add on-the-fly stream compression to applications on a good operating
system. A quick and dirty example under Unix: just transform

f = fopen("file", "r");

into

f = popen("/usr/bin/zcat file.gz", "r");

and you have added compression.

>When you drive car, only one millionth part of the energy hold by the
>gasoline (remember E=mc^2) is used, so what would it affect if someone
>sold you gasoline that would give you half the energy as normal gas.
>(Meaning the entropy of English text is absolutely irrelevant to the
>question should one use 8 or 16 bits to store it)

Very bad argument. :-) I don't have a car and with my racing bike, the
energy I need per kilometer and transported kilogram is lower than
that of an albatros or any other biological or mechanical means of
transportation. Either do it right (use a good bike, use data
compression), or don't worry about efficiency at all.

>How hard are those conversion programs to write anyway?

It depends on which effort you spend for handling the case correctly
in which the destination character set does not contain the character
you want transform. Good conversion programs try to avoid loosing too
much information without redering the text ugly. Have a look at
<ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/iso2asc.txt>
if you want to see some of my early work on the simple subject of
converting Latin-1 to 7-bit ASCII.

>I say that codes, line Unicode have a place in the future, but not as
>the only character sets. 8-bit character sets will live on.

The matter in which way Unicode is encoded doesn't matter too much.
The important aspect of Unicode is that we have a good table that
assigns a 16-bit number to every character and that this table is more
complete than any other we have. Whether you encode your text in
UCS-2, UTF-1, UTF-7, UTF-8, UCS-4, UTF-16, or whatever tricky encoding
you might invent for special applications of Unicode does not matter.
I can convert between all these Unicode encodings with less than 20
lines of highly efficient C code without any transformation tables and
without any loss of information. That is important and that gives
Unicode the right to become the only character set. Who ever said that
using Unicode is limited to the 16-bit encoding UCS-2???


Conclusion:

Conversion between different *sets* is very problematic. Conversion
between different *encodings* of the *same* set is trivial, because
information loss has not to be handled.

This is the reason which I think Unicode/ISO 10646 should become the
only character *set*, but why I do not try to convince you that either
UCS-2 or UTF-8 should become the only character *encoding*.

Erik Naggum

unread,
Jun 3, 1996, 3:00:00 AM6/3/96
to

[Markus Kuhn]

| You got used to variable line length text formats, so I hope you will
| also get used to variable character length text formats.

hear, hear! now, let's get used to variable character encoding text
formats, too. while we're at it, so to speak. none of this "fixed
encoding" nonsense that some programmers think is a panacea. UTF-8 is but
another encoding, and we should recognize that fully, instead of pretending
that it will somehow "replace" all the others.

| Conversion between different *sets* is very problematic.

nonsense. all you need is an inversible mapping from a set to a namespace.
this is _truly_ simple, and requires N tables for N sets. the stupid ways
people have ordinarily done conversion is very problematic, since they tend
to think that N*(N-1) is a good number of tables for N sets. it isn't.

| Conversion between different *encodings* of the *same* set is trivial,
| because information loss has not to be handled.

not so. some encodings of Unicode require only 16 bits to process every
character. others require 21 bits (with the new hi/lo-pair thing).
imagine trying to decode an encoding that overflows on you. that's serious
information loss if there ever was one. (not that C/C++ programmers will
notice -- those languages don't even offer overflow detection.)

| This is the reason which I think Unicode/ISO 10646 should become the
| only character *set*, but why I do not try to convince you that either
| UCS-2 or UTF-8 should become the only character *encoding*.

so look at all other coded sets as variant encodings of your one true set.

how hard can it be?

Christopher Fynn

unread,
Jun 3, 1996, 3:00:00 AM6/3/96
to

There is a kind of Minimum European Subset already being used
in all of the PC's running European language versions of
Windows '95 - the so called WGL4 Character set which is a
sub-set of Unicode and a superset of the UGL Character set.

--
Christopher J Fynn <cf...@sahaja.demon.co.uk>
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

mi...@vlsivie.tuwien.ac.at

unread,
Jun 3, 1996, 3:00:00 AM6/3/96
to

Archive-name: internationalization/iso-8859-1-charset
Posting-Frequency: monthly
Version: 2.9888


ISO 8859-1 National Character Set FAQ

Michael K. Gschwind

<mi...@vlsivie.tuwien.ac.at>


DISCLAIMER: THE AUTHOR MAKES NO WARRANTY OF ANY KIND WITH REGARD TO
THIS MATERIAL, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

Note: Most of this was tested on a Sun 10, running SunOS 4.1.* - other
systems might differ slightly

This FAQ discusses topics related to the use of ISO 8859-1 based 8 bit
character sets. It discusses how to use European (Latin American)
national character sets on UNIX-based systems and the Internet.

If you need to use a character set other than ISO 8859-1, much of
what is described here will be of interest to you. However, you will
need to find appropriate fonts for your character set (see section 17)
and input mechanisms adapted to you language.

1. Which coding should I use for accented characters?
Use the internationally standardized ISO-8859-1 character set to type
accented characters. This character set contains all characters
necessary to type all major (West) European languages. This encoding
is also the preferred encoding on the Internet. ISO 8859-X character
sets use the characters 0xa0 through 0xff to represent national
characters, while the characters in the 0x20-0x7f range are those used
in the US-ASCII (ISO 646) character set. Thus, ASCII text is a proper
subset of all ISO 8859-X character sets.

The characters 0x80 through 0x9f are earmarked as extended control
chracters, and are not used for encoding characters. These characters
are not currently used to specify anything. A practical reason for
this is interoperability with 7 bit devices (or when the 8th bit gets
stripped by faulty software). Devices would then interpret the character
as some control character and put the device in an undefined state.
(When the 8th bit gets stripped from the characters at 0xa0 to 0xff, a
wrong character is represented, but this cannot change the state of a
terminal or other device.)

This character set is also used by AmigaDOS, MS-Windows, VMS (DEC MCS
is practically equivalent to ISO 8859-1) and (practically all) UNIX
implementations. MS-DOS normally uses a different character set and
is not compatible with this character set. (It can, however, be
translated to this format with various tools. See section 5.)

Footnote: Supposedly, IBM code page 819 is fully ISO 8859-1 compliant.


ISO 8859-1 supports the following languages:
Afrikaans, Basque, Catalan, Danish, Dutch, English, Faeroese, Finnish,
French, Galician, German, Icelandic, Irish, Italian, Norwegian,
Portuguese, Spanish and Swedish.

(Reportedly, Welsh cannot be handled due to missing \^{w} and \^{y}.)

(It has been called to my attention that Albanian can be written with
ISO 8859-1 also. However, from a standards point of view, ISO 8859-2
is the appropriate character set for Balkan countries.)

ISO 8859-1 is just one part of the ISO-8859 standard, which specifies
several character sets:
8859-1 Europe, Latin America, Caribbean, Canada, Africa
8859-2 Eastern Europe
8859-3 SE Europe/miscellaneous (Esperanto, Maltese, etc.)
8859-4 Scandinavia/Baltic (mostly covered by 8859-1 also)
8859-5 Cyrillic
8859-6 Arabic
8859-7 Greek
8859-8 Hebrew
8859-9 Latin5, same as 8859-1 except for Turkish instead of Icelandic
8859-10 Latin6, for Lappish/Nordic/Eskimo languages

Unicode is advantageous because one character set suffices to encode
all the world's languages, however very few programs (and even fewer
operating systems) support wide characters. Thus, only 8 bit wide
character sets (such as the ISO 8859-X) can be used with these
systems. Unfortunately, some programmers still insist on using the
`spare' eigth bit for clever tricks, crippling these programs such
that they can process only US-ASCII characters.


Footnote: Some people have complained about missing characters,
e.g. French users about a missing 'oe'. Note that oe is
not a character, but a typographical ligature (a combination of two
characters for typographical purposes). Ligatures are not
part of the ISO 8859-X standard. (Although 'oe' used to
be in the draft 8859-1 standard before it was unmasked as
`mere' ligature.)

Two stories exist for the removal of the oe:
(1) argues that in the final session, the French admitted
that oe was only a ligature. This prompted the
committee to remove it.
(2) argues that the French member missed the session and the
members from the other countries simply decided to
remove it. (If this is true, where were the Swiss and
Belgians?)

Note that the oe ligature is different from the 'historical
ligature' æ which is now considered a letter in Nordic
countries and cannot be replaced by the the latters 'ae'.

A semi-official statement about the missing oe:
4. The present part 1 reflects the position of AFNOR of 1987. It may be
that this is regretted now, but no action can be taken before AFNOR
makes clear what it wants now. Canada may try to convince AFNOR
that something should be done, but as far I know the SC2-FRANCE is
no longer active. They do not respond to letter ballots, nor to
E-mail.


2. Getting your terminal to handle ISO characters.
Terminal drivers normally do not pass 8 bit characters. To enable
proper handling of ISO characters, add the following lines to your
.cshrc:
----------------------------------
tty -s
if ($status == 0) stty cs8 -istrip -parenb
----------------------------------
If you don't use csh, add equivalent code to your shell's start up
file.

Note that it is necessary to check whether your standard I/O streams
are connected to a terminal. Only then should you reconfigure the
terminal driver. Note that tty checks stdin, but stty changes stdout.
This is OK in normal code, but if the .cshrc is executed in a pipe,
you may get spurious warnings :-(

If you use the Bourne Shell or descendants (sh, ksh, bash,
zsh), use this code in your startup (e.g. .profile) file:
----------------------------------
tty -s
if [ $? = 0 ]; then
stty cs8 -istrip -parenb >&0
fi
----------------------------------

Footnote: In the /bin/sh version, we redirect stdout to stdin, so both
tty and stty operate on stdin. This resolves the problem discussed in
the /bin/csh script version. A possible workaround is to use the
following code in .cshrc, which spawns a Bourne shell (/bin/sh) to
handle the redirection:
----------------------------------
tty -s
if ($status == 0) sh -c "stty cs8 -istrip -parenb >&0"
----------------------------------

3. Getting the locale setting right.
For the ctype macros (and by extension, applications you are running
on your system) to correctly identify accented characters, you
may have to set the ctype locale to an ISO 8859-1 conforming
configuration. On SunOS, this may be done by placing
------------------------------------
setenv LANG C
setenv LC_CTYPE iso_8859_1
------------------------------------
in your .login script (if you use the csh). An equivalent statement
will adjust the ctype locale for non-csh users.

The process is the same for other operating systems, e.g. on HP/UX use
'setenv LANG german.iso88591'; on IRIX 5.2 use 'setenv LANG de'; on Ultrix 4.3
use 'setenv LANG GER_DE.8859' and on OSF/1 use 'setenv LANG
de_DE.88591'. The examples given here are for German. Other
languages work too, depending on your operating system. Check out
'man setlocale' on your system for more information.

*****If you can confirm or deny this, please let me know.*****
Currently, each system vendor has his own set of locale names, which
makes portability a bit problematic. Supposedly there is some X/Open
document specifying a

<language>_<country>.<character_encoding>

syntax for environment variables specifying a locale, but I'm unable
to confirm this.

While many vendors know use the <language>_<country> encoding, there
are many different encodings for languages and countries.

Many vendors seem to use some derivative of this encoding:
It looks as if <language> is the two-letter code for the language from
ISO 639, and <country> is the two-letter code for the country from ISO
3166, but I don't know of any standard specifying <character_encoding>.

An appropriate name source for the <character_encoding> part of the
locale name would be to use the character set names specified in RFC
1345 which contains names for all standardized character sets.
(Preferably, the canonical name and all aliases should be accepted,
with the canonical name being the first choice.) Using this
well-known character set repository as name source would bring an end
to conflicting names, without the need to introduce yet another
character set directory with the inherent dangers of inconsistency and
duplicated effort.
*****If you can confirm or deny this, please let me know.*****

Footnote on HP/UX systems:
As of 10.0, you can use either german.iso88591 or de_DE.iso88591 (a
name more in line with other vendors and developing standards for
locale names). For a complete listing of locale names, see the text
file /usr/lib/nls/config. Or, on HP-UX 10.0, execute locale -a . This
command will list all locales currently installed on your system.

4. Selecting the right font under X11 for xterm (and other applications)
To actually display accented characters, you need to select a font
which does contains bit maps for ISO 8859-1 characters in the
correct character positions. The names of these fonts normally
have the suffix "iso8859-1". Use the command
# xlsfonts
to list the fonts available on your system. You can preview a
particular font with the
# xfd -fn <fontname>
command.

Add the appropriate font selection to your ~/.Xdefaults file, e.g.:
----------------------------------------------------------------------------
XTerm*Font: -adobe-courier-medium-r-normal--18-180-75-75-m-110-iso8859-1
Mosaic*XmLabel*fontList: -*-helvetica-bold-r-normal-*-14-*-*-*-*-*-iso8859-1
----------------------------------------------------------------------------

While X11 is farther than most system software when it comes to
internationalization, it still contains many bugs. A number of bug
fixes can be found at URL http://www.dtek.chalmers.se:80/~maf/i18n/.

Footnote: The X11R5 distribution has some fonts which are labeled as
ISO fonts, but which contain only the US-ASCII characters.

5. Translating between different international character sets.
While ISO 8859-1 is an international standard, not everybody uses this
encoding. Many computers use their own, vendor-specific character sets
(most notably Microsoft for MS-DOS). If you want to edit or view files
written in different encoding, you will have to translate them to an
ISO 8859-1 based representation.

There are several PD/free character set translators available on the
Internet, the most notable being 'recode'. recode is available from
URL ftp://prep.ai.mit.edu/u2/emacs. recode is covered by FSF
copyright and is freely redistributable.

The general format of the program call is one of:

recode [OPTION]... [BEFORE]:[AFTER] [FILE]

The second form is the common case. Each FILE will be read assuming
it is coded with charset BEFORE, it will be recoded over itself so to
use the charset AFTER. If there is no such FILE, the program rather
acts as a filter and recode standard input to standard output.

Some recodings are not reversible, so after you have converted the
file (recode overwrites the original file with the new version!), you
may never be able to recontruct the original file. A safer way of
changing the encoing of a file is to use the filter mechanism of
recode and invoke it as follows:

recode [OPTION]... [BEFORE]:[AFTER] <[OLDFILE] >[NEWFILE]

Under SunOS, the dos2unix and unix2dos programs (distributed with
SunOS) will translate between MS-DOS and ISO 8859-1 formats.

It is somewhat more difficult to convert German, `Duden'-conformant
Ersatzdarstellung (ä = ae, ß = ss or sz etc.) into the ISO 8859-1
character set. The German dictionary available as URL
ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/dicts/deutsch.tar.gz also
contains a UNIX shell script which can handle all conversions except
ones involving ß (German scharfes-s), as for `ss' this change is more
complicated.

A more sophisticated program to translate Duden Ersatzdarstellung to
ISO 8859-1 is Gustaf Neumann's diac program (version 1.3 or later)
which can translate all ASCII sequences to their respective ISO 8859-1
character set representation. 'diac' is available as URL
ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/diac.

Translating ISO 8859-1 to ASCII can be performed with a little sed
script according to your needs. But be aware that
* No one-to-one mapping between Latin 1 and ASCII strings is possible.
* Text layout may be destroyed by multi-character substitutions,
especially in tables.
* Different replacements may be in use for different languages,
so no single standard replacement table will make everyone happy.
* Truncation or line wrapping might be necessary to fit textual data
into fields of fixed width.
* Reversing this translation may be difficult or impossible.
* You may be introducing ambiguities into your data.

6. Printing accented characters.

6.1 PostScript printers
If you want to print accented characters on a postscript printer, you
may need a PS filter which can handle ISO characters.

Our Postscript filter of choice is a2ps, the more recent version of
which can handle ISO 8859-1 characters with the -8 option. a2ps V4.3
is available as URL ftp://imag.imag.fr/archive/postscript/a2ps.V4.3.tar.gz.

If you use the pps postscript filter, use the 'pps -ISO' option for
pps to handle ISO 8859-1 characters properly.


6.2 Other (non-PS) printers:
If you want to print to non-PS printers, your success rate depends on
the encoding the printer uses. Several alternatives are possible:

* Your printer accepts ISO 8859-1:
You're lucky. No conversion is needed, just send your files to the
printer.


* You printer supports a PC-compatible font:
You can use the recode tool to translate from ISO 8859-1 to this
encoding. (If you are using a SunOS based computer, you can also use
the unix2dos utility which is part of the standard distribution.)
Just add the appropriate invocation as a built-in filter to your
printer driver.

At our site, we use the following configuration to print ISO 8859-1
characters on an IBM Proprinter XL :

/etc/printcap

lp|isolp|Line Printer with ISO-8859-1:\
:lp=/dev/null:\
:sd=/usr/spool/lpd/lp:mx#0:if=/usr/spool/lpd/iso2dos.sh:rs:
rawlp|Lineprinter:\
:lp=:rm=lphost.vlsivie.tuwien.ac.at:rp=lp:sd=/usr/spool/lpd/rawlp:rs:

/usr/spool/lpd/iso2dos.sh

#!/bin/sh
if /usr/local/gnu/bin/recode latin-1:ibm-pc | /usr/ucb/lpr -Prawlp
then
exit 0
else
exit -1
fi


* Your printer uses a national ISO 646 variant (7 bit ASCII
with some special characters replaced by national characters):
You will have to use a translation tool; this tool would
then be installed in the printer driver and translate character
conventions before sending a file to the printer. The recode
program supports many national ISO 646 norms. (If you add do
this, please submit it to the maintainers of recode, so that it can
benefit everybody.)

Unfortunately, you will not be able to display all characters with
the built-in characters set. Most printers have user-definable
bit-map characters, which you can use to print all ISO characters.
You just have to generate a pix-map for any particular character and
send this bitmap to the printer. The syntax for these characters
varies, but a few conventions have gained universal acceptance
(e.g., many printers can process Epson-compatible escape sequences).


* Your printer supports a strange format:
If your printer supports some other strange format (e.g. HP Roman8,
DEC MCS, Atari, NeXTStep, EBCDIC or what have you), you have to add a
filter which will translate ISO 8859-1 to this encoding before
sending your data to the printer. 'recode' supports many of these
character sets already. If you have to write your own conversion
tool, consider this as a good starting base. (If you add support for
any new character sets, please submit your code changes to the
maintainers of recode).

If your printer supports DEC MCS, this is nearly equivalent to ISO
8859-1 (actually, it is a former ISO 8859-1 draft standard. The only
characters which are missing are the Icelandic characters (eth and
thorn) at locations 0xD0, 0xF0, 0xDE and 0xFE) - the difference is
only a few characters. You could probably get by with just sending
ISO 8859-1 to the printer.


* Your printer supports ASCII only:
You have several options:
+ If your printer supports user-defined characters, you can print all
ISO characters not supported by ASCII by sending the appropriate
bitmaps. You will need a filter to convert ISO 8859-1 characters
to the appropriate bitmaps. (A good starting point would be recode.)
+ Add a filter to the printer driver which will strip the accent
characters and just print the unaccented characters. (This
character set is supported by recode under the name `flat' ASCII.)
+ Add a filter which will generate escape sequences (such as
" <BACKSPACE> a for Umlaut-a (ä), etc.) to be printed. Recode
supports this encoding under the name `ascii-bs'.

Footnote: For more information on character translation and the
'recode' tool, see section 5.

7. TeX and ISO 8859-1
If you want to write TeX without having to type {\"a}-style escape
sequences, you can either get a TeX versions configured to read 8-bit
ISO characters, or you can translate between ISO and TeX codings.

The latter is arduous if done by hand, but can be automated if you use
emacs. If you use Emacs 19.23 or higher, simply add the following line
to your .emacs startup file. This mode will perform the necessary
translations for you automatically:
------------------
(require 'iso-cvt)
------------------

If you are using pre-19.23 versions of emacs, get the "gm-lingo.el"
lisp file via URL ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit. Load
gm-lingo from your .emacs startup file and this mode will perform the
necessary translations for you automatically.

If you want to configure TeX to read 8 bit characters, check out the
configuration files available in URL
ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit.

In LaTeX 2.09 (or earlier), use the isolatin or isolatin1 styles to
include support for ISO latin1 characters. Use the following
documentstyle definition:
\documentstyle[isolatin]{article}

isolatin.sty and isolatin1 are available from all CTAN servers and
from URL ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit. (The isolatin1
version on vlsivie is more complete than the one on CTAN servers.)

There are several possibilities in LaTeX 2e to provide comprehensive
support for 8 bit characters:

The preferred method is to use the inputenc package with the latin1
option. Use the following package invocation to achieve this:
\usepackage[latin1]{inputenc}

The inputenc package should be the first package to be included in the
document. For a more detailed discussion, check out URL
ftp://ftp.vlsivie/tuwien.ac.at/pub/8bit/latex2e.ps (in German).

Alternatively, the styles used for earlier LaTeX versions (see above)
can also be used with 2e. To do this, use the commands:
\documentclass{article}
\usepackage{isolatin}


You can also get the latex-mode to handle opening and closing quotes
correctly for your language. This can be achieved by defining the
emacs variables 'tex-open-quote' and 'tex-closing-quote'. You can
either set these varaibles in your ~/.emacs startup file or as a
buffer-local variable in your TeX file if you want to define quotes on
a per-file basis.

For German TeX quotes, use:
-----------
(setq tex-open-quote "\"`")
(setq tex-closing-quote "'\"")
-----------

If you want to use French quotes (guillemets), use:
-----------
(setq tex-open-quote "«")
(setq tex-closing-quote "»")
-----------

Bibtex has some problems with 8 bit characters, esp. when they are
used as keys. BibTeX 1.0, when it eventually comes out (most likely
some time in 1996), will support 8-bit characters.

8. ISO 8859-1 and emacs
Emacs 19 (as opposed to Emacs 18) can automatically handle 8 bit
characters. (If you have a choice, upgrade to Emacs version 19.23,
which has the most complete ISO support.) Emacs 19 has extensive
support for ISO 8859-1. If your display supports ISO 8859-1 encoded
characters, add the following line to your .emacs startup file:
-----------------------------
(standard-display-european t)
-----------------------------

If want to display ISO-8859-1 encoded files by using TeX-like escape
sequences (e.g. if your terminal supports only ASCII characters), you
should add the following line to your .emacs file (DON'T DO THIS IF
YOUR TERMINAL SUPPORTS ISO OR SOME OTHER ENCODING OF NATIONAL
CHARACTERS):
--------------------
(require 'iso-ascii)
--------------------

If your terminal supports a non-ISO 8859-1 encoding of national
characters (e.g. 7 bit national variant ISO 646 character sets,
aka. `national ASCII' variants), you should configure your own display
table. The standard emacs distribution contains a configuration
(iso-swed.el) for terminals which have ASCII in the G0 set and a
Swedish/Finnish version of ISO 646 in the G1 set. If you want to
create your own display table configuration, take a look at this
sample configuration and at disp-table.el for available support
functions.


Emacs can also accept 8 bit ISO 8859-1 characters as input. These
character codes might either come from a national keyboard (and
driver) which generates ISO-compliant codes, or may have been entered
by use of a COMPOSE-character mechanism.
If you use such an input format, execute the following expression in
your .emacs startup file to enable Emacs to understand them:
-------------------------------------------------
(set-input-mode (car (current-input-mode))
(nth 1 (current-input-mode))
0)
-------------------------------------------------

In order to configure emacs to handle commands operating on words
properly (such as 'Beginning of word', etc.), you should also add the
following line to your .emacs startup file:
-------------------------------
(require 'iso-syntax)
-------------------------------

This lisp script will change character attributes such that ISO 8859-1
characters are recognized as such by emacs.


For further information on using ISO 8859-1 with emacs, also see the
Emacs manual section on "European Display" (available as hypertext
document by typing C-h i in emacs or as a printed version).


If you need to edit text in a non-European language(Arabic, Chinese,
Cyrillic-based languages, Ethiopic, Korean, Thai, Vietnamese, etc.),
MULE (URL ftp://etlport.etl.go.jp/pub/mule) is a Multilingual
Enhancement to GNU Emacs which supports these languages.

9. Typing ISO with US-style keyboards.
Many computer users use US-ASCII keyboards, which do not have keys for
national characters. You can use escape sequences to enter these
characters. For ASCII terminals (or PCs), check the documentation of
your terminal for particulars.


9.1 US-keyboards under X11
Under X Windows, the COMPOSE multi-language support key can be used to
enter accented characters. Thus, when running X11 on a SunOS-based
computer (or any other X11R4 or X11R5 server supporting COMPOSE
characters), you can type three character sequences such as
COMPOSE " a -> ä
COMPOSE s s -> ß
COMPOSE ` e -> è
to type accented characters.

Note that this COMPOSE capability has been removed as of X11R6,
because it does not adequately support all the languages in the world.
Instead, compose processing is supposed to be performed in the client
using an `input method', a mechanism which has been available since
X11R5. (In the short term, this is a step backward for European
users, as few clients support this type of processing at the moment.
It is unfortunate that the X Consortium did not implement a mechanism
which allows for a smoother transition. Even the xterm terminal
emulator supplied by the X Consortium itself does not yet support this
mechanism!)

Input methods are controlled by the locale environment variables (LANG
and LC_xxx). The values for these variables are (or at least, should be
made equivalent by any sane vendor) equivalent to those expected by
the ANSI/POSIX locale library. For a list of possible settings see
section 3.

9.2 US-keyboards and emacs
9.2.1 Using ALT for composing national characters
There are several modes to enter Umlaut characters under emacs when
using a US-style keyboard. One such mode is iso-transl, which is
distributed with the standard emacs distribution. This mode uses the
Alt-key for entering diacritical marks (accents et al.).

To activate iso-transl mode, add the following line to your .emacs
setup file:
(require 'iso-transl)

As of emacs 19.29, Alt-sequences optimized for a particular language
are available. Use the following call in .emacs to select your
favorite keybindings:
(iso-transl-set-language "German")

If you do not have an Alt-key on your keyboard, you can use the C-x 8
prefix to access the same capabilities.

For pre-19.29 versions, similar functionality is availble as extended
iso-transl mode (iso-transl+) which allows the definition of language
specific short cuts is available as URL
ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/iso-transl+.shar. This file
also includes sample configurations for the German and Spanish
languages.


9.2.2 Electric Accents
An alternative to using Alt-sequences for entering diacritical marks
is the use of `electric accents', such as used on old type writers or
under many MS Windows programs. With this method, typing an accent
character will place this accent on the next character entered. One
mode which supports this entry method is the iso-acc minor mode which
comes with the standard emacs distribution. Just add
------------------
(require 'iso-acc)
------------------
to your emacs startup script, and you can turn the '`~/^" keys into
electric accents by typing 'M-x iso-accents-mode' in a specific
buffer. To type the ç (c with cedilla) and ß (German scharfes s)
characters, type ~c and "s, respectively.

Footnote: When starting up under X11, Emacs looks for a Meta key and
if it finds no Meta key, it will use the Alt key instead. The way to
solve this problem, is to define a Meta key using the xmodmap utility
which comes with X11.

10. File names with ISO characters
If your OS is 8 bit clean, you can use ISO characters in file names.
(This is possible under SunOS.)

11. Command names with ISO 8859-1
If your OS supports file names with ISO characters, and your shell is
8 bit clean, you can use command names containing ISO characters. If
your shell does not handle ISO characters correctly, use one of the
many PD shells which do (e.g. tcsh, an extended csh). These are
available from a multitude of ftp sites around the world.

See section 14 on application specific information for a discussion of
various shells.

12. Spell checking
Ispell 3.1 has by far the best understanding of non-English
languages and can be configured to handle 8-bit characters
(Thus, it can handle ISO-8859-1 encoded files).

Ispell 3.1 now comes with hash tables for several languages (English,
German, French,...). It is available via URL ftp://ftp.cs.ucla.edu/pub.
Ispell also contains a list of international dictionaries and about
their availability in the file ispell/languages/Where.

To choose a dictionary for ispell, use the `-d <dictionary>'
option. The `-T <input-encoding>' option should be set set to `-T
latin1' if you want to use ISO 8859-1 as input encoding.

If you use ispell inside emacs (using the ispell.el mode) to spell
check a buffer, you can choose language and input encoding either
using the `M-x ispell-change-dictionary' function, or by choosing the
`Spell' item in the `Edit' pull-down menu. This will present you with
a choice of dictionaries (cum input encodings): all languages are
listed twice, such as in `Deutsch' and `Deutsch8'. `Deutsch8' is the
setting which will use the German dictionary and the 8 bit ISO 8859-1
input encoding.

Alternatively, ispell.el lets you specify the dictionary to use for a
particular file at the end of of that file by adding a line such as
----
Local IspellDict: castellano8
----

The following sites also have dictionaries for ispell available via
anonymous ftp:
language site file name
French ireq-robot.hydro.qc.ca /pub/ispell
French ftp.inria.fr /INRIA/Projects/algo/INDEX/iepelle
French ftp.inria.fr /gnu/ispell3.0-french.tar.gz
German ftp.vlsivie.tuwien.ac.at /pub/8bit/dicts/deutsch.tar.gz
Spanish ftp.eunet.es /pub/unix/text/TeX/spanish/ispell
Portuguese http://www.di.uminho.pt/~jj/pln/pln.html

Some spell checkers use strange encodings for accented characters. If
you have to use one of these spell checkers, you may have to run
recode before invoking the spell checker to generate a file using your
spell checker's coding conventions. After running the spell checker,
you have to translate the file back to ISO with recode.

Of course, this can be automated with a shell script:
---------------------
recode <options to generate spell checker encoding from ISO> $i tmp.file
spell_check tmp.file
recode <options to generate ISO from spell checker encoding> tmp.file $i
---------------------

Footnote: Ispell 4.* is not a superset of ispell 3.*. Ispell 4.* was
developed independently from a common ancestor, but DOES NOT
support any internationalization, but is restricted to the
English language.

13. TCP and ISO 8859-1
TCP was specified by US-Americans, for US-Americans. TCP still carries
this heritage: while TCP/IP protocol itself *is* 8 bit clean, no
effort was made to support the transfer of non-English characters in
many application level protocols (mail, news, etc.). Some of these
protocols still only specify the transfer of 7-bit data, leaving
anything else implementation dependent.

Since the TCP/IP protocol itself transfers 8 bit data correctly,
writing applications based on TCP/IP does not lead to any loss of
encoding information.


13.1 FTP and ISO 8859-1
Transmitting data via FTP is an interesting issue, depending on what
system you use, how the relevant RFCs are interpreted, and what is
actually implemented.

If you transfer data between two hosts using the same ISO 8859-1
representation (such as two Unix hosts), the safest solution is to
specify 'binary' transmission mode.

Note, however, that use of the binary mode for text files will disable
translation between the line-ending conventions of different operating
systems. You might have to provide some filter to convert between the
LF-only convention of Unix and the CR-LF convention of VMS and MS
Windows when you copy from one of these systems to another.

If the FTP server and client computers use different encoding, there
are two possible approaches:
* Transfer all data as binary data, then convert the format using a
conversion tools such as recode to translate the tranferred data.
* Specify an ASCII connection, and have your FTP server and client
convert the encoding automatically.

While the first approach always works, it is somewhat cumbersome if
you transmit a lot of data. The second transfer solution is much more
comfortavle, but it depends on you client (and server) to take care of
the appropriate character translations. Since there is no universal
standard for network characters beyond ASCII (NVT-ASCII as specified
in RFC 854), this depends on attitude of your software vendor.

Most Apple Macintosh network software is configured to treat all
network data as having ISO 8859-1 encoding and automatically
translates from and to the internal MacOS data representation. (This
can be problematic, if you want to send or receive text using the
Macintosh character set. The correct solution to this problem is
to use MIME.)

MS-DOS programs are much less well-behaved, and you have to try
whether your particular FTP program performs conversion.

An additional issue with the automatic translation is how to translate
unavailable characters. If FTP is used to store and retrieve data,
the original file should be re-constructable after conversion. If
data is to printed or processed, different encodings (e.g. graphic
approximation of characters) may be necessary. (See the section on
character set translation for a full discussion of encoding
transformations.)

A second, optional parameter is possible for 'type ascii' commands,
which specifies whether the data is for non-printing or printing
purposes. Ideally, FTP servers for non-8859-1 servers would use this
parameter to determine whether to use an invertible encoding or
graphical and/or logical approximation during translation. (Although
RFC 959, section 3.1.1.5 does not require this.)


13.2 Mail and ISO 8859-1
Most Internet eMail standards come from a time when the Internet was a
mostly-US phenomenon. Other countries did have access to the net, but
much of the communication was in English nevertheless. With the
propagation of Internet, these standards have become a problem for
languages which cannot be represented in a 7 bit ISO 646 character
set.

Using ISO 646, which uses a slightly different character set for each
language, also poses a problem when crossing a language barrier, as
the interpretation of characters will change. As a result, most
countries use the ISO 646 standard commonly referred to as US-ASCII
and will use escape sequences such as 'e (é) or "a (ä) to refer to
national characters. The exception to this rule are Nordic countries
(more so in Sweden and Finland, less so in Denmark and Norway, I'm
being told), where the national ISO 646 variant has garnered a
formidable following and is a common reference point for all Nordic
users.

There are several languages, for which there are not enough
replacement characters to code all national variants (e.g. French).

Footnote:
Hence, French has not followed the nordic track. French
net-convention is e' instead of 'e ("l''el'ephant" is strange
spelling) and many think that this is very ugly writing anyway and
drop the accents altogether but this makes text sometimes funny and
incorrect at least.


As this situation is clearly unsatisfactory, several methods of
sending mails encoded in national character sets have been developed.
We start with a discussion of the mail delivery infrastructure and
will then look at some high-level protocols which can protect mail
users and their messages from the shortcomings of the underlying mail
protocols.

Footnote: Many other email standards exist for proprietary systems.
If you use one of these mail systems, it is the responsibility of the
mail gateway to translate your messages to an appropriate Internet
mail message when you send a message to the Internet.


13.2.1 Mail Transfer Agents and the Internet Mail Infrastructure
The original sendmail protocol specification (SMTP) in RFC 821
specified the transfer of only 7 bit messages. Many sendmail
implementations have been made 8 bit transparent (see RFC 1428), but
some SMTP handling agents are still strictly conforming to the
(somewhat outdated) RFC 821 and intentionally cut off the 8th bit.
This behavior stymies all efforts to transfer messages containing
national characters. Thus, only if all SMTP agents between mail
originator and mail recipient are 8 bit clean, will messages be
transferred correctly. Otherwise, accented characters are mapped to
some ASCII character (e.g. Umlaut a -> 'd'), but the rest of the
messages is still transferred correctly.

A new, enhanced (and compatible) SMTP standard, ESMTP, has been
released as RFC 1425. This standard defines and standardizes 8 bit
extensions. This should be the mail protocol of choice for newly
shipped versions of sendmail.

Much of the European and Latin American network infrastructure
supports the transfer of 8 bit mail messages, the success rate is
somewhat lower for the US.

DEC Ultrix sendmail still implements the somewhat outdated RFC 821 to
the letter, and thus cuts off the eighth bit of all mail passing
through it. Thus ISO encoded mail will always lose the accent marks
when transferred through a DEC host.

If your computer is running DEC Ultrix and you want it to handle 8 bit
characters properly, you can get the source for a more recent version
of sendmail via ftp (see section 14.9). OR, you can simply
call DEC, complain that their standard mail system cannot handle
international 8 bit mail, encourage them to implement 8 bit
transparent SMTP, or (even better) ESMTP, and ask for the sendmail
patch which makes their current sendmail 8 bit transparent.
(Reportedly, such a patch is available from DEC for those who ask.)
In the meantime, an 8 bit transparent sendmail MIPS binary for Ultrix
is available as URL
ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/mips.sendmail.8bit)

If you want to change MTAs, the popular smail PD-MTA is also 8 bit
clean.


13.2.2 High-level protocols
In the Good Old Days, messages were 7-bit US-ASCII only. When users
wanted to transfer 8 bit data (binaries or compressed files, for
example), it was their responsibility to translate them to a 7 bit
form which could be sent. At the other end, the recipient had to
unpack the data using the same protocol. The commonly used encoding
mechanism used for this purpose is uuencode/uudecode.

Today, a standard, MIME (MIME stands for Multi-purpose Internet Mail
Extensions), exists which automatically packs and unpacks data as is
required. This standard can take advantage of different underlying
protocol capabilities and automatically transform messages to
guarantee delivery. This standard can also be used to include
multimedia data types in your mail messages.

The MIME standard defines a mail transfer protocol which can handle
different character sets and multimedia mail, independent of the
network infrastructure. This protocol should eventually solve
problems with 7-bit mailers etc. Unfortunately, no mail transfer
agents (mail routers) and few end user mail readers support this
standard. Source for supporting MIME (the `metamail' package) in
various mail readers is available in URL
ftp://thumper.bellcore.com/pub/nsb. MIME is specified in RFC 1521 and
RFC 1522 which are available from ftp.uu.net. There is also a MIME
FAQ which is available as URL
ftp://ftp.ics.uci.edu/mh/contrib/multimedia/mime-faq.txt.gz. (This
file is in compressed format. You will need the GNU gunzip program to
decompress this file.)

PS: Newer versions of sendmail support ESMTP negotiation and can pass
8 bit data. However, they do not (yet?) support downgrading of 8 bit
MIME messages.


13.3 News and ISO 8859-1
Much as mail, the Usenet news protocol specification is 7 bit based,
but the infrastructure has been upgraded to 8 bit service... Thus,
accented characters are transferred correctly between much of Europe
(and Latin America).

ISO 8859-1 is _the_ standard for typing accented characters in most
newsgroups (may be different for MS-DOS centered newsgroups ;-), and
is preferred in most European news group hierarchies, such as at.* or
de.*

For those who speak French, there is an excellent FAQ on using ISO
8859-1 coded characters on Usenet by François Yergeau (URL
ftp://ftp.ulaval.ca/contrib/yergeau/faq-accents). This FAQ is
regularly posted in soc.culture.french and other relevant newsgroups.


13.4 WWW (and other information servers)
The WWW protocol can transfer 8 bit data without any problems and you
can advertise ISO-8859-1 encoded data from your client. The display
of data is dependent upon the user client. xmosaic (freely available
from the NCSA) which is available for most UNIX platforms uses an
ISO-8859-1 compliant font by default and will display data correctly.


13.5 rlogin
For rlogin to pass 8 bit data correctly, invoke it with 'rlogin -8' or
'rlogin -L'.

14. Some applications and ISO 8859-1
14.1 bash
You need version 1.13 or higher and set the locale correctly (see
section 3). Also, to configure the `readline' input function of bash
to handle 8 bit characters correctly, you have to set some environment
variables in the readline startup file .inputrc:
-------------------------------------------------------
set meta-flag On
set convert-meta Off
set output-meta On
-------------------------------------------------------

Before bash version 1.13, bash used the eighth bit of characters to
mark whether or not they were quoted when performing word expansions.
While this was not a problem in a 7-bit US-ASCII environment, this was
a major restriction for users working in a non-English environment.

These readline variables have the following meaning (and default
values):
meta-flag (Off)
If set to On, readline will enable eight-bit input
(that is, it will not strip the high bit from the char-
acters it reads), regardless of what the terminal
claims it can support.
convert-meta (On)
If set to On, readline will convert characters with the
eighth bit set to an ASCII key sequence by stripping
the eighth bit and prepending an escape character (in
effect, using escape as the meta prefix).
output-meta (Off)
If set to On, readline will display characters with the
eighth bit set directly rather than as a meta-prefixed
escape sequence.

Bash is available from prep.ai.mit.edu in /pub/gnu.


14.2 elm
Elm automatically supports the handling of national character sets,
provided the environment is configured correctly. If you configure
elm without MIME support, you can receive, display, enter and send 8
bit ISO 8859-1 messages (if your environment supports this character
set).

When you compile elm with MIME support, you have two options:
* you can compile elm to use 8 bit ISO-8859-1 as transport encoding:
If you use this encoding even people without MIME compliant mailers
will be able to read your mail messages, if they use the same
character set. The eight bit may however be cut off by 7 bit MTAs
(mail transfer agents), and mutilated mail might be received by the
recipient, regardless of whether she uses MIME or not. (This
problem should be eased when 8 bit mailers are upgraded to
understand how to translate 8 bit mails to 7 bit encodings when they
encounter a 7 bit mailer.)

* you can compile elm to use 7 bit US-ASCII `quoted printable' as
transport encoding:
this encoding ensures that you can transfer your mail containing
national characters without having to worry about 7 bit MTAs. A
MIME compliant mail reader at the other end will translate your
message back to your national character set. Recipients without
MIME compliant mail readers will however see mutilated messages:
national characters will have been replaced by sequences of the type
'=FF' (with FF being the ISO code (in hexadecimal) of the national
character being encoded).


14.3 GNUS
GNUS is a newsreader based on emacs. It is 8 bit transparent and
contains all national character support available in emacs 19.


14.4 less
Version 237 and later automatically displays latin1 characters, if
your locale is configured correctly.

If your OS does not support the locale mechanism, or if you use a
version of less older than 237, set the LESSCHARSET environment
variable with 'setenv LESSCHARSET latin1'.

14.5 metamail
To configure the metamail package for ISO 8859-1 input/output, set the
MM_CHARSET environment variable with 'setenv MM_CHARSET ISO-8859-1'.
Also, set the MM_AUXCHARSETS variable with 'setenv MM_AUXCHARSETS
iso-8859-1'.


14.6 nn
Add the line
-----------------
set data-bits 8
-----------------
to your ~/.nn/init (or the global configuration file) in order for nn
to be able to process 8 bit characters.


14.7 nroff
The GNU replacement for nroff, groff, has an option to generate ISO
8859-1 coded output, instead of plain ASCII. Thus, you can preview
nroff documents with correctly displayed accented characters. Invoke
groff with the 'groff -Tlatin1' option to achieve this.

Groff is free software. It is available from URL
ftp://prep.ai.mit.edu/pub/gnu and many other GNU archives around the
world.


14.8 pgp
PGP (Phil Zimmermann's Pretty Good Privacy) uses Latin1 as canonical
form to transmit crypted data. Your host computer's local character
set should be configured in the configuration file
${PGPPATH}/config.txt by setting the CHARSET parameter. If you are
using ISO 8859-1 as your native character set, CHARSET should bet set
to LATIN1, on MS-DOS computers with code page 850 set 'CHARSET =
CP850'. This will make PGP automatically translate all crypted texts
from/to the LATIN1 canonical form. A setting of 'CHARSET = NOCONV'
can be used to inhibit all translations. (

When PGP is used to code Cyrillic text, KOI8 is regarded as canonical
form (use 'CHARSET = KOI8'). If you use the ALT_CODES encoding for
Cyrillic (popular on PCs), set 'CHARSET = ALT_CODES' and it will
automatically be converted to KOI8.

Footnote: Note that PGP treats KOI8 as LATIN1, even though it is a
completely different character set (Russian), because trying to
convert KOI8 to either LATIN1 or CP850 would be futile anyway.


14.* samba
To make samba work with ISO 8859-1, use the following line in the
[global] section:
valid chars = 0xa0 0xa1 0xa2 0xa3 0xa4 0xa5 0xa6 0xa7 0xa8 0xa9 0xaa 0xab 0xac 0xad 0xae 0xaf 0xb0 0xb1 0xb2 0xb3 0xb4 0xb5 0xb6 0xb7 0xb8 0xb9 0xba 0xbb 0xbc 0xbd 0xbe 0xbf 0xc0:0xe0 0xc1:0xe1 0xc2:0xe2 0xc3:0xe3 0xc4:0xe4 0xc5:0xe5 0xc6:0xe6 0xc7:0xe7 0xc8:0xe8 0xc9:0xe9 0xca:0xea 0xcb:0xeb 0xcc:0xec 0xcd:0xed 0xce:0xee 0xcf:0xef 0xd0:0xf0 0xd1:0xf1 0xd2:0xf2 0xd3:0xf3 0xd4:0xf4 0xd5:0xf5 0xd6:0xf6 0xd7 0xf7 0xd8:0xf8 0xd9:0xf9 0xda:0xfa 0xdb:0xfb 0xdc:0xfc 0xdd:0xfd 0xde:0xfe 0xdf 0xff


14.9 sendmail
BSD Sendmail Version 8 has a flag in the configuration file set to
True or False which determines whether v8 passes any 8-bit data it
encounters, presumably to match the behavior of other 8-bit
transparent MTAs and to meet the wants of non-ASCII users, or if it
strips to 7 bits to conform to SMTP. The source code for an 8 bit
clean sendmail is available in URL ftp://ftp.cs.berkeley.edu/ucb/sendmail.
A pre-compiled binary for DEC MIPS systems running Ultrix is available
as URL ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/mips.sendmail.8bit.


14.10 tcsh
You need version 6.04 or higher, and your locale has to be set
properly (see section 3). Tcsh also needs to be compiled with the
national language support feature, see the config.h file in the tcsh
source directory. Tcsh is an extended csh and is available in URL
ftp://ftp.deshaw.com/pub/tcsh

If tcsh has been configured correctly, it will allow national
characters in ENVIRONMENT variables, shell variables, file names, etc.

set BenötigteDateien=/etc/rc
cat $BenötigteDateien > /dev/null


14.11 vi
Support for 8 bit character sets depends on the OS. It works under
SunOS 4.1.*, but on OSF/1 vi gets confused about the current cursor
position in the presence of 8 bit characters. Some versions of vi
require an 8bit locale to work with 8 bit characters.


All major replacements for vi seem to support 8 bit characters:

14.11.1 vile ('VI Like Emacs')
Vile (by Paul Fox) can be told that the usual range of 8th-bit
characters are printable with "set printing-low 160" and "set
printing-high 255". By either executing these command in vile or by
placing them in ~/.exrc, vile will not use the usual octal or hex
expansion for these characters. vile is available from
ftp://id.wing.net/pub/pgf/vile.


************************* REQUIRES A RE-WRITE ********************************
Normally, 8 bit chars are printed either in hex (the default) or octal
("set unprintable-as-octal"). they look like "\xC7" or "\307" on your
screen.

vile was the first vi rewrite to provide multi-window/multi-buffer
operation. and since it was derived from micro-emacs, it retains
fully rebindable keys, and a built in macro language. the ftp site is
id.wing.net:/pub/pgf/vile. the current version is 5.2. it's pretty
mature (5 years old). there's an X-aware version as well, that makes
full use of the mouse, with scrollbars, etc. (to answer your
question, initialization stuff goes in a .vilerc file.)

Do you require use of the correct locale settings?
no. 8-bit support is fairly primitive. i'll include the
pertinent sections of the doc down below.


hope all this helps --

paul

------------------------------------
from vile's Help file:

8-Bit Operation
---------------

vile allows input, manipulation, and display of all 256 possible
byte-wide characters. (Double-wide characters are not supported.)

Output
------
By default, characters with the high bit set (decimal value 128 or
greater) will display as hex (or octal; see "non-printing- octal"
above) sequences, e.g. \xA5. A range of characters which should
display as themselves (that is, characters understood by the user's
display terminal) may be given using the "printing-low" and
"printing-high" settings (see above). Useful values for these
settings are 160 and 255, which correspond to the printable range
of the ISO-Latin-1 character set.

Input
-----
If the user's input device can generate all characters, and if the
terminal settings are such that these characters pass through
unmolested (Using "stty cs8 -parenb -istrip" works for me, on an
xterm. Real serial lines may take more convincing, at both ends.),
then vile will happily incorporate them into the user's text, or
act on them if they are bound to functions. Users who have no need
to enter 8-bit text may want access to the meta-bound functions
while in insert mode as well as command mode. The mode
"meta-insert-bindings" controls whether functions bound to meta-
keys (characters with the high bit set) are executed only in
command mode, or in both command and insert modes. In either case,
if a character is _not_ bound to a function, then it will be
self-inserting when in insert mode. (To bind to a meta key in the
.vilerc file, one may specify it as itself, or in hex or octal, or
with the shorthand 'M-c' where c is the corresponding character
without the high bit set.

------------------------------------
also from vile's Help file, these are the settable modes which affect
8-bit operation:

meta-insert-bindings (mib) Controls behavior of 8-bit characters
during insert. Normally, key-bindings are only operational
when in command mode: when in insert mode, all characters
are self-inserting. If this mode is on, and a meta-character
is typed which is bound to a function, then that function
binding will be honored and executed from within insert
mode. Any unbound meta-characters will remain self-inserting.
(B)

printing-low The integer value representing the first of the
printable set of "high bit" (i.e. 8-bit) characters.
Defaults to 0. Most foreign (relative to me!) users would
set this to 160, the first printable character in the upper
range of the ISO 8859/1 character set. (U)

printing-high The integer value representing the last character of
the printable set of "high bit" (i.e. 8-bit) characters.
Defaults to 0. Set this to 255 for ISO 8859/1
compatibility. (U)

unprintable-as-octal (uo) If an 8-bit character is non-printing, it
will normally be displayed in hex. This setting will force
octal display. Non-printing characters whose 8th bit is
not set are always displayed in control character (e.g. '^C')
notation. (B)
************************* REQUIRES A RE-WRITE ********************************

14.11.2 vim
vim was developed on an Amiga in Europe, and supports a mechanism
similar to vile. 'vim' supports input digraphs for entering 8-bit
chars, the output convention is similar to vile -- raw or nothing.

Details are unkonwn. (If you know more about vim,
please let me know. A request to comp.editors should yield additional
information.)

14.11.3 nvi
A recent vi-rewrite which should also should support 8 bit characters.
(Keith Bostic (bos...@cs.berkeley.edu) is the author and should know
more about nvi.)

15. Terminals
15.1 X11 Terminal Emulators
See section 4 on X11 for bug fixes for X11 clients.

15.1.1 xterm
If you are using X11 and xterm as your terminal emulator, you should
place the following line in ~/.Xdefaults (this seems to be required in
some releases of X11, not in all):
-------------------------
XTerm*EightBitInput: True
-------------------------

15.1.2 rxvt
rxvt is another terminal emulator used for X11, mostly under
Linux. Invoke rxvt with the 'rxvt -8' command line.


15.2 VT2xx, VT3xx
The character encoding used in VT2xx terminals is a preliminary
version of the ISO-8859-1 standard (DEC MCS), so some characters (the
more obscure ones) differ slightly. However, these terminals can be
used with ISO 8859-1 characters without problems.

The newer VT3xx terminals use the official ISO 8859-1 standard.

The international versions of the VT[23]xx terminals have a COMPOSE
key which can be used to enter accented characters, e.g.
<COMPOSE><e><'> will give an e with accent aigu (é).


15.3 Various UNIX terminals
Some terminals support down-loadable fonts. If characters sent to
these terminals can be 8 bit wide, you can down-load your own ISO
characters set. To see how this can be achieved, take a look at the
/pub/culture/russian/comp/cyril-term on nic.funet.fi.


15.4 MS-DOS PCs
MS-DOS PCs normally use a different encoding for accented characters,
so there are two options:

* you can use a terminal emulator which will translate between the
different encodings. If you use the PROCOMM PLUS, TELEMATE and
TELIX modem programs, you can down-load the translation tables
from URL ftp://oak.oakland.edu/pub/msdos/modem/xlate.zip. (You need
to install CP850 for this to work.)

* you can reconfigure your MS-DOS PC to use an ISO-8859-1 code page.
Either install IBM code page 819 (see section 19), or you can get
the free ISO 8859-X support files from the anonymous ftp archive
ftp://ftp.uni-erlangen.de/pub/doc/ISO/charsets, which contains data
on how to do this (and other ISO-related stuff). The README file
contains an index of the files you need.

Note that many terminal emulations for PCs strip the 8th bit when in
text transmission mode. If you are using such a program to dial up
a computer, you may have to configure your terminal program to
transmit all 8 bits.


16. Programming applications which support the use of ISO 8859-1
For information on how to write applications with support for
localization (to the ISO 8859-1 and other character representations)
check out URL ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/ISO-programming.

17. Other relevant i18n FAQs
This is a list of other FAQs on the net which might be of interest.
Topic Newsgroup(s) Comments
Nordic graphemes soc.culture.nordic interesting stuff about
handling nordic letters
accents sur Usenet soc.culture.french,... Accents on Usenet (French)
+ more
Programming for I18N comp.unix.questions,... see section 16.
International fonts ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/ISO-fonts
Discusses international fonts
and where to find them
I18N on WWW http://www.vlsivie.tuwien.ac.at/mike/i18n.html
German-HowTo for Linux ftp://ftp.univie.ac.at/systems/linux/sunsite/docs/HOWTO/German-HOWTO

Using 8 bit characters ftp://ftp.ulg.ac.be/pub/docs/iso8859/* (1)

Much charactersets info ftp://kermit.columbia.edu/kermit/charsets/
http://www.columbia.edu/kermit/ (2)

(1) written to "convey" the problem to the ASCII programmer, hence
more theoretical background.
(2) Kermit is second to none (in time and quality) for character sets
support and deserves a pointer in this FAQ.

18. Operating Systems and ISO 8859-1
18.1 UNIX
Most Unix implementations use the ISO 8859--1 character set, or at
least have an option to use it. Some systems may also support other
encodings, e.g.~Roman8 (HP/UX), DEC MCS (DEC Ultrix, see the section
on VMS), etc.


18.2 NeXTSTEP
NeXTSTEP uses a proprietary character set.


18.3 MS DOS
IBM code page 819 _is_ ISO 8859-1. Code Page 850 has the same
characters as ISO 8859-1, BUT the characters are in different
locations (i.e., you can translate 1-to-1, but you do have to
translate the characters.)


18.4 MS-Windows
Microsoft Windows uses an ISO 8859-1 compatible character set (Code
Page 1252), as delivered in the US, Europe (except Eastern Europe) and
Latin America. In Windows 3.1, Microsoft has added additional characters
in the 0x80-0x9F range.


18.5 DEC VMS
DEC VMS uses the DEC MCS character set, which is practically
equivalent to ISO 8859-1 (it is a fromer ISO 8859--1 draft standard).
The only characters which differ between DEC MCS and ISO 8859-1 are
the Icelandic characters (eth and thorn) at locations 0xD0, 0xF0, 0xDE
and 0xFE.


19. Table of ISO 8859-1 Characters
This section gives an overview of the ISO 8859-1 character set. The
ISO 8859-1 character set consists of the following four blocks:

00 19 CONTROL CHARACTERS
20 7E BASIC LATIN
80 9F EXTENDED CONTROL CHARACTERS
A0 FF LATIN-1 SUPPLEMENT

The control characters and basic latin blocks are similar do those
used in the US national variant of ISO 646 (US-ASCII), so they are not
listed here. Nor is the second block of control characters listed,
for which not functions have yet been defined.

+----+-----+---+------------------------------------------------------
|Hex | Dec |Car| Description ISO/IEC 10646-1:1993(E)
+----+-----+---+------------------------------------------------------
| | | |
| A0 | 160 | | NO-BREAK SPACE
| A1 | 161 | ¡ | INVERTED EXCLAMATION MARK
| A2 | 162 | ¢ | CENT SIGN
| A3 | 163 | £ | POUND SIGN
| A4 | 164 | ¤ | CURRENCY SIGN
| A5 | 165 | ¥ | YEN SIGN
| A6 | 166 | ¦ | BROKEN BAR
| A7 | 167 | § | SECTION SIGN
| A8 | 168 | ¨ | DIAERESIS
| A9 | 169 | © | COPYRIGHT SIGN
| AA | 170 | ª | FEMININE ORDINAL INDICATOR
| AB | 171 | « | LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
| AC | 172 | ¬ | NOT SIGN
| AD | 173 | ­ | SOFT HYPHEN
| AE | 174 | ® | REGISTERED SIGN
| AF | 175 | ¯ | MACRON
| | | |
| B0 | 176 | ° | DEGREE SIGN
| B1 | 177 | ± | PLUS-MINUS SIGN
| B2 | 178 | ² | SUPERSCRIPT TWO
| B3 | 179 | ³ | SUPERSCRIPT THREE
| B4 | 180 | ´ | ACUTE ACCENT
| B5 | 181 | µ | MICRO SIGN
| B6 | 182 | ¶ | PILCROW SIGN
| B7 | 183 | · | MIDDLE DOT
| B8 | 184 | ¸ | CEDILLA
| B9 | 185 | ¹ | SUPERSCRIPT ONE
| BA | 186 | º | MASCULINE ORDINAL INDICATOR
| BB | 187 | » | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
| BC | 188 | ¼ | VULGAR FRACTION ONE QUARTER
| BD | 189 | ½ | VULGAR FRACTION ONE HALF
| BE | 190 | ¾ | VULGAR FRACTION THREE QUARTERS
| BF | 191 | ¿ | INVERTED QUESTION MARK
| | | |
| C0 | 192 | À | LATIN CAPITAL LETTER A WITH GRAVE ACCENT
| C1 | 193 | Á | LATIN CAPITAL LETTER A WITH ACUTE ACCENT
| C2 | 194 | Â | LATIN CAPITAL LETTER A WITH CIRCUMFLEX ACCENT
| C3 | 195 | Ã | LATIN CAPITAL LETTER A WITH TILDE
| C4 | 196 | Ä | LATIN CAPITAL LETTER A WITH DIAERESIS
| C5 | 197 | Å | LATIN CAPITAL LETTER A WITH RING ABOVE
| C6 | 198 | Æ | LATIN CAPITAL LIGATURE AE
| C7 | 199 | Ç | LATIN CAPITAL LETTER C WITH CEDILLA
| C8 | 200 | È | LATIN CAPITAL LETTER E WITH GRAVE ACCENT
| C9 | 201 | É | LATIN CAPITAL LETTER E WITH ACUTE ACCENT
| CA | 202 | Ê | LATIN CAPITAL LETTER E WITH CIRCUMFLEX ACCENT
| CB | 203 | Ë | LATIN CAPITAL LETTER E WITH DIAERESIS
| CC | 204 | Ì | LATIN CAPITAL LETTER I WITH GRAVE ACCENT
| CD | 205 | Í | LATIN CAPITAL LETTER I WITH ACUTE ACCENT
| CE | 206 | Î | LATIN CAPITAL LETTER I WITH CIRCUMFLEX ACCENT
| CF | 207 | Ï | LATIN CAPITAL LETTER I WITH DIAERESIS
| | | |
| D0 | 208 | Ð | LATIN CAPITAL LETTER ETH
| D1 | 209 | Ñ | LATIN CAPITAL LETTER N WITH TILDE
| D2 | 210 | Ò | LATIN CAPITAL LETTER O WITH GRAVE ACCENT
| D3 | 211 | Ó | LATIN CAPITAL LETTER O WITH ACUTE ACCENT
| D4 | 212 | Ô | LATIN CAPITAL LETTER O WITH CIRCUMFLEX ACCENT
| D5 | 213 | Õ | LATIN CAPITAL LETTER O WITH TILDE
| D6 | 214 | Ö | LATIN CAPITAL LETTER O WITH DIAERESIS
| D7 | 215 | × | MULTIPLICATION SIGN
| D8 | 216 | Ø | LATIN CAPITAL LETTER O WITH STROKE
| D9 | 217 | Ù | LATIN CAPITAL LETTER U WITH GRAVE ACCENT
| DA | 218 | Ú | LATIN CAPITAL LETTER U WITH ACUTE ACCENT
| DB | 219 | Û | LATIN CAPITAL LETTER U WITH CIRCUMFLEX ACCENT
| DC | 220 | Ü | LATIN CAPITAL LETTER U WITH DIAERESIS
| DD | 221 | Ý | LATIN CAPITAL LETTER Y WITH ACUTE ACCENT
| DE | 222 | Þ | LATIN CAPITAL LETTER THORN
| DF | 223 | ß | LATIN SMALL LETTER SHARP S
| | | |
| E0 | 224 | à | LATIN SMALL LETTER A WITH GRAVE ACCENT
| E1 | 225 | á | LATIN SMALL LETTER A WITH ACUTE ACCENT
| E2 | 226 | â | LATIN SMALL LETTER A WITH CIRCUMFLEX ACCENT
| E3 | 227 | ã | LATIN SMALL LETTER A WITH TILDE
| E4 | 228 | ä | LATIN SMALL LETTER A WITH DIAERESIS
| E5 | 229 | å | LATIN SMALL LETTER A WITH RING ABOVE
| E6 | 230 | æ | LATIN SMALL LIGATURE AE
| E7 | 231 | ç | LATIN SMALL LETTER C WITH CEDILLA
| E8 | 232 | è | LATIN SMALL LETTER E WITH GRAVE ACCENT
| E9 | 233 | é | LATIN SMALL LETTER E WITH ACUTE ACCENT
| EA | 234 | ê | LATIN SMALL LETTER E WITH CIRCUMFLEX ACCENT
| EB | 235 | ë | LATIN SMALL LETTER E WITH DIAERESIS
| EC | 236 | ì | LATIN SMALL LETTER I WITH GRAVE ACCENT
| ED | 237 | í | LATIN SMALL LETTER I WITH ACUTE ACCENT
| EE | 238 | î | LATIN SMALL LETTER I WITH CIRCUMFLEX ACCENT
| EF | 239 | ï | LATIN SMALL LETTER I WITH DIAERESIS
| | | |
| F0 | 240 | ð | LATIN SMALL LETTER ETH
| F1 | 241 | ñ | LATIN SMALL LETTER N WITH TILDE
| F2 | 242 | ò | LATIN SMALL LETTER O WITH GRAVE ACCENT
| F3 | 243 | ó | LATIN SMALL LETTER O WITH ACUTE ACCENT
| F4 | 244 | ô | LATIN SMALL LETTER O WITH CIRCUMFLEX ACCENT
| F5 | 245 | õ | LATIN SMALL LETTER O WITH TILDE
| F6 | 246 | ö | LATIN SMALL LETTER O WITH DIAERESIS
| F7 | 247 | ÷ | DIVISION SIGN
| F8 | 248 | ø | LATIN SMALL LETTER O WITH OBLIQUE BAR
| F9 | 249 | ù | LATIN SMALL LETTER U WITH GRAVE ACCENT
| FA | 250 | ú | LATIN SMALL LETTER U WITH ACUTE ACCENT
| FB | 251 | û | LATIN SMALL LETTER U WITH CIRCUMFLEX ACCENT
| FC | 252 | ü | LATIN SMALL LETTER U WITH DIAERESIS
| FD | 253 | ý | LATIN SMALL LETTER Y WITH ACUTE ACCENT
| FE | 254 | þ | LATIN SMALL LETTER THORN
| FF | 255 | ÿ | LATIN SMALL LETTER Y WITH DIAERESIS
+----+-----+---+------------------------------------------------------

Footnote: ISO 10646 calls Æ a `ligature', but this is a
letter in (at least some) Scandinavian languages. Thus, it
is not in the same, merely typographic `ligature' class as
`oe' ({\oe} in {\LaTeX} convention) which was not included
in the ISO8859-1 standard.

***Tentative info***
Supposedly the Danish press, some months ago, reported that ISO has
changed the standard so from now on æ and Æ are classified as
letters.

If you can confirm or deny this, please let me know...
***Tentative info***

20. History
In April 1965, the ECMA (European Computer Manufacturer's Association)
stndardized ECMA-6. This the character set is also (and more
commonly) also know under the names of ISO 646, US-ASCII or DIN 66003.

However, this standard only contained the basic Latin alphabet, with
no provisions for national characters in use all across Europe. These
characters were later added by replacing several special characters
from the US-ASCII alphabet (such as {[|]}\ etc.). These variants were
local to each country and were calle `national ISO 646 variants'.
Portability from one country to another was low, as each country had
their own national variant, and some of the special characters were
still needed (such as for programming C), which made this an
altogether unsatisfying solution.

In 1981, IBM released the IBM PC with an 8 bit character set, code
page 437. The order of the characters added was somewhat confusing,
to say the least. However, in 1982 the first hardware (DEC VT220 and
VT240 terminal) using a more satisfying character set, the DEC MCS
(Multilanguage Character Set) was released.

This character set was very similar to ISO 6937/2, which is
essentially equivalent to today's ISO 8859-1. In March 1985, ECMA
standardized ECMA-94, which later came to be known as ISO 8859-1
through 8859-4. However, ISO 8859-1 was officially stndardized by ISO
only in 1987.

1987 also saw the release of MS-DOS 3.3 which used Code Page 850.
Code Page 850 contains all characters from ISO 8859-1, making a
loss-free conversion possible. Code Page 819 which was released later
goes one step further, as it is fully ISO 8859-1 compliant.

The ISO 8859-X standard was designed to allow as much interoperability
between character sets as possible. Thus, all ISO 8859-X character
sets are a superset of US-ASCII and all character sets will render
English text properly. Also, there is considerable overlap between
several character sets: a text written in German using the ISO 8859-1
character set can be correctly rendered in ISO 8859-2, the Eastern
European character set, where German is the primary foreign language
(-3, -4, -9, -10 supposedly also can display German text without
changes).

While ISO 8859-X was designed for considerable portability, texts are
still restricted mostly to their character set and portability to
other cultural areas is a problem. One solution is to use a
meta-protocol (such as -> MIME) which specifies the character set
which was used to write a text and which causes the correct character
set to be used in displaying text.

A different approach to overloading the character set as done in the
ISO 8859-X standard (where the locations 0xa0 to 0xff are used to
encode national characters) is to use wider characters. This is the
approach employed in Unicode (which is an enocing of Basic
MUlitlanguage Plane (BMP) of ISO/IEC 10646). The downside to this
approach is that most of the software available today only accepts 8
bit wide characters (7 bit if you have bad luck :-( ), so the Unicode
approach is problematic. This 8 bit restriction permeates nearly all
code in use today, including such system software (file systems,
process identifiers, etc.!). To ease this problem somewhat, several
representations which map Unicode characters to a variable length 8
bit based encoding have been introduced (this encoding is called
UTF-8). More information about Unicode can be obtained from URL
http://unicode.org.

21. Glossary: Acronyms, Names, etc.
i18n I<-- 18 letters -->n = Internationalization
e13n Europeanization
l10n Localization
ANSI American National Standards Institute, the US member of ISO
ASCII American Standard Code of Information Interchange
CP Code Page
CP850 Code Page 850, the most widely used MS DOS code page
CR Carriage Return
CTAN server Comprehensive TeX Archive Network, the world's largest
repository for TeX related material. It consists of three
sites mirrowing each other: ftp.shsu.edu, ftp.tex.ac.uk,
ftp.dante.de. The current configuration, including known
mirrows, can be obtained by fingering cta...@ftp.shsu.edu
DEC Digital Equipment Corp.
DIN Deutsche Industrie Norm (German Industry Norm)
DOS Disk Operating System
EBCDIC Extended Binary Coded Decimal Interchange Code
---a proprietary IBM character set used on mainframes
ECMA European Computer Manufacturer's Association
emacs Editing Macros, a family of popular text editors
ESMTP Enhanced SMTP
Esperanto A synthetic, ``universal'' language developed by
Dr.~Zamenhof in~1887.
FSF Free Software Foundation
FTP File Transmission Protocol
GNU GNU's not Unix, an FSF project
HP Hewlett Packard
HP/UX HP Unix
IBM International Business Machines Corp.
IEEE Institute of Electrical and Electronics Engineers
INRIA Institut National de Recherche en Informatique et Automation
IP Internet Protocol
ISO International Standards Organization
KOI8 ???---a popular encoding for Cyrillic on UNIX workstations
\LaTeX{} A macro package for \TeX{}
LF Linefeed
MCS DEC's Multilingual Character Set---the ISO 8859--1 draft standard
MIME Multi-Purpose Internet Mail Extension
MS-DOS Microsoft's program loader
MTA mail transfer agent
MUA mail user agent
OS Operating System
OSF the Open Software Foundation
OSF/1 the Open Software Foundation's Unix, Revision 1
PGP Pretty Good Privacy, an encryption package
POSIX Portable Operating System Interface (an IEEE UNIX standard)
PS PostScript, Adobe's printer language
RFC Request for Comment, an Internet standard
sed stream editor, a UNIX file manipulation utility
SMTP Simple Mail Transfer Protocol
TCP Transmission Control Protocol
\TeX{} Donald Knuth's typesetting program
UDP User Datagram Protocol
URL a WWW Uniform Resource Locator
US-ASCII the US national variant of ISO 646, see ASCII
VMS Virtual Memory System---DEC's proprietary OS
W3 WWW
WWW World Wide Web
X11 X Window System

22. Comments
This FAQ is somewhat Sun-centered, though I have tried to include
other machine types. If you have figured out how to configure your
machine type, please let me (mi...@vlsivie.tuwien.ac.at) know so that I
can include it in future revisions of this FAQ.

23. Home location of this document
23.1 www
You can find this and other i18n documents under URL
http://www.vlsivie.tuwien.ac.at/mike/i18n.html.

23.2 ftp
The most recent version of this document is available via anonymous
ftp from ftp.vlsivie.tuwien.ac.at under the file name
/pub/8bit/FAQ-ISO-8859-1

-----------------

Copyright © 1994,1995,1996 Michael Gschwind (mi...@vlsivie.tuwien.ac.at)

This document may be copied for non-commercial purposes, provided this
copyright notice appears. Publication in any other form requires the
author's consent. (Distribution or publication bundled with a product
requires the author's consent, as does publication in any book,
journal or other work.)

Dieses Dokument darf unter Angabe dieser urheberrechtlichen
Bestimmungen zum Zwecke der nicht-kommerziellen Nutzung beliebig
vervielfältigt werden. Die Publikation in jeglicher anderer Form
erfordert die Zustimmung des Autors. (Verteilung oder Publikation mit
einem Produkt erfordert die Zustimmung des Autors, wie auch die
Veröffentlichung in Büchern, Zeitschriften, oder anderen Werken.)

Local IspellDict: english
Michael Gschwind, Institut f. Technische Informatik, TU Wien
snail: Treitlstraße 3-182-2 || A-1040 Wien || Austria
email: mi...@vlsivie.tuwien.ac.at PGP key available via www (or email)
www : URL:http://www.vlsivie.tuwien.ac.at/mike/mike.html
phone: +(43)(1)58801 8156 fax: +(43)(1)586 9697

Osmo Ronkanen

unread,
Jun 3, 1996, 3:00:00 AM6/3/96
to

In article <yzzafyn...@tiny.lysator.liu.se>,

Johan Olofsson <j...@lysator.liu.se> wrote:
>Osmo Ronkanen wrote:
>
> > I looked that and I do not see the point. Why should I pay anything in
> > the for of size or speed so that I could get Cyrillic or Greek
> > characters. I do not know Greek or Russian and therefore I do not need
> > those characters. Should I have a need to write Greek or Russian names,
> > I' use Latin alphabet as one should do.
>
>As long as we are bound to 8-bit charactersets problems will arise each time
>we try to combine two languages needing different charactersets in the same
>text or in the same program.

Such needs are not typical and one could use 16-bit characters for those
needs. Also if the system is unocode aware, one can define 8-bit sets as
one pleases. It takes just 512 bytes to define a character set.

That is what I see in future. Systems are aware of Unicode and it is
some kind of basic character set, but in actual files would be stored in
8-bit character sets if possible. It would be somewhat analogous to
storing graphics pictures with separate palettes. Of course the systems
should know several predefined character sets (like iso-8859-n). When
16-bits is needed it could naturally be used. In most cases 8 bits is
enough though.

>
>Do you remember how we recently couldn't produce a proper version of the
>Sapmi national anthem sine we here use iso-8859-1 and not iso-8859-10.
>[ We == soc.culture.nordic ]

Do you think all the traffic in the Usenet should be doubled so that one
can send the national anthem. Or would a better solution to be that
messages carried information about their character set. (I do not mean
that brain dead mine but a real system where 8-bit data is sent with
information about the character set).

Osmo

Gary Capell

unread,
Jun 4, 1996, 3:00:00 AM6/4/96
to

Erik Naggum <er...@naggum.no> writes:

>| Conversion between different *sets* is very problematic.

|nonsense. all you need is an inversible mapping from a set to a namespace.


|this is _truly_ simple, and requires N tables for N sets. the stupid ways
|people have ordinarily done conversion is very problematic, since they tend
|to think that N*(N-1) is a good number of tables for N sets. it isn't.

I'm with you now! And why don't we assign a _number_ to every
name in that namespace, while we're at it? Then we could have
a compact representation of the universal character namespace.
And we could call this namespace "Uniname". I just can't
think of what we might call the set of numeric codes for
Uniname. Any ideas?
--
http://www.cs.su.oz.au/~gary/

Ross Ridge

unread,
Jun 4, 1996, 3:00:00 AM6/4/96
to

Erik Naggum <er...@naggum.no> wrote:
>my system currently uses 375M of disk space. I don't want to pay the cost
>of data compression on my text files (which is considerable, especially
>when it comes to searching or obtaining random access into them). I went
>through it right now, and I find that 211M of it is text, probably what you
>would call "raw, uncompressed text". it happens to be mostly ASCII, but I
>use ISO Latin 1 for my native tongue. I did the "strings /dev/mem | wc -c"
>thing, and of my 32M of memory, 9.5M were considered text by `strings'.

That's an incredibly naive way to determine the amount of text strings
stored in the memory of your computer. Unfortunately such naivity has
dominated yours and Mr. Kuhn's discusion on the merits of Unicode.
I quite frankly suprised to see such ignorance about the reality of
internationalization issues outside of English speaking North America.

The reality is that no cares about how much extra disk space or memory
Unicode or any other encoding might take up and that no one wants the
hassle of managing files coded in a totally alien wide character set
like Unicode. Unicode has it's uses, either as wide character set for
programmes and/or operating systems to use internally, or as a code for
the relatively rare case of exchanging information between machines using
multiple different and possibly unknown code sets. One day it might see
practical use as the truly native *narrow* character set of a computer.
Unicode, despite being designed by "computer scientists" from a large
number of organizations has a number of problems, but it's by far the
best attempt so far to produce a comprehensive character set, something
that computer industry very much needs.

Mr. Kuhn's admonition that we all use Unicode is just as naive.
Users shouldn't have to care what character set their applications use
nor should even most programmers. I maintain large set of mostly text
processing utilities that can be compiled and used on everything from
systems that support only ASCII to ones that support a bizarre set of
EBCDIC code pages (including shift encoded ones). Whether any of them
support Unicode is a detail I'm not concerned with.

Ross Ridge

--
l/ // Ross Ridge -- The Great HTMU +1 519 883 4329
[oo][oo] rri...@csclub.uwaterloo.ca http://www.csclub.uwaterloo.ca/u/rridge/
-()-/()/
db //

Erik Naggum

unread,
Jun 4, 1996, 3:00:00 AM6/4/96
to

[Gary Capell]

| I'm with you now! And why don't we assign a _number_ to every name in
| that namespace, while we're at it? Then we could have a compact
| representation of the universal character namespace. And we could call
| this namespace "Uniname". I just can't think of what we might call the
| set of numeric codes for Uniname. Any ideas?

funny guy. the point with names is that you can describe a character set
relatively easily, can debug it, and can use the information in the names
to help you understand what a character means and is.

and, actually, I have suggested that ISO 10646 names be used (in fact, I
use them for just this purpose), if you had cared to pay attention. the
idea of naming characters uniquely is quite novel. it will naturally take
most people quite a while to realize it has unique benefits that can be
exploited. the ISO committee that produced the list does not understand
that the names can be useful outside of their standard, for instance.

as regards Unicode, it is but _another_ coded character set, with a whole
slew of _additional_ encodings to burden the planet. hardly novel. that
is, most programmers will know how to do a number-to-number mapping, and
most programmers aren't very good even at that (look at all the tables out
there, embedded in programs that attempt to do translation, and see how
many of them contain mistakes), but at least they know how to do it.

funny how numbering everything is supposed to be a panacea. some ancient
Greek philosophers got that idea some 4000 years ago. as I recall, it was
discarded then, too. but at least they had an idea that needed to be
tested and understood. today, we should have known better: a number for
everything is a machine necessity, otherwise not a very good idea.

sigh, why do I bother?

Erik Naggum

unread,
Jun 4, 1996, 3:00:00 AM6/4/96
to

[Ross Ridge]

| That's an incredibly naive way to determine the amount of text strings
| stored in the memory of your computer.

thank you for paying attention. Markus Kuhn showed me his number,
concluded some one-digit percentage of his memory, and used that support
his (naļve) notions. I wanted to show him that his one datum was one order
of magnitude wrong and not quite as insignificant as he claimed. did that
work? not with people who are so upset about their superior perception of
other people's naļvité that they fail to see anything else. *sigh*

| Unicode, despite being designed by "computer scientists" from a large
| number of organizations

warning: point of fact intruding into your prejudice collection: it wasn't.

Osmo Ronkanen

unread,
Jun 5, 1996, 3:00:00 AM6/5/96
to

In article <4ou24t$n...@staff.cs.su.oz.au>, Gary Capell <ga...@cs.su.oz.au> wrote:
>ronk...@cc.helsinki.fi (Osmo Ronkanen) writes:
>
>>Also the UTF is most effective on English text, on East European text
>>it is almost as effective as the share of accented characters is so
>>small. However, with languages like Russian and Greek it is less
>>effective. Do you think those people can afford the waste of resources?
>
>You don't have to _store_ your text as UTF-encoded Unicode. Feel
>free to _store_ and _use_ text in whatever encoding you like.

That is just what I have been speaking for, the Unicode will not replace
existing character sets but it will supplement them.

It
>would just be nice if we had one universal character set to _exchange_
>text. Unicode seems to fit the bill. And if we're going to pick one
>transfer encoding for this, we should pick the one that minimizes the
>total resources for all electronic communications, and for now, that
>means optimizing for ASCII. It's not (meant to be) imperialism, just
>pragmatism.

Why should one have to pick one transfer character set. Not all transfer
needs are equal. While UTF-8 works very well in US and relatively well
in Western Europe, in Greece or Russia it will double the amount of data
transferred. Those people already have their efficient character sets.
Why should they stop using them and begin using of UTF-8? The UTF-8 is
based on the philosophy that the ASCII character set (or Latin
alphabets) are the norm and other characters are special characters to
which one can use extra byte(s). If one views the communication needs
that for example Russians have, that assumption is simply insane.

Instead a system should know at least existing ISO-8859-n standards. In
that way (and with UTF-8 or something like that) the communication can be
done most efficiently in all cases. Remember that once system knows
Unicode, handling those character sets is a child's play. It takes only
5KB to store lookup tables for all then sets and using them to convert
from 8-bit to Unicode easier than decoding UTF-8.

>
>If you and your friend make private arrangements to exchange data using
>some other character set, well and good.


I am not talking about private arrangements, I am talking about efficient
ways to store and handle data in a language, that is national
arrangements.

>But to put data in a form that
>_everyone_ should be able to read, I'd suggest UTF-8 encoded unicode.
>(Except, of course, at the moment almost _noone_ is ready to read
>this. ahem)

Yeah, UTF-8 is nice when one wants to slip a Russian letter inside
English text, but what is the need for that?

>
>>Yeah fonts, nice to mention that. When one switches to 16-bit sets
>>then the space needed for fonts does not increase just two fold, it
>>increases 256-fold. Even with MES it increases over 3-fold. I would
>>really love to waste my resources in storing Russian and Chinese
>>characters.
>

>You don't have to. Just because your application understands a large
>character set doesn't mean it has to have glyphs for every one of those
>characters. I use a Unicode editor (http://www.cs.su.oz.au/~gary/wily/)
>where I can specify different fonts for different ranges of the
>character set. I don't have (or need) fonts for the whole range (but it
>would be nice to have them).

Yeah, it is nice to have large character set, when one does not know what
those characters look like.

>
>>Are you sure that your workstation is typical? Are programmers typical
>>computer users? In some systems text is the major waste of resources.
>>Think about Usenet.
>

>Yes, lets think about Usenet. Should everyone on Usenet have a
>character set converter for every possible 8-bit encoding, or should
>they all be able to read one character set (Unicode).

Not every possible, just the standard ones. ISO-8859-n and possible some
industry standards as well. It is not that hard to make lookup tables.
The fact just is that the communication resources are not limitless.
Every time I post something I get a warning that it costs thousands of
dollars or something like that. Why should that be doubled especially
among those who are poorer than Western Europeans?

>Again, feel free
>to immediately convert that Unicode into your local 8-bit encoding as
>soon as you grab the text from Usenet (assuming all the characters in
>everything you look at fit into your 8-bit encoding), but let's have a
>_single_ standard for character _exchange_.
>

The Single standard does not mean we are bound to a poor standard like
UTF-8. IMO using it is like brown-nosing Americans. The standard should
be such that it is the most effective for the material transmitted. Like
using well defined 8-bit character sets and escapes to full Unicode.

>>I say that codes, line Unicode have a place in the future, but not as
>>the only character sets. 8-bit character sets will live on.
>

>Sure, if you like. Just keep them for _local_ use, and don't _expect_
>anyone else to know about your 8-bit character set.

Sorry, but people her are already communicating with ISO-8859-1, Don't
you tell us what we should do. For some using non-ASCII communication is
everyday need and not some philosophical theory. It is not even a
special need, it is just as natural as using ASCII.

>--
>http://www.cs.su.oz.au/~gary/
>

Osmo


Osmo Ronkanen

unread,
Jun 5, 1996, 3:00:00 AM6/5/96
to

In article <6A4$MSd...@khms.westfalen.de>,

Kai Henningsen <k...@khms.westfalen.de> wrote:
>ronk...@cc.helsinki.fi (Osmo Ronkanen) wrote on 29.05.96 in <4oh9rn$a...@kruuna.helsinki.fi>:
>
>> In article <4of365$1...@cortex.dialin.rrze.uni-erlangen.de>,
>> Markus Kuhn <msk...@cip.informatik.uni-erlangen.de> wrote:
>> >ronk...@cc.helsinki.fi (Osmo Ronkanen) writes:
>> >
>> >>I think a good think would be if
>> >>operating systems were aware of various character sets, like having an
>> >>attribute for each text file that tells the set.
>> >
>> >Sorry, I disagree. I strongly feel that the wonderful beauty and
>> >simplicity of having one *single* character set instead of complicated
>> >attributes and conversion tools by far overrules the minor increased
>> >memory usage of Unicode. Raw uncompressed non-ASCII text files
>> >consitute less than 5% of typical harddisks.
>>
>> Why non-ASCII? Are you saying that after the switch there would be both
>> ASCII and then Unicode?
>
>No. He's saying that an UTF-8 file that contains only characters also
>found in ASCII, is in fact the same as an ASCII file, or in other words,
>the encoding for ASCII (characters 0x00-0x7f) in UTF-8 is, in fact, 0x00-
>0x7f.

And I was thinking he suggested moving to full Unicode and not
something that is not even a character set but more like a compression
method.

>
>It's only when you include non-ASCII chars that you get differences.
>

Yeah and for some people those are not any special characters but the
Latin alphabets are special characters.

>> Hard disks are not the only resource there is. There are also memory,
>> serial ports etc. Either those resources are wasted as well or then all
>> programs become more complicated especially if the system cannot
>> translate between different character sets.
>
>I don't think pure text makes for a significant amount of memory usage.
>Even when people use lots of text, there's probably a lot of formatting
>info and management data structures and fonts and so on around at the same
>time.

Yeah, and if the transition to Unicode is made totally then all that
text doubles in size. Of course if all processed data is stored always
compressed (what UTF-8 really is) then things get difficult.

>
>As for serial ports, if and when you shovel lots of text across them, you
>will definitively want to use compression. After that, I don't think
>there's that much overhead left as long as you use UTF-8 - of course, raw
>Unicode or even UCS (ISO 10646) is worse. But then, that's one of the
>reasons we _have_ UTF-8.

Try writing Greek or Russian and tell then about how good UTF-8 is and
how much overhead there is. Or try writing Chinese and use five bytes
per character instead to two.

Or you could memorize the sentence "Non-ASCII characters are not
special characters."

Maybe the next step is that modem manufacturers incorporate a compression
that compressed those UTF-8 characters by converting them back to 8-bit
ones. :-)

>
>> If the transition do Unicode is made completely then all text no matter
>> where it is stored. Compression helps somewhat, but it does not remove
>> the extra resource need completely. Also I do not think I should have to
>> use compression just to get rid of something that was inserted in my
>> files without any good need.
>
>The "good need" is there all right.

I do not have much need to write Russian, Chinese or Greek. I do no see
why I should compromise the convenience that a fixed length 8-bit
character set brings to get the ability to get those languages.

>
>> >Have you ever tried to implement a cross-application cut&paste
>> >facility (like that of xterm) that supports a character set switching
>> >system like ISO 2022 or your attributed files?
>>
>> What is the problem? The paste buffer could be in Unicode and the system
>> could provide the translations.
>
>This sounds like cutting off your finger to spite your hand. If you want
>to keep a byte-oriented character set with occasionally multiple byte
>sequences, why don't you use the simpler UTF-8?

Because it is not a byte oriented set. It used 1-5 bytes per character.
It wastes space. It is far from simple.

Of course one can compress the paste buffer with UTF-8 if one wishes.
One can also use LZW.

Osmo


Scott Schwartz

unread,
Jun 6, 1996, 3:00:00 AM6/6/96
to Osmo Ronkanen

ronk...@cc.helsinki.fi (Osmo Ronkanen) writes:
| Try writing Greek or Russian and tell then about how good UTF-8 is and
| how much overhead there is. Or try writing Chinese and use five bytes
| per character instead to two.

16 bit unicode maps to at most three bytes of utf. It takes two bytes
to represent greek and russian, three bytes for Chinese.

The convenience of having one tool that transparently works on any
character is a real win, because then people can write Chinese or
Greek or Russian, even simultaneously, and not have to change or
reconfigure any software. I've heard people suggest that you just
translate foreign character sets to the native one, as if this will be
easy or fun. My experience with MIME is that all you get is trouble;
layers of software, bugs, and confusion, when you could avoid the
whole issue instead. (About simultaneously: when you send email in
the language of your choice, the headers are still basically in
English, right?)

Ignoring foreign languages for a moment (since English rules the
universe) another advantage of unicode is that it gives you a rich set
of mathematical symbols in addition to alphanumerics. In Plan 9,
troff users have less need for eqn, since you can type arrows and
integral signs directly, and C programs can use greek letters for
variables. Usenet fans will find happy-face and sad-face characters,
obviating the need for the currently popular emoticon ":-)".
(With difficulty, I resist the urge to type them here.)


Gary Capell

unread,
Jun 7, 1996, 3:00:00 AM6/7/96
to

ronk...@cc.helsinki.fi (Osmo Ronkanen) writes:

>Why should one have to pick one transfer character set?

(Off the top of my head)

1. It makes _all_ the software which deals with text much simpler
(than software which has to deal with multiple character sets)
or more general (than software which ignores all but one character
set)

2. Particularly, it makes system-level software (editors, terms) easier
to use with international text.
tail file_with_greek_text
tail file_with_russian_text

With the UTF8 tools I use, this works fine. I don't
see how this could be made to work in the "collections of 8-bit
character sets" world.

3. It makes it much easier to work with text containing characters
from several different languages.

>Not all transfer
>needs are equal. While UTF-8 works very well in US and relatively well
>in Western Europe, in Greece or Russia it will double the amount of data
>transferred.

Only if they're not using any compression, in which case they can't
really claim to care about the efficiency of the character encoding.
--
http://www.cs.su.oz.au/~gary/

Peter Kerr

unread,
Jun 8, 1996, 3:00:00 AM6/8/96
to

Unicode is a wonderful idea: all possible human scripts in one set.
Well, quite a lot of them, it looks like yet again the character table may
have been made too small.

But for everyday use the consensus seems to be that people would prefer to
work with one 8 bit character set, just like a few dinosaurs seem
determined to continue with a 7-bit set.

What is urgently needed is one efficient, rational translation scheme
between ISO 8859-n sub-groups, and so-called "IBM code pages", and the
other industry "standards", going via UniCode if necessary, which will
work cross-platform, multi-lingual, multi-cultural...

It is beginning to look like only another Noah's Flood can sweep away the
Tower of Babel we are currently building.

--
Peter Kerr bodger
School of Music chandler
University of Auckland neo-Luddite

Osmo Ronkanen

unread,
Jun 8, 1996, 3:00:00 AM6/8/96
to

In article <4p6shr$d...@staff.cs.su.oz.au>,

Gary Capell <ga...@cs.su.oz.au> wrote:
>ronk...@cc.helsinki.fi (Osmo Ronkanen) writes:
>
>>Why should one have to pick one transfer character set?
>
>(Off the top of my head)
>
>1. It makes _all_ the software which deals with text much simpler
> (than software which has to deal with multiple character sets)
> or more general (than software which ignores all but one character
> set)
>

I do not see how it makes the software any simpler. Remember that Unicode
is one table lookup away from 8-bit codes.

>2. Particularly, it makes system-level software (editors, terms) easier
> to use with international text.

^^^^^^^^^^^^^


> tail file_with_greek_text
> tail file_with_russian_text
>
> With the UTF8 tools I use, this works fine. I don't
> see how this could be made to work in the "collections of 8-bit
> character sets" world.

For some Greek or Russian is not international text. It is the very text
that is used daily and that has to be processed and stored efficiently.

>
>3. It makes it much easier to work with text containing characters
> from several different languages.
>

Yeah, those stupid files that have greetings on all languages.

>>Not all transfer
>>needs are equal. While UTF-8 works very well in US and relatively well
>>in Western Europe, in Greece or Russia it will double the amount of data
>>transferred.
>

>Only if they're not using any compression, in which case they can't
>really claim to care about the efficiency of the character encoding.

So they should not only waste space, but also processor resources for
compression.

>--
>http://www.cs.su.oz.au/~gary/

Osmo


Osmo Ronkanen

unread,
Jun 8, 1996, 3:00:00 AM6/8/96
to

In article <8grarsy...@galapagos.cse.psu.edu>,

Scott Schwartz <schw...@galapagos.cse.psu.edu> wrote:
>ronk...@cc.helsinki.fi (Osmo Ronkanen) writes:
>| Try writing Greek or Russian and tell then about how good UTF-8 is and
>| how much overhead there is. Or try writing Chinese and use five bytes
>| per character instead to two.
>
>16 bit unicode maps to at most three bytes of utf. It takes two bytes
>to represent greek and russian, three bytes for Chinese.

True, the documentation I looked had examples for up to 31 bits, so that
confused me. In anyway one needs 100% or 50% more bandwidth on those
languages.

>
>The convenience of having one tool that transparently works on any
>character is a real win, because then people can write Chinese or
>Greek or Russian, even simultaneously, and not have to change or
>reconfigure any software.

How many people have that kind of a need? Are there more those than
those who need efficient way to storing and transferring Russian or
Chinese text?

>I've heard people suggest that you just
>translate foreign character sets to the native one, as if this will be
>easy or fun. My experience with MIME is that all you get is trouble;
>layers of software, bugs, and confusion, when you could avoid the
>whole issue instead. (About simultaneously: when you send email in
>the language of your choice, the headers are still basically in
>English, right?)
>
>Ignoring foreign languages for a moment (since English rules the
>universe)

No, we cannot here ignore them. Also letters like äöÄÖ are in no way
foreign here. That is the key issue, what is foreign to you is not that
to others.

>another advantage of unicode is that it gives you a rich set
>of mathematical symbols in addition to alphanumerics. In Plan 9,
>troff users have less need for eqn, since you can type arrows and
>integral signs directly, and C programs can use greek letters for
>variables. Usenet fans will find happy-face and sad-face characters,
>obviating the need for the currently popular emoticon ":-)".
>(With difficulty, I resist the urge to type them here.)
>

Osmo

Scott Schwartz

unread,
Jun 8, 1996, 3:00:00 AM6/8/96
to Osmo Ronkanen

| >The convenience of having one tool that transparently works on any
| >character is a real win, because then people can write Chinese or
| >Greek or Russian, even simultaneously, and not have to change or
| >reconfigure any software.
|
| How many people have that kind of a need? Are there more those than
| those who need efficient way to storing and transferring Russian or
| Chinese text?

It's not a matter of need, but of means. If a small number of people
have the need, it is better for everyone that there be one good way to
do it than dozens of poor ways. Engineering our software to meet that
goal is the most efficient way to meet everyone's diverse needs.

| >I've heard people suggest that you just
| >translate foreign character sets to the native one, as if this will be
| >easy or fun. My experience with MIME is that all you get is trouble;
| >layers of software, bugs, and confusion, when you could avoid the
| >whole issue instead. (About simultaneously: when you send email in
| >the language of your choice, the headers are still basically in
| >English, right?)
| >
| >Ignoring foreign languages for a moment (since English rules the
| >universe)
|
| No, we cannot here ignore them. Also letters like äöÄÖ are in no way
| foreign here. That is the key issue, what is foreign to you is not that
| to others.

I'm sorry for making a joke, but you fail to respond to my argument.
To reiterate, if you send email, you need ascii (because RFC-822
requires it) in addition to your native character set.

This answers the question posed above: all russian and chinese and
greek users of the internet need at least two character sets.

Unicode meets this need directly. Other alternatives, like MIME, are
self evidently less satisfactory.

You also fail to address the following, which provides a culture
independent version of the argument:

Kai Henningsen

unread,
Jun 8, 1996, 3:00:00 AM6/8/96
to

ronk...@cc.helsinki.fi (Osmo Ronkanen) wrote on 05.06.96 in <4p2cka$m...@kruuna.helsinki.fi>:

> In article <6A4$MSd...@khms.westfalen.de>,
> Kai Henningsen <k...@khms.westfalen.de> wrote:
> >ronk...@cc.helsinki.fi (Osmo Ronkanen) wrote on 29.05.96 in
> ><4oh9rn$a...@kruuna.helsinki.fi>:
> >> In article <4of365$1...@cortex.dialin.rrze.uni-erlangen.de>,

> >> Markus Kuhn <msk...@cip.informatik.uni-erlangen.de> wrote:
> >> >ronk...@cc.helsinki.fi (Osmo Ronkanen) writes:

> And I was thinking he suggested moving to full Unicode and not
> something that is not even a character set but more like a compression
> method.

UTF-8 is full Unicode as far as charsets are involved. In fact, it's more
- it's full ISO 10646.

> >It's only when you include non-ASCII chars that you get differences.

> Yeah and for some people those are not any special characters but the
> Latin alphabets are special characters.

Sorry, but I can't parse that sentence.

> >> Hard disks are not the only resource there is. There are also memory,
> >> serial ports etc. Either those resources are wasted as well or then all
> >> programs become more complicated especially if the system cannot
> >> translate between different character sets.
> >
> >I don't think pure text makes for a significant amount of memory usage.
> >Even when people use lots of text, there's probably a lot of formatting
> >info and management data structures and fonts and so on around at the same
> >time.
>
> Yeah, and if the transition to Unicode is made totally then all that
> text doubles in size. Of course if all processed data is stored always
> compressed (what UTF-8 really is) then things get difficult.

1. As I pointed out, doubling the text in memory isn't that interesting.
2. Reading UTF-8 into Unicode, and writing Unicode into UTF-8, is a rather
trivial exercise. About one or two screens of code, which you only need to
write once (and which is fairly straightforward, and there are several
example implementations available on the net).

> >As for serial ports, if and when you shovel lots of text across them, you
> >will definitively want to use compression. After that, I don't think
> >there's that much overhead left as long as you use UTF-8 - of course, raw
> >Unicode or even UCS (ISO 10646) is worse. But then, that's one of the
> >reasons we _have_ UTF-8.
>

> Try writing Greek or Russian and tell then about how good UTF-8 is and
> how much overhead there is. Or try writing Chinese and use five bytes
> per character instead to two.

Well, you can always use two bytes per character if you prefer that.
Unicode certainly won't hinder you.

> Or you could memorize the sentence "Non-ASCII characters are not
> special characters."

You might take a look at my address, and at my name, and think again.
Maybe then you'll see why your sentence is somewhat silly.

> >> If the transition do Unicode is made completely then all text no matter
> >> where it is stored. Compression helps somewhat, but it does not remove
> >> the extra resource need completely. Also I do not think I should have to
> >> use compression just to get rid of something that was inserted in my
> >> files without any good need.
> >
> >The "good need" is there all right.
>
> I do not have much need to write Russian, Chinese or Greek. I do no see
> why I should compromise the convenience that a fixed length 8-bit
> character set brings to get the ability to get those languages.

I see a lot of "I" in that sentence. Maybe, if you look around, you will
notice that there are a lot more people in the world than you.

And you might think about, for example, how many people you can sell a
software that supports ASCII, vs. one that supports Latin-1, vs. one that
supports Unicode.

> >This sounds like cutting off your finger to spite your hand. If you want
> >to keep a byte-oriented character set with occasionally multiple byte
> >sequences, why don't you use the simpler UTF-8?
>
> Because it is not a byte oriented set. It used 1-5 bytes per character.
> It wastes space. It is far from simple.

Of course it's a byte-oriented character set. Every multibyte character
set is byte-oriented. Unicode or ISO 10646 are not, they are word-oriented
(with 16 or 31 bit words).

And UTF-8 is indeed very simple - simpler than most multibyte character
sets. For example, byte-oriented search algorithms work perfectly on UTF-8
text.

And you can't do what it does without using more than one byte per
character.

> Of course one can compress the paste buffer with UTF-8 if one wishes.

Though that doesn't seem to make much sense. Of course, it's still a lot
better than using ISO 2022.

Osmo Ronkanen

unread,
Jun 9, 1996, 3:00:00 AM6/9/96
to

In article <p.kerr-0806...@news.auckland.ac.nz>,

Peter Kerr <p.k...@auckland.ac.nz> wrote:
>Unicode is a wonderful idea: all possible human scripts in one set.
>Well, quite a lot of them, it looks like yet again the character table may
>have been made too small.
>
>But for everyday use the consensus seems to be that people would prefer to
>work with one 8 bit character set, just like a few dinosaurs seem
>determined to continue with a 7-bit set.
>
>What is urgently needed is one efficient, rational translation scheme
>between ISO 8859-n sub-groups, and so-called "IBM code pages", and the
>other industry "standards", going via UniCode if necessary, which will
>work cross-platform, multi-lingual, multi-cultural...

You get relative good lossy translation if you define each 8-bit character
set as a table of values in Unicode and create a fallback table to ASCII
for Unicode (or MES actually). That way one cannot do tricks like
translating ø to ö when translating to CP437 though.

>
>It is beginning to look like only another Noah's Flood can sweep away the
>Tower of Babel we are currently building.
>

You should reread your bible :-)

>--
>Peter Kerr bodger
>School of Music chandler
>University of Auckland neo-Luddite


Osmo

Osmo Ronkanen

unread,
Jun 10, 1996, 3:00:00 AM6/10/96
to

In article <6ASZc...@khms.westfalen.de>,

Kai Henningsen <k...@khms.westfalen.de> wrote:
>ronk...@cc.helsinki.fi (Osmo Ronkanen) wrote on 05.06.96 in <4p2cka$m...@kruuna.helsinki.fi>:
>
>> In article <6A4$MSd...@khms.westfalen.de>,
>> Kai Henningsen <k...@khms.westfalen.de> wrote:
>> >ronk...@cc.helsinki.fi (Osmo Ronkanen) wrote on 29.05.96 in
>> ><4oh9rn$a...@kruuna.helsinki.fi>:
>> >> In article <4of365$1...@cortex.dialin.rrze.uni-erlangen.de>,
>> >> Markus Kuhn <msk...@cip.informatik.uni-erlangen.de> wrote:
>> >> >ronk...@cc.helsinki.fi (Osmo Ronkanen) writes:
>
>> And I was thinking he suggested moving to full Unicode and not
>> something that is not even a character set but more like a compression
>> method.
>
>UTF-8 is full Unicode as far as charsets are involved. In fact, it's more
>- it's full ISO 10646.

You knew what I meant.

>
>> >It's only when you include non-ASCII chars that you get differences.
>
>> Yeah and for some people those are not any special characters but the
>> Latin alphabets are special characters.
>
>Sorry, but I can't parse that sentence.

There are people to whom non-ASCII characters are the normal characters.
For them the UTF-8 does not suit at all.

>
>> >> Hard disks are not the only resource there is. There are also memory,
>> >> serial ports etc. Either those resources are wasted as well or then all
>> >> programs become more complicated especially if the system cannot
>> >> translate between different character sets.
>> >
>> >I don't think pure text makes for a significant amount of memory usage.
>> >Even when people use lots of text, there's probably a lot of formatting
>> >info and management data structures and fonts and so on around at the same
>> >time.
>>
>> Yeah, and if the transition to Unicode is made totally then all that
>> text doubles in size. Of course if all processed data is stored always
>> compressed (what UTF-8 really is) then things get difficult.
>
>1. As I pointed out, doubling the text in memory isn't that interesting.

True, I am not interested in that, so I am not interested in Unicode.

>2. Reading UTF-8 into Unicode, and writing Unicode into UTF-8, is a rather
>trivial exercise. About one or two screens of code, which you only need to
>write once (and which is fairly straightforward, and there are several
>example implementations available on the net).
>

It would be as simple to translate between 8-bit character sets and
unicode if one had the lookup tables.

The fact is that at some point the space is wasted unless one
constantly translates between the sets.

>> >As for serial ports, if and when you shovel lots of text across them, you
>> >will definitively want to use compression. After that, I don't think
>> >there's that much overhead left as long as you use UTF-8 - of course, raw
>> >Unicode or even UCS (ISO 10646) is worse. But then, that's one of the
>> >reasons we _have_ UTF-8.
>>
>> Try writing Greek or Russian and tell then about how good UTF-8 is and
>> how much overhead there is. Or try writing Chinese and use five bytes
>> per character instead to two.
>
>Well, you can always use two bytes per character if you prefer that.
>Unicode certainly won't hinder you.
>

Yes, but if one wants to write Russian at one byte per character just
as one writes English, one can't do that.

If it allows that, then you already have two different character sets.
For example data streams and files have to be marked whether they
contain UTF-8 or Unicode. If you have two, then why not have some more.


>> Or you could memorize the sentence "Non-ASCII characters are not
>> special characters."
>
>You might take a look at my address, and at my name, and think again.
>Maybe then you'll see why your sentence is somewhat silly.
>

So why do you support a system that is based on that philosophy?

>> >> If the transition do Unicode is made completely then all text no matter
>> >> where it is stored. Compression helps somewhat, but it does not remove
>> >> the extra resource need completely. Also I do not think I should have to
>> >> use compression just to get rid of something that was inserted in my
>> >> files without any good need.
>> >
>> >The "good need" is there all right.
>>
>> I do not have much need to write Russian, Chinese or Greek. I do no see
>> why I should compromise the convenience that a fixed length 8-bit
>> character set brings to get the ability to get those languages.
>
>I see a lot of "I" in that sentence. Maybe, if you look around, you will
>notice that there are a lot more people in the world than you.

There are also a lot people besides you.

>
>And you might think about, for example, how many people you can sell a
>software that supports ASCII, vs. one that supports Latin-1, vs. one that
>supports Unicode.

Typically a software that supports one 8-bit set supports all. (Some
special functions like capitalizing text are exceptions)

>
>> >This sounds like cutting off your finger to spite your hand. If you want
>> >to keep a byte-oriented character set with occasionally multiple byte
>> >sequences, why don't you use the simpler UTF-8?
>>
>> Because it is not a byte oriented set. It used 1-5 bytes per character.
>> It wastes space. It is far from simple.
>
>Of course it's a byte-oriented character set. Every multibyte character
>set is byte-oriented. Unicode or ISO 10646 are not, they are word-oriented
>(with 16 or 31 bit words).
>
>And UTF-8 is indeed very simple - simpler than most multibyte character
>sets. For example, byte-oriented search algorithms work perfectly on UTF-8
>text.
>

Yeah, if one searches an ASCII character.

>And you can't do what it does without using more than one byte per
>character.
>

One could for example use 8 bytes and a single character for escape. The
8-bit set would be identified at the beginning of the stream.

>> Of course one can compress the paste buffer with UTF-8 if one wishes.
>
>Though that doesn't seem to make much sense. Of course, it's still a lot
>better than using ISO 2022.
>
>Kai
>--
>Current reorgs: news.groups, news.admin.net-abuse.* (see nana.misc)
>Internet: k...@khms.westfalen.de
>Bang: major_backbone!khms.westfalen.de!kai
>http://www.westfalen.de/private/khms/

Osmo


Gianni Mariani

unread,
Jun 10, 1996, 3:00:00 AM6/10/96
to

Much of the discussion here seems to be related to the following
questions:

1. Do we need to represent more than one character set
simultaneously ?

2. What is the cost benefit of supporting UTF-8 over a
single encoding and is it justified ?

The answers to the above questions vary wildly depending on
application and it's very difficult to have a uniform obvious
answer when you look at the problem in the microscopic level.

Kinds of applications where it is of great cost benefit are:

- Library archiving
- dictionaries
- instruction documents (remember what came with that watch
you bought in Hong Kong ?)
- translation services
- sharing of project data across languages
- address books
- .... add your favourites here ....

Kinds of applications it does not serve much benefit:

- sending an email to your boss
- office projects that is not shared beyond your locality
- ???i'm blank???

So the cost benefit goes like this, how much does it cost
to enable applications in the prior category with and
without unicode (namely UCS-2, UCS-4, UTF-8 and perhaps ISO2022).

I conjecture that all major computining/information services
companies will impliment or enable the prior category applications
so the customer will pay for the costs of not enabling or
enabling - either way.

With the explosion of the WWW and documents of all languages
zipping around I believe that the time has come where many
multilingual applications are nessasary.

I also believe that widespread use of Unicode will reduce the
overall cost of maintenace and implimentation of these systems.

The beneficiary is the consumer !

As to which variant is to be used ? The ASCII centric surely
can't omplain about UTF-8. The Kanji centric who will use 3
bytes instead of the current 2 may have an argument against
UTF-8 but it's likely that it will be a reasonable compromise
in most cases.

As for ISO2022, someone please present to me a standard that
I can impliment as efficiently and is as accepted as Unicode
across the entire globe and then I will give it some consideration.
Right now it seems more like my kitchen sink after dinner
before washing up, and I hate washing up !

My point of view is that no computer company will have an option
not to answer YES to question 1. Consideration of question 2
in my opinion is that Unicode will simplify greatly the
problem of providing language support to all customers.

SGI has implimented some level of UTF-8 support as many other
computer vendors have. The snow-ball effect is building.

Unicode or broke!

/Gianni

my .sig, my .sig, where is my .sig ?
Opinions mine etc.

Gary Capell

unread,
Jun 11, 1996, 3:00:00 AM6/11/96
to

ronk...@cc.helsinki.fi (Osmo Ronkanen) writes:
>Yes, but if one wants to write Russian at one byte per character just
>as one writes English, one can't do that.

>If it allows that, then you already have two different character sets.
>For example data streams and files have to be marked whether they
>contain UTF-8 or Unicode. If you have two, then why not have some more.

They don't _have_ to be marked if simple conventions are followed.
I assume that all the text files I deal with are encoded in UTF-8.
This works fine.

>>And UTF-8 is indeed very simple - simpler than most multibyte character
>>sets. For example, byte-oriented search algorithms work perfectly on UTF-8
>>text.

>Yeah, if one searches an ASCII character.

AND if one searches for text with an arbitrary combination of languages.
If you want to complain about UTF-8 and Unicode, at least complain
about real rather than imagined problems.

>One could for example use 8 bytes and a single character for escape. The
>8-bit set would be identified at the beginning of the stream.

$ grep arbitrary_word arbitrary group of files with different languages

$ tail file_with_one_language

$ tail file_with_different_language

These commands all work fine with UTF-8. Try them with your 8-bit
encoding and 'character-set identifier at the beginning of the stream'
and enjoy the gobbledygook.
--
http://www.cs.su.oz.au/~gary/

Osmo Ronkanen

unread,
Jun 12, 1996, 3:00:00 AM6/12/96
to

In article <4pis16$4...@staff.cs.su.oz.au>,

Gary Capell <ga...@cs.su.oz.au> wrote:
>ronk...@cc.helsinki.fi (Osmo Ronkanen) writes:
--------------------- Restoring unmarked deletion ----------------------

>>>Well, you can always use two bytes per character if you prefer that.
>>>Unicode certainly won't hinder you.
>>>
------------------------------------------------------------------------

>>Yes, but if one wants to write Russian at one byte per character just
>>as one writes English, one can't do that.
>
>>If it allows that, then you already have two different character sets.
>>For example data streams and files have to be marked whether they
>>contain UTF-8 or Unicode. If you have two, then why not have some more.
>
>They don't _have_ to be marked if simple conventions are followed.
>I assume that all the text files I deal with are encoded in UTF-8.
>This works fine.
>

See, if you'd have read my comment in context you would not have said
what you said.

>>>And UTF-8 is indeed very simple - simpler than most multibyte character
>>>sets. For example, byte-oriented search algorithms work perfectly on UTF-8
>>>text.
>
>>Yeah, if one searches an ASCII character.
>

>AND if one searches for text with an arbitrary combination of languages.
>If you want to complain about UTF-8 and Unicode, at least complain
>about real rather than imagined problems.

I was thinking about searching characters.

>
>>One could for example use 8 bytes and a single character for escape. The
>>8-bit set would be identified at the beginning of the stream.
>

>$ grep arbitrary_word arbitrary group of files with different languages
>

So one needs to write a better grep. Or one could have one basic
character set per system, so that problem would not arise.

>$ tail file_with_one_language
>
>$ tail file_with_different_language
>
>These commands all work fine with UTF-8. Try them with your 8-bit
>encoding and 'character-set identifier at the beginning of the stream'
>and enjoy the gobbledygook.

Well better write some new utilities than use two bytes / character.

>--
>http://www.cs.su.oz.au/~gary/


Osmo


Osmo Ronkanen

unread,
Jun 12, 1996, 3:00:00 AM6/12/96
to

In article <31BD0B...@engr.sgi.com>,

Gianni Mariani <gia...@engr.sgi.com> wrote:
>Much of the discussion here seems to be related to the following
>questions:
>
>1. Do we need to represent more than one character set
> simultaneously ?
>
>2. What is the cost benefit of supporting UTF-8 over a
> single encoding and is it justified ?
>
>The answers to the above questions vary wildly depending on
>application and it's very difficult to have a uniform obvious
>answer when you look at the problem in the microscopic level.
>
>Kinds of applications where it is of great cost benefit are:
>
>- Library archiving
>- dictionaries
>- instruction documents (remember what came with that watch
> you bought in Hong Kong ?)
>- translation services
>- sharing of project data across languages
>- address books
>- .... add your favourites here ....

Those are relatively special needs. Also there already are word
processors that can produce multi-language documents. As I understand
the question here is about using Unicode/UTF-8 as base character set.

>
>Kinds of applications it does not serve much benefit:
>
>- sending an email to your boss
>- office projects that is not shared beyond your locality
>- ???i'm blank???

Applications that use fixed fields, index text. Writing with languages
like Turbo Pascal that supports only 255 character strings.

>
>So the cost benefit goes like this, how much does it cost
>to enable applications in the prior category with and
>without unicode (namely UCS-2, UCS-4, UTF-8 and perhaps ISO2022).
>
>I conjecture that all major computining/information services
>companies will impliment or enable the prior category applications
>so the customer will pay for the costs of not enabling or
>enabling - either way.
>
>With the explosion of the WWW and documents of all languages
>zipping around I believe that the time has come where many
>multilingual applications are nessasary.
>
>I also believe that widespread use of Unicode will reduce the
>overall cost of maintenace and implimentation of these systems.
>
>The beneficiary is the consumer !
>
>As to which variant is to be used ? The ASCII centric surely
>can't omplain about UTF-8. The Kanji centric who will use 3
>bytes instead of the current 2 may have an argument against
>UTF-8 but it's likely that it will be a reasonable compromise
>in most cases.

Is it a reasonable compromise in their local communication? Who should
decide that?

You conveniently ignored languages like Greek and Russian in which one
needs two bytes per character instead of one. You did that even though I
had mentioned them several times. Why?

I do not like the idea that I lose the convenience of one byte /
character. While that would not harm much storing information
sequentially, processing it would become harder. Of course I could
translate to ISO-8859-1 after reading the file but then what would be
the point of the Unicode in the first place.

>
>As for ISO2022, someone please present to me a standard that
>I can impliment as efficiently and is as accepted as Unicode
>across the entire globe and then I will give it some consideration.
>Right now it seems more like my kitchen sink after dinner
>before washing up, and I hate washing up !
>
>My point of view is that no computer company will have an option
>not to answer YES to question 1. Consideration of question 2
>in my opinion is that Unicode will simplify greatly the
>problem of providing language support to all customers.
>

And here is the key philosophical difference. I do not view writing
characters like ДЖЕдже as language support anymore than I view writing
characters like djaskldjask as language support.


>SGI has implimented some level of UTF-8 support as many other
>computer vendors have. The snow-ball effect is building.
>
>Unicode or broke!
>
>/Gianni
>
>my .sig, my .sig, where is my .sig ?
>Opinions mine etc.


Osmo

Antoien LECA

unread,
Jun 13, 1996, 3:00:00 AM6/13/96
to

Peter Kerr wrote:
>
> Unicode is a wonderful idea: all possible human scripts in one set.
> Well, quite a lot of them, it looks like yet again the character table may
> have been made too small.
>
> But for everyday use the consensus seems to be that people would prefer to
> work with one 8 bit character set, just like a few dinosaurs seem
> determined to continue with a 7-bit set.
>
> What is urgently needed is one efficient, rational translation scheme
> between ISO 8859-n sub-groups, and so-called "IBM code pages", and the
> other industry "standards", going via UniCode if necessary, which will
> work cross-platform, multi-lingual, multi-cultural...
> [cut]

I think you can give a look at the recode package included in the GNU project.
Look at ftp://ftp.gnu.ai.mit.edu/pub/gnu/recode-3.4.tar.gz

Hope that helps.

Gianni Mariani

unread,
Jun 13, 1996, 3:00:00 AM6/13/96
to Osmo Ronkanen

Osmo Ronkanen wrote:
>
> In article <31BD0B...@engr.sgi.com>,
> Gianni Mariani <gia...@engr.sgi.com> wrote:

....


> >Kinds of applications where it is of great cost benefit are:
> >
> >- Library archiving
> >- dictionaries
> >- instruction documents (remember what came with that watch
> > you bought in Hong Kong ?)
> >- translation services
> >- sharing of project data across languages
> >- address books
> >- .... add your favourites here ....
>
> Those are relatively special needs. Also there already are word
> processors that can produce multi-language documents. As I understand
> the question here is about using Unicode/UTF-8 as base character set.

Exactly, SGI sells a huge number of machines to customers that
have very special needs and so does that darn competition.

....


>
> Is it a reasonable compromise in their local communication? Who should
> decide that?

Exactly what do you mean by compromise ?
I thought it would enhance it.

>
> You conveniently ignored languages like Greek and Russian in which one
> needs two bytes per character instead of one. You did that even though I
> had mentioned them several times. Why?

Unfortunatly a very small portion of revenue comes from Greece and
Russia. While Russia is very important, the buisness economics
just don't work that way. If 20% of SGI revenue came from from
Russian speaking countries this would be the first thing on my
mind. Money talks. That doesn't mean to say that SGI does not care
about Russia, on the contrary, I have been to Moscow to talk with
various software developers and I hope we have improved our
Russian support, e.g. we have released KOI8 support for IRIX 6.2
and are looking at ways we can satisfy future Russian customer
requirements.

However, having your data size double on you is relativly easy
to fix, buy a bigger disk, get more memory, maybe a faster modem,
and get smarter with compression. In time you won't even feel it
and you can take advantage of it when your general purpose machine
is not shuffling text but video processing instead.

-or-

You could otherwise pay for more expensive operating software
because the developers need to deal with the plethora of encodings
and that's money down the proverbial drain. (one major CAD
developer, to be un-named, has 6 full-time staff to deal with
cad-file conversion software. Unicode support to them is a major
incentive.

You choose when you buy your systems, it's called market forces.

>
> I do not like the idea that I lose the convenience of one byte /
> character. While that would not harm much storing information
> sequentially, processing it would become harder.

You can still write your application to deal with just 8 bit
characters. Don't expect it to interoperate with the world
of applications that are fully globalized. You cut your
market potential by dealing with only 8 bits.

> .... Of course I could


> translate to ISO-8859-1 after reading the file but then what would be
> the point of the Unicode in the first place.

Exactly.

>
> >
> >As for ISO2022, someone please present to me a standard that
> >I can impliment as efficiently and is as accepted as Unicode
> >across the entire globe and then I will give it some consideration.
> >Right now it seems more like my kitchen sink after dinner
> >before washing up, and I hate washing up !
> >
> >My point of view is that no computer company will have an option
> >not to answer YES to question 1. Consideration of question 2
> >in my opinion is that Unicode will simplify greatly the
> >problem of providing language support to all customers.
> >
>
> And here is the key philosophical difference. I do not view writing
> characters like ДЖЕдже as language support anymore than I view writing
> characters like djaskldjask as language support.

Excuse me, I missed the point.


--

_ ` _ ` Globalization R&D
/ \ / / \ /-- /-- /
/ // / / / / / / / Graphics is cool
\_/ \ \_ \/ /_/ /_/ o Internationalization c'est magnifique
/ /
\_/ (415) 933 4387 Opinions mine etc ...

Erland Sommarskog

unread,
Jun 15, 1996, 3:00:00 AM6/15/96
to

[The subject line used to read: "ISO 8859-1 National Character Set FAQ"]

Ha scritto ronk...@cc.helsinki.fi (Osmo Ronkanen):


>>>One could for example use 8 bytes and a single character for escape. The
>>>8-bit set would be identified at the beginning of the stream.
>>

>>$ grep arbitrary_word arbitrary group of files with different languages
>>
>
>So one needs to write a better grep. Or one could have one basic
>character set per system, so that problem would not arise.

But if users on the same system wants to use more than 256 characters
altogether? We know now that you don't want more, but the problem is
that what the world needs is not a character set that fits Osmo Ronkanen,
but targets a somewhat wider audience. I for one definitely have a
need of more than 256 characters. I don't care particularly about
Chinese or Thai characters, but I feel handicapped when having to use
special manoeuvres in order to use East European Latin characters or
Cyrillic or Greek characters. Then whether they are stored in true two-
byte form, or in some compressed form à la UTF-8 is something I care
less about. It's the function I'm after. Complaining about disk space
is only remnant from those days when harware really was expensive.

>>$ tail file_with_one_language
>>
>>$ tail file_with_different_language
>>
>>These commands all work fine with UTF-8. Try them with your 8-bit
>>encoding and 'character-set identifier at the beginning of the stream'
>>and enjoy the gobbledygook.
>
>Well better write some new utilities than use two bytes / character.

But how to write a good tail, if the character set may change in the
file? Tail is built on the concept that in Unix you can quickly get
to the end of a file and then equally quickly read backwards from
there. If you insert state information about character set in the file,
you're lost and will have to read from the beginning. Nice thing to
do on /usr/local/news/history...


--
Erland Sommarskog, Stockholm, som...@algonet.se
F=F6r =F6vrigt anser jag att QP b=F6r f=F6rst=F6ras.
B=65sid=65s, I think QP should b=65 d=65stroy=65d.

Osmo Ronkanen

unread,
Jun 16, 1996, 3:00:00 AM6/16/96
to

In article <31C0FC...@engr.sgi.com>,

Gianni Mariani <gia...@engr.sgi.com> wrote:
>Osmo Ronkanen wrote:
>>
>> In article <31BD0B...@engr.sgi.com>,
>> Gianni Mariani <gia...@engr.sgi.com> wrote:
>
...

>....


>>
>> Is it a reasonable compromise in their local communication? Who should
>> decide that?
>
>Exactly what do you mean by compromise ?
>I thought it would enhance it.
>

It would double their needs of resources. Is that a good compromise.

(Next time when you ask me something, do not delete the context)

>>
>> You conveniently ignored languages like Greek and Russian in which one
>> needs two bytes per character instead of one. You did that even though I
>> had mentioned them several times. Why?
>
>Unfortunatly a very small portion of revenue comes from Greece and
>Russia. While Russia is very important, the buisness economics
>just don't work that way. If 20% of SGI revenue came from from
>Russian speaking countries this would be the first thing on my
>mind. Money talks. That doesn't mean to say that SGI does not care
>about Russia, on the contrary, I have been to Moscow to talk with
>various software developers and I hope we have improved our
>Russian support, e.g. we have released KOI8 support for IRIX 6.2
>and are looking at ways we can satisfy future Russian customer
>requirements.

I thought the issue was developing a good character set and not making
money for SGI.

>
>However, having your data size double on you is relativly easy
>to fix, buy a bigger disk, get more memory, maybe a faster modem,
>and get smarter with compression. In time you won't even feel it
>and you can take advantage of it when your general purpose machine
>is not shuffling text but video processing instead.

All those cost money. Also if the bigger disk philosophy worked, then we
would now have no problems with storage as the average disk sizes in
micros have increases over 100 fold since the early eighties.

Think for example usenet. Would it be worth to transmit it in those
countries with two bytes/ character instead of one. Would you think that
they would just swallow that? Why not make it so that ascii characters
are transmitted as two bytes and Cyrillic as one. Would you like that?

>
>-or-
>
>You could otherwise pay for more expensive operating software
>because the developers need to deal with the plethora of encodings
>and that's money down the proverbial drain. (one major CAD
>developer, to be un-named, has 6 full-time staff to deal with
>cad-file conversion software. Unicode support to them is a major
>incentive.
>

Dealing with the character translations is not that hard. All it takes
is translation tables.

>You choose when you buy your systems, it's called market forces.
>

True and I'd never buy a system where I had to use two bytes per
character.

>>
>> I do not like the idea that I lose the convenience of one byte /
>> character. While that would not harm much storing information
>> sequentially, processing it would become harder.
>
>You can still write your application to deal with just 8 bit
>characters. Don't expect it to interoperate with the world
>of applications that are fully globalized. You cut your
>market potential by dealing with only 8 bits.

Nobody said that one should deal only with 8 bits. What I say is that
one should not deal only with Unicode/UTF-8. The whole discussion
started from someone who said that we should forget 8-bit codings.

>
>> .... Of course I could
>> translate to ISO-8859-1 after reading the file but then what would be
>> the point of the Unicode in the first place.
>
>Exactly.
>
>>
>> >
>> >As for ISO2022, someone please present to me a standard that
>> >I can impliment as efficiently and is as accepted as Unicode
>> >across the entire globe and then I will give it some consideration.
>> >Right now it seems more like my kitchen sink after dinner
>> >before washing up, and I hate washing up !
>> >
>> >My point of view is that no computer company will have an option
>> >not to answer YES to question 1. Consideration of question 2
>> >in my opinion is that Unicode will simplify greatly the
>> >problem of providing language support to all customers.
>> >
>>
>> And here is the key philosophical difference. I do not view writing
>> characters like ДЖЕдже as language support anymore than I view writing
>> characters like djaskldjask as language support.
>
>Excuse me, I missed the point.
>

I am not surprised. To you those letters might represent some mysterious
language support. To me they are just letters, like ABCD.

Osmo

Gianni Mariani

unread,
Jun 16, 1996, 3:00:00 AM6/16/96
to Osmo Ronkanen

Osmo Ronkanen wrote:
>
> In article <31C0FC...@engr.sgi.com>,
> Gianni Mariani <gia...@engr.sgi.com> wrote:
> >Osmo Ronkanen wrote:
> >>
> >> In article <31BD0B...@engr.sgi.com>,
> >> Gianni Mariani <gia...@engr.sgi.com> wrote:
> >
> ...
>
> It would double their needs of resources. Is that a good compromise.
>
> (Next time when you ask me something, do not delete the context)

Double ? In the worst possible case I'd agree but in general I
think you would see much less explosion.

>
> >>
> >> You conveniently ignored languages like Greek and Russian in which one
> >> needs two bytes per character instead of one. You did that even though I
> >> had mentioned them several times. Why?
> >
> >Unfortunatly a very small portion of revenue comes from Greece and
> >Russia. While Russia is very important, the buisness economics
> >just don't work that way. If 20% of SGI revenue came from from
> >Russian speaking countries this would be the first thing on my
> >mind. Money talks. That doesn't mean to say that SGI does not care
> >about Russia, on the contrary, I have been to Moscow to talk with
> >various software developers and I hope we have improved our
> >Russian support, e.g. we have released KOI8 support for IRIX 6.2
> >and are looking at ways we can satisfy future Russian customer
> >requirements.
>
> I thought the issue was developing a good character set and not making
> money for SGI.

Unforunately/fortunately it's all about making money. Providing a
good character set enables computer companies to make money, if it didn't
I wouldn't be posting. It's a circular argument and a philosophy that
goes far beyond UTF-8 and SGI, if you would like to engage in that
then lets do that elsewhere.

>
> >
> >However, having your data size double on you is relativly easy
> >to fix, buy a bigger disk, get more memory, maybe a faster modem,
> >and get smarter with compression. In time you won't even feel it
> >and you can take advantage of it when your general purpose machine
> >is not shuffling text but video processing instead.
>
> All those cost money. Also if the bigger disk philosophy worked, then we
> would now have no problems with storage as the average disk sizes in
> micros have increases over 100 fold since the early eighties.

Very little of what is stored on my hard disk is text. I have a
16GB volume we use for code development, a very small portion
of the volume is text.

>
> Think for example usenet. Would it be worth to transmit it in those
> countries with two bytes/ character instead of one. Would you think that
> they would just swallow that? Why not make it so that ascii characters
> are transmitted as two bytes and Cyrillic as one. Would you like that?

I could argue that you already waste huge bandwitch on these because
this kind of data is highly compressible, if you care so much you
could compress it and then you would find that utf-8 vs compressed
8bit Cyrillic would not differ that greatly.

I could also argue that soon you will be transmitting more jpeg and
mpeg data and so your text bandwidth will be close to insignificant
in comparison.

>
> >
> >-or-
> >
> >You could otherwise pay for more expensive operating software
> >because the developers need to deal with the plethora of encodings
> >and that's money down the proverbial drain. (one major CAD
> >developer, to be un-named, has 6 full-time staff to deal with
> >cad-file conversion software. Unicode support to them is a major
> >incentive.
> >
>
> Dealing with the character translations is not that hard. All it takes
> is translation tables.
>
> >You choose when you buy your systems, it's called market forces.
> >
>
> True and I'd never buy a system where I had to use two bytes per
> character.

Many systems are enabled already to deal with wide characters and
hence you likely are already buying systems that do this.

Examples:
VRML, JAVA, Inventor

>
> >>
> >> I do not like the idea that I lose the convenience of one byte /
> >> character. While that would not harm much storing information
> >> sequentially, processing it would become harder.
> >
> >You can still write your application to deal with just 8 bit
> >characters. Don't expect it to interoperate with the world
> >of applications that are fully globalized. You cut your
> >market potential by dealing with only 8 bits.
>
> Nobody said that one should deal only with 8 bits. What I say is that
> one should not deal only with Unicode/UTF-8. The whole discussion
> started from someone who said that we should forget 8-bit codings.

For anyone wanting to develop a truly global software solution
you are locked into dealing with multibyte encodings or a wide
fixed length encoding. 8-bit encodings are exceptions and it is
a very reasonable development practice to reduce the number of
exceptions. This reduces the complexity of the overall system
and hence is easier to test providing more reliable and cost
effective systems.

UTF-8 in particular will work with most current multi-byte enabled
UNIX applications without doing anything special. This makes
UTF-8 very attractive. The best software development is the one
that achieves the goal without major modifications. Implimenting
UTF-8 does that.

>
> >
> >> .... Of course I could
> >> translate to ISO-8859-1 after reading the file but then what would be
> >> the point of the Unicode in the first place.
> >
> >Exactly.
> >
> >>
> >> >
> >> >As for ISO2022, someone please present to me a standard that
> >> >I can impliment as efficiently and is as accepted as Unicode
> >> >across the entire globe and then I will give it some consideration.
> >> >Right now it seems more like my kitchen sink after dinner
> >> >before washing up, and I hate washing up !
> >> >
> >> >My point of view is that no computer company will have an option
> >> >not to answer YES to question 1. Consideration of question 2
> >> >in my opinion is that Unicode will simplify greatly the
> >> >problem of providing language support to all customers.
> >> >
> >>
> >> And here is the key philosophical difference. I do not view writing

> >> characters like äöåÄÖÅ as language support anymore than I view writing


> >> characters like djaskldjask as language support.
> >
> >Excuse me, I missed the point.
> >
>
> I am not surprised. To you those letters might represent some mysterious
> language support. To me they are just letters, like ABCD.

Someone reading this document in KOI8 wouldn't read

ä 0x00E4 # LATIN SMALL LETTER A WITH DIAERESIS
ö 0x00F6 # LATIN SMALL LETTER O WITH DIAERESIS
å 0x00E5 # LATIN SMALL LETTER A WITH RING ABOVE
Ä 0x00C4 # LATIN CAPITAL LETTER A WITH DIAERESIS
Ö 0x00D6 # LATIN CAPITAL LETTER O WITH DIAERESIS
Å 0x00C5 # LATIN CAPITAL LETTER A WITH RING ABOVE

But would get the stuff below.

ä 0x0426 # CYRILLIC CAPITAL LETTER TSE
ö not defined
å 0x0428 # CYRILLIC CAPITAL LETTER SHA
Ä 0x0446 # CYRILLIC SMALL LETTER TSE
Ö 0x0436 # CYRILLIC SMALL LETTER ZHE
Å 0x0448 # CYRILLIC SMALL LETTER SHA

So again, I missed the point.

If you are trying to say that "djaskldjask" is special, it is
because this string is interpretted correctly by a very large
body of software available today however "äöåÄÖÅ" is not.
Depending on the encoding you are assuming it could be some
of the following:

Encoding ucs-2 character for "äöåÄÖÅ"
big5 5648 66b0 79f8
DOS855 0421 0428 0442 2500 043e 253c
eucCN 6f29 8fe5 5cd9
eucJP 7cdc 7ddc 5d1a
eucKR 9698 f97a 9818
eucTW 7169 7766 6d6e
ISO8859-1 00e4 00f6 00e5 00c4 00d6 00c5
ISO8859-2 00e4 00f6 013a 00c4 00d6 0139
ISO8859-3 00e4 00f6 010b 00c4 00d6 010a
ISO8859-4 00e4 00f6 00e5 00c4 00d6 00c5
ISO8859-5 0444 0456 0445 0424 0436 0425
ISO8859-6 0644 0000 0645 0624 0636 0625
ISO8859-7 03b4 03c6 03b5 0394 03a6 0395
ISO8859-8 05d4 05e6 05d5 0000 0000 0000
ISO8859-9 00e4 00f6 00e5 00c4 00d6 00c5
KOI8 0414 0416 0415 0434 0436 0435
sjis 8515 8827 ff96 ff85
WIN1251 0434 0446 0435 0414 0426 0415


>
> Osmo

Gianni

Osmo Ronkanen

unread,
Jun 17, 1996, 3:00:00 AM6/17/96
to

In article <4pvbs6$f...@epimetheus.algonet.se>,

Erland Sommarskog <som...@algonet.se> wrote:
>[The subject line used to read: "ISO 8859-1 National Character Set FAQ"]
>
>Ha scritto ronk...@cc.helsinki.fi (Osmo Ronkanen):
>>>>One could for example use 8 bytes and a single character for escape. The
>>>>8-bit set would be identified at the beginning of the stream.
>>>
>>>$ grep arbitrary_word arbitrary group of files with different languages
>>>
>>
>>So one needs to write a better grep. Or one could have one basic
>>character set per system, so that problem would not arise.
>
>But if users on the same system wants to use more than 256 characters
>altogether?

Then they use appropriate tools, like Unicode, for that purpose. I am
not speaking against it per se. I am just speaking against force feeding
of it where it is not necessary

We know now that you don't want more, but the problem is
>that what the world needs is not a character set that fits Osmo Ronkanen,
>but targets a somewhat wider audience. I for one definitely have a
>need of more than 256 characters. I don't care particularly about
>Chinese or Thai characters, but I feel handicapped when having to use
>special manoeuvres in order to use East European Latin characters or
>Cyrillic or Greek characters. Then whether they are stored in true two-
>byte form, or in some compressed form à la UTF-8 is something I care
>less about. It's the function I'm after. Complaining about disk space
>is only remnant from those days when harware really was expensive.

Since when has there been storage and memory resources to spare?

>
>>>$ tail file_with_one_language
>>>
>>>$ tail file_with_different_language
>>>
>>>These commands all work fine with UTF-8. Try them with your 8-bit
>>>encoding and 'character-set identifier at the beginning of the stream'
>>>and enjoy the gobbledygook.
>>
>>Well better write some new utilities than use two bytes / character.
>
>But how to write a good tail, if the character set may change in the
>file?

Who said the character set must change within the file?

>Tail is built on the concept that in Unix you can quickly get
>to the end of a file and then equally quickly read backwards from
>there. If you insert state information about character set in the file,
>you're lost and will have to read from the beginning. Nice thing to
>do on /usr/local/news/history...
>
>
>--
>Erland Sommarskog, Stockholm, som...@algonet.se
>F=F6r =F6vrigt anser jag att QP b=F6r f=F6rst=F6ras.
>B=65sid=65s, I think QP should b=65 d=65stroy=65d.


Osmo


Peter Kerr

unread,
Jun 18, 1996, 3:00:00 AM6/18/96
to

som...@algonet.se (Erland Sommarskog) wrote:
> But if users on the same system wants to use more than 256 characters
> altogether?

Count me in here.

> I feel handicapped when having to use
> special manoeuvres in order to use East European Latin characters or
> Cyrillic or Greek characters. Then whether they are stored in true two-
> byte form, or in some compressed form à la UTF-8 is something I care
> less about. It's the function I'm after. Complaining about disk space
> is only remnant from those days when harware really was expensive.

I counted up some 164 accented or special roman characters for European
languages. Add in the basic 52, plus 10 numerals and 30
punctuation/currency/etc and voila: 256 characters total. Just for Europe.
Remember Europe, people? That little country just to the north of Africa
which laid the foundations of Western civilization.

Dunno about those control characters they insist on using in the bottom 32
spaces...

--
Peter Kerr bodger
School of Music chandler

University of Auckland NZ neo-Luddite

Stephen Baynes

unread,
Jun 18, 1996, 3:00:00 AM6/18/96
to

Erik Naggum (er...@naggum.no) wrote:
: [Gary Capell]

: | I'm with you now! And why don't we assign a _number_ to every name in
: | that namespace, while we're at it? Then we could have a compact
: | representation of the universal character namespace. And we could call
: | this namespace "Uniname". I just can't think of what we might call the
: | set of numeric codes for Uniname. Any ideas?

: funny guy. the point with names is that you can describe a character set
: relatively easily, can debug it, and can use the information in the names
: to help you understand what a character means and is.

: and, actually, I have suggested that ISO 10646 names be used (in fact, I
: use them for just this purpose), if you had cared to pay attention. the
: idea of naming characters uniquely is quite novel. it will naturally take
: most people quite a while to realize it has unique benefits that can be
: exploited. the ISO committee that produced the list does not understand
: that the names can be useful outside of their standard, for instance.

And what language and encoding do you use for the names? Unicode? For example
Chinese characters I would expect to have Chinese names. Do the Greeks write
what English speakers write as 'alpha' as 'alpha' or do they write it using
Greek characters ( perhaps <alpha><lambda><phi><alpha> ? ).

<snip>
: funny how numbering everything is supposed to be a panacea. some ancient


: Greek philosophers got that idea some 4000 years ago. as I recall, it was
: discarded then, too. but at least they had an idea that needed to be
: tested and understood. today, we should have known better: a number for
: everything is a machine necessity, otherwise not a very good idea.

At least one can have numbers that have the same semantic meanings across the
world and even if one uses different representations (such as binary or
decimal) they still are 1 to 1 equivalent. (As far as I know almost everyone
uses the number systems built on the same or equivalent axioms. It is possible
there are still a few tribes in the more remote parts of the world who use
1,2,3,'many' numbering systems - but I doubt if they have written languages.)

--
Stephen Baynes bay...@ukpsshp1.serigate.philips.nl
Philips Semiconductors Ltd
Southampton My views are my own.
United Kingdom
Are you using ISO8859-1? Do you see © as copyright, ÷ as division and ½ as 1/2?

Erik Naggum

unread,
Jun 18, 1996, 3:00:00 AM6/18/96
to

[Stephen Baynes]

| And what language and encoding do you use for the names? Unicode?

for how long is such silliness entertaining to you?

the ISO 10646 names are strings consisting of the characters SPACE,
HYPHEN-MINUS, LATIN CAPITAL LETTER A through Z, and DIGIT 0 through 9 (the
latter only for ideographic characters). one would expect that
semi-intelligent programmers would ensure that they had or could get a
description of the characters in a form that they could process. for
interchange purposes, you will find that ASCII suffices.

| At least one can have numbers that have the same semantic meanings
| across the world

yes, precisly _no_ semantic meaning.

| (As far as I know almost everyone uses the number systems built on the
| same or equivalent axioms. It is possible there are still a few tribes
| in the more remote parts of the world who use 1,2,3,'many' numbering
| systems - but I doubt if they have written languages.)

thank you for your intelligent contribution.

--
IRC/EFnet: gamlerik

Osmo Ronkanen

unread,
Jun 19, 1996, 3:00:00 AM6/19/96
to

In article <31C4F3...@engr.sgi.com>,

Gianni Mariani <gia...@engr.sgi.com> wrote:
>Osmo Ronkanen wrote:
>>
>> It would double their needs of resources. Is that a good compromise.
>>
>> (Next time when you ask me something, do not delete the context)
>
>Double ? In the worst possible case I'd agree but in general I
>think you would see much less explosion.
>

In Russian the text doubles. (With the exception of spaces, punctuation
and numbers)


>>
>> I thought the issue was developing a good character set and not making
>> money for SGI.
>
>Unforunately/fortunately it's all about making money. Providing a
>good character set enables computer companies to make money, if it didn't
>I wouldn't be posting. It's a circular argument and a philosophy that
>goes far beyond UTF-8 and SGI, if you would like to engage in that
>then lets do that elsewhere.

If to you the money is everything, then we have so different views that
it is pointless to continue the debate.

...


Osmo

Gianni Mariani

unread,
Jun 19, 1996, 3:00:00 AM6/19/96
to

Osmo Ronkanen wrote:
>
> In article <31C4F3...@engr.sgi.com>,

> Gianni Mariani <gia...@engr.sgi.com> wrote:
...
>
> If to you the money is everything, then we have so different views that
> it is pointless to continue the debate.
>

Just to set the record straight.

Personally I don't think money is "everything".

My employer does not believe that money is "everything".

However, in practice of buisness SGI attempts to exercise utmost integrity
and fiscal responsibility respectively.

As I mentioned in my previous article, this is not the venue to debate
the money is "everything" subject, however if anyone would like to engage
in such an argument I would be more than happy to descibe my personal
views and put them under critical review in a more appropriate news-group.

Regards
Gainni

> ...
>
> Osmo

Unknown

unread,
Jun 19, 1996, 3:00:00 AM6/19/96
to

I have been reading this thread with great interest. Please excuse me if this
is the incorrect group for this question. I am interested in learning more
about issues related to capitalization of text. I am particularly interested
in the effects (pro and con) of capturing all characters in names and/or
addresses as capital letters, especially as it relates to non-English
languages.

Please respond by e-mail to swainn@DaytonOH. ncr.com

==========Peter Kerr, 6/18/96==========

Count me in here.


==========Peter Kerr, 6/18/96==========

John B. Melby

unread,
Jun 20, 1996, 3:00:00 AM6/20/96
to

>I have been reading this thread with great interest. Please excuse me if this
>is the incorrect group for this question. I am interested in learning more
>about issues related to capitalization of text.

Some frequently mentioned issues:

In French, letters tend to lose their accents when they are capitalized.

In German, there's a small ess-zett, but not a big ess-zett. Thus,
ess-zett becomes `SS' when capitalized.

In Turkish, the capitalized version of `i' is a dotted `I.' Likewise,
the small version of `I' is an undotted `i.'

Arthur Chance

unread,
Jun 20, 1996, 3:00:00 AM6/20/96
to

[Note: I've set followups to not include soc.culture.nordic. This
doesn't seem to be a soc.culture thread to me.]

In article <p.kerr-1806...@news.auckland.ac.nz>


p.k...@auckland.ac.nz (Peter Kerr) writes:
> I counted up some 164 accented or special roman characters for European
> languages. Add in the basic 52, plus 10 numerals and 30
> punctuation/currency/etc and voila: 256 characters total. Just for Europe.

I have just received my copy of European Prestandard ENV1973:1996
"Information Technology - European Subsets of ISO/IEC 10646-1". It
defines "a minimum limited subset of ISO 10646-1 for use in Europe",
to quote section 1. This contains 926 characters, and is intended to
cover just about every European language (including several I've never
heard of). About 150 of the characters are math symbols, bullets, box
drawing characters, etc, and there are the usual ASCII punctuation
chars, but a good 700+ code points are characters, so Peter's estimate
is a little on the low side. I guess he wasn't thinking of Greek or
Cyrillic.

According to the foreword of the document it will be implemented as
national standards by all members of the EU plus Norway & Switzerland.
Reading the accompanying British Standards Institute's covering note,
I think that's supposed to happen by the end of this month.

> Remember Europe, people? That little country just to the north of Africa
> which laid the foundations of Western civilization.

Oh, I've seen it around. :-) That last line reminds me of Gandhi's
remark about Western civilization though.

--
"For they know they will sooner gain their end by appealing to men's pockets,
in which they have generally something of their own, than to their heads,
which contain for the most part little but borrowed or stolen property"
-- Samuel Butler, "Erewhon" (On political reformers)

Ketil Albertsen

unread,
Jun 20, 1996, 3:00:00 AM6/20/96
to

[Arthur...@Smallworld.co.uk:]

>About 150 of the characters are math symbols, bullets, box
>drawing characters, etc, and there are the usual ASCII punctuation
>chars, but a good 700+ code points are characters, so Peter's estimate
>is a little on the low side. I guess he wasn't thinking of Greek or
>Cyrillic.

Even when limiting yourself to latin characters, his estimate was on
the low side - the Teletex char set (sorry, I don't have the IS number
available) defines 312 characters, and they are all latin-based.

>> Remember Europe, people? That little country just to the north of Africa
>> which laid the foundations of Western civilization.
>
>Oh, I've seen it around. :-) That last line reminds me of Gandhi's
>remark about Western civilization though.

Sorry for following this sidetrack, I know I shouldn't...

I assume you think of the reporter asking: Mr. Gandhi, what do you
think of Western civilization? To which Gandhi replies: I think
that would be a good idea...

But is this story *for real*?? I always though it was a mere joke.
From what Arthur says, it sounds as if Gandhi really gave that reply
once, is that true? (That makes it even better than as only a joke!)

ketil


Eric Brunner

unread,
Jun 20, 1996, 3:00:00 AM6/20/96
to

I hardly know where to begin with this, and I've snipped the two newsgroups
that have zippo to do with real work, one cultural and one long-ago glommed
onto by commercial posts and end-user how-to questions for apps-from-Mars.

There are two or more threads here in c.std.i which caught my eye, I don't
know why Osmo is baiting folks. It isn't as if we'd malice of dumbness to
motivate our collective i18n/l10n work over the past decade.

Osmo Ronkanen (ronk...@cc.helsinki.fi) wrote:
: In article <4pvbs6$f...@epimetheus.algonet.se>,


: Erland Sommarskog <som...@algonet.se> wrote:
: >[The subject line used to read: "ISO 8859-1 National Character Set FAQ"]
: >
: >Ha scritto ronk...@cc.helsinki.fi (Osmo Ronkanen):
: >>>>One could for example use 8 bytes and a single character for escape. The
: >>>>8-bit set would be identified at the beginning of the stream.

The escape sequence mechanism again, formalized in its most popular form
(to my attempts at ignornace since I wrote XPG 1 back in '85) in 2022, and
DBCS schemes and manifested in numerous propriatary schemes.

: >>>$ grep arbitrary_word arbitrary group of files with different languages


: >>
: >>So one needs to write a better grep. Or one could have one basic
: >>character set per system, so that problem would not arise.

No, actually one needs to write a better regular expression definitional
syntax and semantics, and realistically one needs to pick one of two
underlying representational models: stateful (e.g., EUC) and stateless
(e.g. fixed-width) encodings. As an ancillary task, one needs to define
collation ordering(s) for particular locales (to adopt the common X/Open
nomenclature) -- however this is distinct from the issue of what actually
constitutes a regular expression.

: >But if users on the same system wants to use more than 256 characters
: >altogether?

: Then they use appropriate tools, like Unicode, for that purpose. I am


: not speaking against it per se. I am just speaking against force feeding
: of it where it is not necessary

Not well informed, therefore not particularly helpful. In an 8859-x
consistent universe, e.g., 10646, multi-octet representations arise only
where used, hence no "force feeding". Please keep attitude for some other
venue. The general problem is that of providing user defined characteristics,
(characters included) to processes, and/or threads of execution. There are
several general classes of solutions, with design constraints and solution
spaces.

: We know now that you don't want more, but the problem is


: >that what the world needs is not a character set that fits Osmo Ronkanen,
: >but targets a somewhat wider audience. I for one definitely have a
: >need of more than 256 characters. I don't care particularly about

: >Chinese or Thai characters, but I feel handicapped when having to use


: >special manoeuvres in order to use East European Latin characters or
: >Cyrillic or Greek characters. Then whether they are stored in true two-
: >byte form, or in some compressed form à la UTF-8 is something I care
: >less about. It's the function I'm after. Complaining about disk space
: >is only remnant from those days when harware really was expensive.

Please see 8859-x, or the 10646 corner store nearest. The complaining
problem in the para above is inextant, until one leaves the space of the
8859-x group, e.g., "Chinese or Thai characters".

: Since when has there been storage and memory resources to spare?

Not a well-posed issue. Keep to the minimal system engineering cost
problem, or pose another well-characterizable engineering/cost metric
as a means test of an architecture and implementation.

: >
: >>>$ tail file_with_one_language


: >>>
: >>>$ tail file_with_different_language
: >>>
: >>>These commands all work fine with UTF-8. Try them with your 8-bit
: >>>encoding and 'character-set identifier at the beginning of the stream'
: >>>and enjoy the gobbledygook.

Implicit is a setlocale(3), so that the libc(3) encoding-specific APIs are
called (something elegantly hid via shared libraries and run-time dlopen/
dlsym), as a common, portable means of multi-vendor treatment of the various
(language) encodings sensibly supported.

: >>Well better write some new utilities than use two bytes / character.

Not a viable engineering choice. Dual-pathing makes "sense" only in a very
few places on the utility-level of granularity, e.g., having sort(1) call
a single-octet _sort, or a _sortmb for multi-octet (and single- is a special
case of multi-), depending on the MBCURMAX value of the locale from which
the sort(1) was invoked. The engineering investment in dual-octect encoding
support is very great, e.g., HP's HP15 support. The better means is to hide
this in the locale-specification mechanism and the libc primatives. IMEO.

: >But how to write a good tail, if the character set may change in the
: >file?

If the encoding is stateful, no serious problem is presented. If stateless,
then there is.

: Who said the character set must change within the file?

Suppose you've a shell with a history mechanism. After being in locale X,
change to locale Y, where the X and Y encodings are distinct. Attempt to
exercise the history mechanism (read in locale Y a file written in locale X).
What is the means to correctly interpret the first "character"?

Tail, or more generally, multiple encodings within a single "file" (or octet
stream), is where there has been discussion of attribute tagging of files.

Solve the single encoding problem first (done, but obviously not understood
in this series of posts on UTF8/8859-x/10464/UNICODE, in the XPG 4.2 set of
architectural mechanisms, and implementations I've worked on personally or
know enough about (Solaris, AIX, HPUX, ...)), then think about the multi-
lingual issue set. Don't forget the distributed i18n set of headaches <g>.

: >Tail is built on the concept that in Unix you can quickly get

: >to the end of a file and then equally quickly read backwards from
: >there. If you insert state information about character set in the file,
: >you're lost and will have to read from the beginning. Nice thing to
: >do on /usr/local/news/history...

Generally true for all statefull encodings schemes. Once you "get lost",
it is back to the beginning of the byte-stream (lseek(fd,0,SEEK_SET)),
and statefully scan forward.

Note also that tail(1) and other utilities also have the added complication
of having to know what the display width of characters is as well as just
doing character counting... and characters != bytes.

: Osmo

Some advise to Osmo. Shelve the attitude until you understand the issue set.
DBCS was a bad idea back in '85. Heck, I even got out of writing Unix stds
in the i18n arena then because of how hopeless that was.

Howdy old acquaintances! You know who you are <g>.

--
Kitakitamatsinohpowaw,
Eric Brunner

Eric Brunner

unread,
Jun 20, 1996, 3:00:00 AM6/20/96
to

John B. Melby (me...@yk.fujitsu.co.jp) wrote:
: >I have been reading this thread with great interest. Please excuse me if this

: >is the incorrect group for this question. I am interested in learning more
: >about issues related to capitalization of text.

: Some frequently mentioned issues:

: In French, letters tend to lose their accents when they are capitalized.

...

Localization issue, inless a composed vs decomposed character issue is
being discussed.

--
Kitakitamatsinohpowaw,
Eric Brunner

Stephen Baynes

unread,
Jun 21, 1996, 3:00:00 AM6/21/96
to

Ketil Albertsen (ketil.a...@idb.hist.no) wrote:
: [Arthur...@Smallworld.co.uk:]

: >About 150 of the characters are math symbols, bullets, box
: >drawing characters, etc, and there are the usual ASCII punctuation
: >chars, but a good 700+ code points are characters, so Peter's estimate
: >is a little on the low side. I guess he wasn't thinking of Greek or
: >Cyrillic.

: Even when limiting yourself to latin characters, his estimate was on
: the low side - the Teletex char set (sorry, I don't have the IS number
: available) defines 312 characters, and they are all latin-based.

Did you count in all the combinations of combining accents+base character
that can be done with packet 26 triplets? This allows you to generate a
lot of combinations that are not used. Most of the used ones are available
in one or other of the many character set combinations. Actually about 300
sounds right for non-combining. You can potentially generate about 500
of the latin characters that are in unicode when using combining accents.
Teletext also has a few symbols that I don't think are in unicode (such as
the Turkish Lira sign). It also has over a hundered picture drawing characters.

The latest (World System) teletex spec is
EACEM Technical Report 8 "Enhanced Teletext Specification" Draft 4
February 1196.

This does not change the character sets much over earlier ones though
it does clarify how to select them.

Note there are separate specifications for Chinise and for Japanise
teletext. I can get references for them if anyone wants.

Peter Kerr

unread,
Jun 22, 1996, 3:00:00 AM6/22/96
to

ronk...@cc.helsinki.fi (Osmo Ronkanen) wrote:
> Who said the character set must change within the file?

Did I miss something, or wasn't that the point of this thread?
Bi- or multi-lingual European documents may exceed the 256 character
limit of an 8-bit character set.

UTF-8 is an off-the-shelf solution. Some of us hold out vain hopes
for something better...

Erland Sommarskog

unread,
Jun 22, 1996, 3:00:00 AM6/22/96
to

[soc.culture.nordic dropped.]

Ha scritto ronk...@cc.helsinki.fi (Osmo Ronkanen):

>Erland Sommarskog <som...@algonet.se> wrote:
>>But if users on the same system wants to use more than 256 characters
>>altogether?
>
>Then they use appropriate tools, like Unicode, for that purpose. I am
>not speaking against it per se. I am just speaking against force feeding
>of it where it is not necessary

But if your system wants to communicate with a system that want to
use more than 256 characters?

>Since when has there been storage and memory resources to spare?

Since long. The cost for extra hardware you may need, does in no
way match up to the cost for dealing with various character sets.

>>But how to write a good tail, if the character set may change in the
>>file?
>

>Who said the character set must change within the file?

cp greek.txt hungarian.txt russian.txt finnish.txt multiling.txt

When tailing multiling.txt what you want to see? Proper Finnish text,
or Finnish text where alfa and sigma have replaced ä and ö?


Erland Sommarskog, som...@algonet.se, Stockholm

Peter Kerr

unread,
Jun 23, 1996, 3:00:00 AM6/23/96
to

In article <ARTHUR.CHANCE...@holmium.Smallworld.co.uk>,
Arthur...@Smallworld.co.uk (Arthur Chance) wrote:

> In article <p.kerr-1806...@news.auckland.ac.nz>
> p.k...@auckland.ac.nz (Peter Kerr) writes:
> > I counted up some 164 accented or special roman characters for European

^^^^^


> > languages. Add in the basic 52, plus 10 numerals and 30
> > punctuation/currency/etc and voila: 256 characters total. Just for Europe.
>

> (snip) so Peter's estimate


> is a little on the low side. I guess he wasn't thinking of Greek or
> Cyrillic.

I concede I may be a little short, given that my search was limited to
current "official" languages using a roman or modified roman typeface.
This ruled out such as Basque and Bohemian (unofficial) and
Georgian and Armenian (non-roman). Also Hebrew and Arabic will conceivably
need to be defined for use in Europe.

There's a further thought: should a character table dumbly list all
the variants of context sensitive character forms, or should it
contain "intelligence" which directs a lookup from a sub-table?

My main interest is in seeing an end to the plurality and consequent
translation confusion of "IBM" code pages and ISO-8859-n

Osmo Ronkanen

unread,
Jun 23, 1996, 3:00:00 AM6/23/96
to

In article <p.kerr-2206...@news.auckland.ac.nz>,
Peter Kerr <p.k...@auckland.ac.nz> wrote:

>ronk...@cc.helsinki.fi (Osmo Ronkanen) wrote:
>> Who said the character set must change within the file?
>
>Did I miss something, or wasn't that the point of this thread?
>Bi- or multi-lingual European documents may exceed the 256 character
>limit of an 8-bit character set.
>
>UTF-8 is an off-the-shelf solution. Some of us hold out vain hopes
>for something better...

I agree, it is the solution to that problem. However, that is a rare
problem. Multi-lingual documents are an exception, not the norm.

Osmo


Osmo Ronkanen

unread,
Jun 23, 1996, 3:00:00 AM6/23/96
to

In article <DtCvt...@ukpsshp1.serigate.philips.nl>,

Stephen Baynes <bay...@ukpsshp1.serigate.philips.nl> wrote:
>Ketil Albertsen (ketil.a...@idb.hist.no) wrote:
>: [Arthur...@Smallworld.co.uk:]
>
>: >About 150 of the characters are math symbols, bullets, box
>: >drawing characters, etc, and there are the usual ASCII punctuation
>: >chars, but a good 700+ code points are characters, so Peter's estimate

>: >is a little on the low side. I guess he wasn't thinking of Greek or
>: >Cyrillic.
>
>: Even when limiting yourself to latin characters, his estimate was on
>: the low side - the Teletex char set (sorry, I don't have the IS number
>: available) defines 312 characters, and they are all latin-based.
>
>Did you count in all the combinations of combining accents+base character
>that can be done with packet 26 triplets? This allows you to generate a
>lot of combinations that are not used. Most of the used ones are available
>in one or other of the many character set combinations. Actually about 300
>sounds right for non-combining. You can potentially generate about 500
>of the latin characters that are in unicode when using combining accents.
>Teletext also has a few symbols that I don't think are in unicode (such as
>the Turkish Lira sign). It also has over a hundered picture drawing characters.

The MES has following number of characters of various types:

Latin: 301
Greek: 309
Cyrillic 94
Box Drawing: 40
Other: 182

Calculated by counting lines with respective keywords from a file
describing the set.

Osmo


Stephen Baynes

unread,
Jun 24, 1996, 3:00:00 AM6/24/96
to

Peter Kerr (p.k...@auckland.ac.nz) wrote:
: In article <ARTHUR.CHANCE...@holmium.Smallworld.co.uk>,

: There's a further thought: should a character table dumbly list all


: the variants of context sensitive character forms, or should it
: contain "intelligence" which directs a lookup from a sub-table?

The approach taken by unicode is not to worry about varients. Unicode is a
character set, not a glyph set. The context information to choose a glyph for a
character is encoded in the font.

Note I have added comp.fonts to the groups, as this question is appropriate.

Arthur Chance

unread,
Jun 24, 1996, 3:00:00 AM6/24/96
to

In article <4qjot1$e...@kruuna.helsinki.fi> ronk...@cc.helsinki.fi

(Osmo Ronkanen) writes:
> The MES has following number of characters of various types:
<snip>

> Calculated by counting lines with respective keywords from a file
> describing the set.

Is this file publically available? I've only got a paper copy of the
standard.

Damon Kelly

unread,
Jun 25, 1996, 3:00:00 AM6/25/96
to

Stephen Baynes wrote:
>
> Peter Kerr (p.k...@auckland.ac.nz) wrote:
> : In article <ARTHUR.CHANCE...@holmium.Smallworld.co.uk>,
>
> : There's a further thought: should a character table dumbly list all
> : the variants of context sensitive character forms, or should it
> : contain "intelligence" which directs a lookup from a sub-table?
>
> The approach taken by unicode is not to worry about varients. Unicode is a
> character set, not a glyph set. The context information to choose a glyph for a
> character is encoded in the font.
>
> Note I have added comp.fonts to the groups, as this question is appropriate.
>

TrueType Open is supposed to address issues like this. It is an extension to
the TT spec., that lets you (i.e. the font developer) add tables such as
context sensitivity for glyphs (eg. ligatures, Arabic middle forms).

The Microsoft ftp site has a "TTOpen SDK" with some examples (in particular,
Arabic):

ftp://ftp.microsoft.com/developr/drg/truetype

and

http://www.microsoft.com/truetype/tools/tools.htm
http://www.microsoft.com/truetype/tt/tt.htm

Unfortunately, the application must be TTO aware for it to work.
I don't know how effective it will be in the long run, but its not a bad idea.

--
Damon Kelly
====================================================
Electronic Engineer |
Micromedical Industries Ltd |Ph +61 7 5594 6077
P.O. Box 224 |Fax +61 7 5594 0361
Labrador, QLD 4215 |
Australia |

Markus Kuhn

unread,
Jun 25, 1996, 3:00:00 AM6/25/96
to

Implementing a system covering all >30 000 characters and mechanisms
for displaying ideographic and right-to-left scripts available in
Unicode and ISO 10646 can be very expensive, although these
features of Unicode are not required for European products.

The European standard ENV 1973:1995 specifies therefore a Minimum
European Subset (MES) and an Extended European Subset (EES) of
ISO 10646-1. The MES has only 926 characters, which allows very cheap
implementation but still covers all characters important to most
European users. The EES covers all characters used in European
scripts, as well as a comprehensive list of symbols used academically,
commercially, and scientifically in Europe.

For more information about MES and EES, please have a look at

<http://www.indigo.ie/egt/standards/mes.html>
<http://www.indigo.ie/egt/standards/ees.html>

Markus

--
Markus Kuhn, Computer Science student -- University of Erlangen,
Internet Mail: <msk...@cip.informatik.uni-erlangen.de> - Germany
WWW Home: <http://wwwcip.informatik.uni-erlangen.de/user/mskuhn>

Stuart Yeates

unread,
Jun 25, 1996, 3:00:00 AM6/25/96
to

Stephen Baynes (bay...@ukpsshp1.serigate.philips.nl) wrote:

: The latest (World System) teletex spec is


: EACEM Technical Report 8 "Enhanced Teletext Specification" Draft 4
: February 1196.

Does anyone have ftp sites for this and similar documents ?

: Are you using ISO8859-1? Do you see © as copyright, ÷ as division
: and ½ as 1/2?

Yes, in both my newsreader (tin) and my editor (xemacs).

stuart

--
--

Stephen Baynes

unread,
Jun 25, 1996, 3:00:00 AM6/25/96
to

Stuart Yeates (stu...@cosc.canterbury.ac.nz) wrote:
: Stephen Baynes (bay...@ukpsshp1.serigate.philips.nl) wrote:

: : The latest (World System) teletex spec is
: : EACEM Technical Report 8 "Enhanced Teletext Specification" Draft 4
: : February 1196.

: Does anyone have ftp sites for this and similar documents ?

I have just talked to the local expert (who happenes to be the person who
did most of the editorial work on the latest draft). He says that it should
be available from the publishing body who did have it available on a server.
He does not have a url, however if you email to secre...@etsi.fr (which is
the offical contact for obtaining the specification) they can tell you how
to obtain it. He also told me that there is going to be another revision of the
spec, mainly to remove references to the original EBU spec.

--
Stephen Baynes bay...@ukpsshp1.serigate.philips.nl
Philips Semiconductors Ltd
Southampton My views are my own.
United Kingdom

Message has been deleted

Eric Brunner

unread,
Jun 27, 1996, 3:00:00 AM6/27/96
to
Gianni Mariani (gia...@engr.sgi.com) wrote:
: Eric Brunner wrote:
: ...
: > No, actually one needs to write a better regular expression definitional

: > syntax and semantics, and realistically one needs to pick one of two
: > underlying representational models: stateful (e.g., EUC)

: I could argue that EUC is stateless. It depends on the definition of stateless.

I can't believe I made such a typo! Shoot me Gianni, now before I make
other encoding mistakes.

: Unicode is NOT stateless. Remember the BOM ?

Thanks. How's the localedef?

: > ... and stateless


: > (e.g. fixed-width) encodings. As an ancillary task, one needs to define
: > collation ordering(s) for particular locales (to adopt the common X/Open
: > nomenclature) -- however this is distinct from the issue of what actually
: > constitutes a regular expression.

: ...
: >
: > --
: > Kitakitamatsinohpowaw,
: > Eric Brunner

: --

: _ ` _ ` Globalization R&D
: / \ / / \ /-- /-- /
: / // / / / / / / / Graphics is cool
: \_/ \ \_ \/ /_/ /_/ o Internationalization c'est magnifique
: / /
: \_/ (415) 933 4387 Opinions mine etc ...

--
Kitakitamatsinohpowaw,
Eric Brunner

Osmo Ronkanen

unread,
Jun 27, 1996, 3:00:00 AM6/27/96
to
In article <hpa.31cfb77f...@freya.yggdrasil.com>,
H. Peter Anvin <h...@zytor.com> wrote:
>Followup to: <4qjnqv$d...@kruuna.helsinki.fi>
>By author: ronk...@cc.helsinki.fi (Osmo Ronkanen)
>In newsgroup: comp.std.internat

>> >
>> >UTF-8 is an off-the-shelf solution. Some of us hold out vain hopes
>> >for something better...
>>
>> I agree, it is the solution to that problem. However, that is a rare
>> problem. Multi-lingual documents are an exception, not the norm.
>>
>
>On a system-wide basis, hardly so, unless you count the big country in
>the west. Most systems have to deal with at least two languages:
>English and the native; in some parts of the world, three or more
>languages are common.

Was that some kind of joke? Remember the context, it was character sets.
Every 8-bit character set can be used for English. Similarly even in
countries where there are several official languages, generally one can
find an 8 bit set that is usable and that is already used there.

Osmo


Markus Kuhn

unread,
Jun 28, 1996, 3:00:00 AM6/28/96
to
Gianni Mariani (gia...@engr.sgi.com) wrote:

> Unicode is NOT stateless. Remember the BOM ?

Well, I would describe the BOM more as an optional ugly hack and not
as something that makes Unicode a statefull encoding. They have simply
included it, because someone had this funny idea and reserving a
character for this purpose (byte sex identification) costs nothing, so
there was absolutely no reason not to incude it. This does *not* mean
that you have to *use* it or that using the BOM would even be good
engineering practice (which it is certainly not, IMHO!).

Jonathan Rosenne

unread,
Jun 29, 1996, 3:00:00 AM6/29/96
to
Unicode containes some "formatting codes", necessary under some
circumstances for the coding of Hebrew and Arabic, which introduce a
minimal amount of state. See for example LRE or RLO.
--
Jonathan Rosenne
JR Consulting
P O Box 33641, Tel Aviv, Israel
Phone: +972 50 246 522 Fax: +972 9 56 73 53
http://ourworld.compuserve.com/homepages/Jonathan_Rosenne/

Gianni Mariani

unread,
Jul 2, 1996, 3:00:00 AM7/2/96
to

Eric Brunner wrote:
> : I could argue that EUC is stateless. It depends on the definition of stateless.
>
> I can't believe I made such a typo! Shoot me Gianni, now before I make
> other encoding mistakes.
>
> : Unicode is NOT stateless. Remember the BOM ?
>
> Thanks. How's the localedef?

Wroks like a charm. You'll need the XPG4 patch for IRIX 6.2 to get the real
one.

Gianni Mariani

unread,
Jul 2, 1996, 3:00:00 AM7/2/96
to

Jonathan Rosenne wrote:
>
> Unicode containes some "formatting codes", necessary under some
> circumstances for the coding of Hebrew and Arabic, which introduce a
> minimal amount of state. See for example LRE or RLO.

*MINIMAL* state is state !

Lets get this straight. In terms of support you need to deal with state
correctly otherwise you will have bugs in your application - guarenteed.

Re the comments from Markus Kuhn:


> Well, I would describe the BOM more as an optional ugly hack and not
> as something that makes Unicode a statefull encoding. They have simply
> included it, because someone had this funny idea and reserving a
> character for this purpose (byte sex identification) costs nothing, so
> there was absolutely no reason not to incude it. This does *not* mean
> that you have to *use* it or that using the BOM would even be good
> engineering practice (which it is certainly not, IMHO!).

Are you implying that if I was to IGNORE BOM's then I would have Unicode
compliance ?

I must have missed somthing.

Markus Kuhn

unread,
Jul 2, 1996, 3:00:00 AM7/2/96
to

Gianni Mariani <gia...@engr.sgi.com> writes:

>Are you implying that if I was to IGNORE BOM's then I would have Unicode
>compliance ?

I don't have a copy of the Unicode specification (I am still waiting
for 2.0 to be published. When?), but you would definitely be compliant
with ISO 10646 if you ignore the BOM. ISO 10646 implies that the byte
sex is known in advance as part of the protocol specification and
recommends to use Bigendian format for the UCS-2 and UCS-4 encodings.

Message has been deleted

Gianni Mariani

unread,
Jul 7, 1996, 3:00:00 AM7/7/96
to

H. Peter Anvin wrote:
> > Are you implying that if I was to IGNORE BOM's then I would have Unicode
> > compliance ?
> >
> > I must have missed somthing.
>
> He is correct. Unicode deals with the interpretation of 16 or 31-bit
> integers ONLY. How those 16- or 31-bit integers are transmitted is
> technically beyond the scope of the standard. However, by popular
> demand a set of several standard 8-bit encoding schemes have been defined:
>
> UCS-2: Transmitting 2-byte sequences in bigendian order;
> UCS-4: Transmitting 4-byte sequences in bigendian order;
> UTF-8: Transmitting variable-length codes in a specific sequence.
>
> However, the utilization of these are not required to be conforming.
>
> As far as Unicode is concerned, U+FEFF is *exactly* the same thing as
> a zero-width nonjoiner, and U+FFFE simply does not exist. U+FFFE is a
> non-entity, as is U+FFFF. Hence, the reception of what appears to be
> U+FFFE has to be interpreted by a lower level protocol. For example,
> on a wire transmission of UCS-2 characters, reception of U+FFFE may
> indicate a synchronization failure, and the appropriate response is to
> drop the next received byte and continue (the sequence FE FF FE FF FE
> FF would then force sychronization.) When reading plain UCS-2 files
> from another computer (likely to be stored in native byte order)
> U+FFFE is likely to indicate the file was written on a computer with
> the opposite native byte order, and the file should be read
> byte-swapped. For a UTF-8 stream, U+FFFE (EF BF BE) may be used to
> end the transmission, or it may cause the receiving computer to start
> playing Nethack.
>

OK, I'm lost.

Just say I'm reading a file and I recieve a U+FFFE.

What should I do - Ignore it ?

If you say no then I have a stateful encoding. I assume from the
above you are saying that UCS2 and UCS4 are stateful.

At that point you contradict yourself by saying "He is correct".

0 new messages