Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

The Nordic graphemes FAQ (the s.c.nordic FAQ)

50 views
Skip to first unread message

SCN FAQ-robot

unread,
Oct 25, 1998, 2:00:00 AM10/25/98
to
The article below belongs to the www-pages at the soc.culture.nordic FAQ
web-site. Regarding its accuracy, it must be stated that also _if_ the
text initially were quite up to date, this might have changed.

The www-version at http://www.lysator.liu.se/nordic/scn/faq18.html
may look slightly different due to sparsely added links and illustrations.
For newer browsers (able to handle tables) the page is also available at a
faster www-server: http://www2.lysator.liu.se/nordic/scn/faq18.html

The www-version has been html-ized, however not "web-ized" - the texts are
not edited to comply to web-readability findings. The pages are supposed to
get printed out by readers who find them interesting.

Feel free to propose changed wordings for paragraphs and sections in need
of that. Also relevant links are more than welcome, in particular links to
serious www-pages which are not supposed to change address too often. :->>

- - - - - - - - - - - - - - - - - - - - - - -


Subject: 1.8 What are Nordic graphemes?

(by Tor Slettnes)

Nordic graphemes can in this context be described as:

Graphical representations of the letters that exist in the various
Nordic (i.e. Icelandic, Norwegian, Danish, Swedish and Finnish)
alphabets, beyond those that exist in the English alphabet.

Each of the Nordic written languages uses some additional letters
compared to English. These are, in order of appearance in the
alphabets:
Letter: Languages used: Pronounced like: character:
________________________________________________________________

a acute is 'ou' in "loud" á
eth is 'th' in "there" ð
e acute is (dk, no, se, fi) 'ea' in "yeah" é
i acute is 'e' in "he" í
o acute is 'o' in "home" ó
u acute is 'ou' in "you" ú
y acute is 'e' in "he" ý
thorn is 'th' in "thumb" þ
ae is 'i' in "hi" æ
dk, no 'a' in "bad" æ
o-slash dk, no 'i' in "bird" ø
a-ring dk, no, se (fi) 'o' in "bored" å
a diaeresis se, fi 'a' in "bad" ä
o diaeresis se, fi, is 'i' in "bird" ö
u diaeresis (se, fi, dk, no) 'ue' in french "rue" ü

A set of parentheses around the country code indicates that the letter
is rarely used in the corresponding language, typically only for loan
words or names originating from another language. Other accents, such
as ^ (circumflex) and accent-grave are now and then used in foreign
names and words in all Nordic languages.

In Denmark and Norway the alphabet is ordered:
a b c d e f g h i j k l m n o p q r s t u v w x y z æ ø å

For Finland and Sweden the order is:
a b c d e f g h i j k l m n o p q r s t u v w x y z å ä ö

If your curiosity isn't satisfied by the pronounciation guide above,
there are more extensive comments in the various language sections of
this faq.

1.8.1 How are these represented in Usenet postings and E-mail?

The "mother" of all modern character sets for computers is the
original ASCII character set, now renamed to US-ASCII. (ASCII =
"American Standard Code for Information Interchange"). This is a 7-bit
set containing the characters needed to write American English without
accents or special letters, and little more. No "foreign" letters are
included.

Various standards exist for representing extra characters, some of
which are: Digraph, LaTeX, ISO-646, ISO-8859-1, and the IBM codepages
437, 850, and 865. All of these sets, except the IBM codepages, are
usually considered acceptable on soc.culture.nordic, e-mail, and the
internet in general.

Digraphs are two-character combinations used for simplicity, and are
often the most universally understood notation on soc .culture
.nordic. However, when using these to non-Nordics, one should be
careful to explain that these are digraphs, not two separate
characters. Also, some information may get lost by using digraphs,
since a filtering program will not be able to determine whether it is
really a digraph or two separate characters.

LaTeX notation comes from the typesetting program by the same name,
where a sequence starting with '\' may be substituted with a given
character. For instance, the a-ring is written as "\aa" or "{\aa}" in
LaTeX.

ISO-646 (really ISO-646-NO and ISO-646-SE) are 7-bit sets similar to
US-ASCII, but with national characters substituted in place of the
following characters: {, |, }, [, \, ]. This is the oldest one of the
"true representation" standards mentioned here; it was used in e.g.
the Nordic versions of the CP/M operating system, prior to MS-DOS.
Today, it is mostly used in Sweden and Finland (although the ordering
of the letters, for the sake of compability with the Danish /Norwegian
/German equivalents, are not correct in these languages).

ISO-8859-1, also called ISO Latin-1, is the first of several 8-bit
character sets described in International Standards Organization's
document 8859 <http://czyborra.com/charsets/iso8859.html>. (ISO is the
maintainer of the meter, the kilogram, etcetera.) This sets include
all characters needed for all West European languages, leave Sámi and
Esperanto. Latin-1 is a superset of US-ASCII, hence all ASCII
characters maintain their original position in this set. Rather than
trying to accomodate positioning in any spesific language, the letters
in ISO-8859-1 are ordered according to the alphabetical position of
their US-ASCII lookalikes. Latin-1 is supported through modern
standardizations like MIME (RFC 1521).

The IBM codepages 437, 850, 861 and 865 are used on Personal Computers
in "text" mode, and is also the default set on many MS-Windows ®
communication programs. Out of the Big Blue, they were created to
provide text-based PC programs with a means to create low-cost
graphics, and the addition of extra characters came as a nice side
effect. (Certain Nordic characters were not represented in the
original codepage 437, with the consequence that in Iceland, Denmark
and Norway, computers would occasionally be sold with cp 861 or 865 in
the hardware. Today, alternative codepages can be downloaded to the
video card via software). The Danish /Norwegian character o-slash is
not represented in cp 437, and in 850 /861 /865 it is positioned with
the dangerous code 155 (9B hex) -- "Upper Escape". Certain terminal
types will interpret this code as the initial character of a escape
command, and may e.g. clear the screen depending on the next letter.
Further, it is incompatible with the established 8-bit standard
Latin-1, and should be avoided.

The various notations of the Nordic graphemes follow:
Letter Digraph LaTeX ISO-646 ISO-8859-1
HTML Octal Char
_________________________________ _____________________________________

a acute A' \'{A} - alt-0193 &#193; &Aacute; \301 Á
a' \'{a} - alt-0225 &#225; &aacute; \341 á
eth TH - alt-0208 &#208; &ETH; \320 Ð
th - alt-0240 &#240; &eth; \360 ð
e acute E' \'{E} - alt-0201 &#201; &Eacute; \311 É
e' \'{e} - alt-0233 &#233; &eacute; \351 é
i acute I' \'{I} - alt-0205 &#205; &Iacute; \315 Í
i' \'{i} - alt-0237 &#237; &iacute; \355 í
o acute O' \'{O} - alt-0211 &#211; &Oacute; \323 Ó
o' \'{o} - alt-0243 &#243; &oacute; \363 ó
u acute U' \'{U} - alt-0218 &#218; &Uacute; \332 Ú
u' \'{u} - alt-0250 &#250; &uacute; \372 ú
y acute Y' \'{Y} - alt-0221 &#221; &Yacute; \335 Ý
y' \'{y} - alt-0253 &#253; &yacute; \375 ý
thorn TH - alt-0222 &#222; &THORN ; \336 Þ
th - alt-0254 &#254; &thorn; \376 þ

u diaeresis U" \"{U} ^ alt-0220 &#220; &Uuml; \334 Ü
u" \"{u} ~ alt-0252 &#252; &uuml; \374 ü
ae AE {\AE} [ alt-0198 &#198; &AElig; \306 Æ
ae {\ae} { alt-0230 &#230; &aelig; \346 æ
o-slash OE {\OE} \ alt-0216 &#216; &Oslash; \330 Ø
oe {\oe} | alt-0248 &#248; &oslash; \370 ø
a-ring AA {\AA} ] alt-0197 &#197; &Aring; \305 Å
aa {\aa} } alt-0229 &#229; &aring; \345 å
a diaeresis A" \"{A} [ alt-0196 &#196; &Auml; \304 Ä
a" \"{a} { alt-0228 &#228; &auml; \344 ä
o diaeresis O" \"{O} \ alt-0214 &#214; &Ouml; \326 Ö
o" \"{o} | alt-0246 &#246; &ouml; \366 ö

The ISO-646 charsets for Denmark/Norway
<http://www.kostis.net/charsets/iso646.no.html> [ iso-646-NO ] and
Finland/Sweden <http://www.kostis.net/charsets/iso646.se.html>
[ iso-646-SE ] are in practice obsolete, and there never existed one
for Icelandic, but you may run into older 7-bits text files using
them. It is to be noted that 'Ü' is not represented in iso-646-NO for
Denmark/Norway.


1.8.2 Pros and cons of the different representations

If you have been a reader of this group for a while, you may have
noticed that discussion about characters and their representations
occasionally accounts for quite a bit of bandwidth. It often does not
take more than a question about the issue from a new reader, or
someone posting an article with an IBM character set, to get a new
thread going on the issue. Some want to keep 7-bit ISO-646 (be aware
that they may call it "true ASCII", although strictly speaking, is
not), since 7-bit codes will always get though with any setup; others
want ISO-Latin-1 since it is more universal; and yet others promote
digraphs as the greatest common denominator between the two.

Some pros and cons for each set:
Character set: Advantages: Disadvantages:
__________________________________________________________________

Digraphs * Requires 7-bit only * Ambiguous
("oe" or "o-slash"?)
* Non-optimal compromise

LaTeX * Non-ambiguous 7-bit * Made for typesetting;
representation. somewhat cryptic for
regular text.
* Non-optimal compromise

ISO-646-SE, * Only 7-bit "true" * Different standards
ISO-646-DK representation. for each language
<[\]{|}> * No data loss even * Getting harder to
with old hardware/ find font support
software/setup. (Dying out).
* Shadows the brace,
sqare bracket, pipe,
and backslash chars.

ISO Latin 1 * Utilizes all 8 bits * Requires 8-bit clean
(ISO-8859-1) in a byte; yet avoids connection; older
<ÐÞÆØÅÄÖðþæøåäö..> dangerous codes. systems may cause
* Universal for all data loss.
Western European * May require some
languages. setup.
* Supported by ISO and * In case of stripping,
MIME; true subset of becomes "FXEDVfxedv";
Unicode. difficult to read.

IBM CodePages * Uses all 256 codes; * Uses all 256 codes;
Machintosh set more characters incl. dangerous ones.
<Unacceptable> * Often used in PC * Incompatible with
environments such as the "de-facto" 8-bit
BBS'es. standard ISO-8859-1

__________________________________________________________________


1.8.3 How do I set up support for 7-bit ISO-646 representation?
({|}, [\])

The ISO-646 sets are still supported via varoius fonts and translation
filters. Possible measures to set up support for them are:
* For the "terminal" program shipped with Windows 3.x, simply select
"Denmark/Norway", "Sweden" or "Finland" from the Translations item
in the "Terminal Preferences" dialogue box.
* For MS-Kermit, use the command "set term charcter-set language",
where "language" is one of "Finnish", "Swedish", or "Norwegian".
* For other DOS and Windows communication programs, visit its local
translation tables and insert appropriate translations for '[',
'\', ']', '{', '|', '}'.
* For Unix based news readers, either find a ISO-646 font, or pipe
your newsreader through one of the following commands (Provided
the font you use is ISO-8859-1):

Denmark/Norway: tr '\\]{|}' '\330\305\346\370\345'
Sweden/Finland: tr '\\]{|}' '\326\305\344\366\345'

For instance, in your .cshrc file, insert the following line:

alias rn "rn | tr '\\]{|}' '\330\305\346\370\345'"

The character '[' should not be translated, because it is used in ANSI
escape sequences.

Note that if you use this kind of translation, you will no longer see
any of the characters '\]{|}'; in most cases this outweighs the
benefits from seeing the national letters.

1.8.4 How do I set up support for 8-bit ISO-8859-1 representation?
(æøåäö, ÆØÅÄÖ)

The ISO-8859-1 (Latin 1) set is currently the most common character
representation standard on soc.culture.nordic, and is also quite
frequent in e.g. soc.culture.german, personal e-mail etc. However, on
many systems, the ability to view these characters is not provided as
"default", so you may need to configure some things on your own.
* If you are reading news through a modem, you need to make sure
that your modem connection is 8 data bits. (The most common
parameters are "8N1" - 8 data bits, no parity bits, and one stop
bit).
* For DOS text mode communication programs, you need a ISO->IBM
translation table. Tables for Telemate, Telix and Procomm Plus can
be found in the file "xlate.zip", available at various FTP sites.
* For MS Windows ® communication programs, select an ANSI or
ISO-Latin-1 font. For MS-Kermit, use "set term char latin". For
Procomm Plus for Windows, select vt220 or vt320 emulation. Be sure
that bit 8 is not stripped.
* For MS Windows ® you can also generate 8-bit characters globally
by choosing "US-International" keyboard layout via the
"International" dialogue box in the Control Panel. For instance,
'ä' (a diaeresis) is generated by pressing "a, i.e. double quote
followed by lowercase a.
A note to Windows programmers: Let the underlying keyboard
drivers, run-time libararies etc. take care of keyboard input.
Only be sure that the 8th bit is not stripped/masked away.
* If your newsreader is UNIX-based, insert the following command in
your .login or .profile file:

stty -istrip pass8

* If your modem connection is 7 bits (and cannot be changed to 8
bits), you can have ISO-Latin-1 characters translated to "[\]{|}"
before they are sent over the modem. Pipe your reader through the
"tr" command, similar to above:

tr '\306\330\305\304\326\346\370\345\344\366' '[\\][\\{|}{|'

* If you use the "emacs" editor, version 19.x, and have a
ISO-Latin-1 display font, insert the following line in your .emacs
file:

(standard-display-european t)

Also, if you have a keyboard with international characters that
you want to be able to use directly, or if you in another way are
able to generate 8-bit codes directly from your keyboard, insert
the following line:
(set-input-mode (car (current-input-mode))
(nth 1 (current-input-mode))
0)
Note that in cases where the Meta key is represented by setting
the 8th (high) bit, (ie. if you are not using X-windows), this
line will disable the Meta key, so you will subsequently have to
use "ESC x" to generate "M-x".
Otherwise, insert the following line:

(load-library "iso-insert")

A new keymap, 8859-1, has now been assigned to the key sequence
"C-x 8". You can assign this to another sequence, e.g. C-t, by
inserting:

(global-set-key "\C-t" 8859-1-map)

Some strokes from this map:
C-x 8 d gives ð (eth)
C-x 8 t gives þ (thorn)
C-x 8 a e gives æ (ae)
C-x 8 / o gives ø (o-slash)
C-x 8 a a gives å (a-ring)
C-x 8 " a gives ä (a diaeresis)
C-x 8 " o gives ö (o diaeresis)
C-x 8 ' a gives á (a acute)
C-x 8 ' i gives í (i acute)


1.8.5 References

For an index to other literature on internationalization, try:
<http://www.vlsivie.tuwien.ac.at/mike/i18n.html>

I am: Tor Slettnes.
___________________________________
_________________________________________________________________

- Is the text above really reliable?
- See the discussion in section 1.2.2!
_________________________________________________________________


© Copyright 1996-98 by Tor Slettnes.

You are free to quote this page as long as you mention the URL.
This page was last updated September 28th.
--
e-mail: j...@lysator.liu.se
s-mail: Majeldsvägen 8a, 587 31 LINKÖPING, Sweden
www: http://www.lysator.liu.se/~jmo/

0 new messages