I have recently spent quite some time working out a proposal for two
Unicode/ISO 10646 subsets that are so small that I hope they will become
widely implemented in Europe and America. Both are specifically designed
to be suitable for systems where characters are represented in
low-resolution fixed-width fonts. This includes for instance your xterm
and Emacs window under Unix (or more general VT100 emulators and source
code editors), but also applications such as portable LCD devices
(pager, mobile phones), where only a small subset of Unicode makes sense
to be implemented and where no single 8-bit set can cover a reasonable
number of languages. These subsets are not really intended for
applications such as the publishing industry, where these display
restrictions do not exist and larger Unicode subsets or even full
implementations might be adequate.
The two subsets are:
- Very Simple European Character Set (VSECS)
345 characters, basically the superset of Latin 1-4,9,10,15 and CP1251
plus a very few ISO 6397 characters
Rows Positions (Cells)
00 20-7E A0-FF
01 00-13 16-2B 2E-31 34-3E 41-48 4A-4D 50-7E 92
02 C6-C7 D8-DD
20 13-15 18-1A 1C-1E 20-22 26 30 39-3A AC
21 22 26 5B-5E 90-93
26 6A
FF FD
- Simple European Character Set (SECS)
683 characters, covers in addition to VSECS also Cyrillic, Greek,
MS-DOS blockgraphics, and a moderate set of mathematical characters
that is likely to be used in academic email and source code comments.
Rows Positions (Cells)
00 20-7E A0-FF
01 00-13 16-2B 2E-31 34-3E 41-48 4A-4D 50-7E 92
02 BC-BD C6-C7 D8-DD
03 84-86 88-8A 8C 8E-A1 A3-CE D1 D5-D6 F1
04 01-0C 0E-4F 51-5C 5E-5F 90-91
20 13-15 17-1A 1C-1E 20-22 26 30 32-34 39-3A 70 7F-83 A7 AC
21 02 15-16 1A 1D 22 24 26 5B-5E 90-95 A4-A7 D0-D5
22 00-09 0B-0C 12-13 18-1A 1D-1E 24-2A 3C 43 45 48-49 58 5F-62 64-65
22 6A-6B 82-8B 95 97 A4-A7 C2-C3 C5
23 00 08-0B 10 15 20-21 29-2A
25 00 02 0C 10 14 18 1C 24 2C 34 3C 50-6C 80 84 88 8C 90-93 A0 B2
25 BA BC C4 CB
26 10-12 3A-3C 40 42 6A-6B 6D-6F
27 13 17
FF FD
VSECS is somewhat similar to ISO 6937 with some bugs fixed (e.g., the
Euro symbol is included, as are the directed quotation marks).
SECS is somewhat similar to Microsoft/Adobe WGL4. I think SECS is much
better than WGL4, because WGL4 contains many letters for which I could
not find out where they are used (for at least three I am sure they
never existed). SECS contains the following 91 characters that are not
part of WGL4:
Rows Positions (Cells)
02 BC-BD
03 D1 D5-D6 F1
20 34 70 80-83
21 02 15 1A 1D 24 A4-A7 D0-D5
22 00-01 03-05 07-09 0B-0C 13 18 1D 24-28 2A 3C 43 45 49 58 5F 62
22 6A-6B 82-8B 95 97 A4-A7 C2-C3 C5
23 00 08-0B 15 29-2A
26 10-12 6D-6F
27 13 17
FF FD
Almost all of these are a set of basic mathematic characters that most
high school students should be familiar with. They are very useful to
have available in academic email discussions and source code comments.
It would be nice if the authors of WGL4 considered seriously to extend
their Unicode subset by those few dozen elementary math symbols. Then
SECS would become a subset of WGL4. VSECS is already a subset of WGL4
except for U+FFFD.
The mathematical symbols of SECS will hopefully provide for US
developers who do not specialize in i18n issues some motivation to get
interested in 16-bit character sets, as they are more relevant for their
personal use than the accented characters of crazy Europeans.
My dream is that something like SECS becomes rather soon the common
minimum repertoire in Unix X11 fonts and printer fonts. VSECS is
intended as an intermediate step for applications where the size of the
character set is critical and only Latin script support is required.
I do not think SECS contains any useless symbol. I know for each letter
and symbol why it is in there and in which languages or fields it is
used. Just ask.
Much more information on the two sets is available from
http://www.cl.cam.ac.uk/~mgk25/ucs/vsecs.html
http://www.cl.cam.ac.uk/~mgk25/ucs/secs.html
Much better than just looking at these web pages is to download the
database (Perl needed) that generated them from
http://www.cl.cam.ac.uk/~mgk25/ucs/secs.tar.gz
Then you can play around with them and test the subset properties with
regard to other sets easily yourself.
If you want to see example glyphs on the HTML output of this script,
then you'll also need
http://www.cl.cam.ac.uk/~mgk25/ucs/glyphs.zip
The uniset Perl script allows you to comfortably build up your own
database of character collections, to merge and subtract them and to
generate Unicode subsets and study their relations with other subsets.
The mapping files from the Unicode Consortium can be used directly as
input.
Please let me know what you think about SECS and VSECS and if this is
something you would like to see widely implemented.
Markus
--
Markus G. Kuhn, Security Group, Computer Lab, Cambridge University, UK
email: mkuhn at acm.org, home page: <http://www.cl.cam.ac.uk/~mgk25/>
>SECS is somewhat similar to Microsoft/Adobe WGL4. I think SECS is much
>better than WGL4, because WGL4 contains many letters for which I could
>not find out where they are used (for at least three I am sure they
>never existed).
Markus, I'm very interested in your proposal, but would like to know
for which WGL4 letters you could find no use. I have spent a lot of
time researching European (and non-European) orthographies, and may be
able to account for some of the lesser known letters (which is not to
say that I think WGL4 is perfect).
John Hudson, Type Director
Tiro Typeworks
Vancouver, BC
ti...@tiro.com
www.tiro.com
I'm very interested in hearing more about what the rationale to have
the following characters in WGL4 might be:
I don't know where the following ones come from:
0114 # LATIN CAPITAL LETTER E WITH BREVE
0115 # LATIN SMALL LETTER E WITH BREVE
012C # LATIN CAPITAL LETTER I WITH BREVE
012D # LATIN SMALL LETTER I WITH BREVE
014E # LATIN CAPITAL LETTER O WITH BREVE
014F # LATIN SMALL LETTER O WITH BREVE
01FA # LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE
01FB # LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE
01FC # LATIN CAPITAL LETTER AE WITH ACUTE
01FD # LATIN SMALL LETTER AE WITH ACUTE
01FE # LATIN CAPITAL LETTER O WITH STROKE AND ACUTE
01FF # LATIN SMALL LETTER O WITH STROKE AND ACUTE
02C9 # MODIFIER LETTER MACRON
02D6 # MODIFIER LETTER PLUS SIGN
0387 # GREEK ANO TELEIA
The long s might be from German Fraktur fonts which is unused
since ~1945. This letter has certainly no equivalent in modern
German roman/antiqua fonts and is certainly not needed to
write German:
017F # LATIN SMALL LETTER LONG S
I understand that the following ones were added by mistake to
ISO 6937:
0132 # LATIN CAPITAL LIGATURE IJ
0133 # LATIN SMALL LIGATURE IJ
013F # LATIN CAPITAL LETTER L WITH MIDDLE DOT
0140 # LATIN SMALL LETTER L WITH MIDDLE DOT
0149 # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
Usage of LIGATURE IJ is now deprecated in the Netherlands
and the other ones never existed in Catalan or Afrikaans
as originally assumed (source: NL gov manual by J.W. van
Wingen).
The following are claimed to be used in Welsh, but Welsh
native speakers who I asked claimed to have never seen them,
so I suspect they are historic characters that are not in
general use.
1E80 # LATIN CAPITAL LETTER W WITH GRAVE
1E81 # LATIN SMALL LETTER W WITH GRAVE
1E82 # LATIN CAPITAL LETTER W WITH ACUTE
1E83 # LATIN SMALL LETTER W WITH ACUTE
1E84 # LATIN CAPITAL LETTER W WITH DIAERESIS
1E85 # LATIN SMALL LETTER W WITH DIAERESIS
1EF2 # LATIN CAPITAL LETTER Y WITH GRAVE
1EF3 # LATIN SMALL LETTER Y WITH GRAVE
The purpose of the following characters is also
unclear to me:
201B # SINGLE HIGH-REVERSED-9 QUOTATION MARK
203C # DOUBLE EXCLAMATION MARK
203E # OVERLINE
All these are in WGL4 but (so far) not in SECS.
There are also some mysterious characters in the MES-2
proposal which I have not found anywhere else:
01B7 # LATIN CAPITAL LETTER EZH
01C4 # LATIN CAPITAL LETTER DZ WITH CARON
01C6 # LATIN SMALL LETTER DZ WITH CARON
01C7 # LATIN CAPITAL LETTER LJ
01C9 # LATIN SMALL LETTER LJ
01CA # LATIN CAPITAL LETTER NJ
01CC # LATIN SMALL LETTER NJ
01DE # LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON
01DF # LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
01E4 # LATIN CAPITAL LETTER G WITH STROKE
01E5 # LATIN SMALL LETTER G WITH STROKE
01E6 # LATIN CAPITAL LETTER G WITH CARON
01E7 # LATIN SMALL LETTER G WITH CARON
01E8 # LATIN CAPITAL LETTER K WITH CARON
01E9 # LATIN SMALL LETTER K WITH CARON
01EE # LATIN CAPITAL LETTER EZH WITH CARON
01EF # LATIN SMALL LETTER EZH WITH CARON
01F1 # LATIN CAPITAL LETTER DZ
01F3 # LATIN SMALL LETTER DZ
01F4 # LATIN CAPITAL LETTER G WITH ACUTE
01F5 # LATIN SMALL LETTER G WITH ACUTE
027C # LATIN SMALL LETTER R WITH LONG LEG
0292 # LATIN SMALL LETTER EZH
0374 # GREEK NUMERAL SIGN
0375 # GREEK LOWER NUMERAL SIGN
037A # GREEK YPOGEGRAMMENI
037E # GREEK QUESTION MARK
Do you know a good reason why any of these characters should
go into a simple European character set?
My browser finally finished downloading Markus' SECS website, and I
have prepared the following comments on some of the WGL4 characters he
has excluded from SECS. I believe that some of these characters should
be included in the SECS, in accordance with Markus' criteria, and have
marked my comments on these characters with an asterisk.
I have not bothered to comment on the heavy linedraw characters, etc.,
and have confined my comments to letters and diacritics.
[I am also concerned that Markus' recommended mathematical set may be
too extensive. Is this really a _basic_ mathematical subset, or
something more?]
0114 LATIN CAPITAL LETTER E WITH BREVE
0115 LATIN SMALL LETTER E WITH BREVE
012C LATIN CAPITAL LETTER I WITH BREVE
012D LATIN SMALL LETTER I WITH BREVE
These characters are not required for the modern writing of any
European language. They are essential to much European prosody, and
are found in most Latin language textbooks and dictionaries. I believe
it would be sound to omit them from SECS if the basic, non-combining
IPA characters are also to be omitted. If the latter are included it
would make sense to include short and long vowel diacritics.
0132 LATIN CAPITAL LIGATURE IJ
0133 LATIN SMALL LIGATURE IJ
These, of course, are the Dutch digraph characters. There is no need
for them to be separately encoded, as Dutch writers commonly type /I/
followed by /J/. These characters can, I believe, be safely omitted
from SECS.
013F LATIN CAPITAL LETTER L WITH MIDDLE DOT
0140 LATIN SMALL LETTER L WITH MIDDLE DOT
These are composite rendering forms for the Catalan lateral
approximant. They are not strictly necessary in a character set which
includes an appropriately sized, positioned and spaced midpoint
character (U+00B7). I am a little concerned that in a monospaced font,
of the kind referred to in Markus' SECS criteria, reliance on the
midpoint character will produce gaping holes in the middle of many
Catalan words. I am undecided about the possible inclusion of these
characters.
0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
This is an Hewlett Packard character, apparently used by them for
Afrikaans. I've never heard a clear explanation of its purpose, or its
inclusion in WGL4 or other character sets (other than the fact that HP
wanted it to be included). In any case, Afrikaans is beyond the scope
of SECS, so this character may be safely omitted.
014E LATIN CAPITAL LETTER O WITH BREVE
014F LATIN SMALL LETTER O WITH BREVE
These characters are not required for the modern writing of any
European language. They are essential to much European prosody, and
are found in most Latin language textbooks and dictionaries. I believe
it would be sound to omit them from SECS if the basic, non-combining
IPA characters are also to be omitted. If the latter are included it
would make sense to include short and long vowel diacritics.
017F LATIN SMALL LETTER LONG S
Archaic. This may be safely omitted.
01A0 LATIN CAPITAL LETTER O WITH HORN
01A1 LATIN SMALL LETTER O WITH HORN
01AF LATIN CAPITAL LETTER U WITH HORN
01B0 LATIN SMALL LETTER U WITH HORN
Vietnamese. These characters may be safely omitted (although there are
sizeable Vietnamese speaking populations in parts of Europe, notably
in the Netherlands).
01FA LATIN CAPITAL LETTER A WITH RING ABOVE AND
ACUTE
01FB LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE
01FC LATIN CAPITAL LETTER AE WITH ACUTE
01FD LATIN SMALL LETTER AE WITH ACUTE
01FE LATIN CAPITAL LETTER O WITH STROKE AND ACUTE
01FF LATIN SMALL LETTER O WITH STROKE AND ACUTE
* These characters are used in Danish and their inclusion in both
Unicode and the WGL4 set was at the request of the Danish standards
organization. My understanding is that there is some debate over the
status of these characters in modern Danish. Some sources claim that
they are archaic, others that they are orthographically correct and
that to omit them is a mistake. I believe they should not be omitted
from SECS without further research.
02D6 MODIFIER LETTER PLUS SIGN
This may be safely omitted.
1E80 LATIN CAPITAL LETTER W WITH GRAVE
1E81 LATIN SMALL LETTER W WITH GRAVE
1E82 LATIN CAPITAL LETTER W WITH ACUTE
1E83 LATIN SMALL LETTER W WITH ACUTE
1E84 LATIN CAPITAL LETTER W WITH DIAERESIS
1E85 LATIN SMALL LETTER W WITH DIAERESIS
1EF2 LATIN CAPITAL LETTER Y WITH GRAVE
1EF3 LATIN SMALL LETTER Y WITH GRAVE
* All these characters are used in modern Welsh and should _not_ be
omitted from SECS. Their use is less common than the W and Y
circumflex diacritics, but all are essential to semantic distinction
and or pronunciation. My source for this information is Andrew Hawke
(a...@pophost.aber.ac.uk), assistant editor of the University of Wales
dictionary of the Welsh language. I can provide a Welsh word list if
required.
>There are also some mysterious characters in the MES-2
>proposal which I have not found anywhere else:
>01B7 # LATIN CAPITAL LETTER EZH
Ezh (or Yogh) is found in Old and Middle English texts, and is a
letter in the orthographies of a number of African languages. The only
modern European language I associate it with is Skolt Saami (see
below). The number of speakers/writers of Skolt Saami is probably well
below the 10,000 minimum set in Markus' criteria.
>01C4 # LATIN CAPITAL LETTER DZ WITH CARON
>01C6 # LATIN SMALL LETTER DZ WITH CARON
>01C7 # LATIN CAPITAL LETTER LJ
>01C9 # LATIN SMALL LETTER LJ
>01CA # LATIN CAPITAL LETTER NJ
>01CC # LATIN SMALL LETTER NJ
These are digraphs which were separately encoded in ISO/IEC 10646 and
Unicode to facilitate compatible font mappings between Latin and
Cyrillic fonts for Serbo-Croatian. Language reform policies in the
former Yugoslav republic -- particularly in Croatia -- have greatly
reduced the need for such compatability. I believe these digraph
characters may still be of use in Serbia, if transliteration to Latin
script is a requirement, but such specialised usage may fall beyond
the proposed scope of SECS.
>01DE # LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON
>01DF # LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
Unicode 2.0 identifies these characters as Lappish. In the first
place, Lappish is generally considered a derogatory term; in the
second these characters do not appear in any of the Saami
orthographies I have collected. Note that I only have Latin
orthographies for five of the nine Saami languages.
>01E4 # LATIN CAPITAL LETTER G WITH STROKE
>01E5 # LATIN SMALL LETTER G WITH STROKE
These characters are used to write Skolt Saami in the Latin script
(Skolt Saami is also written by some using the Cyrillic script). The
number of speakers/writers of Skolt Saami is probably well below the
10,000 minimum set in Markus' criteria.
>01E6 # LATIN CAPITAL LETTER G WITH CARON
>01E7 # LATIN SMALL LETTER G WITH CARON
I can find no reference for these characters. Their use in Turkish is
incorrect and an unacceptable substitute for the G breve diacritics.
Unicode 2.0 indicates 'Lappish', but they do not occur in any of the
Saami orthographies I have on file.
>01E8 # LATIN CAPITAL LETTER K WITH CARON
>01E9 # LATIN SMALL LETTER K WITH CARON
>01EE # LATIN CAPITAL LETTER EZH WITH CARON
>01EF # LATIN SMALL LETTER EZH WITH CARON
These characters are used to write Skolt Saami in the Latin script
(Skolt Saami is also written by some using the Cyrillic script). The
number of speakers/writers of Skolt Saami is probably well below the
10,000 minimum set in Markus' criteria.
>01F1 # LATIN CAPITAL LETTER DZ
>01F3 # LATIN SMALL LETTER DZ
>01F4 # LATIN CAPITAL LETTER G WITH ACUTE
>01F5 # LATIN SMALL LETTER G WITH ACUTE
Your guess is as good as mine. I believe these can be safely omitted.
>027C # LATIN SMALL LETTER R WITH LONG LEG
I know of no usage of this character outside of phonetic transcription
(strident apico-alveolar trill). I'm not even sure that it remains
part of the official IPA standard set.
>0292 # LATIN SMALL LETTER EZH
See note above for uppercase Ezh/Yogh. Of course, if it is decided to
include a basic IPA subset, this character would become necessary.
>0374 # GREEK NUMERAL SIGN
>0375 # GREEK LOWER NUMERAL SIGN
I believe these to be archaic, and are only of use when Greek letters
are serving as numerals (as they did before the introduction of
'Arabic' numerals).
>037A # GREEK YPOGEGRAMMENI
This is the Greek subscript iota. It is not used in modern, monotonic
Greek, so may be safely omitted from SECS.
>037E # GREEK QUESTION MARK
I'm unable to confirm, at this time, whether this punctuation mark is
still in use or not. I suspect not, and most readers would be unlikely
to distinguish it from a semicolon.
(snip)
>The following are claimed to be used in Welsh, but Welsh
>native speakers who I asked claimed to have never seen them,
>so I suspect they are historic characters that are not in
>general use.
>
>1E80 # LATIN CAPITAL LETTER W WITH GRAVE
>1E81 # LATIN SMALL LETTER W WITH GRAVE
>1E82 # LATIN CAPITAL LETTER W WITH ACUTE
>1E83 # LATIN SMALL LETTER W WITH ACUTE
>1E84 # LATIN CAPITAL LETTER W WITH DIAERESIS
>1E85 # LATIN SMALL LETTER W WITH DIAERESIS
>1EF2 # LATIN CAPITAL LETTER Y WITH GRAVE
>1EF3 # LATIN SMALL LETTER Y WITH GRAVE
Actually, if one is doing a Welsh *pronunciation* guide these could be
potentially useful; I've also seen at least "w" and "y" acute in past.
(The others, I'll admit, *are* weird--I'm not entirely sure where they'd
be used, save *maybe* in other languages in the same subfamily of Celtic
languages Cymru/Welsh is in [for example, Breton or Manx]. I'm rather
afraid I don't speak any Celtic tongue so I can't be for sure on this;
if memory serves, there is a Manx dictionary online, though. IF my
memory of that serves at ALL well, Manx doesn't use "w" as a vowel but
*does* use "y"; I know exactly nothing on Breton.)
*POSSIBLY* w-diaresis and y-diaresis occur in *some* transcription schemes
for Native American languages (if they occur in this, it'd likely be for
Northwest languages that have vowels and consonants that literally cannot
be expressed in any other way without resorting to the International
Phonetic Alphabet).
Y-dieresis and y-dieresis do occur in the standard character sets of most
English-language Postscript and Truetype fonts.
Offhand, as an aside--I expect some of the other oddish characters
(AE-grave, etc.) are also used mostly in pronunciation guides as well.
A-dieresis-grave, etc. *may* be used in Vietnamese, but I'm not sure.
>The purpose of the following characters is also
>unclear to me:
>
>201B # SINGLE HIGH-REVERSED-9 QUOTATION MARK
>203C # DOUBLE EXCLAMATION MARK
>203E # OVERLINE
>
>All these are in WGL4 but (so far) not in SECS.
Double-exclamation sounds more like a "typesetting character"; so does
"high reversed-9 quot mark" (maybe this is equivalent to leftquot?)
>There are also some mysterious characters in the MES-2
>proposal which I have not found anywhere else:
>
>01B7 # LATIN CAPITAL LETTER EZH
>01C4 # LATIN CAPITAL LETTER DZ WITH CARON
>01C6 # LATIN SMALL LETTER DZ WITH CARON
>01C7 # LATIN CAPITAL LETTER LJ
>01C9 # LATIN SMALL LETTER LJ
EZH I'm not sure on, but it *may* be used in some Turkic languages; DZ and
its variants, and LJ and its variants, occur in some Slavic languages and
also possibly in some Turkic languages (mostly those spoken in countries
that split off from the old USSR and are going back to Romanised
chracters).
(In Cyrillic, separate letters *do* exist for each of these in regional
variants that were used before the USSR split up. This is probably why
they carry over.)
LJ/lj is roughly equivalent to slash-l in Polish, BTW.
>01CA # LATIN CAPITAL LETTER NJ
>01CC # LATIN SMALL LETTER NJ
Used in some Slavic languages, and occasionally in various African
languages. (In Slavic languages, indicates a palatalised-N (similar to
n-acute in some Slavic languages; the "j" essentially means the same as
the "soft mark" in Cyrillic); in the African languages where this is an
actual character, indicates exactly what it says--an "nj" sound (like "ng"
only one doesn't touch one's palate). :)
>01DE # LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON
>01DF # LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
If memory serves, used in Vietnamese (in this case, the macron is a tone
character) and in some transcription schemes for Native American
languages.
>01E4 # LATIN CAPITAL LETTER G WITH STROKE
>01E5 # LATIN SMALL LETTER G WITH STROKE
I've only seen this offhand in *some* transcription schemes for Native
American languages [this indicates *roughly* the same as g-caron; see
below] but it may occur in Turkic languages that are converting to Roman
characters.
>01E6 # LATIN CAPITAL LETTER G WITH CARON
>01E7 # LATIN SMALL LETTER G WITH CARON
Commonly used in Turkish and some other Turkic languages to indicate a
"hard G" sound. Also occurs, for the same sound, in some Native American
language transcription schemes.
>01E8 # LATIN CAPITAL LETTER K WITH CARON
>01E9 # LATIN SMALL LETTER K WITH CARON
Less common, but does occur in some Turkic languages; indicates a "hard K"
sound (like hard G--you say it in the back of your throat). Occurs in
some transcription schemes for Native American languages as well.
(As a minor aside--you will find many, MANY standards for transcription
and, in some cases, transliteration of Native American languages. These
vary from fitting the closest Roman equivalent, to using diacritical marks
for consonants that are "sort" of close [many languages have literally two
to four different ways you can pronounce a consonant sound where we might
have one in English, for example] to using unused characters to represent
sounds ["x" for "sh" and "c" for "soft ch" are rather common] to resorting
to the IPA when there's no really good way to represent it via Roman
characters. Hence my notes on this. :)
>01EE # LATIN CAPITAL LETTER EZH WITH CARON
>01EF # LATIN SMALL LETTER EZH WITH CARON
Possibly used in some Slavic languages and Slavic transliteration schemes.
Possibly occurs in Turkic languages. (Again, an "ezh-caron" equivalent
does occur in several local variants of Cyrillic used for "minority
languages" in the USSR.)
>01F1 # LATIN CAPITAL LETTER DZ
>01F3 # LATIN SMALL LETTER DZ
Commonly used in Slavic and Turkic languages; occurs in some Native
American languages as well (most notably the Na-Dene family, which
includes Dine' [Navaho]).
>01F4 # LATIN CAPITAL LETTER G WITH ACUTE
>01F5 # LATIN SMALL LETTER G WITH ACUTE
Fairly unusual, but does occur in some Native American and Slavic (and
possibly Turkic as well, depending on the country's Romanisation scheme)
languages. Usually indicates a palatalised g sound in the few places
where I've seen it.
>027C # LATIN SMALL LETTER R WITH LONG LEG
Fairly unusual; used in some Native American languages as an R-variant.
This is borrowed from the IPA, offhand. This also, occasionally, occurs
in transcription schemes for some African languages.
Some Turkic languages may use it; not sure (at least I've not *seen* any)
however.
>Do you know a good reason why any of these characters should
>go into a simple European character set?
Some of them I'm sort of puzzled on m'self. Some (like Y-dieresis and
Y-acute-dieresis, for example) I can see as they are used in languages
with a known, large audience on Usenet (for instance, Vietnamese-language
or Cymru-language newsgroups).
Some of them, I will frankly admit (namely, *all* the Greek characters
noted and, possibly, some of the other *unusual* letters like longleg-r
and k-macron, etc.) puzzle me why they're included. (As far as I know,
longleg-r only exists in a few Native American transcription schemes and
in some African-language transcription schemes; unless there is a large
Usenet population of folks wishing to type in Salish, I'm not sure why it
should be there. [If it is in there, we should go ahead and add upside-
down K/k, upside-down T/t, cedilla-H, Latin-omega-acute-dieresis,
Latin-chi, etc. and all the other IPA characters you *have* to import from
the IPA to write some of the languages of that area. :) And, of course,
import Latin capital-schwa and Latin small-schwa for our friends in
Azerbaijan; hell, let's just import the entire IPA and be done with it :)
Ah well...I'm sure the author will be glad to explain, in any case. :)
-moo
who, incidentially, still wants to know when the author will write the
terminal patch that will allow a VT100 terminal hooked up to an IBM 3090
mainframe to actually *read* these strange and ferlie characters, or pay
for the unis still using these beasts for student Internet access to
upgrade to nice happy spanking new DEC Alphas and upgrade everyone's
computer to a Pentium whilst they're at it :)
No diaeresis's and macrons in Vietnamese.
One needs:
one of <a>, <i>, <u>, <e>, <o>, <y>
with possibility of a circumflex on <a>,
or a horn on <o>,
or a horn on <u>,
or a circumflex on <o>,
or a circumflex on <e>
plus nothing,
or acute accent,
or grave accent,
or "curl", (sorry, do not know technical name for this)
or tilde,
or dot underneath (is there a technical name for this?)
(Not all of the above combinations will exist.)
(Optionally, a <2> or <z>-like hybrid of "curl" and tilde
may occur in the handwriting of southern Vietnamese
speakers who do not distinguish the two tones
marked by those diacritics.)
Thomas Chan
tc...@cornell.edu
>The long s might be from German Fraktur fonts which is unused
>since ~1945. This letter has certainly no equivalent in modern
>German roman/antiqua fonts and is certainly not needed to
>write German:
>
>017F # LATIN SMALL LETTER LONG S
While I agree that this letter is not needed in a basic European
character set your reasoning is quite wrong.
The long s was actually used in *both* Fraktur and Antiqua (i.e.
non-Fraktur) typefaces for centuries, and is completely unrelated to
any "Germanness". You should see lots of long s in any older English
(French, Italian, ...) book. The only difference is that Antiqua (or
"Latin") typefaces eventually dropped the long s while Fraktur
typefaces kept it to this day.
As for Fraktur going out of fashion in Germany by 1945... well, the
connection between Nazis and Fraktur is a common misconception.
Actually, the Nazi government *discouraged* use of Fraktur in 1940
because Hitler thought it outdated and contrary to his plans to
"modernise" Germany according to Nazi ideology.
As for Fraktur being "unused" today... several new Fraktur typefaces
have been designed during the past few decades by German designers.
If you go to any newspaper stand you'll see plenty of Fraktur
headlines on newspapers of any nationality. Station and street signs
are also frequently set in Fraktur. But I agree that Fraktur
typefaces are only being used as decorative fonts these days, not as
text fonts which is the important criterium for this discussion.
--
Chris Nahr (cn...@hal9000.net, replace hal9000 with ibm to e-mail me)
Please don't e-mail me if you post! PGP key at wwwkeys.ch.pgp.net
> 017F # LATIN SMALL LETTER LONG S
It is useful for quoting most old English and German literature.
-- William Ehrich
Correction to myself: There's also the possibility of
a breve on <a>.
> As for Fraktur going out of fashion in Germany by 1945... well, the
> connection between Nazis and Fraktur is a common misconception.
> Actually, the Nazi government *discouraged* use of Fraktur in 1940
> because Hitler thought it outdated and contrary to his plans to
> "modernise" Germany according to Nazi ideology.
I don't know if he ever stated that. However, on the 23rd of January 1941,
an official order of the German Nazist Party (Anordnung 2/41; Ordnungsziffer
111) abolished Fraktur and Schwabacher from all printed items, saying:
...It is ordered that from now on only the normal type is to be used
for all printed documents. As normal type, the antiqua type is meant.
The so-called gothic type (Fraktur) is not a german type but goes back
to the schwabacher jew-letters. This type has been strongly used in
Germany because Jews owned the printing works already since typography
was introduced, and later on the newspapers...
I don't know who did the translation (possibly Yannis Haralambous) but
it was accompanied by a photostat of the order. See TUGboat vol. 12 no.
1, March 1991.
--Paul
If you go with the Barcelona you will find that there is a station
which appears to be named Paral-lel, so thick is the middle dot,
and this is not the only instance, I've seen.
>* These characters are used in Danish and their inclusion in both
>Unicode and the WGL4 set was at the request of the Danish standards
>organization. My understanding is that there is some debate over the
>status of these characters in modern Danish. Some sources claim that
>they are archaic, others that they are orthographically correct and
>that to omit them is a mistake. I believe they should not be omitted
>from SECS without further research.
Of course, as I'm coming from a neighbour country, I cannot be taken
for an authority, but so much I can say that I have never seen them.
Then again, most Russian I have read had accented vowels, and there
appears to be no accented Cyrillic letters in Markus's set. But as
you might have guessed, I never came much further than my beginner's
textbook...
--
Erland Sommarskog, Stockholm, som...@algonet.se
This could have been my two cents worth, but alas the Swedish
government has decdided that I am not to have any cents.
Thanks for your comments, they have been very useful.
> [I am also concerned that Markus' recommended mathematical set may be
> too extensive. Is this really a _basic_ mathematical subset, or
> something more?]
There is an international standard ISO 31-11 that defines large parts
of the mathematical notation that is commonly used all over the world.
Most of the character that I have included are from ISO 31-11. I
tried to cover this standard entirely as far as this is possible
in a fixed-width font.
The actual list of math characters that I have included is appended
below. It contains a few remarks about why I think this character
should be covered. Comments welcome.
It is a quite comprehensive set of symbols, so I certainly would not
argue that the math collection should become any larger. I admit that
there are might be a few less common symbols in it that are mostly
of concern to computer scientists, but after all, these are computer
character sets and I can well imagine that most of these symbols
will be used in source code comments etc.
0x80 0x20AC #EURO SIGN
0x81 #UNDEFINED
0x82 0x201A #SINGLE LOW-9 QUOTATION MARK
0x83 0x0192 #LATIN SMALL LETTER F WITH HOOK
0x84 0x201E #DOUBLE LOW-9 QUOTATION MARK
0x85 0x2026 #HORIZONTAL ELLIPSIS
0x86 0x2020 #DAGGER
0x87 0x2021 #DOUBLE DAGGER
0x88 0x02C6 #MODIFIER LETTER CIRCUMFLEX ACCENT
0x89 0x2030 #PER MILLE SIGN
0x8A 0x0160 #LATIN CAPITAL LETTER S WITH CARON
0x8B 0x2039 #SINGLE LEFT-POINTING ANGLE QUOTATION MARK
0x8C 0x0152 #LATIN CAPITAL LIGATURE OE
0x8D #UNDEFINED
0x8E 0x017D #LATIN CAPITAL LETTER Z WITH CARON
0x8F #UNDEFINED
0x90 #UNDEFINED
0x91 0x2018 #LEFT SINGLE QUOTATION MARK
0x92 0x2019 #RIGHT SINGLE QUOTATION MARK
0x93 0x201C #LEFT DOUBLE QUOTATION MARK
0x94 0x201D #RIGHT DOUBLE QUOTATION MARK
0x95 0x2022 #BULLET
0x96 0x2013 #EN DASH
0x97 0x2014 #EM DASH
0x98 0x02DC #SMALL TILDE
0x99 0x2122 #TRADE MARK SIGN
0x9A 0x0161 #LATIN SMALL LETTER S WITH CARON
0x9B 0x203A #SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
0x9C 0x0153 #LATIN SMALL LIGATURE OE
0x9D #UNDEFINED
0x9E 0x017E #LATIN SMALL LETTER Z WITH CARON
0x9F 0x0178 #LATIN CAPITAL LETTER Y WITH DIAERESIS
Most of them make perfectly sense and are useful extentions, however
I have no idea what the purpose of the following three is:
0x83 0x0192 #LATIN SMALL LETTER F WITH HOOK
0x88 0x02C6 #MODIFIER LETTER CIRCUMFLEX ACCENT
0x98 0x02DC #SMALL TILDE
Any ideas?
Probably the Dutch Gulden symbol.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
> In article <35D84110...@cl.cam.ac.uk> Markus Kuhn <Marku...@cl.cam.ac.uk> writes:
> > 0x83 0x0192 #LATIN SMALL LETTER F WITH HOOK
>
> Probably the Dutch Gulden symbol.
Certainly the Dutch Guilder/Gulden/Florin symbol.
--Paul
Shall we still include it in VSECS (same for the Peseta sign from
CP437)? If we include Peseta and Gulden, then we would also
have to include the Franc and Lira symbols. All these currency
symbols are expected to be superseded by the Euro symbol from
mid-2002 on and would only be of historical value.
> Dik T. Winter wrote:
> > > 0x83 0x0192 #LATIN SMALL LETTER F WITH HOOK
> >
> > Probably the Dutch Gulden symbol.
>
> Shall we still include it in VSECS (same for the Peseta sign from
> CP437)? If we include Peseta and Gulden, then we would also
> have to include the Franc and Lira symbols.
What puzzles me is why Unicode added a Lira symbol. The lira symbol
is essentially identical to the pound symbol. A couple of Unicode
fonts I've seen give the pound one cross-stroke and the lira two
cross-strokes, but that's a matter of aesthetics and font design.
Even if there is some inherent national preference one way or the other
between UK and Italian typography, typesetters in each country will tend to
use the same glyph for both symbols (or, more usually, use the symbol
for their national currency and a letter for the other one to avoid
confusion).
--Paul
>>The following are claimed to be used in Welsh, but Welsh
>>native speakers who I asked claimed to have never seen them,
>>so I suspect they are historic characters that are not in
>>general use.
>>1E80 # LATIN CAPITAL LETTER W WITH GRAVE
>>1E81 # LATIN SMALL LETTER W WITH GRAVE
>>1E82 # LATIN CAPITAL LETTER W WITH ACUTE
>>1E83 # LATIN SMALL LETTER W WITH ACUTE
>>1E84 # LATIN CAPITAL LETTER W WITH DIAERESIS
>>1E85 # LATIN SMALL LETTER W WITH DIAERESIS
>>1EF2 # LATIN CAPITAL LETTER Y WITH GRAVE
>>1EF3 # LATIN SMALL LETTER Y WITH GRAVE
For the record, as provided to me by Andrew Hawke, assistant editor of
the University of Wales Dictionary of the Welsh Language:
Modern usage of the diacritics in Welsh is as follows:
The circumflex is used solely to indicate that a vowel is long
in a context in which it would normally be expected to be
short, e.g.:
gwa^n (he pierces) vs. gwan (weak)
gwe^n (a smile) vs. gwen (white (fem.))
pi^n (pine (wood, tree)) vs. pi`n (a pin)
co^r (a choir) vs. cor (a dwarf)
bu^m (I was (perfect)) vs. bum (five (mutated))
tw^r (a tower) vs. twr (a group)
y^m (we are) vs. ym (in (before m))
The diaeresis is used to separate vowels, as in English:
prosa"ig (prosaic)
cre"wr (creator)
copi"o (to copy)
tro"edigaeth (conversion)
du"wch (blackness)
Rebacay"ddiaeth (lit. Rebaccaism)
cyw"res (concubine)
The acute accent is used to indicate unexpected stress (i.e.
not on the penultimate):
casa'u (to hate)
case't (cassette)
ricri'wt (a recruit)
paraso'l (a parasol)
rebu'wc (a rebuke)
caridy'ms (riff-raff)
gw'raidd (manly)
[this last is on the penult, but is to distinguish it
from the word gwraidd (root)which is monosyllabic]
The grave accent is used to indicate that a vowel is short in
a context in which it would normally be expected to be long:
pa`s (a pass, permit) vs. pas (a cough)
sie`d (a shed) vs. sie^d/sied (escheat)
sgi`l (a skill) vs. sgi^l/sgil (following)
no`d (a nod) vs. nod (a target, an aim)
cu`l (a hut) vs. cul (narrow)
mw`g (a mug) vs. mwg (smoke (n.))
py`g (dirty) vs. pyg (pitch, tar)
Generally speaking, diacritics in Welsh cannot reasonably be
omitted as they are used either to show unusual stress, or to
differentiate between pairs of otherwise identical words with
different pronounciations. As such they are equally necessary
in upper- and lower-case forms.
The commonest diacritic is the circumflex, followed by the
acute and diaeresis probably about equally. The grave is rare,
but as more and more words are borrowed from English, and new
compounds coined for technical terms, their use will
undoubtedly increase.
To give a very rough indication, according to the headwords in
our (unfinished) dictionary (which we estimate will contain
about about 84,500 entries), the number of accented keywords
(extrapolated to the expected finished size of the dictionary)
will be roughly:
circumflex: 2,000
diaeresis: 880
acute: 500
grave: 160
Clearly it would be a mistake to omit these diacritics from any
character set intended to support the Welsh language.
> The Windows standard character set CP1252 extends ISO 8859-1
> by the following 27 characters:
>
[...]
>
> Most of them make perfectly sense and are useful extentions, however
> I have no idea what the purpose of the following three is:
>
> 0x83 0x0192 #LATIN SMALL LETTER F WITH HOOK
This is the guilder sign. Unicode, for whatever reason, doesn't
include an actual guilder/florin sign, but the small f with hook looks
right. This mapping is an approximation. Both the Windows and
Macintosh character sets include the character, so its omission from
Unicode was a surprise to me.
> 0x88 0x02C6 #MODIFIER LETTER CIRCUMFLEX ACCENT
> 0x98 0x02DC #SMALL TILDE
These are to distinguish between the character and the accent. The
circumflex (shift-6 on most US keyboards) is now used for the literal
character (for TeX superscript, regexp inversion...), and so a
distinct character is needed for the diacritic. Similarly, the tilde
is now used for home directories or approximation; a smaller tilde is
needed for using as a diacritic.
-Chris
--
<!NOTATION SGML.Geek PUBLIC "-//Anonymous//NOTATION SGML Geek//EN">
<!ENTITY crism PUBLIC "-//O'Reilly//NONSGML Christopher R. Maden//EN"
"<URL>http://www.oreilly.com/people/staff/crism/ <TEL>+1.617.499.7487
<USMAIL>90 Sherman Street, Cambridge, MA 02140 USA" NDATA SGML.Geek>
Include them. It is going to be much more painful to omit them,
IMNSHO. However, my understanding is that the Franc symbol isn't in
common use; in fact, I've had French people tell me "what Franc
symbol", pretty much what I'd tell anyone who'd ask me what the symbol
for a Swedish Crown is.
-hpa
--
PGP: 2047/2A960705 BA 03 D3 2C 14 A8 A8 BD 1E DF FE 69 EE 35 BD 74
See http://www.zytor.com/~hpa/ for web page and full PGP public key
I am Bahá'í -- ask me about it or see http://www.bahai.org/
"To love another person is to see the face of God." -- Les Misérables
*BZZZZT* Wrong answer! The British Pound (Sterling) symbol has
exactly one cross-stroke, the Italian Lira symbol has two. You
*never* see the other way around, and they are not interchangable. It
is not like the one or two strokes on the dollar sign!
Can anyone tell me if the Turkish Lira symbol (a sort of TL monogram similar
to the TM (trademark) symbol)
which exists in the Teletext character set but not in Unicode is another
currency symbol that is not used in practice (or just an invention of the
Teletext standards authority)?
--
Stephen Baynes CEng MBCS Stephen...@soton.sc.philips.com
Philips Semiconductors Ltd
Southampton SO15 0DJ +44 (01703) 316431
United Kingdom My views are my own.
Do you use ISO8859-1? Yes if you see © as copyright, ÷ as division and ½ as 1/2.
We write IJ and ij because the glyph is not available. If you look at TeX you
see that it is added because we (the Dutch) wanted and needed it.
<smily on>
We do not have much of a culture, so don't take away the little we have left.
<smily off>
The Graphical Gnome (r...@ktibv.nl)
Sr. Software Engineer IT Department
-----------------------------------------
The Unofficial Delphi Developers FAQ
http://www.gnomehome.demon.nl/uddf/index.htm
Because of the fact that most old typewriting systems could not cope with this
glyph does not mean it is deprecated in the Netherlands. It's a Dutch glyph,
and we are mighty proud of it!.
At school, I was taught to write a pound sign with two strokes
(Scotland, mid-70's). I don't write it that way now, for with age
comes laziness.
Peter, there are ways of disagreeing with people that are not so
inflammatory. Always think it possible that you might be mistaken.
--
Stewart C. Russell, Glasgow, Scotland - scr...@enterprise.net
"Hang on... This is the real thing... The truth, my friend,
and nothing but the truth" - Mervyn Peake
http://homepages.enterprise.net/scruss/
> Followup to: <evale...@sktb.demon.co.uk>
> By author: p...@sktb.demon.co.uk
> In newsgroup: comp.std.internat
> >
> > What puzzles me is why Unicode added a Lira symbol. The lira symbol
> > is essentially identical to the pound symbol. A couple of Unicode
> > fonts I've seen give the pound one cross-stroke and the lira two
> > cross-strokes, but that's a matter of aesthetics and font design.
> > Even if there is some inherent national preference one way or the other
> > between UK and Italian typography, typesetters in each country will tend to
> > use the same glyph for both symbols (or, more usually, use the symbol
> > for their national currency and a letter for the other one to avoid
> > confusion).
> >
>
> *BZZZZT* Wrong answer! The British Pound (Sterling) symbol has
> exactly one cross-stroke, the Italian Lira symbol has two.
*BZZZZZZT*. Totally wrong answer! I'm in the UK and have been for
the 40-odd years of my life. My father was a printer, as was my brother,
grandfather, three uncles and various cousins (in case you're interested, my
father was a laserjet). I'm old enough to remember when the two
cross-stroke form was the norm in the UK. In fact I'm old enough that I was
*taught* that the two-stroke form should be used.
> You *never* see the other way around,
I admit that the one-stroke form predominates in the UK these days. But
that is a matter of typographic style, not an absolute rule. Either is
acceptable.
> and they are not interchangable.
<panto>Oh yes they are</panto>.
Take a look at Whittaker's Almanac in the foreign currency section. It
uses the one-stroke form for pound, punt and lira.
> It is not like the one or two strokes on the dollar sign!
Ah, but it is. There may be national preferences involved and these may
change over time, but one- or two-cross stroke forms are entirely
interchangeable in the UK. Dunno about Italy.
--Paul
>You can also write oe, ue and ae. Does this mean that the o-umlaut, u-umlaut
>and a-umlaut should be removed?? The same applies for the German Sharp s tou
>can write it as ss.
These examples are hardly parallel. Apart from spacing considerations,
the IJ glyph is identical in appearence to an I followed by a J.
Obviously the same cannot be said of the o-umlaut which, in any case,
is also required as an o-diaeresis for non Germanic languages. The
German eszett cannot, in standard German, be replaced by /ss/, as
there exist words which are semantically distinguished by the use of
/ss/ or eszett.
That said, I'm perfectly happy to endorse inclusion of the IJ and ij
digraphs as characters in any font I make for Dutch clients, if they
want them. Most Dutch type designers I know (and I know a _lot_) seem
quite ambivalent about this digraph.
John Hudson