Thanks for your help.
fulio pen
> I wanted to display in my web pages the letter ' i ' but without the
> dot on the top of the vertical bar. I looked for it on the unicode
> chart but didn't find it. If anyone knows that, please let me know.
U+0131
If you need it for Turkish or Azerbaijani, you will need the dotted
capital I as well: U+0130
--
Helmut Richter
and here it is <ı>
>
> If you need it for Turkish or Azerbaijani, you will need the dotted
> capital I as well: U+0130
and here it is <İ>
>
> --
> Helmut Richter
Confirmed:
http://rudhar.com/lingtics/uniclnks.htm
http://unicode.org/charts/PDF/U0100.pdf
--
Ruud Harmsen, http://rudhar.com
BTW it is easy to find out such things via Character Map (if you are
using Windows) . bringing the cursor on the character and clicking
once gives you the code as well as its keystroke combination displayed
on the bottom line.
it also gives a brief description of the character in question.
Ditto if you're using Linux. I wonder how one would find the HTML version?
http://rudhar.com/sfreview/unigglen.htm
Or by text search in http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
Here, search for "dotless" instead of "undotted" would have helped.
--
Helmut Richter
Thanks. I had guessed it might be something like that, but my first test
was to look up "ā" which I knew was HTML "ā" and when that turned
out to be "U+0101" I thought my guess was wrong.
My new guess, which I will test later, is that "0101" is to be converted
to "256 + 1." "U+0131" would presumably mean "&#[256 + (3 x 16) + 1];" =
"ı" (writing which leaves me no good place to put a period unless I
add a parenthetical remark).
> My new guess, which I will test later, is that "0101" is to be converted to
> "256 + 1." "U+0131" would presumably mean "&#[256 + (3 x 16) + 1];" =
> "ı" (writing which leaves me no good place to put a period unless I add a
> parenthetical remark).
... or "ı"
Yes, the notation U+.... always means the character whose Unicode number
is .... in hexadecimal.
--
Helmut Richter
>> http://rudhar.com/sfreview/unigglen.htm
>
>Thanks. I had guessed it might be something like that, but my first test
>was to look up "?" which I knew was HTML "ā" and when that turned
>out to be "U+0101" I thought my guess was wrong.
That's decimals versus hexadecimal. ā is equivalent ā Most
modern browsers support both.
>My new guess, which I will test later, is that "0101" is to be converted
>to "256 + 1." "U+0131" would presumably mean "&#[256 + (3 x 16) + 1];" =
>"ı" (writing which leaves me no good place to put a period unless I
>add a parenthetical remark).
--
Ruud Harmsen, http://rudhar.com
Just wondering ---- there are several forms of "i" that can be produced
using "alt-codes", such as ì, í, î, and ï but I'm unable to find an alt-code
for a dotless i or for a dotted capital I. Are there alt-codes for these?
P.S. I'm sure I'm revealing my ignorance with this question, but I'll ask it
anyway: In terms of being able to produce the desired character, of what use
is it to know the Unicode number? One can copy and paste from the Character
Map, but is there a way to just input the Unicode number to produce the
character?
Thanks.
>Just wondering ---- there are several forms of "i" that can be produced
>using "alt-codes", such as �, �, �, and � but I'm unable to find an alt-code
>for a dotless i or for a dotted capital I. Are there alt-codes for these?
No, because alt-codes only support ISO-8859-1 or the almost identical
WIndows 1252. And Turkish is not in those.
>P.S. I'm sure I'm revealing my ignorance with this question, but I'll ask it
>anyway: In terms of being able to produce the desired character, of what use
>is it to know the Unicode number? One can copy and paste from the Character
>Map, but is there a way to just input the Unicode number to produce the
>character?
In HTML there is.
you need a special Turkish font to get dotted capital I and dotless i
into the extended ASCII set.
If you use MS Word, type the four- (or occasionally five-) digit
Unicode code on the regular keyboard and then type Alt-X.
If you use MS Word, you can assign your own keyboard shortcut to any
character at all, through the Insert Symbol panel.
Any font with the Unicode range "Latin Extended-A" accommodates all
the roman-written languages of Eastern Europe (including Turkish), as
well as Maltese and Esperanto. Most text fonts, these days, include
that range.
Thanks! (thanks also to Ruud and Yusuf)
I meant if you want to obtain the Turkish letters within hexadecimal
FF, you need special Turkish fonts, this is aholdover from pre-Unicode
days. some websites still use them.
That's mostly application specific - e.g. if your application/OS supports
ISO 14755, you can - in rxvt-unicode, you press Control-Shift
and type the code while holding the keys. In GTK 2.x, you
press Control-Shift-U and then the code, in vim you press Ctrl-v u followed
by the code, in yudit, type u+ followed by the code. Unfortunately, there
is AFAIK no general support for this in the X11, neither in Qt.
(GTK 2 is most mature, also regarding various transliterating input methods,
in comparison X11 xkb is extremely ridig and inflexible).
--
-----------------------------------------------------------
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
>> Just wondering ---- there are several forms of "i" that can be produced
>> using "alt-codes", such as �, �, �, and � but I'm unable to find an alt-code
>> for a dotless i or for a dotted capital I. �Are there alt-codes for these?
>
>you need a special Turkish font to get dotted capital I and dotless i
>into the extended ASCII set.
Modern computers (with Vista, Ubuntu etc.) have these included in
their pre-installed fonts as standard.
This is unrelated to the Alt-code methode, because that input method
departs from an 8 bit encoding, in which the Turkish chars do not fit.
(Or maybe they do: ISO-8859-15 vs. ISO-8859-1)
http://czyborra.com/charsets/iso8859.html (See Latin9: no Turkish
there).
>If you use MS Word, type the four- (or occasionally five-) digit
>Unicode code on the regular keyboard and then type Alt-X.
>
>If you use MS Word, you can assign your own keyboard shortcut to any
>character at all, through the Insert Symbol panel.
Yes, that works! (Tested in Word 2007 under Vista).
Interesting, I didn't know that, thanks for the tip.
It works the other way, too. Put the cursor after (or select) the
character you want to check and type Alt-X, and it shows you the
Unicode code. (If you select more than one character, nothing happens.)
What are all those things? Ways for computer snobs to avoid using
Windows or Mac OS? Wouldn't some sort of standardization be sensible?
>
> What are all those things? Ways for computer snobs to avoid using
> Windows or Mac OS?
Just some applications I use that conform to ISO 14755 - incidentally, all
of them also work on both Windows and MAC OS X (but not MAC OS).
> Wouldn't some sort of standardization be sensible?
Yes, that's what the ISO 14755 is meant for... unfortunately, it is
not really widespread (except for the GTK 2 toolkit, ergo for all the
applications using it in a standard way - works nicely on all of X11,
Windows and MAC OS X).
Sadly, the _existing_ ISO standard has been ignored by Xorg, Microsoft
and Apple alike. Duh.
I prefer the following web-pages, if I search a character:
1) http://demo.icu-project.org/icu-bin/ubrowse
Enter a search string like 'letter i' (case insensitive) and you will find
http://demo.icu-project.org/icu-bin/ubrowse?s=LETTER+I&cs=012C
İ U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
ı U+0131 LATIN SMALL LETTER DOTLESS I
It provides differnt ways to search a character. One of them is by
language, e.g. Turkish (or Azerbaijani) in this case:
http://www.eki.ee/letter/chardata.cgi?lang=tr+Turkish&script=latin
decimal: İ
UTF-8 (c4, b0) İ
LATIN CAPITAL LETTER I WITH DOT ABOVE
found in languages: az [Azerbaijani]; tr [Turkish]; tt [Tatar];
decimal: ı
UTF-8 (c4, b1) ı
LATIN SMALL LETTER DOTLESS I
found in languages: az [Azerbaijani]; tr [Turkish]; tt [Tatar];
> Thanks for your help.
HTH
Helmut Wollmersdorfer
>> Just wondering ---- there are several forms of "i" that can be produced
>> using "alt-codes", such as ì, í, î, and ï but I'm unable to find an alt-code
>> for a dotless i or for a dotted capital I. Are there alt-codes for these?
> you need a special Turkish font to get dotted capital I and dotless i
> into the extended ASCII set.
No - you mismatch 'character encoding' versus 'font' versus 'keyboard'.
For entering these characters you need a keyboard supporting these
characters. These keyboards could also be 'virtual', i.e. mapping the
keys (or some of them) to these characters. In Windows you can download
and install 'language packs' - AFAIK free of charge - supporting
different key-maps.
Helmut Wollmersdorfer
No, people using other OSs than Windows or Mac - or other programs -
have serious reasons to do so. I use Linux _and_ Windows. And I use a
dozen of different editors for different purposes.
> Wouldn't some sort of standardization be sensible?
Yes, but unfortunately IBM, Apple and Microsoft didn't agree on a common
standard for GUIs (Graphical User Interfaces) in the early days.
Helmut Wollmersdorfer
But they -- and probably hundreds of governments -- did agree on
Unicode, which is the topic at hand.
Why is it using some sort of two-letter code (which can identify no
more than 676 languages) rather than the three-letter ISO language
codes (based on those developed by Ethnologue over half a century)?
You don't need to download anything. Windows (XP, Vista, and 7) comes
with dozens of keyboards for conveniently typing just about any
language using any of the Unicode ranges. (With XP you need to have
your distribution disk at hand to add them.)
The "language packs" contain proofing tools -- hyphenation routines,
dictionaries, thesauruses, etc. -- for a wide variety of languages.
(Two or three of these come with each internationalized version of
Office; you can buy more for maybe USD30 each.)
And, you can install operating languages so that Windows or Office can
talk to you in any of 30 or so languages.
> You don't need to download anything. Windows (XP, Vista, and 7) comes
> with dozens of keyboards for conveniently typing just about any
> language using any of the Unicode ranges. (With XP you need to have
> your distribution disk at hand to add them.)
Yes, but maybe not _all_ of them. In XP I had to download and install a
'Indic' package - this was the most convenient way to get high quality
fonts for Indic scripts.
Helmut Wollmersdorfer
>> 2)http://www.eki.ee/letter/
[...]
>> found in languages: az [Azerbaijani]; tr [Turkish]; tt [Tatar];
> Why is it using some sort of two-letter code (which can identify no
> more than 676 languages) rather than the three-letter ISO language
> codes (based on those developed by Ethnologue over half a century)?
The two-letter ISO language codes do not conflict with the three-letter
ones. The above site http://www.eki.ee/letter/ also uses three-letter
codes e.g. haw [Hawaiian] if the language does not have a two-letter code.
IMHO two-letter language codes are more common to the average reader. If
you ask me the two-letter codes I will know them for most of the
European and other important or popular languages, but I will need to
lookup the three-letter codes e.g. for cs [Czech].
Helmut Wollmersdorfer
Sorry, from context I assumed you mean the (missing) standardization of
key-combinations across GUIs.
Helmut Wollmersdorfer
OK. character encoding. via the language packs,you could get "Turkish
Windows" or "Turkish ISO" which map the extar turkish letters into the
extended ASCII set (not Unicode). the two are different. in the
"Internet Browser" these get accessed through the "View"
option,section "Encoding". soemwebsites stillusing them rather than
Unicode.
most Indic scripts were included in my XP package, except Sinhala.
>
> Helmut Wollmersdorfer
And of course, there are two sets of widely used (for some definition of
"widely") three-letter codes: ISO 639-2 bibliographic and terminological,
(I cannot even remember for my own language which code of the two is in
what standard), fortunately ISO 639-3 will obsolete the two in due time.
But unfortunately they did agree on _several_ representations of Unicode.
Fortunately, the world seems to converge on UTF-8 for data exchange,
which is a good thing™.
Each manufacturer providing key combinations for the tens of thousands
of Unicode characters would be silly. (For Chinese alone about a dozen
input methods ship with Windows.)
What _is_ extremely annoying is that when I moved from Mac to PC I
found that FrameMaker doesn't use the same key combinations for the
accented letters on the two platforms -- and then I found that
different apps in Windows don't have a standard set of key
combinations.
I don't know which number goes with what scheme, but about two years
ago LINGUIST List carried requests to specialists in every area to
comment on and improve the Ethnologue codes that were about to become
the ISO list, and that happened, and it is. There shouldn't be more
than one.
If you want "high quality," you might have to pay for them.
> most Indic scripts were included in my XP package, except Sinhala.
Sinhala isn't a language of India. (These are marketing decisions, not
scientific decisions.)
>What _is_ extremely annoying is that when I moved from Mac to PC I
>found that FrameMaker doesn't use the same key combinations for the
>accented letters on the two platforms --
Perhaps true, I don't know.
>and then I found that
>different apps in Windows don't have a standard set of key
>combinations.
Not true, like I told you before. US International (or any other
keyboard layout of your preference) works the same in all Windows
applications.
>Peter T. Daniels <gram...@verizon.net> wrote:
>>
>> But they -- and probably hundreds of governments -- did agree on
>> Unicode, which is the topic at hand.
>
>But unfortunately they did agree on _several_ representations of Unicode.
>Fortunately, the world seems to converge on UTF-8 for data exchange,
>which is a good thing�.
UTF8 is efficient for western scripts, Arabic, Russian, Hebrew etc.,
but inefficient for Indian script, Chinese, Japanese etc. But you
cares about a few bytes extra these days?
\
>Why is it using some sort of two-letter code (which can identify no
>more than 676 languages) rather than the three-letter ISO language
>codes (based on those developed by Ethnologue over half a century)?
The ISO standard comes in 2-letter and 3-letter versions.
Exactly. And there is BOCU and gzip, if you do :-)
And the characters that happen not to be on that keyboard but do have
(different) key assignments in Mac vs. Windows FrameMaker?
Why would anyone use the "legacy" codes when they can't cover even 20%
of languages?
[ISO 639 language codes]
> I don't know which number goes with what scheme,
ISO 639-1 alpha 2: 185
ISO 639-2 alpha 3: ~450
ISO 639-3 alpha 3: ~7700
> but about two years
> ago LINGUIST List carried requests to specialists in every area to
> comment on and improve the Ethnologue codes that were about to become
> the ISO list, and that happened, and it is. There shouldn't be more
> than one.
Yes, the registration authority for ISO 639-3 is sil.org, for -1 and -2
still loc.gov (Library Of Congress).
> There shouldn't be more than one.
ACK. But there are myriads of applications and standards in the wild
still using ISO 639-1 and -2. It will need time (10-20 years) to switch
to the new standard.
Helmut Wollmersdorfer
> Why would anyone use the "legacy" codes when they can't cover even 20%
> of languages?
Do you use ISO 639 language codes yourself? If yes, how many different
language codes do you use? If yes, did you use them before 2007 (1st
edition of ISO 639-3)?
I myself use them systematically since ~2002 in corpus analysis, tagging
the results, covering ~300 languages at the moment. IMHO there are not
much more _written_ languages available.
The change to ISO 639-3 is on my task list, but it isn't that easy.
Apart from (minor) technical problems the change in taxonomy is an
unknown risk (in my case).
If the taxonomy of sil.org and/or Ethnologue is useful for linguists I
don't know, but expect so. What's your opinion and experience?
BTW: Do you use ISO 15924 'Codes for the representation of names of
scripts'?
http://www.unicode.org/iso15924/
Helmut Wollmersdorfer
> But unfortunately they did agree on _several_ representations of Unicode.
That's no problem. The few encodings for Unicode are well standardized
and nearly all environments can handle them.
Helmut Wollmersdorfer
Mon, 21 Dec 2009 20:53:18 +0000 (UTC):
garabik-ne...@kassiopeia.juls.savba.sk: in sci.lang:
>Exactly. And there is BOCU and gzip, if you do :-)
BOCU is new to me. Thanks for telling me. More info:
http://en.wikipedia.org/wiki/Binary_Ordered_Compression_for_Unicode
I only ever deal with maybe 7 languages and don't expect this to
increase during my lifetime. I know these few codes by heart: nl, en,
de, pt, es, fr, eo.
If I ever feel the need to discuss, say, a certain dialect of a San
language, I can look up the code and use that.
> and don't expect this to
> increase during my lifetime. I know these few codes by heart: nl, en,
> de, pt, es, fr, eo.
Counting doesn't add to the precision?
<scnr>
Joachim
No, and no.
> I myself use them systematically since ~2002 in corpus analysis, tagging
> the results, covering ~300 languages at the moment. IMHO there are not
> much more _written_ languages available.
Why would _written_ languages be a useful natural class?
> The change to ISO 639-3 is on my task list, but it isn't that easy.
> Apart from (minor) technical problems the change in taxonomy is an
> unknown risk (in my case).
>
> If the taxonomy of sil.org and/or Ethnologue is useful for linguists I
> don't know, but expect so. What's your opinion and experience?
They tend to be conservative in their classification (they didn't go
all Greenberg) and very liberal in enumerating languages vs. dialects.
> BTW: Do you use ISO 15924 'Codes for the representation of names of
> scripts'?http://www.unicode.org/iso15924/
Nope, never heard of it. What would I use it for?
The list accessed from that page appears to be a list of Unicode
ranges, which is far from equivalent to a list of scripts.
>> I myself use them systematically since ~2002 in corpus analysis, tagging
>> the results, covering ~300 languages at the moment. IMHO there are not
>> much more _written_ languages available.
> Why would _written_ languages be a useful natural class?
Natural? No, it's my decision that _I_ focus on written ones.
>> BTW: Do you use ISO 15924 'Codes for the representation of names of
>> scripts'?http://www.unicode.org/iso15924/
> Nope, never heard of it. What would I use it for?
Don't know. Standards are very useful and save time - for authors and
readers. And a topic like 'writing systems of languages' has many
related standards. But if you don't know them you cannot use them.
> The list accessed from that page appears to be a list of Unicode
> ranges,
No.
> which is far from equivalent to a list of scripts.
IMHO they did their lessons:
http://www.unicode.org/iso15924/standard/index.html#bibliography
Helmut Wollmersdorfer
Are you saying that's not what it appears to me to be?
> > which is far from equivalent to a list of scripts.
>
> IMHO they did their lessons:http://www.unicode.org/iso15924/standard/index.html#bibliography
That list confirms how amateurish the approach is -- most of those
works are either general linguistics encyclopedias or long-outdated
and/or popular surveys of writing systems.
I need do no more than point to the utter mess they made of the
definition of the term "abugida" -- which is the reason, it seems to
me, why "abjad" has been widely adopted and used but "abugida" has
not.
See also the battle that raged for a decade over the coding of
cuneiform. _No one_ is happy with the result, especially the
Assyriologists who are supposed to use the thing.
And what happened to Coptic.
One doesn't usually find oneself needing to indicate the language in
which a document written in an unwritten language is written.
Note that the two-letter codes don't even cover Cantonese.
>> I myself use them systematically since ~2002 in corpus analysis,
>> tagging the results, covering ~300 languages at the moment.
[...]
> Note that the two-letter codes don't even cover Cantonese.
'them' = ISO 639-1 _and_ ISO 639-2. How could someone tag 300 different
languages only with 154 different two-letter codes? SCNR;-)
Helmut Wollmersdorfer
>>> I myself use them systematically since ~2002 in corpus analysis, tagging
>>> the results, covering ~300 languages at the moment. IMHO there are not
>>> much more _written_ languages available.
>>
>> Why would _written_ languages be a useful natural class?
>
>One doesn't usually find oneself needing to indicate the language in
>which a document written in an unwritten language is written.
ROTFL!
>>>> BTW: Do you use ISO 15924 'Codes for the representation of names of
>>>> scripts'?http://www.unicode.org/iso15924/
>>> Nope, never heard of it. What would I use it for?
>> Don't know. Standards are very useful and save time - for authors and
>> readers. And a topic like 'writing systems of languages' has many
>> related standards. But if you don't know them you cannot use them.
>>> The list accessed from that page appears to be a list of Unicode
>>> ranges,
>> No.
> Are you saying that's not what it appears to me to be?
'Block' (= Unicode range) is a different property of a character than
'Script'. Not every character in the Block=Basic_Latin (alias ASCII) has
the property-value Script=Latin. And characters with Script=Latin are
spread over many Blocks. The mismatch of Block with Script is a common
mistake of Unicode beginners.
>>> which is far from equivalent to a list of scripts.
>> IMHO they did their lessons:http://www.unicode.org/iso15924/standard/index.html#bibliography
> That list confirms how amateurish the approach is -- most of those
> works are either general linguistics encyclopedias or long-outdated
> and/or popular surveys of writing systems.
Sounds good for me - based on stable and agreed knowledge.
> I need do no more than point to the utter mess they made of the
> definition of the term "abugida" -- which is the reason, it seems to
> me, why "abjad" has been widely adopted and used but "abugida" has
> not.
Has this an impact on the normative part? If yes, which one?
> See also the battle that raged for a decade over the coding of
> cuneiform. _No one_ is happy with the result, especially the
> Assyriologists who are supposed to use the thing.
> And what happened to Coptic.
That's Unicode and not script codes. Nearly everyone using Unicode
seriously dislikes some details. E.g. I have the problem that letters
with overlays cannot be decomposed by policy.
Helmut Wollmersdorfer
> One doesn't usually find oneself needing to indicate the language in
> which a document written in an unwritten language is written.
YMMD
Helmut Wollmersdorfer
You don't have a problem with either "outdated" (which should need no
elaboration) or "popular" (which at the very best means
'oversimplified' and more often means highly inaccurate -- I reviewed
a batch of them in Sino-Platonic Papers (which I just learned have
been put on line -- : http://sino-platonic.org/complete/spp098_book_reviews.pdf
) and some others incidentally in the Gragg Festschrift
http://oi.uchicago.edu/pdf/saoc60.pdf )?
> > I need do no more than point to the utter mess they made of the
> > definition of the term "abugida" -- which is the reason, it seems to
> > me, why "abjad" has been widely adopted and used but "abugida" has
> > not.
>
> Has this an impact on the normative part? If yes, which one?
Sorry, what do you mean by "normative part"? It was in their Glossary,
which evidently is widely consulted.
> > See also the battle that raged for a decade over the coding of
> > cuneiform. _No one_ is happy with the result, especially the
> > Assyriologists who are supposed to use the thing.
> > And what happened to Coptic.
>
> That's Unicode and not script codes. Nearly everyone using Unicode
> seriously dislikes some details. E.g. I have the problem that letters
> with overlays cannot be decomposed by policy.
I don't know what that means ...
Unless one is a linguist, to whom it usually doesn't matter whether a
language has a written form or not. Even today, most linguistic
analysis makes use of transcriptions, not of sound recordings.
Sinhala and everything else was automaticaly included when VISTA was
loaded unto another computer of mine.
Yes, but you have to inform somehow the receiving end about the encoding,
and now we have to use protocol that has this possibility (i.e. mime,
content-type headers etc) - or even worse, implement handshake to agree on
common encoding.
If we could reasonably expect everything to be in (say) UTF-8, life would be
a lot simpler.
>>
>> One doesn't usually find oneself needing to indicate the language in
>> which a document written in an unwritten language is written.-
>
> Unless one is a linguist, to whom it usually doesn't matter whether a
> language has a written form or not. Even today, most linguistic
> analysis makes use of transcriptions, not of sound recordings.
...unless you are a corpus linguist, and your task is to create/analyse
(huge) corpora of written language... then, an unwritten language suddenly
becomes much less interesting (again, until you start working with spoken
corpora...)
>If we could reasonably expect everything to be in (say) UTF-8,
>life would be a lot simpler.
I think in newer types of HTML and XML, the default is indeed UTF8,
where earlier this was ISO8859-1. Details may vary, though.
[Unicode]
>>> I need do no more than point to the utter mess they made of the
>>> definition of the term "abugida" -- which is the reason, it seems to
>>> me, why "abjad" has been widely adopted and used but "abugida" has
>>> not.
>> Has this an impact on the normative part? If yes, which one?
> Sorry, what do you mean by "normative part"?
http://www.unicode.org/glossary/#N
| Normative. Required for conformance with the Unicode Standard.
> It was in their Glossary,
> which evidently is widely consulted.
AFAIK "abugida" is not used in the UCD (Unicode Data Base).
A better example would be "ideograph(ic)" which is used in identifiers
(=names) of blocks, characters and properties.
http://www.unicode.org/glossary/#I
----quote----
Ideograph. (1) Any symbol that primarily denotes an idea (or meaning) in
contrast to a sound (or pronunciation)�for example, a symbol showing a
telephone. (2) An English term commonly used to refer to Han characters,
equivalent to the borrowings h�nz�, kanji, and hanja.
----end of quote----
Maybe 'CJK', 'Han', 'Sinograph' would be better names. But they choosed
'(CJK) Ideograph' at some point in the ~20 year history of Unicode. It
cannot be changed now. But it doesn't because it's just a name. A name
like 'krixikraxi' or 'blahblah' will also work.
>>> See also the battle that raged for a decade over the coding of
>>> cuneiform. _No one_ is happy with the result, especially the
>>> Assyriologists who are supposed to use the thing.
>>> And what happened to Coptic.
>> That's Unicode and not script codes. Nearly everyone using Unicode
>> seriously dislikes some details. E.g. I have the problem that letters
>> with overlays cannot be decomposed by policy.
> I don't know what that means ...
An overlay is a modification of a base character - like an accent mark
or diacritic modifies a base character. But an overlay is not placed
above, below or to the side of a character - it's overlayed.
Typical examples are Latin letters 'WITH STROKE'
http://demo.icu-project.org/icu-bin/ubrowse?s=WITH+STROKE
e.g.
� U+00F8 LATIN SMALL LETTER O WITH STROKE
It's against the policy of Unicode to provide decomposition for overlays.
Helmut Wollmersdorfer
>> Unless one is a linguist, to whom it usually doesn't matter whether a
>> language has a written form or not. Even today, most linguistic
>> analysis makes use of transcriptions, not of sound recordings.
> ...unless you are a corpus linguist, and your task is to create/analyse
> (huge) corpora of written language... then, an unwritten language suddenly
> becomes much less interesting (again, until you start working with spoken
> corpora...)
Exactly.
Helmut Wollmersdorfer
>> That's no problem. The few encodings for Unicode are well standardized
>> and nearly all environments can handle them.
> Yes, but you have to inform somehow the receiving end about the encoding,
> and now we have to use protocol that has this possibility (i.e. mime,
> content-type headers etc) - or even worse, implement handshake to agree on
> common encoding.
The official Unicode encodings can be dedected.
We had this problem at the specification of Perl 6 which 'is written in
Unicode'. Source files cannot use content-type headers or handshake, but
it was shown that no special tag or so is needed.
Helmut Wollmersdorfer
From a marketing point of view, that makes sense. First (XP) they did
the languages of the country with many customers. Then (Vista) they
did the language of a smaller nearby country with fewer customers.
University College Dublin or University of California - Davis
> A better example would be "ideograph(ic)" which is used in identifiers
> (=names) of blocks, characters and properties.
Which has to be one of the most egregiously stupid practices in the
entire field. Gelb's book -- the foundation of the modern study of
writing systems -- was published in NINETEEN-FIFTY-TWO.
> http://www.unicode.org/glossary/#I
> ----quote----
> Ideograph. (1) Any symbol that primarily denotes an idea (or meaning) in
> contrast to a sound (or pronunciation)—for example, a symbol showing a
> telephone. (2) An English term commonly used to refer to Han characters,
> equivalent to the borrowings hànzì, kanji, and hanja.
> ----end of quote----
>
> Maybe 'CJK', 'Han', 'Sinograph' would be better names. But they choosed
> '(CJK) Ideograph' at some point in the ~20 year history of Unicode. It
> cannot be changed now. But it doesn't because it's just a name. A name
> like 'krixikraxi' or 'blahblah' will also work.
It was idiotic to use a name as explicitly misleading as "ideograph"
for CJ characters.
> >>> See also the battle that raged for a decade over the coding of
> >>> cuneiform. _No one_ is happy with the result, especially the
> >>> Assyriologists who are supposed to use the thing.
> >>> And what happened to Coptic.
> >> That's Unicode and not script codes. Nearly everyone using Unicode
> >> seriously dislikes some details. E.g. I have the problem that letters
> >> with overlays cannot be decomposed by policy.
> > I don't know what that means ...
>
> An overlay is a modification of a base character - like an accent mark
> or diacritic modifies a base character. But an overlay is not placed
> above, below or to the side of a character - it's overlayed.
Do you characters work in three dimensions? Above, below, and beside
make sense in the two-dimensional world of characters on paper or
screen. "Overlay" doesn't. An overlay is a see-through sheet of
writing material (tracing paper; acetate) that goes on top of the base
drawing for adding _layers_ of information (such as the electrical or
plumbing diagram of the building).
> Typical examples are Latin letters 'WITH STROKE'
> http://demo.icu-project.org/icu-bin/ubrowse?s=WITH+STROKE
> e.g.
>
> ø U+00F8 LATIN SMALL LETTER O WITH STROKE
>
> It's against the policy of Unicode to provide decomposition for overlays.
Fortunately, there are ranges for "Combining Diacritics" that handles
accented letters that they didn't happen to think of when they were
drawing up the inventories of the various Latin (also a bad name)
ranges.
Again if you are working with a spoken corpus, you first transcribe it.
>> Exactly. And there is BOCU and gzip, if you do :-)
>
> BOCU is new to me. Thanks for telling me. More info:
> http://en.wikipedia.org/wiki/Binary_Ordered_Compression_for_Unicode
Here is more, from the Unicode list:
http://www.unicode.org/mail-arch/unicode-ml/y2009-m12/0099.html
Hans
> I think in newer types of HTML and XML, the default is indeed UTF8,
> where earlier this was ISO8859-1. Details may vary, though.
For HTML on the web, the relevant specifications are not interworking.
IETF's HTTP 1.1 protocol defines the default character encoding for all
MIME types of type text/* to be ISO-8859-1. Yet browsers conforming to
the W3C's most recent HTML specification, the HTML 4.01 Recommendation,
must not assume any default.
Both these specifications are ten years old, and so they may not reflect
what browsers actually do nowadays. HTTP 1.1 is a Draft Standard anyway,
and there seems to be some work to revise it. HTML 5 looks as if it will
provide an algorithm to determine the character encoding, with a default
value that is either implementation- or user-specific.
--
John
>It was idiotic to use a name as explicitly misleading as "ideograph"
>for CJ characters.
What does it matter? -- if they can be represented in Unicode? That's
what it's all about. What they are called does not make any
difference.
If someone gives me a bike and insists on calling it yrestrievotsh,
that is an idiotic name. But if the bike works and I can ride on it,
it is still useful, no matter its strange name. (Should I have used
the non-word "irregardless" here? Whatever!)
There's nothing wrong with "yrestrievotsh," especially if it has an
etymology in the language it was borrowed from that clarifies its
relationship to bikes.
"Ideograph," however, has "idea" in it, thus misleading people like
LSD for generations.
> "Ideograph," however, has "idea" in it, thus misleading people like
> LSD for generations.
I agree that other terms may be less misleading, but "ideograph" is not
wrong. (And I cannot see how LSD may understand the function of the script
wrong which he has used for his whole lifetime -- more probably he just
does not use the same terms you would have used.)
Quite generally, the meaning of compounds is not uniquely determined by
their components. A chocolate biscuit has another relationship to
chocolate than a dog biscuit to a dog or a ship biscuit to a ship. Hence,
the term "ideograph" does *not* unambiguously mean "graphic symbol
denoting an idea" but means no more than "graphical symbol having anything
to do with ideas". And this is not wrong: while the CJ symbols denote
words/morphemes and not ideas, most of them do it by a reference to an
idea of the meaning: a few directly referring to the meaning (mere
pictures like "man" or "tree"; symbolic pictures like "East" (the sun in a
tree) or "peace" (a woman having a roof above her head)), and nearly all
others by a signific, meaning-related portion of the character which
serves to disambiguate near-homophones. And the fact that the meaning of
the morpheme has played a r�le for the design of the character is what
indeed distinguishes this script from most others.
The word "ideograph" should only mislead those who are mislead by "dog
biscuit" as well.
--
Helmut Richter
That's like saying Palin's "death panels" is a fine expression to use
in discussing health insurance reform, because the legislation deals
with end-of-life issues, and it sets up panels.
As Dwight Bolinger so memorably pointed out, language is a "loaded
weapon." Words have both meanings and connotations.
> Quite generally, the meaning of compounds is not uniquely determined by
> their components. A chocolate biscuit has another relationship to
> chocolate than a dog biscuit to a dog or a ship biscuit to a ship. Hence,
> the term "ideograph" does *not* unambiguously mean "graphic symbol
> denoting an idea" but means no more than "graphical symbol having anything
> to do with ideas". And this is not wrong: while the CJ symbols denote
> words/morphemes and not ideas, most of them do it by a reference to an
> idea of the meaning: a few directly referring to the meaning (mere
> pictures like "man" or "tree"; symbolic pictures like "East" (the sun in a
As Bill Boltz still finds it necessary to point out on every occasion,
if you didn't know the history of any modern simple character that
originated as a pictogram, you wouldn't know what it was a picture of.
> tree) or "peace" (a woman having a roof above her head)), and nearly all
That's one of the old wive's tales. It's semantic + phonetic just like
almost every other character in the Chinese inventory.
> others by a signific, meaning-related portion of the character which
> serves to disambiguate near-homophones. And the fact that the meaning of
> the morpheme has played a rôle for the design of the character is what
> indeed distinguishes this script from most others.
>
> The word "ideograph" should only mislead those who are mislead by "dog
> biscuit" as well.
"Dog biscuits" are in the everyday experience of most persons in
Western civilization. Morphographic scripts are not.
'Wrong' is precisely what it is. And 'wrong' becomes 'stupid' when 1) it's
been pointed out countless times why it is wrong and 2) there is no need at
all to use the wrong them when 2a) there are perfectly good terms to use to
convey the meaning they intend and 2b) the term has its own, different,
applicability in this context.
> (And I cannot see how LSD may understand the function of the script
> wrong which he has used for his whole lifetime -- more probably he just
> does not use the same terms you would have used.)
LSD using a script doesn't mean LSD can analyze it correctly. Otherwise we'd
all be grammer geniouses, no? And you're crediting LSD with more honesty
than is in him.
For example, he has simply dropped out the discussion when his sun/week
example backfired - it turned out that both words have synonyms, and each
synonym has its own, non-interchangeable characters, directly contradicting
the ideographic nature he was trying to give them (conveniently, he had
ignored previous questions on synonyms, probably thinking no one could
muster the sinological ability to look that up, but Wiktionary can do wonders).
> Quite generally, the meaning of compounds is not uniquely determined by
> their components. A chocolate biscuit has another relationship to
> chocolate than a dog biscuit to a dog or a ship biscuit to a ship. Hence,
> the term "ideograph" does *not* unambiguously mean "graphic symbol
> denoting an idea" but means no more than "graphical symbol having anything
> to do with ideas".
That's so vague as to be useless; from another perspective, there is no use
for such a referent.
> And this is not wrong: while the CJ symbols denote
> words/morphemes and not ideas, most of them do it by a reference to an
> idea of the meaning: a few directly referring to the meaning (mere
> pictures like "man" or "tree"; symbolic pictures like "East" (the sun in a
> tree) or "peace" (a woman having a roof above her head)), and nearly all
> others by a signific, meaning-related portion of the character which
> serves to disambiguate near-homophones. And the fact that the meaning of
> the morpheme has played a rôle for the design of the character is what
> indeed distinguishes this script from most others.
Not from most other historical scripts (hence excluding modern creations),
which can trace their origins to pictographs. Believe me, no one can make
any sense of the pictorial origin of chinese characters without first
studying the language. Just as no one will guess that A is a bull's head
upside down. What you can claim is that chinese characters are closely
associated to a morpheme, whereas latin characters are closely associated to
a sound. But that's why chinese characters are classified as logograms,
whereas latin characters are classified as phonograms.
Why is it so hard for this to sink in?
Wikipedia explains it surprisingly well*:
'Logograms are commonly known also as "ideograms" or "hieroglyphics", which
can also be called "hieroglyphs". Strictly speaking, however, ideograms
represent ideas directly rather than words and morphemes, and none of the
logographic systems described here are truly ideographic.'
(Lest someone complains about the sloppy use of the hieroglyph- words,
that's the point: use of 'ideogram' here is just as sloppy and deserves to
be tainted by the association.)
(*) 'Well', of course, for those interested in learning. For those
interested in being obnoxious (a common sport hereabouts), something more
thorough is needed, not to convince them, which is impossible with people
impervious to reason, but to leave them without arguments. Which is hardly
an interesting occupation.
Well, that's just an old school teacher's tale.
It probably started as a picture of something like a bindle stick, which
would be shouldered by two porters walking one behind the other. The
"idea" may have been that of a stick going all the way through or along
from one point to another. Hard to see that "idea" in the "idea" of "east."
Bart Mathias
And why would "sun in a tree" say 'east' rather than 'west' anyway?
> Helmut Richter wrote (24-12-2009 13:08):
> > Quite generally, the meaning of compounds is not uniquely determined by
> > their components. A chocolate biscuit has another relationship to
> > chocolate than a dog biscuit to a dog or a ship biscuit to a ship. Hence,
> > the term "ideograph" does *not* unambiguously mean "graphic symbol
> > denoting an idea" but means no more than "graphical symbol having anything
> > to do with ideas".
>
> That's so vague as to be useless; from another perspective, there is no use
> for such a referent.
Yes, this is how words function; from a word, one can *never* conclude its
meaning. "Linguistics" could as well mean the medical discipline dealing
with diseases of the tongue, a "symbol" (from "throw together") could as
well mean a coincidence, and, to take up my previous example, a "dog
biscuit" could as well mean a biscuit flavoured with dog meat. What a word
*really* means is not given by the word itself but rather by what people
understand. Even "oxygen" (sour-maker) keeps its name although it is the
hydrogen ions that make things sour. It is understood, and this is enough.
If someone wants to know what oxygen is he should learn about chemistry;
knowing the origin of the word helps exactly nothing for understanding --
it leads nowhere and is thus not even misleading. So if people are mislead
by the word "ideographic" because it would not mean what they imagine
without the faintest idea about what they are talking: so what, *every*
word is misleading if you do not know about what you are talking.
> Just as no one will guess that A is a bull's head upside down.
And no word containing an A has anything to do with a bull or its head, or
if so, by mere coincidence.
Other than that, a Chinese character containing the symbol for a man does
indeed often have to do with a person, or a symbol with the two crosses on
top with a plant or its product, and so on. So even though *all* scripts
once were pictures, and all scripts make use of phonetics, and the Chinese
characters denote morphemes (as distinct from ideas disregarding their
relation to morphemes), there *is* a difference: in some scripts a little
bit of the semantic content of what is written is in some way reflected in
the script and in others this is not at all the case.
> What you can claim is that chinese characters are closely associated to
> a morpheme, whereas latin characters are closely associated to a sound.
> But that's why chinese characters are classified as logograms, whereas
> latin characters are classified as phonograms.
Fine, where is the problem? We do exactly agree on how the scripts work,
we probably agree that "logogram" is a better word than "ideogram", at
least as long as we regard the morphemes as words. The only thing where we
disagree is that the word "ideogram" has by its mere composition a wrong
meaning. As I said above, *words* have no unambiguous inherent meanings
but at most very vague ones, and in this vague sense the word "ideogram"
could well be defended. At least much better than "oxygen" which no
chemist is ashamed to use.
--
Helmut Richter
> Fine, where is the problem? We do exactly agree on how the scripts work,
> we probably agree that "logogram" is a better word than "ideogram", at
> least as long as we regard the morphemes as words. The only thing where we
> disagree is that the word "ideogram" has by its mere composition a wrong
> meaning. As I said above, *words* have no unambiguous inherent meanings
> but at most very vague ones, and in this vague sense the word "ideogram"
> could well be defended. At least much better than "oxygen" which no
> chemist is ashamed to use.
Everyone knows what "idea" means. No one knows what "oxy" means.
And Europeans even know what "gram" means. A bit more than 28 of them make
one of your ounces.
Now let's talk about moronic oxen.
Joachim
So a tele-gram is a lightweight missile?
> Now let's talk about moronic oxen.
The ones that are content to go boustrophedon all day?
Oh, my problem isn't with the composition of the word. It might be 'zack'
for all I care. It's rather with the fact that 'ideogram' does have a useful
meaning (numerals, etc...) which is different from 'logogram'. Using
'ideogram' to refer to logograms 1) makes one confuse the two concepts, 2)
obscures the concept of 'ideogram', 3) makes one think of logograms as
having the properties of real ideograms.
> Oh, my problem isn't with the composition of the word. It might be 'zack' for
> all I care. It's rather with the fact that 'ideogram' does have a useful
> meaning (numerals, etc...) which is different from 'logogram'. Using
> 'ideogram' to refer to logograms 1) makes one confuse the two concepts, 2)
> obscures the concept of 'ideogram', 3) makes one think of logograms as having
> the properties of real ideograms.
I was not sufficiently aware that the term "ideogram" has a
well-established *other* use so that confusion may arise. I had focussed
on the question whether it is a bad term in itself, irrespective of the
potential confusion with oter meanings.
--
Helmut Richter