I don't know if this is of much use for anyone, but I could imagine it's
better to use the correct unicode characters whenever possible --
especially if dealing with documents which might be read by a
text-to-speech converter later on (even though I doubt present day tts
converters are aware of all 1.2 million unicode characters, or how many
they were total).
[1] http://www.unicode.org/charts/
[2] http://www.unicode.org/charts/PDF/U2100.pdf
--
Alan Plum, WAD/WD, Mushroom Cloud Productions
http://www.mushroom-cloud.com/
> I had a look at the Unicode charts [1] and I was surprised to find
> out even our good old "degrees Celsius" has an own symbol: ℃ (hex
> 2103)
Yes, there is such a compatibility character. It is compatibility
equivalent to U+00B0 U+0043 (i.e., degree sign followed by letter C).
For compatibility characters in general, see some notes at
http://www.cs.tut.fi/~jkorpela/chars.html#compat
> I don't know how well typographically correct symbols like
> these are supported,
It has little to do with typographic correctness; rather, it's a matter
of compatibility, so that data containing that character in some non-
Unicode encoding can be encoded in Unicode without losing the distinction
between that character and the U+00B0 U+0043 pair, should someone wish to
retain that distinction.
> but I also found the symbol of "degrees
> Fahrenheit" ℉ (x2109), liter ℓ (x2113), Ohm Ω (x2126 - NOT
> Omega), Siemens ℧ (x2127), Kelvin K (x212A) and various others.
> Most of them can be found in Letterlike Symbols [2].
They are best forgotten, except when writing applications that may need
to process legacy data.
> I don't know if this is of much use for anyone, but I could imagine
> it's better to use the correct unicode characters whenever possible
Indeed. This means using U+00B0 U++0043 for degree Celsius, the Latin
capital letter L or the Latin small letter l (as you like) for liter, the
capital Greek letter omega for ohm, etc.
The Unicode standard says, for example:
"Unit Symbols. Several letterlike symbols are used to indicate units. In
most cases, however, such as for SI units (Système International), the
use of regular letters or other symbols is preferred. U+2113 SCRIPT
SMALL L is commonly used as a non-SI symbol for the liter. Official
SI usage prefers the regular lowercase letter l.
Three letterlike symbols have been given canonical equivalence to regular
letters: U+2126 OHM SIGN, U+211A KELVIN SIGN, and U+211B ANGSTROM SIGN.
In all three instances the regular letter should be used. In normal use,
it is better to represent degrees Celsius "°C" with a sequence of U+00B0
DEGREE SIGN+ U+0043 LATIN CAPITAL LETTER C, rather than
U+2103 DEGREE CELSIUS. For searching, treat these two sequences as
identical."
http://www.unicode.org/versions/Unicode4.0.0/ch14.pdf#page=6
Unfortunately the standard has wrong information about the symbol for the
liter; the official position in the SI system is that both "l" and "L"
are allowed, with no expressed preference (though in the US, "L" is
preferred by national authorities).
> -- especially if dealing with documents which might be read by a
> text-to-speech converter later on
That's possible, but if such a converter can recognize a wide range of
characters, it could probably recognize simple patterns too, and read
them accordingly.
--
Yucca, http://www.cs.tut.fi/~jkorpela/
> Ashmodai <ashm...@mushroom-cloud.com> wrote:
>
>
>>I had a look at the Unicode charts [1] and I was surprised to find
>>out even our good old "degrees Celsius" has an own symbol: ℃ (hex
>>2103)
>
>
> Yes, there is such a compatibility character. It is compatibility
> equivalent to U+00B0 U+0043 (i.e., degree sign followed by letter C).
> For compatibility characters in general, see some notes at
> http://www.cs.tut.fi/~jkorpela/chars.html#compat
If it is prefered to use two characters instead of one, why is there a
special character for it?
>>I don't know how well typographically correct symbols like
>>these are supported,
>
>
> It has little to do with typographic correctness; rather, it's a matter
> of compatibility, so that data containing that character in some non-
> Unicode encoding can be encoded in Unicode without losing the distinction
> between that character and the U+00B0 U+0043 pair, should someone wish to
> retain that distinction.
Agreed. The point of Unicode is to provide a large variety of different
characters within one encoding tho, eg. Latin-1 characters and Cyrillic
characters in the same document, to my understanding.
>>but I also found the symbol of "degrees
>>Fahrenheit" ℉ (x2109), liter ℓ (x2113), Ohm Ω (x2126 - NOT
>>Omega), Siemens ℧ (x2127), Kelvin K (x212A) and various others.
>>Most of them can be found in Letterlike Symbols [2].
>
>
> They are best forgotten, except when writing applications that may need
> to process legacy data.
If you say so.
>>I don't know if this is of much use for anyone, but I could imagine
>>it's better to use the correct unicode characters whenever possible
>
>
> Indeed. This means using U+00B0 U++0043 for degree Celsius, the Latin
> capital letter L or the Latin small letter l (as you like) for liter, the
> capital Greek letter omega for ohm, etc.
>
> The Unicode standard says, for example:
>
> "Unit Symbols. Several letterlike symbols are used to indicate units. In
> most cases, however, such as for SI units (Syst?me International), the
> use of regular letters or other symbols is preferred. U+2113 SCRIPT
> SMALL L is commonly used as a non-SI symbol for the liter. Official
> SI usage prefers the regular lowercase letter l.
Oops. Wasn't aware the SI prefers Latin-1 characters. I've seen the
SCRIPT SMALL L character on quite a lot of products, so I thought they
were more correct than the plain ones.
> Three letterlike symbols have been given canonical equivalence to *regular
> letters:* U+2126 OHM SIGN, U+211A KELVIN SIGN, and U+211B ANGSTROM SIGN.
> *In all three instances the regular letter should be used.* In normal use,
> it is better to represent degrees Celsius "?C" with a sequence of U+00B0
> DEGREE SIGN+ U+0043 LATIN CAPITAL LETTER C, rather than
> U+2103 DEGREE CELSIUS. For searching, treat these two sequences as
> identical."
>
> http://www.unicode.org/versions/Unicode4.0.0/ch14.pdf#page=6
So for Kelvin I _should_ use U+211A instead of a plain K according to
the UC, but the SI prefers the plain K? Irritating.
> Unfortunately the standard has wrong information about the symbol for the
> liter; the official position in the SI system is that both "l" and "L"
> are allowed, with no expressed preference (though in the US, "L" is
> preferred by national authorities).
As far as I know lowercase l is prefered in Germany, but as in
handwriting a lowercase l usually is just a short vertical bar, it is
oftenly written as an uppercase L in order to prevent confusion whether
it is an actual character or just a slipped line or bracket.
On printouts in chemistry class I've seen "mL" with a capital L a lot as
well, but that may come from the international (i.e. American
influenced) BCPs of companies.
What I find even more irritating (although still legible) is that all
out of sudden the convention of writing g/mol has been replaced by
g*mol^-1. Is that just a local thing or an international standard?
What is the prefered way to mark up mathematics in ISO-8859-1/UTF-8
characters anyway? I don't think the (American, to my understanding) "x"
character for "times" and the colon-minus "division" character are the
international standard, are they?
>>-- especially if dealing with documents which might be read by a
>>text-to-speech converter later on
>
> That's possible, but if such a converter can recognize a wide range of
> characters, it could probably recognize simple patterns too, and read
> them accordingly.
Point taken.
I'm still working out proper typography on the net, which is a total
nightmare anyway. Forgive me for my mislead enthusiasm.
Old ideographic character sets from East Asia, for example
JIS X 0212, contain lots of characters for individual SI units.
Design goal of Unicode was to be round-trip compatible with
all these characters. This means, it must be possible to
convert JIS X 0212 to Unicode and back to JIS X 0212, without
any loss of information. As a result, Unicode now contains a lot
of nonsense characters that really nobody should be using.
The characters that you should use are those in Unicode Normalization
Form C. Unfortunately, not too many people have actually read
the Unicode standard, which is available from Addison Wesley and
is thicker than many telephone books. People know Unicode only from
simple-minded selection tables and often pick the completely wrong
characters, as these tables to not show the descriptive comments that
the standard provides for each character.
http://www.cl.cam.ac.uk/~mgk25/unicode.html
>Agreed. The point of Unicode is to provide a large variety of different
>characters within one encoding tho, eg. Latin-1 characters and Cyrillic
>characters in the same document, to my understanding.
It was also meant to replace any other existing encoding, hence the
round-trip requirement. During text entry, you are actually supposed
to use only a very small subset of Unicode.
>>>but I also found the symbol of "degrees
>>>Fahrenheit" ℉ (x2109), liter ℓ (x2113), Ohm Ω (x2126 - NOT
>>>Omega), Siemens ℧ (x2127), Kelvin K (x212A) and various others.
>>>Most of them can be found in Letterlike Symbols [2].
>>
>> They are best forgotten, except when writing applications that may need
>> to process legacy data.
I agree in principle, though I do personally like to use the ohm sign
and the micro sign, because where I use software to convert Unicode
to ASCII by transliteration, these can then be written out as
ohm and u or micro, as opposed to omega and mu. That is the reason,
why I do make use of the compatibility characters MICRO SIGN and
OHM SIGN in the misc.metric-system FAQ.
>Oops. Wasn't aware the SI prefers Latin-1 characters. I've seen the
>SCRIPT SMALL L character on quite a lot of products, so I thought they
>were more correct than the plain ones.
The SI brochure certainly does not use a script l, and I believe
NIST SP811 explicitely deprecates that notation. It might have been
an American idea. They have always been rather uncomfortable with the
lowercase l for litre, because unlike in Europe, in American
handwriting 1 and l can look pretty much identical. That, together with
the American habit of not adding a space between the number and the unit,
is an obvious cause for mistakes. In Europe (minus UK) the handwritten one
has an upstroke (and the 7 a cross stroke), and can therefore not be
confused with any l.
>So for Kelvin I _should_ use U+211A instead of a plain K according to
>the UC, but the SI prefers the plain K? Irritating.
No, you should use U+211A only for round-trip compatibility with
with other character sets that introduced it as a distinct character.
The English SI symbol for Kelvin is the latin capital character K.
>What I find even more irritating (although still legible) is that all
>out of sudden the convention of writing g/mol has been replaced by
>g*mol^-1. Is that just a local thing or an international standard?
I would consider the two perfectly equivalent and i don't know any
international standard (ISO 31, etc.) that expresses any particular
preference here.
>What is the prefered way to mark up mathematics in ISO-8859-1/UTF-8
>characters anyway? I don't think the (American, to my understanding) "x"
>character for "times"
The x times is the preferred multiplication character according
to ISO 31-0, when it appears next to any digit, as the centered dot
more commonly used in mathematics could there be confused too easily
in handwriting with the English decimal dot. Between variables, the
centered dot or no character at all are the recommended notations for
multiplication. I personally always use a centered dot in formulas before
an opening parenthesis, to distinguish multiplication from
function application.
> and the colon-minus "division" character are the
> international standard, are they?
I know these only from calculator keyboards. It does not appear to
be used or recommended in scientific notation.
>I'm still working out proper typography on the net, which is a total
>nightmare anyway. Forgive me for my mislead enthusiasm.
UTF-8 gives us already much more capability to use mathematical
notation on the USENET, but it is still a long way away from the
capabilities of TeX. There were proposals from the STIX project
to add to Unicode a small set of control characters to fully
support the 2-D structure of mathematical notation (sub/super
scripts, grouping, arrays, growing delimiters, etc.) but I
haven't heard of any progress in this direction recently.
Markus
--
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain
>Ashmodai <ashm...@mushroom-cloud.com> writes:
>>Oops. Wasn't aware the SI prefers Latin-1 characters. I've seen the
>>SCRIPT SMALL L character on quite a lot of products, so I thought they
>>were more correct than the plain ones.
>
>The SI brochure certainly does not use a script l, and I believe
>NIST SP811 explicitely deprecates that notation. It might have been
>an American idea. They have always been rather uncomfortable with the
>lowercase l for litre, because unlike in Europe, in American
>handwriting 1 and l can look pretty much identical. That, together with
>the American habit of not adding a space between the number and the unit,
>is an obvious cause for mistakes. . . .
American habit? You seem to be confused.
Americans don't have much problem with that, except in the special
case of butting a degree sign up to the letter for the adjective
identifying it and leaving a space between that combination and the
number. But that is as much a problem anywhere else as it is in
America, and I'm not sure about the problem of using degrees of
temperature without identifying the scale, which is a problem in
America.
OTOH, look at things like the style guide of the national newspaper of
the U.K., The Times.
http://www.timesonline.co.uk/section/0,,2941,00.html
This style guide does omit the space between the number and the
symbol, and I'll bet it is only one of many in the U.K. which does so.
The NPL site does briefly state on one of its pages that this space
should be there, and I think the British might be getting a little
better at this than they were in the past.
The Times style guide also has the opposite problem with degrees
Celsius and degrees Fahrenheit--it omits the degree sign. It also
endorses "degrees centigrade"--which the CGPM told us to stop using
way back in 1948.
Gene Nygaard
http://ourworld.compuserve.com/homepages/Gene_Nygaard/
The site states the German umlaut Ä is to be written (if written as
precomposed character) as a diaresis followed by A.
I'm pretty sure the alternate character for Ä is Ae, as originally the
diaresis was a small lowercase e (19th century German proves that),
which is also what is taught in schools and common knowledge in Germany.
I think it is a more realistic assumption that the alternative spelling
uses an Ae rather than a diaresis followed by a an A, even though
there's no logical way to compose an Ä from that.
I also think having the small leter sharp s (or rather, sz ligature --
although the modern extension is a dual small letter s) transliterated
as a beta sign in CP437 is quite questionable.
[http://www.cl.cam.ac.uk/~mgk25/unicode.html]
>
> The site states the German umlaut Ä is to be written (if written as
> precomposed character) as a diaresis followed by A.
The precomposed character Ä is the "A with diaresis", as it appears
in this message. This is how it works in both unicode and ISO8859-15.
With unicode, you may place any mark (diaresis, dot, accent, ring...)
above any other character, using two symbols: the base character
followed by a "combining mark". The software will then (hopefully)
display the mark above the preceding character.
This is a nice feature, but to remain compatible with old terminals
and to make conversion easier, the precomposed characters remained
in the unicode standard.
> I'm pretty sure the alternate character for Ä is Ae, as originally the
> diaresis was a small lowercase e (19th century German proves that),
> which is also what is taught in schools and common knowledge in Germany.
Ae is used for Ä in ASCIIfication, and in any other environment, where
only the 26 basic latin characters are available.
> I also think having the small leter sharp s (or rather, sz ligature --
> although the modern extension is a dual small letter s) transliterated
> as a beta sign in CP437 is quite questionable.
Using the beta sign is a way to display unicode (or ISO8859-15) text
on a display with CP437 burned in its ROM. The software has to do the
mapping, just like mapping Ä (ISO8859-15) to the corresponding CP437
character, or mapping A-followed-by-combining-diaresis to exactly the
same one. Alternatively, the unicode or ISO8859-15 sz ligature can
be displayed as double s in CP437, but that is an one-way conversion.
Btw, using the unicode SI symbols would probably aid text-to-speech
conversion, but I am just too lazy to use them...
HTH,
Klaus
> OTOH, look at things like the style guide of the national newspaper of
> the U.K., The Times.
> http://www.timesonline.co.uk/section/0,,2941,00.html
The Times is not the national newspaper of the U.K.
> This style guide does omit the space between the number and the
> symbol, and I'll bet it is only one of many in the U.K. which does so.
Newspapers are concerned with saving space.
>Gene Nygaard <gnyg...@nccray.com> writes:
>
>> OTOH, look at things like the style guide of the national newspaper of
>> the U.K., The Times.
>> http://www.timesonline.co.uk/section/0,,2941,00.html
>
>The Times is not the national newspaper of the U.K.
Fine, replace "the" with "a"; I was mostly concerned with
distinguishing it from the New York Times or Los Angeles Times.
>> This style guide does omit the space between the number and the
>> symbol, and I'll bet it is only one of many in the U.K. which does so.
>
>Newspapers are concerned with saving space.
It isn't just newspapers. It's an ubiquitous problem with people from
the U.K.
Google groups search: weight 15st 1610 hits
and very few of them by Americans, I'll bet.
Gene Nygaard
http://ourworld.compuserve.com/homepages/Gene_Nygaard/
>>> This style guide does omit the space between the number and the
>>> symbol, and I'll bet it is only one of many in the U.K. which does so.
>>
>>Newspapers are concerned with saving space.
>
> It isn't just newspapers. It's an ubiquitous problem with people from
> the U.K.
>
> Google groups search: weight 15st 1610 hits
>
> and very few of them by Americans, I'll bet.
Yeah, right.
Google groups search: weight 160lbs about 3000 hits
and very few of them by Brits, I'll bet.
But yes, it probably is fairly normal practice here in handwriting and
colloquial printing not to put spaces between figures and unit
abbreviations.
The major academic press style guides use a thin space, as far as I
remember. The MHRA guide shows a full space.
>Gene Nygaard <gnyg...@nccray.com> writes:
>
>>>> This style guide does omit the space between the number and the
>>>> symbol, and I'll bet it is only one of many in the U.K. which does so.
>>>
>>>Newspapers are concerned with saving space.
>>
>> It isn't just newspapers. It's an ubiquitous problem with people from
>> the U.K.
>>
>> Google groups search: weight 15st 1610 hits
>>
>> and very few of them by Americans, I'll bet.
>
>Yeah, right.
>
>Google groups search: weight 160lbs about 3000 hits
>
>and very few of them by Brits, I'll bet.
A lot more of them with pounds by Brits, than stones by Americans.
Also Canadians, Australians, New Zealanders, South Africans, and the
like.
But here's a more informative comparison
Google groups search: weight "15 st" 251 hits
That's only a little over 15% as many hits as for 15st run together.
Furthermore, while by looking at the earlier search, it looks to me
like more than 90% of the hits deal with stones as units of measure,
in this one I'd doubt that even half of the hits deal with stones.
Google groups search: weight "160 lbs" 5690 hits
See the difference? That's 190% of what you got for 160lbs run
together. (in this case, its likely that all of them in both cases
deal with pounds as units of measure).
>But yes, it probably is fairly normal practice here in handwriting and
>colloquial printing not to put spaces between figures and unit
>abbreviations.
>
>The major academic press style guides use a thin space, as far as I
>remember. The MHRA guide shows a full space.
Showing it that way isn't as effective as a rule saying it _should be_
that way. Most people aren't very good at picking up rules by
example, and the most such examples can show is that that way is
acceptable, not that some other way is unacceptable. Furthermore,
without an explicitly stated rule, those who want to complain about
someone else's usage have a much harder time making their point.
Gene Nygaard
http://ourworld.compuserve.com/homepages/Gene_Nygaard/
>>But yes, it probably is fairly normal practice here in handwriting and
>>colloquial printing not to put spaces between figures and unit
>>abbreviations.
>>
>>The major academic press style guides use a thin space, as far as I
>>remember. The MHRA guide shows a full space.
>
> Showing it that way isn't as effective as a rule saying it _should be_
> that way. Most people aren't very good at picking up rules by
> example, and the most such examples can show is that that way is
> acceptable, not that some other way is unacceptable. Furthermore,
> without an explicitly stated rule, those who want to complain about
> someone else's usage have a much harder time making their point.
So don't complain. (Fortunately, in English, there's nobody with
authority to make rules.)
In ASCII, where no thin space is available, I usually omit the
space. It simply reads better that way to me, perhaps because an
abbreviated unit of measurement standing alone, isolated by spaces,
looks unnatural as part of a sentence.
After all, you don't talk about 300 $, do you?
Of course not. Since the Australian Dollar, the Barbados Dollar,
and the Bermudan Dollar differ quite a bit in value, I prefer to add
the ISO currency code, as in 300 AUD, 300 BBD, and 300 BMD.
And for these, there seems no reason not to add them with a space
after the number, just like with SI units.
http://en.wikipedia.org/wiki/ISO_4217
http://www.xe.com/
Markus
>Julian Bradfield <j...@inf.ed.ac.uk> writes:
>>After all, you don't talk about 300 $, do you?
>
>Of course not. Since the Australian Dollar, the Barbados Dollar,
>and the Bermudan Dollar differ quite a bit in value, I prefer to add
>the ISO currency code, as in 300 AUD, 300 BBD, and 300 BMD.
Let's not forget, it was also a "peso sign" before it was a "dollar
sign," expanding your possibilities even more.
Gene Nygaard
http://ourworld.compuserve.com/homepages/Gene_Nygaard/
> Ashmodai wrote:
>>
>>I also think having the small leter sharp s (or rather, sz ligature --
>>although the modern extension is a dual small letter s) transliterated
>>as a beta sign in CP437 is quite questionable.
>
> Using the beta sign is a way to display unicode (or ISO8859-15) text
> on a display with CP437 burned in its ROM. The software has to do the
> mapping, just like mapping Ä (ISO8859-15) to the corresponding CP437
> character, or mapping A-followed-by-combining-diaresis to exactly the
> same one. Alternatively, the unicode or ISO8859-15 sz ligature can
> be displayed as double s in CP437, but that is an one-way conversion.
I think it's a one-way transliteration anyway. Try and convert a text in
which all sz-ligatures are beta characters into UTF-8.
It'd be safer to transliterate sz-ligatures into two lowercase s
characters -- that won't break any words at least.
> After all, you don't talk about 300 $, do you?
I don't know about Gene, but I do talk about 300 $. Unfortunately, the
convention for writing 300 dollars in English is "$300". IMHO, it does
make much more sense if written language follows spoken language
closely. Any differences between the two makes it just more difficult.
I believe the cumbersome practice of writing "$300" arose a few
hundred years. The idea was to prevent people from adding additional
digits to a figure.
Another disadvantage is the difficulty of reading currency symbols in
a aligned list of figures. If the currency symbol is placed, separated
by a space, to the right of the figure, foreign currencies can be
sorted and spotted easily.
Regards,
Chris Kaese
It also helps when combining units such as:
x $/litre
y $/kg
z $/m
> Markus Kuhn scribbled something along the lines of:
> > http://www.cl.cam.ac.uk/~mgk25/unicode.html
> The site states the German umlaut Ä is to be written (if written as
> precomposed character) as a diaresis followed by A.
Right.
> I'm pretty sure the alternate character for Ä is Ae, as originally the
> diaresis was a small lowercase e (19th century German proves that),
> which is also what is taught in schools and common knowledge in Germany.
True in German, but NOT IN ANY OTHER LANGUAGE (such as Finnish or
Swedish) that uses Ä (or Ö or Ü, for that matter). Pardon the
uppercase, but it really tends to piss me off when people get this wrong
and turn Finnish words into pure gibberish.
> I think it is a more realistic assumption that the alternative spelling
> uses an Ae rather than a diaresis followed by a an A, even though
> there's no logical way to compose an Ä from that.
Not unless you can be absolutely certain that you will never encounter
non-German words.
--
Esa Peuha
student of mathematics at the University of Helsinki
http://www.helsinki.fi/~peuha/
> I've seen the
> SCRIPT SMALL L character on quite a lot of products, so I thought they
> were more correct than the plain ones.
Incidentally, I just noticed that the material by the NIST explicitly
says otherwise:
"The script letter [l], is not an approved symbol for the liter."
http://physics.nist.gov/Pubs/SP811/sec05.html
Note, by the way, that the document uses images to present the symbols of
minute and second as angle unit, i.e. the prime and the double prime,
so when it thus avoids using the Ascii apostrophe ' and the Ascii
quotation mark ", it clearly takes the position that the Ascii characters
are not the correct characters for the units. In practice, as we know, we
tend to use them, since the prime and the double prime are often not
available in fonts.
--
Yucca, http://www.cs.tut.fi/~jkorpela/