Reviews for lisp implementations

Loh Yoon Chao, Peter

unread,

Apr 14, 1999, 3:00:00 AM4/14/99

to

Hi,
Could someone recommend any good, independent
sites for the above? In particular, I'm looking
for product comparisons between Harlequin's end
Franz's implementations. I had tried the ALU site
(and the rest of the web) for a few hours without
any success. Thanks in advance.

Best regards,
Peter

Arthur Lemmens

unread,

Apr 15, 1999, 3:00:00 AM4/15/99

to

"Loh Yoon Chao, Peter" wrote:

> In particular, I'm looking for product comparisons between

> Harlequin's and Franz's implementations. I had tried the ALU site

> (and the rest of the web) for a few hours without
> any success.

Apart from Usenet snippets, the only product comparison I'm aware of
is by David Lamkins (http://www.teleport.com/~dlamkins). Unfortunately,
it's probably too old to be of any use.

I've waited two days for people with more experience to shed some
light here. But, apparently, nobody is willing to burn his fingers
on a comparison between Harlequin and Franz. So here's my (very
personal and very subjective) impression, based on about 1000 hours
of working with Harlequin's Lispworks, 50 hours of experiments with
Franz' previous version (don't remember version number) for Windows
and about 5 hours of playing with Franz' current version. All of this
on Windows 95/98.

* Price
Franz is a lot more expensive than Harlequin (at least a few thousand
vs. less than one thousand dollars). Also, Franz wants royalties for
programs that you distribute; Harlequin doesn't (unless you use their
Enterprise Edition).
For personal use, both companies have a free version.

* Conformance to standards.
My impression is that both companies are pretty good at conforming
to the ANSI spec, but that Harlequin takes it a bit more seriously
than Franz. Harlequin seems to be much better at supporting Unicode
and other character sets. Franz still seems to think that 256 characters
is more than enough (just like Bill Gates thought that 640K is more
than anyone would ever need).

* Integration with underlying platform
My impression is that Franz puts more effort into this than Harlequin.
For Windows, Franz seems to support more platform-specific stuff
(e.g. multimedia extensions, tree views). Also, their development
environment has a more 'natural' feel.

* Performance
I haven't run any benchmarks, but Lispworks feels a bit more
sluggish (both in space and speed) than Allegro CL.

If money didn't matter, I would use Allegro for platform-dependent
stuff and Lispworks for everything else. Personally, I can't afford
Allegro and I've settled for Lispworks. I've never regretted buying
it.

I'll be happy to have my impressions corrected by people who know
better.

Arthur Lemmens

Erik Naggum

unread,

Apr 15, 1999, 3:00:00 AM4/15/99

to

* Arthur Lemmens <lem...@simplex.nl>

| I've waited two days for people with more experience to shed some
| light here. But, apparently, nobody is willing to burn his fingers
| on a comparison between Harlequin and Franz.

that's because this is the kind of stuff lawsuits are made of. you need
a protective wrapper of serious legal quality to dive into this matter of
comparing products in general. not that I think Franz or Harlequin will
sue anyone, but most professionals are aware of the problems of comparing
products, and consequently avoid it, at least in public.

| So here's my (very personal and very subjective) impression, based on
| about 1000 hours of working with Harlequin's Lispworks, 50 hours of
| experiments with Franz' previous version (don't remember version number)
| for Windows and about 5 hours of playing with Franz' current version.
| All of this on Windows 95/98.

although extremely important to inform your readers of (thanks), this
makes your comparison "weak". (I wouldn't be able to provide a stronger
comparison, by the way.)

| * Price

price comparisons are more dangerous than any other comparisons.

| * Conformance to standards.

this comparison should be performed by someone very familiar with the
standard and its semantics, because impressions of non-conformance may
actually be within the bounds of conformance, and some non-conformances
may be insignificant and easily fixed if the vendor is alerted to them.

| My impression is that both companies are pretty good at conforming to the
| ANSI spec, but that Harlequin takes it a bit more seriously than Franz.

this is _very_ difficult to establish from watching the products, as it
refers to intentions and future, not to the past. it _is_ fair to say
that Harlequin's LispWorks conforms better to the specification in some
areas than Franz's Allegro CL does, but it has to be an area-by-area
comparison to be fair, and the severity of the non-conformance is also
important for a fair comparison. e.g., _my_ impression is that Allegro
CL has a weaker safe mode (not all errors signal errors as they should)
than one could hope for, but this is not an area where I need it, so it
may or may not matter to a particular programmer. (incidentally, I know
that Franz Inc _is_ taking conformance seriously and I'm working with
them to help us all get there.)

| Harlequin seems to be much better at supporting Unicode and other
| character sets.

although very valuable for a user, this is not about conformance to the
ANSI Common Lisp standard. it is therefore important to state what you
expect from a product.

| Franz still seems to think that 256 characters is more than enough (just
| like Bill Gates thought that 640K is more than anyone would ever need).

such parenthetical remarks, however, make your "comparison" nigh useless.

incidentally, Franz Inc has an "international" (= Japanese) version that
covers the need of most present non-Latin speakers. (I have had to do a
little home-brewing to get ISO 8859-1 working as I want it to in Allegro
CL, but I don't know whether LispWorks is any better.)

| * Integration with underlying platform

this is a valuable comment to a user.

| * Performance

comparisons here are fraught with danger and should be performed with
published code and all sorts of things. e.g., some property that makes
it feel "sluggish" could be extremely easy to fix, and other properties
can be very hard to change because they are pandemic to the design. I
think performance comparisons are _generally_ unfair, because after you
have decided on a product, you learn how to make it faster.

| I'll be happy to have my impressions corrected by people who know
| better.

I don't want to snap at you, but it's a _lot_ safer to talk to the person
requesting a comparison and let it be a personal exchange, rather than
post impressions and request correction; it usually requires a huge
effort to correct simple misimpressions. this is why comparisons often
produce a tremendous amount of noise on the newsgroups. also, most user
impressions are exceedingly hard to quantify, and a lot of factors come
into play.

incidentally, I haven't had the opportunity to compare Allegro CL with
much anything else. (I went from CMUCL 17f to Allegro CL 4.3 and it was
a world of difference, so I don't even consider CMUCL possible to compare
in the area I think matters the most: the development environment.) I
get the performance I need, and I get the support I need from Franz Inc
whenever I wonder about something or find a problem, and I see no reason
to go look for a competing product. now, this is more an accident of
history than anything else, so it does in no way preclude similar
experiences with Harlequin -- it just didn't happen to me. my guess is
that this is how most user impressions are formed: luck and good timing.

#:Erik
--
environmentalists are much too concerned with planet earth. their geocentric
attitude prevents them from seeing the greater picture -- lots of planets are
much worse off than earth is.

Fernando D. Mato Mira

unread,

Apr 15, 1999, 3:00:00 AM4/15/99

to

Arthur Lemmens wrote:

> I've waited two days for people with more experience to shed some
> light here. But, apparently, nobody is willing to burn his fingers

> on a comparison between Harlequin and Franz. So here's my (very

OK. I've used both Allegro (up to 4.2) and Harlequin (upto 1995),
and I must say, FFI issues aside, that I would go with Allegro. I just
feel safer (some might argue that Harlequin's other businesses make it
safer,
but for me it gives a feeling (just an impression, but that's what marketing

is all about) of lack of commitment.
Maybe more importantly, I never `got' the Harlequin way. I'm not a fan
of IDEs. The primitive (but then, you have those cool menus) Xemacs
interface of
Allegro feels good. I've never used ACL for Windows but it looks pretty
neat,
so maybe it's just that I don't like _this_ Harlequin IDE.

I'd really like a Genera for SGI to play around. I have no use for it right
now, but I'd gladly pay my own $800 for the manuals (upgradeable to a
commercial license).

--
Fernando D. Mato Mira
Real-Time SW Eng & Networking
Advanced Systems Engineering Division
CSEM
Jaquet-Droz 1 email: matomira AT acm DOT org
CH-2007 Neuchatel tel: +41 (32) 720-5157
Switzerland FAX: +41 (32) 720-5720

www.csem.ch www.vrai.com ligwww.epfl.ch/matomira.html

Arthur Lemmens

unread,

Apr 15, 1999, 3:00:00 AM4/15/99

to

Erik Naggum wrote:

> price comparisons are more dangerous than any other comparisons.

Would you care to explain?
I would think that price is just about the only thing you can
compare without the risk of giving "misimpressions".

> this comparison should be performed by someone very familiar with the
> standard and its semantics

> [...]

> but it has to be an area-by-area comparison to be fair, and the
> severity of the non-conformance is also important for a fair comparison.

I can't disagree with this, of course.
But I tried to make it clear that I was giving my personal impression
and not attempting to make a fair comparison. (I couldn't possibly find
the time for a fair comparison, but I didn't want to leave the original
question unanswered.)

> | Franz still seems to think that 256 characters is more than enough (just
> | like Bill Gates thought that 640K is more than anyone would ever need).
>
> such parenthetical remarks, however, make your "comparison" nigh useless.

Sorry, I shouldn't have said that.
This wasn't the right place to vent my frustration about the slow
acceptance of a decent international character set.

> (I have had to do a little home-brewing to get ISO 8859-1 working
> as I want it to in Allegro CL, but I don't know whether LispWorks
> is any better.)

I'm so glad that I only need to type
(code-char #x41A)
to actually get a Russian K that I've forgiven Lispworks for
returning NIL when I ask
(alpha-char-p *)

But it _does_ know something about Latin 1:

CL-USER 17 > (code-char #xF0)
#\ð

CL-USER 18 > (char-upcase *)
#\Ð

> it's a _lot_ safer to talk to the person requesting a comparison
> and let it be a personal exchange, rather than post impressions
> and request correction;

Thanks for the advice. I don't know if I will actually follow it,
though. Having a public discussion increases the chance that _I_ can
learn something as well. E.g., if I had sent my remarks privately,
I wouldn't have learnt from you that Franz has an international
version of Allegro CL.

> (I went from CMUCL 17f to Allegro CL 4.3 and it was a world of
> difference, so I don't even consider CMUCL possible to compare
> in the area I think matters the most: the development environment.)

Let's hope the CMUCL maintainers won't sue you for this remark ;-)

Arthur Lemmens

Erik Naggum

unread,

Apr 15, 1999, 3:00:00 AM4/15/99

to

* Arthur Lemmens <lem...@simplex.nl>

| Would you care to explain? I would think that price is just about the
| only thing you can compare without the risk of giving "misimpressions".

there are all sorts of pricing policies around, depending on who you are
(student, commercial, educational), where you are (United States, Europe,
Asia), how much you want to buy (trial, student, professional, enterprise
edition), etc, etc. it's actually difficult to compare the price that
you would have to pay unless you're the person buying something and in
position to weigh alternatives. for instance, one might find that some
add-on product is not worth the price from one vendor and roll your own,
while from another vendor the price is acceptable. the result may be
that the former costs less from the vendor than the latter, but more
after life-cycle costs for the new code are accounted for, but not with
the initial prices, only. stuff like this is why large companies have
acquisitions departments who work like hell to get good package deals.

| But it _does_ know something about Latin 1:
|
| CL-USER 17 > (code-char #xF0)

| #\š
|
| CL-USER 18 > (char-upcase *)
| #\Š

good. Allegro CL does this correctly only with my personal fixes.
(which, incidentally, supports the entire ISO 8859 family, one by one,
once properly invoked.)

| E.g., if I had sent my remarks privately, I wouldn't have learnt from
| you that Franz has an international version of Allegro CL.

valid point. however, it would have been prudent to ask Franz Inc if you
tried to write a fair comparison.

Vassil Nikolov

unread,

Apr 15, 1999, 3:00:00 AM4/15/99

to

Off-topic.

In article <37164EF3...@simplex.nl>,
Arthur Lemmens <lem...@simplex.nl> wrote:

(...)

> I'm so glad that I only need to type
> (code-char #x41A)
> to actually get a Russian K

Cyrillic K, please. There are a number of nations besides the
Russians, including Byelorussians, Macedonians, Serbs, Ukrainians,
as well as Bulgarians, who use (different versions of) this alphabet.

I am too lazy to look it up, but I believe the ISO 10646 name of
this character is CYRILLIC CAPITAL LETTER KA or something.

Historical Note:
The original version of the Cyrillic alphabet was developed in the
9th century on the basis of the Greek alphabet. Its name is a
tribute to St. Cyril, the Eastern Roman scholar and missionary who
captured the phonetics of the (then common) Slavonic language
into a writing system (using a different alphabet, now extinct)
and who was a translator of the Bible, and its development is
credited to St. Climent of Okhrid, one of St. Cyril's disciples.

By the way, I am not aware of another alphabet besides the
contemporary Russian version of Cyrillics where the number
of letters is a power of 2 (2^5).

(...)

> But it _does_ know something about Latin 1:
>
> CL-USER 17 > (code-char #xF0)
> #\š
>
> CL-USER 18 > (char-upcase *)
> #\Š

(...)

I'd rather have had the above as

(char-code (char-upcase (code-char #xF0)))

instead of, or in addition to, the above, which makes little
apparent sense on my Macintosh with a Cyrillic font selected.

--
Vassil Nikolov <vnikoÄ

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own

Valeriy E. Ushakov

unread,

Apr 16, 1999, 3:00:00 AM4/16/99

to

Vassil Nikolov <vnik...@poboxes.com> wrote:

> Off-topic.

Yes.

> By the way, I am not aware of another alphabet besides the
> contemporary Russian version of Cyrillics where the number
> of letters is a power of 2 (2^5).

[mumbling the alphabet]... hmm, last time I checked - it was 33. :-)

You probably forgot cyrillic-letter-io, which is rarely used in
printed Russian (mostly in texts for children and foreigners and in
ambiguous cases) and is substituted with cyrillic-letter-ie. Still
it's part of the alphabet/orthography.

SY, Uwe
--
u...@ptc.spbu.ru | Zu Grunde kommen
http://www.ptc.spbu.ru/~uwe/ | Ist zu Grunde gehen

Arthur Lemmens

unread,

Apr 16, 1999, 3:00:00 AM4/16/99

to

I wrote:
> I'm so glad that I only need to type
> (code-char #x41A)
> to actually get a Russian K

Vassil Nikolov replied:

> Cyrillic K, please. There are a number of nations besides the
> Russians, including Byelorussians, Macedonians, Serbs, Ukrainians,
> as well as Bulgarians, who use (different versions of) this alphabet.

Uhm, yes. Sorry. In my situation, (code-char #x41A) is usually a
Russian K. But next time I'll call it Cyrillic. I sometimes forget
I'm not the only Cyrillic speaking (;-) Lisp programmer on Usenet.

> I am too lazy to look it up, but I believe the ISO 10646 name of
> this character is CYRILLIC CAPITAL LETTER KA or something.

I looked it up. You're right.

> I'd rather have had the above as
>
> (char-code (char-upcase (code-char #xF0)))
>
> instead of, or in addition to, the above, which makes little
> apparent sense on my Macintosh with a Cyrillic font selected.

Before sending, I verified that the content-type header included
"charset=iso8859-1" to increase the probability of readers seeing
what I meant.

Arthur Lemmens

David Fox

unread,

Apr 16, 1999, 3:00:00 AM4/16/99

to

Arthur Lemmens <lem...@simplex.nl> writes:

> > (I have had to do a little home-brewing to get ISO 8859-1 working
> > as I want it to in Allegro CL, but I don't know whether LispWorks
> > is any better.)

LispWorks uses ISO 8859-1 for files by default. Currently users need
to do some configuration to use other encodings for files.

The internal encoding of LispWorks 4.x is Unicode. It has just one
executable.

> I'm so glad that I only need to type
> (code-char #x41A)

> to actually get a Russian K that I've forgiven Lispworks for
> returning NIL when I ask
> (alpha-char-p *)
>

> But it _does_ know something about Latin 1:
>
> CL-USER 17 > (code-char #xF0)
> #\š
>
> CL-USER 18 > (char-upcase *)
> #\Š

Yes, in LispWorks we added the alphabetic property and case-pairs
(beyond those required by the ANSI standard) for Latin-1 only. I
should admit that this is rather half-baked, but allow me to
explain one of the technical problems...

Recall that BASE-STRINGs contain only BASE-CHARs. LispWorks provides
also a 16bit string type (TEXT-STRING) which can contain all of
Unicode.

There is a particular difficulty (for LispWorks at least) with U+00FF
LATIN SMALL LETTER Y DIARESIS which is a BASE-CHAR in LispWorks yet
its uppercase pair (as defined by Unicode) is an EXTENDED-CHAR. Thus
if we were to make these particular characters BOTH-CASE-P then
STRING-UPCASE etc. could not be relied upon to preserve string
types. That might be acceptable by the ANSI standard (though
potentially dangerous to users whose code used specialized accessors)
but the real killer was NSTRING-UPCASE.

I suppose we could have defined a larger set of alphabetic characters
without such problems, but we didn't. Sorry!

There was some attempt to define extended character case (and other)
functions in the JEIDA Common Lisp Guideline. I don't know if anyone
actually implemented that.

LispWorks users needing case converters beyond Latin-1 should exploit
the fact that the internal encoding is Unicode to write their own
functions using range checks.

--
Dave Fox Email: da...@harlequin.com
Harlequin Ltd, Barrington Hall, Tel: +44 1223 873879
Barrington, Cambridge CB2 5RG, England. Fax: +44 1223 873873
These opinions are not necessarily those of Harlequin.

Lars Marius Garshol

unread,

Apr 16, 1999, 3:00:00 AM4/16/99

to

* David Fox

|
| There is a particular difficulty (for LispWorks at least) with
| U+00FF LATIN SMALL LETTER Y DIARESIS which is a BASE-CHAR in
| LispWorks yet its uppercase pair (as defined by Unicode) is an
| EXTENDED-CHAR.

This problem is solved in the recently-approved ISO 8859-15, so
providing that as an alternative to 8859-1 may make sense.

--Lars M.

Reini Urban

unread,

Apr 16, 1999, 3:00:00 AM4/16/99

to

David Fox <da...@harlequin.co.uk> wrote:
>Yes, in LispWorks we added the alphabetic property and case-pairs
>(beyond those required by the ANSI standard) for Latin-1 only. I

>should admit that this is rather half-baked, ...

If you need the case pairs for some other codepages,
you can grab those. It's not the full set but you get the idea.
I haven't found them on the net anywhere so I had to calculate them by
my own. I wrote it for AutoLISP, in CL you maybe could use vectors
instead.

I post this also (instead of linking to it) to let you see how weird
some codepages had been designed, considering case predicates and case
conversions. I guess most OS do it by precalculating the tables, wasting
a lot of bytes.

Note: the third element <islower> of each triple is the numeric
difference from the uppercase to the lowercase char.
so (65 90 32) means that there are 36 uppercase chars from 60 to 95
with the lower brothers 32 above (65+32 up to 90+32)

;;; Hardcoded charset capital letter ranges per codepage,
;;; kind of LC_CTYPE info. Format: list of: (<from> <to> <tolower>)
;;; Found the differences in toupper, tolower, isupper, islower
;;; by scanning the descriptive character names for upper and lower,
;;; unified the pairs into groups and came up with redefinitions
;;; of the upper/lower predicates and conversions.
(setq std:cp-cap-ascii '((65 90 32))) ; this is simple
;; there's a hole at 215
(setq std:cp-cap-iso8859-1 '((65 90 32)(192 214 32)(216 223 32)))
(setq std:cp-cap-iso8859-2 '((65 90 32)(192 214 32)(216 223 32)
(161 161 16)(163 163 16)
(165 166 16)(169 172 16)(174 175 16)
))
(setq std:cp-cap-iso8859-3 '((65 90 32)(192 214 32)(216 223 32)
(161 161 16)(166 166 16)
(169 172 16)(175 175 16)
; 0xAE, 0xBE seem to missing
))

;;; A really weird charset (by ibm), very old.
;;; thanksfully the system provided strcase should handle this most
;;; of the time (by static table loopkup)
(setq std:cp-cap-dos850 '((65 90 32)
(128 128 7)(142 142 -10)(143 143 -9)(144 144 -14)
(146 146 -1) (153 153 -5) (154 154 -25) (157 157 -2)
(165 165 -1) (181 181 -21) (182 182 -51) (183 183 -50)
(185 185 -5) (186 186 -7)
(188 188 29) (199 199 -1) (209 209 -1)
(210 212 -74)(214 214 -53) (215 215 -75) (216 216 -77)
(222 222 -81)
(224 224 -62) (226 226 -79) (227 227 -78) (229 229 -1)
(231 231 1)
(233 233 -70) (234 235 -84) (237 237 -1)))
(setq std:cp-cap-iso8859-4 '((65 90 32) (192 214 32)(216 222 32)
(152 152 71) (161 161 16)
(163 163 16) (165 166 16)
(169 172 16) (174 174 16) (189 189 2)
;; not tested
))
(setq std:cp-cap-koi-8r '((65 90 32)
(179 179 -16)(224 255 -32)))
;; the weirdest charset ever (by microsoft), ignoring cp866,
;; iso-8859-5 and koi8-r
(setq std:cp-cap-cp1251 '((65 90 32)(192 223 32)
(128 128 16)(129 129 2)
(138 138 16)(140 143 16)
(161 161 1)(163 163 25)(165 165 15)
(168 168 16)(170 170 16)(175 175 16)
(178 178 1)(189 189 1)
))
(setq std:cp-cap-dos866 '((65 90 32)
(128 144 32)(145 159 80)
(240 240 1)(242 242 1)
(244 244 1)(246 246 1)
))

;; Beware: Dynamic Autolisp code, just to get the idea.
;; you really should store the pairs in bitfield for the
;; predicates and vectors for the converters.
(defun STD-ISUPPER (_i)
(if (stringp _i)
(setq _i (ascii _i)))
(apply 'or
(mapcar
(function (lambda (l)
(<= (car l) _i (cadr l))))
std:actual-cp-cap)))

(defun STD-TOUPPER (i / cp x)
(setq x (car (setq cp std:actual-cp-cap)))
(while x
(if (<= (+ (caddr x) (car x)) i (+ (caddr x) (cadr x)))
(setq i (- i (caddr x))
x nil)
(setq cp (cdr cp) x (car cp))
)
)
i
)

this is from http://xarch.tu-graz.ac.at/autocad/stdlib/STDLOCAL.LSP
---
Reini Urban
http://xarch.tu-graz.ac.at/autocad/news/faq/autolisp.html

Vassil Nikolov

unread,

Apr 17, 1999, 3:00:00 AM4/17/99

to

In article <7f6l5s$cnm$1...@news.ptc.spbu.ru>,
"Valeriy E. Ushakov" <u...@ptc.spbu.ru> wrote:
> Vassil Nikolov <vnik...@poboxes.com> wrote:
(...)

> > By the way, I am not aware of another alphabet besides the
> > contemporary Russian version of Cyrillics where the number
> > of letters is a power of 2 (2^5).
>
> [mumbling the alphabet]... hmm, last time I checked - it was 33. :-)
>
> You probably forgot cyrillic-letter-io, which is rarely used in
> printed Russian (mostly in texts for children and foreigners and in
> ambiguous cases) and is substituted with cyrillic-letter-ie. Still
> it's part of the alphabet/orthography.

No, I had not forgotten it, I forgot to write something like
`mainstream use,' and I apologise for that. (I believe I have
(almost) never seen this letter in a publication (and I have read a
_lot_ of Russian texts, still do from time to time) which was
not a children's book, a textbook, a dictionary, or some such.)
Specifically, in the context of the thread, I was thinking of the
32-character block that one sees in 8859-5 etc.

Of course, cyrillic letter io _is_ an integral part of the Russian
alphabet (Ukranian? Byelorussian?), and I should have mentioned it.
By the way, with all those language reforms, the phrase `last time
I checked' is very appropriate...

--
Vassil Nikolov <vnik...@poboxes.com> www.poboxes.com/vnikolov
(You may want to cc your posting to me if I _have_ to see it.)
LEGEMANVALEMFVTVTVM (Ancient Roman programmers' adage.)

Vassil Nikolov

unread,

Apr 17, 1999, 3:00:00 AM4/17/99

to

In article <3716EA3D...@simplex.nl>,

Arthur Lemmens <lem...@simplex.nl> wrote:
(...)

> In my situation, (code-char #x41A) is usually a
> Russian K. But next time I'll call it Cyrillic.

That sounds like an interesting situation. If it is _usually_
that, what is it _sometimes_? Does it ever happen to be a
Bulgarian K? And what would the difference be, for your purposes,
between a Russian K and a Bulgarian K? (I'd be hard pressed
to think of such a difference in terms of characters and their
codes.)

(Or do you sometimes use another 16-bit-per-character encoding
where #x41A is the code of some Chinese or Japanese ideogram?)

My point was that unless the context is appropriately specific,
the generic name (Cyrillic) should be used in preference to the
language-specific name (Russian). In the same way, outside of a
specific context, it is appropriate to say `Roman K' (or `Latin
K'), rather than `English K' (or `Italian K' etc.).

If only the world had simply stuck to the good old Phoenician
alphabet as it was...

(...)

> > I'd rather have had the above as
> >
> > (char-code (char-upcase (code-char #xF0)))
> >
> > instead of, or in addition to, the above, which makes little
> > apparent sense on my Macintosh with a Cyrillic font selected.
>
> Before sending, I verified that the content-type header included
> "charset=iso8859-1" to increase the probability of readers seeing
> what I meant.

I _did_ see what you meant---but not with my _eyes_ (with the mind's
eye, perhaps, if my mind has one---I have never seen it).

Well, I know I deserve to lose... Having struggled on too many
occasions with all those 4-5 different Cyrillic encodings that are in
_active_ use around myself (and that are mutually exclusive with the
Roman letters with diacritical marks, for happiness to be complete),
and with all those different EBCDIC-ASCII mappings, etc.^1, I have become
somewhat hypersensitive to not having the character code itself on such
occasions. I wish ``charset=...'' did work, always. In a perfect world,
maybe.
__________
^1 the law of perverse solutions (`every problem has one') is also
applicable here: there are character sets where the codes for
the Roman _and_ Cyrillic letters A, C, E, etc. (that have the
same glyphs) are the same... KOI-8 (and even DKOI) is a blessing
by comparison.

Vassil Nikolov

unread,

Apr 17, 1999, 3:00:00 AM4/17/99

to

In article <3717b01c.47909860@judy>,
rur...@sbox.tu-graz.ac.at (Reini Urban) wrote:
(...)

> I post this also (instead of linking to it) to let you see how weird
> some codepages had been designed, considering case predicates and case
> conversions. I guess most OS do it by precalculating the tables, wasting
> a lot of bytes.

(...)

First of all, it was nice of you to post a useful piece of data.

Second, I would like to make a few points, not to criticise, but
to show there are different ways to look at this.

* The sets you identified as weird all contain Cyrillic characters
that by themselves look rather strange, even to one who knows the
Greek alphabet (which helps a little). Regarding the layout,
weirdness comes at least in part from the fact that only the
32 `mainstream' Cyrillic characters are in contiguous positions
(even with `well-behaved' sets like 8859-5). Since there are
other characters in addition to these 32, they had to be fit
elsewhere, while deciding which other characters (like left/right
single/double quotes) to keep and which to sacrifice.

(By the way, even limiting ourselves to ()[]{}<>, there isn't
a simple operation like toggling a bit to convert an `opener'
into a `closer,' so even 7-bit ASCII is not absolutely
regular (not that it could have been, I believe).)

* Keeping tables to support case conversions etc. does not take
up that much memory (especially now that memory does not come
so expensive as a couple of decades ago), and improves speed
a lot; besides, with some sets like KOI-8 and effectively
Macintosh Cyrillics^1 as well, tables are a must in order to
do sorting even if we limit ourselves to the `mainstream'
characters (because (< (CODE-CHAR a) (CODE-CHAR b)) does not
produce alphabetical order).
__________
^1 uppercase: 80-9F, lowercase: E0-FE,DF

Third, if anyone needs assistance with making sense out of Cyrillic
characters and sets (in particular, Bulgarian, Russian, and Serbian
<silly remarks deleted>), I'd be happy to be of any help, just send
me a private e-mail.

Good luck with character sets,
Vassil.

Vassil Nikolov

unread,

Apr 17, 1999, 3:00:00 AM4/17/99

to

In article <wk3e20i...@ifi.uio.no>,

It's good that it has been solved (well, I shouldn't say that
when I don't know how). I was never able to understand what
made them use M-DEL for a printable character in the first
place.

Erik Naggum

unread,

Apr 17, 1999, 3:00:00 AM4/17/99

to

* Vassil Nikolov <vnik...@poboxes.com>

| It's good that it has been solved (well, I shouldn't say that when I
| don't know how). I was never able to understand what made them use M-DEL
| for a printable character in the first place.

ISO character sets come in 94-character and 96-character flavors, apart
from ISO 10646. the ISO 8859 family uses the ISO 4873 8-bit template,
with a 94-character set in the left half and a 96-character set in the
right half.

in the 94-character set, 2/0 is SPACE and 7/15 is DELETE, both of which
sort of dual as control and data characters. in the 96-character set,
2/0 and 7/15 are data characters.

if you have a 94-character set and only 7 bits worth of data, the last
bit is free to be used for other purposes, such as constant zero, parity,
an application flag, or constant one. most modern uses are constant zero
and an application flag. however, if you use an 8-bit character set, the
only chance you have at using an application flag is with 10/0 and 15/15,
in which case you'd probably want a non-breaking space and what IBM calls
EO (eight ones), used as an "end of whatever" signal. referring to 15/15
as "M-DEL" regardless of whether it is a character or EO betrays a
serious conceptual confusion about the usage of the code space.

incidentally, there _is_ no upper-case version of ÿ, just as there is no
upper-case version of ß. pining for LATIN CAPITAL LETTER Y WITH DIARESIS
is like pining for LATIN CAPITAL LETTER SHARP S -- a symptom of a strong
inability to deal with practical matters and to understand the sometimes
_very_ erratic history of writing systems.

not that Vassil or anyone here is particularly to blame for this, but the
history of the æ, oe (not in 8859-1 because some French moron told ECMA
it wasn't needed and shouldn't be there, and then we got × and ÷ stuck in
the middle of the O's, only to have the smart French guy who designed
this stuff return fully recuperated after some serious accident or other,
only the voting had completed, to demand a 8859 member with OE and oe --
which they got from ISO after a few years, but which nobody uses, not
even the French¹), and ÿ are one of dipthongs that merged over the course
of centuries and then assumed phonemes of their own. ae -> æ in Denmark
and Norway are almost the same as ä in Sweden, but different from ä in
Germany (and the decoration used to be different, too, until ECMA had
enough of it). the French oe has a long and arduous story I don't know
in detail, but it's not unlike ö in Germany.

now, ÿ is not a y with diaeresis at all. it has more in common with et
(&) and ad (@) than y, since it's "ij" written together. in Belgia and
the Netherlands, it is pronounced like the English long I. of course, as
time goes by, various stupid people will do all kinds of stupid things,
and in this case, we have the _reverse_ of what happened in France when
some genius² decided that capital letters should not have accents because
that was too hard to do with early typewriters and printers -- this has
since been reversed when computers learned how to handle French. so now
that we have these nifty computerized thingamajigs, let's just forget
that neither I nor J have dots on them, even though i and j do (despite
the linguist³ who decided that Turkish i and j should upcase to I and J
with dots, but I and J should downcase to i and j without dots, which I
think is at least part of the reason awful movies get Turkey awards), so
the nifty computers should produce a _really_ historically moronic letter
that nobody in their right mind would ever want to use.

so, the single cluon in danger of being annihilated by swarms of morons
upon contact is that just as ß is upcased to SS, ÿ is upcased to IJ.

[ this article was best viewed with an ISO 8859-1 capable font. ]

#:Erik
-------
¹ the morale of this story is either to keep the morons away from standards
bodies or not to have serious accidents if you're the only smart guy in
France.
² read: moron -- it wasn't the only smart guy in France alluded to above.
³ another moron; wouldn't surprise me if he was French.

Philip Lijnzaad

unread,

Apr 17, 1999, 3:00:00 AM4/17/99

to

[ interesting thread, this ]

On 17 Apr 1999 17:23:24 +0000,
"Erik" == Erik Naggum <er...@naggum.no> writes:

Erik> now, я is not a y with diaeresis at all. it has more in common with et
Erik> (&) and ad (@) than y, since it's "ij" written together.

Being Dutch, I probably should have known or figured this out, but I didn't;
I always thought it was a Turkish letter. I don't know who invented the
graphical form of this letter (я), but it probably wasn't a Dutchman. In
actual practice, "ij", although one letter (actually, diftong), is *always*
typed and typeset as an i followed by a j. As far as I'm concerned, i'd be
happy to ceede this ascii value to more important purposes (capital sharp s?)
When upcased, both i and j have to be upcased (which is rare, but a good
example is 'IJsselmeer', the big watery hole in the middle of
Holland^H^H^H^H^H^H^H^HThe Netherlands). However, most dictionaries sort the
'ij' as two separate letters. Confusing, sortof.
Philip
--
To accurately forge this signature, use a lucidatypewriter-medium-12 font
-----------------------------------------------------------------------------
Philip Lijnzaad, lijn...@ebi.ac.uk | European Bioinformatics Institute
+44 (0)1223 49 4639 | Wellcome Trust Genome Campus, Hinxton
+44 (0)1223 49 4468 (fax) | Cambridgeshire CB10 1SD, GREAT BRITAIN
PGP fingerprint: E1 03 BF 80 94 61 B6 FC 50 3D 1F 64 40 75 FB 53

Lieven Marchand

unread,

Apr 17, 1999, 3:00:00 AM4/17/99

to

Erik Naggum <er...@naggum.no> writes:

> now, я is not a y with diaeresis at all. it has more in common with et

> (&) and ad (@) than y, since it's "ij" written together. in Belgia and
> the Netherlands, it is pronounced like the English long I. of course, as
> time goes by, various stupid people will do all kinds of stupid things,

Except that in the Dutch speaking parts of Belgium and the
Netherlands, everybody writes it as ij. The confusion could have been
started because some morons (this time not even French) collated the
ij combination with the y, although modern dictionaries have stopped
this a long time ago. There is also some difference of opinion how to
write an uppercase version of this. Some people use Ij but most -
especially in handwriting will use a variant of uppercase Y with
diaresis.

BTW: if Gordon's Introduction to Old Norse is accurate and can be
extrapolated to the modern variant, it's rather pronounced as the ei
diphtong in 'bein'.

--
Lieven Marchand <m...@bewoner.dma.be>
If there are aliens, they play Go. -- Lasker

Vassil Nikolov

unread,

Apr 18, 1999, 3:00:00 AM4/18/99

to

In article <31333586...@naggum.no>,

Erik Naggum <er...@naggum.no> wrote:
> * Vassil Nikolov <vnik...@poboxes.com>
> | It's good that it has been solved (well, I shouldn't say that when I
> | don't know how). I was never able to understand what made them use M-DEL
> | for a printable character in the first place.

(...)

> however, if you use an 8-bit character set, the
> only chance you have at using an application flag is with 10/0 and 15/15,
> in which case you'd probably want a non-breaking space and what IBM calls
> EO (eight ones), used as an "end of whatever" signal. referring to 15/15
> as "M-DEL" regardless of whether it is a character or EO betrays a
> serious conceptual confusion about the usage of the code space.

I don't know if what it _betrays_ is true (don't have such introspective
capabilities), but what it _is_ is inappropriate use of technical
jargon. Sorry for that.

Correct me if I am wrong, but the above (quoted) paragraph does not
contradict a statement that using 15/15 for a printable character is
inappropriate. Or did I miss anything?

(...)

> _very_ erratic history of writing systems.

But very interesting, and from an information technology point
of view too. (Writing is an information technology in my book
as this phrase does not necessarily mean computer technology.)

It is hard to encode the barely encodable. (I.e. to transform
human speech into a sequence of signs.) I find it interesting
that the same language can be used for speaking and writing.

> not that Vassil or anyone here is particularly to blame for

[inadequacies in standardised character sets]

:-)

(This reminded me of some Russian who allegedly said, `Cyril and
Methodius did such a bad thing to us...' (meaning that otherwise
Russians would be using the Roman alphabet, like e.g. the Polish
or the Czech, and be saved from many headaches, perhaps).)
__________
For the Russian-speaking: `Kiril i Metodij nam takoe nadelali...';
St. Methodius was St. Cyril's brother and co-developer/co-translator.

(...)

> neither I nor J have dots on them, even though i and j do (despite

> the linguist3 who decided that Turkish i and j should upcase to I and J

> with dots, but I and J should downcase to i and j without dots, which I

I don't understand your point here. In the version of the Roman alphabet
as used in Turkey (and adopted by an Act of Parliament from 1928, by the
way), there are two I's: one has dots both in the small and capital case
(and is pronounced as the `i' in `fit') and the other has no dots either
in the small or capital case (and is pronounced as the `i' in `fir' but
short and without any `r' of course). Whether this is moronic is not
for me to say, but this is the way the Turkish alphabet is. (As to J
in that alphabet, it has a dot in the small case only.)

(Turkish is a very rich language, having incorporated a lot from
Arabic and Persian; until Ataturk's reforms in the 1920's, Arabic
script (or some variety thereof) was used for writing. I do not
know Turkish (apart from a few words), but I have a dictionary and
I know a few facts about its history (of the language, not the
dictionary).)

> think is at least part of the reason awful movies get Turkey awards), so
> the nifty computers should produce a _really_ historically moronic letter
> that nobody in their right mind would ever want to use.

I.e. a small minority would never want to use it, and the majority will
just accept it as the latest and the greatest benefit coming from
computer technology.

> so, the single cluon in danger of being annihilated by swarms of morons
> upon contact is that just as ß is upcased to SS, ÿ is upcased to IJ.

I wondered (as an academic exercise) what should CHAR-UPCASE and
NSTRING-UPCASE do about LATIN SMALL LETTER Y WITH DIAERESIS (assuming
STRING-UPCASE is allowed to return a longer string which isn't
especially nice either). Signal an error? Or the implementation
would state that the character sets it uses do not include this
letter? (Making CHAR-UPCASE return two values, like #\I and #\J
in this case, appears more than perverse, though who knows.)

> [ this article was best viewed with an ISO 8859-1 capable font. ]

I did use one this time, on a different machine.

(...)

Erik Naggum

unread,

Apr 18, 1999, 3:00:00 AM4/18/99

to

* Vassil Nikolov <vnik...@poboxes.com>

| Correct me if I am wrong, but the above (quoted) paragraph does not
| contradict a statement that using 15/15 for a printable character is
| inappropriate. Or did I miss anything?

yes. 10/0 and 15/15 are characters when the right-hand side of an 8-bit
character set (GR) is filled with a 96-character set. (the other 32 are
control characters (C1).) if you had filled it with a 94-character set,
it would have been inappropriate to use 15/15 at all.

the reason for this is that 10/0 and 15/15 are characters in their own
right and must be coded with 8 bits, but if you use a shifting coding
with only 7 bits and codes to swap between G0 and G1 (both now in GL)
with the codes SO and SI, then it's important that 2/0 and 7/15 remain
their usual semi-control characters even when G1 is invoked.

| I don't understand your point here.

seems I was mistaken about the up/downcasing of I with/without dots.
(shoot, gotta check and go back and fix those files for Emacs.)

| I wondered (as an academic exercise) what should CHAR-UPCASE and
| NSTRING-UPCASE do about LATIN SMALL LETTER Y WITH DIAERESIS (assuming
| STRING-UPCASE is allowed to return a longer string which isn't especially
| nice either). Signal an error? Or the implementation would state that
| the character sets it uses do not include this letter? (Making
| CHAR-UPCASE return two values, like #\I and #\J in this case, appears
| more than perverse, though who knows.)

I have come to think that people who use sick writing systems should pay
for their own mistakes so they will have reason to fix them. forcing
everybody else to pay for them only causes software not to be available.
e.g., the Spanish purportedly undid the silly sorting requirements of ll
(treated as a separate "letter" between k and l, I think it was) due to
the force of simplicity and logic of computers (or was it marketing :).
a German spelling reform (which people seem to hate rather strongly) do
away with the sharp s and spell it "ss" in lowercase, too. the Norwegian
and Danish sillitude of sorting "aa" as equivalent to "å" (a ring), and
the hysterical requirement that German spelled out with "ue" instead of
"ü" should be sorted as if it wasn't spelled out are examples of morons
who got into standards bodies. (now, the right way to do this is to
store a sort key and a print string, but since people don't use tools
easily extendible that way, forcing stupid people to do this causes a lot
of grief and problems when they try to print the sort key or vice versa.)

anyway, let's just ignore the issue and ask them to spell it out as ij,
like the Dutch correctly do. (the ÿ is Belgian, _from_ Dutch ij.) (I'm
not sure upcasing "ij" to "IJ" is all that great an idea, although it is
obvious if you look at fonts designed in or for The Netherlands: they
sport "ij" and "IJ" ligatures, just as fonts designed for Norway has a
ligature for "fj" just like "fi", because of "fjord" and "fjell".)

anyway. 8 bits would have been enough if we had been using floating
diacritics and upcasing and downcasing would have needed to worry about
A-Z, only. ISO tried that, too, (ISO 6937) but computer people were not
able to appreciate it, because they were thinking fonts, not character
sets. sigh.

if there's reincarnation, I hope I won't remember any of this the next
time around.

Lars Marius Garshol

unread,

Apr 18, 1999, 3:00:00 AM4/18/99

to

* Erik Naggum
|
| now, ÿ is not a y with diaeresis at all. it has more in common with et

| (&) and ad (@) than y, since it's "ij" written together.

* Philip Lijnzaad
|
| [...] In actual practice, "ij", although one letter (actually,

| diftong), is *always* typed and typeset as an i followed by a j. As
| far as I'm concerned, i'd be happy to ceede this ascii value to more
| important purposes (capital sharp s?) When upcased, both i and j

| have to be upcased [...]. However, most dictionaries sort the 'ij'

| as two separate letters. Confusing, sortof.

Most? From what I've heard (from Dutch sources, BTW) IJ is sorted as a
separate letter after Z. Can you elaborate on whether both happens or
whether I've been misinformed?

And if it's really sorted separately then I think makes sense to
consider it a separate character, as Unicode more or less does
(although it calls it a ligature): U+0132 and U+0133.

--Lars M.

Erik Naggum

unread,

Apr 18, 1999, 3:00:00 AM4/18/99

to

* Lars Marius Garshol <lar...@ifi.uio.no>

| And if it's really sorted separately then I think makes sense to
| consider it a separate character, as Unicode more or less does
| (although it calls it a ligature): U+0132 and U+0133.

this is getting a bit far afield, but collation order, characterness, and
glyphness are distinct properties of a writing system element. for one
thing, there is no _single_ correct collation order. character sets do
_not_ imply collation order. characterness of a writing system element
is a fairly fundamental concept and is strongly associated with meaning.
glyphness of a writing system element is strongly associated with looks.
finally, fonts are made up instantiations of glyphs. e.g., a writing
system element may exhibit so different meanings that they deserve to be
separate characters, although this is very rare. in general, there is
also one glyph per character, although some have more (the German short
and long s, the open and baggy a, the open and broken vertical line), but
more frequent is a glyph for a sequence of characters (ligatures in Latin
scripts, but includes vowels in Indic scripts and Hebrew) or a character
in contex (the connectives (single, initial, medial, final) in Arabic
scripts), etc. collation order is tightly coupled with character, but
for hysterical raisins many languages collate sequences of characters as
a single unit. to represent all of this correctly, you need a whole
bunch of tables. there are therefore glyph set standards that are very
separate from character set standards, and their mapping is non-trivial.
there are huge tables of correct collation orders for different scripts
and languages (French requires a five-level deep collation system in full
name and dictionary sorting), and conflation of representation makes up
most of it (e.g., no significance it attached to the ring in "Ångstrøm"
in an English dictionary, where it is sorted with Angst, but you'll find
it at the end of a Norwegian one because Å is a separate character).

Unicode is a hybrid of a character and a glyph set. the reason for this
is fairly obvious when you consider its major proponents: Xerox and
Microsoft. Xerox makes printers and wanted a simple standard for which
they could make huge fonts. Microsoft are just too damn stupid to get it
right or to respect any traditions. (Xerox didn't want it to replace the
first ISO 10646 draft, however, so they may be excused.) in typical "is
this a font or what?"-misunderstanding, æ was a ligature in Unicode, but
I complained about it, so ISO 10646-1 has amended it to be a letter, and
"ij" is a character, not a presentation form, which it should have been.

Juanma Barranquero

unread,

Apr 19, 1999, 3:00:00 AM4/19/99

to

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 18 Apr 1999 09:43:07 +0000, Erik Naggum <er...@naggum.no> wrote:

> e.g., the Spanish purportedly undid the silly sorting requirements
> of ll (treated as a separate "letter" between k and l, I think it
> was) due to the force of simplicity and logic of computers (or was
> it marketing :).

Between "l" and "m".

What it's stupid, IMHO, is not the fact of having "ll" as a single
letter, but having it so, and the same with "ch" (between "c" and "d")
and then having "rr" as r+r and "qu" as q+u. The sound of most of
those characters is not related to their spelling ("ll" is not an l+l,
etc., and "q" is *never* used in isolation in Spanish, it is *always*
q+u, the only case in Spanish where "u" is mute) so in a coherent
world either "ch", "ll", "rr" and "qu" should each be treated as a
single entity, or none of them at all (perhaps the best solution).

Regarding the reform of the sorting requirement, the Spanish RAE
("Real Academia Española de la Lengua") did it, but I think some
latin-american academies objected and the issue was dropped. Not sure,
thought.

/L/e/k/t/u

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 6.0.2i

iQA/AwUBNxr0ev4C0a0jUw5YEQJRdQCfWI/MKMWEMIMt4a28s8WlrhWBlZwAn0Fp
+tn5lYZRhWnsoNfQMxuJ7fML
=n4bS
-----END PGP SIGNATURE-----

Arthur Lemmens

unread,

Apr 20, 1999, 3:00:00 AM4/20/99

to

* Erik Naggum
|
| now, я is not a y with diaeresis at all. it has more in common with

et
| (&) and ad (@) than y, since it's "ij" written together.

* Philip Lijnzaad
|
| [...] In actual practice, "ij", although one letter (actually,
| diftong), is *always* typed and typeset as an i followed by a j. As
| far as I'm concerned, i'd be happy to ceede this ascii value to more
| important purposes (capital sharp s?) When upcased, both i and j
| have to be upcased [...]. However, most dictionaries sort the 'ij'
| as two separate letters. Confusing, sortof.

* Lars Marius Garshol

|
| Most? From what I've heard (from Dutch sources, BTW) IJ is sorted as a
| separate letter after Z.

Not that any of this has much to do with Lisp, but:

- U+00FF (LATIN SMALL LETTER Y DIAERESIS) is described in the Unicode
standard as being French, not Dutch. This probably explains why
Philip didn't recognize it as a Dutch letter. It also casts some
doubt on Erik's explanation that it's "ij" written together.
I suppose we have to wait for the French to tell us more about this
(I read some French from time to time, but I don't recall ever
having seen a я.)

- The Unicode version of Dutch 'ij', which _is_ "ij" written together
and is probably what Erik had in mind, is U+0133. Its upper case
equivalent is U+0132.

- IJ is _never_ sorted as a separate letter after Z. Maybe, sometimes,
it has been sorted as Y (between X and Z). Modern dictionaries sort
it as I followed by J. So you have '("iets" "ijdel" "ijsje" "ik").

- When a Dutchman doesn't have a U+0133 handy (which is very likely),
he just uses #\i followed by #\j. As in "ijsje". If this needs
capitalizing, he'll use #\I followed by #\J. Capitalizing the
above list would result in '("Iets" "IJdel" "IJsje" "Ik").

* Lars Marius Garshol

|
| And if it's really sorted separately then I think makes sense to
| consider it a separate character, as Unicode more or less does
| (although it calls it a ligature): U+0132 and U+0133.

For _capitalization_ it makes some sense to consider it a separate
character. But _sorting_ will be much more likely to go wrong
when you use a separate character.

Arthur Lemmens

Erik Naggum

unread,

Apr 20, 1999, 3:00:00 AM4/20/99

to

* Arthur Lemmens <lem...@simplex.nl>

| Not that any of this has much to do with Lisp, but:
|
| - U+00FF (LATIN SMALL LETTER Y DIAERESIS) is described in the Unicode
| standard as being French, not Dutch.

I said _from_ Dutch "ij". it's an _imported_ character. it is used in a
bunch of names in Belgia that historically had "ij" in their name.

| It also casts some doubt on Erik's explanation that it's "ij" written
| together.

it does? so the fact that æ is a Danish and Norwegian letter casts doubt
on its history of being imported from Latin as its a+e ligature, too?
appreciate that the history of writing systems is not a couple years old.

| - The Unicode version of Dutch 'ij', which _is_ "ij" written together
| and is probably what Erik had in mind, is U+0133.

I probably had in mind what I wrote. so do other people. please assume
this next time you feel an overpowering urge to tell people what they
think.

#:Erik

Casper H.S. Dik - Network Security Engineer

unread,

Apr 20, 1999, 3:00:00 AM4/20/99

to

[[ PLEASE DON'T SEND ME EMAIL COPIES OF POSTINGS ]]

Lars Marius Garshol <lar...@ifi.uio.no> writes:

>* Philip Lijnzaad
>|
>| [...] In actual practice, "ij", although one letter (actually,
>| diftong), is *always* typed and typeset as an i followed by a j. As
>| far as I'm concerned, i'd be happy to ceede this ascii value to more
>| important purposes (capital sharp s?) When upcased, both i and j
>| have to be upcased [...]. However, most dictionaries sort the 'ij'
>| as two separate letters. Confusing, sortof.

>Most? From what I've heard (from Dutch sources, BTW) IJ is sorted as a

>separate letter after Z. Can you elaborate on whether both happens or
>whether I've been misinformed?

I think you've been misinformed; all Dutch dictionaries I've ever seen
as well as "Het Groene Boekje" (the official list of Dutch words) sorts
"ij" as if it's an i followed by a j.

The only exception to this standard rule is the Dutch telephone
book; it sorts the ij as if it is an y.

>And if it's really sorted separately then I think makes sense to
>consider it a separate character, as Unicode more or less does
>(although it calls it a ligature): U+0132 and U+0133.

I think the Dutch "ij" is a ligature, even though we learn differently
at school. As you say, both I and J are upcased together as
in "IJmuiden" but that holds true for AE ligatures as well.

Casper

--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

Philip Lijnzaad

unread,

Apr 20, 1999, 3:00:00 AM4/20/99

to

On 20 Apr 1999 13:02:29 GMT,

"Casper" == Casper H S Dik <Caspe...@Holland.Sun.Com> writes:

>> Most? From what I've heard (from Dutch sources, BTW) IJ is sorted as a
>> separate letter after Z.

No, never.

Casper> all Dutch dictionaries I've ever seen as well as "Het Groene Boekje"
Casper> (the official list of Dutch words) sorts "ij" as if it's an i
Casper> followed by a j.

yes, although I remember having used dictionaries in school that had IJ
between X and Z. It's apparently obsolete now, but:

Casper> The only exception to this standard rule is the Dutch telephone
Casper> book; it sorts the ij as if it is an y.

(didn't know that ... a bit strange and confusing, I'd say)

Casper> As you say, both I and J are upcased together as in "IJmuiden" but
Casper> that holds true for AE ligatures as well.

another point is abbreviations: I'm fairly sure that the Dutch 'ij' would be
abbreviated to 'IJ'. Making up an example: Vereniging ter bevoordering van de
ijspret would be V.B.IJ, not V.B.I. The abbreviation issue must be correlated
with the capitalization issue, and I suspect it would be the same for
ligatures in other languages/scripts.

Marco Antoniotti

unread,

Apr 20, 1999, 3:00:00 AM4/20/99

to

#+:noise-ahead
Isn't Dutch a throat disease? :)

--
Marco Antoniotti ===========================================
PARADES, Via San Pantaleo 66, I-00186 Rome, ITALY
tel. +39 - 06 68 10 03 17, fax. +39 - 06 68 80 79 26
http://www.parades.rm.cnr.it/~marcoxa

Howard R. Stearns

unread,

Apr 20, 1999, 3:00:00 AM4/20/99

to

Vassil Nikolov wrote:
> ...

> I wondered (as an academic exercise) what should CHAR-UPCASE and
> NSTRING-UPCASE do about LATIN SMALL LETTER Y WITH DIAERESIS (assuming
> STRING-UPCASE is allowed to return a longer string which isn't
> especially nice either). Signal an error? Or the implementation
> would state that the character sets it uses do not include this
> letter? (Making CHAR-UPCASE return two values, like #\I and #\J
> in this case, appears more than perverse, though who knows.)

Careful. Recall that ANSI CL is an American standard and doesn't make
any attempt to accomodate other collating sequences.

String-upcase and friends are specifically required to work character by
character, without reference to any context:

"More precisely, each character of the result string is produced by
applying the function char-upcase to the corresponding character of
string."

I would have thought that ISO would have addressed this issue more
broadly in ISLisp, but it does not appear that they did. There is no
string-upcase at all, and string< and friends are specifically defined
to work character by character:

"Two strings string1 and string2 are in order (string<) if in the
first
position in which they differ the character of string1 is char< the
corresponding character of string2, ..."

Given that Scheme is an ISO standard, apparently tries to do either the
right thing or nothing at all, and seems to try to not include useful
utilities which are "obvious" compositions or iterations of other
utilities, I would have expected that Scheme either wouldn't have string
operations at all or would have them do the contextually right thing.
After all, if you just want to map over a sequence with some char< or
such function, just do it. Of course, I'm wrong. Scheme also defines
string< to work character by character, but at least it meets my
expectations by failing to define string-upcase at all. In case I'm
misinterpreting, here's the definition for string-<? and friends:

"These procedures are the lexicographic extensions to strings of the
corresponding orderings on characters. For example,
string<? is the lexicographic ordering on strings induced by the
ordering
char<? on characters. If two strings differ in
length but are the same up to the length of the shorter string, the
shorter string is considered to be lexicographically less
than the longer string."

Lieven Marchand

unread,

Apr 20, 1999, 3:00:00 AM4/20/99

to

Erik Naggum <er...@naggum.no> writes:

> * Arthur Lemmens <lem...@simplex.nl>
> | Not that any of this has much to do with Lisp, but:
> |
> | - U+00FF (LATIN SMALL LETTER Y DIAERESIS) is described in the Unicode
> | standard as being French, not Dutch.
>
> I said _from_ Dutch "ij". it's an _imported_ character. it is used in a
> bunch of names in Belgia that historically had "ij" in their name.
>

Could you name two please? I live in Belgium (no need to form a
plural) and I've never seen it.

Lieven Marchand

unread,

Apr 20, 1999, 3:00:00 AM4/20/99

to

Marco Antoniotti <mar...@copernico.parades.rm.cnr.it> writes:

> #+:noise-ahead
> Isn't Dutch a throat disease? :)
>

#+:further-noise
No, we just have a fairly complete set of sounds. It helps to
recognize foreigners.

Schild en Vriend?

Erik Naggum

unread,

Apr 20, 1999, 3:00:00 AM4/20/99

to

* Lieven Marchand <m...@bewoner.dma.be>

| Could you name two please?

not off-hand. the rationale for я that I have related here is that given
to ECMA in 1982-6 when formulating and to ISO in 1987 when adopting ISO
8859-1 through -4.

| I live in Belgium (no need to form a plural) and I've never seen it.

^^^^^^^^^^^^^^^^^^^^^^^^^^

_very_ good! never though of it as a plural. we call it "Belgia"
in Norwegian. I'm sure it's an editing glitch. I hope it isn't receding
language skills. :)

I have actually seen it, though, which is why I remember the minutes from
the ISO work. it's been a while (11 years), and I regret I'm not able to
recall them in minute detail, anymore.

#:Erik

Breanndán Ó Nualláin

unread,

Apr 27, 1999, 3:00:00 AM4/27/99

to

>>>>> "Lieven" == Lieven Marchand <m...@bewoner.dma.be> writes:
>>>>> Erik Naggum <er...@naggum.no> writes:

Erik> I said _from_ Dutch "ij". it's an _imported_ character. it
Erik> is used in a bunch of names in Belgia that historically had
Erik> "ij" in their name.

Lieven> Could you name two please? I live in Belgium (no need to
Lieven> form a plural) and I've never seen it.

Kortrijk springs to mind. Maybe to yours too; is that why you asked
for two examples? :-)

I had to glance at a map to find a second one: Nijvel.

Would a Belgian spell these with "y trema"?