APLFONTS and Unicode

Björn Helgason

unread,

Apr 29, 2005, 1:19:09 PM4/29/05

to

004121900032002000332374003500200039
This line here above is unicode for
A [left arrow] 2 3 [rho] 5 9

In Microsoft word2003 use alt+x to toggle between unicode and the apl
sign using font ariel unicode ms

In J
A=:2 3$5 9
Or
A=.2 3$5 9
this little example shows ambiguity
=: and =. Both assign the same as left arrow but one =: assigns
globally while =. Assigns locally

In old apl there is no difference

Maybe it should be displayed
A 2190 2 3 2374 5 9
or
A (2190) 2 3 (2374) 5 9
Or
0041,2190,0032,0020,0033,2374,0035,0020,0039
Or
A {2190} 2 3 {2374} 5 9

It looks like unicode is coming and in http://www.vector.org.uk/forum/
you can see some examples on how it can be displayed on the web

I often thought that unicode would solve apls problems.

That may not be the solution

Obviously J has found a way to solve the character issue.

J solved several other problems at the same time.

I think we should try to look at the things that may enhance the use of
any apl.

I love the apl characters and would love to see them more in use.

Here are the Unicode for APL2 AV

0021,0022,0023,0024,0025,0026,0027,0028,0029,002A,
002B,002C,002D,002E,002F,0030,0031,0032,0033,0034,0035,
0036,0037,0038,0039,003A,003B,003C,003D,003E,003F,0040,
0041,0042,0043,0044,0045,0046,0047,0048,0049,004A,004B,
004C,004D,004E,004F,0050,0051,0052,0053,0054,0055,0056,
0057,0058,0059,005A,005B,005C,005D,005E,005F,0060,0061,
0062,0063,0064,0065,0066,0067,0068,0069,006A,006B,006C,
006D,006E,006F,0070,0071,0072,0073,0074,0075,0076,0077,
0078,0079,007A,007B,007C,007D,007E,007F,00C7,00FC,00E9,
00E2,00E4,00E0,00E5,00E7,00EA,00EB,00E8,00EF,00EE,00EC,
00C4,00C5,2395,235E,2339,00F4,00F6,00F2,00FB,00F9,22A4,
00D6,00DC,00F8,00A3,22A5,20A7,2336,00E1,00ED,00F3,00FA,
00F1,00D1,00AA,00BA,00BF,2308,00AC,00BD,222A,00A1,2355,
234E,2591,2592,2593,2502,2524,235F,2206,2207,2192,2563,
2551,2557,255D,2190,230A,2510,2514,2534,252C,251C,2500,
253C,2191,2193,255A,2554,2569,2566,2560,2550,256C,2261,
2378,2377,2235,2337,2342,233B,22A2,22A3,22C4,2518,250C,
2588,2584,00A6,00CC,2580,237A,00DF,2282,2283,235D,2372,
2374,2371,233D,2296,25CB,2228,2373,2349,220A,2229,233F,
2340,2265,2264,2260,00D7,00F7,2359,2218,2375,236B,234B,
2352,00AF,00A8

By copying these numbers into Word and using alt+x after each number
you get quad av from apl2

It could be a way to share apl code to send it in unicode

Take these codes into word and change them into apl with alt+x

Have an editor macro changing it between apl and unicode

Probably upgrade old apl into j by translating apl to unicode.

Have a J verb read the unicode and produce J. Then when there are
ambiguities the program might propose the alternatives similar to what
you do with ocr

http://www.vector.org.uk/forum/
http://www.jsoftware.com

brian.b.m...@lmco.com

unread,

Apr 29, 2005, 2:47:10 PM4/29/05

to

I have been thinking that it would be nice if APL interpreters used
Unicode characters. For one thing, if everyone did that, then
workspace
portability would be improved. But there is the question of how to
handle #AV. Do we make it 65536 characters long, or do we just pick
256
Unicode characters to include in #AV and have some character supported
by the interpreter that are not included in #AV? I am also not sure
whether the extra space required, 2 bytes per character rather than 1,
would be a problem. Then there is the question of #NA and interfacing
with software that uses 1-byte characters.

--- Brian

Dragan Cvetkovic

unread,

Apr 29, 2005, 3:01:38 PM4/29/05

to

brian.b.m...@lmco.com writes:

Unicode is not 2 bytes, it could be (if we are talking UTF-8) from 1 to 6
(or was that 4?) bytes. Nice thing about UTF-8 is that 'normal' ASCII
characters are still 1 byte ensuring backward compatibility. That is its
advantage over UTF-16 (that Microsoft uses). Check
e.g. http://en.wikipedia.org/wiki/UTF-8 and
http://en.wikipedia.org/wiki/UTF-16 for details.

Dragan

--
Dragan Cvetkovic,

To be or not to be is true. G. Boole No it isn't. L. E. J. Brouwer

!!! Sender/From address is bogus. Use reply-to one !!!

David Liebtag

unread,

Apr 29, 2005, 4:42:31 PM4/29/05

to

Brian,

APL2 (and I believe some other interpreters) have supported Unicode for
several years. The way wee handle one byte and four byte characters is a
lot like how we handle numbers.

Conceptually a number is a number is a number. Whether the number is stored
internally as an 8 byte floating point number, or a four byte integer, or a
single bit, is generally not relevant to the APL application; the
interpreter takes care of any necessary coercions between internal types.

Like wise, a character is a character is a character. Whether the character
is stored internally using one or four bytes should be irrelevant to the APL
application; the interpreter does any necessary coercions between internal
types.

QuadAV is simply a shorthand way to refer to the particular subset of
Unicode characters that are of particular importance to APL programmers.
For efficiency, we store these particular characters in one byte where
possible.

David Liebtag
IBM APL Products and Services

brian.b.m...@lmco.com

unread,

May 2, 2005, 8:54:05 AM5/2/05

to

I was thinking of the basic code plane, which includes all of the APL
characters,
but of course the whole unicode set could be used. Having #NA do type
translation would work, but if #AV doesn't include the whole set of
characters
supported by APL, it might break some transliteration programs that
default
to using 'AV',{format}#AV{iota}character for characters that haven't
been assigned
specific representations. There are probably other types of programs
that
were written assuming that #AV holds the complete character set. I
suppose
that new programs should use #UCS instead, but that is not universally
available.

UTF8 would probably be a pain to work with because of the variable
character
lengths. If has a character vector S and refers to, say, S[2100], do
you have to
scan through all of the first 2100 characters to find it?

--- Brian

brian.b.m...@lmco.com

unread,

May 2, 2005, 8:54:22 AM5/2/05

to

brian.b.m...@lmco.com

unread,

May 2, 2005, 10:04:11 AM5/2/05

to

Another problem that comes to mind: if #AV doesn't include all of the
characters supported by the APL interpreter, then you can run into
problems when using #AV as the left argument of grade-up to sort
character matrices. One solution might be to allow monadic grade-up
for character arrays, using the Unicode collating sequence. But that
wouldn't allow characters to be treated as equivalent, e.g. if you want

to ignore the case of letters or, say, have a-umlaut treated as
equivalent to 'a' and 'A'.

--- Brian

James L. Ryan

unread,

May 2, 2005, 10:08:15 PM5/2/05

to

To me, one of the unfortunate things is that the entirety of APL glyphs
weren't assigned unicode positions. What I'm referring to here are the
letters which traditionally have been represented as italics. The advantage
of such an encoding would be that there would be an absolute and clearcut
means of separating APL expressions from the surrounding explanatory text. A
similar approach was taken on the Analogic APL Machine in the early
eighties. On The APL Machine the 256 character atomic vector contained both
ASCII and APL letters and glyphs.

--
James L. Ryan -- Taliesinsoft

phil chastney

unread,

May 5, 2005, 9:51:41 AM5/5/05

to

"James L. Ryan" <talies...@mac.com> wrote in message
news:0001HW.BE9C473E...@news.dallas.sbcglobal.net...

> To me, one of the unfortunate things is that the entirety of APL glyphs
> weren't assigned unicode positions. What I'm referring to here are the
> letters which traditionally have been represented as italics. The
advantage
> of such an encoding would be that there would be an absolute and clearcut
> means of separating APL expressions from the surrounding explanatory text.

italic vs. upright is a matter of presentation, and not Unicode's concern

if the surrounding explanatory text uses a different font from the APL code,
then the APL font can have slanted (oblique, italicised) alphabetic
characters

if the surrounding explanatory text uses the same font as the APL code,
then a simple finite state machine can choose between upright and slanted
alphabetic characters

in latter days, my own documentation would also use upright alphabetics
within literals and comments, and slanted characters within the code --
it isn't difficult to do

regards . . . /phil

phil chastney

unread,

May 5, 2005, 9:59:51 AM5/5/05

to

<brian.b.m...@lmco.com> wrote in message
news:1115042651.5...@o13g2000cwo.googlegroups.com...

I hope this isn't news to you, but if you want to sort, say, a list of names
into the alphabetic sequence standard for a language or a country, the
Unicode collating sequence is not normally adequate to the task --
dyadic upgrade is still required

dyadic upgrade needs to be enhanced, though, because (a) some languages
contain accented characters not in Unicode, and (b) Unicode's recommendation
is that *all* accented characters should be stored in decomposed form

phil chastney

unread,

May 5, 2005, 10:24:11 AM5/5/05

to

<brian.b.m...@lmco.com> wrote in message
news:1115038462.7...@f14g2000cwb.googlegroups.com...
> <snip>

>
> UTF8 would probably be a pain to work with because of the variable
> character lengths. If has a character vector S and refers to, say,
> S[2100], do you have to scan through all of the first 2100 characters
> to find it?

yes
well, you don't, but the interpreter does
actually, because the interpreter doesn't *have* to scan through *all*
the preceding characters, it's more correct to say that polynomial array
addressing is no longer adequate

UTF-8 is a way to map multibyte character sets onto 8-bit streams --
it is not, strictly speaking, a character encoding -- it was originally
intended for communication channels, I believe, and not internal
representation -- you would not want your document stored internally
in UTF-8 form if you were using Chinese, for instance

I would suggest that the increased storage cost in moving from single byte
ASCII (or []AV) to 2-byte Unicode BMP is insignificant when compared with
(i) the bloat experienced when moving from plain text to HTML or Word, and
(ii) the falling cost of physical memory

James L. Ryan

unread,

May 5, 2005, 9:22:11 AM5/5/05

to

On Thu, 5 May 2005 08:51:41 -0500, phil chastney wrote
(in article <d5cc1k$uv9$1...@news7.svr.pol.co.uk>):

Phil,

My original posting, the one you comment on above, was not well stated. What
I intended to say was that in my opinion the entirety of the APL glyphs
should be considered unique and should be assigned their own place in the
unicode table. I agree that the style applied, for example, Italic vs.
non-Italic for the letters, is external to their placement in unicode. As I
tried to emphasize in my original posting, an advantage of having the
entirety of the APL characters in their own plane (is that term valid in
unicode?) is the ease with which a presentation system could differentiate
between APL and non-APL. glyphs, regardless of whether or not a particular
APL glyph has a look-alike similarity to a non-APL glyph. Another advantage
is that this could provide the APL community with the ability to have what
I'll dub a "unified APL atomic vector" in addition to their own, if they so
choose, unique atomic vectors. Such a unified atomic vector could easily be
added to current APL implementations and would provide a vehicle for code
interchange and communication.

Jim

brian.b.m...@lmco.com

unread,

May 5, 2005, 10:19:30 AM5/5/05

to

The problem is that #AV is often used directly as the left argument of
dyadic upgrade, or is used to create a suitable left argument. In an
APL interpreter that uses Unicode but doesn't include all of the
Unicode characters in #AV, it is no longer possible to use #AV in this
way. So how do you then create the left argument?

--- Brian

Björn Helgason

unread,

May 5, 2005, 12:52:07 PM5/5/05

to

Maybe what is needed is a new #UAV or #UV

Unicode is clearly coming more actively now and XML as communication
tool.

It would be nice if APL would come alive again through Unicode

David Liebtag

unread,

May 5, 2005, 9:39:23 PM5/5/05

to

James,

I believe I understand the benefits you envision for having APL alphabetic
characters have their own Unicode codepoints. And although they are nice, I
have a different desire:

I would like APL interpreters to allow any Unicode alphabetic or numeric
character in object names. That way, people who use different languages
could use meaningful names written in their native languages and not be
limited to the character set chosen by the English-speaking APL implementers
years ago.

Regarding sorting, as someone pointed out recently, general purpose sorting
of Unicode strings is far more complicated than can be supported by the
grade primitive. In a perfect world, I think the APL implementers should
provide easy access to some of the Unicode string sorting algorithms that
exist outside their interpreters.

These are just my opinions, which do not necessarily agree with my
employer's.

David Liebtag

phil chastney

unread,

May 6, 2005, 3:07:02 PM5/6/05

to

"Björn Helgason" <gos...@gmail.com> wrote in message
news:1115311927.3...@z14g2000cwz.googlegroups.com...

> Maybe what is needed is a new #UAV or #UV

is that necessary? []AV varies by vendor, but Unicode codepoints do not,
so that a given character is uniquely identified by a 16-bit integer (or a
32-bit integer, or a pair of 16-bit integers) -- this is totally
transportable, so that all that's required, it seems to me, is that vendors
accept integer values on the left of a dyadic upgrade, and interpret these
values as Unicode characters

well, that's not _quite_ everything: implementors would need to take account
of the fact that, for instance, U+0065/U+0301 is the "preferred" spelling
for e-acute, and should be treated as equivalent in every way to U+00E9 --
but as a user, I'm happy to leave that task in their capable hands

Björn Helgason

unread,

May 6, 2005, 9:01:30 AM5/6/05

to

How would you tell the session manager to display the the character
associated with that particular hex og dec number you want to see or
use?

That is why I think you would need say #UV[00E9] to display that
particular character or its equivalent in dec #UV[233]

I went to a site to look up 00E9 and I found a new site I discovered
writing this

http://isthisthingon.org/unicode/index.phtml

Latin Small Letter E With Acute
Shift-JIS (Hex): None
Unicode (Hex): 00E9
Unicode (HTML): é
HTML Entity: é

I like these one

Latin Capital Letter Thorn (icelandic)
Shift-JIS (Hex): None
Unicode (Hex): 00DE
Unicode (HTML): Þ
HTML Entity: þ

Latin Small Letter Thorn (icelandic)
Shift-JIS (Hex): None
Unicode (Hex): 00FE
Unicode (HTML): þ
HTML Entity: þ

Latin Capital Letter Eth (icelandic)
Shift-JIS (Hex): None
Unicode (Hex): 00D0
Unicode (HTML): Ð
HTML Entity: ð

Latin Small Letter Eth (icelandic)
Shift-JIS (Hex): None
Unicode (Hex): 00F0
Unicode (HTML): ð
HTML Entity: ð

Some of the Icelandic letters overlap APL chars in #AV and it is
therefore hard to use APL and Icelandic together

We have 32 chars in our alphabet - and that is counting each case
That makes 64 chars for those who do not have a calculator handy

It was especially a problem with the keybords earlier
Way to expensive to get keyboards with both on them

That is why I switched to J very early on

I am glad now that I did but I still have a dream of beeing able to use
both

I am sure both J as well as other APL dyalects will support #UV and
then we can live happily ever after together on the same keyboard and
screen

Eric Landau

unread,

May 6, 2005, 9:13:13 AM5/6/05

to

In article <11153435...@r2d2.vermontel.net>, "David Liebtag"
<DavidL...@vermontel.net> wrote:

The problem isn't that general-purpose sorting of Unicode alphanumeric
strings is technically complex; it is that it is logical nonsense. The
"general-purpose collating sequence for alphabetics", which we call
"alphabetical order", is well-defined for any given alphabet: A to Z,
alpha to omega, alif to ya, whatever. With multiple alphabets, there is
no defined alphabetical order. We can create an appropriate default
collating sequence for any given alphabet; G goes after F and before H,
but does it go before or after gamma, or gimel, or ghaym? There is no
consensus answer to that, and we can do no better than to leave those
choices to the programmer.

Eric Landau, APL Solutions, Inc.
"Sacred cows make the tastiest hamburger." - Abbie Hoffman

phil chastney

unread,

May 7, 2005, 2:41:29 PM5/7/05

to

> "Björn Helgason" <gos...@gmail.com> wrote in message

news:1115384490.8...@g14g2000cwa.googlegroups.com...

> How would you tell the session manager to display the the character

> associated with that particular hex or dec number you want to see or
> use?

had you thought of using a session manager which supports Unicode?

> That is why I think you would need say #UV[00E9] to display that
> particular character or its equivalent in dec #UV[233]

I think a look-up table is a bit too primitive for the things you're going
to want to do -- at a minimum, you'd want to be able to convert a
preformed composite (like e-acute) into its constituent parts and,
conversely, convert base+accent(s) into a composite (if such a composite
exists, of course) -- you'd also want to convert a 32-bit codepoint into
two 16-bit surrogates, and vice versa -- you would perhaps be better
served with system functions or library functions here

> I went to a site to look up 00E9 and I found a new site I discovered
> writing this

http://isthisthingon.org/unicode/index.phtml

nice site, thanks for the link
I'm surprised you needed to look up E+00E9 -- from E+0000 to E+007F,
Unicode coincides with ASCII 7-bit -- from E+0000 to E+00FF, Unicode
coincides with Latin-1 (which includes, I think, everything you'd need for
modern Scandinavian languages, like Eth, Thorn and O-slash, uc & lc --
enough for Icelandic and Finnish, but not enough for Greenlandic or Saami)

<snip>

> Some of the Icelandic letters overlap APL chars in #AV and it is
> therefore hard to use APL and Icelandic together

more generally, it is hard to use APL with almost any other language than
English, if you confine yourself to an 8-bit system -- that is why a move
to Unicode is pretty much inevitable

>We have 32 chars in our alphabet

are you including accented forms as separate letters?
(the way the Danes see A-ring,, for instance)

all the best . . . /phil

Björn Helgason

unread,

May 7, 2005, 6:56:39 AM5/7/05

to

phil chastney wrote:
> >We have 32 chars in our alphabet
>
> are you including accented forms as separate letters?
> (the way the Danes see A-ring,, for instance)

aábcðdeéfghiíjklmnoópqrstuúvwxyýzþæö

phil chastney

unread,

May 7, 2005, 5:06:33 PM5/7/05

to

"Björn Helgason" <gos...@gmail.com> wrote in message

news:1115463399.6...@f14g2000cwb.googlegroups.com...

> aábcðdeéfghiíjklmnoópqrstuúvwxyýzþæö

thanks -- I shall file that . . . /phil

phil chastney

unread,

May 7, 2005, 5:24:15 PM5/7/05

to

"Eric Landau" <ela...@cais.com> wrote in message
news:q5udnVKjnoW...@rcn.net...
> <snip>

> The problem isn't that general-purpose sorting of Unicode alphanumeric
> strings is technically complex; it is that it is logical nonsense. The
> "general-purpose collating sequence for alphabetics", which we call
> "alphabetical order", is well-defined for any given alphabet: A to Z,
> alpha to omega, alif to ya, whatever. With multiple alphabets, there is
> no defined alphabetical order. We can create an appropriate default
> collating sequence for any given alphabet; G goes after F and before H,
> but does it go before or after gamma, or gimel, or ghaym? There is no
> consensus answer to that, and we can do no better than to leave those
> choices to the programmer.

while I agree 100% with your conclusion, it isn't necessary to go to
multiple scripts to demonstrate its necessity

France, Germany and Denmark are adjacent countries, but within their
dictionaries, accented letters are sorted on three different, incompatible,
principles

Spanish is different from all three -- Spain recently adopted a new sort
order, although (I believe) Latin America retains the old one

that's five sort methods already, within Latin-1

James L. Ryan

unread,

May 7, 2005, 10:14:15 AM5/7/05

to

My assertion is tht it would be to APL's benefit if there was a unicode set
that contained the entirety of the characters used in APL, albeit that not
all of these characters would be a component of each and every APL. As for
the characters used in a particular APL, there could be a system selection
vector on the unicode set that would reveal just those. The traditional AV
could be implementation specific.
--
James L. Ryan -- TaliesinSoft

Björn Helgason

unread,

May 7, 2005, 10:40:29 AM5/7/05

to

There is a cute little song to remember how they are sorted
It rimes and I quite often use it while I look up the telephone
directory

--------------------
A, b, c, d, e, f, g;
eftir kemur h, í, k,
l, m, n, ó, einnig p,
ætla ég q þar standi hjá.

R, s, t, u, v eru þar næst,
x, ý, z, þ, æ, ö.
Allt stafrófið er svo læst
í erindi þessi lítil tvö.

------------- rough translation

A, b, c, d, e, f, g;
after that comes h, í, k,
l, m, n, ó, also p,
I gather q stands there by

R, s, t, u, v come next,
x, ý, z, þ, æ, ö.
The whole alphabet is so locked
in these two little verses.

phil chastney

unread,

May 7, 2005, 7:22:46 PM5/7/05

to

"Björn Helgason" <gos...@gmail.com> wrote in message

news:1115476829.5...@o13g2000cwo.googlegroups.com...

> ------------- rough translation

oh, that's priceless -- and I meant to ask you if you'd given the alphabet
in dictionary sort order in my last msg -- thanks again . . . /phil

phil chastney

unread,

May 7, 2005, 7:41:00 PM5/7/05

to

"James L. Ryan" <talies...@mac.com> wrote in message

news:0001HW.BEA23767...@news.supernews.com...

my understanding is that Unicode includes every APL character ever used,
plus every APL character ever proposed

my guess is that you want a _subset_ of Unicode which includes every APL
character ever used -- presumably an 8-bit subset? -- this sounds a lot
like a codepage -- I'd often wondered why APL implementors never got
together to define a standard APL codepage, but I can see that there might
be differences between those coming from an EBCDIC background and those
coming from an ASCII background

your suggestion doesn't _require_ a codepage, however -- all that's
required is a font equipped with glyphs and codepoints for all the
characters in the ISO standard -- there are rendering systems which allow
you to map a physical font file into specified areas of a (larger) virtual
font -- this would allow the alphabetic characters in the subset to be
slanted while the APL subset was mapped in, reverting to upright (or
whatever) when the APL subset is removed

is this what you wanted? or have I misunderstood your requirement?

Björn Helgason

unread,

May 7, 2005, 1:12:30 PM5/7/05

to

Ok my knowledge of this is superficial

I have heard all the arguments and I may have seen all the arguments to
the issue.

Anyway what I know and have had the interest in remembering is that as
always IPSA and APL2 decided to agree to disagree on this isssue as
pretty much everything else

SIGAPL/ISOAPL and uniform direction has for similar reasons always
been a dead duck.

I have never understood the reasons for why this has had to be this
way.

I know and have known all the top players and I even tried at one
gathering at APL<some year long time ago - I guess it was ca 1985> to
get them all together to solve the issue and over drinks at the banquet
all seemed to agree on the need to work together. Needless to say
nothing happened.

As far as I remember and know then there are some APL characters in #AV
in IPSA not the same as #AV in APL2 even if it looks the same. That is
they do not use the same UNICODE even if they look exactly the same.
Lets call one symbol kalli and they are in different places in IPSA and
APL2 but for us users kalli looks exactly the same. #AVIPSA[x] looks
the same as #AVAPL2[y] and then in #UV you can not even use the same
code so #AVIPSA[x] is #UV[z] and #AVAPL2[y] is #UV[t] and they all
look like kalli.

As far as I know there are several characters in UNICODE from Japan,
Korea, China etc that in similar ways look exactly alike but do need
different codes. So the problem is not just in APL.

Obviously these disagreements have meant prolonged discussions on
getting UNICODE in general use.

I think the unfortunate differance in ways for IPSA and APL2 have
resulted in infighting between different APL fractions that have made
APL a looser.

I am sure that if it were not for these differences APL would be much
more prosperous than what it is.

Who bloody cares for these old differences are we not mature enough to
try to work together and not always try to downgrade each others
dialect.

I have used both dialects and I like both and firstly I would like them
to both succeed and I have accepted the fact they will never be the
same.

If we do not hang together we will all be hanged separately.

phil chastney

unread,

May 9, 2005, 1:16:46 PM5/9/05

to

"Björn Helgason" <gos...@gmail.com> wrote in message

news:1115485950.3...@z14g2000cwz.googlegroups.com...
> <snip>

> Lets call one symbol kalli and they are in different places in IPSA and
> APL2 but for us users kalli looks exactly the same. #AVIPSA[x] looks
> the same as #AVAPL2[y] and then in #UV you can not even use the same
> code so #AVIPSA[x] is #UV[z] and #AVAPL2[y] is #UV[t] and they all
> look like kalli.

I don't understand your assertion here -- your "kalli" would have a unique
codepoint, so that any Unicode-compliant output (whether from IPSA or APL2)
would show kalli as #UV[z], --regardless of how it may be represented
internally

(this assumes, of course, that UV is your shorthand for Unicode from
U+0000 to U+FFFF -- it would be quite ridiculous for implementors
to start re-ordering Unicode)

all I would ask is that:

(i) character arrays are capable of storing all Unicode characters;

(ii) characters can be converted to and from integer Unicode values;

(iii) when called upon to write to an external channel or medium, such as
the screen, printer, disk or modem, an APL implementation outputs
character arrays using the codepoints specified in the APL standard,
and my choice of UTF-8 or UCS-2

(of course, if that happened, the implementors would then be asked to prefix
UCS strings with U+FEFF, but I don't see why that can't be left to the user)

once we have strict Unicode-compliant output, []AV becomes an historical
curiosity, retained only for backwards compatibility

> As far as I know there are several characters in UNICODE from Japan,
> Korea, China etc that in similar ways look exactly alike but do need
> different codes. So the problem is not just in APL.

well, this is a touchy issue, because nationalist sensitivities are
involved -- many characters found in Traditional Chinese, Simplified
Chinese and Japanese Kanji were given the same codepoint, even though the
displayed forms of these characters may differ considerably -- a process
known as Han Unification (incidentally, this was insisted upon by the
Chinese government, not the Americans, as many people believe)

there are some characters used in Japanese which many scholars believe to be
of Korean origin -- while the Japanese acknowledge their Chinese origins,
some Japanese are less comfortable acknowledging any Korean borrowings --
I am told the issue generated considerable heat in committee discussions,
but the disputed characters were eventually given distinct codepoints

some Cyrillic characters were added at the behest of the Ukrainian as being
uniquely Ukrainian -- Russians will tell you these characters are simply
stylistic variants of existing characters, and are not necessary to resolve
ambiguities in plain text

Greek and Coptic were unified, but are now being separated (I'm not sure
I understand the logic underlying this change)

I hope I've been sufficiently diplomatic here, but the conclusion is, yes,
there may be some duplication in Unicode

but not in APL -- the APL standard specifies a unique Unicode value for
every APL character ever used, and every APL character ever proposed

> Obviously these disagreements have meant prolonged discussions on
> getting UNICODE in general use.

if that is the case, then maybe the discussions are based on a
misunderstanding somewhere -- personally, I don't see any difficulty

> I think the unfortunate difference in ways for IPSA and APL2 have
> resulted in infighting between different APL factions that have made
> APL a loser.

>
> I am sure that if it were not for these differences APL would be much
> more prosperous than what it is.

that may be true -- maybe you should extend your comments to all
vendors -- and maybe mention nested arrays (?)

> If we do not hang together we will all be hanged separately.

Good God! that's a bit extreme, isn't it? I didn't realise failure to
standardise properly was a capital offence

brian.b.m...@lmco.com

unread,

May 9, 2005, 12:59:55 PM5/9/05

to

My interest is not in promoting one APL version over another but rather
in understanding how the guts of APL interpreters work, how Unicode
could best be used in an APL interpreter (e.g. would the basic code
plane be enough or would the full Unicode set be needed), and how
using Unicode would affect the existing operations.

It seems to me that if a #UV variable could be implemented containing
the whole unicode character set or subset supported by an interpreter,
it might be best to simply define #AV that way rather than introducing
a new variable for the same purpose. If that was not the case (e.g.
the
character set would not fit in a workspace), it might be better to use
a
translation function such as #UCS instead.

I can see implementing both 1-byte and 2-byte or 4-byte characters
and translating upward (widening) as needed, just as integer arrays
are automatically promoted to floating point as required. But one
difference is that there are standard operations, such as floor or
ceiling, that translate back from floating point to integer, whereas I
know of no standard APL operations that would narrow characters
from 4 bytes back to 1 byte each. So was a character array was
widened it would stay wide, which might pose space problems.
I suppose one might use #DR or some such to perform narrowing
explicitly, but that would not be very portable.

Anyway, this is an interesting problem.

--- Brian

phil chastney

unread,

May 10, 2005, 12:59:25 AM5/10/05

to

<brian.b.m...@lmco.com> wrote in message
news:1115657995.0...@f14g2000cwb.googlegroups.com...

> My interest is not in promoting one APL version over another but rather
> in understanding how the guts of APL interpreters work, how Unicode
> could best be used in an APL interpreter (e.g. would the basic code
> plane be enough or would the full Unicode set be needed), and how
> using Unicode would affect the existing operations.

APL code is input to the interpreter as text -- if that text were a stream
of Unicode values, then portability would be improved a little, as would
communication (between systems and between programmers)

the benefits of being able to output Unicode strings are rather too large to
be enumerated here

I have no idea what you mean by "would the basic code plane be enough" --
each of the codepoints defined within the APL standard is a 16-bit value,
lying within BMP, the Basic Multilingual Plane, which extends from U+0000
to U+FFFF

is that enough? for whom? if somebody is doing text processing in Old
Italic, then maybe it isn't, because Old Italic lies on Plane 1, and
therefore requires a 32-bit address

on the other hand, maybe it is -- 32-byte values can be represented as
two 16-bit values -- Unicode calls these "surrogates" -- so if the
interpreter can store 16-bit characters, that's all we need

> It seems to me that if a #UV variable could be implemented containing
> the whole unicode character set or subset supported by an interpreter,
> it might be best to simply define #AV that way rather than introducing
> a new variable for the same purpose. If that was not the case (e.g.
> the
> character set would not fit in a workspace), it might be better to use
> a
> translation function such as #UCS instead.

this bit has me completely mystified -- what on earth would you expect to
find stored in this #UV ?? and what purpose would #AV serve?

> I can see implementing both 1-byte and 2-byte or 4-byte characters
> and translating upward (widening) as needed, just as integer arrays
> are automatically promoted to floating point as required. But one
> difference is that there are standard operations, such as floor or
> ceiling, that translate back from floating point to integer, whereas I
> know of no standard APL operations that would narrow characters
> from 4 bytes back to 1 byte each. So was a character array was
> widened it would stay wide, which might pose space problems.
> I suppose one might use #DR or some such to perform narrowing
> explicitly, but that would not be very portable.

would it perhaps be simpler to store all characters as 16-bit integers,
using surrogates where a 32-bit codepoint needs to be represented? no
problems then with promotion and demotion -- wouldn't that be easier?

I'm sorry, but my replies look rather bad-tempered -- they're not meant
to, but I honestly cannot see where the communication gap is -- would it
help to imagine an APL without any character datatype at all? you wouldn't
lose any processing power whatever: just process (avoiding arithmetic, of
course) arrays of 32-bit (ergo no problems with surrogates) integers

when you need to output to an external device, you call a system routine to
convert these integers into a strictly Unicode-compliant stream (a much
simpler process to implement, incidentally, than the conversion and
formatting necessary for genuine "numeric" integers), which our system
routine then passes on to the aforementioned external device

if that works OK as a conceptual model, then we can refine it later

James L. Ryan

unread,

May 9, 2005, 10:50:07 PM5/9/05

to

Thinking "aloud" for a minute in regards to Unicode and APL......

#AV is an implementation specific ordered character vector containing the
universe of all characters recognized when programming in a given APL
dialect.

#UV is a vector of the same length as #AV containing indices into Unicode,
each element containing the appropriate Unicode index for the corresponding
element of #AV.

This assumes that each and every glyph used in every APL would have a home in
Unicode. #UV would then provide a means of porting code from implementation
to implementation -- ignoring the differences that might exist in
interpretation and/or recognition of that code.

This all brings to mind the inclusion of a "rosetta stone" in the workspace
exchange stuff of the late seventies.

phil chastney

unread,

May 10, 2005, 12:01:40 PM5/10/05

to

"James L. Ryan" <talies...@mac.com> wrote in message

news:0001HW.BEA58B8F...@news.supernews.com...

that makes a lot of sense -- #UV has 256 integer elements, so there is no
problem with storage space -- and, unless some genius has extended their
character set with a non-standard "semi-colon slash in a circle" since the
standard was approved, there is no problem allocating the "correct" (i.e,
standard) codepoint to each character

this wouldn't take long to implement, and would (surely?) cause no
incompatibilities with existing systems

I note that you carefully specify "all characters recognized when
programming" -- this leaves open the wider issues of "all characters that
might appear in a literal" and David Leibtag's "all the characters that
might appear in a name" -- it also avoids questions like, "what is the
correct codepoint for FMK?"

good move -- go for the easy bits first

I don't know that people migrate between APLs that much, but they could
perhaps exchange utilities, by transmitting the Unicode string representing
the function definition (and, ideally, starting that string with U+FEFF, to
indicate byte-order, and ending with U+FFFF, to indicate end of file)

your suggestion looks like a simple, non-contentious, fully compatible first
step -- excellent stuff

micr...@microapl.demon.co.uk

unread,

May 10, 2005, 11:14:43 AM5/10/05

to

> >
> > #UV is a vector of the same length as #AV containing indices into
> > Unicode, each element containing the appropriate Unicode index for
the
> > corresponding element of #AV.
>>
>

> this wouldn't take long to implement, and would (surely?) cause no
> incompatibilities with existing systems
>
>

You've already effectively got this, certainly in APLX and APL2 at
least. What is being suggested for #UV is equivalent to: #UCS #AV

I think that all the major vendors are already committed to
inter-operable Unicode data exchange. The only issue I can see is that
it is unfortunately not quite true that every APL symbol has an
unambiguous Unicode equivalent (see Adrian Smith's article in Vector
19.3, January 2003). For this reason, and also to maximise the
probability of being able to process text from other applications, in
APLX we accept as input from Unicode some alternative symbols. For
example, both 002a (ASCII Asterisk) and 22c6 (Star in 'Mathematical
operators') map to the APLX Star symbol.

Richard

brian.b.m...@lmco.com

unread,

May 10, 2005, 3:01:57 PM5/10/05

to

I have been thinking of APL arrays as being implemented as a
tree of object classes in a language such as C++ or Ada. A
generic APL array object could take care of structural
operations such as take or drop, while the basic data type
classes (APL character, APL integer, etc.) would provide
the scalar operations. So an APL character array might
boil down to something similar to

class CharArray {
deque<long> shape;
deque<char> values;
};

In traditional APL interpreters we generally have 256
character values, all of which are included in #AV.
Now suppose that we want to support Unicode in
an APL interpreter. To support the full Unicode
character set would require using four bytes per
character, multiplying the size of character
arrays by four. This could cause "workspace full"
problems when large character arrays were used.

If we still want to have #AV include all of the
characters supported by the interpreter, then
#AV might become too large to fit in a
workspace, limiting its uses. (Could it still be
used as the left argument of grade up?) But
restricting #AV to a subset of the supported
characters would break code that assumes
otherwise, e.g. workspace transliteration
programs.

If, on the other hand, the use of character
outside of the basic code plane is very rare,
then we don't need to support these characters
and can get away with two-byte characters,
which reduces the problem. A 64K-sized
#AV should be manageable.

If the character array is implemented as I
describe above, all of the characters take
up the same amount of space, and the
fundamental character type is something
like char or wchar_t. It would also be possible
to use a UTF-8 string instead of, say,
deque<wchar_t>, to hold the values of the
characters in the array, but this does not
seem feasible to me as I see no efficient
way to index a large UTF-8 string and array
indexing is a common operation in APL.

It has been suggested that it would be
advantageous to support 1-byte characters as
well as longer Unicode characters. That way,
an array that only used characters in the
basic 256-character set could be stored
more efficiently. If, say, a Unicode character
was concatenated to this array, the array
would automatically be promoted to the
larger Unicode array type before the
concatenation was performed, just as an
integer array would be promoted to floating
point if you appended a floating point value.
The difference is that APL programs have a
standard way of converting from floating point
back to integer form: just apply ceiling or floor.
There is no corresponding transformation for
characters. So if a character array was
promoted to the 4-byte character form and
then at some later point all of the charcters
that required a 4-byte representation were
removed, the string would remain in 4-byte form
even though it could be stored in the more
compact form. There would be no standard way
for an APL program to convert it to the more
efficient 1-byte form. The best one could do
would be to use a nonstandard feature such as
#DR to perform this conversion.

Another reason to support 1-byte characters as
well as 4-byte ones would be to make it easier
to transfer character data to and from external
routines via #NA.

A place where UTF-8 might be useful would be
when writing a workspace out to disk, as that
would save disk space.

I hope this clarifies what I was wondering about.

--- Brian

phil chastney

unread,

May 11, 2005, 12:39:45 AM5/11/05

to

<micr...@microapl.demon.co.uk> wrote in message
news:1115738083.8...@o13g2000cwo.googlegroups.com...
> <snip>

>
> I think that all the major vendors are already committed to
> inter-operable Unicode data exchange. The only issue I can see is that
> it is unfortunately not quite true that every APL symbol has an
> unambiguous Unicode equivalent (see Adrian Smith's article in Vector
> 19.3, January 2003). For this reason, and also to maximise the
> probability of being able to process text from other applications, in
> APLX we accept as input from Unicode some alternative symbols. For
> example, both 002a (ASCII Asterisk) and 22c6 (Star in 'Mathematical
> operators') map to the APLX Star symbol.

Adrian Smith's article is seriously misleading

published in 2003, it gives the impression that the Unicode values
appropriate to the APL character set were (at that time) still undecided

that is not the case now, and was not the case in 2003 -- the final
standards meeting on this topic took place just before the Berlin APL
conference in 2000 -- the results of that meeting were presented to ISO
shortly thereafter, and accepted as the ISO standard later in that year
(more then 2 years before Adrian Smith's article)

I have seen nothing here, or in Vector, that acknowledges that fact, but
part of the story can be found at
http://www.math.uwaterloo.ca/~ljdickey/apl-rep/

so, in case I failed to make myself clear earlier, the APL standard
specifies unique Unicode values for every APL character in use, and
(so far as is known) every APL character ever proposed

I would like to draw attention to the following key words in the above:
"APL"
"standard"
"specifies"
"unique"
and "Unicode"

there is no requirement that input be strict -- you can accept stars and
asterisks of all sorts, sizes, shapes and hues and interpret them all as
"asterisk", if you so wish

tolerant input and strict output makes a lot of sense, but there is no
*requirement* that output be strict either -- most ISO standards are not
enforcable in any legal sense -- but if an implementation generates
non-strict output, they may have some unhappy users -- that's all

phil chastney

unread,

May 11, 2005, 2:12:33 AM5/11/05

to

<brian.b.m...@lmco.com> wrote in message
news:1115751717.1...@o13g2000cwo.googlegroups.com...
> <large snip>

>
> I hope this clarifies what I was wondering about.

thanks for that, Brian -- the missing piece of the jigsaw might be this
stuff on "surrogates", which I've reformatted from Unicode.org's website

Q: What are surrogates?
A: Surrogates are code points from two special ranges of Unicode values,
reserved for use as the leading, and trailing values of paired code units in
UTF-16. Leading, also called high, surrogates are from #D800 to #DBFF,
and trailing, or low, surrogates are from #DC00 to #DFFF. They are called
surrogates, since they do not represent characters directly, but only as a
pair </plagiarism>

the point is that surrogates provide a mechanisn for converting 32-bit
indices into two 16-bit indices, and vice versa -- this means that a
stream of 16-bit integers can represent all characters in the BMP, and all
the characters in the Supplementary Planes -- the cost is that we don't
get full 4 Gig addressing, but what the hell! we do get lots of space

that being so, an implementation only needs to allow 2 bytes per
character -- actually, make that "2 bytes per Unicode value" -- in a
character vector

and at this point, you no longer need []AV -- while the interpreter may
well convert the 16-bit characters of your code into 8-bit internal values,
the actual 8-bit internal value used for a given primitive is of no
particular interest to the user

I would agree with you on UTF-8 -- great for transmission, crap for
internal storage

sorting is a larger problem -- 16-bit integer left arguments to upgrade is
all we need syntactically, but there's more to it than that, which will have
to wait for another time

till then, all the best . . . /phil

James L. Ryan

unread,

May 10, 2005, 7:24:39 PM5/10/05

to

On Tue, 10 May 2005 23:39:45 -0500, phil chastney wrote
(in article <d5r5u9$nnc$1...@newsg1.svr.pol.co.uk>):

> so, in case I failed to make myself clear earlier, the APL standard
> specifies unique Unicode values for every APL character in use, and
> (so far as is known) every APL character ever proposed

Ever proposed?

I'm curious as to whether the Burroughs APL\700 file system characters (some
of which are shown below) made it....

quad-left-arrow -- read
quad-right-arrow -- write
quad-up-arrow -- take
quad-down-arrow -- drop
quad-up-carat --hold
quad-down-carat -- free
quad-circle -- rotate
quad-forward-slash -- compress
quad-backward-slash -- expand

The APL\700 file system (remember, we're talking early seventies now!) was
similar to the STSC/Sharp files in that a file was treated as a vector of
components, where a component was an APL scalar or array. Unlike the
STSC/Sharp files individual components were not assigned a fixed number, but
were accessed by position as are elements of a vector.

phil chastney

unread,

May 11, 2005, 12:28:05 PM5/11/05

to

"James L. Ryan" <talies...@mac.com> wrote in message

news:0001HW.BEA6ACE7...@news.supernews.com...
> <snip>

> Ever proposed?
>
> I'm curious as to whether the Burroughs APL\700 file system characters
> (some of which are shown below) made it....
>
> quad-left-arrow -- read
> quad-right-arrow -- write
> quad-up-arrow -- take
> quad-down-arrow -- drop
> quad-up-carat --hold
> quad-down-carat -- free
> quad-circle -- rotate
> quad-forward-slash -- compress
> quad-backward-slash -- expand
>
> The APL\700 file system (remember, we're talking early seventies now!) was
> similar to the STSC/Sharp files in that a file was treated as a vector of
> components, where a component was an APL scalar or array. Unlike the
> STSC/Sharp files individual components were not assigned a fixed number,
> but were accessed by position as are elements of a vector.

U+2347 quad-left-arrow -- read
U+2348 quad-right-arrow -- write
U+2350 quad-up-arrow -- take
U+2357 quad-down-arrow -- drop
U+2353 quad-up-carat --hold
U+234c quad-down-carat -- free
U+233c quad-circle -- rotate
U+2341 quad-forward-slash -- compress
U+2342 quad-backward-slash -- expand

. . . and 10 others you didn't mention

micr...@microapl.demon.co.uk

unread,

May 11, 2005, 5:47:47 AM5/11/05

to

phil chastney wrote:

>
> there is no requirement that input be strict -- you can accept
stars and
> asterisks of all sorts, sizes, shapes and hues and interpret them all
as
> "asterisk", if you so wish

Yes, but we still need to choose a single Unicode value on output.
There is only one asterisk in #AV (and it would be extremely confusing
if there were more). Do you want us to translate this to 22C6 (which
the Unicode standard defines as 'Star Operator APL')? If so,
exchanging Unicode text with the vast majority of non-APL applications
won't work properly; from the user's point of view, an asterisk is an
asterisk is an asterisk; there is a perfectly good one in the standard
ASCII character set, which in Unicode is at 002A, and which is
recognized by all applications. If you write some Unicode text to the
clipboard from APL, for example a formula which you are going to paste
into Excel, you certainly want 002A, not 22C6.

>
> tolerant input and strict output makes a lot of sense, but there is
no
> *requirement* that output be strict either -- most ISO standards
are not
> enforcable in any legal sense -- but if an implementation generates
> non-strict output, they may have some unhappy users -- that's all
>

What I am suggesting is that it would be preferable for everyone to
agree on a single output mapping - and this should most definitely
include placing all the ASCII-compatible APL characters (which include
minus, asterisk, and tilde) in the range 0000 to 0007F. We would
certainly get unhappy users if these common characters were mapped to
unusual APL-specific or mathematical Unicode characters, which don't
appear in all fonts and are not recognized by most applications. We
would also get unhappy users if ordinary Unicode text (containing only
common, non-APL-specific characters apparently in #AV), which was
imported into APL and then exported back to Unicode, came out different
to the original.

The crucial point is that 'strict' output is much less important than
maximizing inter-operability between APL and non-APL applications.

If anyone is interested in the exact details, the mappings we currently
use in APLX are documented in an Appendix in the 'APLX Language
Reference Manual', available from
http://www.microapl.co.uk/apl/aplx_docs.html. I think I am right in
saying that this is compatible with the mappings used by IBM in APL2,
for those characters which are common to both interpreters.

Richard

phil chastney

unread,

May 11, 2005, 6:25:10 PM5/11/05

to

<micr...@microapl.demon.co.uk> wrote in message
news:1115804867.1...@f14g2000cwb.googlegroups.com...

> phil chastney wrote:
>
> >
> > there is no requirement that input be strict -- you can accept
> stars and
> > asterisks of all sorts, sizes, shapes and hues and interpret them all
> as
> > "asterisk", if you so wish
>
> Yes, but we still need to choose a single Unicode value on output.

agreed

> There is only one asterisk in #AV (and it would be extremely confusing
> if there were more). Do you want us to translate this to 22C6 (which
> the Unicode standard defines as 'Star Operator APL')? If so,
> exchanging Unicode text with the vast majority of non-APL applications
> won't work properly; from the user's point of view, an asterisk is an
> asterisk is an asterisk; there is a perfectly good one in the standard
> ASCII character set, which in Unicode is at 002A, and which is
> recognized by all applications. If you write some Unicode text to the
> clipboard from APL, for example a formula which you are going to paste
> into Excel, you certainly want 002A, not 22C6.

"an asterisk is an asterisk is an asterisk" -- that's kind of imprecise,
isn't it? do you mean a 5-point asterisk or a 6-point asterisk? is the
character written superscripted, subscripted or centred? (or maybe
written below a base character?)

if you are writing out text which you want another interpreter to recognise
as APL code, then output U+22c6, the APL star operator

if you are writing out text that is to be interpreted by Excel, then output
U+002a, the "asterisk" character (a rather imprecise description, I agree,
but that's because Unicode imported so much unchanged stuff from existing
standards)

if you can produce Unicode output, this is no big deal, surely? the
programmer already has to remember to use different symbols anyway
for Excel and APL

> What I am suggesting is that it would be preferable for everyone to
> agree on a single output mapping - and this should most definitely
> include placing all the ASCII-compatible APL characters (which include
> minus, asterisk, and tilde) in the range 0000 to 0007F.

if you have a point of view on this, it's a pity you didn't speak up 5 years
ago

> We would
> certainly get unhappy users if these common characters were mapped to
> unusual APL-specific or mathematical Unicode characters, which don't
> appear in all fonts and are not recognized by most applications.

name a single character that appears in all fonts -- you are always going
to have difficulties in this area

and I'm not sure what you mean by "recognised" -- if the application can
read Unicoded plain text, then it can "recognise" U+22c6 just as easily as
it can recognise U+002a -- no problem

OTOH, if you mean that a compiler or interpreter needs to "recognise" this
output as acceptable code, then, fine, write the output using those
characters the compiler or interpreter will understand -- what's the
problem?

> We
> would also get unhappy users if ordinary Unicode text (containing only
> common, non-APL-specific characters apparently in #AV), which was
> imported into APL and then exported back to Unicode, came out different
> to the original.

you don't export anything to Unicode -- it's just a numbering convention

reading Unicoded text from disk, and writing it out again as Unicoded text,
should leave the text unchanged -- this is about *the* most fundamental
requirement for Unicode compatibility -- so, yes, you would have some
unhappy users if that text got changed in any way -- but then, nobody
has suggested changing it, have they?

this requirement applies to all transput, and may therefore be deemed to
include all literals

and it still holds true if the text is (supposed to be) an APL function
definition

however, reading in a function as Unicoded text, tokenising it, and then
converting the tokenised form to Unicoded text output, will, quite possibly,
produce output which differs from the input -- but so what? it already
does -- spacing is not always preserved, constants like 2.000 get changed
to 2.0, etc -- but that's not a problem, presumably?

> The crucial point is that 'strict' output is much less important than
> maximizing inter-operability between APL and non-APL applications.

"strict output" refers only to text output which is going to be interpreted
as APL code -- anything else is plain text (even if, as is the case with
Excel macros, that plain text will later be be interpreted as code) -- so,
if I read your statement aright, you're saying that communication with other
APL systems is less important that communicating with non-APL systems --
fine -- that's a statement of your priorities

so, if you can produce Unicode output, then the codepoints to be used for
APL characters are not really of any great interest to you -- right?

> If anyone is interested in the exact details, the mappings we currently
> use in APLX are documented in an Appendix in the 'APLX Language
> Reference Manual', available from
> http://www.microapl.co.uk/apl/aplx_docs.html. I think I am right in
> saying that this is compatible with the mappings used by IBM in APL2,
> for those characters which are common to both interpreters.

oh, OK -- it seems they are of interest, albeit of low priority

well, like I said, the ISO standard is not enforcable in any real sense, so
if vendors can agree among themselves on a convention they are happy with,
then they are free to use that instead of the standard

will you attempt to change the ISO standard, or just maintain an informal
agreement amongst yourselves?

micr...@microapl.demon.co.uk

unread,

May 11, 2005, 11:36:07 AM5/11/05

to

phil chastney wrote:

>
> if you are writing out text which you want another interpreter to
recognise
> as APL code, then output U+22c6, the APL star operator
>
> if you are writing out text that is to be interpreted by Excel, then
output
> U+002a, the "asterisk" character (a rather imprecise description, I
agree,
> but that's because Unicode imported so much unchanged stuff from
existing
> standards)
>

The APL interpeter cannot know what the user wants to do next with the
text. It just knows that he or she wants to copy it to the clipboard,
or write it to a file, or find its Unicode index using #UCS. So, as I
said, we need to choose a unique value. The alternative you seem to be
suggesting is that we should introduce some kind of 'code page',
whereby the mapping varies according to some user-defined context. I
thought the whole idea was to get away from that kind of stuff.

Of course, the more technically-experienced user can always write out
Unicode indexes directly in order to specify a particular Unicode
character - what I was talking about was the default mapping which
should be used by the various APL interpreters when they export text
(including 8-bit character strings from existing workspaces and
component files) in Unicode encoding.

>
> > We would
> > certainly get unhappy users if these common characters were mapped
to
> > unusual APL-specific or mathematical Unicode characters, which
don't
> > appear in all fonts and are not recognized by most applications.
>
> name a single character that appears in all fonts -- you are always
going
> to have difficulties in this area

Well, for historical reasons the ASCII and indeed Latin-1 character
sets are a pretty good start. What is more, I believe that all APL
vendors include the full ASCII character set in #AV, so it would be
perverse not to map those to their corresponding Unicode positions. I
think users will understand that other characters are more likely to be
font-specific.

>
> > The crucial point is that 'strict' output is much less important
than
> > maximizing inter-operability between APL and non-APL applications.
>
> "strict output" refers only to text output which is going to be
interpreted
> as APL code -- anything else is plain text (even if, as is the case
with
> Excel macros, that plain text will later be be interpreted as code)
-- so,
> if I read your statement aright, you're saying that communication
with other
> APL systems is less important that communicating with non-APL systems
--
> fine -- that's a statement of your priorities

No it's not; it's a statement of what I believe to be the priorities of
people who use APL interpreters. Do you disagree with this statement?

But in any case both requirements can easily be met, if APL vendors
agree, which I think they do.

>
> well, like I said, the ISO standard is not enforcable in any real
sense, so
> if vendors can agree among themselves on a convention they are happy
with,
> then they are free to use that instead of the standard
>
> will you attempt to change the ISO standard, or just maintain an
informal
> agreement amongst yourselves?

I think the sensible approach is for the vendors to agree amongst
themselves on the practical interpretation of the ISO standard.

What does everyone else think?

Richard

phil chastney

unread,

May 11, 2005, 11:58:19 PM5/11/05

to

<micr...@microapl.demon.co.uk> wrote in message
news:1115825767....@g14g2000cwa.googlegroups.com...
> phil chastney wrote:
> > <snip>

>
> The APL interpeter cannot know what the user wants to do next with the
> text. It just knows that he or she wants to copy it to the clipboard,
> or write it to a file, or find its Unicode index using #UCS. So, as I
> said, we need to choose a unique value. The alternative you seem to be
> suggesting is that we should introduce some kind of 'code page',
> whereby the mapping varies according to some user-defined context. I
> thought the whole idea was to get away from that kind of stuff.

I really cannot make any sense of this paragraph, unless you are describing
a system trying to struggle on with 8-bit characters

I am talking about an interpreter capable of storing strings of Unicode
values, and I make the simplifying assumption that the character U+22c6 will
be stored as the hexadecimal value 22c6 -- now whether that value is
stored in a 16-bit space, one half of a 32-bit space, or a variable number
of UTF-8-style bytes, I don't give a hoot, because it isn't material to
this discussion -- it might be a good idea to make the further simplifying
assumption that we can represent the BMP only

if you will permit, I'd like to rephrase Unicode's requirement about
transmitting strings unchanged as "don't mess with the user's literals,
unless told to" -- "being told to" encompasses structural operations like
catenation, rotation, indexing and selective assignment

and it should now be apparent that the APL interpreter doesn't need to know
a thing about what the user wants to do next -- when told to output
U+22c6, the intepreter sends (some representation of ) the hexadecimal
value 22c6 to the chosen output device -- ditto, ditto for U+002a

> - what I was talking about was the default mapping which
> should be used by the various APL interpreters when they export text
> (including 8-bit character strings from existing workspaces and
> component files) in Unicode encoding.

there will be some problems with legacy code, etc, but I would have hoped an
interpreter would recognise an old workspace as one of its own, and know
what Unicode values correspond to the old []AV -- ditto, ditto component
files

> Well, for historical reasons the ASCII and indeed Latin-1 character
> sets are a pretty good start. What is more, I believe that all APL
> vendors include the full ASCII character set in #AV, so it would be
> perverse not to map those to their corresponding Unicode positions.

ASCII and Latin-1 map onto the range U+0000 to U+00ff -- you must
know that already?!? -- what is this about perversity?

> I think users will understand that other characters are more likely to be
> font-specific.

fonts? what has any of this to do with fonts?
other characters will be represented by the value of their Unicode indices

> > > The crucial point is that 'strict' output is much less important
> than
> > > maximizing inter-operability between APL and non-APL applications.
> >
> > "strict output" refers only to text output which is going to be
> interpreted
> > as APL code -- anything else is plain text (even if, as is the case
> with
> > Excel macros, that plain text will later be be interpreted as code)
> -- so,
> > if I read your statement aright, you're saying that communication
> with other
> > APL systems is less important that communicating with non-APL systems
> --
> > fine -- that's a statement of your priorities
>
> No it's not; it's a statement of what I believe to be the priorities of
> people who use APL interpreters. Do you disagree with this statement?

well, yes -- I'm sorry, I should have pointed it out earlier, but there is
another point where strict output becomes important, and that is when the
user wants to see his code -- the tokenised form must then be converted to
text and sent to an appropriate display device

if the user chooses to capture that display in a character array, as in
A <- []CR 'FOO'
then strict output should be used here as well -- A is now plain text, and
if the user chooses to substitute U+002a (or anything else) for U+22c6,
then that is the user's prerogative, and the interpreter should do what it's
told: no more, no less

> I think the sensible approach is for the vendors to agree amongst
> themselves on the practical interpretation of the ISO standard.
>
> What does everyone else think?

since the standard defines a unique codepoint for each APL character, I'm
not sure that there _is_ much room for "interpretation" -- though you are,
of course, free to take it leave it, in whole or in part

so much of this discussion seems to be at cross-purposes, that I seriously
wonder if we're talking about the same thing -- as I said, I am talking
about an interpreter capable of storing strings of Unicode values (ideally,
all Unicode values), and processing aforesaid strings

starting from that assumption, there are some difficult questions, like the
semantics of string search and sort operations, but that deserves a separate
thread

micr...@microapl.demon.co.uk

unread,

May 13, 2005, 2:00:26 PM5/13/05

to

phil chastney wrote:
> >
> > The APL interpeter cannot know what the user wants to do next with
the
> > text. It just knows that he or she wants to copy it to the
clipboard,
> > or write it to a file, or find its Unicode index using #UCS. So,
as I
> > said, we need to choose a unique value. The alternative you seem
to be
> > suggesting is that we should introduce some kind of 'code page',
> > whereby the mapping varies according to some user-defined context.
I
> > thought the whole idea was to get away from that kind of stuff.
>
> I really cannot make any sense of this paragraph, unless you are
describing
> a system trying to struggle on with 8-bit characters

Precisely, that's exactly what we're talking about, as was clear from
the context and the previous exchanges about '#UV'. However, I
wouldn't use the phrase 'trying to struggle on'. In the real world we
have to recognize that virtually all existing APL code exists either in
8-bit character form (for example as character vectors in existing
workspaces and files), or in tokenized form produced by existing APL
interpeters which have a 256-element #AV. These two forms are of course
mixed up together, for example in APL functions which rely on executed
strings. In moving to Unicode, we need address this issue.

You said earlier in this thread:

"that makes a lot of sense -- #UV has 256 integer elements, so there
is no
problem with storage space -- and, unless some genius has extended
their
character set with a non-standard "semi-colon slash in a circle" since
the
standard was approved, there is no problem allocating the "correct"
(i.e,

standard) codepoint to each character".

What I am pointing out is that there is indeed a problem in allocating
the "correct" (i.e, standard) codepoint for each character, because
unfortunately the standard allocated APL-specific encodings for at
least three characters which in most or all existing APL systems are
regarded as the same as ordinary ASCII characters. Furthermore, many
existing APL applications rely upon this.

This problem could be avoided if, either by a formal change to the
standard, or by an informal agreement amongst all concerned, we map
those characters to the ordinary ASCII-compatible positions in Unicode.

>
> so much of this discussion seems to be at cross-purposes, that I
seriously
> wonder if we're talking about the same thing -- as I said, I am
talking
> about an interpreter capable of storing strings of Unicode values
(ideally,
> all Unicode values), and processing aforesaid strings
>
>

Unfortunately, even in an interpreter which stored all strings in 16-
or 32-bit Unicode encoding, and had no need whatsoever to be compatible
with previous APL workspaces, I think the problem would still show up.
For example, consider an APL system running on a PC with a standard US
keyboard. In ordinary non-APL applications, Shift-8 gives the ASCII
asterisk, Unicode 002a. In APL, with the 'Unified' keyboard, what
should Shift-8 produce? Presumably it should still produce Unicode
002a, because otherwise the keyboard would no longer be 'Unified'. But
if this is not the same as the APL 'Star' symbol, then we have to add a
new and separate key combination for the APL 'Star'. Would users
really want that? What is the advantage? And how would we explain to
new APL users - who are often already put off by the APL-specific
symbols - that the two very similar-looking characters are not
equivalent? And that the one that is easier to type and which is used
in all other applications is not the one which is used in APL
expressions? And that they have to remember to use the non-APL one if
they want to export text to an Excel formula, a Unix shell script, or
indeed any other application apart from APL?

Richard

Christopher C. Stacy

unread,

May 14, 2005, 5:13:06 AM5/14/05

to

micr...@microapl.demon.co.uk writes:
> And how would we explain to new APL users - who are
> often already put off by the APL-specific symbols

When I was a newbie (in the 70s), I considered the special
characters to be one of the beautiful things about APL.
And when I first saw them, it was the obvious thing that
shouted out that something was going to be very different
from FORTRAN. I was attracted by the special characters.

Do new users today actually dislike the APL characters?
Is this different than most people, decades ago?
Have aesthetics or expectations changed?
Or do they merely dislike not having an APL keyboard?
How the hell do new users learn and remember where
the characters are located on today's keyboards?

Back in the 70s when I did lots of APL, the latest thing was
dot-matrix thermal printers and CRTs, which was nice because
you didn't have to change the typeball (or spinwheel).
The thing that I hated was the messed-up layout on one of
the spinwheel terminals (can't remember which brand it was
that sucked). Today there's no technical reason why we
can't have all the APL symbols we want; the only problem
is remembering where on the keyboard to press for them.

The Lisp computers that I used back in the 80s had extra
shift keys labeled "Top" and "Front", and "Mode Lock",
and had special glyphs printed on the front and top of
the keycaps. Today, we all have enough keys on standard
keyboards that we could use some of them for APL-mode shifting.
And we have editors that could be aware of what you are
typing (for example, automatically shifting your input
mode to ASCII when you begin typing a string literal).
Our editors also let us horizontally move the input cursor.
We could even have a seperate key for doing overstrike.
Seems like maybe all we really need are better keycaps.
I also have some dim recollection of press-apply AP
stickers that you could put onto regular keyboards.

If I wanted to write "\i" I'd program in some new
(APL or not) language de jour, and I would hate it.
Way too hard to read. Yuck.

If I wanted to have to spell out "IOTA", I'd program in Lisp.
(Which is, in fact, what I do.) Better than APL in many ways,
and has nice syntax, but not quite the same feel - can't
lexically pack the operators together as tightly.

Only with the special characters can you compose non-precedential
operators in a way that's fast to read: easy to scan with the
eyes (each operator is instantly distinguished) and not too
verbose (brain doesn't need to read any words or punctuation).
Abstract and concise.

David Golden

unread,

May 14, 2005, 8:31:20 AM5/14/05

to

Christopher C. Stacy wrote:

> How the hell do new users learn and remember where
> the characters are located on today's keyboards?
>

Well, clearly what we need is a keyboard with little LCDs
embedded in the keys so that the glyphs are reconfigurable :-)

Björn Helgason

unread,

May 14, 2005, 9:58:34 AM5/14/05

to

On my iPaq I have a soft keyboard on the touchscreen

With Unicode I am not sure how you treat all these chars
Charmap allows you to see them several at a time and find the code for
it at least

Windows solves it nicely by using alt+x after the char or after the
number or after U+xxxx

A soft keyboard seem like a good idea

Christopher C. Stacy

unread,

May 14, 2005, 11:51:30 AM5/14/05

to

"Björn Helgason" <gos...@gmail.com> writes:

Well, maybe, but I was actually hoping someone was going
to clue me in about where to obtain the sticky labels,
or something, without having to have custom keyboards
on all my various kinds of computers.

James L. Ryan

unread,

May 14, 2005, 7:55:09 PM5/14/05

to

On Sat, 14 May 2005 10:51:30 -0500, Christopher C. Stacy wrote
(in article <uu0l5g...@news.dtpq.com>):

> Well, maybe, but I was actually hoping someone was going
> to clue me in about where to obtain the sticky labels,
> or something, without having to have custom keyboards
> on all my various kinds of computers.

And then there are those of us that just learn to touch-type APL and don't
need a visual clue such as keytop labelling. When i took typing in high
school the typewriters deliberately had blank keytops so that we were forced
to remember which characters went with which keys.

phil chastney

unread,

May 15, 2005, 10:28:41 PM5/15/05

to

ONCE UPON A TIME, there was a European offshoot of the APL standards
effort -- they were funded by the European Commission; they met in each
member state in turn (and an appropriate time of the year), and the EC
re-imbursed their expenses -- I have known three or four people who were
part of this effort: one was a cynic who enjoyed the jollies, but the others
showed a kind of frustrated concern often encountered among standards
group members

as if there weren't enough real issues to worry about, two delegates
insisted that the star in APL was not an asterisk -- it had
"traditionally" been a five-pointed star, while as asterisk was nearly
always six-pointed -- it seems the rest of the committee conceded the
point, just so they could return to more important business

(the same delegates opposed the inclusion of the dollar sign in the APL
character set, on the ground that it was a national, not an international,
symbol -- this might help you identify the miscreants, because I'm not
going to name them)

so, here we are 20 years later, the APL star is still a distinct symbol, and
now it has its own codepoint -- and we must learn to live with it

well, enough of that digression: I'm not quite sure where you stand,
exactly, so please correct me where I've misunderstood you

1) the need to able to store Unicoded literals is accepted -- right?
2) the use of non-Latin characters in names can be set aside -- OK?
3) you have concerns on i/o
4) we haven't even started on problems of string matching, like:
what is the result when we compare U+00e9
to the 2-vector U+0065, U+0301 ?
5) sorting is a major problem, and it may be that neither of us
will live to see it solved

item (3) can be spilt into its i-component and its o-component, and
the o-component can be further split into code and data, giving us
3i) input
3d) output of data
3c) output of code

starting with the easy one, item (3d): the output of data (literals and
formatted numbers - i.e, display code only, no binaries) should use
whatever codepoints the user has specified (in addition to character
constants (which we leave untouched), this may also include the ability
to specify the use of mid-dot as the decimal point, &c)

so we know what we have to output, in terms of character "values", and
presumably the user can specify the encoding, (UCS, UTF, whatever)

I don't see any point of contention here -- have I missed anything?

item (3c) covers the case where tokenised code needs converting to
character form, for display purposes -- I take it the need for
standardisation is accepted?

'plus' and 'plonk' and lots of other stuff are displayed using characters
from the 7-bit ASCII range, 'multiply' and 'divide' use characters from
Latin-1, 'notequals" and the weak inequalities come from Mathematical
Operators (U+2200 to U+22ff), while 'execute' and 'format' use
characters from Misc Tech (U+2300 to U+23ff)

I take it there's no objection, in principle, to using characters from
Mathematical Operators and/or Misc Tech? because there seems to be
some sort of objection on your part to using U+22c6 to represent the
exponentiation operator (the "star") -- is it that we should use ASCII
asterisk because it's a more common character? or are unsolved problems
on the input side clouding the issue on the output side? or maybe none of
the above?

frankly, I couldn't give a monkey's whether APL's "star" is five-pointed or
six-pointed, but I wouldn't want to use the asterisk, because the asterisk
is usually a raised character, and the code looks a lot tidier if the
symbols representing primitive functions have a common centre line
(make that "primitive functions not resting on the base line") -- there
is an entirely acceptable symbol at U+2217, and if you felt strongly
enough, maybe you could campaign for a change in the standard . . .

finally, the tricky, but separate, problem of input -- item (3i)

first, keyboard input: the interpreter's input routine presumably has some
way of knowing whether the user is inputting code, a character constant or
a comment -- so, when the user hits the "asterisk" key, if it's in a
character string or a comment, then your input routine will pass an asterisk
to the tokeniser -- if the user is entering code, your input routine
accepts the asterisk, but converts it to an "APL star", before passing it to
the tokeniser -- no need for a "new and separate key combination"

actually, that's only a conceptual model -- I'd probably change the
characters on-the-fly, within the tokeniser itself -- no problem then with
"execute" on character strings, or when converting legacy code, either

loads of detail elided here, as you are only too well aware, but that's the
beginnings of tolerant input -- no need for a keybutton for APL star --
no problem building an Excel macro, either

your keyboard interface will already provide some means of entering all the
characters used in APL programming -- how the user gets to feed in other
Unicode characters is a wider question, but not one we need to explore here

so, is there anything there you're not happy with? does a move to tolerant
input cover the three [or more] characters "which in most or all existing
APL systems are regarded as the same as ordinary ASCII characters"?

regards . . . /phil

<micr...@microapl.demon.co.uk> wrote in message
news:1116007226.0...@g49g2000cwa.googlegroups.com...
> <lots of stuff I've tried to reply to above>

micr...@microapl.demon.co.uk

unread,

May 17, 2005, 11:43:23 AM5/17/05

to

phil chastney wrote:

>
> (the same delegates opposed the inclusion of the dollar sign in the
APL
> character set, on the ground that it was a national, not an
international,
> symbol -- this might help you identify the miscreants, because I'm
not
> going to name them)

I'm intrigued. Who could possibly object to the currency symbol of
Tuvalu being included in the APL character set?

>
> well, enough of that digression: I'm not quite sure where you stand,
> exactly, so please correct me where I've misunderstood you
>
> 1) the need to able to store Unicoded literals is accepted --
right?
> 2) the use of non-Latin characters in names can be set aside -- OK?
> 3) you have concerns on i/o
> 4) we haven't even started on problems of string matching, like:
> what is the result when we compare U+00e9
> to the 2-vector U+0065, U+0301 ?
> 5) sorting is a major problem, and it may be that neither of us
> will live to see it solved
>
> item (3) can be spilt into its i-component and its o-component, and
> the o-component can be further split into code and data, giving us
> 3i) input
> 3d) output of data
> 3c) output of code

Agreed. I would add, however, the additional item of conversion of
existing 8-bit APL code and data to Unicode.

Of course, items 4) and 5) are not specific to APL, and in any case
depend on why you want to do the comparison or sort - there's no single
right answer.

>
> starting with the easy one, item (3d): the output of data (literals
and
> formatted numbers - i.e, display code only, no binaries) should use
> whatever codepoints the user has specified (in addition to character
> constants (which we leave untouched), this may also include the
ability
> to specify the use of mid-dot as the decimal point, &c)
>
> so we know what we have to output, in terms of character "values",
and
> presumably the user can specify the encoding, (UCS, UTF, whatever)
>
> I don't see any point of contention here -- have I missed anything?

I agree, as long as it is pure Unicode. For strings, what they type in
(or import from somewhere else, as Unicode), is what they get. There's
no translation on input or output. If they want to output to a
non-Unicode format (ASCII, EBCDIC etc), then of course there has to be
a translation; many Unicode characters will not be representable, and
in some cases several Unicode characters may be mapped on to a single
8-bit character, as a convenience. If the users don't like the default
mapping, they can do their own.

Similarly, if we're importing from existing 8-bit APL text (including
string literals in functions), we have to choose a suitable
translation.

>
> item (3c) covers the case where tokenised code needs converting to
> character form, for display purposes -- I take it the need for
> standardisation is accepted?

I don't think it is just tokenised code. In some APLs (not ours, as it
happens), a function is kept in both tokenised form and the original
text form, so as to preserve the original formatting. Quad-CR outputs
the original text form. Presumably, in such a system, what they type
in would be what they get out, standard or no standard?

Also an APL expression, typed in to the session window (or an Edit
window), might be copied to the clipboard - see example below. No
tokenisation has necessarily taken place.

>
> frankly, I couldn't give a monkey's whether APL's "star" is
five-pointed or
> six-pointed, but I wouldn't want to use the asterisk, because the
asterisk
> is usually a raised character, and the code looks a lot tidier if the
> symbols representing primitive functions have a common centre line

That's a font issue - nothing to do with character mappings. The font
can be designed for clear and readable rendition of APL code, just as
other fonts are optimised for other specific purposes. Different people
will prefer different character styles - for example, some people like
slanted letters for APL, some don't.

> first, keyboard input: the interpreter's input routine presumably
has some
> way of knowing whether the user is inputting code, a character
constant or
> a comment -- so, when the user hits the "asterisk" key, if it's in
a
> character string or a comment, then your input routine will pass an
asterisk
> to the tokeniser -- if the user is entering code, your input
routine
> accepts the asterisk, but converts it to an "APL star", before
passing it to
> the tokeniser -- no need for a "new and separate key combination"

Consider typing the following into an APL function, closing the
function, and then re-opening it:

2 * A = '*'

(An artificial example, but to avoid confusion I wanted to choose only
characters displayable in ASCII).

The implication of what you are suggesting is that if you now highlight
this text (it having gone through tokenising/de-tokenising), and copy
it to the clipboard, the two asterisks would map to different Unicode
characters. But if the user just typed this text in an Edit or indeed
Session window, highlighted it, and copied it to the clipboard, they'd
map to the same Unicode character - as you would expect. And if they
edited the function line so that the expression now read:

"*" = "2 * A = '*'"

the result would be 0 0 0 0 0 0 0 0 0 1 0, despite the fact that the
two asterisks were entered using the same keystroke. [This assumes the
APL supports double-quotes, as in APL+Win and APLX - you could do the
same by doubling up the single quotes as in traditional APL.] And
there would still be a need for a new and separate key combination,
for example if the user wanted to do a search in a function for a
particular expression containing the APL star (despite the fact that
they didn't enter it using that special key combination...).

Admittedly the font would presumably be designed to make it clear that
the two asterisks were different, but I think you've convinced me that
this is madness, even without worrying about how to deal with existing
8-bit APL code!

Regards
Richard

phil chastney

unread,

May 19, 2005, 3:49:17 PM5/19/05

to

OK, my understanding is that, while a move to Unicode is accepted, you
remain unhappy with three of the codepoints specified in the APL standard,
and would prefer to use ASCII characters instead

in particular, you are unhappy that tolerant input may have confusing
or deleterious effects on the user's code (i) when using copy&paste
or copydown, and (ii) when converting existing 8-bit APL code

my suggestion was that tolerant input could be used to ease the move to
Unicode -- copydown was an issue I had overlooked, and it may be that
there are insurmountable objections to its use, but I'm not yet prepared to
concede the point

the alternative to tolerant input is strict input -- granted, there are
not that many Unicode editors, but some APLers are already using emacs or
gvim, so we know it is possible (if the interpreter can handle it)

with strict input, it will always be possible to convert existing code from
the old []AV to Unicode, but Unicode compliance requires that we don't
mess with literals, so there's really no question of how existing function
definitions should be displayed, regardless of whether they are stored as
text or in tokenised form -- the user may well see different glyphs being
used for certain primitive functions, but I doubt the change will be
traumatic

and then execution of literal strings means we're still going to require
some sort of conversion routine within the interpreter

if the executable string has been formed by the concatenation of (i) a
string from an old 8-bit ws and (ii) another string entered via a strict
input routine, there is a possibility that both U+002a and U+22c6 will have
been used to denote exponentiation, so we still need to devise a method for
conversion to strict notation

my guess is that the three characters causing your qualms are hyphen-minus,
asterisk and tilde -- the problem is that these characters have ambiguous
semantics

those of us old enough to remember Cobol know that
A-B
is a hyphenated name, while
A - B
is an arithmetic operation

in this example, we distinguish the different meanings of the two uses of
the symbol by reference to context -- for a large characters set, it is
much simpler to dispense with the context-sensitivities, and define separate
codepoints for the separate functions -- that way, ASCII text retains its
ambiguity (i.e, no information is lost, and (equally importantly) there is
no (possibly erroneous) increase in semantic content as a result of a move
to Unicode), but those who need to distinguish the two uses can do so,
without reference to context, by using the appropriate codepoint
(NOT by changing font)

if you are troubled by the prospect of explaining to a user that a centred
5-point star is not just visually different but also semantically different
from a raised 6-point star, you may be in for a difficult time -- as well
as explaining the differences between hyphen-minus, hyphen and minus,
there's the m-dash, the n-dash, 16 different spaces (some of which may
have the same width), and a graduated set of circles, you are going to
have to explain why
'A' = 'AA'
sometimes returns the result 0 0

it goes with the territory, I'm afraid

so, on to your example -- no problems with its artificiality, by the way --
we're exploring boundary conditions

> Consider typing the following into an APL function, closing the
> function, and then re-opening it:
>
> 2 * A = '*'
>

> The implication of what you are suggesting is that if you now
> highlight this text (it having gone through tokenising/
> de-tokenising), and copy it to the clipboard, the two asterisks
> would map to different Unicode characters.

yes (but it might be wise to change the display as soon as the text
is converted from tolerant to strict)

> But if the user just typed this text in an Edit or indeed
> Session window, highlighted it, and copied it to the clipboard,
> they'd map to the same Unicode character - as you would expect.

yes

> And if they edited the function line so that the expression now
> read:
>
> "*" = "2 * A = '*'"
>
> the result would be 0 0 0 0 0 0 0 0 0 1 0, despite the fact that
> the two asterisks were entered using the same keystroke.

I would rephrase that to say "despite the fact that the same keystroke
had been used to enter the different star-like characters" -- but apart
from that, yes

the user would, in any case, spot the difference between the centred 5-point
star, and the raised 6-point star, and realise why -- there again, maybe
not -- in which case users can't be trusted with tolerant input

in that case, enforce strict input -- I don't rightly know which vendors
use which input methods, but does alt-P currently deliver an asterisk to the
interpreter? would it be possible to have this key combination deliver the
APL-star instead, leaving the asterisk wherever it is now (shift-8, on my
machine)?

in the expression
CHAR = "2 * A = '*'"
I would define CHAR to be U+22c6, if I were searching text for references
to exponentiation -- and if I defined CHAR to be U+002a, I would pick up
references to footnotes, emphasised text and potentially offensive
f***-letter words -- I see no madness here; I am happy to have two
distinct characters

so, there you go -- Unicode cannot be ignored forever, but you need to
decide whether to go along with the APL standard, or stay with current ASCII
characters in those three cases -- if you decide to stick with ASCII, you
need to persuade other vendors to go the same route, and/or get the standard
changed -- if you decide to go along with the current standard, you then
have to decide whether to offer users tolerant input -- have I missed
anything?

if I have failed in my attempt to dissuade you from reverting to ASCII for
these three characters, then I'm sorry -- you face some difficult
decisions

phil chastney

unread,

May 19, 2005, 5:16:33 PM5/19/05

to

<micr...@microapl.demon.co.uk> wrote in message
news:1116344602....@g49g2000cwa.googlegroups.com...
> phil chastney wrote:

I'm sorry that last msg is such a mess

for some reason, the draft reply was displayed in TNR, and I couldn't
change it -- I checked the linebreaks before sending the draft, but
clearly, TNR being a more compact font, the linebreaks were not well
placed for Arial Unicode, which is the font I intended to use for
sending

I'm now seeing Arial, so let's see if I'm more successful with that cod
example:

'A' = 'ΑА'

well, I don't know what that looks like to you, but it looks better here

(although the draft was dispalyed in TNR, anything pasted in was
displayed in Arial -- all very odd)

confused . . . /phil

David Liebtag

unread,

May 29, 2005, 5:04:55 PM5/29/05

to

Eric,

You wrote:

> The problem isn't that general-purpose sorting of Unicode alphanumeric
> strings is technically complex; it is that it is logical nonsense. The
> "general-purpose collating sequence for alphabetics", which we call
> "alphabetical order", is well-defined for any given alphabet: A to Z,
> alpha to omega, alif to ya, whatever. With multiple alphabets, there is
> no defined alphabetical order. We can create an appropriate default
> collating sequence for any given alphabet; G goes after F and before H,
> but does it go before or after gamma, or gimel, or ghaym?

I believe this is not correct. Some languages do not define an alphabetical
sort order. In particular, the Japanese and Chinese languages are not
sorted using a simple alphabetical order. Furthermore, even western
languages do not use a simple sort order when you consider case.

David Liebtag

unread,

May 29, 2005, 5:07:50 PM5/29/05

to

James,

APL2 has a system function named QuadUCS.

If the right argument is a character scalar or vector, the result is the
Unicode codepoints of those characters.

If the right argument is an integer scalar or vector, the result is the
Unicode characters associated with those codepoints.

So, QuadUV is not needed.

QuadAV {match} QuadUCS QuadUCS QuadAV

David Liebtag