In Microsoft word2003 use alt+x to toggle between unicode and the apl
sign using font ariel unicode ms
In J
A=:2 3$5 9
Or
A=.2 3$5 9
this little example shows ambiguity
=: and =. Both assign the same as left arrow but one =: assigns
globally while =. Assigns locally
In old apl there is no difference
Maybe it should be displayed
A 2190 2 3 2374 5 9
or
A (2190) 2 3 (2374) 5 9
Or
0041,2190,0032,0020,0033,2374,0035,0020,0039
Or
A {2190} 2 3 {2374} 5 9
It looks like unicode is coming and in http://www.vector.org.uk/forum/
you can see some examples on how it can be displayed on the web
I often thought that unicode would solve apls problems.
That may not be the solution
Obviously J has found a way to solve the character issue.
J solved several other problems at the same time.
I think we should try to look at the things that may enhance the use of
any apl.
I love the apl characters and would love to see them more in use.
Here are the Unicode for APL2 AV
0021,0022,0023,0024,0025,0026,0027,0028,0029,002A,
002B,002C,002D,002E,002F,0030,0031,0032,0033,0034,0035,
0036,0037,0038,0039,003A,003B,003C,003D,003E,003F,0040,
0041,0042,0043,0044,0045,0046,0047,0048,0049,004A,004B,
004C,004D,004E,004F,0050,0051,0052,0053,0054,0055,0056,
0057,0058,0059,005A,005B,005C,005D,005E,005F,0060,0061,
0062,0063,0064,0065,0066,0067,0068,0069,006A,006B,006C,
006D,006E,006F,0070,0071,0072,0073,0074,0075,0076,0077,
0078,0079,007A,007B,007C,007D,007E,007F,00C7,00FC,00E9,
00E2,00E4,00E0,00E5,00E7,00EA,00EB,00E8,00EF,00EE,00EC,
00C4,00C5,2395,235E,2339,00F4,00F6,00F2,00FB,00F9,22A4,
00D6,00DC,00F8,00A3,22A5,20A7,2336,00E1,00ED,00F3,00FA,
00F1,00D1,00AA,00BA,00BF,2308,00AC,00BD,222A,00A1,2355,
234E,2591,2592,2593,2502,2524,235F,2206,2207,2192,2563,
2551,2557,255D,2190,230A,2510,2514,2534,252C,251C,2500,
253C,2191,2193,255A,2554,2569,2566,2560,2550,256C,2261,
2378,2377,2235,2337,2342,233B,22A2,22A3,22C4,2518,250C,
2588,2584,00A6,00CC,2580,237A,00DF,2282,2283,235D,2372,
2374,2371,233D,2296,25CB,2228,2373,2349,220A,2229,233F,
2340,2265,2264,2260,00D7,00F7,2359,2218,2375,236B,234B,
2352,00AF,00A8
By copying these numbers into Word and using alt+x after each number
you get quad av from apl2
It could be a way to share apl code to send it in unicode
Take these codes into word and change them into apl with alt+x
Have an editor macro changing it between apl and unicode
Probably upgrade old apl into j by translating apl to unicode.
Have a J verb read the unicode and produce J. Then when there are
ambiguities the program might propose the alternatives similar to what
you do with ocr
--- Brian
Unicode is not 2 bytes, it could be (if we are talking UTF-8) from 1 to 6
(or was that 4?) bytes. Nice thing about UTF-8 is that 'normal' ASCII
characters are still 1 byte ensuring backward compatibility. That is its
advantage over UTF-16 (that Microsoft uses). Check
e.g. http://en.wikipedia.org/wiki/UTF-8 and
http://en.wikipedia.org/wiki/UTF-16 for details.
Dragan
--
Dragan Cvetkovic,
To be or not to be is true. G. Boole No it isn't. L. E. J. Brouwer
!!! Sender/From address is bogus. Use reply-to one !!!
APL2 (and I believe some other interpreters) have supported Unicode for
several years. The way wee handle one byte and four byte characters is a
lot like how we handle numbers.
Conceptually a number is a number is a number. Whether the number is stored
internally as an 8 byte floating point number, or a four byte integer, or a
single bit, is generally not relevant to the APL application; the
interpreter takes care of any necessary coercions between internal types.
Like wise, a character is a character is a character. Whether the character
is stored internally using one or four bytes should be irrelevant to the APL
application; the interpreter does any necessary coercions between internal
types.
QuadAV is simply a shorthand way to refer to the particular subset of
Unicode characters that are of particular importance to APL programmers.
For efficiency, we store these particular characters in one byte where
possible.
David Liebtag
IBM APL Products and Services
UTF8 would probably be a pain to work with because of the variable
character
lengths. If has a character vector S and refers to, say, S[2100], do
you have to
scan through all of the first 2100 characters to find it?
--- Brian
to ignore the case of letters or, say, have a-umlaut treated as
equivalent to 'a' and 'A'.
--- Brian
--
James L. Ryan -- Taliesinsoft
italic vs. upright is a matter of presentation, and not Unicode's concern
if the surrounding explanatory text uses a different font from the APL code,
then the APL font can have slanted (oblique, italicised) alphabetic
characters
if the surrounding explanatory text uses the same font as the APL code,
then a simple finite state machine can choose between upright and slanted
alphabetic characters
in latter days, my own documentation would also use upright alphabetics
within literals and comments, and slanted characters within the code --
it isn't difficult to do
regards . . . /phil
I hope this isn't news to you, but if you want to sort, say, a list of names
into the alphabetic sequence standard for a language or a country, the
Unicode collating sequence is not normally adequate to the task --
dyadic upgrade is still required
dyadic upgrade needs to be enhanced, though, because (a) some languages
contain accented characters not in Unicode, and (b) Unicode's recommendation
is that *all* accented characters should be stored in decomposed form
yes
well, you don't, but the interpreter does
actually, because the interpreter doesn't *have* to scan through *all*
the preceding characters, it's more correct to say that polynomial array
addressing is no longer adequate
UTF-8 is a way to map multibyte character sets onto 8-bit streams --
it is not, strictly speaking, a character encoding -- it was originally
intended for communication channels, I believe, and not internal
representation -- you would not want your document stored internally
in UTF-8 form if you were using Chinese, for instance
I would suggest that the increased storage cost in moving from single byte
ASCII (or []AV) to 2-byte Unicode BMP is insignificant when compared with
(i) the bloat experienced when moving from plain text to HTML or Word, and
(ii) the falling cost of physical memory
Phil,
My original posting, the one you comment on above, was not well stated. What
I intended to say was that in my opinion the entirety of the APL glyphs
should be considered unique and should be assigned their own place in the
unicode table. I agree that the style applied, for example, Italic vs.
non-Italic for the letters, is external to their placement in unicode. As I
tried to emphasize in my original posting, an advantage of having the
entirety of the APL characters in their own plane (is that term valid in
unicode?) is the ease with which a presentation system could differentiate
between APL and non-APL. glyphs, regardless of whether or not a particular
APL glyph has a look-alike similarity to a non-APL glyph. Another advantage
is that this could provide the APL community with the ability to have what
I'll dub a "unified APL atomic vector" in addition to their own, if they so
choose, unique atomic vectors. Such a unified atomic vector could easily be
added to current APL implementations and would provide a vehicle for code
interchange and communication.
Jim
--- Brian
Unicode is clearly coming more actively now and XML as communication
tool.
It would be nice if APL would come alive again through Unicode
I believe I understand the benefits you envision for having APL alphabetic
characters have their own Unicode codepoints. And although they are nice, I
have a different desire:
I would like APL interpreters to allow any Unicode alphabetic or numeric
character in object names. That way, people who use different languages
could use meaningful names written in their native languages and not be
limited to the character set chosen by the English-speaking APL implementers
years ago.
Regarding sorting, as someone pointed out recently, general purpose sorting
of Unicode strings is far more complicated than can be supported by the
grade primitive. In a perfect world, I think the APL implementers should
provide easy access to some of the Unicode string sorting algorithms that
exist outside their interpreters.
These are just my opinions, which do not necessarily agree with my
employer's.
David Liebtag
is that necessary? []AV varies by vendor, but Unicode codepoints do not,
so that a given character is uniquely identified by a 16-bit integer (or a
32-bit integer, or a pair of 16-bit integers) -- this is totally
transportable, so that all that's required, it seems to me, is that vendors
accept integer values on the left of a dyadic upgrade, and interpret these
values as Unicode characters
well, that's not _quite_ everything: implementors would need to take account
of the fact that, for instance, U+0065/U+0301 is the "preferred" spelling
for e-acute, and should be treated as equivalent in every way to U+00E9 --
but as a user, I'm happy to leave that task in their capable hands
That is why I think you would need say #UV[00E9] to display that
particular character or its equivalent in dec #UV[233]
I went to a site to look up 00E9 and I found a new site I discovered
writing this
http://isthisthingon.org/unicode/index.phtml
Latin Small Letter E With Acute
Shift-JIS (Hex): None
Unicode (Hex): 00E9
Unicode (HTML): é
HTML Entity: é
I like these one
Latin Capital Letter Thorn (icelandic)
Shift-JIS (Hex): None
Unicode (Hex): 00DE
Unicode (HTML): Þ
HTML Entity: þ
Latin Small Letter Thorn (icelandic)
Shift-JIS (Hex): None
Unicode (Hex): 00FE
Unicode (HTML): þ
HTML Entity: þ
Latin Capital Letter Eth (icelandic)
Shift-JIS (Hex): None
Unicode (Hex): 00D0
Unicode (HTML): Ð
HTML Entity: ð
Latin Small Letter Eth (icelandic)
Shift-JIS (Hex): None
Unicode (Hex): 00F0
Unicode (HTML): ð
HTML Entity: ð
Some of the Icelandic letters overlap APL chars in #AV and it is
therefore hard to use APL and Icelandic together
We have 32 chars in our alphabet - and that is counting each case
That makes 64 chars for those who do not have a calculator handy
It was especially a problem with the keybords earlier
Way to expensive to get keyboards with both on them
That is why I switched to J very early on
I am glad now that I did but I still have a dream of beeing able to use
both
I am sure both J as well as other APL dyalects will support #UV and
then we can live happily ever after together on the same keyboard and
screen
The problem isn't that general-purpose sorting of Unicode alphanumeric
strings is technically complex; it is that it is logical nonsense. The
"general-purpose collating sequence for alphabetics", which we call
"alphabetical order", is well-defined for any given alphabet: A to Z,
alpha to omega, alif to ya, whatever. With multiple alphabets, there is
no defined alphabetical order. We can create an appropriate default
collating sequence for any given alphabet; G goes after F and before H,
but does it go before or after gamma, or gimel, or ghaym? There is no
consensus answer to that, and we can do no better than to leave those
choices to the programmer.
Eric Landau, APL Solutions, Inc.
"Sacred cows make the tastiest hamburger." - Abbie Hoffman
had you thought of using a session manager which supports Unicode?
> That is why I think you would need say #UV[00E9] to display that
> particular character or its equivalent in dec #UV[233]
I think a look-up table is a bit too primitive for the things you're going
to want to do -- at a minimum, you'd want to be able to convert a
preformed composite (like e-acute) into its constituent parts and,
conversely, convert base+accent(s) into a composite (if such a composite
exists, of course) -- you'd also want to convert a 32-bit codepoint into
two 16-bit surrogates, and vice versa -- you would perhaps be better
served with system functions or library functions here
> I went to a site to look up 00E9 and I found a new site I discovered
> writing this
http://isthisthingon.org/unicode/index.phtml
nice site, thanks for the link
I'm surprised you needed to look up E+00E9 -- from E+0000 to E+007F,
Unicode coincides with ASCII 7-bit -- from E+0000 to E+00FF, Unicode
coincides with Latin-1 (which includes, I think, everything you'd need for
modern Scandinavian languages, like Eth, Thorn and O-slash, uc & lc --
enough for Icelandic and Finnish, but not enough for Greenlandic or Saami)
<snip>
> Some of the Icelandic letters overlap APL chars in #AV and it is
> therefore hard to use APL and Icelandic together
more generally, it is hard to use APL with almost any other language than
English, if you confine yourself to an 8-bit system -- that is why a move
to Unicode is pretty much inevitable
>We have 32 chars in our alphabet
are you including accented forms as separate letters?
(the way the Danes see A-ring,, for instance)
all the best . . . /phil
aábcðdeéfghiíjklmnoópqrstuúvwxyýzþæö
> aábcðdeéfghiíjklmnoópqrstuúvwxyýzþæö
thanks -- I shall file that . . . /phil
while I agree 100% with your conclusion, it isn't necessary to go to
multiple scripts to demonstrate its necessity
France, Germany and Denmark are adjacent countries, but within their
dictionaries, accented letters are sorted on three different, incompatible,
principles
Spanish is different from all three -- Spain recently adopted a new sort
order, although (I believe) Latin America retains the old one
that's five sort methods already, within Latin-1
--------------------
A, b, c, d, e, f, g;
eftir kemur h, í, k,
l, m, n, ó, einnig p,
ætla ég q þar standi hjá.
R, s, t, u, v eru þar næst,
x, ý, z, þ, æ, ö.
Allt stafrófið er svo læst
í erindi þessi lítil tvö.
------------- rough translation
A, b, c, d, e, f, g;
after that comes h, í, k,
l, m, n, ó, also p,
I gather q stands there by
R, s, t, u, v come next,
x, ý, z, þ, æ, ö.
The whole alphabet is so locked
in these two little verses.
> ------------- rough translation
oh, that's priceless -- and I meant to ask you if you'd given the alphabet
in dictionary sort order in my last msg -- thanks again . . . /phil
my understanding is that Unicode includes every APL character ever used,
plus every APL character ever proposed
my guess is that you want a _subset_ of Unicode which includes every APL
character ever used -- presumably an 8-bit subset? -- this sounds a lot
like a codepage -- I'd often wondered why APL implementors never got
together to define a standard APL codepage, but I can see that there might
be differences between those coming from an EBCDIC background and those
coming from an ASCII background
your suggestion doesn't _require_ a codepage, however -- all that's
required is a font equipped with glyphs and codepoints for all the
characters in the ISO standard -- there are rendering systems which allow
you to map a physical font file into specified areas of a (larger) virtual
font -- this would allow the alphabetic characters in the subset to be
slanted while the APL subset was mapped in, reverting to upright (or
whatever) when the APL subset is removed
is this what you wanted? or have I misunderstood your requirement?
Ok my knowledge of this is superficial
I have heard all the arguments and I may have seen all the arguments to
the issue.
Anyway what I know and have had the interest in remembering is that as
always IPSA and APL2 decided to agree to disagree on this isssue as
pretty much everything else
SIGAPL/ISOAPL and uniform direction has for similar reasons always
been a dead duck.
I have never understood the reasons for why this has had to be this
way.
I know and have known all the top players and I even tried at one
gathering at APL<some year long time ago - I guess it was ca 1985> to
get them all together to solve the issue and over drinks at the banquet
all seemed to agree on the need to work together. Needless to say
nothing happened.
As far as I remember and know then there are some APL characters in #AV
in IPSA not the same as #AV in APL2 even if it looks the same. That is
they do not use the same UNICODE even if they look exactly the same.
Lets call one symbol kalli and they are in different places in IPSA and
APL2 but for us users kalli looks exactly the same. #AVIPSA[x] looks
the same as #AVAPL2[y] and then in #UV you can not even use the same
code so #AVIPSA[x] is #UV[z] and #AVAPL2[y] is #UV[t] and they all
look like kalli.
As far as I know there are several characters in UNICODE from Japan,
Korea, China etc that in similar ways look exactly alike but do need
different codes. So the problem is not just in APL.
Obviously these disagreements have meant prolonged discussions on
getting UNICODE in general use.
I think the unfortunate differance in ways for IPSA and APL2 have
resulted in infighting between different APL fractions that have made
APL a looser.
I am sure that if it were not for these differences APL would be much
more prosperous than what it is.
Who bloody cares for these old differences are we not mature enough to
try to work together and not always try to downgrade each others
dialect.
I have used both dialects and I like both and firstly I would like them
to both succeed and I have accepted the fact they will never be the
same.
If we do not hang together we will all be hanged separately.
I don't understand your assertion here -- your "kalli" would have a unique
codepoint, so that any Unicode-compliant output (whether from IPSA or APL2)
would show kalli as #UV[z], --regardless of how it may be represented
internally
(this assumes, of course, that UV is your shorthand for Unicode from
U+0000 to U+FFFF -- it would be quite ridiculous for implementors
to start re-ordering Unicode)
all I would ask is that:
(i) character arrays are capable of storing all Unicode characters;
(ii) characters can be converted to and from integer Unicode values;
(iii) when called upon to write to an external channel or medium, such as
the screen, printer, disk or modem, an APL implementation outputs
character arrays using the codepoints specified in the APL standard,
and my choice of UTF-8 or UCS-2
(of course, if that happened, the implementors would then be asked to prefix
UCS strings with U+FEFF, but I don't see why that can't be left to the user)
once we have strict Unicode-compliant output, []AV becomes an historical
curiosity, retained only for backwards compatibility
> As far as I know there are several characters in UNICODE from Japan,
> Korea, China etc that in similar ways look exactly alike but do need
> different codes. So the problem is not just in APL.
well, this is a touchy issue, because nationalist sensitivities are
involved -- many characters found in Traditional Chinese, Simplified
Chinese and Japanese Kanji were given the same codepoint, even though the
displayed forms of these characters may differ considerably -- a process
known as Han Unification (incidentally, this was insisted upon by the
Chinese government, not the Americans, as many people believe)
there are some characters used in Japanese which many scholars believe to be
of Korean origin -- while the Japanese acknowledge their Chinese origins,
some Japanese are less comfortable acknowledging any Korean borrowings --
I am told the issue generated considerable heat in committee discussions,
but the disputed characters were eventually given distinct codepoints
some Cyrillic characters were added at the behest of the Ukrainian as being
uniquely Ukrainian -- Russians will tell you these characters are simply
stylistic variants of existing characters, and are not necessary to resolve
ambiguities in plain text
Greek and Coptic were unified, but are now being separated (I'm not sure
I understand the logic underlying this change)
I hope I've been sufficiently diplomatic here, but the conclusion is, yes,
there may be some duplication in Unicode
but not in APL -- the APL standard specifies a unique Unicode value for
every APL character ever used, and every APL character ever proposed
> Obviously these disagreements have meant prolonged discussions on
> getting UNICODE in general use.
if that is the case, then maybe the discussions are based on a
misunderstanding somewhere -- personally, I don't see any difficulty
> I think the unfortunate difference in ways for IPSA and APL2 have
> resulted in infighting between different APL factions that have made
> APL a loser.
>
> I am sure that if it were not for these differences APL would be much
> more prosperous than what it is.
that may be true -- maybe you should extend your comments to all
vendors -- and maybe mention nested arrays (?)
> If we do not hang together we will all be hanged separately.
Good God! that's a bit extreme, isn't it? I didn't realise failure to
standardise properly was a capital offence
It seems to me that if a #UV variable could be implemented containing
the whole unicode character set or subset supported by an interpreter,
it might be best to simply define #AV that way rather than introducing
a new variable for the same purpose. If that was not the case (e.g.
the
character set would not fit in a workspace), it might be better to use
a
translation function such as #UCS instead.
I can see implementing both 1-byte and 2-byte or 4-byte characters
and translating upward (widening) as needed, just as integer arrays
are automatically promoted to floating point as required. But one
difference is that there are standard operations, such as floor or
ceiling, that translate back from floating point to integer, whereas I
know of no standard APL operations that would narrow characters
from 4 bytes back to 1 byte each. So was a character array was
widened it would stay wide, which might pose space problems.
I suppose one might use #DR or some such to perform narrowing
explicitly, but that would not be very portable.
Anyway, this is an interesting problem.
--- Brian
APL code is input to the interpreter as text -- if that text were a stream
of Unicode values, then portability would be improved a little, as would
communication (between systems and between programmers)
the benefits of being able to output Unicode strings are rather too large to
be enumerated here
I have no idea what you mean by "would the basic code plane be enough" --
each of the codepoints defined within the APL standard is a 16-bit value,
lying within BMP, the Basic Multilingual Plane, which extends from U+0000
to U+FFFF
is that enough? for whom? if somebody is doing text processing in Old
Italic, then maybe it isn't, because Old Italic lies on Plane 1, and
therefore requires a 32-bit address
on the other hand, maybe it is -- 32-byte values can be represented as
two 16-bit values -- Unicode calls these "surrogates" -- so if the
interpreter can store 16-bit characters, that's all we need
> It seems to me that if a #UV variable could be implemented containing
> the whole unicode character set or subset supported by an interpreter,
> it might be best to simply define #AV that way rather than introducing
> a new variable for the same purpose. If that was not the case (e.g.
> the
> character set would not fit in a workspace), it might be better to use
> a
> translation function such as #UCS instead.
this bit has me completely mystified -- what on earth would you expect to
find stored in this #UV ?? and what purpose would #AV serve?
> I can see implementing both 1-byte and 2-byte or 4-byte characters
> and translating upward (widening) as needed, just as integer arrays
> are automatically promoted to floating point as required. But one
> difference is that there are standard operations, such as floor or
> ceiling, that translate back from floating point to integer, whereas I
> know of no standard APL operations that would narrow characters
> from 4 bytes back to 1 byte each. So was a character array was
> widened it would stay wide, which might pose space problems.
> I suppose one might use #DR or some such to perform narrowing
> explicitly, but that would not be very portable.
would it perhaps be simpler to store all characters as 16-bit integers,
using surrogates where a 32-bit codepoint needs to be represented? no
problems then with promotion and demotion -- wouldn't that be easier?
I'm sorry, but my replies look rather bad-tempered -- they're not meant
to, but I honestly cannot see where the communication gap is -- would it
help to imagine an APL without any character datatype at all? you wouldn't
lose any processing power whatever: just process (avoiding arithmetic, of
course) arrays of 32-bit (ergo no problems with surrogates) integers
when you need to output to an external device, you call a system routine to
convert these integers into a strictly Unicode-compliant stream (a much
simpler process to implement, incidentally, than the conversion and
formatting necessary for genuine "numeric" integers), which our system
routine then passes on to the aforementioned external device
if that works OK as a conceptual model, then we can refine it later
#AV is an implementation specific ordered character vector containing the
universe of all characters recognized when programming in a given APL
dialect.
#UV is a vector of the same length as #AV containing indices into Unicode,
each element containing the appropriate Unicode index for the corresponding
element of #AV.
This assumes that each and every glyph used in every APL would have a home in
Unicode. #UV would then provide a means of porting code from implementation
to implementation -- ignoring the differences that might exist in
interpretation and/or recognition of that code.
This all brings to mind the inclusion of a "rosetta stone" in the workspace
exchange stuff of the late seventies.
that makes a lot of sense -- #UV has 256 integer elements, so there is no
problem with storage space -- and, unless some genius has extended their
character set with a non-standard "semi-colon slash in a circle" since the
standard was approved, there is no problem allocating the "correct" (i.e,
standard) codepoint to each character
this wouldn't take long to implement, and would (surely?) cause no
incompatibilities with existing systems
I note that you carefully specify "all characters recognized when
programming" -- this leaves open the wider issues of "all characters that
might appear in a literal" and David Leibtag's "all the characters that
might appear in a name" -- it also avoids questions like, "what is the
correct codepoint for FMK?"
good move -- go for the easy bits first
I don't know that people migrate between APLs that much, but they could
perhaps exchange utilities, by transmitting the Unicode string representing
the function definition (and, ideally, starting that string with U+FEFF, to
indicate byte-order, and ending with U+FFFF, to indicate end of file)
your suggestion looks like a simple, non-contentious, fully compatible first
step -- excellent stuff
You've already effectively got this, certainly in APLX and APL2 at
least. What is being suggested for #UV is equivalent to: #UCS #AV
I think that all the major vendors are already committed to
inter-operable Unicode data exchange. The only issue I can see is that
it is unfortunately not quite true that every APL symbol has an
unambiguous Unicode equivalent (see Adrian Smith's article in Vector
19.3, January 2003). For this reason, and also to maximise the
probability of being able to process text from other applications, in
APLX we accept as input from Unicode some alternative symbols. For
example, both 002a (ASCII Asterisk) and 22c6 (Star in 'Mathematical
operators') map to the APLX Star symbol.
Richard
class CharArray {
deque<long> shape;
deque<char> values;
};
In traditional APL interpreters we generally have 256
character values, all of which are included in #AV.
Now suppose that we want to support Unicode in
an APL interpreter. To support the full Unicode
character set would require using four bytes per
character, multiplying the size of character
arrays by four. This could cause "workspace full"
problems when large character arrays were used.
If we still want to have #AV include all of the
characters supported by the interpreter, then
#AV might become too large to fit in a
workspace, limiting its uses. (Could it still be
used as the left argument of grade up?) But
restricting #AV to a subset of the supported
characters would break code that assumes
otherwise, e.g. workspace transliteration
programs.
If, on the other hand, the use of character
outside of the basic code plane is very rare,
then we don't need to support these characters
and can get away with two-byte characters,
which reduces the problem. A 64K-sized
#AV should be manageable.
If the character array is implemented as I
describe above, all of the characters take
up the same amount of space, and the
fundamental character type is something
like char or wchar_t. It would also be possible
to use a UTF-8 string instead of, say,
deque<wchar_t>, to hold the values of the
characters in the array, but this does not
seem feasible to me as I see no efficient
way to index a large UTF-8 string and array
indexing is a common operation in APL.
It has been suggested that it would be
advantageous to support 1-byte characters as
well as longer Unicode characters. That way,
an array that only used characters in the
basic 256-character set could be stored
more efficiently. If, say, a Unicode character
was concatenated to this array, the array
would automatically be promoted to the
larger Unicode array type before the
concatenation was performed, just as an
integer array would be promoted to floating
point if you appended a floating point value.
The difference is that APL programs have a
standard way of converting from floating point
back to integer form: just apply ceiling or floor.
There is no corresponding transformation for
characters. So if a character array was
promoted to the 4-byte character form and
then at some later point all of the charcters
that required a 4-byte representation were
removed, the string would remain in 4-byte form
even though it could be stored in the more
compact form. There would be no standard way
for an APL program to convert it to the more
efficient 1-byte form. The best one could do
would be to use a nonstandard feature such as
#DR to perform this conversion.
Another reason to support 1-byte characters as
well as 4-byte ones would be to make it easier
to transfer character data to and from external
routines via #NA.
A place where UTF-8 might be useful would be
when writing a workspace out to disk, as that
would save disk space.
I hope this clarifies what I was wondering about.
--- Brian
Adrian Smith's article is seriously misleading
published in 2003, it gives the impression that the Unicode values
appropriate to the APL character set were (at that time) still undecided
that is not the case now, and was not the case in 2003 -- the final
standards meeting on this topic took place just before the Berlin APL
conference in 2000 -- the results of that meeting were presented to ISO
shortly thereafter, and accepted as the ISO standard later in that year
(more then 2 years before Adrian Smith's article)
I have seen nothing here, or in Vector, that acknowledges that fact, but
part of the story can be found at
http://www.math.uwaterloo.ca/~ljdickey/apl-rep/
so, in case I failed to make myself clear earlier, the APL standard
specifies unique Unicode values for every APL character in use, and
(so far as is known) every APL character ever proposed
I would like to draw attention to the following key words in the above:
"APL"
"standard"
"specifies"
"unique"
and "Unicode"
there is no requirement that input be strict -- you can accept stars and
asterisks of all sorts, sizes, shapes and hues and interpret them all as
"asterisk", if you so wish
tolerant input and strict output makes a lot of sense, but there is no
*requirement* that output be strict either -- most ISO standards are not
enforcable in any legal sense -- but if an implementation generates
non-strict output, they may have some unhappy users -- that's all
thanks for that, Brian -- the missing piece of the jigsaw might be this
stuff on "surrogates", which I've reformatted from Unicode.org's website
Q: What are surrogates?
A: Surrogates are code points from two special ranges of Unicode values,
reserved for use as the leading, and trailing values of paired code units in
UTF-16. Leading, also called high, surrogates are from #D800 to #DBFF,
and trailing, or low, surrogates are from #DC00 to #DFFF. They are called
surrogates, since they do not represent characters directly, but only as a
pair </plagiarism>
the point is that surrogates provide a mechanisn for converting 32-bit
indices into two 16-bit indices, and vice versa -- this means that a
stream of 16-bit integers can represent all characters in the BMP, and all
the characters in the Supplementary Planes -- the cost is that we don't
get full 4 Gig addressing, but what the hell! we do get lots of space
that being so, an implementation only needs to allow 2 bytes per
character -- actually, make that "2 bytes per Unicode value" -- in a
character vector
and at this point, you no longer need []AV -- while the interpreter may
well convert the 16-bit characters of your code into 8-bit internal values,
the actual 8-bit internal value used for a given primitive is of no
particular interest to the user
I would agree with you on UTF-8 -- great for transmission, crap for
internal storage
sorting is a larger problem -- 16-bit integer left arguments to upgrade is
all we need syntactically, but there's more to it than that, which will have
to wait for another time
till then, all the best . . . /phil
> so, in case I failed to make myself clear earlier, the APL standard
> specifies unique Unicode values for every APL character in use, and
> (so far as is known) every APL character ever proposed
Ever proposed?
I'm curious as to whether the Burroughs APL\700 file system characters (some
of which are shown below) made it....
quad-left-arrow -- read
quad-right-arrow -- write
quad-up-arrow -- take
quad-down-arrow -- drop
quad-up-carat --hold
quad-down-carat -- free
quad-circle -- rotate
quad-forward-slash -- compress
quad-backward-slash -- expand
The APL\700 file system (remember, we're talking early seventies now!) was
similar to the STSC/Sharp files in that a file was treated as a vector of
components, where a component was an APL scalar or array. Unlike the
STSC/Sharp files individual components were not assigned a fixed number, but
were accessed by position as are elements of a vector.
U+2347 quad-left-arrow -- read
U+2348 quad-right-arrow -- write
U+2350 quad-up-arrow -- take
U+2357 quad-down-arrow -- drop
U+2353 quad-up-carat --hold
U+234c quad-down-carat -- free
U+233c quad-circle -- rotate
U+2341 quad-forward-slash -- compress
U+2342 quad-backward-slash -- expand
. . . and 10 others you didn't mention
>
> there is no requirement that input be strict -- you can accept
stars and
> asterisks of all sorts, sizes, shapes and hues and interpret them all
as
> "asterisk", if you so wish
Yes, but we still need to choose a single Unicode value on output.
There is only one asterisk in #AV (and it would be extremely confusing
if there were more). Do you want us to translate this to 22C6 (which
the Unicode standard defines as 'Star Operator APL')? If so,
exchanging Unicode text with the vast majority of non-APL applications
won't work properly; from the user's point of view, an asterisk is an
asterisk is an asterisk; there is a perfectly good one in the standard
ASCII character set, which in Unicode is at 002A, and which is
recognized by all applications. If you write some Unicode text to the
clipboard from APL, for example a formula which you are going to paste
into Excel, you certainly want 002A, not 22C6.
>
> tolerant input and strict output makes a lot of sense, but there is
no
> *requirement* that output be strict either -- most ISO standards
are not
> enforcable in any legal sense -- but if an implementation generates
> non-strict output, they may have some unhappy users -- that's all
>
What I am suggesting is that it would be preferable for everyone to
agree on a single output mapping - and this should most definitely
include placing all the ASCII-compatible APL characters (which include
minus, asterisk, and tilde) in the range 0000 to 0007F. We would
certainly get unhappy users if these common characters were mapped to
unusual APL-specific or mathematical Unicode characters, which don't
appear in all fonts and are not recognized by most applications. We
would also get unhappy users if ordinary Unicode text (containing only
common, non-APL-specific characters apparently in #AV), which was
imported into APL and then exported back to Unicode, came out different
to the original.
The crucial point is that 'strict' output is much less important than
maximizing inter-operability between APL and non-APL applications.
If anyone is interested in the exact details, the mappings we currently
use in APLX are documented in an Appendix in the 'APLX Language
Reference Manual', available from
http://www.microapl.co.uk/apl/aplx_docs.html. I think I am right in
saying that this is compatible with the mappings used by IBM in APL2,
for those characters which are common to both interpreters.
Richard
agreed
> There is only one asterisk in #AV (and it would be extremely confusing
> if there were more). Do you want us to translate this to 22C6 (which
> the Unicode standard defines as 'Star Operator APL')? If so,
> exchanging Unicode text with the vast majority of non-APL applications
> won't work properly; from the user's point of view, an asterisk is an
> asterisk is an asterisk; there is a perfectly good one in the standard
> ASCII character set, which in Unicode is at 002A, and which is
> recognized by all applications. If you write some Unicode text to the
> clipboard from APL, for example a formula which you are going to paste
> into Excel, you certainly want 002A, not 22C6.
"an asterisk is an asterisk is an asterisk" -- that's kind of imprecise,
isn't it? do you mean a 5-point asterisk or a 6-point asterisk? is the
character written superscripted, subscripted or centred? (or maybe
written below a base character?)
if you are writing out text which you want another interpreter to recognise
as APL code, then output U+22c6, the APL star operator
if you are writing out text that is to be interpreted by Excel, then output
U+002a, the "asterisk" character (a rather imprecise description, I agree,
but that's because Unicode imported so much unchanged stuff from existing
standards)
if you can produce Unicode output, this is no big deal, surely? the
programmer already has to remember to use different symbols anyway
for Excel and APL
> What I am suggesting is that it would be preferable for everyone to
> agree on a single output mapping - and this should most definitely
> include placing all the ASCII-compatible APL characters (which include
> minus, asterisk, and tilde) in the range 0000 to 0007F.
if you have a point of view on this, it's a pity you didn't speak up 5 years
ago
> We would
> certainly get unhappy users if these common characters were mapped to
> unusual APL-specific or mathematical Unicode characters, which don't
> appear in all fonts and are not recognized by most applications.
name a single character that appears in all fonts -- you are always going
to have difficulties in this area
and I'm not sure what you mean by "recognised" -- if the application can
read Unicoded plain text, then it can "recognise" U+22c6 just as easily as
it can recognise U+002a -- no problem
OTOH, if you mean that a compiler or interpreter needs to "recognise" this
output as acceptable code, then, fine, write the output using those
characters the compiler or interpreter will understand -- what's the
problem?
> We
> would also get unhappy users if ordinary Unicode text (containing only
> common, non-APL-specific characters apparently in #AV), which was
> imported into APL and then exported back to Unicode, came out different
> to the original.
you don't export anything to Unicode -- it's just a numbering convention
reading Unicoded text from disk, and writing it out again as Unicoded text,
should leave the text unchanged -- this is about *the* most fundamental
requirement for Unicode compatibility -- so, yes, you would have some
unhappy users if that text got changed in any way -- but then, nobody
has suggested changing it, have they?
this requirement applies to all transput, and may therefore be deemed to
include all literals
and it still holds true if the text is (supposed to be) an APL function
definition
however, reading in a function as Unicoded text, tokenising it, and then
converting the tokenised form to Unicoded text output, will, quite possibly,
produce output which differs from the input -- but so what? it already
does -- spacing is not always preserved, constants like 2.000 get changed
to 2.0, etc -- but that's not a problem, presumably?
> The crucial point is that 'strict' output is much less important than
> maximizing inter-operability between APL and non-APL applications.
"strict output" refers only to text output which is going to be interpreted
as APL code -- anything else is plain text (even if, as is the case with
Excel macros, that plain text will later be be interpreted as code) -- so,
if I read your statement aright, you're saying that communication with other
APL systems is less important that communicating with non-APL systems --
fine -- that's a statement of your priorities
so, if you can produce Unicode output, then the codepoints to be used for
APL characters are not really of any great interest to you -- right?
> If anyone is interested in the exact details, the mappings we currently
> use in APLX are documented in an Appendix in the 'APLX Language
> Reference Manual', available from
> http://www.microapl.co.uk/apl/aplx_docs.html. I think I am right in
> saying that this is compatible with the mappings used by IBM in APL2,
> for those characters which are common to both interpreters.
oh, OK -- it seems they are of interest, albeit of low priority
well, like I said, the ISO standard is not enforcable in any real sense, so
if vendors can agree among themselves on a convention they are happy with,
then they are free to use that instead of the standard
will you attempt to change the ISO standard, or just maintain an informal
agreement amongst yourselves?
>
> if you are writing out text which you want another interpreter to
recognise
> as APL code, then output U+22c6, the APL star operator
>
> if you are writing out text that is to be interpreted by Excel, then
output
> U+002a, the "asterisk" character (a rather imprecise description, I
agree,
> but that's because Unicode imported so much unchanged stuff from
existing
> standards)
>
The APL interpeter cannot know what the user wants to do next with the
text. It just knows that he or she wants to copy it to the clipboard,
or write it to a file, or find its Unicode index using #UCS. So, as I
said, we need to choose a unique value. The alternative you seem to be
suggesting is that we should introduce some kind of 'code page',
whereby the mapping varies according to some user-defined context. I
thought the whole idea was to get away from that kind of stuff.
Of course, the more technically-experienced user can always write out
Unicode indexes directly in order to specify a particular Unicode
character - what I was talking about was the default mapping which
should be used by the various APL interpreters when they export text
(including 8-bit character strings from existing workspaces and
component files) in Unicode encoding.
>
> > We would
> > certainly get unhappy users if these common characters were mapped
to
> > unusual APL-specific or mathematical Unicode characters, which
don't
> > appear in all fonts and are not recognized by most applications.
>
> name a single character that appears in all fonts -- you are always
going
> to have difficulties in this area
Well, for historical reasons the ASCII and indeed Latin-1 character
sets are a pretty good start. What is more, I believe that all APL
vendors include the full ASCII character set in #AV, so it would be
perverse not to map those to their corresponding Unicode positions. I
think users will understand that other characters are more likely to be
font-specific.
>
> > The crucial point is that 'strict' output is much less important
than
> > maximizing inter-operability between APL and non-APL applications.
>
> "strict output" refers only to text output which is going to be
interpreted
> as APL code -- anything else is plain text (even if, as is the case
with
> Excel macros, that plain text will later be be interpreted as code)
-- so,
> if I read your statement aright, you're saying that communication
with other
> APL systems is less important that communicating with non-APL systems
--
> fine -- that's a statement of your priorities
No it's not; it's a statement of what I believe to be the priorities of
people who use APL interpreters. Do you disagree with this statement?
But in any case both requirements can easily be met, if APL vendors
agree, which I think they do.
>
> well, like I said, the ISO standard is not enforcable in any real
sense, so
> if vendors can agree among themselves on a convention they are happy
with,
> then they are free to use that instead of the standard
>
> will you attempt to change the ISO standard, or just maintain an
informal
> agreement amongst yourselves?
I think the sensible approach is for the vendors to agree amongst
themselves on the practical interpretation of the ISO standard.
What does everyone else think?
Richard
I really cannot make any sense of this paragraph, unless you are describing
a system trying to struggle on with 8-bit characters
I am talking about an interpreter capable of storing strings of Unicode
values, and I make the simplifying assumption that the character U+22c6 will
be stored as the hexadecimal value 22c6 -- now whether that value is
stored in a 16-bit space, one half of a 32-bit space, or a variable number
of UTF-8-style bytes, I don't give a hoot, because it isn't material to
this discussion -- it might be a good idea to make the further simplifying
assumption that we can represent the BMP only
if you will permit, I'd like to rephrase Unicode's requirement about
transmitting strings unchanged as "don't mess with the user's literals,
unless told to" -- "being told to" encompasses structural operations like
catenation, rotation, indexing and selective assignment
and it should now be apparent that the APL interpreter doesn't need to know
a thing about what the user wants to do next -- when told to output
U+22c6, the intepreter sends (some representation of ) the hexadecimal
value 22c6 to the chosen output device -- ditto, ditto for U+002a
> - what I was talking about was the default mapping which
> should be used by the various APL interpreters when they export text
> (including 8-bit character strings from existing workspaces and
> component files) in Unicode encoding.
there will be some problems with legacy code, etc, but I would have hoped an
interpreter would recognise an old workspace as one of its own, and know
what Unicode values correspond to the old []AV -- ditto, ditto component
files
> Well, for historical reasons the ASCII and indeed Latin-1 character
> sets are a pretty good start. What is more, I believe that all APL
> vendors include the full ASCII character set in #AV, so it would be
> perverse not to map those to their corresponding Unicode positions.
ASCII and Latin-1 map onto the range U+0000 to U+00ff -- you must
know that already?!? -- what is this about perversity?
> I think users will understand that other characters are more likely to be
> font-specific.
fonts? what has any of this to do with fonts?
other characters will be represented by the value of their Unicode indices
> > > The crucial point is that 'strict' output is much less important
> than
> > > maximizing inter-operability between APL and non-APL applications.
> >
> > "strict output" refers only to text output which is going to be
> interpreted
> > as APL code -- anything else is plain text (even if, as is the case
> with
> > Excel macros, that plain text will later be be interpreted as code)
> -- so,
> > if I read your statement aright, you're saying that communication
> with other
> > APL systems is less important that communicating with non-APL systems
> --
> > fine -- that's a statement of your priorities
>
> No it's not; it's a statement of what I believe to be the priorities of
> people who use APL interpreters. Do you disagree with this statement?
well, yes -- I'm sorry, I should have pointed it out earlier, but there is
another point where strict output becomes important, and that is when the
user wants to see his code -- the tokenised form must then be converted to
text and sent to an appropriate display device
if the user chooses to capture that display in a character array, as in
A <- []CR 'FOO'
then strict output should be used here as well -- A is now plain text, and
if the user chooses to substitute U+002a (or anything else) for U+22c6,
then that is the user's prerogative, and the interpreter should do what it's
told: no more, no less
> I think the sensible approach is for the vendors to agree amongst
> themselves on the practical interpretation of the ISO standard.
>
> What does everyone else think?
since the standard defines a unique codepoint for each APL character, I'm
not sure that there _is_ much room for "interpretation" -- though you are,
of course, free to take it leave it, in whole or in part
so much of this discussion seems to be at cross-purposes, that I seriously
wonder if we're talking about the same thing -- as I said, I am talking
about an interpreter capable of storing strings of Unicode values (ideally,
all Unicode values), and processing aforesaid strings
starting from that assumption, there are some difficult questions, like the
semantics of string search and sort operations, but that deserves a separate
thread
Precisely, that's exactly what we're talking about, as was clear from
the context and the previous exchanges about '#UV'. However, I
wouldn't use the phrase 'trying to struggle on'. In the real world we
have to recognize that virtually all existing APL code exists either in
8-bit character form (for example as character vectors in existing
workspaces and files), or in tokenized form produced by existing APL
interpeters which have a 256-element #AV. These two forms are of course
mixed up together, for example in APL functions which rely on executed
strings. In moving to Unicode, we need address this issue.
You said earlier in this thread:
"that makes a lot of sense -- #UV has 256 integer elements, so there
is no
problem with storage space -- and, unless some genius has extended
their
character set with a non-standard "semi-colon slash in a circle" since
the
standard was approved, there is no problem allocating the "correct"
(i.e,
standard) codepoint to each character".
What I am pointing out is that there is indeed a problem in allocating
the "correct" (i.e, standard) codepoint for each character, because
unfortunately the standard allocated APL-specific encodings for at
least three characters which in most or all existing APL systems are
regarded as the same as ordinary ASCII characters. Furthermore, many
existing APL applications rely upon this.
This problem could be avoided if, either by a formal change to the
standard, or by an informal agreement amongst all concerned, we map
those characters to the ordinary ASCII-compatible positions in Unicode.
>
> so much of this discussion seems to be at cross-purposes, that I
seriously
> wonder if we're talking about the same thing -- as I said, I am
talking
> about an interpreter capable of storing strings of Unicode values
(ideally,
> all Unicode values), and processing aforesaid strings
>
>
Unfortunately, even in an interpreter which stored all strings in 16-
or 32-bit Unicode encoding, and had no need whatsoever to be compatible
with previous APL workspaces, I think the problem would still show up.
For example, consider an APL system running on a PC with a standard US
keyboard. In ordinary non-APL applications, Shift-8 gives the ASCII
asterisk, Unicode 002a. In APL, with the 'Unified' keyboard, what
should Shift-8 produce? Presumably it should still produce Unicode
002a, because otherwise the keyboard would no longer be 'Unified'. But
if this is not the same as the APL 'Star' symbol, then we have to add a
new and separate key combination for the APL 'Star'. Would users
really want that? What is the advantage? And how would we explain to
new APL users - who are often already put off by the APL-specific
symbols - that the two very similar-looking characters are not
equivalent? And that the one that is easier to type and which is used
in all other applications is not the one which is used in APL
expressions? And that they have to remember to use the non-APL one if
they want to export text to an Excel formula, a Unix shell script, or
indeed any other application apart from APL?
Richard
When I was a newbie (in the 70s), I considered the special
characters to be one of the beautiful things about APL.
And when I first saw them, it was the obvious thing that
shouted out that something was going to be very different
from FORTRAN. I was attracted by the special characters.
Do new users today actually dislike the APL characters?
Is this different than most people, decades ago?
Have aesthetics or expectations changed?
Or do they merely dislike not having an APL keyboard?
How the hell do new users learn and remember where
the characters are located on today's keyboards?
Back in the 70s when I did lots of APL, the latest thing was
dot-matrix thermal printers and CRTs, which was nice because
you didn't have to change the typeball (or spinwheel).
The thing that I hated was the messed-up layout on one of
the spinwheel terminals (can't remember which brand it was
that sucked). Today there's no technical reason why we
can't have all the APL symbols we want; the only problem
is remembering where on the keyboard to press for them.
The Lisp computers that I used back in the 80s had extra
shift keys labeled "Top" and "Front", and "Mode Lock",
and had special glyphs printed on the front and top of
the keycaps. Today, we all have enough keys on standard
keyboards that we could use some of them for APL-mode shifting.
And we have editors that could be aware of what you are
typing (for example, automatically shifting your input
mode to ASCII when you begin typing a string literal).
Our editors also let us horizontally move the input cursor.
We could even have a seperate key for doing overstrike.
Seems like maybe all we really need are better keycaps.
I also have some dim recollection of press-apply AP
stickers that you could put onto regular keyboards.
If I wanted to write "\i" I'd program in some new
(APL or not) language de jour, and I would hate it.
Way too hard to read. Yuck.
If I wanted to have to spell out "IOTA", I'd program in Lisp.
(Which is, in fact, what I do.) Better than APL in many ways,
and has nice syntax, but not quite the same feel - can't
lexically pack the operators together as tightly.
Only with the special characters can you compose non-precedential
operators in a way that's fast to read: easy to scan with the
eyes (each operator is instantly distinguished) and not too
verbose (brain doesn't need to read any words or punctuation).
Abstract and concise.
> How the hell do new users learn and remember where
> the characters are located on today's keyboards?
>
Well, clearly what we need is a keyboard with little LCDs
embedded in the keys so that the glyphs are reconfigurable :-)
On my iPaq I have a soft keyboard on the touchscreen
With Unicode I am not sure how you treat all these chars
Charmap allows you to see them several at a time and find the code for
it at least
Windows solves it nicely by using alt+x after the char or after the
number or after U+xxxx
A soft keyboard seem like a good idea
Well, maybe, but I was actually hoping someone was going
to clue me in about where to obtain the sticky labels,
or something, without having to have custom keyboards
on all my various kinds of computers.
> Well, maybe, but I was actually hoping someone was going
> to clue me in about where to obtain the sticky labels,
> or something, without having to have custom keyboards
> on all my various kinds of computers.
And then there are those of us that just learn to touch-type APL and don't
need a visual clue such as keytop labelling. When i took typing in high
school the typewriters deliberately had blank keytops so that we were forced
to remember which characters went with which keys.
as if there weren't enough real issues to worry about, two delegates
insisted that the star in APL was not an asterisk -- it had
"traditionally" been a five-pointed star, while as asterisk was nearly
always six-pointed -- it seems the rest of the committee conceded the
point, just so they could return to more important business
(the same delegates opposed the inclusion of the dollar sign in the APL
character set, on the ground that it was a national, not an international,
symbol -- this might help you identify the miscreants, because I'm not
going to name them)
so, here we are 20 years later, the APL star is still a distinct symbol, and
now it has its own codepoint -- and we must learn to live with it
well, enough of that digression: I'm not quite sure where you stand,
exactly, so please correct me where I've misunderstood you
1) the need to able to store Unicoded literals is accepted -- right?
2) the use of non-Latin characters in names can be set aside -- OK?
3) you have concerns on i/o
4) we haven't even started on problems of string matching, like:
what is the result when we compare U+00e9
to the 2-vector U+0065, U+0301 ?
5) sorting is a major problem, and it may be that neither of us
will live to see it solved
item (3) can be spilt into its i-component and its o-component, and
the o-component can be further split into code and data, giving us
3i) input
3d) output of data
3c) output of code
starting with the easy one, item (3d): the output of data (literals and
formatted numbers - i.e, display code only, no binaries) should use
whatever codepoints the user has specified (in addition to character
constants (which we leave untouched), this may also include the ability
to specify the use of mid-dot as the decimal point, &c)
so we know what we have to output, in terms of character "values", and
presumably the user can specify the encoding, (UCS, UTF, whatever)
I don't see any point of contention here -- have I missed anything?
item (3c) covers the case where tokenised code needs converting to
character form, for display purposes -- I take it the need for
standardisation is accepted?
'plus' and 'plonk' and lots of other stuff are displayed using characters
from the 7-bit ASCII range, 'multiply' and 'divide' use characters from
Latin-1, 'notequals" and the weak inequalities come from Mathematical
Operators (U+2200 to U+22ff), while 'execute' and 'format' use
characters from Misc Tech (U+2300 to U+23ff)
I take it there's no objection, in principle, to using characters from
Mathematical Operators and/or Misc Tech? because there seems to be
some sort of objection on your part to using U+22c6 to represent the
exponentiation operator (the "star") -- is it that we should use ASCII
asterisk because it's a more common character? or are unsolved problems
on the input side clouding the issue on the output side? or maybe none of
the above?
frankly, I couldn't give a monkey's whether APL's "star" is five-pointed or
six-pointed, but I wouldn't want to use the asterisk, because the asterisk
is usually a raised character, and the code looks a lot tidier if the
symbols representing primitive functions have a common centre line
(make that "primitive functions not resting on the base line") -- there
is an entirely acceptable symbol at U+2217, and if you felt strongly
enough, maybe you could campaign for a change in the standard . . .
finally, the tricky, but separate, problem of input -- item (3i)
first, keyboard input: the interpreter's input routine presumably has some
way of knowing whether the user is inputting code, a character constant or
a comment -- so, when the user hits the "asterisk" key, if it's in a
character string or a comment, then your input routine will pass an asterisk
to the tokeniser -- if the user is entering code, your input routine
accepts the asterisk, but converts it to an "APL star", before passing it to
the tokeniser -- no need for a "new and separate key combination"
actually, that's only a conceptual model -- I'd probably change the
characters on-the-fly, within the tokeniser itself -- no problem then with
"execute" on character strings, or when converting legacy code, either
loads of detail elided here, as you are only too well aware, but that's the
beginnings of tolerant input -- no need for a keybutton for APL star --
no problem building an Excel macro, either
your keyboard interface will already provide some means of entering all the
characters used in APL programming -- how the user gets to feed in other
Unicode characters is a wider question, but not one we need to explore here
so, is there anything there you're not happy with? does a move to tolerant
input cover the three [or more] characters "which in most or all existing
APL systems are regarded as the same as ordinary ASCII characters"?
regards . . . /phil
<micr...@microapl.demon.co.uk> wrote in message
news:1116007226.0...@g49g2000cwa.googlegroups.com...
> <lots of stuff I've tried to reply to above>
>
> (the same delegates opposed the inclusion of the dollar sign in the
APL
> character set, on the ground that it was a national, not an
international,
> symbol -- this might help you identify the miscreants, because I'm
not
> going to name them)
I'm intrigued. Who could possibly object to the currency symbol of
Tuvalu being included in the APL character set?
>
> well, enough of that digression: I'm not quite sure where you stand,
> exactly, so please correct me where I've misunderstood you
>
> 1) the need to able to store Unicoded literals is accepted --
right?
> 2) the use of non-Latin characters in names can be set aside -- OK?
> 3) you have concerns on i/o
> 4) we haven't even started on problems of string matching, like:
> what is the result when we compare U+00e9
> to the 2-vector U+0065, U+0301 ?
> 5) sorting is a major problem, and it may be that neither of us
> will live to see it solved
>
> item (3) can be spilt into its i-component and its o-component, and
> the o-component can be further split into code and data, giving us
> 3i) input
> 3d) output of data
> 3c) output of code
Agreed. I would add, however, the additional item of conversion of
existing 8-bit APL code and data to Unicode.
Of course, items 4) and 5) are not specific to APL, and in any case
depend on why you want to do the comparison or sort - there's no single
right answer.
>
> starting with the easy one, item (3d): the output of data (literals
and
> formatted numbers - i.e, display code only, no binaries) should use
> whatever codepoints the user has specified (in addition to character
> constants (which we leave untouched), this may also include the
ability
> to specify the use of mid-dot as the decimal point, &c)
>
> so we know what we have to output, in terms of character "values",
and
> presumably the user can specify the encoding, (UCS, UTF, whatever)
>
> I don't see any point of contention here -- have I missed anything?
I agree, as long as it is pure Unicode. For strings, what they type in
(or import from somewhere else, as Unicode), is what they get. There's
no translation on input or output. If they want to output to a
non-Unicode format (ASCII, EBCDIC etc), then of course there has to be
a translation; many Unicode characters will not be representable, and
in some cases several Unicode characters may be mapped on to a single
8-bit character, as a convenience. If the users don't like the default
mapping, they can do their own.
Similarly, if we're importing from existing 8-bit APL text (including
string literals in functions), we have to choose a suitable
translation.
>
> item (3c) covers the case where tokenised code needs converting to
> character form, for display purposes -- I take it the need for
> standardisation is accepted?
I don't think it is just tokenised code. In some APLs (not ours, as it
happens), a function is kept in both tokenised form and the original
text form, so as to preserve the original formatting. Quad-CR outputs
the original text form. Presumably, in such a system, what they type
in would be what they get out, standard or no standard?
Also an APL expression, typed in to the session window (or an Edit
window), might be copied to the clipboard - see example below. No
tokenisation has necessarily taken place.
>
> frankly, I couldn't give a monkey's whether APL's "star" is
five-pointed or
> six-pointed, but I wouldn't want to use the asterisk, because the
asterisk
> is usually a raised character, and the code looks a lot tidier if the
> symbols representing primitive functions have a common centre line
That's a font issue - nothing to do with character mappings. The font
can be designed for clear and readable rendition of APL code, just as
other fonts are optimised for other specific purposes. Different people
will prefer different character styles - for example, some people like
slanted letters for APL, some don't.
> first, keyboard input: the interpreter's input routine presumably
has some
> way of knowing whether the user is inputting code, a character
constant or
> a comment -- so, when the user hits the "asterisk" key, if it's in
a
> character string or a comment, then your input routine will pass an
asterisk
> to the tokeniser -- if the user is entering code, your input
routine
> accepts the asterisk, but converts it to an "APL star", before
passing it to
> the tokeniser -- no need for a "new and separate key combination"
Consider typing the following into an APL function, closing the
function, and then re-opening it:
2 * A = '*'
(An artificial example, but to avoid confusion I wanted to choose only
characters displayable in ASCII).
The implication of what you are suggesting is that if you now highlight
this text (it having gone through tokenising/de-tokenising), and copy
it to the clipboard, the two asterisks would map to different Unicode
characters. But if the user just typed this text in an Edit or indeed
Session window, highlighted it, and copied it to the clipboard, they'd
map to the same Unicode character - as you would expect. And if they
edited the function line so that the expression now read:
"*" = "2 * A = '*'"
the result would be 0 0 0 0 0 0 0 0 0 1 0, despite the fact that the
two asterisks were entered using the same keystroke. [This assumes the
APL supports double-quotes, as in APL+Win and APLX - you could do the
same by doubling up the single quotes as in traditional APL.] And
there would still be a need for a new and separate key combination,
for example if the user wanted to do a search in a function for a
particular expression containing the APL star (despite the fact that
they didn't enter it using that special key combination...).
Admittedly the font would presumably be designed to make it clear that
the two asterisks were different, but I think you've convinced me that
this is madness, even without worrying about how to deal with existing
8-bit APL code!
Regards
Richard
in particular, you are unhappy that tolerant input may have confusing
or deleterious effects on the user's code (i) when using copy&paste
or copydown, and (ii) when converting existing 8-bit APL code
my suggestion was that tolerant input could be used to ease the move to
Unicode -- copydown was an issue I had overlooked, and it may be that
there are insurmountable objections to its use, but I'm not yet prepared to
concede the point
the alternative to tolerant input is strict input -- granted, there are
not that many Unicode editors, but some APLers are already using emacs or
gvim, so we know it is possible (if the interpreter can handle it)
with strict input, it will always be possible to convert existing code from
the old []AV to Unicode, but Unicode compliance requires that we don't
mess with literals, so there's really no question of how existing function
definitions should be displayed, regardless of whether they are stored as
text or in tokenised form -- the user may well see different glyphs being
used for certain primitive functions, but I doubt the change will be
traumatic
and then execution of literal strings means we're still going to require
some sort of conversion routine within the interpreter
if the executable string has been formed by the concatenation of (i) a
string from an old 8-bit ws and (ii) another string entered via a strict
input routine, there is a possibility that both U+002a and U+22c6 will have
been used to denote exponentiation, so we still need to devise a method for
conversion to strict notation
my guess is that the three characters causing your qualms are hyphen-minus,
asterisk and tilde -- the problem is that these characters have ambiguous
semantics
those of us old enough to remember Cobol know that
A-B
is a hyphenated name, while
A - B
is an arithmetic operation
in this example, we distinguish the different meanings of the two uses of
the symbol by reference to context -- for a large characters set, it is
much simpler to dispense with the context-sensitivities, and define separate
codepoints for the separate functions -- that way, ASCII text retains its
ambiguity (i.e, no information is lost, and (equally importantly) there is
no (possibly erroneous) increase in semantic content as a result of a move
to Unicode), but those who need to distinguish the two uses can do so,
without reference to context, by using the appropriate codepoint
(NOT by changing font)
if you are troubled by the prospect of explaining to a user that a centred
5-point star is not just visually different but also semantically different
from a raised 6-point star, you may be in for a difficult time -- as well
as explaining the differences between hyphen-minus, hyphen and minus,
there's the m-dash, the n-dash, 16 different spaces (some of which may
have the same width), and a graduated set of circles, you are going to
have to explain why
'A' = 'AA'
sometimes returns the result 0 0
it goes with the territory, I'm afraid
so, on to your example -- no problems with its artificiality, by the way --
we're exploring boundary conditions
> Consider typing the following into an APL function, closing the
> function, and then re-opening it:
>
> 2 * A = '*'
>
> The implication of what you are suggesting is that if you now
> highlight this text (it having gone through tokenising/
> de-tokenising), and copy it to the clipboard, the two asterisks
> would map to different Unicode characters.
yes (but it might be wise to change the display as soon as the text
is converted from tolerant to strict)
> But if the user just typed this text in an Edit or indeed
> Session window, highlighted it, and copied it to the clipboard,
> they'd map to the same Unicode character - as you would expect.
yes
> And if they edited the function line so that the expression now
> read:
>
> "*" = "2 * A = '*'"
>
> the result would be 0 0 0 0 0 0 0 0 0 1 0, despite the fact that
> the two asterisks were entered using the same keystroke.
I would rephrase that to say "despite the fact that the same keystroke
had been used to enter the different star-like characters" -- but apart
from that, yes
the user would, in any case, spot the difference between the centred 5-point
star, and the raised 6-point star, and realise why -- there again, maybe
not -- in which case users can't be trusted with tolerant input
in that case, enforce strict input -- I don't rightly know which vendors
use which input methods, but does alt-P currently deliver an asterisk to the
interpreter? would it be possible to have this key combination deliver the
APL-star instead, leaving the asterisk wherever it is now (shift-8, on my
machine)?
in the expression
CHAR = "2 * A = '*'"
I would define CHAR to be U+22c6, if I were searching text for references
to exponentiation -- and if I defined CHAR to be U+002a, I would pick up
references to footnotes, emphasised text and potentially offensive
f***-letter words -- I see no madness here; I am happy to have two
distinct characters
so, there you go -- Unicode cannot be ignored forever, but you need to
decide whether to go along with the APL standard, or stay with current ASCII
characters in those three cases -- if you decide to stick with ASCII, you
need to persuade other vendors to go the same route, and/or get the standard
changed -- if you decide to go along with the current standard, you then
have to decide whether to offer users tolerant input -- have I missed
anything?
if I have failed in my attempt to dissuade you from reverting to ASCII for
these three characters, then I'm sorry -- you face some difficult
decisions
I'm sorry that last msg is such a mess
for some reason, the draft reply was displayed in TNR, and I couldn't
change it -- I checked the linebreaks before sending the draft, but
clearly, TNR being a more compact font, the linebreaks were not well
placed for Arial Unicode, which is the font I intended to use for
sending
I'm now seeing Arial, so let's see if I'm more successful with that cod
example:
'A' = 'ΑА'
well, I don't know what that looks like to you, but it looks better here
(although the draft was dispalyed in TNR, anything pasted in was
displayed in Arial -- all very odd)
confused . . . /phil
You wrote:
> The problem isn't that general-purpose sorting of Unicode alphanumeric
> strings is technically complex; it is that it is logical nonsense. The
> "general-purpose collating sequence for alphabetics", which we call
> "alphabetical order", is well-defined for any given alphabet: A to Z,
> alpha to omega, alif to ya, whatever. With multiple alphabets, there is
> no defined alphabetical order. We can create an appropriate default
> collating sequence for any given alphabet; G goes after F and before H,
> but does it go before or after gamma, or gimel, or ghaym?
I believe this is not correct. Some languages do not define an alphabetical
sort order. In particular, the Japanese and Chinese languages are not
sorted using a simple alphabetical order. Furthermore, even western
languages do not use a simple sort order when you consider case.
David Liebtag
APL2 has a system function named QuadUCS.
If the right argument is a character scalar or vector, the result is the
Unicode codepoints of those characters.
If the right argument is an integer scalar or vector, the result is the
Unicode characters associated with those codepoints.
So, QuadUV is not needed.
QuadAV {match} QuadUCS QuadUCS QuadAV
David Liebtag