My experiments so far show that I can copy and paste code into XML
documents, but depending on the font used, the result may be missing
some or all APL characters. The best font so far seems to be Adrian
Smith's APL385.
What is the story with underscored letters in Unicode? Do they not
exist, on the grounds that the designers felt that underscoring is a
text attribute and does not warrant a separate glyph?
XML is always in Unicode. So you need to make sure that the APL
characters are encoded in Unicode.
>
> My experiments so far show that I can copy and paste code into XML
> documents, but depending on the font used, the result may be missing
> some or all APL characters. The best font so far seems to be Adrian
> Smith's APL385.
Any Unicode APL font should work fine to display the text. See the
list here:
http://aplwiki.com/AplCharacters
>
> What is the story with underscored letters in Unicode? Do they not
> exist, on the grounds that the designers felt that underscoring is a
> text attribute and does not warrant a separate glyph?
Yes, I believe that is right.
Richard Nabavi
MicroAPL Ltd
You will have to find some way to avoid "<", "&" & probably ">". The
most straightforward way is by replacing them with the character
entities "<", "&" & ">" respectively but you then have to
ensure that the receiving software understands as well. Another way
that avoids translation is to embed all APL code in mark=up that has
special end markers. I can't remember the details now; there are
various alternatives but all have end markers that are deemed to be
very unlikely to occur randomly. Various combinations of square and
angle brackets come to mind. The trouble is they probably do occur
quite regularly as code fragments in APL.
Phil
Here you go.
cdata sections start with "<![CDATA[" and end with "]]>"
comments start with "<!--" and end with "-->"
Either can contain any unparsed characters except their own end
marker.
Why, with the word "extensible" in the title none of them ever thought
of self defining section markers I don't know. Even http includes the
multipart boundary.
Phil
By the way, if you use the new ⎕XML system function to create and
parse the XML text, all these niggling little nasties will be taken
care of. The spec for ⎕XML was defined jointly between MicroAPL and
Dyalog based on some earlier work by Mark E. Johns, and should work
pretty much identically in the two implementations. In APLX, it's
included in the new Version 5, which has been released for Windows,
and is ready for beta for MacOS and Linux. (I'm not certain whether
the Dyalog version which implements it has been released yet.)
Will it do first-level automatic character entity translation of "<",
"&" & ">" in & out?
Yes, it handles all that dross.
Remember the " character as well, that has to be escaped as "
Vector articles now get marked up in XHTML and run through an XSLT
processor before hitting your browser. Apart from the need to escape
those 4 characters, it's straightforward cutnpaste and text editing.
Help yourself to any recent Vector article as a model. (The site
script will serve you articles embedded in the site's HTML dressing.
To see the pre-XSLT source of eg
http://www.vector.org.uk/?vol=24&no=2&art=burriesci
browse to
http://www.vector.org.uk/archive/v242/burriesci.htm
UTF-8 is the default encoding for XML documents, and that's what you
want as your default too. If you're editing XML documents, use a
Unicode-savvy text editor. Notepad will work fine; I get good results
from (free) Notepad2. Set the editor's default encoding to UTF-8 with
no BOM (byte-order mark). Set its display font to a Unicode font with
good APL characters. Vector uses Adrian Smith's APL385 Unicode.
Watch out: use a text editor that makes the encoding clear. My
otherwise excellent industrial-strength UltraEdit 13 text editor tends
to resave UTF-8 documents with the BOM set – basically a spontaneous
change of encoding. The BOM triggers a warning from the W3.org
validator (STRONGLY recommended) and chokes an XSLT processor. Other
text editors may share this weakness. Notepad2 doesn't, and is good at
displaying and setting the encoding.
HTH
Stephen
If my understanding of Unicode encoding is correct, that sounds like a
straightforward bug in UltraEdit. Surely UTF-8 shouldn't have a byte-
order mark, for the very good reason that there's no byte order?
Perhaps it is converting the encoding to UTF-16, in which case it
would make a bit more sense.
Richard Nabavi
MicroAPL Ltd
Sometimes it depends on who introduced the bug. I believe that
Microsoft started to add BOMs to UTF-8 files. It does not make any
sense but anyway....
UltraEdit accepts reality by default but can be told to do
differently:
http://www.ultraedit.com/forums/viewtopic.php?t=34
Kai
> Sometimes it depends on who introduced the bug. I believe that
> Microsoft started to add BOMs to UTF-8 files. It does not make any
> sense but anyway....
I think it is a bit harsh to say this makes NO sense whatsoever. By
looking at the first few bytes of the file, you can use the encoding
of the BOM to determine whether the file is big- or little-endian
UCS-2 or -4 or UTF-8. The table that you can see under the heading
"Byte Order Mark" in http://www.vector.org.uk/?vol=24&no=1&art=kromberg
is quite useful, I think... For example, if you want to load (APL)
source code from a file and you don't know how it was encoded...
Morten
you are right, Morten
U+FEFF is nowadays known as a "zero width no break space", so there is
no contradiction in starting a UTF-8 stream with 0xEF 0xBB 0xBF
U+FEFF can be used at the start of UCS-2 and UCS-4 streams to indicate
the byte order, but also, to quote the standard, "data streams that
begin with the byte order mark are likely to contain Unicode
characters", which is a useful piece of information when handling an
incoming 8-bit stream
the possibility exists that a file in an 8-bit encoding might genuinely
begin with 0xEF 0xBB 0xBF, but this is extremely unlikely, and is
therefore usually ignored
regards to all . . . /phil