Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

APL in XML

1 view
Skip to first unread message

Ibeam2000

unread,
Aug 24, 2009, 10:59:18 AM8/24/09
to
Would anyone have any examples on ways to embed APL code in XML
documents? What font should be used? What does Unicode have to do
with it?

My experiments so far show that I can copy and paste code into XML
documents, but depending on the font used, the result may be missing
some or all APL characters. The best font so far seems to be Adrian
Smith's APL385.

What is the story with underscored letters in Unicode? Do they not
exist, on the grounds that the designers felt that underscoring is a
text attribute and does not warrant a separate glyph?

Richard Nabavi

unread,
Aug 24, 2009, 11:05:11 AM8/24/09
to
On 24 Aug, 15:59, Ibeam2000 <ibeam2...@gmail.com> wrote:
> Would anyone have any examples on ways to embed APL code in XML
> documents?  What font should be used?  What does Unicode have to do
> with it?

XML is always in Unicode. So you need to make sure that the APL
characters are encoded in Unicode.

>
> My experiments so far show that I can copy and paste code into XML
> documents, but depending on the font used, the result may be missing
> some or all APL characters.  The best font so far seems to be Adrian
> Smith's APL385.

Any Unicode APL font should work fine to display the text. See the
list here:

http://aplwiki.com/AplCharacters

>
> What is the story with underscored letters in Unicode?  Do they not
> exist, on the grounds that the designers felt that underscoring is a
> text attribute and does not warrant a separate glyph?

Yes, I believe that is right.

Richard Nabavi
MicroAPL Ltd

Phil Last

unread,
Aug 24, 2009, 3:48:06 PM8/24/09
to
On Aug 24, 4:05 pm, Richard Nabavi <micro...@microapl.demon.co.uk>
wrote:

You will have to find some way to avoid "<", "&" & probably ">". The
most straightforward way is by replacing them with the character
entities "&lt;", "&amp;" & "&gt;" respectively but you then have to
ensure that the receiving software understands as well. Another way
that avoids translation is to embed all APL code in mark=up that has
special end markers. I can't remember the details now; there are
various alternatives but all have end markers that are deemed to be
very unlikely to occur randomly. Various combinations of square and
angle brackets come to mind. The trouble is they probably do occur
quite regularly as code fragments in APL.

Phil

Phil Last

unread,
Aug 24, 2009, 4:19:36 PM8/24/09
to

Here you go.

cdata sections start with "<![CDATA[" and end with "]]>"
comments start with "<!--" and end with "-->"

Either can contain any unparsed characters except their own end
marker.

Why, with the word "extensible" in the title none of them ever thought
of self defining section markers I don't know. Even http includes the
multipart boundary.

Phil

Richard Nabavi

unread,
Aug 25, 2009, 6:44:02 AM8/25/09
to
Good points, Phil.

By the way, if you use the new ⎕XML system function to create and
parse the XML text, all these niggling little nasties will be taken
care of. The spec for ⎕XML was defined jointly between MicroAPL and
Dyalog based on some earlier work by Mark E. Johns, and should work
pretty much identically in the two implementations. In APLX, it's
included in the new Version 5, which has been released for Windows,
and is ready for beta for MacOS and Linux. (I'm not certain whether
the Dyalog version which implements it has been released yet.)

Phil Last

unread,
Aug 25, 2009, 7:37:29 AM8/25/09
to
On Aug 25, 11:44 am, Richard Nabavi <micro...@microapl.demon.co.uk>
wrote:

Will it do first-level automatic character entity translation of "<",
"&" & ">" in & out?

Richard Nabavi

unread,
Aug 25, 2009, 8:45:12 AM8/25/09
to
On 25 Aug, 12:37, Phil Last <phil.l...@ntlworld.com> wrote:
>
> Will it do first-level automatic character entity translation ... in & out?

Yes, it handles all that dross.

Stephen Taylor, editor@vector.org.uk

unread,
Aug 26, 2009, 3:54:57 AM8/26/09
to
On Aug 25, 1:45 pm, Richard Nabavi <micro...@microapl.demon.co.uk>
wrote:

> On 25 Aug, 12:37, Phil Last <phil.l...@ntlworld.com> wrote:
>
>
>
> > Will it do first-level automatic character entity translation ... in & out?
>
> Yes, it handles all that dross.

Remember the " character as well, that has to be escaped as &quot;

Vector articles now get marked up in XHTML and run through an XSLT
processor before hitting your browser. Apart from the need to escape
those 4 characters, it's straightforward cutnpaste and text editing.
Help yourself to any recent Vector article as a model. (The site
script will serve you articles embedded in the site's HTML dressing.
To see the pre-XSLT source of eg

http://www.vector.org.uk/?vol=24&no=2&art=burriesci

browse to

http://www.vector.org.uk/archive/v242/burriesci.htm

UTF-8 is the default encoding for XML documents, and that's what you
want as your default too. If you're editing XML documents, use a
Unicode-savvy text editor. Notepad will work fine; I get good results
from (free) Notepad2. Set the editor's default encoding to UTF-8 with
no BOM (byte-order mark). Set its display font to a Unicode font with
good APL characters. Vector uses Adrian Smith's APL385 Unicode.

Watch out: use a text editor that makes the encoding clear. My
otherwise excellent industrial-strength UltraEdit 13 text editor tends
to resave UTF-8 documents with the BOM set – basically a spontaneous
change of encoding. The BOM triggers a warning from the W3.org
validator (STRONGLY recommended) and chokes an XSLT processor. Other
text editors may share this weakness. Notepad2 doesn't, and is good at
displaying and setting the encoding.

HTH

Stephen

Richard Nabavi

unread,
Aug 26, 2009, 4:08:01 AM8/26/09
to
On 26 Aug, 08:54, "Stephen Taylor, edi...@vector.org.uk"

<stephentaylorf...@googlemail.com> wrote:
> Watch out: use a text editor that makes the encoding clear. My
> otherwise excellent industrial-strength UltraEdit 13 text editor tends
> to resave UTF-8 documents with the BOM set – basically a spontaneous
> change of encoding.

If my understanding of Unicode encoding is correct, that sounds like a
straightforward bug in UltraEdit. Surely UTF-8 shouldn't have a byte-
order mark, for the very good reason that there's no byte order?
Perhaps it is converting the encoding to UTF-16, in which case it
would make a bit more sense.

Richard Nabavi
MicroAPL Ltd

kai

unread,
Sep 1, 2009, 6:04:44 AM9/1/09
to
> If my understanding of Unicode encoding is correct, that sounds like a
> straightforward bug in UltraEdit.  Surely UTF-8 shouldn't have a byte-
> order mark, for the very good reason that there's no byte order?
> Perhaps it is converting the encoding to UTF-16, in which case it
> would make a bit more sense.

Sometimes it depends on who introduced the bug. I believe that
Microsoft started to add BOMs to UTF-8 files. It does not make any
sense but anyway....

UltraEdit accepts reality by default but can be told to do
differently:

http://www.ultraedit.com/forums/viewtopic.php?t=34

Kai

Morten Kromberg

unread,
Sep 5, 2009, 3:31:34 AM9/5/09
to
On Sep 1, 12:04 pm, kai <kaithomas...@googlemail.com> wrote:

> Sometimes it depends on who introduced the bug. I believe that
> Microsoft started to add BOMs to UTF-8 files. It does not make any
> sense but anyway....

I think it is a bit harsh to say this makes NO sense whatsoever. By
looking at the first few bytes of the file, you can use the encoding
of the BOM to determine whether the file is big- or little-endian
UCS-2 or -4 or UTF-8. The table that you can see under the heading
"Byte Order Mark" in http://www.vector.org.uk/?vol=24&no=1&art=kromberg
is quite useful, I think... For example, if you want to load (APL)
source code from a file and you don't know how it was encoded...

Morten

phil chastney

unread,
Sep 5, 2009, 1:46:21 PM9/5/09
to

you are right, Morten

U+FEFF is nowadays known as a "zero width no break space", so there is
no contradiction in starting a UTF-8 stream with 0xEF 0xBB 0xBF

U+FEFF can be used at the start of UCS-2 and UCS-4 streams to indicate
the byte order, but also, to quote the standard, "data streams that
begin with the byte order mark are likely to contain Unicode
characters", which is a useful piece of information when handling an
incoming 8-bit stream

the possibility exists that a file in an 8-bit encoding might genuinely
begin with 0xEF 0xBB 0xBF, but this is extremely unlikely, and is
therefore usually ignored

regards to all . . . /phil

0 new messages