Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Name for the set of characters legal in identifiers

188 views
Skip to first unread message

Russell Wallace

unread,
Jan 13, 2004, 11:36:14 PM1/13/04
to
A trivial little question, but one that's been bugging me: Is there a
name for that set of characters legal in Lisp identifiers? For most
languages this would be "alphanumeric" (perhaps with a footnote that _
is regarded as a letter in this context), but Lisp includes characters
like + and - that most languages regard as punctuation.

Thanks,

--
"Sore wa himitsu desu."
To reply by email, remove
the small snack from address.
http://www.esatclear.ie/~rwallace

rydis

unread,
Jan 13, 2004, 11:57:06 PM1/13/04
to
wallacet...@eircom.net (Russell Wallace) writes:

> A trivial little question, but one that's been bugging me: Is there a
> name for that set of characters legal in Lisp identifiers? For most
> languages this would be "alphanumeric" (perhaps with a footnote that _
> is regarded as a letter in this context), but Lisp includes characters
> like + and - that most languages regard as punctuation.

I think "constituent character" is quite close, if not "it".

Regards,

'mr

--
[Emacs] is written in Lisp, which is the only computer language that is
beautiful. -- Neal Stephenson, _In the Beginning was the Command Line_

Wade Humeniuk

unread,
Jan 14, 2004, 12:01:11 AM1/14/04
to
Russell Wallace wrote:
> A trivial little question, but one that's been bugging me: Is there a
> name for that set of characters legal in Lisp identifiers?

In CL that would be _all_.

Wade

Erik Naggum

unread,
Jan 14, 2004, 12:39:34 AM1/14/04
to
* Russell Wallace

| A trivial little question, but one that's been bugging me: Is there
| a name for that set of characters legal in Lisp identifiers? For
| most languages this would be "alphanumeric" (perhaps with a footnote
| that _ is regarded as a letter in this context), but Lisp includes
| characters like + and - that most languages regard as punctuation.

The type STANDARD-CHAR covers the set of characters from which all
symbols in the standard packages are made. This simple fact may
give rise to the invalid assumption that there must be a particular
character set from which all symbols must be made.

However, the functions INTERN and MAKE-SYMBOL take a STRING as the
name of the symbol to be created, and there is no restriction on
this /string/ to be of type BASE-STRING. Likewise, the value of
SYMBOL-NAME is only specified to be of type STRING, with no mention
of the common observation that it may be a SIMPLE-STRING regardless
of whether the corresponding argument to INTERN or MAKE-SYMBOL was.

Since the symbols are normally created by the Common Lisp reader,
your question is therefore really which characters the reader is
able to build into a string that it will pass to INTERN. There is
no upper bound on this character set in the standard, but an actual
implementation will necessarily place restrictions on this set. In
the worst case, the Common Lisp reader does not understand which
character is has just read the encoding of, and may produce symbols
with garbage bytes that nevertheless reproduce the character in your
editor or other character display equipment.

Pessimistically, therefore, your question is whether you will find
any mention in the standard of any invalid characters in symbols,
but you find quite the opposite: After a single-escape character,
normally \, any following character will be a constituent character
in the symbol name being read, and between the multiple-escape
characters, normally |, all characters will be constituent. The
best you can hope for is thus that whatever reads the byte stream
that is your source file will reject unacceptable encodings. As
long as you use an encoded character set that includes the standard
characters, there is no restriction on what you can do, and if you
use an encoding that does not confuse standard characters and one of
your other characters even in the least capable decoders, you will
find that there is not even any useful restriction on the /length/
of Common Lisp symbol names.

Optimistically, however, the answer to your question is that the set
of characters that are legal in identifiers is the standard-class
CHARACTER, but you may not be able to produce all of them in any
given source file.

I am particularly fond of using the non-breaking space in symbol
names, just as I use it in filenames under operating systems that
believe that ordinary spaces are separators regardless of how much
effort one puts into convincing its various programs otherwise. I
know people who think there ought to be laws against this practice,
but sadly, the Common Lisp standard does not come to their aid.

--
Erik Naggum | Oslo, Norway Yes, I survived 2003.

Act from reason, and failure makes you rethink and study harder.
Act from faith, and failure makes you blame someone and push harder.

Duane Rettig

unread,
Jan 14, 2004, 1:37:58 AM1/14/04
to

> Erik Naggum | Oslo, Norway Yes, I survived 2003.

Welcome back, Erik!

--
Duane Rettig du...@franz.com Franz Inc. http://www.franz.com/
555 12th St., Suite 1450 http://www.555citycenter.com/
Oakland, Ca. 94607 Phone: (510) 452-2000; Fax: (510) 452-0182

Russell Wallace

unread,
Jan 14, 2004, 2:46:55 AM1/14/04
to
On 14 Jan 2004 05:39:34 +0000, Erik Naggum <er...@naggum.no> wrote:

> However, the functions INTERN and MAKE-SYMBOL take a STRING as the
> name of the symbol to be created, and there is no restriction on
> this /string/ to be of type BASE-STRING. Likewise, the value of
> SYMBOL-NAME is only specified to be of type STRING, with no mention
> of the common observation that it may be a SIMPLE-STRING regardless
> of whether the corresponding argument to INTERN or MAKE-SYMBOL was.

Welcome back, Erik!

Thanks for the explanation - okay, so basically any character _can_ be
part of a symbol... fair enough... my question is really about the
English terminology, though. That is, say you write...

(defun +-?-+ ...)

...that's fine, you can use the characters +, - and ? in a function
name, they're... "constituent characters", one poster said? Whereas if
you write...

(defun )(')( ...)

That won't work; (, ) and ' are "punctuation" (?) and normally
recognized by the reader as special characters. (I'm talking about the
normal case, not what you can persuade the reader, interner or
whatever to do if you try hard enough :)) So there's "whitespace",
"punctuation" and... what's the third category called? Not
"alphanumeric"... "constituent characters"?

Erik Naggum

unread,
Jan 14, 2004, 3:22:42 AM1/14/04
to
* Russell Wallace

| Thanks for the explanation - okay, so basically any character _can_
| be part of a symbol... fair enough... my question is really about
| the English terminology, though.

The terminology is really pretty simple, but you have to look at it
from the right angle. In languages that require identifiers to be
made up of particular characters, there is obviously a name for the
character set, but in a language that goes out of its way to make it
possible to use absolutely any character you want, there are only
names for those characters that need special treatment to become
part of a symbol name because their "normal" function is not to.

| Whereas if you write...
|
| (defun )(')( ...)
|
| That won't work; (, ) and ' are "punctuation" (?) and normally
| recognized by the reader as special characters.

Well, they are known as "macro characters". The important thing is
that the set of macro characters is not defined by the language, but
by the readtable in effect when the Common Lisp reader processes
your source. There is a standard readtable, however, and one would
have to say "unescaped terminating macro characters in the standard
readtable" or another phrasing that tries to hide the obvious anal
retentiveness to really speak about the characters that will not be
part of a symbol name unless you have changed the rules. There is
nothing particularly special about any of these macro characters.
There are some restrictions on what the readtable can do and how the
reader collects characters into symbol names. If you really insist,
calling them "constituent characters" will help, but realize that
this property is a result of falling through every other test --
unless it is escaped, in which case it wins its constituency right
away. (There's an awful pun waiting to happen here, about Iowa, but
I'll ignore the temptation.)

| (I'm talking about the normal case, not what you can persuade the
| reader, interner or whatever to do if you try hard enough :))

While this may seem reasonable from the angle you chose to look at
this problem, it is the a priori reasonability of the position that
has produced your problem. It is in fact unreasonable to approach
Common Lisp from this angle. The problem does not exist. This

(defun |)(')(| ...)

is in fact fully valid Common Lisp code. You cannot define away the
solution to the problem and insist that you still have a problem in
need of an answer.

| So there's "whitespace", "punctuation" and... what's the third
| category called? Not "alphanumeric"... "constituent characters"?

I have to zoom out and ask you what you would do with the elusive
name for this category. If I guess correctly at your intentions, I
would perhaps have said that "any character can be part of a symbol
name, but most macro characters need to be escaped to prevent them
from having their macro function". (The important exception is #,
the only non-terminating macro character in the standard readtable,
meaning that #xF will be interpreted as hexadecimal number, but F#x
is a three-character-long symbol name with a # in it.)

Unless you have a simple need that can be resolved by a nice, vague
explanation that only informs your reader that Common Lisp is a lot
different from languages that require particular characters in the
names of identifiers/symbols, I think Chapter 23 in the standard, on
the Common Lisp Reader, would be a really good suggestion right now.

Yeah, I'm back allright, with undesirably high levels of precision,
scaring away frail newbies from day one. Maybe I'll go hibernate.

--
Erik Naggum | Oslo, Norway

Act from reason, and failure makes you rethink and study harder.

james anderson

unread,
Jan 14, 2004, 5:56:12 AM1/14/04
to

Erik Naggum wrote:
>
> * Russell Wallace
> | Thanks for the explanation - okay, so basically any character _can_
> | be part of a symbol... fair enough... my question is really about
> | the English terminology, though.
>

> ...


>
> | So there's "whitespace", "punctuation" and... what's the third
> | category called? Not "alphanumeric"... "constituent characters"?
>
> I have to zoom out and ask you what you would do with the elusive
> name for this category. If I guess correctly at your intentions, I
> would perhaps have said that "any character can be part of a symbol
> name, but most macro characters need to be escaped to prevent them
> from having their macro function". (The important exception is #,
> the only non-terminating macro character in the standard readtable,
> meaning that #xF will be interpreted as hexadecimal number, but F#x
> is a three-character-long symbol name with a # in it.)
>
> Unless you have a simple need that can be resolved by a nice, vague
> explanation that only informs your reader that Common Lisp is a lot
> different from languages that require particular characters in the
> names of identifiers/symbols, I think Chapter 23 in the standard, on
> the Common Lisp Reader, would be a really good suggestion right now.
>

i would have thought that a useful characterization would be "constituent
character in the current readtable, with the constituent traits 'alphabetic'
or 'alphadigit'", as that describes the set of characters which could be read,
without escaping, as part of a symbol name, by means of readtable adjustments
with set-syntax-from-char.

upon experimentation, however, i observe that

? (defun test-constituent-character (code)
(handler-case
(read-from-string (concatenate 'string "a" (string (code-char code)) "b"))
(error (e) e)))
TEST-CONSTITUENT-CHARACTER
? (let ((*rt* (copy-readtable)))
(dotimes (i char-code-limit)
(set-syntax-from-char (code-char i) #\a *rt*))
(let ((result nil)
(*readtable* *rt*))
(dotimes (i char-code-limit)
(typecase (setf result (test-constituent-character i))
(symbol )
(t (format *trace-output* "~%~6,'0d (~c) : *** : ~a"
i (code-char i) result))))))

000058 (:) : *** : There is no package named "A" .
NIL
?

i would have expected the token parser to have signaled errors when reading
from strings which contained those characters for which 2.1.4.2 specifies the
constituent trait 'invalid'.

is this an implementation bug, or have i misunderstood 2.1.4.2?

...

Russell Wallace

unread,
Jan 14, 2004, 6:34:28 AM1/14/04
to
On 14 Jan 2004 08:22:42 +0000, Erik Naggum <er...@naggum.no> wrote:

> Well, they are known as "macro characters". The important thing is
> that the set of macro characters is not defined by the language, but
> by the readtable in effect when the Common Lisp reader processes
> your source. There is a standard readtable, however, and one would
> have to say "unescaped terminating macro characters in the standard
> readtable" or another phrasing that tries to hide the obvious anal
> retentiveness to really speak about the characters that will not be
> part of a symbol name unless you have changed the rules.

Right, so another way of phrasing my question would be: is there a
shorter term for the noun phrase "unescaped..." above :)

> While this may seem reasonable from the angle you chose to look at
> this problem, it is the a priori reasonability of the position that
> has produced your problem. It is in fact unreasonable to approach
> Common Lisp from this angle. The problem does not exist.

You're right, of course, and if my objective was to understand Common
Lisp, I wouldn't give this issue any more thought - it isn't a problem
in that language.

> I have to zoom out and ask you what you would do with the elusive
> name for this category.

What I'm actually doing is designing a new language that's intended to
share Lisp's property of allowing characters like + and - in symbols
(though not the feature of also allowing things like brackets in
symbols if you ask nicely), and I found when thinking about the syntax
I was making heavy use of a concept I didn't have a name for, which
rather bugged me; Lisp is one of the very few languages which allow
non-alphanumeric characters in symbols, so I was wondering if it had a
name for the concept.

It seems the answer is that it doesn't have a name because it doesn't
particularly need the concept... hmm. I think I'll call them "ordinary
characters".

> Yeah, I'm back allright, with undesirably high levels of precision,
> scaring away frail newbies from day one. Maybe I'll go hibernate.

*grin* No, stick around. The newsgroup's more fun with you around.

Lars Brinkhoff

unread,
Jan 14, 2004, 7:45:54 AM1/14/04
to
wallacet...@eircom.net (Russell Wallace) writes:
> I was making heavy use of a concept I didn't have a name for, which
> rather bugged me; Lisp is one of the very few languages which allow
> non-alphanumeric characters in symbols

So does Forth, so perhaps programmers using that language have a name
for it.

--
Lars Brinkhoff, Services for Unix, Linux, GCC, HTTP
Brinkhoff Consulting http://www.brinkhoff.se/

Erik Naggum

unread,
Jan 14, 2004, 12:47:50 PM1/14/04
to
* james anderson

| upon experimentation, however, i observe that

Your experiment has only uncovered that it is impossible to override
the package marker status of colon. Other than that, you have only
clobbered the constituent traits of all characters, forcing them the
same as for #\a. It is unclear which hypotheses your experiment has
actually tested.

This goes to show that : must always be escaped if it is to be part
of a symbol name, however, further complicating the "name" for the
set of allowable characters in a symbol.

james anderson

unread,
Jan 14, 2004, 1:25:09 PM1/14/04
to

Erik Naggum wrote:
>
> * james anderson
> | upon experimentation, however, i observe that
>
> Your experiment has only uncovered that it is impossible to override
> the package marker status of colon. Other than that, you have only
> clobbered the constituent traits of all characters, forcing them the
> same as for #\a. It is unclear which hypotheses your experiment has
> actually tested.
>

the hypothesis was that the constituent traits as set out in the table on
standard and semi-standard characters, which traits are not supposed to be
clobbered by set-syntax-from-char, would be useful to characterise the set of
characters which could be used in symbol names without explicit escaping.

> This goes to show that : must always be escaped if it is to be part
> of a symbol name, however, further complicating the "name" for the
> set of allowable characters in a symbol.

i would have expected the same status as that for #\: to apply to whitespace
characters and to rubout.

...

Russell Wallace

unread,
Jan 14, 2004, 1:52:33 PM1/14/04
to
On 14 Jan 2004 13:45:54 +0100, Lars Brinkhoff <lars...@nocrew.org>
wrote:

>wallacet...@eircom.net (Russell Wallace) writes:
>> I was making heavy use of a concept I didn't have a name for, which
>> rather bugged me; Lisp is one of the very few languages which allow
>> non-alphanumeric characters in symbols
>
>So does Forth, so perhaps programmers using that language have a name
>for it.

So it does; good idea. I'll try asking there, thanks.

Joe Marshall

unread,
Jan 14, 2004, 2:03:40 PM1/14/04
to
Erik Naggum <er...@naggum.no> writes:

[snip]

Welcome back!

Thomas F. Burdick

unread,
Jan 14, 2004, 2:37:18 PM1/14/04
to
wallacet...@eircom.net (Russell Wallace) writes:

> What I'm actually doing is designing a new language that's intended to
> share Lisp's property of allowing characters like + and - in symbols
> (though not the feature of also allowing things like brackets in
> symbols if you ask nicely)

So you won't be having first-class symbols? I'd be pretty appalled if
I couldn't give make-symbol any arbitrary string.

--
/|_ .-----------------------.
,' .\ / | No to Imperialist war |
,--' _,' | Wage class war! |
/ / `-----------------------'
( -. |
| ) |
(`-. '--.)
`. )----'

Erik Naggum

unread,
Jan 14, 2004, 4:36:55 PM1/14/04
to
* james anderson

| the hypothesis was that the constituent traits as set out in the
| table on standard and semi-standard characters, which traits are not
| supposed to be clobbered by set-syntax-from-char, would be useful to
| characterise the set of characters which could be used in symbol
| names without explicit escaping.

That does not appear to be an unreasonable hypothesis, but it was
not the hypothesis you tested. You tested whether a string of three
characters, varying the middle one, would be read as a symbol or
would signal an error. Any number of middle characters that cause a
termination of the reader algorithm will produce a symbol read from
the first character, a letter.

| i would have expected the same status as that for #\: to apply to
| whitespace characters and to rubout.

But (read-from-string "a b") will return a symbol, namely A, when
the constituent trait of the space is /invalid/. You did not test
the length or any other property of the symbol-name of the returned
symbol, only that it did not error. The secondary value returned
from READ-FROM-STRING should be educational.

Don Geddis

unread,
Jan 14, 2004, 2:12:12 PM1/14/04
to
wallacet...@eircom.net (Russell Wallace) writes:
> What I'm actually doing is designing a new language that's intended to
> share Lisp's property of allowing characters like + and - in symbols
> (though not the feature of also allowing things like brackets in
> symbols if you ask nicely)

I think you're still missing the point. As Erik explained, _all_ characters
are valid in a Lisp symbol name.

You seem to be trying to find the set of characters that don't require
escaping in order to use them in symbol names. This is really a question about
the Lisp reader. Basically, things will get turned into symbols if they don't
parse as some other kind of thing.

I think you're mistaken to assume there is some subset of characters in CL that
does what you want. Otherwise, what do you think of this:

Lisp> (type-of '123)
FIXNUM
Lisp> (type-of '123d0)
DOUBLE-FLOAT
Lisp> (type-of 'd1230)
SYMBOL
Lisp> (type-of '123j0)
SYMBOL

If your concern is what you can type to the reader, to result in a symbol,
the answer is not simply a subset of characters. The syntax of those
characters matters a lot as well. Are numerals in your set? By themselves,
without escaping, the reader will turn them into numbers, not symbols.
How about the letter "d", along with some numerals? Depends where in the
sequence it appears.

All of the sequences above, if escaped, can be the names of symbols. If not
escaped, then whether they become symbols or not when passed through the
reader is _not_ a simple matter of character subsets; it's a matter of
fallthrough in a series of parse attempts.

(And yes, I'm sure you can find a sufficiently small subset of characters, such
that any sequence from the subset will parse only as a symbol. But that set
is much _smaller_ than alphanumeric, whereas you were clearly looking for
a subset of characters larger than that, e.g. including punctuation.)

-- Don
_______________________________________________________________________________
Don Geddis http://don.geddis.org/ d...@geddis.org
Underachievement: The tallest blade of grass is the first to be cut by the
lawnmower. -- Despair.com

Russell Wallace

unread,
Jan 14, 2004, 5:18:47 PM1/14/04
to
On 14 Jan 2004 11:37:18 -0800, t...@famine.OCF.Berkeley.EDU (Thomas F.
Burdick) wrote:

>wallacet...@eircom.net (Russell Wallace) writes:
>
>> What I'm actually doing is designing a new language that's intended to
>> share Lisp's property of allowing characters like + and - in symbols
>> (though not the feature of also allowing things like brackets in
>> symbols if you ask nicely)
>
>So you won't be having first-class symbols?

Right.

>I'd be pretty appalled if
>I couldn't give make-symbol any arbitrary string.

Well, in Common Lisp you'd probably be right. Arete (provisional name
for my new language) is designed differently - symbols are only used
for lexically scoped name-value mappings; strings do most of the other
things you use symbols for in Lisp. (For example, 'FOO is just
syntactic sugar for "FOO", it's not a symbol.)

Russell Wallace

unread,
Jan 14, 2004, 5:21:12 PM1/14/04
to
On 14 Jan 2004 11:12:12 -0800, Don Geddis <d...@geddis.org> wrote:

>I think you're still missing the point. As Erik explained, _all_ characters
>are valid in a Lisp symbol name.

No, that's fine, I understand that - my question wasn't about Lisp,
but about English terminology. I gather from Erik's explanation that
the answer is "Lisp doesn't regard any such set as special enough to
merit a short name", though, so I'll just make up one myself,
something like "ordinary characters".

Marc Spitzer

unread,
Jan 14, 2004, 5:27:37 PM1/14/04
to
Erik Naggum <er...@naggum.no> writes:

Glad to see you here again,

marc

Pascal Costanza

unread,
Jan 14, 2004, 5:49:02 PM1/14/04
to

Russell Wallace wrote:

> What I'm actually doing is designing a new language that's intended to
> share Lisp's property of allowing characters like + and - in symbols
> (though not the feature of also allowing things like brackets in
> symbols if you ask nicely), and I found when thinking about the syntax
> I was making heavy use of a concept I didn't have a name for, which
> rather bugged me; Lisp is one of the very few languages which allow
> non-alphanumeric characters in symbols, so I was wondering if it had a
> name for the concept.

I don't know any language that has a name for this concept. Instead, you
will find grammars for most languages, in BNF notation or something
along these lines, that define what characters are accepted as part of
identifiers. Chapter 2.2 in the HyperSpec is pretty close to what other
languages do in this regard, for example.

When defining a new language, it's probably a good idea to define such a
grammar at a certain stage anyway, and try to convince yourself that
it's an LL(1) grammar. Minimizing the lookahead that's needed for
parsing a program source is likely to improve the programmer's
understanding of the language.

As a result you will get a single definitive point to refer to when
someone wants to know what characters are accepted. That's probably
better than inventing a term for this concept. Later on you can just use
terms like "identifier" or "symbol", and it's clear from the grammar
what is meant.

Further note that the idea to include characters like + and - in
identifiers is IMHO only a good idea in prefix and probably postfix
languages. In infix languages, it's very likely to be confusing when a+b
and a + b mean different things. (If your language is not an infix
language, then just forget this remark. ;)


Pascal

--
Tyler: "How's that working out for you?"
Jack: "Great."
Tyler: "Keep it up, then."

Russell Wallace

unread,
Jan 14, 2004, 5:59:15 PM1/14/04
to
On Wed, 14 Jan 2004 23:49:02 +0100, Pascal Costanza <cost...@web.de>
wrote:

>When defining a new language, it's probably a good idea to define such a
>grammar at a certain stage anyway, and try to convince yourself that
>it's an LL(1) grammar. Minimizing the lookahead that's needed for
>parsing a program source is likely to improve the programmer's
>understanding of the language.

*nod-nod* I agree completely. I've the outline of a BNF grammar
sketched in my head, and I'm pretty sure it's LL(1). Simple grammer is
good ^.^

>Further note that the idea to include characters like + and - in
>identifiers is IMHO only a good idea in prefix and probably postfix
>languages. In infix languages, it's very likely to be confusing when a+b
>and a + b mean different things. (If your language is not an infix
>language, then just forget this remark. ;)

It is an infix language, and I agree that's a downside. I just think
it's very heavily outweighed by the ability to write multiword
identifiers with dashes instead of mixed case.

james anderson

unread,
Jan 14, 2004, 6:04:39 PM1/14/04
to

Erik Naggum wrote:
>
> * james anderson
> | the hypothesis was that the constituent traits as set out in the
> | table on standard and semi-standard characters, which traits are not
> | supposed to be clobbered by set-syntax-from-char, would be useful to
> | characterise the set of characters which could be used in symbol
> | names without explicit escaping.
>
> That does not appear to be an unreasonable hypothesis, but it was

> not the hypothesis [the posted code] tested. [It tested] whether a string of three


> characters, varying the middle one, would be read as a symbol or
> would signal an error. Any number of middle characters that cause a
> termination of the reader algorithm will produce a symbol read from
> the first character, a letter.
>
> | i would have expected the same status as that for #\: to apply to
> | whitespace characters and to rubout.
>
> But (read-from-string "a b") will return a symbol, namely A, when
> the constituent trait of the space is /invalid/.

i had thought that circumstance was specified to signal an error. there was a
different version, which printed a bit too much to post, which noted and
printed everything - exactly because the result was a surprise, which neither
signalled an error, nor did it demonstrate the length-1-symbol-name behaviour.

> [The posted code] did not test


> the length or any other property of the symbol-name of the returned
> symbol, only that it did not error. The secondary value returned
> from READ-FROM-STRING should be educational.

it was always 3.

...

Erik Naggum

unread,
Jan 14, 2004, 11:00:22 PM1/14/04
to
* Erik Naggum

> But (read-from-string "a b") will return a symbol, namely A, when
> the constituent trait of the space is /invalid/.

* james anderson


| i had thought that circumstance was specified to signal an error.

Hm. This appears to be unexplored territory. You deserve credit
for pointing to the map and the real world and urging me to take a
closer look at both.

We have the following situation: A character whose syntax type is
/constituent/ is used to set the syntax type of a character whose
previous syntax type was /whitespace/, but this means that the
constituent trait of that character remains /invalid/, which makes
the syntax type /invalid/. According to the specification, such a
character can never occur in the input except under the control of a
single escape character, so (read-from-string "a b") should indeed
signal an error, as per 2.1.4.3. (In case anyone else wonders, the
multiple escape mechanism already forces all characters to have the
alphabetic trait.)

I thought I caught an obvious oversight in your test, but it would
have been strong enough to test the hypothesis, were it not for the
sorry fact that none of the Common Lisp environments I have access
to signal an error when encountering invalid characters in the input
stream.

| it was always 3.

OK, then this is definitely surprising and in clear violation of the
standard. You're right that SET-SYNTAX-FROM-CHAR should not clobber
the constituent trait for any character, not just the package marker.

Where is that annoying conformance test guy who stresses the useless
corners and boundary conditions of the standard when you need him?

Christophe Rhodes

unread,
Jan 15, 2004, 3:28:30 AM1/15/04
to
Erik Naggum <er...@naggum.no> writes:

> Where is that annoying conformance test guy who stresses the useless
> corners and boundary conditions of the standard when you need him?

Since he may not respond to that description, I'll just say that
Paul's tests are currently in progress up to chapter 21 (Streams), so
it shouldn't be too long before chapter 23 (Reader) is breached.

Christophe
--
http://www-jcsu.jesus.cam.ac.uk/~csr21/ +44 1223 510 299/+44 7729 383 757
(set-pprint-dispatch 'number (lambda (s o) (declare (special b)) (format s b)))
(defvar b "~&Just another Lisp hacker~%") (pprint #36rJesusCollegeCambridge)

Paul F. Dietz

unread,
Jan 15, 2004, 7:06:55 AM1/15/04
to
Erik Naggum wrote:

> Where is that annoying conformance test guy who stresses the useless
> corners and boundary conditions of the standard when you need him?

I haven't tested the reader (much) yet, so I don't feel comfortable
offering an opinion on this at this time.

Paul
(who is trying to recover from attempting to test section 19)

Kent M Pitman

unread,
Jan 15, 2004, 8:55:46 AM1/15/04
to
wallacet...@eircom.net (Russell Wallace) writes:

> A trivial little question, but one that's been bugging me: Is there a
> name for that set of characters legal in Lisp identifiers?

A character.

I think you don't mean what you wrote.

A Lisp identifier is a symbol, not a piece of text. Some code is
constructed entirely from programs and never even goes through the
text phase and has no such thing.

All characters are, in principle, allowed in a symbol. You have to
use \x or |xxx| escaping to get some in.

If your question is about symbols, rather than about identifiers,
that's a legit thing to ask, but is a completely different matter.
Not all symbols are identifiers, though.

> For most languages this would be "alphanumeric" (perhaps with a
> footnote that _ is regarded as a letter in this context), but Lisp

> includes characters like + and - that most languages regard as
> punctuation.

Most languages are parsed from text to program, with no intermediate
phase. In Lisp, text (if there was any) has been parsed prior to the
time that expressions start to become considered as programs. Lisp
programs are not made out of characters, they are made out of structured
(i.e., already extant and composed) objects (conses, symbols, numbers,
etc.).

Kent M Pitman

unread,
Jan 15, 2004, 9:10:09 AM1/15/04
to
Erik Naggum <er...@naggum.no> writes:

(Hi, Erik! A pleasure to see you here.)

> * Russell Wallace
> | A trivial little question, but one that's been bugging me: Is there
> | a name for that set of characters legal in Lisp identifiers? For
> | most languages this would be "alphanumeric" (perhaps with a footnote
> | that _ is regarded as a letter in this context), but Lisp includes
> | characters like + and - that most languages regard as punctuation.
>
> The type STANDARD-CHAR covers the set of characters from which all
> symbols in the standard packages are made. This simple fact may
> give rise to the invalid assumption that there must be a particular
> character set from which all symbols must be made.

Although, in contrast, if you're trying to write code to share around,
it's a good conservative set. In the same sense as it's conservative
to write your programs in English.

I experimented with a multi-user, multi-lingual system (not Lisp-based)
for a while, and we eventually concluded that multilingualism is cool but
is best left to the interface. At the programming level, the ability to
have one name for one function, is important to being able to search for
and update callers. Making multiple names (for each language) for a
function both impedes search and makes programs look dumb. Making each
package impose its own language choice makes it hard to read programs and
sometimes raises argument order/naming issues. And so, in the end, if
you retreat to some language to program in as a common language, English
once again rears its ugly chauvinistic self as the obvious alternative.
And with it, the standard characters are a nice safe set to build out of,
since there's no real reason to invite portability problems when you're
already within striking distance of easy portability.

The more power you get, the more the burden is on you to use it
wisely. Just because you can do something doesn't mean you should...

> However, the functions INTERN and MAKE-SYMBOL take a STRING as the
> name of the symbol to be created, and there is no restriction on
> this /string/ to be of type BASE-STRING. Likewise, the value of
> SYMBOL-NAME is only specified to be of type STRING, with no mention
> of the common observation that it may be a SIMPLE-STRING regardless
> of whether the corresponding argument to INTERN or MAKE-SYMBOL was.

Yeah, I think this last is left to implementations. I don't think there
is any really good reason to require it to be a simple string. An
implementation might want to experiment with non-simple strings in ways
the designers didn't anticipate.

> I am particularly fond of using the non-breaking space in symbol
> names, just as I use it in filenames under operating systems that
> believe that ordinary spaces are separators regardless of how much
> effort one puts into convincing its various programs otherwise. I
> know people who think there ought to be laws against this practice,
> but sadly, the Common Lisp standard does not come to their aid.

Erik, I have missed your singular ability to make me mad and make me smile
at the same time. I wish I could decide whether I think this practice is
clever and forward thinking or just an irritating loophole. But either way,
the problem exists, and you're just highlighting it.

Don Geddis

unread,
Jan 15, 2004, 11:44:40 AM1/15/04
to
wallacet...@eircom.net (Russell Wallace) writes:
> my question wasn't about Lisp, but about English terminology. I gather from
> Erik's explanation that the answer is "Lisp doesn't regard any such set as
> special enough to merit a short name", though, so I'll just make up one
> myself, something like "ordinary characters".

I think you're still making a conceptual error. You're all concerned about
the name for this concept, but the problem (in Lisp) is that the concept
itself doesn't exist.

There are CHARACTERs, which for example can be put together into STRINGs.
SYMBOLs have names which are STRINGs, composed of any CHARACTER at all.

There is _no_ (sub)set of CHARACTERs in Lisp which does what you want.
You're searching for the name of a concept, but the concept itself is not
well-formed. No wonder it doesn't have a name.

(In particular: whether the CL reader interprets a token as a symbol is a
result of a parsing algorithm, not a result of whether the constituent
characters are in your magic subset or not. If the parser can't interpret
the token as some other data type, then it becomes a symbol. You're imagining
the wrong algorithm for choosing to make a token into a symbol.)

Thomas A. Russ

unread,
Jan 15, 2004, 5:16:22 PM1/15/04
to
wallacet...@eircom.net (Russell Wallace) writes:
> (defun )(')( ...)

(defun |)(;)| ( ...)

> That won't work; (, ) and ' are "punctuation" (?) and normally
> recognized by the reader as special characters. (I'm talking about the
> normal case, not what you can persuade the reader, interner or
> whatever to do if you try hard enough :))

Of course, surrounding the symbol name with vertical bars might be
considered "trying hard enough" by some people.


--
Thomas A. Russ, USC/Information Sciences Institute

Kenny Tilton

unread,
Jan 15, 2004, 6:56:25 PM1/15/04
to

Thomas A. Russ wrote:

> wallacet...@eircom.net (Russell Wallace) writes:
>
>> (defun )(')( ...)
>
>
> (defun |)(;)| ( ...)
>
>
>>That won't work; (, ) and ' are "punctuation" (?) and normally
>>recognized by the reader as special characters. (I'm talking about the
>>normal case, not what you can persuade the reader, interner or
>>whatever to do if you try hard enough :))
>
>
> Of course, surrounding the symbol name with vertical bars might be
> considered "trying hard enough" by some people.

That's how I might have felt until I found myself with the requirement
to parse some useful metadata out of an XML dtd. <yechh> Now it seems
like the most natural thing in the world to code:

(case tag-id
(|BeginString| ...)
(|MsgType| ...)
(|CheckSum| ...))

Speaking of which, this is related to a Lisp NYC project to create a toy
exchange with a FIX (Financial Info Exchange) protocol interface. The
original protocol was a flat "tag=value;"+ format. An XML version was
developed and a DTD along with it, which I used just to get the metadata.

Now we want to leave the land of XML behind and write out a nice sexpr
variant of the same metadata as /our/ spec (we do not have to worry
about matching the real Fix tit for tat since this is a toy exchange not
meant as a FIX client testbed).

Of course it is easy enough for me to come up with a sexpr format off
the top of my head, but I seem to recall someone (Erik? Tim? Other?)
saying they had done some work on a formal approach to an alternative to
XML/HTML/whatever.

True that? If so, I am all ears.

kt


--
http://tilton-technology.com

Why Lisp? http://alu.cliki.net/RtL%20Highlight%20Film

Your Project Here! http://alu.cliki.net/Industry%20Application

Russell Wallace

unread,
Jan 16, 2004, 8:35:17 AM1/16/04
to
On 15 Jan 2004 14:16:22 -0800, t...@sevak.isi.edu (Thomas A. Russ)
wrote:

>Of course, surrounding the symbol name with vertical bars might be
>considered "trying hard enough" by some people.

The fact that it works proves it's enough, n'est-ce pas? ^.~

Barry Margolin

unread,
Jan 16, 2004, 1:33:24 PM1/16/04
to
In article <4004f268....@news.eircom.net>,
wallacet...@eircom.net (Russell Wallace) wrote:

> (defun )(')( ...)


>
> That won't work; (, ) and ' are "punctuation" (?) and normally
> recognized by the reader as special characters. (I'm talking about the
> normal case, not what you can persuade the reader, interner or

> whatever to do if you try hard enough :)) So there's "whitespace",


> "punctuation" and... what's the third category called? Not
> "alphanumeric"... "constituent characters"?

Yes, that's the phrase used in the specification.

Note, however, that a token consisting only of constituent characters is
*not* necessarily going to be parsed as a symbol. Both numbers and
symbols are made up only of constituent characters (unless you make use
of radix prefixes like #o and #b). Thus, 123e.456 is a symbol, 123.456
and 123e456 are floats, and 123e is a symbol or integer depending on the
value of *READ-BASE*.

There are tables in the ANSI spec and CLTL that list all the standard
character types and constituent character attributes. The character
types are whitespace, terminating macro, non-terminating macro, single
escape, multiple escape, and constituent (the text also mentions
"illegal" characters, although no standard characters are of this type).

--
Barry Margolin, bar...@alum.mit.edu
Arlington, MA

Erik Naggum

unread,
Jan 19, 2004, 7:24:42 AM1/19/04
to
* Kenny Tilton

| Of course it is easy enough for me to come up with a sexpr format off
| the top of my head, but I seem to recall someone (Erik? Tim? Other?)
| saying they had done some work on a formal approach to an alternative
| to XML/HTML/whatever.
|
| True that? If so, I am all ears.

Really? You are? Maybe I didn't survive 2003 and this is some Hell
where people have to do eternal penance, and now I get to do SGML all
over again.

Much processing of SGML-like data appears to be stream-like and will
therefore appear to be equivalent to an in-order traversal of a tree,
which can therefore be represented with cons cells while the traverser
maintains its own backward links elsewhere, but this is misleading.

The amount of work and memory required to maintain the proper backward
links and to make the right decisions is found in real applications to
balloon and to cause random hacks; the query languages reflect this
complexity. Ease of access to the parent element is crucial to the
decision-making process, so if one wants to use a simple list to keep
track of this, the most natural thing is to create a list of the
element type, the parent, and the contents, such that each element has
the form (type parent . contents), but this has the annoying property
that moving from a particular element to the next can only be done by
remembering the position of the current element in a list, just as one
cannot move to the next element in a list unless you keep the cons
cell around. However, the whole point of this exercise is to be able
to keep only one pointer around. So the contents of an element must
have the form (type parent contents . tail) if it has element contents
or simply a list of objects, or just the object if simple enough.

Example: <foo>123</foo> would thus be represented by (foo nil "123"),
<foo>123</foo><bar>456</bar> by (foo nil "123" bar nil "456"), and
<zot><foo>123</foo><bar>456</bar></zot> by #1=(zot nil (foo #1# "123"
bar #1# "456")).

Navigation inside this kind of structure is easy: When the contents in
CADDR is exhausted, the CDDDR is the next element, or if NIL, we have
exhausted the contents of the parent and move up to the CADR and look
for its next element, etc. All the important edges of the containers
that make up the *ML document are easily detectible and the operations
that are usually found at the edges are normally tied to the element
type (or as modified by its parents), are easily computable. However,
using a list for this is cumbersome, so I cooked up the «quad». The
«quad» is devoid of any intrinsic meaning because it is intended to be
a general data structure, so I looked for the best meaningless names
for the slots/accessors, and decided on QAR, QBR, QCR, and QDR. The
quad points to the element type (like the operator in a sexpr) in the
QAR, the parent (or back) quad in the QBR, the contents of the element
in the QCR, and the usual pointer to the next quad in the QDR.

Since the intent with this model is to «load» SGML/XML/SALT documents
into memory, one important issue is how to represent long stretches of
character content or binary content. The quad can easily be used to
represent a (sequence of) entity fragments, with the source in QAR,
the start position in QBR, and the end position in QCR, thereby using
a minimum of memory for the contents. Since very large documents are
intended to be loaded into memory, this property is central to the
ability to search only selected elements for their contents -- most
searching processors today parse the entire entity structure and do
very little to maintain the parsed element structure.

Speaking of memory, one simple and efficient way to implement the quad
on systems that lack the ability to add native types without overhead,
is to use a two-dimensional array with a second dimension of 4 and let
quad pointers be integers, which is friendly to garbage collection and
is unambiguous when the quad is used in the way explained above.

Maybe I'll talk about SALT some other day.

Kenny Tilton

unread,
Jan 19, 2004, 1:05:54 PM1/19/04
to

Erik Naggum wrote:

> * Kenny Tilton
> | Of course it is easy enough for me to come up with a sexpr format off
> | the top of my head, but I seem to recall someone (Erik? Tim? Other?)
> | saying they had done some work on a formal approach to an alternative
> | to XML/HTML/whatever.
> |
> | True that? If so, I am all ears.
>
> Really? You are? Maybe I didn't survive 2003 and this is some Hell
> where people have to do eternal penance, and now I get to do SGML all
> over again.

First, thx, <<quad>>s are nice. I was thinking about compiling some
XML-alternative syntax into internal Lisp structures (which is why I was
wondering why I even need someone else's proposal, I can just write the
internal structures out as READable forms).

I see <<quads>> are something that allow one to navigate the structure
itself, and that this is useful if one does not want to gobble up the
whole of the structure. I'll keep <<quad>>s in mind if I ever want a
random-access markup store.

kenny

Erik Naggum

unread,
Jan 19, 2004, 11:08:55 PM1/19/04
to
* Kenny Tilton

| First, thx, <<quad>>s are nice.

Heh. My absence from news shows. Over here in Europe, «» are the
proper quotation marks, instead of the various versions of " that are
not in ISO 8859-1. The « and » are not integral to the name of the
type, it's just "quad".

| I was thinking about compiling some XML-alternative syntax into
| internal Lisp structures (which is why I was wondering why I even need
| someone else's proposal, I can just write the internal structures out
| as READable forms).

You always have to consider how much information you want to retain
from the parsing process. The sexpr contains just enough information
for its uses, but the only navigation you ever do with sexprs is to go
down the CAR or CDR.

| I see <<quads>> are something that allow one to navigate the structure
| itself, and that this is useful if one does not want to gobble up the
| whole of the structure.

Hm, I think it makes most sense when you do want to gobble up the
whole of the structure. The point about storing pointers to entity
fragments using quads, too, was that contents usually dwarfs the
markup in volume. When end-tags make up 25% of the volume of the
document, however, the start-tags make up another 25%, and when I
designed the quad and its various implementations in Common Lisp and
in special languages, the strong desire was to be able to load large
documents into memory.

| I'll keep <<quad>>s in mind if I ever want a random-access markup
| store.

That seems like you decided on their utility before trying them out,
while I have really tried to build a system as useful for XML-like
data as the cons cell is for Lisp-like data. Instead of inventing the
array and regarding cons cells as random access into the list, we just
use lists made up cons cells because that affords the navigation we
need when processing them. Likewise, the quad affords the navigation
we need when processing XML-like structures. When I suggest that an
implementation that does not provide the ability to add native types
use a two-dimensional array, it is not because it makes random access
into the document possible but because it saves a lot of memory.

Kenny Tilton

unread,
Jan 20, 2004, 2:29:50 AM1/20/04
to

Erik Naggum wrote:

> * Kenny Tilton


> | I was thinking about compiling some XML-alternative syntax into
> | internal Lisp structures (which is why I was wondering why I even need
> | someone else's proposal, I can just write the internal structures out
> | as READable forms).
>
> You always have to consider how much information you want to retain
> from the parsing process. The sexpr contains just enough information
> for its uses, but the only navigation you ever do with sexprs is to go
> down the CAR or CDR.

Maybe I should have said more about what I am doing. I wrote a poor
man's XML parser just so I could read a DTD just so I could get metadata
about the required structure of Financial Info Exchange (FIX) protocol
messages. The funny thing is we do not plan to support FIXML, but the
DTD for it looked like the best source of metadata about the original
"tag=value" format.

What we want to do is now leave the world of XML behind and just write
out the metadata in some nice Lisp-friendly way.

The DTD was nothing more than !ENTITYs, !ELEMENTs, and !ATTLISTs.
Anyway, I just created hashtables for entities and elements which I
converted to structs, and the element struct had a slot for attributes,
etc etc etc.

Now I want to write it all out readably so we can leave XML behind. As
it is, I had to fill in some gaps by adding to the DTD so the parse
could produce the Right Thing (this being more fun than the alternative
of hardcoded additions to the internal structures post-parse).

It's a little fuzzy, but one element would define a record and have a
content string that listed all the field elements. So at run time I
dynamically use bits of the same parser to read that string and
determine the fields (I sensed that I had to leave it to the last second
to support dynamic redefinition of elements, but perhaps this step could
also be <<pre-compiled>>) and then look up the fields to determine their
attributes in turn to assist with parsing of the field data.

> | I'll keep <<quad>>s in mind if I ever want a random-access markup
> | store.
>
> That seems like you decided on their utility before trying them out,

Maybe I just misunderstood. If quads just give me a link to the parent,
well, in the case of the DTD, all the entities, elements, and attributes
had the same parent, the XML dtd document. So I imagined an awful lot of
serial searching, repeated over and over again for the same message
type, and yes, I made a gut determination that I could use the names of
things as keys to a hash table and turn a record expansion into so many
keyed lookups.

Well, maybe I am all wet. If performance is my concern, I need only
memoize things like record expansions, something I should do anyway even
with the keyed lookups. Memoization will internally involve its own hash
tables, but at least they are hidden behind a functional interface,
which would be nice.

> However, the whole point of this exercise is to be able
> to keep only one pointer around. So the contents of an element must
> have the form (type parent contents . tail) if it has element contents
> or simply a list of objects, or just the object if simple enough.
>
> Example: <foo>123</foo> would thus be represented by (foo nil "123"),
> <foo>123</foo><bar>456</bar> by (foo nil "123" bar nil "456"), and
> <zot><foo>123</foo><bar>456</bar></zot> by #1=(zot nil (foo #1# "123"
> bar #1# "456")).

Do we need each child to refer to its parent? Why not a format with the
parent first and then one or more children understood to share the same
parent?

#1=(nil zot (#1# foo "123" bar "456"))

?

kt

Björn Lindberg

unread,
Jan 20, 2004, 9:05:37 AM1/20/04
to
Erik Naggum <er...@naggum.no> writes:

> * Kenny Tilton
> | First, thx, <<quad>>s are nice.
>
> Heh. My absence from news shows. Over here in Europe, «» are the
> proper quotation marks, instead of the various versions of " that are
> not in ISO 8859-1.

What do you mean? The proper quotation marks for Swedish is
"ninety-nine ninety-nine", while in eg the UK "sixty-six ninety-nine"
is used. "Gåsögon", the marks you used can also be used in Swedish but
are then often used »like this». In Norway they are pointed outwards,
like you did, but in Denmark they are »pointed inwards« instead[1].

Or did you just mean to say that since ISO 8859-1 is lacking the
proper "-style quotation marks, it is better to use «»? Because I
believe the quote marks to be used are dictated by the language the
text is written in, so that English text should be written using
English quotation marks, Swedish text using Swedish quotation marks,
etc.

[1] (in Swedish)
http://susning.nu/Citat
http://susning.nu/G%e5s%f6gon


Björn

Marco Antoniotti

unread,
Jan 20, 2004, 2:16:44 PM1/20/04
to

Kenny Tilton wrote:
>
>
> Erik Naggum wrote:
>
>> * Kenny Tilton
>> | I was thinking about compiling some XML-alternative syntax into
>> | internal Lisp structures (which is why I was wondering why I even need
>> | someone else's proposal, I can just write the internal structures out
>> | as READable forms).
>>
>> You always have to consider how much information you want to retain
>> from the parsing process. The sexpr contains just enough information
>> for its uses, but the only navigation you ever do with sexprs is to go
>> down the CAR or CDR.
>
>
> Maybe I should have said more about what I am doing. I wrote a poor
> man's XML parser just so I could read a DTD just so I could get metadata
> about the required structure of Financial Info Exchange (FIX) protocol
> messages. The funny thing is we do not plan to support FIXML, but the
> DTD for it looked like the best source of metadata about the original
> "tag=value" format.

What's wrong with CL-XML?


Cheers
--
Marco

Joe Marshall

unread,
Jan 20, 2004, 2:34:05 PM1/20/04
to
Erik Naggum <er...@naggum.no> writes:

> Maybe I didn't survive 2003 and this is some Hell where people

> have to do eternal penance...

It's worse than that, this is comp.lang.lisp

Kenny Tilton

unread,
Jan 20, 2004, 2:36:31 PM1/20/04
to

I couldn't understand the installation instructions.

And the doc said it pulled things into CLOS instances, and then I would
use XQuery/XPath/XCrap to get at the info. And I did not really want to
do XML (in which case getting all fancy like that might make sense), I
just wanted to suck some info out of a DTD.

In fact, someone on the team already said I screwed up, I should have
parsed an HTML file for the same info, which would be more accurate in
certain dark corners where the XML orientation of the DTD diminishes the
correspondence to the non-XML syntax.

And this is Lisp, I wrote my crappy hard-coded parser in less time than
it would have taken to figure out how to install cl-xml. And about 100
lines of code so no one on our team has to bother with cl-xml.

And now it's mine! All mine!!

:)

Edi Weitz

unread,
Jan 20, 2004, 3:36:14 PM1/20/04
to
On Tue, 20 Jan 2004 19:36:31 GMT, Kenny Tilton <kti...@nyc.rr.com> wrote:

> Marco Antoniotti wrote:
>>
>> What's wrong with CL-XML?
>
> I couldn't understand the installation instructions.

So I'm not the only one... :)

But you know that there are some more "lightweight" solutions out
there? You could have said

(asdf-install:install :pxmlutils)

or

(asdf-install:install :xmls)

and - voilà! (At least I hope so...)

CLiki is your friend.

Edi.

james anderson

unread,
Jan 20, 2004, 6:20:48 PM1/20/04
to

Edi Weitz wrote:
>
> On Tue, 20 Jan 2004 19:36:31 GMT, Kenny Tilton <kti...@nyc.rr.com> wrote:
>
> > Marco Antoniotti wrote:
> >>
> >> What's wrong with CL-XML?
> >
> > I couldn't understand the installation instructions.
>
> So I'm not the only one... :)

the mind boggles.

the distribution unpacks to a directory at the top level of which is a
collection of files with names in the form

load{+,-}cl-http{+,-}instanceNames.lisp

and a file

load.lisp

which is a symbolic link to

load-cl-http+instanceNames.lisp

that is to say, if one is using one of the supported lisp implementations and
one types

(load #p"<pathname to the load.lisp file")<return>

at the the repl prompt, one compiles and loads the parser in an environment
without cl-http and in a mode which implements names as instances (as opposed
to symbols).

which aspect of this prospective process does one find difficult to understand?

>
> But you know that there are some more "lightweight" solutions out
> there? You could have said
>
> (asdf-install:install :pxmlutils)
>
> or
>
> (asdf-install:install :xmls)
>
> and - voilà! (At least I hope so...)

should there be any cause for uncertainty as to how to processed, the release
includes a directory

[tschichold:XML-0-949-20030409T2320-MACOS/tests/implementation]
janson% ls -l
total 0
drwxr-xr-x 3 janson admin 102 Apr 9 2003 acl-5-0-1
drwxr-xr-x 3 janson admin 102 Apr 9 2003 cmucl-18e+
drwxr-xr-x 3 janson admin 102 Apr 9 2003 lispworks-4-2
drwxr-xr-x 3 janson admin 102 Apr 9 2003 lispworks-4-3
drwxr-xr-x 3 janson admin 102 Apr 9 2003 mcl-5-0b
drwxr-xr-x 3 janson admin 102 Apr 9 2003 openmcl-0-13-3
[tschichold:XML-0-949-20030409T2320-MACOS/tests/implementation]
janson%

which contains transcripts of the load process and the results of the oasis
conformance tests in the respective implementations.

...

Kenny Tilton

unread,
Jan 20, 2004, 7:03:05 PM1/20/04
to

Edi Weitz wrote:
> On Tue, 20 Jan 2004 19:36:31 GMT, Kenny Tilton <kti...@nyc.rr.com> wrote:
>
>
>>Marco Antoniotti wrote:
>>
>>>What's wrong with CL-XML?
>>
>>I couldn't understand the installation instructions.
>
>
> So I'm not the only one... :)
>
> But you know that there are some more "lightweight" solutions out
> there?

Aw, c'mon, I can write a hard-coded parser in my sleep. Besides, now I
can put XML on the resume. I'll just have to pretend i wrote it in a
real language.

Edi Weitz

unread,
Jan 20, 2004, 7:16:17 PM1/20/04
to
On Wed, 21 Jan 2004 00:20:48 +0100, james anderson <james.a...@setf.de> wrote:

> the mind boggles.
>
> the distribution unpacks to a directory at the top level of which is
> a collection of files with names in the form
>
> load{+,-}cl-http{+,-}instanceNames.lisp
>
> and a file
>
> load.lisp
>
> which is a symbolic link to
>
> load-cl-http+instanceNames.lisp
>
> that is to say, if one is using one of the supported lisp
> implementations and one types
>
> (load #p"<pathname to the load.lisp file")<return>
>
> at the the repl prompt, one compiles and loads the parser in an
> environment without cl-http and in a mode which implements names as
> instances (as opposed to symbols).
>
> which aspect of this prospective process does one find difficult to
> understand?

What you describe here is rather easy to understand. I suggest you add
it to the webpage

<http://pws.prserv.net/James.Anderson/XML/documentation/howto/load.html>.

I'm not 100% sure but I think I remember that the last time I checked
I was supposed to install CL-HTTP before I could compile
CL-XML. That's a bit harder than just (LOAD "load.lisp").

Also, if I unpack the file XML-0-949-20030409.tgz on my machine the
"documentation" directory is empty except for three GIFs - no "README"
or "INSTALL" file. The file "load.lisp" is just one of five
"load*.lisp" files. The fact that it once was a symbolic link
obviously got lost in the tarball.

Edi.

james anderson

unread,
Jan 20, 2004, 8:03:33 PM1/20/04
to

Edi Weitz wrote:
>
> On Wed, 21 Jan 2004 00:20:48 +0100, james anderson <james.a...@setf.de> wrote:
>

> > ...


> >
> > which aspect of this prospective process does one find difficult to
> > understand?
>
> What you describe here is rather easy to understand. I suggest you add
> it to the webpage
>
> <http://pws.prserv.net/James.Anderson/XML/documentation/howto/load.html>.
>

ok.

the irony of which is, that page was composed in response to an aversion, from
someone who had found the various load*.lisp files, that, once he had looked
at them, it was not clear how to load and use the xml-path library rather than
just the parser.

> I'm not 100% sure but I think I remember that the last time I checked
> I was supposed to install CL-HTTP before I could compile
> CL-XML. That's a bit harder than just (LOAD "load.lisp").

it would be helpful to hear what might have led you to that supposition. so
far as i can ascertain (i have code going back 4 years only) that has not been
the case for a long time. perhpas there's a note somewhere in the
documentation which is misleading?

>
> Also, if I unpack the file XML-0-949-20030409.tgz on my machine the
> "documentation" directory is empty except for three GIFs - no "README"
> or "INSTALL" file. The file "load.lisp" is just one of five
> "load*.lisp" files. The fact that it once was a symbolic link
> obviously got lost in the tarball.

hmm. it would appear that i have to be more selective as to which version of
tar i use in the future. thanks for the hint.

...

Edi Weitz

unread,
Jan 20, 2004, 8:11:49 PM1/20/04
to
On Wed, 21 Jan 2004 02:03:33 +0100, james anderson <james.a...@setf.de> wrote:

> Edi Weitz wrote:
>>
>> I'm not 100% sure but I think I remember that the last time I
>> checked I was supposed to install CL-HTTP before I could compile
>> CL-XML. That's a bit harder than just (LOAD "load.lisp").
>
> it would be helpful to hear what might have led you to that
> supposition. so far as i can ascertain (i have code going back 4
> years only) that has not been the case for a long time. perhpas
> there's a note somewhere in the documentation which is misleading?

Hmm, it definitely wasn't four years ago - I hardly knew about CL at
that time. It must have been around 2001/2002 and I didn't really try
hard to get CL-XML installed. (I didn't need it, I was just browsing.)
I read the docs and said to myself "That's too much of a hassle. Let's
try again later." (Mind you, that was when I was still fighting with
things like MK:DEFSYSTEM and ASDF. I have learned a thing or two
since.)

If CL-HTTP wasn't required then maybe the preferred system definition
utility (or the one described in the docs) was from CL-HTTP? I can't
remember but I know that I left with the impression that I had to be
somewhat familiar with CL-HTTP (which I wasn't) in order to use
CL-XML. It's good to know that this isn't the case.

Cheers,
Edi.

Erik Naggum

unread,
Jan 20, 2004, 9:30:17 PM1/20/04
to
* Björn Lindberg
| What do you mean?

That my habits have changed.

Erik Naggum

unread,
Jan 21, 2004, 2:04:58 AM1/21/04
to
* Kenny Tilton

| Maybe I just misunderstood. If quads just give me a link to the
| parent, well, in the case of the DTD, all the entities, elements, and
| attributes had the same parent, the XML dtd document. So I imagined
| an awful lot of serial searching, repeated over and over again for the
| same message type, and yes, I made a gut determination that I could
| use the names of things as keys to a hash table and turn a record
| expansion into so many keyed lookups.

You assume way too much. I lack the information to unwind your many
assumptions, but you may have noticed that I wrote that QAR would
point to the element type, like the operator in the CAR of a sexpr.
This is obviously a symbol-like structure. For some reason, you have
read what I wrote to refer to the prolog of an SGML/XML document,
while I talked about the document instance. I have written elsewhere
that the very concept of a DTD was a huge mistake, so I really wish
you had asked me instead of running with your assumptions.

Just as Common Lisp is defined on objects in a tree structure, but
still manages to have clearly defined semantics, I had hoped it would
be rather obvious that I intend the same to hold true for the SGML
tree. Defining element types and processors on them is clearly part
of the whole approach, and just as Common Lisp systems do not search
source files linearly for definitions of operators, this part of the
language is not restricted to being represented with quads.

But I don't know where to begin to explain things to you so you don't
assume things without asking. It is very difficult to predict what
someone who guesses a lot will need to invalidate an assumption.

| Do we need each child to refer to its parent?

Yes.

| Why not a format with the parent first and then one or more children
| understood to share the same parent?

That would require more pointers to be kept around in a stack-like
structure when traversing the document, while an explicit design goal
of my approach is to move all this information into the tree.

a

unread,
Jan 21, 2004, 1:40:19 PM1/21/04
to
But one does have to use an "experimental" version of CMUCL, if one uses
CMUCL. It's documented on the CL-XML website but I certainly overlooked it
the first time I tried to get CL-XML working. It's a CLOS issue, IIRC.

"Edi Weitz" <e...@agharta.de> wrote in message
news:m3ektum...@bird.agharta.de...
...

Raymond Toy

unread,
Jan 21, 2004, 2:35:57 PM1/21/04
to
>>>>> "a" == a <a...@def.gh> writes:

a> But one does have to use an "experimental" version of CMUCL, if one uses
a> CMUCL. It's documented on the CL-XML website but I certainly overlooked it

What does "experimental" mean? AFAIK, experimental versions were from
years ago. There are, however, some monthly snapshots available[1], with
some other random CVS snapshots. And a release is coming Real Soon
Now too.


Ray


Footnotes:
[1] Sort of. cons.org is mostly down right now. But should be back
real soon now.

a

unread,
Jan 21, 2004, 4:13:03 PM1/21/04
to
See http://cl-xml.org and under Availability | Releases follow the "A
separate {{document}}" link. to the CL-XML Releases page. The cmucl entry is
marked "yes[1]" under the "wo/CL-HTTP w/ name symbols" column. Footnote [1]
says "the experimental CLOS/MOP is required. tests were done with the
{{i686-linux}} version." The i686-linux link points to
http://cvs2.cons.org/ftp-area/cmucl/experimental/pcl/cmucl-2003-03-28--17-20
-37-i686-Linux.tar.gz, if I spelled that correctly. CL-XML did not compile
with any other version of CMUCL that I had tried but it worked fine with
that one.

"Raymond Toy" <t...@rtp.ericsson.se> wrote in message
news:4nd69dc...@edgedsp4.rtp.ericsson.se...

Raymond Toy

unread,
Jan 21, 2004, 4:26:41 PM1/21/04
to
>>>>> "a" == a <a...@def.gh> writes:

a> marked "yes[1]" under the "wo/CL-HTTP w/ name symbols" column. Footnote [1]
a> says "the experimental CLOS/MOP is required. tests were done with the

Ah, ok. That's Gerd's PCL stuff. Yeah, it was experimental, but it's
not anymore. It will go out in the next release, and has been
the default for quite a while.

Ray

james anderson

unread,
Jan 21, 2004, 7:13:46 PM1/21/04
to

there were a number of things which the experimental pcl did better than the
then official release at the time i was porting, so i've been waiting for some
indication that it had been folded into a stable release before rechecking
compatibility and updating any documentation.

if anyone has built it with more recent cvs snapshots, please let me know, so
that i can update the notes. otherwise i'll just keep watching the releases.

...

Kenny Tilton

unread,
Jan 22, 2004, 3:09:29 AM1/22/04
to

Erik Naggum wrote:
> * Kenny Tilton
> | Maybe I just misunderstood. If quads just give me a link to the
> | parent, well, in the case of the DTD, all the entities, elements, and
> | attributes had the same parent, the XML dtd document. So I imagined
> | an awful lot of serial searching, repeated over and over again for the
> | same message type, and yes, I made a gut determination that I could
> | use the names of things as keys to a hash table and turn a record
> | expansion into so many keyed lookups.
>
> You assume way too much. I lack the information to unwind your many
> assumptions, but you may have noticed that I wrote that QAR would
> point to the element type, like the operator in the CAR of a sexpr.
> This is obviously a symbol-like structure. For some reason, you have
> read what I wrote to refer to the prolog of an SGML/XML document,
> while I talked about the document instance.

No, I figured out you must be talking about doc instances. That is why I
confessed to parsing a DTD. Anyway, i think I follow. The only reason I
thought mad serial searching was involved was the parent pointer, but
hey, all my tree nodes know their parents so I certainly see the value
in that.

I have written elsewhere
> that the very concept of a DTD was a huge mistake,

It seems the world agrees. The DTD is dead, long live the Schema.

> But I don't know where to begin to explain things to you so you don't
> assume things without asking.

This reminds me of my attempt to teach someone rollerblading, who
persisted in pitching himself headlong and often spinning thru the air
and onto the asphalt. After a few tries at talking the lad down I
realized it was just his learning style and let him be.

Anyway, as I said, I have always been a parent-aware node designer
myself, so it is fun seeing that elevated to the status of a car or cdr.

Not that it matters, but why isn't the parent the first slot? As for the
tail being dotted, bold stroke that. I've always felt bad writing code
to ask my parent who is nect after me just to get to my next sibling.

kt

Erik Naggum

unread,
Jan 22, 2004, 5:51:11 AM1/22/04
to
* Kenny Tilton

| The only reason I thought mad serial searching was involved was the
| parent pointer, but hey, all my tree nodes know their parents so I
| certainly see the value in that.

OK, but the point with my approach is to "load" a document into memory
and then work on and navigate around in the in-memory representation
instead of the edge-detection scheme that is used in the most popular
tools. That is, a DOM without any of the insanity.

| It seems the world agrees. The DTD is dead, long live the Schema.

I am not pleased with this development, either, FWIW.

| Not that it matters, but why isn't the parent the first slot?

Because the QAR is the "operator". In the case of an entity fragment,
the QAR is the source, and the meaning of the QBR is different, but if
the two-dimensional vector with indexes instead of pointers is used,
both a parent and a start position in a source would be a number.

| As for the tail being dotted, bold stroke that. I've always felt bad

| writing code to ask my parent who is next after me just to get to my
| next sibling.

Precisely. That is so wrong.

0 new messages