A trivial little question, but one that's been bugging me: Is there a name for that set of characters legal in Lisp identifiers? For most languages this would be "alphanumeric" (perhaps with a footnote that _ is regarded as a letter in this context), but Lisp includes characters like + and - that most languages regard as punctuation.
wallacethinmi...@eircom.net (Russell Wallace) writes: > A trivial little question, but one that's been bugging me: Is there a > name for that set of characters legal in Lisp identifiers? For most > languages this would be "alphanumeric" (perhaps with a footnote that _ > is regarded as a letter in this context), but Lisp includes characters > like + and - that most languages regard as punctuation.
I think "constituent character" is quite close, if not "it".
Regards,
'mr
-- [Emacs] is written in Lisp, which is the only computer language that is beautiful. -- Neal Stephenson, _In the Beginning was the Command Line_
Russell Wallace wrote: > A trivial little question, but one that's been bugging me: Is there a > name for that set of characters legal in Lisp identifiers?
* Russell Wallace | A trivial little question, but one that's been bugging me: Is there | a name for that set of characters legal in Lisp identifiers? For | most languages this would be "alphanumeric" (perhaps with a footnote | that _ is regarded as a letter in this context), but Lisp includes | characters like + and - that most languages regard as punctuation.
The type STANDARD-CHAR covers the set of characters from which all symbols in the standard packages are made. This simple fact may give rise to the invalid assumption that there must be a particular character set from which all symbols must be made.
However, the functions INTERN and MAKE-SYMBOL take a STRING as the name of the symbol to be created, and there is no restriction on this /string/ to be of type BASE-STRING. Likewise, the value of SYMBOL-NAME is only specified to be of type STRING, with no mention of the common observation that it may be a SIMPLE-STRING regardless of whether the corresponding argument to INTERN or MAKE-SYMBOL was.
Since the symbols are normally created by the Common Lisp reader, your question is therefore really which characters the reader is able to build into a string that it will pass to INTERN. There is no upper bound on this character set in the standard, but an actual implementation will necessarily place restrictions on this set. In the worst case, the Common Lisp reader does not understand which character is has just read the encoding of, and may produce symbols with garbage bytes that nevertheless reproduce the character in your editor or other character display equipment.
Pessimistically, therefore, your question is whether you will find any mention in the standard of any invalid characters in symbols, but you find quite the opposite: After a single-escape character, normally \, any following character will be a constituent character in the symbol name being read, and between the multiple-escape characters, normally |, all characters will be constituent. The best you can hope for is thus that whatever reads the byte stream that is your source file will reject unacceptable encodings. As long as you use an encoded character set that includes the standard characters, there is no restriction on what you can do, and if you use an encoding that does not confuse standard characters and one of your other characters even in the least capable decoders, you will find that there is not even any useful restriction on the /length/ of Common Lisp symbol names.
Optimistically, however, the answer to your question is that the set of characters that are legal in identifiers is the standard-class CHARACTER, but you may not be able to produce all of them in any given source file.
I am particularly fond of using the non-breaking space in symbol names, just as I use it in filenames under operating systems that believe that ordinary spaces are separators regardless of how much effort one puts into convincing its various programs otherwise. I know people who think there ought to be laws against this practice, but sadly, the Common Lisp standard does not come to their aid.
-- Erik Naggum | Oslo, Norway Yes, I survived 2003.
Act from reason, and failure makes you rethink and study harder. Act from faith, and failure makes you blame someone and push harder.
On 14 Jan 2004 05:39:34 +0000, Erik Naggum <e...@naggum.no> wrote:
> However, the functions INTERN and MAKE-SYMBOL take a STRING as the > name of the symbol to be created, and there is no restriction on > this /string/ to be of type BASE-STRING. Likewise, the value of > SYMBOL-NAME is only specified to be of type STRING, with no mention > of the common observation that it may be a SIMPLE-STRING regardless > of whether the corresponding argument to INTERN or MAKE-SYMBOL was.
Welcome back, Erik!
Thanks for the explanation - okay, so basically any character _can_ be part of a symbol... fair enough... my question is really about the English terminology, though. That is, say you write...
(defun +-?-+ ...)
...that's fine, you can use the characters +, - and ? in a function name, they're... "constituent characters", one poster said? Whereas if you write...
(defun )(')( ...)
That won't work; (, ) and ' are "punctuation" (?) and normally recognized by the reader as special characters. (I'm talking about the normal case, not what you can persuade the reader, interner or whatever to do if you try hard enough :)) So there's "whitespace", "punctuation" and... what's the third category called? Not "alphanumeric"... "constituent characters"?
* Russell Wallace | Thanks for the explanation - okay, so basically any character _can_ | be part of a symbol... fair enough... my question is really about | the English terminology, though.
The terminology is really pretty simple, but you have to look at it from the right angle. In languages that require identifiers to be made up of particular characters, there is obviously a name for the character set, but in a language that goes out of its way to make it possible to use absolutely any character you want, there are only names for those characters that need special treatment to become part of a symbol name because their "normal" function is not to.
| Whereas if you write... | | (defun )(')( ...) | | That won't work; (, ) and ' are "punctuation" (?) and normally | recognized by the reader as special characters.
Well, they are known as "macro characters". The important thing is that the set of macro characters is not defined by the language, but by the readtable in effect when the Common Lisp reader processes your source. There is a standard readtable, however, and one would have to say "unescaped terminating macro characters in the standard readtable" or another phrasing that tries to hide the obvious anal retentiveness to really speak about the characters that will not be part of a symbol name unless you have changed the rules. There is nothing particularly special about any of these macro characters. There are some restrictions on what the readtable can do and how the reader collects characters into symbol names. If you really insist, calling them "constituent characters" will help, but realize that this property is a result of falling through every other test -- unless it is escaped, in which case it wins its constituency right away. (There's an awful pun waiting to happen here, about Iowa, but I'll ignore the temptation.)
| (I'm talking about the normal case, not what you can persuade the | reader, interner or whatever to do if you try hard enough :))
While this may seem reasonable from the angle you chose to look at this problem, it is the a priori reasonability of the position that has produced your problem. It is in fact unreasonable to approach Common Lisp from this angle. The problem does not exist. This
(defun |)(')(| ...)
is in fact fully valid Common Lisp code. You cannot define away the solution to the problem and insist that you still have a problem in need of an answer.
| So there's "whitespace", "punctuation" and... what's the third | category called? Not "alphanumeric"... "constituent characters"?
I have to zoom out and ask you what you would do with the elusive name for this category. If I guess correctly at your intentions, I would perhaps have said that "any character can be part of a symbol name, but most macro characters need to be escaped to prevent them from having their macro function". (The important exception is #, the only non-terminating macro character in the standard readtable, meaning that #xF will be interpreted as hexadecimal number, but F#x is a three-character-long symbol name with a # in it.)
Unless you have a simple need that can be resolved by a nice, vague explanation that only informs your reader that Common Lisp is a lot different from languages that require particular characters in the names of identifiers/symbols, I think Chapter 23 in the standard, on the Common Lisp Reader, would be a really good suggestion right now.
Yeah, I'm back allright, with undesirably high levels of precision, scaring away frail newbies from day one. Maybe I'll go hibernate.
-- Erik Naggum | Oslo, Norway
Act from reason, and failure makes you rethink and study harder. Act from faith, and failure makes you blame someone and push harder.
> * Russell Wallace > | Thanks for the explanation - okay, so basically any character _can_ > | be part of a symbol... fair enough... my question is really about > | the English terminology, though.
> ...
> | So there's "whitespace", "punctuation" and... what's the third > | category called? Not "alphanumeric"... "constituent characters"?
> I have to zoom out and ask you what you would do with the elusive > name for this category. If I guess correctly at your intentions, I > would perhaps have said that "any character can be part of a symbol > name, but most macro characters need to be escaped to prevent them > from having their macro function". (The important exception is #, > the only non-terminating macro character in the standard readtable, > meaning that #xF will be interpreted as hexadecimal number, but F#x > is a three-character-long symbol name with a # in it.)
> Unless you have a simple need that can be resolved by a nice, vague > explanation that only informs your reader that Common Lisp is a lot > different from languages that require particular characters in the > names of identifiers/symbols, I think Chapter 23 in the standard, on > the Common Lisp Reader, would be a really good suggestion right now.
i would have thought that a useful characterization would be "constituent character in the current readtable, with the constituent traits 'alphabetic' or 'alphadigit'", as that describes the set of characters which could be read, without escaping, as part of a symbol name, by means of readtable adjustments with set-syntax-from-char.
000058 (:) : *** : There is no package named "A" . NIL ?
i would have expected the token parser to have signaled errors when reading from strings which contained those characters for which 2.1.4.2 specifies the constituent trait 'invalid'.
is this an implementation bug, or have i misunderstood 2.1.4.2?
On 14 Jan 2004 08:22:42 +0000, Erik Naggum <e...@naggum.no> wrote:
> Well, they are known as "macro characters". The important thing is > that the set of macro characters is not defined by the language, but > by the readtable in effect when the Common Lisp reader processes > your source. There is a standard readtable, however, and one would > have to say "unescaped terminating macro characters in the standard > readtable" or another phrasing that tries to hide the obvious anal > retentiveness to really speak about the characters that will not be > part of a symbol name unless you have changed the rules.
Right, so another way of phrasing my question would be: is there a shorter term for the noun phrase "unescaped..." above :)
> While this may seem reasonable from the angle you chose to look at > this problem, it is the a priori reasonability of the position that > has produced your problem. It is in fact unreasonable to approach > Common Lisp from this angle. The problem does not exist.
You're right, of course, and if my objective was to understand Common Lisp, I wouldn't give this issue any more thought - it isn't a problem in that language.
> I have to zoom out and ask you what you would do with the elusive > name for this category.
What I'm actually doing is designing a new language that's intended to share Lisp's property of allowing characters like + and - in symbols (though not the feature of also allowing things like brackets in symbols if you ask nicely), and I found when thinking about the syntax I was making heavy use of a concept I didn't have a name for, which rather bugged me; Lisp is one of the very few languages which allow non-alphanumeric characters in symbols, so I was wondering if it had a name for the concept.
It seems the answer is that it doesn't have a name because it doesn't particularly need the concept... hmm. I think I'll call them "ordinary characters".
> Yeah, I'm back allright, with undesirably high levels of precision, > scaring away frail newbies from day one. Maybe I'll go hibernate.
*grin* No, stick around. The newsgroup's more fun with you around.
wallacethinmi...@eircom.net (Russell Wallace) writes: > I was making heavy use of a concept I didn't have a name for, which > rather bugged me; Lisp is one of the very few languages which allow > non-alphanumeric characters in symbols
So does Forth, so perhaps programmers using that language have a name for it.
-- Lars Brinkhoff, Services for Unix, Linux, GCC, HTTP Brinkhoff Consulting http://www.brinkhoff.se/
* james anderson | upon experimentation, however, i observe that
Your experiment has only uncovered that it is impossible to override the package marker status of colon. Other than that, you have only clobbered the constituent traits of all characters, forcing them the same as for #\a. It is unclear which hypotheses your experiment has actually tested.
This goes to show that : must always be escaped if it is to be part of a symbol name, however, further complicating the "name" for the set of allowable characters in a symbol.
-- Erik Naggum | Oslo, Norway
Act from reason, and failure makes you rethink and study harder. Act from faith, and failure makes you blame someone and push harder.
> * james anderson > | upon experimentation, however, i observe that
> Your experiment has only uncovered that it is impossible to override > the package marker status of colon. Other than that, you have only > clobbered the constituent traits of all characters, forcing them the > same as for #\a. It is unclear which hypotheses your experiment has > actually tested.
the hypothesis was that the constituent traits as set out in the table on standard and semi-standard characters, which traits are not supposed to be clobbered by set-syntax-from-char, would be useful to characterise the set of characters which could be used in symbol names without explicit escaping.
> This goes to show that : must always be escaped if it is to be part > of a symbol name, however, further complicating the "name" for the > set of allowable characters in a symbol.
i would have expected the same status as that for #\: to apply to whitespace characters and to rubout.
On 14 Jan 2004 13:45:54 +0100, Lars Brinkhoff <lars.s...@nocrew.org> wrote:
>wallacethinmi...@eircom.net (Russell Wallace) writes: >> I was making heavy use of a concept I didn't have a name for, which >> rather bugged me; Lisp is one of the very few languages which allow >> non-alphanumeric characters in symbols
>So does Forth, so perhaps programmers using that language have a name >for it.
So it does; good idea. I'll try asking there, thanks.
wallacethinmi...@eircom.net (Russell Wallace) writes: > What I'm actually doing is designing a new language that's intended to > share Lisp's property of allowing characters like + and - in symbols > (though not the feature of also allowing things like brackets in > symbols if you ask nicely)
So you won't be having first-class symbols? I'd be pretty appalled if I couldn't give make-symbol any arbitrary string.
-- /|_ .-----------------------. ,' .\ / | No to Imperialist war | ,--' _,' | Wage class war! | / / `-----------------------' ( -. | | ) | (`-. '--.) `. )----'
* james anderson | the hypothesis was that the constituent traits as set out in the | table on standard and semi-standard characters, which traits are not | supposed to be clobbered by set-syntax-from-char, would be useful to | characterise the set of characters which could be used in symbol | names without explicit escaping.
That does not appear to be an unreasonable hypothesis, but it was not the hypothesis you tested. You tested whether a string of three characters, varying the middle one, would be read as a symbol or would signal an error. Any number of middle characters that cause a termination of the reader algorithm will produce a symbol read from the first character, a letter.
| i would have expected the same status as that for #\: to apply to | whitespace characters and to rubout.
But (read-from-string "a b") will return a symbol, namely A, when the constituent trait of the space is /invalid/. You did not test the length or any other property of the symbol-name of the returned symbol, only that it did not error. The secondary value returned from READ-FROM-STRING should be educational.
-- Erik Naggum | Oslo, Norway
Act from reason, and failure makes you rethink and study harder. Act from faith, and failure makes you blame someone and push harder.
wallacethinmi...@eircom.net (Russell Wallace) writes: > What I'm actually doing is designing a new language that's intended to > share Lisp's property of allowing characters like + and - in symbols > (though not the feature of also allowing things like brackets in > symbols if you ask nicely)
I think you're still missing the point. As Erik explained, _all_ characters are valid in a Lisp symbol name.
You seem to be trying to find the set of characters that don't require escaping in order to use them in symbol names. This is really a question about the Lisp reader. Basically, things will get turned into symbols if they don't parse as some other kind of thing.
I think you're mistaken to assume there is some subset of characters in CL that does what you want. Otherwise, what do you think of this:
Lisp> (type-of '123) FIXNUM Lisp> (type-of '123d0) DOUBLE-FLOAT Lisp> (type-of 'd1230) SYMBOL Lisp> (type-of '123j0) SYMBOL
If your concern is what you can type to the reader, to result in a symbol, the answer is not simply a subset of characters. The syntax of those characters matters a lot as well. Are numerals in your set? By themselves, without escaping, the reader will turn them into numbers, not symbols. How about the letter "d", along with some numerals? Depends where in the sequence it appears.
All of the sequences above, if escaped, can be the names of symbols. If not escaped, then whether they become symbols or not when passed through the reader is _not_ a simple matter of character subsets; it's a matter of fallthrough in a series of parse attempts.
(And yes, I'm sure you can find a sufficiently small subset of characters, such that any sequence from the subset will parse only as a symbol. But that set is much _smaller_ than alphanumeric, whereas you were clearly looking for a subset of characters larger than that, e.g. including punctuation.)
-- Don ___________________________________________________________________________ ____ Don Geddis http://don.geddis.org/ d...@geddis.org Underachievement: The tallest blade of grass is the first to be cut by the lawnmower. -- Despair.com
>> What I'm actually doing is designing a new language that's intended to >> share Lisp's property of allowing characters like + and - in symbols >> (though not the feature of also allowing things like brackets in >> symbols if you ask nicely)
>So you won't be having first-class symbols?
Right.
>I'd be pretty appalled if >I couldn't give make-symbol any arbitrary string.
Well, in Common Lisp you'd probably be right. Arete (provisional name for my new language) is designed differently - symbols are only used for lexically scoped name-value mappings; strings do most of the other things you use symbols for in Lisp. (For example, 'FOO is just syntactic sugar for "FOO", it's not a symbol.)
On 14 Jan 2004 11:12:12 -0800, Don Geddis <d...@geddis.org> wrote:
>I think you're still missing the point. As Erik explained, _all_ characters >are valid in a Lisp symbol name.
No, that's fine, I understand that - my question wasn't about Lisp, but about English terminology. I gather from Erik's explanation that the answer is "Lisp doesn't regard any such set as special enough to merit a short name", though, so I'll just make up one myself, something like "ordinary characters".
Russell Wallace wrote: > What I'm actually doing is designing a new language that's intended to > share Lisp's property of allowing characters like + and - in symbols > (though not the feature of also allowing things like brackets in > symbols if you ask nicely), and I found when thinking about the syntax > I was making heavy use of a concept I didn't have a name for, which > rather bugged me; Lisp is one of the very few languages which allow > non-alphanumeric characters in symbols, so I was wondering if it had a > name for the concept.
I don't know any language that has a name for this concept. Instead, you will find grammars for most languages, in BNF notation or something along these lines, that define what characters are accepted as part of identifiers. Chapter 2.2 in the HyperSpec is pretty close to what other languages do in this regard, for example.
When defining a new language, it's probably a good idea to define such a grammar at a certain stage anyway, and try to convince yourself that it's an LL(1) grammar. Minimizing the lookahead that's needed for parsing a program source is likely to improve the programmer's understanding of the language.
As a result you will get a single definitive point to refer to when someone wants to know what characters are accepted. That's probably better than inventing a term for this concept. Later on you can just use terms like "identifier" or "symbol", and it's clear from the grammar what is meant.
Further note that the idea to include characters like + and - in identifiers is IMHO only a good idea in prefix and probably postfix languages. In infix languages, it's very likely to be confusing when a+b and a + b mean different things. (If your language is not an infix language, then just forget this remark. ;)
Pascal
-- Tyler: "How's that working out for you?" Jack: "Great." Tyler: "Keep it up, then."
On Wed, 14 Jan 2004 23:49:02 +0100, Pascal Costanza <costa...@web.de> wrote:
>When defining a new language, it's probably a good idea to define such a >grammar at a certain stage anyway, and try to convince yourself that >it's an LL(1) grammar. Minimizing the lookahead that's needed for >parsing a program source is likely to improve the programmer's >understanding of the language.
*nod-nod* I agree completely. I've the outline of a BNF grammar sketched in my head, and I'm pretty sure it's LL(1). Simple grammer is good ^.^
>Further note that the idea to include characters like + and - in >identifiers is IMHO only a good idea in prefix and probably postfix >languages. In infix languages, it's very likely to be confusing when a+b >and a + b mean different things. (If your language is not an infix >language, then just forget this remark. ;)
It is an infix language, and I agree that's a downside. I just think it's very heavily outweighed by the ability to write multiword identifiers with dashes instead of mixed case.
> * james anderson > | the hypothesis was that the constituent traits as set out in the > | table on standard and semi-standard characters, which traits are not > | supposed to be clobbered by set-syntax-from-char, would be useful to > | characterise the set of characters which could be used in symbol > | names without explicit escaping.
> That does not appear to be an unreasonable hypothesis, but it was > not the hypothesis [the posted code] tested. [It tested] whether a string of three > characters, varying the middle one, would be read as a symbol or > would signal an error. Any number of middle characters that cause a > termination of the reader algorithm will produce a symbol read from > the first character, a letter.
> | i would have expected the same status as that for #\: to apply to > | whitespace characters and to rubout.
> But (read-from-string "a b") will return a symbol, namely A, when > the constituent trait of the space is /invalid/.
i had thought that circumstance was specified to signal an error. there was a different version, which printed a bit too much to post, which noted and printed everything - exactly because the result was a surprise, which neither signalled an error, nor did it demonstrate the length-1-symbol-name behaviour.
> [The posted code] did not test > the length or any other property of the symbol-name of the returned > symbol, only that it did not error. The secondary value returned > from READ-FROM-STRING should be educational.
> But (read-from-string "a b") will return a symbol, namely A, when > the constituent trait of the space is /invalid/.
* james anderson | i had thought that circumstance was specified to signal an error.
Hm. This appears to be unexplored territory. You deserve credit for pointing to the map and the real world and urging me to take a closer look at both.
We have the following situation: A character whose syntax type is /constituent/ is used to set the syntax type of a character whose previous syntax type was /whitespace/, but this means that the constituent trait of that character remains /invalid/, which makes the syntax type /invalid/. According to the specification, such a character can never occur in the input except under the control of a single escape character, so (read-from-string "a b") should indeed signal an error, as per 2.1.4.3. (In case anyone else wonders, the multiple escape mechanism already forces all characters to have the alphabetic trait.)
I thought I caught an obvious oversight in your test, but it would have been strong enough to test the hypothesis, were it not for the sorry fact that none of the Common Lisp environments I have access to signal an error when encountering invalid characters in the input stream.
| it was always 3.
OK, then this is definitely surprising and in clear violation of the standard. You're right that SET-SYNTAX-FROM-CHAR should not clobber the constituent trait for any character, not just the package marker.
Where is that annoying conformance test guy who stresses the useless corners and boundary conditions of the standard when you need him?
-- Erik Naggum | Oslo, Norway
Act from reason, and failure makes you rethink and study harder. Act from faith, and failure makes you blame someone and push harder.
Erik Naggum <e...@naggum.no> writes: > Where is that annoying conformance test guy who stresses the useless > corners and boundary conditions of the standard when you need him?
Since he may not respond to that description, I'll just say that Paul's tests are currently in progress up to chapter 21 (Streams), so it shouldn't be too long before chapter 23 (Reader) is breached.
Christophe -- http://www-jcsu.jesus.cam.ac.uk/~csr21/ +44 1223 510 299/+44 7729 383 757 (set-pprint-dispatch 'number (lambda (s o) (declare (special b)) (format s b))) (defvar b "~&Just another Lisp hacker~%") (pprint #36rJesusCollegeCambridge)