symbol vs. string

Ladvánszky Károly

unread,

Oct 3, 2003, 5:38:22 AM10/3/03

to

Please explain me why the use of symbols (as in Lisp) is more powerful than
using strings for the same tasks? I understand it is faster to access
symbols than strings but that is certainly not the main factor. Let me
explain my question a bit. Symbol processing languages are often mentioned
as suitable tools for natural language processing. Now, why is it better to
use symbols to formulate a statement like (verb run) for instance than using
("verb" "run") which could be done in languages with no symbol data type.
Furthermore, whenever it is required to analyze a word (to find the vowels
etc.), one can use the whole arsenal of string handling functions while it
seems not as easy in the case of symbols.
I'm sure I don't see something very important.

Thanks for any help (and maybe a few simple code examples) on this,

Károly

Pascal Bourguignon

unread,

Oct 3, 2003, 9:45:13 AM10/3/03

to

"Ladvánszky Károly" <a...@bb.cc> writes:

> Please explain me why the use of symbols (as in Lisp) is more powerful than
> using strings for the same tasks? I understand it is faster to access
> symbols than strings but that is certainly not the main factor. Let me
> explain my question a bit. Symbol processing languages are often mentioned
> as suitable tools for natural language processing. Now, why is it better to
> use symbols to formulate a statement like (verb run) for instance than using
> ("verb" "run") which could be done in languages with no symbol data type.

Symbols come with a predefined data structure. They are naturally unique:

(eq 'foo 'foo) is T
while (eq "foo" "foo") is not ensured and generally NIL.

They're stored into an data structure generally optimized for access
time (a hash table for example).

When you're parsing words, it may be useful to note that two tokens
are the same word.

Symbols have slots:

- a value slot (symbol-value 'foo) [usually written: foo]

- a function slot (symbol-function 'foo) [usually written: (foo args...)]

- a name slot (symbol-name 'foo) [which returns the string "FOO"]

- a property-list slot (symbol-plist 'foo)
where user programs can attach interesting fact about these symbols,
for example:
(setf (getf 'cat 'noun) t
(getf 'fly 'noun) t
(getf 'fly 'verb) t
(getf 'flies 'verb) t
(getf 'flies 'time) :present
(getf 'ate 'verb) t
(getf 'ate 'time) :past)

And later, you can write rules such as:

(if (and (getf word1 'noun)
(getf word2 'verb)
(getf word3 'noun))
(funcall (symbol-function word2) word2 word1 word3))

Of course, you can implement all this in your own data structures, but
you have it all done, optimized and debugged in the lisp system.

> Furthermore, whenever it is required to analyze a word (to find the vowels
> etc.), one can use the whole arsenal of string handling functions while it
> seems not as easy in the case of symbols.
> I'm sure I don't see something very important.

The point is to do SYMBOL processing. Not phonem or letter (string)
processing.

But in any case, most string function in Common-Lisp take symbols as
well:

[58]> (string= 'foobar "FOOBAR")
T

or it's otherwise quite easy to get to the characters of a symbol:

[59]> (char (string 'foobar) 4)
#\A

and this should not be much slower either:

[69]> (time (dotimes (i 1000) (char (string 'foobar) 4)))
Real time: 0.002269 sec.
Run time: 0.0 sec.
Space: 716 Bytes
NIL
[73]> (time (dotimes (i 1000) (char "foobar" 4)))
Real time: 0.002038 sec.
Run time: 0.01 sec.
Space: 716 Bytes
NIL

> Thanks for any help (and maybe a few simple code examples) on this,

--
__Pascal_Bourguignon__
http://www.informatimago.com/
Do not adjust your mind, there is a fault in reality.

Marco Antoniotti

unread,

Oct 3, 2003, 2:03:25 PM10/3/03

to

equality via 'strcmp' or 'string=' is an O(n) operation

equality on symbols is alaways a O(1) operation

Does this answer your question?

Note that this is the reason why Java has String.intern().

Cheers
--
Marco

Johan Kullstam

unread,

Oct 3, 2003, 2:45:19 PM10/3/03

to

Pascal Bourguignon <sp...@thalassa.informatimago.com> writes:

> "Ladvánszky Károly" <a...@bb.cc> writes:
>
> > Please explain me why the use of symbols (as in Lisp) is more powerful than
> > using strings for the same tasks? I understand it is faster to access
> > symbols than strings but that is certainly not the main factor. Let me
> > explain my question a bit. Symbol processing languages are often mentioned
> > as suitable tools for natural language processing. Now, why is it better to
> > use symbols to formulate a statement like (verb run) for instance
> > than using

I think that usually you don't want just the string itself, but you
want to associate stuff to the string. This gives two main options
1) use SYMBOLS, have the string (with case smash) be the name
use the property list to store your associations
2) use strings and a data structure indexed by them such as
association lists or hash tables.

For symbols, as far as I've seen, people generally use the property
list of the symbol. Other symbol slots like value or function are
possible but since they are one per symbol you run the risk of
conflict. You have at least two languages, 1) your language you wish
to parse and you have 2) lisp. E.g., if you want to put something in
the function slot, what do you do with the symbol CAR?

You can make an EQUAL hash table and use the string as the key/index.
It seems viable on the surface and I think the reason it's not used is
1) speed - lisp generally handles symbols pretty quickly
2) convenience - lisp has a number of symbol handling features, e.g.,
the lisp symbol handling system (reader &c) smashes case and has some
sort of symbol-name hashtable-like entity built-in.
3) tradition - it's the way it's always been done

> > ("verb" "run") which could be done in languages with no symbol data type.
>
> Symbols come with a predefined data structure. They are naturally unique:
>
> (eq 'foo 'foo) is T
> while (eq "foo" "foo") is not ensured and generally NIL.

This seems hardly fair. Why would you compare strings with EQ?

> They're stored into an data structure generally optimized for access
> time (a hash table for example).

> When you're parsing words, it may be useful to note that two tokens
> are the same word.
>
>
> Symbols have slots:
>
> - a value slot (symbol-value 'foo) [usually written: foo]
>
> - a function slot (symbol-function 'foo) [usually written: (foo
> args...)]

As I mentioned above, there may be some annoyance in putting stuff in
the function slot of, e.g., CAR.

> - a name slot (symbol-name 'foo) [which returns the string "FOO"]
>
> - a property-list slot (symbol-plist 'foo)
> where user programs can attach interesting fact about these symbols,
> for example:
> (setf (getf 'cat 'noun) t
> (getf 'fly 'noun) t
> (getf 'fly 'verb) t
> (getf 'flies 'verb) t
> (getf 'flies 'time) :present
> (getf 'ate 'verb) t
> (getf 'ate 'time) :past)

The property list seems to be the crux.

--
Johan KULLSTAM <kulls...@comcast.net> sysengr

Pascal Bourguignon

unread,

Oct 3, 2003, 4:54:38 PM10/3/03

to

Johan Kullstam <kulls...@comcast.net> writes:
> > (eq 'foo 'foo) is T
> > while (eq "foo" "foo") is not ensured and generally NIL.
>
> This seems hardly fair. Why would you compare strings with EQ?

Fair to ask, I let it unsaid: eq is O(1) while equal, string=,
string-equal, etc are O(n), with n=(length string) in case of strings.

> As I mentioned above, there may be some annoyance in putting stuff in
> the function slot of, e.g., CAR.

Of course, but then you can put your symbols in a private package.

Barry Margolin

unread,

Oct 3, 2003, 6:22:36 PM10/3/03

to

In article <9cbd760a7fa514ea...@news.meganetnews.com>,

While the other answers in this thread are very good, a big part of the
answer is simply historical.

Early Lisp dialects didn't have the plethora of built-in data types that
we're accustomed to. They generally had only symbols, conses, numbers
(perhaps only integers), and sometimes arrays. What most of them *didn't*
have were strings. Even Maclisp, one of the immediate predecessors to
Common Lisp, didn't have strings as a primitive type, although there was a
kludge involving hunks (an extension of conses that supported any power of
2 number of cells). They also didn't have hash tables, except internally
to support the obarray (the table used to keep track of interned symbols --
there were no named pages).

So if you wanted to process words, the alternatives were generally to write
lots of code to implement strings and hash tables, or to use symbols. The
property list of symbols was provided to support the latter style of
programming.

Nowadays, we have hash tables and strings, so it's about as easy to use
them for this type of application. As a result, symbols are mostly used
only for the identifiers in programs. However, some of the performance
issues that have been mentioned are still relevant. When you look up a
string in a hash table, the hash function may have to examine quite a few
bytes to perform well, and when it's searching the bucket it has to do a
string comparison with each candidate.

Of course, this also has to be done by INTERN when using symbols. The
difference is that this is typically done just once, when the user is
typing. If you program with strings and hash tables, you need to design it
so that it doesn't have to do repeated lookups to get comparable performance.

--
Barry Margolin, barry.m...@level3.com
Level(3), Woburn, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

Kaz Kylheku

unread,

Oct 3, 2003, 7:23:56 PM10/3/03

to

"Ladvánszky Károly" <a...@bb.cc> wrote in message news:<9cbd760a7fa514ea...@news.meganetnews.com>...

> Please explain me why the use of symbols (as in Lisp) is more powerful than
> using strings for the same tasks?

That depends on how you use the strings. If you use strings as keys to
look up objects, then you are doing symbolic processing---a clumsy,
inefficient form of it where you are writing extra code perhaps, and
doing interning at run time.

If you use actual character manipulation as part of your
algorithm---for example, rewriting syntax at the character
representation level rather than parse tree representation level,
that's even worse, much worse.

> I understand it is faster to access
> symbols than strings but that is certainly not the main factor. Let me
> explain my question a bit. Symbol processing languages are often mentioned
> as suitable tools for natural language processing. Now, why is it better to
> use symbols to formulate a statement like (verb run) for instance than using
> ("verb" "run") which could be done in languages with no symbol data type.

Why it's better is that "verb" and "run" are just character containers
with no other properties. Symbols are useful not only because they
allow fast identity comparisons, but because you can associate them
with arbitrary properties! Symbols *denote* things through these
associations.

To build an association between "verb" or "run" and some other data,
you can't do that directly; you have to use some kind of hash table
which you can query using "verb" as a key to get to some object which
holds the associations.

Once you do that, you have re-implemented symbols from scratch, minus
the first-class support in the language.

> Furthermore, whenever it is required to analyze a word (to find the vowels
> etc.), one can use the whole arsenal of string handling functions while it
> seems not as easy in the case of symbols.

When you are analyzing the word, you are recognizing that they have a
finer linguistic structure: they are not *atoms*! In this context,
therefore, it may be inappropriate to represent a word as a symbol; a
word should be represented as a non-atomic structure, such as a nested
list, which can express the idea that there are morphemes and
phonemes---or syllables if you are dealing purely with orthography.

You may still need the character-based analysis when the input to your
program is a character string representing a word. This is where you
apply whatever hacks are appropriate to decipher the structure; the
rest of the program can deal with a much more convenient data
structure from which the printed word can be recovered when it's
necessary to produce output understood in the outside world.

Robert Klemme

unread,

Oct 5, 2003, 5:24:41 PM10/5/03

to

"Pascal Bourguignon" <sp...@thalassa.informatimago.com> schrieb im
Newsbeitrag news:871xtuj...@thalassa.informatimago.com...

> "Ladvánszky Károly" <a...@bb.cc> writes:
> - a property-list slot (symbol-plist 'foo)
> where user programs can attach interesting fact about these symbols,
> for example:
> (setf (getf 'cat 'noun) t
> (getf 'fly 'noun) t
> (getf 'fly 'verb) t
> (getf 'flies 'verb) t
> (getf 'flies 'time) :present
> (getf 'ate 'verb) t
> (getf 'ate 'time) :past)
>
> And later, you can write rules such as:
>
> (if (and (getf word1 'noun)
> (getf word2 'verb)
> (getf word3 'noun))
> (funcall (symbol-function word2) word2 word1 word3))

But isn't this likely to cause problems? If it's the usual practice to
associate data with a symbol via its propery list clashes for common names
seem likely to me, i.e. one part of an application could use property 'color
with a boolean value to denote whether the symbol denotes a color while
another part could use 'color with another symbol ('blue, 'red...) to mean
the color of the item.

Kind regards

robert

Matthew Danish

unread,

Oct 5, 2003, 5:41:58 PM10/5/03

to

On Sun, Oct 05, 2003 at 11:24:41PM +0200, Robert Klemme wrote:
> But isn't this likely to cause problems? If it's the usual practice to
> associate data with a symbol via its propery list clashes for common names

> seem likely to me.

Packages were invented to solve precisely this problem. Though, these
days it seems that using symbol property lists is not so common,
although valid.

--
; Matthew Danish <mda...@andrew.cmu.edu>
; OpenPGP public key: C24B6010 on keyring.debian.org
; Signed or encrypted mail welcome.
; "There is no dark side of the moon really; matter of fact, it's all dark."

Thomas F. Burdick

unread,

Oct 5, 2003, 7:58:05 PM10/5/03

to

Matthew Danish <mda...@andrew.cmu.edu> writes:

> On Sun, Oct 05, 2003 at 11:24:41PM +0200, Robert Klemme wrote:
> > But isn't this likely to cause problems? If it's the usual practice to
> > associate data with a symbol via its propery list clashes for common names
> > seem likely to me.
>
> Packages were invented to solve precisely this problem. Though, these
> days it seems that using symbol property lists is not so common,
> although valid.

It's true that it's not so common to use a symbol's property list
anymore, but I personally wonder it shouldn't be used more. I see
semi-frequent grumblings from people who want both portability *and*
key-weak hash tables. That's exactly when I use symbol plists, if I
can -- once the symbol's gone, *poof* so's the reference.

--
/|_ .-----------------------.
,' .\ / | No to Imperialist war |
,--' _,' | Wage class war! |
/ / `-----------------------'
( -. |
| ) |
(`-. '--.)
`. )----'

Nils Goesche

unread,

Oct 6, 2003, 7:34:50 PM10/6/03

to

t...@famine.OCF.Berkeley.EDU (Thomas F. Burdick) writes:

> Matthew Danish <mda...@andrew.cmu.edu> writes:
>
> > Packages were invented to solve precisely this problem.
> > Though, these days it seems that using symbol property lists
> > is not so common, although valid.
>
> It's true that it's not so common to use a symbol's property
> list anymore, but I personally wonder it shouldn't be used
> more. I see semi-frequent grumblings from people who want both
> portability *and* key-weak hash tables. That's exactly when I
> use symbol plists, if I can -- once the symbol's gone, *poof*
> so's the reference.

Heh. Another situation where I use symbol plists is in FFI: When
the C header file says

#define FOO 42

I'll sometimes have a symbol constant FOO (usually not +FOO+,
don't ask me why) evaluating to itself, but with a secret C-VALUE
property of 42. That way, only the library code actually calling
the C function knows the numerical value and it prints nicely.
And I don't need no silly hash table.

Regards,
--
Nils Gösche
"Don't ask for whom the <CTRL-G> tolls."

PGP key ID #xD26EF2A0

Ladvánszky Károly

unread,

Oct 7, 2003, 4:03:18 AM10/7/03

to

Thanks to everyone who sent an answer to my question about strings versus
symbols. All the answers have been very helpful. Special thanks to Kaz
Kylheku for his clear, comprehensive explanation.