Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

What's more efficient? Strings or symbols?

183 views
Skip to first unread message

proton

unread,
Sep 20, 2012, 9:13:52 AM9/20/12
to
Hi all,

I have a huge amount of text (more than 2 TB) that I want to process to create a database. Basically, I read this text from disk, I split it into words, do some processing, and store it in a hash-table or in a file.

My question is: what is the most efficient way to treat the words, as strings or as symbols? I am interested in possible issues due to the huge number of words (about 700K), will the string space run out faster than the symbol space, will it be GCed properly once a word is not used, etc...

I would appreciate any suggestions on this.

Thanks in advance.

Barry Margolin

unread,
Sep 20, 2012, 9:59:00 AM9/20/12
to
In article <7b44215a-7abd-4ac2...@googlegroups.com>,
Symbols are typically implemented as a structure something like:

(defstruct symbol
(name :type string)
(package :type package)
(plist :type list)
value
function)

So since the symbol is a structure + string, while a string is just a
string, the latter is obviously more space efficient.

Garbage collection works the same for all data type: if there are no
references to an object, it gets collected.

--
Barry Margolin, bar...@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***

Sam Steingold

unread,
Sep 20, 2012, 10:59:33 AM9/20/12
to
> * proton <yrbfn...@tznvy.pbz> [2012-09-20 06:13:52 -0700]:
>
> I have a huge amount of text (more than 2 TB) that I want to process
> to create a database. Basically, I read this text from disk, I split
> it into words, do some processing, and store it in a hash-table or in
> a file.
>
> My question is: what is the most efficient way to treat the words, as
> strings or as symbols? I am interested in possible issues due to the
> huge number of words (about 700K), will the string space run out
> faster than the symbol space, will it be GCed properly once a word is
> not used, etc...

It all depends on what you do with the text.

1. Space: a symbol is more expensive than a string (see Barry's
message), but if you have many repeated words (as you certainly do, 700k
in 2T), it would be a huge saving to have just one symbol FOO instead of
a zillion strings "FOO". Strings are GCed as soon as you lose them,
while symbols have to be uninterned from their package first (thus I
suggest that you place your symbols into a separate package which you
can then summarily delete when not needed).

2. Strings are compared character-by-character (EQUAL) while
symbols are compared as pointers (EQ). This could be big.

3. You will be associating some information with each word, right? Use
symbols and put the information into the value slot; you will save huge
on access time compared with hash tables.

4. Reading a symbol is more expensive than reading a string because you
have to intern it. If you do a lot of i/o but little processing, symbols
are not for you.

In short, you have to decide for yourself and maybe experiment a little,
but for anything other than proof of concept you will probably need symbols.

--
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://honestreporting.com
http://memri.org http://think-israel.org http://openvotingconsortium.org
Any supplier that makes enough to pay a full time lobbyist is overcharging.

Pascal J. Bourguignon

unread,
Sep 20, 2012, 11:04:35 AM9/20/12
to
proton <leosa...@gmail.com> writes:

> Hi all,
>
> I have a huge amount of text (more than 2 TB) that I want to process
> to create a database. Basically, I read this text from disk, I split
> it into words, do some processing, and store it in a hash-table or in
> a file.
>
> My question is: what is the most efficient way to treat the words, as
> strings or as symbols? I am interested in possible issues due to the
> huge number of words (about 700K), will the string space run out
> faster than the symbol space,

Only toy implementations distinguish memory spaces for different types.
An out of memory situation will apply to any type of object at the same
time.

Given the average length of words is usually 5 characters, assuming an
implementation supporting unicode therefore using 32-bit per character,
assuming one more 32-bit overhead, 700K words will require
(* 700 1024 6 4) 17,203,200 bytes of core. An insignificant amount of
memory.

Now, the problem you have is that you want to avoid just loading the 2
TB of text in core memory, but instead unify the words. That's what
symbols offer you, with the INTERN function. As mentionned by Barry,
there may be some overhead on symbols. 5 or 6 words if something like
defstruct is used, but perhaps less (implementation may implement
symbols differently, amortizing some of the common or unused slots).
Anyways, assuming an overhead for symbols of 6 words, that's (* 700 1024
6 4) = 17,203,200 more bytes (plus some more for the hash-table in the
package where those symbols are interned), again an insignificant amount.

So, using symbols to represent words is a perfectly good solution.
Symbols were designed for that, and are used a lot to represent words in
NLP software written in lisp. Other advantages are that you can easily
add attributes (using the symbol-plist of the symbol-function) to the
words to help parsing or understanding, or other NLP algorithms.


So already, we're discussing O(48n) = O(n) space. We could do O(24n) by
using and unifying strings, but that'd still just be O(n) space, using
something like:


(defvar *strings* (make-hash-table :test (function equal)))

(defun string-intern (string)
(or (gethash string *strings*)
(setf (gethash string *strings*) string)))


(loop while (word-available-p input)
collect (string-intern (parse-word input)))


But if you need to add attributes to those "words", then you'll have to
define a structure or a class yourself, or otherwise add the overhead
that's already provided by symbols.


But you can use tries to store words more efficiently: all the words
starting by #\a share this initial letter. You could store it only
once, and then store in the children only the rest of the words starting
by #\a. And so on. Therefore having on the order of O(nlogn) space
requirements for words. Well, if you're careful, but notice now that a
word is represented by m characters and m pointers (only some of the
characters and pointers are shared with other words).

write 5
written 7
---
12

writ-e 4+1
\ten +3
----
8

http://en.wikipedia.org/wiki/Trie
https://www.google.com/search?q=cl+tries


> will it be GCed properly once a word is
> not used, etc...

Yes, garbage collectors work correctly.

--
__Pascal Bourguignon__ http://www.informatimago.com/
A bad day in () is better than a good day in {}.

Barry Margolin

unread,
Sep 20, 2012, 2:22:30 PM9/20/12
to
In article <87ehlwz...@gnu.org>, Sam Steingold <s...@gnu.org> wrote:

> > * proton <yrbfn...@tznvy.pbz> [2012-09-20 06:13:52 -0700]:
> >
> > I have a huge amount of text (more than 2 TB) that I want to process
> > to create a database. Basically, I read this text from disk, I split
> > it into words, do some processing, and store it in a hash-table or in
> > a file.
> >
> > My question is: what is the most efficient way to treat the words, as
> > strings or as symbols? I am interested in possible issues due to the
> > huge number of words (about 700K), will the string space run out
> > faster than the symbol space, will it be GCed properly once a word is
> > not used, etc...
>
> It all depends on what you do with the text.
>
> 1. Space: a symbol is more expensive than a string (see Barry's
> message), but if you have many repeated words (as you certainly do, 700k
> in 2T), it would be a huge saving to have just one symbol FOO instead of
> a zillion strings "FOO". Strings are GCed as soon as you lose them,
> while symbols have to be uninterned from their package first (thus I
> suggest that you place your symbols into a separate package which you
> can then summarily delete when not needed).

He said that he was going to put the strings in a hash table, so there
will just be one of each equivalent string. Symbols in packages and
strings in a hash table are pretty much isomorphic.

>
> 2. Strings are compared character-by-character (EQUAL) while
> symbols are compared as pointers (EQ). This could be big.

But INTERN has to compare by character when it's looking the symbol up
in the hash table.

>
> 3. You will be associating some information with each word, right? Use
> symbols and put the information into the value slot; you will save huge
> on access time compared with hash tables.

How do you think INTERN finds the symbol? The package is mostly just a
hash table.

> 4. Reading a symbol is more expensive than reading a string because you
> have to intern it. If you do a lot of i/o but little processing, symbols
> are not for you.

But if he's going to look up the string in a hash table, that's
equivalent to interning it.

Barry Margolin

unread,
Sep 20, 2012, 2:25:07 PM9/20/12
to
In article <877groe...@kuiper.lan.informatimago.com>,
"Pascal J. Bourguignon" <p...@informatimago.com> wrote:

> Only toy implementations distinguish memory spaces for different types.
> An out of memory situation will apply to any type of object at the same
> time.

Lisp Machines had different spaces for different types (I think it
called them "regions"). But if a region filled up, it would normally
just allocate another one (there may have been region-creation options
that disabled this, I don't remember).

Sam Steingold

unread,
Sep 20, 2012, 3:16:01 PM9/20/12
to
> * Barry Margolin <one...@nyhz.zvg.rqh> [2012-09-20 14:22:30 -0400]:
>
> In article <87ehlwz...@gnu.org>, Sam Steingold <s...@gnu.org> wrote:
>
>> > * proton <yrbfn...@tznvy.pbz> [2012-09-20 06:13:52 -0700]:
>> >
>> > I have a huge amount of text (more than 2 TB) that I want to process
>> > to create a database. Basically, I read this text from disk, I split
>> > it into words, do some processing, and store it in a hash-table or in
>> > a file.
>> >
>> > My question is: what is the most efficient way to treat the words, as
>> > strings or as symbols? I am interested in possible issues due to the
>> > huge number of words (about 700K), will the string space run out
>> > faster than the symbol space, will it be GCed properly once a word is
>> > not used, etc...
>>
>> It all depends on what you do with the text.
>>
>> 1. Space: a symbol is more expensive than a string (see Barry's
>> message), but if you have many repeated words (as you certainly do, 700k
>> in 2T), it would be a huge saving to have just one symbol FOO instead of
>> a zillion strings "FOO". Strings are GCed as soon as you lose them,
>> while symbols have to be uninterned from their package first (thus I
>> suggest that you place your symbols into a separate package which you
>> can then summarily delete when not needed).
>
> He said that he was going to put the strings in a hash table, so there
> will just be one of each equivalent string. Symbols in packages and
> strings in a hash table are pretty much isomorphic.

Of course.

>> 2. Strings are compared character-by-character (EQUAL) while
>> symbols are compared as pointers (EQ). This could be big.
>
> But INTERN has to compare by character when it's looking the symbol up
> in the hash table.

Of course.

>> 3. You will be associating some information with each word, right? Use
>> symbols and put the information into the value slot; you will save huge
>> on access time compared with hash tables.
>
> How do you think INTERN finds the symbol? The package is mostly just
> a hash table.

Of course.

>> 4. Reading a symbol is more expensive than reading a string because you
>> have to intern it. If you do a lot of i/o but little processing, symbols
>> are not for you.
>
> But if he's going to look up the string in a hash table, that's
> equivalent to interning it.

Of course - but suppose he ignores the strings shorter than, say, 2
characters? to avoid auto-interning such strings by READ, he would have
to read them as strings and then decide whether to intern them.

My point was that symbols would be more _convenient_.

Technically, he would either use

string --[hash table]--> data structure

or

string --[package]--> symbol --> data structure in the symbol value slot

The second is, IMO, easier on the coding load: there would be less need
for a PRINT-OBJECT method :-)

Also, using symbols will prevent him from a common error of multiple
lookups (e.g., he will not be able to pass the string around so that
different functions will look it up in the hash table separately).

At any rate, I agree that the difference is more
convenience/aesthetic than anything else.

--
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://americancensorship.org http://think-israel.org
http://memri.org http://thereligionofpeace.com http://ffii.org
Perl: all stupidities of UNIX in one.

Kaz Kylheku

unread,
Sep 20, 2012, 5:05:06 PM9/20/12
to
On 2012-09-20, proton <leosa...@gmail.com> wrote:
> Hi all,
>
> I have a huge amount of text (more than 2 TB) that I want to process to
> create a database. Basically, I read this text from disk, I split it into
> words, do some processing, and store it in a hash-table or in a file.

Symbols will give you some advantages, while retaining most of the advantages
of a string representation. Symbols have a name which is a string, and
can be treated as strings if need be. For instance the STRING= function accepts
symbols directly. Other functions that work on strings like SUBSEQ will
be less convenient; you have to use SYMBOL-NAME to pull out the string.

The value of symbols is that they can be interned into a package. (Symbols
usually are, except when programs use certain functions, or a special
read syntax, to make uninterned symbols).

Interning into a package ensures that if the same token/string is interned
two or more times into the same package, it produces the same symbol object.
So, for one thing, you do not need a hash table just to collapse identical
words to the same object: the package does that for you.

Secondly, once you have the symbol interned, and you want to attach further
properties to it with additional hash tables, it is much more efficient!
The reason is that because it is a symbol you can use an EQ hash table.
An EQ hashing function on symbols is much more efficient than an EQUAL string
hashing function on strings. Hashing strings requires character-by-character
processing, whereas symbols can be hashed by some simple manipulation of a
single word (a memory address). This can be as simple as taking some middle
bits from the address.

Comparing whether two values are the same word is also much more efficient if
they are symbols, because you just use EQ which compares the pointers. Two
values can be the same symbol if and only if they are just two copies of a
reference to the same object.

> My question is: what is the most efficient way to treat the words, as strings
> or as symbols? I am interested in possible issues due to the huge number of
> words (about 700K), will the string space run out faster than the symbol
> space, will it be GCed properly once a word is not used, etc...

Once a word is not needed, it will certainly not be GC'd if it is an interned
symbol. This is because the package maintains a reference to all of its
interned symbols, and you have a reference to the package. This means you
have to take an extra step.

If you want to use symbols to manipulate data in this way, you have to
become familiar with some functions the CL package system.

When reading data from files and turning it into symbols, it's best to do
the interning in a package which is instantiated for that purpose, rather
than polluting CL-USER or some other package that is used for your code.

When you no longer need a symbol, you can use the UNINTERN function to kick it
out of the package. If you then also drop all other references to the symbol,
then it becomes garbage.

(You have the same issue with strings anyway because you're putting them into
hash tables. A string also won't be GC'd until you remove it from all the
hashes, or drop your references to the hashes themselves.)

Pascal J. Bourguignon

unread,
Sep 20, 2012, 5:49:32 PM9/20/12
to
Kaz Kylheku <k...@kylheku.com> writes:

> When reading data from files and turning it into symbols, it's best to do
> the interning in a package which is instantiated for that purpose, rather
> than polluting CL-USER or some other package that is used for your code.
>
> When you no longer need a symbol, you can use the UNINTERN function to kick it
> out of the package. If you then also drop all other references to the symbol,
> then it becomes garbage.

Or just delete the package.


> (You have the same issue with strings anyway because you're putting them into
> hash tables. A string also won't be GC'd until you remove it from all the
> hashes, or drop your references to the hashes themselves.)

Or just drop the references to the hash-table.

Barry Margolin

unread,
Sep 21, 2012, 12:05:20 AM9/21/12
to
In article <871uhwe...@kuiper.lan.informatimago.com>,
"Pascal J. Bourguignon" <p...@informatimago.com> wrote:

> Kaz Kylheku <k...@kylheku.com> writes:
> > (You have the same issue with strings anyway because you're putting them
> > into
> > hash tables. A string also won't be GC'd until you remove it from all the
> > hashes, or drop your references to the hashes themselves.)
>
> Or just drop the references to the hash-table.

Or use a weak hash table.

Kaz Kylheku

unread,
Sep 21, 2012, 7:15:52 PM9/21/12
to
On 2012-09-20, Pascal J. Bourguignon <p...@informatimago.com> wrote:
> Kaz Kylheku <k...@kylheku.com> writes:
>> hashes, or drop your references to the hashes themselves.)
>
> Or just drop the references to the hash-table.

Redundant.

Pascal J. Bourguignon

unread,
Sep 21, 2012, 8:30:13 PM9/21/12
to
No. What you wrote was:

(maphash (lambda (k v) (setf (gethash k h) nil)) h)

What I wrote was:

(setf h nil)

Quite different things.

Kaz Kylheku

unread,
Sep 21, 2012, 8:49:32 PM9/21/12
to
On 2012-09-22, Pascal J. Bourguignon <p...@informatimago.com> wrote:
> Kaz Kylheku <k...@kylheku.com> writes:
>
>> On 2012-09-20, Pascal J. Bourguignon <p...@informatimago.com> wrote:
>>> Kaz Kylheku <k...@kylheku.com> writes:
>>>> hashes, or drop your references to the hashes themselves.)
>>>
>>> Or just drop the references to the hash-table.
>
> No. What you wrote was:
>
> (maphash (lambda (k v) (setf (gethash k h) nil)) h)

I wrote no such thing, literally or figuratively.

> What I wrote was:
>
> (setf h nil)
>
> Quite different things.

No, "hashes" are hash tables, of course.

(setf h nil) drops a hash.

What you're dropping there are references to the values stored in a hash table
(while neglecting to scrub the keys). The entries in a hash, their values or
their keys are never properly called "hashes".

Hashing function return values are also called hashes, but they make no
no appearance here. E.g. "this file holds password hashes".

Pascal J. Bourguignon

unread,
Sep 21, 2012, 8:55:38 PM9/21/12
to
Kaz Kylheku <k...@kylheku.com> writes:

> On 2012-09-22, Pascal J. Bourguignon <p...@informatimago.com> wrote:
>> Kaz Kylheku <k...@kylheku.com> writes:
>>
>>> On 2012-09-20, Pascal J. Bourguignon <p...@informatimago.com> wrote:
>>>> Kaz Kylheku <k...@kylheku.com> writes:
>>>>> hashes, or drop your references to the hashes themselves.)
>>>>
>>>> Or just drop the references to the hash-table.
>>
>> No. What you wrote was:
>>
>> (maphash (lambda (k v) (setf (gethash k h) nil)) h)
>
> I wrote no such thing, literally or figuratively.
>
>> What I wrote was:
>>
>> (setf h nil)
>>
>> Quite different things.
>
> No, "hashes" are hash tables, of course.

The hashes are not the hash tables, they're the hash codes, the values
returned by SXHASH.


> (setf h nil) drops a hash.
>
> What you're dropping there are references to the values stored in a hash table
> (while neglecting to scrub the keys). The entries in a hash, their values or
> their keys are never properly called "hashes".
>
> Hashing function return values are also called hashes, but they make no
> no appearance here. E.g. "this file holds password hashes".

They're released by (setf (gethash k h) nil).

Kaz Kylheku

unread,
Sep 21, 2012, 9:10:00 PM9/21/12
to
On 2012-09-22, Pascal J. Bourguignon <p...@informatimago.com> wrote:
> Kaz Kylheku <k...@kylheku.com> writes:
>
>> On 2012-09-22, Pascal J. Bourguignon <p...@informatimago.com> wrote:
>>> Kaz Kylheku <k...@kylheku.com> writes:
>>>
>>>> On 2012-09-20, Pascal J. Bourguignon <p...@informatimago.com> wrote:
>>>>> Kaz Kylheku <k...@kylheku.com> writes:
>>>>>> hashes, or drop your references to the hashes themselves.)
>>>>>
>>>>> Or just drop the references to the hash-table.
>>>
>>> No. What you wrote was:
>>>
>>> (maphash (lambda (k v) (setf (gethash k h) nil)) h)
>>
>> I wrote no such thing, literally or figuratively.
>>
>>> What I wrote was:
>>>
>>> (setf h nil)
>>>
>>> Quite different things.
>>
>> No, "hashes" are hash tables, of course.
>
> The hashes are not the hash tables, they're the hash codes, the values
> returned by SXHASH.

Again, you're repeating something I wrote. Well "pre-peating":

>> Hashing function return values are also called hashes

SXHASH has nothing to do with hash tables. For one thing, SXHASH has no
arguments to determine whether you're hashing for EQ, EQL, EQUAL or EQUALP.
It appears to be tied to EQUAL equality.

Yes, I agree that the return value of SXHASH is called "a hash".

A word can be used for two things. For instance "a condition" can be a
situation in a program, a logical state that you can test (what COND
is named after!), or a Lisp condition.

Both Perl and Ruby call hash tables "hashes" in their documentation,
incidentally.

>> Hashing function return values are also called hashes, but they make no
>> no appearance here. E.g. "this file holds password hashes".
>
> They're released by (setf (gethash k h) nil).

No such process of hash code release is documented anywhere, I'm afraid.

Some hash table algorithms store the hash values in the table entries
(such as the one I wrote for Kazlib more than ten years ago) and some do not.

In any case, what you're doing there is overwriting the /value/ with nil.
You are not removing the key from the hash table! That requires remhash.

A Lisp hash table distinguishes between nil being stored under a key, versus
the key not being in the hash table at all. The multiple value returning API
tells you that (as you know, and I know that you know).

And the hash code, if one is stored, belongs to the key, not to the value.
Why would you remove the hash code if the value is being replaced?

Pascal J. Bourguignon

unread,
Sep 21, 2012, 10:33:56 PM9/21/12
to
Kaz Kylheku <k...@kylheku.com> writes:

> Both Perl and Ruby call hash tables "hashes" in their documentation,
> incidentally.

That would seem to be the problem then, you're discussing lisp with perl
vocabulary…


>>> Hashing function return values are also called hashes, but they make no
>>> no appearance here. E.g. "this file holds password hashes".
>>
>> They're released by (setf (gethash k h) nil).
>
> No such process of hash code release is documented anywhere, I'm afraid.
>
> Some hash table algorithms store the hash values in the table entries
> (such as the one I wrote for Kazlib more than ten years ago) and some do not.
>
> In any case, what you're doing there is overwriting the /value/ with nil.
> You are not removing the key from the hash table! That requires remhash.
>
> A Lisp hash table distinguishes between nil being stored under a key, versus
> the key not being in the hash table at all. The multiple value returning API
> tells you that (as you know, and I know that you know).
>
> And the hash code, if one is stored, belongs to the key, not to the value.
> Why would you remove the hash code if the value is being replaced?

Right. I meant to use remhash all along. Sorry for that confusion.

Barry Margolin

unread,
Sep 22, 2012, 2:02:00 AM9/22/12
to
In article <874nmqb...@kuiper.lan.informatimago.com>,
"Pascal J. Bourguignon" <p...@informatimago.com> wrote:

> Kaz Kylheku <k...@kylheku.com> writes:
>
> > Both Perl and Ruby call hash tables "hashes" in their documentation,
> > incidentally.
>
> That would seem to be the problem then, you're discussing lisp with perl
> vocabulary…

I think "hash" as a shorthand for "hash table" is becoming a common
industry term, because of the popularity of these other languages.

Lisp doesn't have an object called a "hash". So if this term comes up,
it's reasonable to use the meaning from the general programming
community.

For many years I resisted "IP" as short for "IP address". Eventually I
gave in -- it's just too pervasive.

Chris Riesbeck

unread,
Sep 24, 2012, 2:00:45 PM9/24/12
to
On 9/22/2012 1:02 AM, Barry Margolin wrote:
>
> For many years I resisted "IP" as short for "IP address". Eventually I
> gave in -- it's just too pervasive.
>

So now you're willing to say IP freely?

http://www.hulu.com/watch/29828
0 new messages