Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion What's more efficient? Strings or symbols?
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Sam Steingold  
View profile  
 More options Sep 20 2012, 3:16 pm
Newsgroups: comp.lang.lisp
From: Sam Steingold <s...@gnu.org>
Date: Thu, 20 Sep 2012 15:16:01 -0400
Local: Thurs, Sep 20 2012 3:16 pm
Subject: Re: What's more efficient? Strings or symbols?

> * Barry Margolin <one...@nyhz.zvg.rqh> [2012-09-20 14:22:30 -0400]:

> In article <87ehlwzqne....@gnu.org>, Sam Steingold <s...@gnu.org> wrote:

>> > * proton <yrbfnen...@tznvy.pbz> [2012-09-20 06:13:52 -0700]:

>> > I have a huge amount of text (more than 2 TB) that I want to process
>> > to create a database. Basically, I read this text from disk, I split
>> > it into words, do some processing, and store it in a hash-table or in
>> > a file.

>> > My question is: what is the most efficient way to treat the words, as
>> > strings or as symbols? I am interested in possible issues due to the
>> > huge number of words (about 700K), will the string space run out
>> > faster than the symbol space, will it be GCed properly once a word is
>> > not used, etc...

>> It all depends on what you do with the text.

>> 1. Space: a symbol is more expensive than a string (see Barry's
>> message), but if you have many repeated words (as you certainly do, 700k
>> in 2T), it would be a huge saving to have just one symbol FOO instead of
>> a zillion strings "FOO".  Strings are GCed as soon as you lose them,
>> while symbols have to be uninterned from their package first (thus I
>> suggest that you place your symbols into a separate package which you
>> can then summarily delete when not needed).

> He said that he was going to put the strings in a hash table, so there
> will just be one of each equivalent string.  Symbols in packages and
> strings in a hash table are pretty much isomorphic.

Of course.

>> 2. Strings are compared character-by-character (EQUAL) while
>> symbols are compared as pointers (EQ). This could be big.

> But INTERN has to compare by character when it's looking the symbol up
> in the hash table.

Of course.

>> 3. You will be associating some information with each word, right?  Use
>> symbols and put the information into the value slot; you will save huge
>> on access time compared with hash tables.

> How do you think INTERN finds the symbol?  The package is mostly just
> a hash table.

Of course.

>> 4. Reading a symbol is more expensive than reading a string because you
>> have to intern it. If you do a lot of i/o but little processing, symbols
>> are not for you.

> But if he's going to look up the string in a hash table, that's
> equivalent to interning it.

Of course - but suppose he ignores the strings shorter than, say, 2
characters?  to avoid auto-interning such strings by READ, he would have
to read them as strings and then decide whether to intern them.

My point was that symbols would be more _convenient_.

Technically, he would either use

string --[hash table]--> data structure

or

string --[package]--> symbol --> data structure in the symbol value slot

The second is, IMO, easier on the coding load: there would be less need
for a PRINT-OBJECT method :-)

Also, using symbols will prevent him from a common error of multiple
lookups (e.g., he will not be able to pass the string around so that
different functions will look it up in the hash table separately).

At any rate, I agree that the difference is more
convenience/aesthetic than anything else.

--
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://americancensorship.org http://think-israel.org
http://memri.org http://thereligionofpeace.com http://ffii.org
Perl: all stupidities of UNIX in one.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.