lisp newbie question: how can I associate properties to list cells?

yang

unread,

Nov 27, 2008, 2:44:46 PM11/27/08

to

for a symbol, 'a, I can do
(setf (get 'a 'state) 'finish)

so that 'a has a property state with value finish,

I want to know how to do similar things for lisp. set a property and
value for each cell in the list
eg: given a list
(setq lst '(1 2 3 1))

(setf (get (car lst ) 'state) 'finish)
will report an error.

properties are bound to symbol names other than list cells. how can I
specify properties for cells ?
thanks

Tamas K Papp

unread,

Nov 27, 2008, 2:51:21 PM11/27/08

to

As you said, symbols have properties. Values don't (in contrast to, say,
attributes in R). You will have to find a different mechanism for
storing that information. Depending on your needs, the following come to
mind:

- an object with 'value' and 'properties' slots, the latter an assoc list
or a hash table (faster)
- structure using the same logic
- cons cell (ugly)

HTH,

Tamas

John Thingstad

unread,

Nov 27, 2008, 5:24:29 PM11/27/08

to

(defparameter *enum* '(:one 1 two 2 :three 3))

(getf *enum* :one)
> 1

--------------
John Thingstad

Pascal J. Bourguignon

unread,

Nov 27, 2008, 5:32:34 PM11/27/08

to

yang <yang...@gmail.com> writes:

You have to refine your specifications.

For example, with the list you gave, what should we get for:

(let ((list '(1 2 3 1)))
(setf (get-property (first list) 'state) 'finish)
(get-property (fourth list) 'state))

?

Once you've run the above expression, should the following expression
return anything useful?

(get-property '1 'state)

Or do you want to be able to do:

(let ((list '(1 2 3 1)))
(setf (get-property (first list) 'state) 'finish)
(setf (first list) 42)
(values (get-property (first list) 'state)
list))
--> FINISH ;
(42 2 3 1)

?

--
__Pascal Bourguignon__ http://www.informatimago.com/

THIS IS A 100% MATTER PRODUCT: In the unlikely event that this
merchandise should contact antimatter in any form, a catastrophic
explosion will result.

Robert Maas, http://tinyurl.com/uh3t

unread,

Nov 28, 2008, 2:56:59 AM11/28/08

to

> From: yang <yangre...@gmail.com>

> for a symbol, 'a, I can do
> (setf (get 'a 'state) 'finish)
> so that 'a has a property state with value finish,

Yes, because each symbol is an object (not in the OOP sense, more
like in the STRUCT sense) which has several slots: the print-name,
the value cell, the function cell, the package cell, and the
property-list. You can't change the print-name (or shouldn't
anyway), but any of the other slots can be changed at any time
(although you probably don't want to change the package cell unless
you *really* know what you are doing).

> I want to know how to do similar things for lisp. set a property
> and value for each cell in the list
> eg: given a list
> (setq lst '(1 2 3 1))

When you say (each) "cell" (in the list), do you mean:
- The CONS cell that is used to build the linked list?
- The value of the element at that position in the list?
- The *position* of that CONS cell (and corresponding element) in the list?

It makes a difference!!

If the property is with the CONS cell, or with the position within
the list, then replacing any of the elements with new values
doesn't change the property. But if the property is associated with
the value itself then changing the value will mean you're using a
different property cell, and if the same value occurs twice then
the *same* property will be associated with *it* regardless of how
you access it.

If the property is associated with the CONS cell itself, then if
you rearrange the CONS cells the properties will move with them.
But if the property is associated with the (numeric) position
within the list, then when you rearrage the CONS cells within the
list, even dicard them and replace them with new CONS cells from
elsewhere, the third of the list will retain the same property the
third had when the third was something different.

Do you understand the distinction between the three kinds of
associations? If so, you need to decide which of the three
semantics you had in mind. Then here is how I would suggest
implementing it:

For matching value to property, use a hash table keyed on the value
(element). You need to decide whether it's to be matched per EQL
(the default) or some other equivalence relation (EQ and EQUAL are
standard). Given the element (value), GETHASH will give the
property associated with it.

For matching on the CONS cell, use an EQ hash table keyed on the
CONS cell itself, not the value under it. Given the (pointer to)
the CONS cell, GETHASH will give the property associated with it.

For matching on the numeric position within the list, use parallel
list that has the properties. Given the index, NTH will give the
corresponding property. Given the CONS cell, a variation of
POSITION will give the index, or NIL if the cell isn't in the list
any longer. Given the element, POSITION will give the first
position with that value, and successive POSITION with :START will
give successive positions *also* having that same value, so you can
retrieve multiple properties for all places some element appears.

Here's my attempt at writing that variant of POSITION which finds
the index of a CONS cell in a list, in case you need it:

(defun cell-position (list targetCell)
(loop for ix from 0
for thisCell on list
when (eq thisCell targetCell) return ix))

(setq foo (list 3 1 4 1 5 9 2 6 5 3 5)) ;Head of list, intentionally whole list
(setq baz (nthcdr 7 foo)) ;A cell within that list, intentionally just that cell
(cell-position foo baz) ;Position of that cell within that list
=> 7
(rplacd (nthcdr 2 foo) (nthcdr 5 foo))
(cell-position foo baz)
=> 5

(Nitpick on intentional data types: Although in Lisp, internally
the value of foo is just the CONS cell at the head of the linked
list, the intention of foo is the whole list, not just that first
cell. But the intention of baz is to point at just one cell within
foo, not to imply the entire tail from that point to the end of
the list.)

> how can I specify properties for cells ?

When you say "cells", do you actually mean the CONS cells
themselves as distinct intentional objects? Or one of the other two
meanings I listed above?

Note that if you never mutate the sequence of CONS cells within
that list, such as by calling SORT on it, then it doesn't matter
whether you hash on CONS cell identity or have parallel list using
numeric postition within list, because the relation between the two
is unchanged. And if you have no duplicate values within the list,
and you never change any value in the list, then hashing the value
instead of the CONS cell is likewise equivalent. If all three are
permanently equivalent, then the parallel list is more efficient.
But if you want to establish properties for more than one such
list, and have a single mechanism to do them all instead of a
different mechanism for each such list, and allow the same value in
different lists or places within a single list to have different
properties, then EQ hashing of the CONS cell is probably best.

Note that in *any* of the three cases, if you want more than one
property for each item (CONS cell, or position, or element value),
then use a detached property list or an association list in place
of a single value, depending on whether you want replacement or
masking semantics respectively for the various named properties.

I'll guess that you want replacement semantics, same as with
property lists associated with symbols, so use a detached property
list as the hash value. Look at the manual or hyperspec for
functions getf and remf, analagous to get and remprop when using
detached property lists located in places in lieu of ordinary
property lists attached to symbols. For example (using baz from
above):
(defvar $CONS-EQ-HASH-TABLE$ (make-hash-table :test #'EQ))
(setf (getf (gethash baz $CONS-EQ-HASH-TABLE$) :PROP1) "Heather") ;Set prop
(getf (gethash baz $CONS-EQ-HASH-TABLE$) :PROP1) ;Retrieve prop
(remf (gethash baz $CONS-EQ-HASH-TABLE$) :PROP1) ;Erase prop

Disclaimer for SETQ haters and DEFVAR nitpickers: I used SETQ of
foo and baz, because presumably they would be local variables
within some function. During interactive debugging you can ignore
the warning you get about them being special. But I used DEFVAR
with $CONS-EQ-HASH-TABLE$ because presumably that would be a
global that really should be proclaimed a special variable.

William James

unread,

Nov 28, 2008, 4:12:36 PM11/28/08

to

On Nov 27, 4:24 pm, "John Thingstad" <jpth...@online.no> wrote:

> På Thu, 27 Nov 2008 20:44:46 +0100, skrev yang <yangre...@gmail.com>:
>
>
>
>
>
> > for a symbol, 'a, I can do
> > (setf (get 'a 'state) 'finish)
>
> > so that 'a has a property state with value finish,
>
> > I want to know how to do similar things for lisp. set a property and
> > value for each cell in the list
> > eg: given a list
> > (setq lst '(1 2 3 1))
>
> > (setf (get (car lst ) 'state) 'finish)
> > will report an error.
>
> > properties are bound to symbol names other than list cells. how can I
> > specify properties for cells ?
> > thanks
>
> (defparameter *enum* '(:one 1 two 2 :three 3))
>
> (getf *enum* :one)
>
> > 1

Ruby:

enum = { :one, 1, :two, 2, :three, 3 }
==>{:one=>1, :two=>2, :three=>3}
enum[:one]
==>1

William James

unread,

Nov 28, 2008, 4:49:13 PM11/28/08

to

John Thingstad wrote:

> På Thu, 27 Nov 2008 20:44:46 +0100, skrev yang <yang...@gmail.com>:
>
> > for a symbol, 'a, I can do
> > (setf (get 'a 'state) 'finish)
> >
> > so that 'a has a property state with value finish,
> >
> > I want to know how to do similar things for lisp. set a property and
> > value for each cell in the list
> > eg: given a list
> > (setq lst '(1 2 3 1))
> >
> > (setf (get (car lst ) 'state) 'finish)
> > will report an error.
> >
> > properties are bound to symbol names other than list cells. how can
> > I specify properties for cells ?
> > thanks
>
>

> (defparameter enum '(:one 1 two 2 :three 3))
>
> (getf enum :one)
> > 1

JavaScript:

js> enum = { one: 1, two: 2, three: 3 }
[object Object]
js> enum.one
1
js> enum['three']
3

André Thieme

unread,

Nov 28, 2008, 7:22:31 PM11/28/08

to

William James schrieb:

> John Thingstad wrote:
>
>> På Thu, 27 Nov 2008 20:44:46 +0100, skrev yang <yang...@gmail.com>:
>>
>>> for a symbol, 'a, I can do
>>> (setf (get 'a 'state) 'finish)
>>>
>>> so that 'a has a property state with value finish,
>>>
>>> I want to know how to do similar things for lisp. set a property and
>>> value for each cell in the list
>>> eg: given a list
>>> (setq lst '(1 2 3 1))
>>>
>>> (setf (get (car lst ) 'state) 'finish)
>>> will report an error.
>>>
>>> properties are bound to symbol names other than list cells. how can
>>> I specify properties for cells ?
>>> thanks
>>
>> (defparameter enum '(:one 1 two 2 :three 3))
>>
>> (getf enum :one)
>>> 1

> Ruby:

>
> enum = { :one, 1, :two, 2, :three, 3 }
> ==>{:one=>1, :two=>2, :three=>3}
> enum[:one]

> ==>1

>
> JavaScript:
>
> js> enum = { one: 1, two: 2, three: 3 }
> [object Object]
> js> enum.one
> 1
> js> enum['three']
> 3

In Clojure:
(def enum {:one 1 :two 2 :three 3})

And then
(:one enum) ==> 1

Or also:
(get enum :two) ==> 2

André
--
Lisp is not dead. It’s just the URL that has changed:
http://clojure.org/

Lars Rune Nøstdal

unread,

Nov 28, 2008, 9:55:08 PM11/28/08

to

this is probably not a good idea for what you're (really?) after, but:

(defmacro mk-meta-slot (accessor-name &key (test '#'eq))
`(let ((slot-data (make-hash-table :test ,test :weakness :key)))
(defmethod ,accessor-name (object)
(gethash object slot-data))

(defmethod (setf ,accessor-name) (new-value object)
(setf (gethash object slot-data) new-value))))

;; anything stronger than EQ decends into the lists/conses and determines equality based on content instead of the conses themselves ...
(mk-meta-slot properties-of :test #'eq)

(defun property-of (object property-name &key (test #'eq))
(let ((result (assoc property-name (properties-of object) :test test)))
(if result
(values (cdr result) t)
(values nil nil))))

(defun (setf property-of) (property-value object property-name &key (test #'eq))
(let ((already-existing (assoc property-name (properties-of object) :test test)))
(if already-existing
(setf (cdr already-existing) property-value)
(push (cons property-name property-value) (properties-of object)))))

SW> (let ((lst (list 1 2 3 1)))
(setf (property-of (nthcdr 0 lst) 'cell-name) 'first
(property-of (nthcdr 1 lst) 'cell-name) 'second
(property-of (nthcdr 2 lst) 'cell-name) 'third
(property-of (nthcdr 3 lst) 'cell-name) 'fourth)
(assert (not (eq (property-of (nthcdr 0 lst) 'cell-name) ;; this is either exactly what you wanted or it is the opposite of it; both "ways" are possible though
(property-of (nthcdr 3 lst) 'cell-name))))
(setf (first lst) 0)
(assert (eq (property-of (nthcdr 0 lst) 'cell-name) 'first)))
NIL

budden

unread,

Nov 29, 2008, 3:27:05 PM11/29/08

to

Hi!
You can not (in general) put conses to hash table with :test #'eq

http://www.lispworks.com/documentation/HyperSpec/Body/f_sxhash.htm

states:

> 3. The hash-code for an object is always the same within a single session provided that the object is not visibly modified with regard to the equivalence test equal.

That means, when we modify car or cdr, conforming implementation may
change hash value of the cons.
In particulare, in SBCL we get:
CL-USER> (defvar *v* (cons nil nil))
CL-USER> (sxhash *v*)
457802361
CL-USER> (setf (cdr *v*) 4)
4
CL-USER> (sxhash *v*)
213127439

For mapping conses with a eq, we could use association lists only, but
they're not efficient. Or we must prohibit the modification of the
cons once a property is put on it.

Why is it so? This is because conses are garbage-collected. They may
be moved while gc-ing and so they have no permanent address in a
memory, as C objects do. It looks like inherent source of inefficiency
in any garbage-collected languages where objects may move.

Lars Rune Nøstdal

unread,

Nov 29, 2008, 3:53:24 PM11/29/08

to

yeah, and defmethod inside some-other-form (or something like that)
isn't allowed either; this was pointed out to me in another post

..even if sxhash returns different values, it seems to work anyway(?):

> (let ((lst (cons nil nil)))
(setf (property-of (nthcdr 0 lst) 'state) 'finished)
(setf (cdr lst) 4)
(property-of (nthcdr 0 lst) 'state))
FINISHED
T

Pascal Costanza

unread,

Nov 29, 2008, 5:13:29 PM11/29/08

to

sxhash and hash tables are not related. See further down in the
definition of sxhash:

"The hash codes returned by sxhash are not necessarily related to any
hashing strategy used by any other function in Common Lisp."

What's important is the :test argument for make-hash-tables, nothing
else (unless you are in implementation-dependent land).

Furthermore, you can (!) have local defmethod forms. Although that's
exotic, it is required to work per ANSI CL.

Pascal

--
My website: http://p-cos.net
Common Lisp Document Repository: http://cdr.eurolisp.org
Closer to MOP & ContextL: http://common-lisp.net/project/closer/

Lars Rune Nøstdal

unread,

Nov 29, 2008, 6:29:05 PM11/29/08

to

On Sat, 2008-11-29 at 23:13 +0100, Pascal Costanza wrote:

> Lars Rune Nøstdal wrote:
> > yeah, and defmethod inside some-other-form (or something like that)
> > isn't allowed either; this was pointed out to me in another post
> >
> > ..even if sxhash returns different values, it seems to work anyway(?):
> >
> > > (let ((lst (cons nil nil)))
> > (setf (property-of (nthcdr 0 lst) 'state) 'finished)
> > (setf (cdr lst) 4)
> > (property-of (nthcdr 0 lst) 'state))
> > FINISHED
> > T
>
> sxhash and hash tables are not related. See further down in the
> definition of sxhash:
>
> "The hash codes returned by sxhash are not necessarily related to any
> hashing strategy used by any other function in Common Lisp."
>
> What's important is the :test argument for make-hash-tables, nothing
> else (unless you are in implementation-dependent land).
>
> Furthermore, you can (!) have local defmethod forms. Although that's
> exotic, it is required to work per ANSI CL.
>
>
> Pascal

ah, great .. i can keep "misusing my computer" then :) .. .. instead of
the other way around

*alt-tabs back to emacs*

budden

unread,

Nov 30, 2008, 4:37:19 AM11/30/08

to

I was wrong. Sorry.

Some time ago I also was interested with associating properties to
cons cells. I have read cltl2, not the new standard and I was sure
sxhash is used in a hash tables. Well, it is fine that I was wrong.

But I still can't imagine a really efficient way to do so. We either
need to store cons cell identity in the cons cell itself (consumes
memory), or we need to do some rehashing if a cons cell have recently
moved with a GC (consumes a time), or we should not move objects at
all (consumes a memory).

Pascal Costanza

unread,

Nov 30, 2008, 11:14:06 AM11/30/08

to

...or we should only worry if and when we find out that this is an
actual bottleneck in our program...

Robert Maas, http://tinyurl.com/uh3t

unread,

Nov 30, 2008, 1:22:35 PM11/30/08

to

(EQ hash tables *must* work, even if GC moves objects around:)
> From: budden <budde...@gmail.com>
> ... I still can't imagine a really efficient way to do so. We

> either need to store cons cell identity in the cons cell itself
> (consumes memory), or we need to do some rehashing if a cons cell
> have recently moved with a GC (consumes a time), or we should not
> move objects at all (consumes a memory).

Note: I'm not angry at you. You're a newbie, so presumably you
often doesn't know what's important and what's not important, so
the following is intended to be educational, not accusatory.
(I only get angry at newbies who ask us to do their homework for them,
and your article is most surely not in that category.)

It's an implementation issue deep down inside the guts of some
version of CL, so you as an ordinary user absolutely never need to
worry about how it's actually done. It's the same as you can't
think of a really efficient way to implement multiple value
returns, or keyword parameters, or multiple lexical closures that
share some bindings, etc. It's not your job, as application
programmer, as user of some CL implementation, to know or care how
it's done. You just must know that any *conforming* CL
implementation must somehow make it work per spec, and you just
refuse to use any seriously non-conforming implementation, and
*efficiency* is a matter of trade secrets and competitive advantage
among the various confirming implementations, so if you find a
bottleneck where it's utterly slow in one company's implementation
you try another company's implementation if you can afford the
extra cost they charge for that extra efficiency better than you
can afford your wasted personal time waiting for the less-efficient
but cheaper implemention to finish common tasks.

As for how it *might* be implemented, here are some ideas I have:

Never move anything. Use the Mad Hatter's Tea Party algorithm: When
your current seating (pages of memory) are too dirty (storage
fragmented so you can't allocate a large object you want), you move
to a new seat (allocate another page of virtual memory). This is
feasible because virtual address space is **IMMENSE** on modern
computers and swap space on hard disk is **IMMENSE** so as long as
your working set isn't larger than physical RAM you are OK, and
physical RAM is pretty large on modern computers so fragmented
memory isn't really as bad as it used to be on 1MB Macintosh Plus
or 256k MicroSoft DOS.

Never move CONS cells. This is feasible because CONS cells are
always the same size, so fragmentation among competing uses for
CONS cells isn't a problem. Only if you allocate lots of CONS
cells, then discard most of them, then try to allocate a large
contiguous array, you'll need to use the Mad Hatter's Tea Party
algorithm. But that doesn't solve the problem of EQ hash tables
that point use *other* things, such as arrays, as keys. So this
section is moot. Go back to fullfledged Mad Hatter's Tea Party
algorithm, or forward to next algorithm below.

Have an extra bit in each object, CONS cell or array etc. etc.,
which tells whether any EQ hash table has ever used it as a key.
Once that bit is set, that object is permanently locked in place
for the duration of the Lisp session, or until the next super-GC
which resets that bit everywhere and then sets it again for all
keys of all EQ hash tables. Any object not recenty a key of an EQ
hash table is free to move. This is reasonble because only a few
objects are ever used as EQ keys in a typical program, or because
even if every object in your program is used as an EQ hash key
still the Mad Hatter's Tea Party agorithm will save you.

Rehash all EQ hash tables whenever GC needs to unfragment memory.
This is reasonable as a space/time tradeoff on older machines that
have only a small address space, such as Macintoshes running in
24-bit address space mode to make older software continue to work,
so the Mad Hatter has only a small number of seats at the table and
you'll shortly get back to where you started and *really* need to
clean your place (unfragment memory). It's also reasonble if
fragmentation causes your working set to occasionally exceed
physical RAM and the GC algorithm has a way to notice that
thrashing has commenced and take remedial action to reduce the
working set by defragmentation. For example, the OS can do a
call-back into the Lisp environment whenever page swapping exceeds
some pre-set threshold, and the called Lisp function can simply set
a flag that is later recognized by CONS and other memory allocaters
to trigger an *immediate* repacking-GC followed by global
EQ-hash-rehashing (or setting a flag in each EQ hash table to
invaidate it and force re-hash the next time anybody tries to
search it) at that moment even before memory is exhausted.

Don't GC in the first place!! This is feasible if the longest
possible lifetime of the Lisp environment is shorter than the least
possible time to run out of memory, such as one-transaction
Web-service applications (regular CGI without mod_lisp persistent
processes), or quickie Unix-style utilities, or even decently long
lasting applications on computer with immense quantity of physical
RAM (such as Lisp Machine, often run for weeks or months with GC
turned off before needing to re-boot the system then re-start the
very last task that had started but didn't get to finish). This is
reasonble because it may be faster to re-boot the system and re-do
that last task rather than run a GC.

So don't worry about it already!

Your idea of storing a CONS-cell identity in each CONS cell would
need to be extended to all moveable objects, including arrays. It
would need to be done for *all* such objects, not just those that
are used as keys in EQ hashtables, because typically you build the
object first and *then* make it a key, and there's no adjacent
storage available to store the identity info at the time you make
it a key. Unfortuately the number of different objects that are
allocated is unbounded, even on a system with finite address space,
because objects are repeatedly discarded and later replaced (in
RAM) by new objects, so the size of the identity info is unbounded
(logarithm of number of objects), so you'd need to use some
variable-length ID-info, such as radix-1 or UTF-8, which would
probably be less efficient than most of the ideas I suggested
above. (In addition to such variable-length strings of data taking
more storage than the no-additional-space of a simple machine
address you already have to find the object in the first place,
hashing a fixed-length virtual machine address is a fixed sequence
of instructions whereas hashing a variable-length string of data is
a WHILE loop, both for computing the bucket location and for
comparing two strings that hash to the same bucket to see if they
are identical. EQ hash tables *ought* to be more efficient than
EQUAL hash tables of course for this reason, hence should be used
instead whereever appropriate. EQL hash tables, like EQ hash
tables, should be fixed code with no loop, but with a dispatch to
handle numbers of various types and character objects differently
from other objects where EQ is called, so EQL hashtables should be
slighty slower than EQ hash tables, because if not all keys are
numbers or characters then all the GC problems regarding object
moving occurs with EQL tables too.)

OT Gripe: For most practical applications, EQUAL hash tables are
what you *really* want, so that for example the most common case of
the same string or list/tree/sortedSet of strings generated from
two different sources will match. IMO the default hashtable type
should have been EQUAL instead of EQL, except that EQL is the
default for virtually ever other function that allows various
comparison functions, such as ASSOC and MEMBER and POSITION, so I
guess we're stuck with EQL being default for hash tables, sigh.
(But note that GET etc. use EQ, so there is one group of exceptions
to the EQL default behaviour. Why not a second exception?)

Kaz Kylheku

unread,

Nov 30, 2008, 6:49:34 PM11/30/08

to

A common implementation strategy for conses is to put them into a dedicated
heap that contains only conses. Since the conses are all of the same size, that
heap doesn't fragment. Since it doesn't fragment, it doesn't have to be
compacted, and so conses can stay where they are.

You are right, though; this is a general problem. If you compact heaps, how do
you maintain object identity? It's not just heap compaction. There is also the
possibility of CLOS objects changing type dynamically; they can acquire and
lose slots. If an object changes type in such a way that it's now too large to
fit into its original space, it has to be reloated.

The possibilities you have outlined are not exhaustive. Your idea of putting an
ID into the object is one possibility. Another way is to have handles. Under
handles, an object's ID isn't an absolute address, but a displacement into an
array of pointers. The disadvantage is the double dereference to access an
object: fetch the real pointer from the table, then go to the object. This can
be dealt with by some aggressive optimization: a thread can fetch the real
addresses of objects and keep them cached in registers. (Some such caching will
necessarily happen even in unoptimized access sequences, and so the GC has to
deal with it anyway).

An advantage of the handle approach is that it provides reliable ID recycling.
Two different objects have a different position in the table and hence
different ID. When an object becomes garbage, its ID (position in the table)
can be easily reused for another object. If ID's are stored in
objects, you need some algorithm to generate new ID's which don't clash with
any existing ID's (e.g. a data structure holding dynamic, compacted set of
intervals over the integer domain).

Kaz Kylheku

unread,

Nov 30, 2008, 8:01:50 PM11/30/08

to

On 2008-11-30, Robert Maas, http://tinyurl.com/uh3t

<seeWeb...@teh.intarweb.org> wrote:
> As for how it *might* be implemented, here are some ideas I have:
>
> Never move anything. Use the Mad Hatter's Tea Party algorithm: When
> your current seating (pages of memory) are too dirty (storage
> fragmented so you can't allocate a large object you want), you move
> to a new seat (allocate another page of virtual memory). This is
> feasible because virtual address space is **IMMENSE** on modern
> computers and swap space on hard disk is **IMMENSE** so as long as
> your working set isn't larger than physical RAM you are OK, and
> physical RAM is pretty large on modern computers so fragmented
> memory isn't really as bad as it used to be on 1MB Macintosh Plus
> or 256k MicroSoft DOS.

Note that virtual memory is not exactly free. It requires virtual address
translation, whose efficiency depends on locality of reference: keeping
everything clumped together.

You don't want to scatter all your objects to different virtual pages, so that
they all hash to different TLB entries. If you're working with a number of
pages that is much larger than the TLB size, you will be thrashing your TLB
cache.

This is made even worse by braindead TLB management, which may be direct
mapped, depending on architecture. Direct mapped means that a page is assigned
a fixed position in the TLB. If you are accessing two pages that map to the
same TLB entry, they bump each other's entry.

Combine a large VM footprint with poor TLB replacement and you have bad
performance.

budden

unread,

Dec 1, 2008, 9:59:03 AM12/1/08

to

Hi!
Thanks to all for verbose replies.
I disagree (regardless of anyone's opinion) that I shouldn't care
efficiency from the very start. Lisp is very flexible and allows to
fix some architecture errors afterwards, but some decisions are to be
made from the very beginning and
knowledge of compared efficiency of misc. mechanisms is important.

I made a test:

CL-USER> (require :iterate-keywords) ; http://sourceforge.net/projects/iteratekeywords/
CL-USER> (defparameter *count* 1000000)
CL-USER> (defparameter *v* (make-hash-table :test #'eq))
CL-USER> (progn (defparameter *lst* (iter (:for i from 1 to *count*)
(:collect i))) nil)
CL-USER> (time (iter (:for i from 1 to *count*) (:for c on *lst*)
(setf (gethash c *v*) i)))
Evaluation took:
1.049 seconds of real time
33,632,184 bytes consed
CL-USER> (time (iter (:for i from 1 to *count*) (:for c on *lst*)
(when (gethash c *v*) (:count 1))))
Evaluation took:
0.233 seconds of real time
31,216 bytes consed

1000000

Acceptable.

Robert Maas, http://tinyurl.com/uh3t

unread,

Dec 2, 2008, 6:30:48 AM12/2/08

to

> > As for how it *might* be implemented, here are some ideas I have:

> > Never move anything. Use the Mad Hatter's Tea Party algorithm: ...
> From: Kaz Kylheku <kkylh...@gmail.com>

> Note that virtual memory is not exactly free. It requires virtual
> address translation, whose efficiency depends on locality of
> reference: keeping everything clumped together.

Indeed. Still there may be some mid-sized algorithms where the MHTP
algorithm will allow you to avoid needing GC while you use a lot
more VM than physical RAM. But as you pointed out if your application
gets *really* huge, scattered with just a few bytes used on
each swapping page, to where swapping is very inefficient you may
at that point need to really GC then.

Thanks for the info, <SNL>Debbie Downer</SNL> ^H^H...^H^H Kaz Killjoy (just kidding)

Robert Maas, http://tinyurl.com/uh3t

unread,

Dec 2, 2008, 6:51:09 AM12/2/08

to

> From: budden <budde...@gmail.com>

> I disagree (regardless of anyone's opinion) that I shouldn't care
> efficiency from the very start.

I take a middle ground. If you're writing a one-shot script that
won't be processing a lot of data, then just write it the quickest
way you can think of and don't worry about computer efficiency.
Your personal efficiency writing the code is more important. An
hour saved writing the code is more important than the five seconds
of extra CPU time your script consumed. Even if you run that script
a hundred times, and thereby waste a total of 500 CPU seconds, over
the course of a year, that's still probably fine.

If you're going to be processing a medium amount of data, like one
million records on today's fast computers, then worry about order
of magnitude considerations, and see if you can find a built-in
utility that is reasonble, for example:
- Guaranteed log(n) for finding a key in a set, and then returning
the value or deleting or inserting or changing
- Guaranteed n for iterating through a set
- Guaranteed n * log(n) for sorting a set, or for building a
structure that will allow log(n) operations listed above
This is a trade-off between computer efficiency and personal efficiency.

If your program is apparently too slow, and it's going to be used a
lot, and you really need to speed it up, *then* start getting more
serious about efficiency. Do timing tests to see where the
bottlenecks are, and work hard tuning *only* those parts.

budden

unread,

Dec 2, 2008, 11:52:54 AM12/2/08

to

Hi!

> - Guaranteed log(n) for finding a key in a set, and then returning
> the value or deleting or inserting or changing
> - Guaranteed n for iterating through a set
> - Guaranteed n * log(n) for sorting a set, or for building a
> structure that will allow log(n) operations listed above

Hmm, we might also get some function (n) on each GC, and it depends on
many factors which I don't know yet.

If my hash-table is huge, but some parts of it are modified
frequently, would generational GC help it or it
will be stored in a "youngest" generation and rehashed on each GC?
This might slow down program significantly.
I'm interested in it not as in theory, but in a relation to existing
implementations (I have no means to build my own EQ hash table, as
sxhash can't hash cons cells efficiently and reliably).

And, also, lisp is sometimes advertised as *fast* language. Cl-ppcre
said to be faster than perl on regexps (and it is),
CMU CL is said to be faster than fortarn on numerical maths (didn't
try CMUCL, but hesitate that SBCL is so fast. E.g.
fixnum in SBCL 1.0.22 is limited by 1FFFFFFF, so it loses compared to
plain 32-bit C on 64-bit tasks, I don't know about FORTRAN).

I believed that lisp is fast from the start, but now I'm not so sure
it is true. See, e.g. http://shootout.alioth.debian.org/
Tasks are too simple to show real performance, but SBCL (designed to
be fast) aint perform so well even in regexps tasks.
Looks like JIT produces faster code than lisp's compiler. And even C,
where regexps are likely to be interpreted, is faster than cl-ppcre.

budden

unread,

Dec 3, 2008, 6:13:49 PM12/3/08

to

Hallo!
Hmm, it seems I set very interesting question about hash tables, but
no replies.
Repeating myself: if I have a big hash table which is mostly
constant, but some parts of it are changed frequently (consider OLAP
as an example there we have a huge client database which is mostly
constant). We have already found that
if objects are moved on GC (and as far as I know they _are_ moved in
most CL implementations), we pay some price on each rehash. In
generational GC, will this price be proportional to:
- total size of a table (we'll get poor efficiency)
- intensity of read/write access between GCs (it is ok)
If it is proportional to total size of a table, I should avoid
hashing on conses, maybe should avoid eq hash or hash tables at all
and build tables using sxhash or hand-written type-specific
algorithms.
It it is proportional to intensity of access, eq hash tables are
scalable.
What can be said about it regarding the best existing CL
implementations?

Abdulaziz Ghuloum

unread,

Dec 3, 2008, 10:44:08 PM12/3/08

to

I have written a paper[1] on implementing eq-hash-tables such that:

1. No need to store extra info (hash key) in any data structure
making it ideal for cons cells and other small data structures.

2. It is generation-friendly: *only* objects that actually *move*
during a collection are rehashed, and only when lookup operation
fail. Large tables with old keys are not penalized by minor
collections.

Ikarus Scheme and Chez Scheme employ this strategy. I don't know
about CL implementations but last time I checked, what they did was
rehash the entire table when a lookup operation fails after a GC.

Aziz,,,

[1] Generation-Friendly Eq Hash Tables.
Abdulaziz Ghuloum and R. Kent Dybvig.
2007 Workshop on Scheme and
Functional Programming
Freiburg, Germany
http://sfp2007.ift.ulaval.ca/procPaper3.pdf

[PS. I don't read comp.lang.lisp usually, so, please email me if
you have any questions]

budden

unread,

Dec 5, 2008, 2:11:24 AM12/5/08

to

Hallo, Aziz!
If I get it right, it looks like you have at least a pointer and a
cons cell in a guardian per each key/value pair stored in a hash
table. So it is a memory vs time tradeoff.
The advantage of your approach is that you do not have this cell for
every entity in an image.
But if there are many hash tables in an image, the same key might be
stored in many guardians and we get unnecessary memory consumption. If
you store an object list per generation, In this case, storing
object's identity in the object itself seems to be better.
> http://sfp2007.ift.ulaval.ca/procPaper3.pdf
If an implementation had a generation number, we might store
generation number of the youngest object in a bucket and rehash only
the buckets which store young data. It would be optimal to do that
only on hash lookup failure. But I see sbcl do not have such an id.

gc.lisp:
(declaim (type cons *gc-epoch*))
(defvar *gc-epoch* (cons nil nil))

Maybe we might add such a number (likely, a bignum) just for hash
tables. But then accessing hash table implies comparing two bignums,
I'm not sure it is not slow.

Juho Snellman

unread,

Dec 5, 2008, 3:00:12 AM12/5/08

to

Abdulaziz Ghuloum <aghu...@gmail.com> writes:

> I have written a paper[1] on implementing eq-hash-tables such that:
>
> 1. No need to store extra info (hash key) in any data structure
> making it ideal for cons cells and other small data structures.
>
> 2. It is generation-friendly: *only* objects that actually *move*
> during a collection are rehashed, and only when lookup operation
> fail. Large tables with old keys are not penalized by minor
> collections.
>
> Ikarus Scheme and Chez Scheme employ this strategy. I don't know
> about CL implementations but last time I checked, what they did was
> rehash the entire table when a lookup operation fails after a GC.

In CMUCL the GC eagerly rehashes only the keys that moved (but still
needs to do a full scan of the backing vector of the table to find
those objects).

SBCL used to do the same, but that code was scrapped a while ago,
since the extra syncronization costs from safely coordinating the GC
and the mutators was much larger than the gains from doing less
rehashing. Getting rid of the partial rehashes made it much easier to
optimize the rest of the code.

--
Juho Snellman

Abdulaziz Ghuloum

unread,

Dec 5, 2008, 3:11:42 AM12/5/08

to

On Dec 5, 3:00 am, Juho Snellman <jsn...@iki.fi> wrote:

> SBCL used to do the same, but that code was scrapped a while ago,
> since the extra syncronization costs from safely coordinating the GC
> and the mutators was much larger than the gains from doing less
> rehashing.

I don't understand how these are the same. In our approach, there is
no synchronization whatsoever between the collector and the mutator
because the collector does not rehash (or reorganize, or mess up) the
table. It's the mutator that owns the hash table. The collector only
tells what keys have moved.

Aziz,,,

Abdulaziz Ghuloum

unread,

Dec 5, 2008, 3:25:17 AM12/5/08

to

On Dec 5, 2:11 am, budden <budde...@gmail.com> wrote:
> Hallo, Aziz!
> If I get it right, it looks like you have at least a pointer and a
> cons cell in a guardian per each key/value pair stored in a hash
> table. So it is a memory vs time tradeoff.

The intent of using the tconcs was not for performance vs. memory
tradeoff. It was to eliminate the need to synchronize the mutator
with the collector. The collector knows when keys move. It does not
eagerly rehash them because the mutator might be operating on the
table when the collector kicks in (e.g., it might be collecting the
list of keys/values in the table, or it might be enlarging/shrinking
the table). The tconc is used to only gather the list of keys that
actually moved: the collector adds them to the end, and the mutator
pops them from the front. This requires no synchronization between
the two. (you can refer to the original Dybvig paper on guardians
for details of how that's done).

> The advantage of your approach is that you do not have this cell
> for every entity in an image.

Correct.

> But if there are many hash tables in an image, the same key might be
> stored in many guardians and we get unnecessary memory consumption.

Also correct. However, you have to make a choice between adding an
extra field to all of your objects (thus doubling the size of pairs
and other small data structures) or consuming an extra pair for each
key that moves during GC. If you want to optimize for the common
use of Scheme or CommonLisp, you better not double the size of every
pair in the system and let hash tables pay for themselves. After
all, with chained hashing, you do require about 5 cells for each
key/value pair in a hash table, so, using an occasional extra pair
may not be the worst thing.

Aziz,,,

Juho Snellman

unread,

Dec 5, 2008, 4:37:03 AM12/5/08

to

The removed code that I was referring to was the code inherited from
CMUCL, which did eager partial rehashing. So the GC would actually be
mutating the hash table (in a rather clever way) to rehash the keys
that moved after scavenging the hash table.

I wasn't trying to imply that this was the same technique you're
using, just pointing out that some CL implementations have done
something more complicated than full rehashes on failing lookups.

--
Juho Snellman

budden

unread,

Dec 5, 2008, 4:34:19 AM12/5/08

to

> It was to eliminate the need to synchronize the mutator with the collector.

Ok, that's fine.

> After all, with chained hashing, you do require about 5 cells for each
key/value pair in a hash table, so, using an occasional extra pair
may not be the worst thing.

Reasonable, I agree.
So, I conclude that (in principle, at least) there is a possibility to
store large eq hash tables without catastrophic time/memory penalties.
So, I might e.g. track a source locations of large code base with a
high degree of precision. This way I can use cons cells for storing
AST (which is the natural way to go in lisp) and annotate them via
hash tables, instead of creating "specialized conses" fot that. It is
rather important and fine.
Thanks for your comments.

Abdulaziz Ghuloum

unread,

Dec 5, 2008, 4:41:32 AM12/5/08

to

On Dec 5, 4:37 am, Juho Snellman <jsn...@iki.fi> wrote:

> I wasn't trying to imply that this was the same technique you're
> using, just pointing out that some CL implementations have done
> something more complicated than full rehashes on failing lookups.

Understood. Thanks for the clarification.

Aziz,,,

Abdulaziz Ghuloum

unread,

Dec 5, 2008, 5:29:03 AM12/5/08

to

On Dec 5, 4:34 am, budden <budde...@gmail.com> wrote:

> So, I conclude that (in principle, at least) there is a possibility to
> store large eq hash tables without catastrophic time/memory penalties.
> So, I might e.g. track a source locations of large code base with a
> high degree of precision. This way I can use cons cells for storing
> AST (which is the natural way to go in lisp) and annotate them via
> hash tables, instead of creating "specialized conses" fot that. It is
> rather important and fine.

I don't know how big your sources are at any given time, but I tried
timing Ikarus on all Scheme source in my home's ikarus/scheme
directory
(77 files). I timed how long it took to read all files, then how long
it took to insert all pairs in a hash table (taking into account the
possibility of having cyclic and shared data structures). This is
what I have on my macbook:

running stats for (map read-file files):
12 collections
304 ms elapsed cpu time, including 58 ms collecting
305 ms elapsed real time, including 59 ms collecting
51311024 bytes allocated
running stats for (for-each hash sources):
1 collection
52 ms elapsed cpu time, including 9 ms collecting
52 ms elapsed real time, including 9 ms collecting
5268152 bytes allocated
77 files, 38937 lines, 198197 keys

It doesn't look catastrophic, but the size if not that big either.
(what's 40,000 lines or 200,000 pairs)

Aziz,,,

Abdulaziz Ghuloum

unread,

Dec 5, 2008, 5:55:37 AM12/5/08

to

Maybe one meaningful benchmark here is what happens if we
perform 100 collections while the hash table is in memory,
and how that changes if we probe the table before each
collection. (the probe is with a nonexistent key, which
causes rehashing of all keys that moved)

running stats for (do ((i 0 (+ i 1))) ((= i 100)) (collect)):
100 collections
148 ms elapsed cpu time, including 147 ms collecting

running stats for (do ((i 0 (+ i 1))) ((= i 100)) (hashtable-ref h
(cons 1 2) #f) (collect)):
100 collections
177 ms elapsed cpu time, including 157 ms collecting

The total rehashing overhead here is 20~30 ms for 100 collections.
(depending on how you count:
total time - collection time = 20ms
with probe - without probe = 30ms)

We try the same experiment with 1000 collections now.

running stats for (do ((i 0 (+ i 1))) ((= i 1000)) (collect)):
1000 collections
408 ms elapsed cpu time, including 406 ms collecting

running stats for (do ((i 0 (+ i 1))) ((= i 1000)) (hashtable-ref h
(cons 1 2) #f) (collect)):
1000 collections
487 ms elapsed cpu time, including 437 ms collecting

The total rehashing overhead is now 50~80 ms.

So, for increasing the number of collections by 10x, there is
about 2.5x increase in collection time, and about 2.5x increase
in rehashing time. (which is what "generation-friendly" means)

Enough of that for now.

Aziz,,,

budden

unread,

Dec 5, 2008, 6:01:23 AM12/5/08

to

I think it is rather fast.