Sparse sets

Marcel Hendrix

unread,

Mar 26, 2012, 2:49:13 PM3/26/12

to

I have a hard time believing the quoted advantages of sparse sets myself,
but it is a nice exercise where one can drop in a small Forth twist.

I like the code a lot better than what can be found for other languages.

-marcel

-- ------------------------------------------------------------------
(*
* LANGUAGE : ANS Forth with extensions
* PROJECT : Forth Environments
* DESCRIPTION : Sparse Set operations
* CATEGORY : Exercise
* AUTHOR : Marcel Hendrix
* LAST CHANGE : March 25, 2012, Marcel Hendrix
*)

NEEDS -miscutil

REVISION -sparseset "--- Sparse Sets Version 0.01 ---"

PRIVATES

DOC
(*
( From http://programmingpraxis.com/ )

A clever data structure exists for storing sparse sets of integers
on the range 0 .. u-1. Performing initialization, lookup, and insertion
is O(1) while iteration is O(n), where n is the number of elements in
the set.

This data structure was studied in the 1993 article "An Efficient
Representation for Sparse Sets" by Preston Briggs and Linda Torczon,
in Exercise 1.9 (1.8 in the first edition) of Jon Bentley's book
"Programming Pearls," and in exercise 2.12 of the 1974 book "The Design
and Analysis of Computer Algorithms" by Al Aho, John Hopcroft
and Jeffrey Ullman.

The data structure considers a universe of integers from 0 to u-1;
depending on the circumstances the integers probably map to
something else, but we don't care about that. Any given set consists
of n items chosen from the universe; there are no duplicates. Note
that n < u, certainly, and likely n is much less than u, otherwise,
you would probably use a bit vector to represent the set. Note also
that we are optimizing for speed at the expense of space, as a bit
vector takes u bits but our data structure takes 2u integers.

Think about a bit vector. Setting a bit is a constant-time operation,
as is checking if a bit is set or unset. But initializing the bit
vector and iterating over its set elements each take time
proportional to the vector size. Our sparse set reduce the
iteration to time proportional to the size of the set (rather than
the size of the universe) and reduce the initialization time
to a constant.

n
|
v
D: 5 1 4 _ _ _ _ _ _ _
\| |
\ \
|\ \
| \ \
| \ \
| \ \
| \|
| \
| |\
| | |
S: _ 1 _ _ 2 0 _ _ _ _
ix 0 1 2 3 4 5 6 7 8 9

The sparse set is represented by two vectors that we will call dense
(abbreviated D) and sparse (abbreviated S). Initially n, the number of
elements in the set, is zero; the two vectors are uninitialized and
may contain anything. To add an element 0 = k < u to a set that does
not already contain k, we set D[n] to k, S[k] to n, and increase n
by 1, an operation that takes constant time. After this, the two
vectors point to each other, which gives a test of set membership
that also works in constant time: an element k is in the set if and
only if S[k] < n and D[S[k]] == k. Note that if k is not a member of
the set, the value of S[k] doesn't matter; either S[k] will be greater
than n or it will point to an element of D that doesn't point back
to it.

The diagram above shows a set with the elements 5, 1 and 4; the '_'
boxes may contain any value. To iterate over the elements of the set,
read D[0 .. n-1], which takes time O(n), and to clear the set make
n = 0, which takes time O(1); note in particular that clearing the
set doesn't require reinitialization. Other operations, including
size-of, delete, union, intersection, difference, and set-equality
are possible, and equally time-efficient compared to bit vectors,
but we won't discuss them here, since they are seldom used with
this representation of sets. A common use of these sparse sets is
with register allocation algorithms in compilers, which have a
fixed universe (the number of registers in the machine) and are
updated and cleared frequently during a single processing run.

Comments:
1) The INSERTed elements are visited in order of storage, something
impossible with a bit vector.
2) The elements to be stored must be unique. (Use LOOKUP before INSERT)
*)
ENDDOC

0 VALUE #universe PRIVATE
0 VALUE n PRIVATE
0 VALUE Dense PRIVATE
0 VALUE Sparse PRIVATE

: SPARSE-SET ( sz | "name" -- )
CREATE DUP , 2* CELLS ALLOT
DOES> @+ TO #universe TO Dense
Dense #universe CELLS + TO Sparse ;

#16 SPARSE-SET test-set test-set ( make it current )

-- Assume x is not already in the set.
: INSERT ( x -- )
DUP #universe >= ABORT" INSERT :: out of range"
DUP Dense n CELL[] !
Sparse []CELL n SWAP !
1 +TO n ;

: LOOKUP ( x -- bool )
Sparse OVER CELL[] @ DUP n U>= IF 2DROP FALSE EXIT ENDIF
Dense []CELL @ = ;

: ITERATE ( xt -- ) LOCAL xt n 0 ?DO Dense I CELL[] @ xt EXECUTE LOOP ;
: CLEARS ( -- ) CLEAR n ;

CLEARS
CR .( Insert {5,1,4} ) 5 INSERT 1 INSERT 4 INSERT
CR .( Lookup {5,1,4,15} => ) 5 LOOKUP . 1 LOOKUP . 4 LOOKUP . #15 LOOKUP .
CR .( Iterate with '.' => ) ' . ITERATE

:ABOUT CR ." Sparse Sets"
CR ." Try: ( n -- ) INSERT"
CR ." ( n -- bool ) LOOKUP "
CR ." ( xt -- ) ITERATE "
CR ." ( -- ) CLEARS " ;

.ABOUT -sparseset CR
DEPRIVE

(* End of Source *)

Anton Ertl

unread,

Mar 27, 2012, 10:43:53 AM3/27/12

to

m...@iae.nl (Marcel Hendrix) writes:
>I have a hard time believing the quoted advantages of sparse sets myself

The advantages are real. The problem is that this representation
requires a lot of memory, and actually uses it (so the OS has to clear
it at the start, and it has to be kept in main memory, and it will
affect cache misses), although the memory is only accessed sparsely.

> A common use of these sparse sets is
> with register allocation algorithms in compilers, which have a
> fixed universe (the number of registers in the machine) and are
> updated and cleared frequently during a single processing run.

Actually the universe in the register allocation application is the
number of pseudo-registers (aka live ranges) in the function. The
number of registers in the machine is small, so you don't need a
sparse representation for that.

Anyway, the only use of this representation in register allocation in
know of was the one described in the paper. Appel and George
represented the conclict graph (which was the main use for this sparse
set representation in the work of Briggs et al.) in a hash table: much
smaller memory, also average constant-time insertion and lookup, and
acceptable clearing time.

Still, I find this set representation instructive, in its advertised
properties, why they are important for some applications, and in the
not so obvious side effects, which make it less of a good idea than
one thinks at first.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2011: http://www.euroforth.org/ef11/

David Kuehling

unread,

Mar 28, 2012, 5:41:48 PM3/28/12

to

>>>>> "Marcel" == Marcel Hendrix <m...@iae.nl> writes:

> I have a hard time believing the quoted advantages of sparse sets
> myself, but it is a nice exercise where one can drop in a small Forth
> twist.

Hah, didn't know this data representation trick. Interesting read.

BTW the sparse set stuff is reminiscent of the Bloom Filter. Bloom
Filter sets have constant time element insertion and lookup, and
constant memory requirement in the order of set members, not universe
size. At the cost of allowing false positives (i.e. membership tests
sometimes returns true on an element *not* being in the set, however
when it returns false, you can be sure that an elment is not in the
set). Usually Bloom Filters are used as a cheap short-cut check for
element membership in cases where you expect a high miss rate.

https://en.wikipedia.org/wiki/Bloom_filter

cheers,

David
--
GnuPG public key: http://dvdkhlng.users.sourceforge.net/dk.gpg
Fingerprint: B17A DC95 D293 657B 4205 D016 7DEF 5323 C174 7D40