Performance of dictionary structures

David Rush

unread,

Apr 22, 2003, 1:16:03 PM4/22/03

to

In the same general vein of Aubrey Jaffer's recent call for empirical
measurements of programs, and inspired by one of the discussion
threads from the Vector SRFI about the desirability of association
vectors (as a space optimization of a-vector), I thought I'd offer
some measurements that I took as I did some very basic optimizations
to a document-indexing program.

In particular, this program builds up a term-frequency vector of a
collection of documents. My first coding pass, used a bog-standard
a-list. The second version went to a self-organizing list, exploiting
the vocabulary siliarity of the document collection to get better
performance w/out having to pull in non-r5rs/non-SRFI hash-table
code. The third version used a hash table coded from scratch on top of
SRFI-9. I've been meaning to do a b-tree based version, just for
completeness, but haven't gotten the proverbial "round tuit" yet.

So, indexing 119 old comp.lang.scheme postings on a lightly laden
2GHz Pentium-4 w/1GByte RAM

with a-lists: 279.740u 0.190s 4:41.38 99.4%
with self-organizing a-lists: 82.110u 0.120s 1:23.01 99.0%
with hash-tables: 16.850u 0.070s 0:17.07 99.3%

comparing wall-clock time growth rates (in seconds):

n=number of articles
o=time for self-organizing a-list
h=time for hashed dictionary

n 10 20 30 40 50 60 70 80 90 100 110 120
o 4.72 7.97 14.57 19.98 24.54 32.34 41.58 49.49 56.04 65.46 76.75 84.06
h 3.20 4.23 6.02 7.37 8.27 9.65 11.32 12.39 13.31 14.53 16.23 17.09

So what does all of this tell us? I realize that without the code, it
doesn't say a lot (if there's sufficient interest, I'll post it to the
web), but the self-organizing code is surprisingly linear in it's
growth. Also, it is worth noting that the hash tables never rehashed
in this application, my initial allocation of 5003 buckets was enough
that the average bucket search length never grew greater than 4
searches. So I guess that we can conclude that the amortized cost of
hashed lookup is rather a lot less than the amortized cost of
self-organizing lookup. Or more precisely, that the usenet text in
c.l.s has insufficient locality properties (we don't repeat ourselves
every repeat fourth repeat word) to out-perform the hash-table
parameters.

This leaves open the interesting possibilities of twiddling the
hash-table params to see what the other dimensions of the performance
curve look like. But what it *really* means is that we could use
someone to step up and produce a really good dictionary SRFI that
allows multiple implementation models, much as Olin's Sorting SRFI
attempts to do.

Any takers?

david rush
--
(Unix) is not so much a product as it is a painstaskingly compiled
oral history of the hacker subculture. It is our Gilgamesh epic.
-- Neal Stephenson in _The Return of the Command Line_

Joe Marshall

unread,

Apr 22, 2003, 1:27:10 PM4/22/03

to

David Rush <dr...@aol.net> writes:

> So, indexing 119 old comp.lang.scheme postings on a lightly laden
> 2GHz Pentium-4 w/1GByte RAM
>
> with a-lists: 279.740u 0.190s 4:41.38 99.4%
> with self-organizing a-lists: 82.110u 0.120s 1:23.01 99.0%
> with hash-tables: 16.850u 0.070s 0:17.07 99.3%

You should take the log of the times for comparison:
(log 279.740) 5.6 a-lists
(log 82.110) 4.4 self-organizing alists
(log 16.850) 2.8 hash-tables

This makes it easy to see that the relative performance increase in
going from self-organizing alists to hash-tables is larger than the
increase from alists to self-organizing alists.

Phil Bewig

unread,

Apr 22, 2003, 11:16:19 PM4/22/03

to

David Rush <dr...@aol.net> wrote in message news:<b83tcj$fd6$1...@pixie.nscp.aoltw.net>...

> But what it *really* means is that we could use
> someone to step up and produce a really good dictionary SRFI that
> allows multiple implementation models, much as Olin's Sorting SRFI
> attempts to do.
>
> Any takers?

I *really* don't have time right now, but here is a description of
the abstract data type I wrote for a fully-applicative dictionary
based on treaps. Of course, this implies an ordering relationship
between keys, not just an equality relationship as with a-lists or
hash tables, so I guess the SRFI would require two different models.
Actually four different models, with applicative/imperative the
other dimension. This is starting to sound complicated.

Phil

;;; DICT -- maintain key/value dictionary of ordered keys
;;; CREATE-DICT lt? -- create a new dict honoring the lt? less-than predicate
;;; dict 'LOOKUP key -- (key . value) pair if key in dict, #f otherwise
;;; dict 'INSERT key value -- newly-allocated dict containing key and value
;;; dict 'DELETE key -- newly-allocated dict not containing key
;;; dict 'UPDATE key value proc -- store (proc key oldvalue) or value in dict
;;; dict 'SPLIT key -- return two dicts, less-than key, greater-than key
;;; dict 'JOIN dict2 -- merge dict2 into dict -- assume disjoint keys
;;; dict 'MAP proc -- replace all values in dict with (proc key oldvalue)
;;; dict 'FOLD proc base -- apply proc pairwise to base and items, in order
;;; dict 'FOLD-REV proc base -- apply proc pairwise to base and items, reverse
;;; dict 'FOR-EACH proc -- apply proc elementwise to each item in key order
;;; dict 'FOR-EACH-REV proc -- apply proc elementwise to each item in reverse
;;; dict 'TO-LIST -- newly-allocated list of (key . value) pairs in key order
;;; dict 'TO-LIST-REV -- newly-allocated list of (key . value) pairs reversed
;;; dict 'FROM-LIST list -- insert each (key . value) pair in list into dict
;;; dict 'CLEAR -- newly-allocated dict with no data using same lt? predicate
;;; dict 'EMPTY? -- #t if dict contains no (key . value) pairs, #f otherwise
;;; dict 'SIZE -- number of (key . value) pairs in dict
;;; dict 'MIN -- the (key . value) pair with minimum key, or #f if empty
;;; dict 'MAX -- the (key . value) pair with maximum key, or #f if empty
;;; dict 'DELETE-MIN -- newly-allocated dict not containing minimum element
;;; dict 'DELETE-MAX -- newly-allocated dict not containing maximum element
;;; DICT? obj -- #t if obj is a dict, #f otherwise

Phil

Scott G. Miller

unread,

Apr 22, 2003, 11:28:32 PM4/22/03

to

David Rush wrote:

>
> This leaves open the interesting possibilities of twiddling the
> hash-table params to see what the other dimensions of the performance
> curve look like. But what it *really* means is that we could use
> someone to step up and produce a really good dictionary SRFI that
> allows multiple implementation models, much as Olin's Sorting SRFI
> attempts to do.
>
> Any takers?
>

An SRFI for Scheme Collections will be submitted for draft status in the
next two days. It does not specifically define a dictionary, but rather
a semantic framework where it would cozily live.

Scott

be...@sonic.net

unread,

Apr 23, 2003, 11:59:29 AM4/23/03

to

David Rush wrote:

> This leaves open the interesting possibilities of twiddling the
> hash-table params to see what the other dimensions of the performance
> curve look like. But what it *really* means is that we could use
> someone to step up and produce a really good dictionary SRFI that
> allows multiple implementation models, much as Olin's Sorting SRFI
> attempts to do.
>
> Any takers?

Hmmm. I have a hash table library I'm debugging now, but it
may be too specialized to propose as a SRFI.

What's tricky or specialized about it is that it handles
growth and shrinking (and the rehashing into larger or smaller
hash tables) incrementally in order to avoid big performance
hits on a few calls. This makes it less space efficient, but
allows a near-absolute guarantee of constant-time insertions,
deletions, and lookups without caveats like "unless it has
to rebuild its table." There's still small a performance hit
on calls when it allocates a new (bigger or smaller) table,
especially as tables get large, but it's nothing like the
performance hit of rehashing all the keys at once.

It uses open hashing (each "bucket" is the head of a list
of key/value pairs), so it also has consing space overhead
and stores its keys. And its constants are chosen so that on
each insert/delete call when growth or shrinking is actually
taking place, an average of 3 keys from the old table are
rehashed into the new one every four calls.

Anyway; as I said, I optimized it for consistent time
performance, not for space efficiency. It uses between 4/3
and 16/3 vector cells for each key/value pair when growing in
a series of insertions, and between 10 and 47 vector cells
for each key/value pair when shrinking in an unbroken series
of deletions. When the size of the table stabilizes and
it's not moving stuff from an old internal table to a new
one any more, it stabilizes between 10 and 4/3 vector cells
for every key/value pair.

Would this be the kind of hash table that people want?

Bear

Michael Burschik

unread,

Apr 24, 2003, 2:06:44 AM4/24/03

to

be...@sonic.net wrote in message news:<3EA6B85C...@sonic.net>...

> David Rush wrote:
>
>
> > This leaves open the interesting possibilities of twiddling the
> > hash-table params to see what the other dimensions of the performance
> > curve look like. But what it *really* means is that we could use
> > someone to step up and produce a really good dictionary SRFI that
> > allows multiple implementation models, much as Olin's Sorting SRFI
> > attempts to do.
> >
> > Any takers?
>
> Hmmm. I have a hash table library I'm debugging now, but it
> may be too specialized to propose as a SRFI.
>

> [stuff deleted]

>
> Would this be the kind of hash table that people want?
>
> Bear

Implementation details are beside the point. A SRFI would specify the
interface. Your reference implementation can be as good or as bad as
you choose.

And the interface of a dictionary library should be pretty limited,
wouldn't you say? Please correct me if I err, but dictionary-create,
dictionary-get, dictionary-set!, dictionary-map, dictionary-for-each
and possibly dictionary-remove should be more or less all that needs
to be specified.

Regards

Michael Burschik

David Rush

unread,

Apr 24, 2003, 7:14:11 AM4/24/03

to

be...@sonic.net writes:
> David Rush wrote:
> > This leaves open the interesting possibilities of twiddling the
> > hash-table params to see what the other dimensions of the performance
> > curve look like. But what it *really* means is that we could use
> > someone to step up and produce a really good dictionary SRFI that
> > allows multiple implementation models, much as Olin's Sorting SRFI
> > attempts to do.

> Hmmm. I have a hash table library I'm debugging now, but it

> may be too specialized to propose as a SRFI.
>
> What's tricky or specialized about it is that it handles
> growth and shrinking (and the rehashing into larger or smaller
> hash tables) incrementally in order to avoid big performance
> hits on a few calls.

very cute.

> Would this be the kind of hash table that people want?

It is *a* kind of hash table that I'd like to have in my toolbox,
especially for interactive or near-real-time applications. Which is
precisely my point. A dictionary interface can be implemented in
multiple ways to satisfy different performance requirements. A
dictionary is *not* an structure that can be satisfied with a
one-size-fits-all implementation. So the SRFI challenge is to provide
a standard interface which admits *multiple* implementations.

david rush
--
There is only one computer program, and we're all just writing
pieces of it.
-- Thant Tessman (on comp.lang.scheme)

David Rush

unread,

Apr 24, 2003, 7:20:32 AM4/24/03

to

Michael....@gmx.de (Michael Burschik) writes:
> be...@sonic.net wrote in message news:<3EA6B85C...@sonic.net>...
> > David Rush wrote:
> > > we could use
> > > someone to step up and produce a really good dictionary SRFI that
> > > allows multiple implementation models, much as Olin's Sorting SRFI
> > > attempts to do.
> >

> > Hmmm. I have a hash table library I'm debugging now, but it
> > may be too specialized to propose as a SRFI.

> > Would this be the kind of hash table that people want?
>

> Implementation details are beside the point. A SRFI would specify the
> interface. Your reference implementation can be as good or as bad as
> you choose.

No, you miss the point. The SRFI needs to specify how a family of
implementations can live happily together because the performance
details are very important and very application-specific.

> And the interface of a dictionary library should be pretty limited,
> wouldn't you say? Please correct me if I err, but dictionary-create,
> dictionary-get, dictionary-set!, dictionary-map, dictionary-for-each
> and possibly dictionary-remove should be more or less all that needs
> to be specified.

A few more, surely. key-only, value-only and full association
enumerating combinators, plus fold and unfold at the least. I'm trying
hard to resist the temptation to slap something together right now
because I'm supposed to be working on an I/O SRFI ;)

david rush
--
A beer barf is usually very satisfying, whereas a Tequilla puke tends
to stress the stomach more, and seeing the worm for the second time is
never much fun.
-- Rob Crittenden (on mcom.bad-attitude)

Scott G. Miller

unread,

Apr 24, 2003, 9:33:39 AM4/24/03

to

Again, SRFI-44: Collections, has been submitted, please wait until the
editors post it as draft so that a Hashtable SRFI can play nice.

Scott

Michael Burschik

unread,

Apr 25, 2003, 2:08:16 AM4/25/03

to

David Rush <dr...@aol.net> wrote in message news:<b88h9v$70u$5...@pixie.nscp.aoltw.net>...

> Michael....@gmx.de (Michael Burschik) writes:
> > Implementation details are beside the point. A SRFI would specify the
> > interface. Your reference implementation can be as good or as bad as
> > you choose.
>
> No, you miss the point. The SRFI needs to specify how a family of
> implementations can live happily together because the performance
> details are very important and very application-specific.

No, I am not missing the point. I am simply disagreeing with you. I do
not believe that the SRFI should be concerned with implementation
details. A dictionary SRFI would define a new data type (the
dictionary) and a few functions to operate on this data type. It
should not mandate how this data type is to be implemented, nor need
it provide the user with the power to do so.

Most users seem to be quite happy about using lists without being able
to specify how these lists are implemented, although this has a
profound impact on performance, and I think the same should apply to
dictionaries.

> > And the interface of a dictionary library should be pretty limited,
> > wouldn't you say? Please correct me if I err, but dictionary-create,
> > dictionary-get, dictionary-set!, dictionary-map, dictionary-for-each
> > and possibly dictionary-remove should be more or less all that needs
> > to be specified.
>
> A few more, surely. key-only, value-only and full association
> enumerating combinators, plus fold and unfold at the least. I'm trying
> hard to resist the temptation to slap something together right now
> because I'm supposed to be working on an I/O SRFI ;)

Whatever. But the interface should be kept as small as possible.

Regards

Michael Burschik

Jonathan Bartlett

unread,

Apr 30, 2003, 1:02:14 AM4/30/03

to

If anyone's interested, I've been playing with a purely-functional
implementation of binary trees, with both the ability to be used for
sets, leftist heaps, and dictionary usage. It assumes a
total-ordering of keys, and has some convenience functions for
creating string-based dictionaries and other stuff.

If anyone wants to look at it or comment on it, I've made it available
at

http://www.eskimo.com/~johnnyb/tree.scm

Note that this is completely unbalanced. I plan on implementing
red-black trees later. I think someone has an AVL tree implementation
in one of the scheme archives, but I can't remember who.

It almost seems to me that there needs to be the abstraction of a
dictionary separated from the implementation of a dictionary. This
way, people can use multiple dictionary implementations, and only have
to vary the constructors. Something like:

((get-dictionary-function 'lookup my-dict) "hello")

Returns dictionary entry for "hello".

This way, the total-ordering or non-ordering of keys only matters on
dictionary creation, when the user picks his implementation. As he
uses it, he can use a single interface.

Scott G. Miller

unread,

Apr 30, 2003, 4:52:03 PM4/30/03

to

Jonathan Bartlett wrote:
> It almost seems to me that there needs to be the abstraction of a
> dictionary separated from the implementation of a dictionary. This
> way, people can use multiple dictionary implementations, and only have
> to vary the constructors. Something like:
>
> ((get-dictionary-function 'lookup my-dict) "hello")
>
> Returns dictionary entry for "hello".
>
> This way, the total-ordering or non-ordering of keys only matters on
> dictionary creation, when the user picks his implementation. As he
> uses it, he can use a single interface.

See SRFI-44, now in draft.

Scott

Jonathan Bartlett

unread,

May 1, 2003, 4:15:44 PM5/1/03

to

>
> See SRFI-44, now in draft.
>
> Scott

This SRFI doesn't seem to support purely functional implementations of
the data structures. In fact, is seems that a lot of schemers seem to
be drifting away from even mostly-functional programming. I agree
that non-functional programs take less time to implement at first
(pun-intended, jokingly), but following a mostly-functional approach
yields long-term benefits that dramatically outweight these. Yet
people who are actively promoting and using scheme seem to be
forgetting their Lambda calculus roots and creating all of the
libraries and glue in a non-functional manner. Why is this? I'm
really curious.

Scott G. Miller

unread,

May 1, 2003, 4:30:31 PM5/1/03

to srf...@srfi.schemers.org

This is a matter of debate on the SRFI mailing list as well. While I
would very much like to change the SRFI to support functional
collections, the pure fact of the matter is that not all collections can
be purely functional. Having a collections interface to a database
result set for example is going to be non-functional by virtue of having
to access a database driver whose API is almost certainly C-style in
nature.

Many collection types could be implemented functionally. The list
collection in the SRFI is an excellent example. Some would be more
difficult. Without care, many more complicated datastructures could be
very inefficient, requiring large amounts of copying to produce the
datastructure resulting from an add or remove operation.

In short, it would be detrimental to the SRFI and its users to mandate a
purely functional interface, as it would preclude many collection types
and make many more inefficient.

Scott

Scott G. Miller

unread,

May 1, 2003, 4:32:06 PM5/1/03

to srf...@srfi.schemers.org

This is a matter of debate on the SRFI mailing list as well. While I

Scott G. Miller

unread,

May 1, 2003, 4:42:50 PM5/1/03

to

Scott G. Miller wrote:
> Jonathan Bartlett wrote:
>
>>> See SRFI-44, now in draft.
>>>
>>> Scott
>>
>>
>>
>> This SRFI doesn't seem to support purely functional implementations of
>> the data structures. In fact, is seems that a lot of schemers seem to
>> be drifting away from even mostly-functional programming. I agree
>> that non-functional programs take less time to implement at first
>> (pun-intended, jokingly), but following a mostly-functional approach
>> yields long-term benefits that dramatically outweight these. Yet
>> people who are actively promoting and using scheme seem to be
>> forgetting their Lambda calculus roots and creating all of the
>> libraries and glue in a non-functional manner. Why is this? I'm
>> really curious.
>
>
> This is a matter of debate on the SRFI mailing list as well. While I
> would very much like to change the SRFI to support functional
> collections, the pure fact of the matter is that not all collections can
> be purely functional. Having a collections interface to a database
> result set for example is going to be non-functional by virtue of having
> to access a database driver whose API is almost certainly C-style in
> nature.
>

To be clear, I will consider an interface that supports functional and
non functional collections. I'm against forcing functional collections.

Scott

Sander Vesik

unread,

May 5, 2003, 7:16:26 PM5/5/03

to

David Rush <dr...@aol.net> wrote:

[snip - simple lists for dictionaries are dead-dead-dead 8-) ]

>
> n 10 20 30 40 50 60 70 80 90 100 110 120
> o 4.72 7.97 14.57 19.98 24.54 32.34 41.58 49.49 56.04 65.46 76.75 84.06
> h 3.20 4.23 6.02 7.37 8.27 9.65 11.32 12.39 13.31 14.53 16.23 17.09
>
> So what does all of this tell us? I realize that without the code, it
> doesn't say a lot (if there's sufficient interest, I'll post it to the
> web), but the self-organizing code is surprisingly linear in it's
> growth. Also, it is worth noting that the hash tables never rehashed
> in this application, my initial allocation of 5003 buckets was enough
> that the average bucket search length never grew greater than 4
> searches. So I guess that we can conclude that the amortized cost of
> hashed lookup is rather a lot less than the amortized cost of
> self-organizing lookup. Or more precisely, that the usenet text in
> c.l.s has insufficient locality properties (we don't repeat ourselves
> every repeat fourth repeat word) to out-perform the hash-table
> parameters.

Did you use chained or linear probing hash tables? linear probe hash tables
on top of say vector quive you additional benefits cache locality wise.

>
> This leaves open the interesting possibilities of twiddling the
> hash-table params to see what the other dimensions of the performance
> curve look like. But what it *really* means is that we could use
> someone to step up and produce a really good dictionary SRFI that
> allows multiple implementation models, much as Olin's Sorting SRFI
> attempts to do.

Please do remember to allow for multiple hash functions per table at
least internaly so that you can use advanced hash table variants.

>
> Any takers?
>
> david rush

--
Sander

+++ Out of cheese error +++

David Rush

unread,

May 6, 2003, 8:59:31 AM5/6/03

to

Sander Vesik <san...@haldjas.folklore.ee> writes:
> David Rush <dr...@aol.net> wrote:
> [snip - simple lists for dictionaries are dead-dead-dead 8-) ]
>
> > n 10 20 30 40 50 60 70 80 90 100 110 120
> > o 4.72 7.97 14.57 19.98 24.54 32.34 41.58 49.49 56.04 65.46 76.75 84.06
> > h 3.20 4.23 6.02 7.37 8.27 9.65 11.32 12.39 13.31 14.53 16.23 17.09
> >
> > So what does all of this tell us?
>

> Did you use chained or linear probing hash tables? linear probe hash tables
> on top of say vector quive you additional benefits cache locality wise.

Chained. That seems to be the consensus implementation in the Lisp
world (since the lists require no brain-cycles for implementation I
suppose). The Smalltalks I worked with in the '90s used linear
probing. The big downside of linear probing is that delete requires
a rehash.

> > But what it *really* means is that we could use
> > someone to step up and produce a really good dictionary SRFI that
> > allows multiple implementation models, much as Olin's Sorting SRFI
> > attempts to do.
>
> Please do remember to allow for multiple hash functions per table at
> least internaly so that you can use advanced hash table variants.

I'm just going to have look like a thick-o...I don't understand this
at all.

david rush
--
Property is theft.
-- What is Property? (Pierre-Joseph Proudhon)

Sander Vesik

unread,

May 7, 2003, 12:22:44 PM5/7/03

to

David Rush <dr...@aol.net> wrote:
> Sander Vesik <san...@haldjas.folklore.ee> writes:
>> David Rush <dr...@aol.net> wrote:
>> [snip - simple lists for dictionaries are dead-dead-dead 8-) ]
>>
>> > n 10 20 30 40 50 60 70 80 90 100 110 120
>> > o 4.72 7.97 14.57 19.98 24.54 32.34 41.58 49.49 56.04 65.46 76.75 84.06
>> > h 3.20 4.23 6.02 7.37 8.27 9.65 11.32 12.39 13.31 14.53 16.23 17.09
>> >
>> > So what does all of this tell us?
>>
>> Did you use chained or linear probing hash tables? linear probe hash tables
>> on top of say vector quive you additional benefits cache locality wise.
>
> Chained. That seems to be the consensus implementation in the Lisp
> world (since the lists require no brain-cycles for implementation I
> suppose). The Smalltalks I worked with in the '90s used linear
> probing. The big downside of linear probing is that delete requires
> a rehash.

No, if you have a 'deleted entry' sematics. But yes, linear probing can
be messy in the end. IMHO packed chains are the best middle ground.

>
>> > But what it *really* means is that we could use
>> > someone to step up and produce a really good dictionary SRFI that
>> > allows multiple implementation models, much as Olin's Sorting SRFI
>> > attempts to do.
>>
>> Please do remember to allow for multiple hash functions per table at
>> least internaly so that you can use advanced hash table variants.
>
> I'm just going to have look like a thick-o...I don't understand this
> at all.

I was refering to things like 'double hashing', '2-way chaining' (ok, its
really n-way, but in most cases 2-way should be sufficent), 'cockoo hashing'
and similar. Being able to do packed chains for the hash tables would be
really nice.

For example the n-way chaining, see

"Balanced Allocations" by Yossi Azar, Andrei Z. Broder,
Anna R. Karlin, Eli Upfal

http://www.cs.brown.edu/people/eli/papers/SICOMP29.pdf

Noel Welsh

unread,

May 9, 2003, 7:29:33 PM5/9/03

to

David Rush <dr...@aol.net> wrote in message news:<b83tcj$fd6$1...@pixie.nscp.aoltw.net>...

> In the same general vein of Aubrey Jaffer's recent call for empirical

> measurements of programs ... I thought I'd offer

> some measurements that I took as I did some very basic optimizations
> to a document-indexing program.

To follow in the vein of empirical measurements, to draw statistically
significant conclusions from the timing you need to measure the
variance as well. Then a simple t-test (assuming the distributions
appear normal) will suffice to say if the observed difference in
performance is meaningful.

Noel

MJ Ray

unread,

May 9, 2003, 8:27:07 PM5/9/03

to

Noel Welsh <noel...@yahoo.com> wrote:
> variance as well. Then a simple t-test (assuming the distributions
> appear normal) will suffice to say if the observed difference in
> performance is meaningful.

Is that assumption safe, or should you instead be using non-parametric
tests?

MJR

David Rush

unread,

May 11, 2003, 6:16:35 PM5/11/03

to

noel...@yahoo.com (Noel Welsh) writes:

> David Rush <dr...@aol.net> wrote in message news:<b83tcj$fd6$1...@pixie.nscp.aoltw.net>...
> > In the same general vein of Aubrey Jaffer's recent call for empirical
> > measurements of programs ... I thought I'd offer
> > some measurements that I took as I did some very basic optimizations
> > to a document-indexing program.
>
> To follow in the vein of empirical measurements, to draw statistically
> significant conclusions from the timing you need to measure the
> variance as well.

The variance of what? You would prefer a mean/variance measure of time
for each of the dataset sizes? Or are you suggesting a mean/variance
measure over multiple different datasets at each size? I'm afraid that
the first would be implying a measure of precision that the test
simply is not able to give. The second would be interesting as a more
definite statement of the growth characteristics of the probabilisitic
data structures, but would also require a lot more time to perform.

> Then a simple t-test (assuming the distributions
> appear normal) will suffice to say if the observed difference in
> performance is meaningful.

I have seen very few examples in the IR literature where there was
enough dataset variability at a constant size (heck, this is a bloody
difficult thing to measure in itself) to even conclude the you
had normal distributions (w/rt the second variance measure
above). Perhaps my perception of performance measuring has been skewed
by this experience. It's also possible I just don't understand what
you're talking about. I've never formally studied statistics, and it's
a *big* hole in my knowledge these days.

On that note, does anyone know a good intro text (for stats) that will
lead me gently through to Bayes Rule and that kind of statistical
inference?

david rush
--
Scheme: Because everything is a function.
-- Anton van Straaten (the Scheme Marketing Dept from c.l.s)

Jens Axel Søgaard

unread,

May 11, 2003, 6:30:12 PM5/11/03

to

David Rush wrote:
> On that note, does anyone know a good intro text (for stats) that will
> lead me gently through to Bayes Rule and that kind of statistical
> inference?

As one text in a probability course we had

"Introduction to Probability Theory"
by Hoel, Port and Stone

The sequel is called

"Introduction to Statistical Theory".

Borrow it before you decide whether to buy it or not.

--
Jens Axel Søgaard

Joe Marshall

unread,

May 13, 2003, 12:46:55 PM5/13/03

to

David Rush <dr...@aol.net> writes:

> On that note, does anyone know a good intro text (for stats) that will
> lead me gently through to Bayes Rule and that kind of statistical
> inference?

Careful there... you'll end up being a Bayesian!

Take a look at http://bayes.wustl.edu/