Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Binary Tree vs. Hash

56 views
Skip to first unread message

Xiangrong Fang

unread,
May 26, 2003, 2:10:34 AM5/26/03
to
Hi ruby fans,

I have a question about algorithm efficiency.

I have used hash in my ruby program. It worked fine. However, I found
that the memory consumed by a hash is tremendous. While the hash file is
stored on disk (using pstore) it is merely 100-200KB, but in memory, it
consumes apporximately 5-10 Megabytes! I don't know if it is related to
the cache algorithm used by ruby?

I tried to reduce this memory requirment and thought of using Binary
Tree. Just did some test in Delphi, it seems good. I loaded an English
dictionary of about 1M (90000+ lines) it just took about 6-8M of memory,
seems much better than hash.

Now my question is, since hash is O(1) and BTree is O(LOG(N)), does it
mean that hash is born to be less storage-efficient than BTree? How can
I get the best of both BTree and hash?

Thanks a lot.

Shannon


Robert Klemme

unread,
May 26, 2003, 3:44:16 AM5/26/03
to

"Xiangrong Fang" <xrf...@hotmail.com> schrieb im Newsbeitrag
news:2003052613572...@hotmail.com...

> Hi ruby fans,
>
> I have a question about algorithm efficiency.
>
> I have used hash in my ruby program. It worked fine. However, I found
> that the memory consumed by a hash is tremendous. While the hash file is
> stored on disk (using pstore) it is merely 100-200KB, but in memory, it
> consumes apporximately 5-10 Megabytes! I don't know if it is related to
> the cache algorithm used by ruby?

How did you measure this? Is there a way to know the amount of memory
used of an object graph?

> I tried to reduce this memory requirment and thought of using Binary
> Tree. Just did some test in Delphi, it seems good. I loaded an English
> dictionary of about 1M (90000+ lines) it just took about 6-8M of memory,
> seems much better than hash.

I don't think you can compare memory consumption across programming
languages like that.

> Now my question is, since hash is O(1) and BTree is O(LOG(N)), does it
> mean that hash is born to be less storage-efficient than BTree? How can
> I get the best of both BTree and hash?

It's a typical tradeoff: you can't have both - space efficiency AND
minimal memory consumption. If you *have* to keep memory consumption low
I think there is a tree implementation in the RAA. Another option would
be to use an Array, keep that sorted and implement (or use) a binary
search, which gives quite similar fetch performance as a tree. Insert
might or might not be slower.

Regards

robert

Simon Strandgaard

unread,
May 26, 2003, 5:32:13 AM5/26/03
to
On Mon, 26 May 2003 10:44:16 +0200, Robert Klemme wrote:
> "Xiangrong Fang" <xrf...@hotmail.com> schrieb im Newsbeitrag
>> Now my question is, since hash is O(1) and BTree is O(LOG(N)), does it
>> mean that hash is born to be less storage-efficient than BTree? How can
>> I get the best of both BTree and hash?
>
> It's a typical tradeoff: you can't have both - space efficiency AND
> minimal memory consumption. If you *have* to keep memory consumption low
> I think there is a tree implementation in the RAA. Another option would
> be to use an Array, keep that sorted and implement (or use) a binary
> search, which gives quite similar fetch performance as a tree. Insert
> might or might not be slower.


How about 'skip-list's ???

http://c2.com/cgi/wiki?SkipList


--
Simon Strandgaard

Xiangrong Fang

unread,
May 26, 2003, 5:29:23 AM5/26/03
to
Hi Robert,


On Mon, 26 May 2003 17:18:35 +0900
"Robert Klemme" <bob....@gmx.net> wrote:

> > I have used hash in my ruby program. It worked fine. However, I found
> > that the memory consumed by a hash is tremendous. While the hash file is
> > stored on disk (using pstore) it is merely 100-200KB, but in memory, it
> > consumes apporximately 5-10 Megabytes! I don't know if it is related to
> > the cache algorithm used by ruby?
>
> How did you measure this? Is there a way to know the amount of memory
> used of an object graph?

I have no good way to measure this. What I did is to watch the *total*
memory required by the program, with or without the hash. I don't know
if there is a way to do what you mentioned. May be some ruby experts
know?

>
> It's a typical tradeoff: you can't have both - space efficiency AND
> minimal memory consumption. If you *have* to keep memory consumption low
> I think there is a tree implementation in the RAA. Another option would
> be to use an Array, keep that sorted and implement (or use) a binary
> search, which gives quite similar fetch performance as a tree. Insert
> might or might not be slower.

In my program lookup is very freqent, but update is rare. I tried using
Array.include? , but it seems is using seqential search, not binary. The
performance is unbearable. Does anyone know if ruby's internal
Array.include? is using binary search or not?

Thanks!

Shannon

Michael Neumann

unread,
May 26, 2003, 5:54:24 AM5/26/03
to
On Mon, May 26, 2003 at 06:29:23PM +0900, Xiangrong Fang wrote:
> > It's a typical tradeoff: you can't have both - space efficiency AND
> > minimal memory consumption. If you *have* to keep memory consumption low
> > I think there is a tree implementation in the RAA. Another option would
> > be to use an Array, keep that sorted and implement (or use) a binary
> > search, which gives quite similar fetch performance as a tree. Insert
> > might or might not be slower.
> In my program lookup is very freqent, but update is rare. I tried using
> Array.include? , but it seems is using seqential search, not binary. The
> performance is unbearable. Does anyone know if ruby's internal
> Array.include? is using binary search or not?

I am not sure what your program is doing, but if it's something similar to
a search engine, then you might consider to use an inverted word-list,
which need not loaded into memory at all.

Regards,

Michael

--
ruby -r complex -e 'c,m,w,h=Complex(-0.75,0.136),50,150,100;puts "P6\n#{w} #{h}\n255";(0...h).each{|j|(0...w).each{|i|
n,z=0,Complex(.9*i/w,.9*j/h);while n<=m&&(z-c).abs<=2;z=z*z+c;n+=1 end;print [10+n*15,0,rand*99].pack("C*")}}'|display

Florian Frank

unread,
May 26, 2003, 6:00:22 AM5/26/03
to
On 2003-05-26 18:29:23 +0900, Xiangrong Fang wrote:
> performance is unbearable. Does anyone know if ruby's internal
> Array.include? is using binary search or not?

No, it can't because arrays don't have to be sorted. "include?" is mixed
in by Enumerable which uses "each" to iterate over the elements.

--
4) "A TRUE Klingon Warrior does not comment his code!"
-- Top 12 things likely to be overheard if you had a Klingon Programmer

Brian Candler

unread,
May 26, 2003, 8:11:59 AM5/26/03
to
On Mon, May 26, 2003 at 06:29:23PM +0900, Xiangrong Fang wrote:
> I have no good way to measure this. What I did is to watch the *total*
> memory required by the program, with or without the hash. I don't know
> if there is a way to do what you mentioned. May be some ruby experts
> know?

You are probably creating a lot of temporary garbage objects, so you're
measuring the size of hash + garbage pool.

You could try 'GC.start' to do garbage collection, but I don't know if Ruby
then un-mallocs the storage and returns it to the O/S.

> In my program lookup is very freqent, but update is rare. I tried using
> Array.include? , but it seems is using seqential search, not binary. The
> performance is unbearable. Does anyone know if ruby's internal
> Array.include? is using binary search or not?

It doesn't, because Array doesn't have any requirement to be sorted.

Regards,

Brian.

Mauricio Fernández

unread,
May 26, 2003, 9:32:20 AM5/26/03
to
On Mon, May 26, 2003 at 09:11:59PM +0900, Brian Candler wrote:
> On Mon, May 26, 2003 at 06:29:23PM +0900, Xiangrong Fang wrote:
> > I have no good way to measure this. What I did is to watch the *total*
> > memory required by the program, with or without the hash. I don't know
> > if there is a way to do what you mentioned. May be some ruby experts
> > know?
>
> You are probably creating a lot of temporary garbage objects, so you're
> measuring the size of hash + garbage pool.
>
> You could try 'GC.start' to do garbage collection, but I don't know if Ruby
> then un-mallocs the storage and returns it to the O/S.

It does free() the space. Whether or not it's returned to the OS depends
on stdlib.

Maybe you could add some printf statements in obj_free (gc.c) and
st_free_table (st.c) to see how much memory was released, doing

theHash = some_method(*args)
GC.start
theHash = nil
GC.start # and now take a look at the last output lines

--
_ _
| |__ __ _| |_ ___ _ __ ___ __ _ _ __
| '_ \ / _` | __/ __| '_ ` _ \ / _` | '_ \
| |_) | (_| | |_\__ \ | | | | | (_| | | | |
|_.__/ \__,_|\__|___/_| |_| |_|\__,_|_| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

* JHM wonders what Joey did to earn "I'd just like to say, for the record,
that Joey rules."
-- Seen on #Debian

Dave Thomas

unread,
May 26, 2003, 9:40:00 AM5/26/03
to
Xiangrong Fang wrote:
> Hi ruby fans,
>
> I have a question about algorithm efficiency.
>
> I have used hash in my ruby program. It worked fine. However, I found
> that the memory consumed by a hash is tremendous. While the hash file is
> stored on disk (using pstore) it is merely 100-200KB, but in memory, it
> consumes apporximately 5-10 Megabytes! I don't know if it is related to
> the cache algorithm used by ruby?
>
> I tried to reduce this memory requirment and thought of using Binary
> Tree. Just did some test in Delphi, it seems good. I loaded an English
> dictionary of about 1M (90000+ lines) it just took about 6-8M of memory,
> seems much better than hash.

If you're looking for memory efficiency are are simply doing lookups,
perhaps you should look at using a Bloom filter. You could pack a 90,000
word dictionary into 64kb. By coincidence my latest Kata is about them:

http://pragprog.com/pragdave/Practices/Kata/KataFive.rdoc,v

Cheers


Dave


Xiangrong Fang

unread,
May 26, 2003, 9:58:08 AM5/26/03
to
Hi,

On Mon, 26 May 2003 22:32:20 +0900
Mauricio Fern醤dez <batsm...@yahoo.com> wrote:


> > You could try 'GC.start' to do garbage collection, but I don't know if Ruby
> > then un-mallocs the storage and returns it to the O/S.

I did tried GC.start, but it had ABSOLUTELY no effect. if I put the
program in background for a while (at least 30 min, for example), it
then shrinked memory usage automatically. I am using windows 2000, and I
have a fairly big memory (320M). I wonder if I have more memory, the
Ruby GC get lazier? :)

Thanks.
Shannon


MikkelFJ

unread,
May 26, 2003, 12:28:28 PM5/26/03
to

"Xiangrong Fang" <xrf...@hotmail.com> wrote in message
news:2003052613572...@hotmail.com...

> Hi ruby fans,
>
> I have a question about algorithm efficiency.

OK

> I have used hash in my ruby program. It worked fine. However, I found
> that the memory consumed by a hash is tremendous. While the hash file is
> stored on disk (using pstore) it is merely 100-200KB, but in memory, it
> consumes apporximately 5-10 Megabytes! I don't know if it is related to
> the cache algorithm used by ruby?

I'm assuming 32 bit architecture below.

I don't know how the hash works in Ruby, or how you dumped it on disk.
But if the disk is non-indexed such that it is loaded into memory before
becoming useful, you can save an awful lot of memory.
A hash can be quite memory consuming. Each entry in the hash typically has:
hashkey, reference to real key, reference to data, reference to next
collision.

This is typically 4 32bit words plus the actually key.

For afficient allocation you typically have 25% extra entries but this can
vary, so you actually use 5 x 32bit in this case.
You also need a bucket table. A bucket table could be that array of entries
directly, but it is better to have a separate table. The bucket table takes
up on reference to the entry. The bucket table is preferably more empty than
full (which is why it is better to not use the entry table directly.
How much overcapacity you use in the bucket table is a design decision, but
for at least half empty you need two references.

So we are totalling 7 x 32bit per stored entry, not counting any keys
(strings or whatever). Keys are typically short, even for strings, so one or
two 32bit words probably.

A total rough estimate is therefore 8 32 bit words per stored entry in the
hash table.
If you have 2 32 bit words worth of data stored in a flat file unindexed,
you use 4 times that memory in a in-memory hash.
This suggests that Ruby should use around 1-2MB if your on disk
datastructure uses 100-200KB. However, Ruby probably has a lot of extra
per-object overhead for keys.

>
> I tried to reduce this memory requirment and thought of using Binary
> Tree. Just did some test in Delphi, it seems good. I loaded an English
> dictionary of about 1M (90000+ lines) it just took about 6-8M of memory,
> seems much better than hash.

Binary trees are typically implemented as red-black trees. The have
reference to data, reference to left and right child, often also reference
to parent. And they have some bits to store its color. This can be done
tagging a bit in one of the pointers, but typically a node useds 5 32 bit
words and the keys. If we use 2 32 bit words for storing keys, we also end
up using 7 32bit words per entry.
That is only marginally better than hash tables.

> Now my question is, since hash is O(1) and BTree is O(LOG(N)), does it
> mean that hash is born to be less storage-efficient than BTree? How can
> I get the best of both BTree and hash?

Don't worry too much about the Log(N) in B-Trees. It is not log2(N)
but logM(N), where M is in the range of 10 to 1000. For most practical
purposes
the B-Tree can be viewed as close to O(1). However, the constant factor
may be somewhat larger than for hash tables. It just means a B-Tree scales
well.
A binary tree does not scale nearly as well. Here you truly get log(N)
performance.

A binary tree and a B-Tree keeps an order relation, a hash does not
- or rather:

Usually you do not get to keep insertion order in hashtables, but it is in
fact possible to do so without paying any significant time/space overhead.
I have written such a hashtable but only implemented it for 32bit keys
(which
therefor can be stored in-place in the hash entry. This is not sorting, but
it is
extra ordering information.

Searching Binary trees are usually faster than most things for small (<100)
items.
This assumes the tree allocates tree node from a common pool so all nodes
are close
to each other (for cache performance).
Arrays are faster for (<10) items (linear search, not binary search).

A hash table generally performs very well for medium to large amounts of
data.
One problem is the memory consumption which hurts cache efficiency. Another
problem is the large bucket table with random access pattern. This is also
bad for
cache performance. However, despite these problems, a good hash so fast that
it still manages to beat more clever datastructures except in the
competition for size.

A B-Tree uses [reference to key, reference to data]. It also uses additional
memory
for internal tree nodes (assuming all data is stored at leaf entries).
Had the B-Tree been binary, it would use the same amount of memory for
internal nodes
as for leaf storage, but the branching is higher, so the memory consuption
drops. I don't
have any exact mesures here, but the cost of internal nodes are limited.
A B-Tree operates at about 2/3 of allocated memory. So you need 3 32 bit
words plus internal
node overhead. A rough estimate is therefore 4 entries. Then you need to add
the overhead
for any keys you store externally. Thus we get to about 5 bit words per
entry.

The best way to work with in-memory B-Trees is to ensure large branchfactor
that still is cache friendly.
Furthermore, we want to scan a B-Tree node linearly and not using binary
search becuase it is too slow
for anything below 100 entries. A good choice for in memory nodes appear to
be around 7-15 entries,
but that depends on processor and the data stored. If the key is external,
it may be relevant to use
binary seach to reduce the number of comparions.
The B-Tree generally has very good cache performance, except it can be hurt
be external keys.

The Judy tree / trie is even more complex than the B-Tree but is more memory
efficient than B-Trees
and may be slightly more cache friendly.

The B-Tree and Judy tree both scales well to very large datastructures,
whereas a hash table gets into
trouble as soon as you start trashing to disk. A Binary tree is just not as
efficient long before memory
becomes a problem.

The Judy trie does not have a problem a problem with external keys. I have
developed a B-Trie which are
stacked B-Trees each holding a part of the key. It uses too much memory
becuase there are many
mostly empty B-Tree nodes, but this can fixed by optimizing B-Trees for very
small collections.

A believe the B-Tree is easier to update than Judy trees and is better for
combined in-memory and on-disk
datastructures. However, for strictly in-memory operation the Judy tree may
be better.

Another datastructure, the Skip-list datastructure has been very popular
becuase it is 10 times
easier to implement than an efficient in-memory B-Tree. It has been claimed
to better than B-Trees.
This is not true. A skip-list jumps all over the memory and has poor cache
performance. It suffers from
pointer chasing. I have done some benchmarks, and I 've seen one other
person do some similar
investigations reaching the same conclusion.
This does not mean you shouldn't use skip-lists. They are still useful
datastructures - they may be better than than red-black trees which are
typically used for map
implementations.

B-Trees are very kind to efficient allocation of memory. Judy trees and skip
lists allocate all sorts of
different memory sizes. A B-Tree only allocates one node size (and possibly
one smaller node size
for the optimized special case of very small collections).

Mikkel


MikkelFJ

unread,
May 26, 2003, 12:41:25 PM5/26/03
to

"Dave Thomas" <da...@pragprog.com> wrote in message
news:3ED21933...@pragprog.com...
> Xiangrong Fang wrote:

> If you're looking for memory efficiency are are simply doing lookups,
> perhaps you should look at using a Bloom filter. You could pack a 90,000
> word dictionary into 64kb. By coincidence my latest Kata is about them:
>
> http://pragprog.com/pragdave/Practices/Kata/KataFive.rdoc,v

I also recently found some articly on non-loss storage for lookup only. I.e.
returns yes or no to match correctly.
The method is based on finite state machines.
It could be used with attached data at some additional complexity - it
requires you to track the seach path and use
bits from each branch to generate an index for the data reference.
I guess this is a fancy way of creating perfect hashes.

For multi word searching I also found an aritcle on using vector spaces.
Each document is given a vector and your query is also given a vector.
You enumerate all known words. Each word becomes a position in the vector.
The value in the vector is the number of word occurrences.
You locate the document with the smallest distance from the query vector
using some euclidian or other measure
of distance. This also handles inexact queries and is supposedly memory
efficient.
The query would now need to look up each word in some hashtable or similar
to locate the vector index.
The documents need not be stored in-memory, only their vector representation
(which can be compressed).
A dump implementation would now scan all document vectors against the query,
but I'm sure there is more clever way.
Perhaps this is an excercise for a new kata?


Mikkel


Mauricio Fernández

unread,
May 26, 2003, 12:51:15 PM5/26/03
to
On Mon, May 26, 2003 at 10:40:00PM +0900, Dave Thomas wrote:
> >I have a question about algorithm efficiency.
> >
> >I have used hash in my ruby program. It worked fine. However, I found
> >that the memory consumed by a hash is tremendous. While the hash file is
> >stored on disk (using pstore) it is merely 100-200KB, but in memory, it
> >consumes apporximately 5-10 Megabytes! I don't know if it is related to
> >the cache algorithm used by ruby?

IIRC you were using that for a spam filter or something like that ?

> >I tried to reduce this memory requirment and thought of using Binary
> >Tree. Just did some test in Delphi, it seems good. I loaded an English
> >dictionary of about 1M (90000+ lines) it just took about 6-8M of memory,
> >seems much better than hash.
>
> If you're looking for memory efficiency are are simply doing lookups,
> perhaps you should look at using a Bloom filter. You could pack a 90,000
> word dictionary into 64kb. By coincidence my latest Kata is about them:
>
> http://pragprog.com/pragdave/Practices/Kata/KataFive.rdoc,v

Are you sure the given example (80000 words into 16KB) is right?
It just doesn't seem to work according to the reference given in your
page:

16*8*1024/80000
1.63840

=> max hashes k = 1. P(false positive) = .46!!

However with 90000 words and 64KB, P(false positive) = .061 (k=4) which
is much better :)

--
_ _
| |__ __ _| |_ ___ _ __ ___ __ _ _ __
| '_ \ / _` | __/ __| '_ ` _ \ / _` | '_ \
| |_) | (_| | |_\__ \ | | | | | (_| | | | |
|_.__/ \__,_|\__|___/_| |_| |_|\__,_|_| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Because I don't need to worry about finances I can ignore Microsoft
and take over the (computing) world from the grassroots.
-- Linus Torvalds

MikkelFJ

unread,
May 26, 2003, 1:04:21 PM5/26/03
to

"MikkelFJ" <mikkelfj-...@bigfoot.com> wrote in message
news:3ed243b6$0$97260$edfa...@dread12.news.tele.dk...


> The documents need not be stored in-memory, only their vector
representation
> (which can be compressed).
> A dump implementation would now scan all document vectors against the
query,
> but I'm sure there is more clever way.
> Perhaps this is an excercise for a new kata?

Which makes me think of many keyed indices.
Using space filling curves (a kind of fractals), you can map a vector into a
single linear value.
The linear value is the position on the line of the space filling curve.
The vector is the position in space where the curve has the given distance.
A Piano curve is one such fractal.

The linear index can be stored in B-Trees or similar efficient lookup
datastructures.

The nice thing about space filling curves is that things many curves exhibit
a kind of sensible distance measure.
That is, if the distance between two linear indices is small then the points
are also reasonably close together on the curve.
This is not an exact science but it does work.
Spacefilling curves have been used to index images where each key is some
aspect of the image, such as how much yellow etc.
Given a new image of the same person in the same shirt etc. you should have
a chance of retrieving and old similar image.

If we take this idea and apply it to document storage, we can now easily
find the best document matches by looking up the documents
stored with linear indices within a predetermined distance of the query. The
distance chosen may take into account the nature of the
space filling curve.

A problem is that it can be a bit slow to calculate the linear index.

Do we have a new google? :-)


Mauricio Fernández

unread,
May 26, 2003, 1:19:23 PM5/26/03
to
On Tue, May 27, 2003 at 02:00:18AM +0900, MikkelFJ wrote:
> For multi word searching I also found an aritcle on using vector spaces.
> Each document is given a vector and your query is also given a vector.
> You enumerate all known words. Each word becomes a position in the vector.
> The value in the vector is the number of word occurrences.
> You locate the document with the smallest distance from the query vector
> using some euclidian or other measure
> of distance. This also handles inexact queries and is supposedly memory
> efficient.
> The query would now need to look up each word in some hashtable or similar
> to locate the vector index.
> The documents need not be stored in-memory, only their vector representation
> (which can be compressed).
> A dump implementation would now scan all document vectors against the query,
> but I'm sure there is more clever way.
> Perhaps this is an excercise for a new kata?

In fact Dave sort of solved the Kata already :)
http://pragprog.com/pragdave/Tech/Blog/Searching.rdoc,v

--
_ _
| |__ __ _| |_ ___ _ __ ___ __ _ _ __
| '_ \ / _` | __/ __| '_ ` _ \ / _` | '_ \
| |_) | (_| | |_\__ \ | | | | | (_| | | | |
|_.__/ \__,_|\__|___/_| |_| |_|\__,_|_| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Not only Guinness - Linux is good for you, too.
-- Banzai on IRC

Dave Thomas

unread,
May 26, 2003, 1:21:40 PM5/26/03
to
Mauricio Fernández wrote:

> Are you sure the given example (80000 words into 16KB) is right?
> It just doesn't seem to work according to the reference given in your
> page:
>
> 16*8*1024/80000
> 1.63840
>
> => max hashes k = 1. P(false positive) = .46!!

Hmm - you're right, but I now I'm seriously worried. I've carried those
numbers around in my head for 20 years, and I was sure they were
correct. Now my problem is trying to find a way to convert the old 8"
floppies in RT-11 format so I can work out what why there's an
inconsistency.

Thanks for pointing it out.

Cheers


Dave


Dave Thomas

unread,
May 26, 2003, 1:22:45 PM5/26/03
to
MikkelFJ wrote:

> For multi word searching I also found an aritcle on using vector spaces.
> Each document is given a vector and your query is also given a vector.
> You enumerate all known words. Each word becomes a position in the vector.
> The value in the vector is the number of word occurrences.
> You locate the document with the smallest distance from the query vector
> using some euclidian or other measure
> of distance. This also handles inexact queries and is supposedly memory
> efficient.
> The query would now need to look up each word in some hashtable or similar
> to locate the vector index.
> The documents need not be stored in-memory, only their vector representation
> (which can be compressed).
> A dump implementation would now scan all document vectors against the query,
> but I'm sure there is more clever way.
> Perhaps this is an excercise for a new kata?

Or for the original one... :)

http://pragprog.com/pragdave/Practices/CodeKata


Cheers


Dave


MikkelFJ

unread,
May 26, 2003, 1:59:02 PM5/26/03
to

"Dave Thomas" <da...@pragprog.com> wrote in message
news:3ED24D67...@pragprog.com...
> MikkelFJ wrote:

> > Perhaps this is an excercise for a new kata?
>
> Or for the original one... :)
>
> http://pragprog.com/pragdave/Practices/CodeKata

This is why you should hang out in comp.lang.ruby, never mind the language,
but I've yet to find a place where references to interesting technology
spreads faster.

I'd have to get back on the bit counting thing, guess I must read the
article.
Are you aware that there are efficient boolean operations for counting bits.
They are anything but intuitive, but someone figured out the instruction
sequences and listed them on the internet somewhere that has slipped me
momentarily. (Something like cryptic like (0xaeaeaeae and myword) or
0x121212121 ....).

Mikkel


Clifford Heath

unread,
May 26, 2003, 8:27:41 PM5/26/03
to
Dave Thomas wrote:
> ...you should look at using a Bloom filter. You could pack a 90,000
> word dictionary into 64kb.

As long as you don't mind missing 1 in 64-ish spelling errors.

A Bloom filter reaches optimal performance when there an equal
number of odd and even bits set (each probe into the bitmap
therefore yields maximum selectivity) 64Kb is 524288 bits,
which means you must use around six hash values from each word
in your 90,000 word dictionary. That yields false hit
performance of 1:64, which IMO is insufficient for a useful
spell checker.

I mentioned this not to criticise, but to show folk how to
choose the best number of hash probes for a Bloom filter of
a given size, and how to predict the false hit rate.

I learnt this stuff by experiment while building a spell-checker.
I used 16-18 probes for each word, yielding a false hit rate of
1 in 65000+. For that performance, your 90,000 word dictionary
needs a filter of 200Kb+, not 64Kb.

Clifford.

Xiangrong Fang

unread,
May 26, 2003, 9:32:08 PM5/26/03
to
Hi Mikkel,

Thanks for the detailed explanations. I have some qusestions regarding
your explain:

1. You mentioned B-Tree and binary tree, seems that they are different?

2. I didn't used any of my own hash algorithm, I just used Ruby's hash,
and used Ruby's PStore to dump it to disk. The memory usage is an
estimation from monitoring the task list in win2k. Do you have any
comments about the efficiency of using Ruby's hash and pstore?

3. My application is related to using N-GRAM method to cut Chinese text
into words (a Chinese bigram is 4-byte string). Do you have any comments
on which algorithm is better? I am very interested in your comments
about SkipList, can you expand more?

Thanks a lot!

Shannon

Dave Thomas

unread,
May 26, 2003, 11:16:04 PM5/26/03
to
Clifford Heath wrote:
> Dave Thomas wrote:
>
>> ...you should look at using a Bloom filter. You could pack a 90,000
>> word dictionary into 64kb.
>
>
> As long as you don't mind missing 1 in 64-ish spelling errors.
>
> A Bloom filter reaches optimal performance when there an equal
> number of odd and even bits set (each probe into the bitmap
> therefore yields maximum selectivity) 64Kb is 524288 bits,
> which means you must use around six hash values from each word
> in your 90,000 word dictionary. That yields false hit
> performance of 1:64, which IMO is insufficient for a useful
> spell checker.

Actually, it's worse that that. According to
http://www.cs.wisc.edu/~cao/papers/summary-cache/node8.html, for an m/n
ratio of 6, the filter words best with 4 hashes, which gives an error
rate of about 1/18, which is clearly silly. Next time I'll actually use
the tables rather than try to wing it in my head :)

> I learnt this stuff by experiment while building a spell-checker.
> I used 16-18 probes for each word, yielding a false hit rate of
> 1 in 65000+. For that performance, your 90,000 word dictionary
> needs a filter of 200Kb+, not 64Kb.

Interestingly, the same tables suggest that for a 200k hash, m/n will be
roughly 18, so k optimizes out at 12 or 13 hashes, for an error rate of
0.000176, or 1/5681, which probably still isn't good enough for
real-world use. It looks like about 300k is probably a good starting
point. I'm off to try generate some real statistics just to confirm all
this.


Cheers


Dave

Clifford Heath

unread,
May 27, 2003, 1:12:08 AM5/27/03
to
Dave Thomas wrote:
>> A Bloom filter reaches optimal performance when there an equal
>> number of odd and even bits set (each probe into the bitmap
>> therefore yields maximum selectivity)
> Actually, it's worse that that.

You're right, I forgot to divide by two (for half the bits set).
The discovery that half the bits should be set seems obvious now,
but it was a major "aha!" moment for me in 1985...

> Interestingly, the same tables suggest that for a 200k hash, m/n will be
> roughly 18, so k optimizes out at 12 or 13 hashes

I only had a 45,000 word list though, so k was higher.
The 0.6185 heuristic assumes a perfectly random hash.
I tested many different hash algs to find the cheapest
that was close to random.

Clifford.

Brian Candler

unread,
May 27, 2003, 4:20:40 AM5/27/03
to
On Tue, May 27, 2003 at 12:16:04PM +0900, Dave Thomas wrote:
> Interestingly, the same tables suggest that for a 200k hash, m/n will be
> roughly 18, so k optimizes out at 12 or 13 hashes, for an error rate of
> 0.000176, or 1/5681, which probably still isn't good enough for
> real-world use. It looks like about 300k is probably a good starting
> point.

That's not very good though; I mean, just encoding each letter using 5 bits,
I reckon a 90,000 word dictionary might take under 600K [*]. And then you
have none of that "dangerous messing around with improbabilities" (as I
think Douglas Adams wrote). Admittedly it wouldn't be quick to search though.

Regards,

Brian.

[*]
$ wc /usr/share/dict/words
235881 235881 2493066 /usr/share/dict/words

So I'm guessing a 90,000 word dictionary would have
90000/235881 * 2493066 bytes
Encoding each character and newline as a 5-bit pattern gives 5/8ths of that,
or 594516 bytes, or about 580K.

ahoward

unread,
May 27, 2003, 10:15:59 AM5/27/03
to
On Mon, 26 May 2003, Brian Candler wrote:

> It doesn't, because Array doesn't have any requirement to be sorted.

it does not - but it certainly can be. i've often wondered why Array does not
carry an internal @sorted flag - which could be used as a hint for certain
operations - like Array#include?

-a
--
====================================
| Ara Howard
| NOAA Forecast Systems Laboratory
| Information and Technology Services
| Data Systems Group
| R/FST 325 Broadway
| Boulder, CO 80305-3328
| Email: ara.t....@fsl.noaa.gov
| Phone: 303-497-7238
| Fax: 303-497-7259
| ~ > ruby -e 'p % ^) .intern'
====================================

Robert Klemme

unread,
May 27, 2003, 11:00:10 AM5/27/03
to

"ahoward" <aho...@fsl.noaa.gov> schrieb im Newsbeitrag
news:Pine.LNX.4.53.03...@eli.fsl.noaa.gov...

> On Mon, 26 May 2003, Brian Candler wrote:
>
> > It doesn't, because Array doesn't have any requirement to be sorted.
>
> it does not - but it certainly can be. i've often wondered why Array
does not
> carry an internal @sorted flag - which could be used as a hint for
certain
> operations - like Array#include?

IMHO that's overly complicating Array. An Array is simply a sequence of
items. Period.

To have a sorted array I'd have either

- a sub class of Array (like SortedArray) that overrides modifying
operations to ensure this invariant (or do it dependend on the flag you
mentioned) or

- a SortedArray delegating all non modifying calls to an array and
overriding modifying methods.

I think this is the more component oriented approach.

Regards

robert

ahoward

unread,
May 27, 2003, 11:26:59 AM5/27/03
to
On Tue, 27 May 2003, Robert Klemme wrote:

> IMHO that's overly complicating Array. An Array is simply a sequence of
> items. Period.

on paper - sure. but in reality we have:


[ ] new & * + -- << <=> == === [ ] [ ]= | assoc at clear
collect collect! compact compact! concat delete delete_at
delete_if each each_index empty? eql? fill first flatten
flatten! include? index indexes indices join last length map!
nitems pack pop push rassoc reject! replace reverse reverse!
reverse_each rindex shift size slice slice! sort sort! to_a
to_ary to_s uniq uniq! unshift

mixins
Enumerable:
collect, detect, each_with_index, entries,
find, find_all, grep, include?, map, max,
member?, min, reject, select, sort, to_a


*much* more than a sequence of items.

what would be the issue if Enumerable was optimized for containers, like
Array, which could support random access and binary searches? the same
methods would exist - they would simply be faster. admittedly this is less
'general' but optimization for such a heavily used container as the Array
makes sense IHMO, espcially when it could be done in such a way to be
completely backward compatible while giving a speed increase to practially
every ruby program ever written...

-a

>
> To have a sorted array I'd have either
>
> - a sub class of Array (like SortedArray) that overrides modifying
> operations to ensure this invariant (or do it dependend on the flag you
> mentioned) or
>
> - a SortedArray delegating all non modifying calls to an array and
> overriding modifying methods.
>
> I think this is the more component oriented approach.
>
> Regards
>
> robert
>
>

-a

Robert Klemme

unread,
May 27, 2003, 12:38:06 PM5/27/03
to

"ahoward" <aho...@fsl.noaa.gov> schrieb im Newsbeitrag
news:Pine.LNX.4.53.03...@eli.fsl.noaa.gov...
> On Tue, 27 May 2003, Robert Klemme wrote:
>
> > IMHO that's overly complicating Array. An Array is simply a sequence
of
> > items. Period.
>
> on paper - sure. but in reality we have:
>
>
> [ ] new & * + -- << <=> == === [ ] [ ]= | assoc at clear
> collect collect! compact compact! concat delete delete_at
> delete_if each each_index empty? eql? fill first flatten
> flatten! include? index indexes indices join last length map!
> nitems pack pop push rassoc reject! replace reverse reverse!
> reverse_each rindex shift size slice slice! sort sort! to_a
> to_ary to_s uniq uniq! unshift
>
> mixins
> Enumerable:
> collect, detect, each_with_index, entries,
> find, find_all, grep, include?, map, max,
> member?, min, reject, select, sort, to_a
>
>
> *much* more than a sequence of items.

Well, yes. But all these actions don't introduce a new class invariant.
Note, that methods sort and sort! must be invoked explicitely. Under the
hood it remains a "sequence of things" and it is not a "sorted sequence of
things". This is a major difference although it might not look that way
on first glance. If you consider, how all modifying methods must change
you'll might easier see the impact of this difference. Having all methods
of this kind check a flag is commonly considered bad practice. I'd favour
an extra type, i.e. sub class or proxy that delegates.

Apart from that, since instances of a class are not comparable by default
(i.e. implement <=>) the SortedArray either would have to ensure that only
comparable elements go into the container or make up an artifical
comparison. This checks would have to be employed, if the sorted flag
changes from false to true. Then what do you do? Kick those out that
don't match? If you make the flag read only, i.e. only set during
construction, you are better off having a different type for this.

IMHO a flag like you suggested is not sufficient: maybe you want to apply
different sorting criteria for the same type. Java solved this elegantly
by defining an interface Comparator that performs the comparison
operation. You can then create a sorted collection either with default
sorting (uses Comparable, same in ruby) or you implement a comparator on
your own. This is the most modular, flexible and clean solution.

> what would be the issue if Enumerable was optimized for containers, like
> Array, which could support random access and binary searches? the same
> methods would exist - they would simply be faster. admittedly this is
less
> 'general' but optimization for such a heavily used container as the
Array
> makes sense IHMO, espcially when it could be done in such a way to be
> completely backward compatible while giving a speed increase to
practially
> every ruby program ever written...

I have several objections:

- The speed increase would not be there automatically, but one would have
to explicitely set the sorting flag; so there's no automatic performance
increase.

- You can't use sorting by default because a lot of applications rely on
the data *not* being sorted behind the scenes and

- there's a performance penalty to pay for sorting, which you clearly
don't want if you don't need sorted data.

- You suggested optimization is the other way round: Enumerable is the
general implementation just requiring an implementation of each. All
containers that can do include? and others better than the default
implementation override those methods by their own implementation.

Sorry, if I'm beeing a bit massive (lots of text), but I think there are
serious reasons to not change Array to support automated internal sorting.
However, introducing such a type is surely not a bad idea. Maybe we
simply adopt the pattern from Java. I think, they did a good design job
on this.

Regards

robert

MikkelFJ

unread,
May 27, 2003, 1:02:38 PM5/27/03
to

"Xiangrong Fang" <xrf...@hotmail.com> wrote in message
news:2003052709230...@hotmail.com...

> Hi Mikkel,
>
> Thanks for the detailed explanations. I have some qusestions regarding
> your explain:
>
> 1. You mentioned B-Tree and binary tree, seems that they are different?

In principle they are identical, but differ vastly in implementation.

A binary tree has two children, the left has smaller values, the right has
larger values.
A B-Tree has an array of children such that all values in the leftmost
subtree
is smaller than all values in the next subtree, etc.
In effect a B-Tree is a compressed binary tree.
In both cases you have logarithmic search. But if you assume the time is
consumed
by visiting new nodes and not by visiting values within a single B-Tree
node,
you get a much faster operation. This assumption is true if you read data
from
disk. There is a limit of B-Tree node sizes beyond which it takes
too long time to visit each key within each node. For disk based
B-Trees you usually have about 1000 keys (M=1000).
You can also use B-Trees in-memory, but then the node size must
reflect the size of the CPU cache. When data is in cache, a very
fast scan over the array of keys is more effecient than visiting
new nodes.
B-Trees are very complicated because a node can have between M / 2 and M
keys in a node. If you go above or below you must reorginize the tree,
This is efficient, but the implementation has lots of special cases.

A B-Tree guarantees that you must at most visit K nodes.
K is a small limited value. If K is 3 and you have 1000 keys per
node (M=1000), you have K^M = 1G possible keys.
That is about all you can address in a 32bit systems.
For in-memory operation (M=10), K is significantly larger,
but if you consider the realistic size of your dataset, it is still
small. The trick you can do to make in memory performance better
is to organize you tree so allocation of nodes happen close to
related nodes. This gives you benefits in lower level cache and
virtual memory. Effectively you a split a larger (M=1000) into
several smaller (M=10) nodes. This gives you both good in-memory
and on-disk performance. In praxis it is not that easy to manage,
but you can do something to make it work reasonably well.

Binary trees on the other hand do not have a small limit on the
number of nodes that you must visit. Therefore they truly have
logarithmic behavior. It's not too bad, it is just not the most
efficient way to deal with large datasets.

A binary tree requires you to visit a new node for each comparison.
Binary trees must be balanced otherwise all data could end up in one
subtree (becoming just a linked list).
There are several ways to balance binary trees (e.g. red-black trees and
splay trees).
In any case it solves the problem.


> 2. I didn't used any of my own hash algorithm, I just used Ruby's hash,
> and used Ruby's PStore to dump it to disk. The memory usage is an
> estimation from monitoring the task list in win2k. Do you have any
> comments about the efficiency of using Ruby's hash and pstore?

My best guess is that PStore would not save more than necessary, so you
should
expect in-memory to take up more space as I already suggested,
but I don't know what it does.
Your way of measuring is very crude, but probably fair because this is how
the
system is actually stressed no matter the theory.

> 3. My application is related to using N-GRAM method to cut Chinese text
> into words (a Chinese bigram is 4-byte string). Do you have any comments
> on which algorithm is better?

I'd like to here more about this, in private mail if you prefer.

It it appears that you are counting "word" or symbol frequencies and can do
this
by indexing a pair of symbols as a single 32bit word.
This is ideal for the hashtable I have implemented as it does not need to
visit any
external keys. However, it is not available in Ruby. I can send you the
C-Source.
I already suggested that this source be made available to Ruby but I don't
know how
well it fits with Ruby as is. It would be specialized for 32 bit words, not
general hashes.

I have been working a lot with B-Trees and I think I would choose B-Trees
over
hash tables because they scale better. B-Trees have a good trade-off. They
are easy to have
small and easy to have very very large and if they are not the fastest
solution, they are
always seem to be among the best performers.

With B-Trees you can have your data on disk and do efficient lookups without
needing
to load everything into memory.
It is possible to have on-disk hash tables, but in the end you need to do
some clever
memory mananagement that is going to make it look an awful lot like B-Trees.

Also, because the key is 32bit the B-Tree is efficient because it does not
need to
visit external keys. As soon as you visit external keys the cache benefit of
B-Trees
goes away. However, I do not recommend that you start implementing B-Trees.
They are
difficult to implement and difficult to make sure that they work.

You may be able to use the datastructures in the Berkeley DB database. I
don't know how
efficient it would be in you case (it's fast but it is a database after
all).
Berkeley DB does handle hashes and B-Trees. It can operate in in-memory mode
only
and I think there is a Ruby interface.
Note that Berkeley is free only for non-commerical usage.

> I am very interested in your comments
> about SkipList, can you expand more?

I don't recall exactly why skip-lists are better than binary trees, but they
are
in any case easy to implement (except efficient node allocation).

I can send you an implementation in C if you like.

This is only a rough outline according to memory:

A skiplist is a sorted linked list of all values.
Since it is slow to find a value this way, some nodes both point to the next
node
as well as to one or more nodes further down the list.
In fact a skiplist node has an array of next pointers.
All nodes have on next pointer, half the nodes have at least two next
pointers.
Only one nodes has log(N) next pointers (the head of the list).
Two nodes has log(N-1) next pointers, etc.
The search happens by looking at the longest jumping link first. If that was
too far,
you try the next shorter jumping node.
Eventually you find a node that has a key that matches or is shorter.
You now repeat the process jumping from that node. This goes on until you
have a match
(or conclude failure).
This is a bit simplified as there are a few more cases to consider, but this
is generally
the idea.
In reality the skiplist does not have a perfect distribution of nodes. The
number of nodes
with a certain number of next links may not be optimal, and the location of
these nodes may
also not be optimal. The distribution happens statistically, but works
pretty well in praxis.
A skiplist node thus has an array of pointers. This is not unlike B-Tree
nodes, except the
concept is radically different. There is no complex balancing or splitting
of nodes and
the performance is decent. However, you are jumping back and forth between a
lot of nodes,
so it is not friendly to your CPU cache. Since nodes have vastly different
sizes it is
difficult to make a fast node allocation scheme (if you choose malloc / new,
you have lost the
speed competition).


> Thanks a lot!

You are welcome.

Mikkel


MikkelFJ

unread,
May 27, 2003, 2:03:49 PM5/27/03
to

"MikkelFJ" <mikkelfj-...@bigfoot.com> wrote in message
news:3ed255d7$0$97199$edfa...@dread12.news.tele.dk...

> > > Perhaps this is an excercise for a new kata?
> >
> > Or for the original one... :)
> >
> > http://pragprog.com/pragdave/Practices/CodeKata

...


> Are you aware that there are efficient boolean operations for counting
bits.
> They are anything but intuitive, but someone figured out the instruction
> sequences and listed them on the internet somewhere that has slipped me
> momentarily. (Something like cryptic like (0xaeaeaeae and myword) or
> 0x121212121 ....).

Here's some source for counting bits. I picked it from an ideal hash table.

// from http://www.jjj.de/bitwizardry/files/bitcount.h

static inline ulong bit_count(ulong x)
{
#if BITS_PER_LONG == 32
x -= (x>>1) & 0x55555555;
x = ((x>>2) & 0x33333333) + (x & 0x33333333);
x = ((x>>4) + x) & 0x0f0f0f0f;
x *= 0x01010101;
return x>>24;
#endif

#if BITS_PER_LONG == 64
x = ((x>>1) & 0x5555555555555555) + (x & 0x5555555555555555); // 0-2 in
2 bits
x = ((x>>2) & 0x3333333333333333) + (x & 0x3333333333333333); // 0-4 in
4 bits
x = ((x>>4) + x) & 0x0f0f0f0f0f0f0f0f; // 0-8 in
4 bits
x += x>> 8; // 0-16
in 8 bits
x += x>>16; // 0-32
in 8 bits
x += x>>32; // 0-64
in 8 bits
return x & 0xff;
#endif
}


Mikkel

Hal E. Fulton

unread,
May 27, 2003, 5:35:11 PM5/27/03
to
----- Original Message -----
From: "Robert Klemme" <bob....@gmx.net>
Newsgroups: comp.lang.ruby
To: "ruby-talk ML" <ruby...@ruby-lang.org>
Sent: Tuesday, May 27, 2003 11:49 AM
Subject: Re: Binary Tree vs. Hash

[snip discussion of "sorted" flag for arrays]

I can see both sides of this. I think the
original idea has some merit.

As I see it, the @sorted flag would be set
only when a sort was done, and it would be
unset when any operation was done NOT
guaranteeing preservation of sorting.

This would make me want an "insert" operation
that would add an element and preserve the
sorting. (slice already preserves the state,
since it operates only on contiguous items.)

One problem I see is the potential ambiguity
of "sorted" -- we can apply any sorting
technique we like to an array, right? And not
all of these are conducive to a speedup unless
we do an internal Schwartzian transform.

An overall better approach might be a SortedArray
inheriting from Array -- it could even have its own
definitions of << and so on that would preserve
sorting. And the method of sorting would (could/
should/might?) be fixed on an instance basis.

Just my $0.02,
Hal


Dave Thomas

unread,
May 27, 2003, 5:59:58 PM5/27/03
to
>>Are you aware that there are efficient boolean operations for counting

Um, yes - I reference and use them in the article...


Dave


Brian Candler

unread,
May 27, 2003, 6:03:06 PM5/27/03
to
On Wed, May 28, 2003 at 01:49:53AM +0900, Robert Klemme wrote:
> Well, yes. But all these actions don't introduce a new class invariant.
> Note, that methods sort and sort! must be invoked explicitely. Under the
> hood it remains a "sequence of things" and it is not a "sorted sequence of
> things". This is a major difference although it might not look that way
> on first glance. If you consider, how all modifying methods must change
> you'll might easier see the impact of this difference. Having all methods
> of this kind check a flag is commonly considered bad practice. I'd favour
> an extra type, i.e. sub class or proxy that delegates.

And presumably if you *insert* into a SortedArray then it needs to put the
item in the correct place. So syntax such as

a[3] = "hello"

really doesn't make any sense. However, a << "hello" would put the entry in
the correct place.

(That's unless you maintain this flag saying "the array is currently sorted"
which is a bit of a nightmare IMO, for lots of reasons)

So a SortedArray would have considerably different semantics to a normal
Array anyway. Its implementation would probably be different too - perhaps a
heap or a b-tree.

Regards,

Brian.

MikkelFJ

unread,
May 27, 2003, 6:57:40 PM5/27/03
to

"Dave Thomas" <da...@pragprog.com> wrote in message
news:3ED3DFE2...@pragprog.com...

> >>Are you aware that there are efficient boolean operations for counting
>
> Um, yes - I reference and use them in the article...

Damn ... must read more carefully ... I must have skipped way down or
something...??
Thats what you get when you have an attention span of a kitten catching
flies.

BTW: I've found the article on finite state machines as tries - low memory
overhead, in particular, the fsm can be updated incrementally and it is fast
to do so:

Incremental Construction of Minimal Acyclic Finite-State Automata
http://acl.ldc.upenn.edu/J/J00/J00-1002.pdf

How to squeeze a lexicon
http://sun.iinf.polsl.gliwice.pl/~mciura/lexicon.pdf

http://www.google.com/search?sourceid=navclient&hl=da&ie=UTF-8&oe=UTF-8&q=Da
ciuk+minimal+tries

(Robert Feldt posted a link here at c.l.r. a year back)
http://lampwww.epfl.ch/papers/idealhashtrees.pdf

Jim Weirich

unread,
May 27, 2003, 8:27:24 PM5/27/03
to
On Tue, 2003-05-27 at 17:35, Hal E. Fulton wrote:

> An overall better approach might be a SortedArray
> inheriting from Array -- it could even have its own
> definitions of << and so on that would preserve
> sorting. And the method of sorting would (could/
> should/might?) be fixed on an instance basis.

I'm not sure that a SortableArray is substitutable for a normal array.
When I use an array, I expect the following ...

a[5] = 10
assert_equal 10, a[5]

A sorted array might not even support the []= operation, or if it does,
it would have to move the inserted to its correct position (and violate
the expectation above).

If you are considering sorted containers, I suggest a good look at
either the Smalltalk library, or the Eiffel library to see how they
approach the problem. Both are well designed.

--
-- Jim Weirich jwei...@one.net http://jimweirich.umlcoop.net
---------------------------------------------------------------------
"Beware of bugs in the above code; I have only proved it correct,
not tried it." -- Donald Knuth (in a memo to Peter van Emde Boas)

Mauricio Fernández

unread,
May 28, 2003, 1:39:29 AM5/28/03
to
On Wed, May 28, 2003 at 01:49:53AM +0900, Robert Klemme wrote:
>
> "ahoward" <aho...@fsl.noaa.gov> schrieb im Newsbeitrag
> news:Pine.LNX.4.53.03...@eli.fsl.noaa.gov...
> > On Tue, 27 May 2003, Robert Klemme wrote:
> >
> > > IMHO that's overly complicating Array. An Array is simply a sequence
> of
> > > items. Period.
> >
> > on paper - sure. but in reality we have:
> >

We cannot optimize w/ the @sorted flag because Array is used for two
different purposes:
* poor man's set (sorting would help here for include? and the like)
* plain old array (you don't want sorting as you expect to find one
element at the index you inserted it)

Unless we create 2 classes specifically for these 2 different semantics,
nothing can be done.

So if anything, the first step would be creating a Set class. Set
objects could be hinted on creation whether to use internally a
sorted array, a hash ( object => true ), etc, depending on the desired
complexity and space/time trade-off.

--
_ _
| |__ __ _| |_ ___ _ __ ___ __ _ _ __
| '_ \ / _` | __/ __| '_ ` _ \ / _` | '_ \
| |_) | (_| | |_\__ \ | | | | | (_| | | | |
|_.__/ \__,_|\__|___/_| |_| |_|\__,_|_| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

* JHM wonders what Joey did to earn "I'd just like to say, for the record,
that Joey rules."
-- Seen on #Debian

Robert Klemme

unread,
May 28, 2003, 3:35:14 AM5/28/03
to

"Hal E. Fulton" <hal...@hypermetrics.com> schrieb im Newsbeitrag
news:064901c32486$95db4a20$0300...@austin.rr.com...

> ----- Original Message -----
> From: "Robert Klemme" <bob....@gmx.net>
> Newsgroups: comp.lang.ruby
> To: "ruby-talk ML" <ruby...@ruby-lang.org>
> Sent: Tuesday, May 27, 2003 11:49 AM
> Subject: Re: Binary Tree vs. Hash
>
> [snip discussion of "sorted" flag for arrays]
>
> I can see both sides of this. I think the
> original idea has some merit.
>
> As I see it, the @sorted flag would be set
> only when a sort was done, and it would be
> unset when any operation was done NOT
> guaranteeing preservation of sorting.

I had a different understanding, but maybe I was wrong. Must reread the
thread...

> One problem I see is the potential ambiguity
> of "sorted" -- we can apply any sorting
> technique we like to an array, right? And not
> all of these are conducive to a speedup unless
> we do an internal Schwartzian transform.

Yeah, that's a major point, as I tried to point out. Imagine

foo=[]
foo << 5 << 2 << -1 << 7
p foo

foo.sort!{|a,b| a<=>b}
p foo
p foo.sorted?

foo.sort!{|a,b| b<=>a}
p foo
p foo.sorted?

printing
[5, 2, -1, 7]
[-1, 2, 5, 7]
true
[7, 5, 2, -1]
true

Anybody reading the sorted flag must know at the same time, which ordering
was applied. So you have to record some of the information outside this
array which IMHO is not a good idea.

> An overall better approach might be a SortedArray
> inheriting from Array -- it could even have its own
> definitions of << and so on that would preserve
> sorting. And the method of sorting would (could/
> should/might?) be fixed on an instance basis.

Exactly, apart the problems with []= remain, as others have pointed out.

robert

Robert Klemme

unread,
May 28, 2003, 3:37:25 AM5/28/03
to

"Mauricio Fernández" <batsm...@yahoo.com> schrieb im Newsbeitrag
news:20030528053...@student.ei.uni-stuttgart.de...

> We cannot optimize w/ the @sorted flag because Array is used for two
> different purposes:
> * poor man's set (sorting would help here for include? and the like)
> * plain old array (you don't want sorting as you expect to find one
> element at the index you inserted it)
>
> Unless we create 2 classes specifically for these 2 different semantics,
> nothing can be done.

Thanks for pointing this out with so few words!

> So if anything, the first step would be creating a Set class. Set
> objects could be hinted on creation whether to use internally a
> sorted array, a hash ( object => true ), etc, depending on the desired
> complexity and space/time trade-off.

Yes. Apart from that I'd prefer different classes over a flag at
construction time. It's more OO and you need several classes for the
implementation anyway.

Cheers

robert

Mauricio Fernández

unread,
May 28, 2003, 4:49:34 AM5/28/03
to

Um, now I begin to see the merits of the original idea:

batsman@tux-chan:/tmp$ expand -t2 ap.rb

module ClearSortedFlag
def clear_flag_at(*syms)
syms.each do |meth|
module_eval <<-EOF
def #{meth}(*args,&block)
@sorted = false
super(*args,&block)
end
EOF
end
end
end

class CleverArray < Array
extend ClearSortedFlag

def initialize(*args)
@sorted = false
end

def sort!(&block)
@sorted = true
@sblock = block if block
end

clear_flag_at("[]=".intern, :clear, :collect!, :compact!,
:concat, :fill, :flatten!, :map!, :push, :replace, :reverse,
:unshift)

def include?(anObject)
return super unless defined? @sorted and @sorted
# perform binary search using @sblock unless nil
# I am too lazy to implement it now :-)
# anyway this had better be done in C
end
end

Note we lose the "@sorted" feature when a new Array object is returned.

It is possible to redefine the methods in Array directly, using
alias_method instead of super. This should be essentially transparent
and straightforward, but I don't know if the speed increase would really
matter. When it really *does*, it is probably better to use a Set
class and give it hints on what can be done to increase speed/decrease
memory needs.

Perhaps we could add a separate core class to Ruby ==> Set ??
We could make it parameterizable to select between implementations as
different as sorted array, hash or Bloom filter...

--
_ _
| |__ __ _| |_ ___ _ __ ___ __ _ _ __
| '_ \ / _` | __/ __| '_ ` _ \ / _` | '_ \
| |_) | (_| | |_\__ \ | | | | | (_| | | | |
|_.__/ \__,_|\__|___/_| |_| |_|\__,_|_| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Less is more or less more
-- Y_Plentyn on #LinuxGER

Robert Klemme

unread,
May 28, 2003, 5:20:46 AM5/28/03
to

"Mauricio Fernández" <batsm...@yahoo.com> schrieb im Newsbeitrag
news:20030528084...@student.ei.uni-stuttgart.de...

[snip]

... because it's an Array and not a CleverArray.

> It is possible to redefine the methods in Array directly, using
> alias_method instead of super. This should be essentially transparent
> and straightforward, but I don't know if the speed increase would really
> matter. When it really *does*, it is probably better to use a Set
> class and give it hints on what can be done to increase speed/decrease
> memory needs.
>
> Perhaps we could add a separate core class to Ruby ==> Set ??

AFAIK this is part of 1.8

> We could make it parameterizable to select between implementations as
> different as sorted array, hash or Bloom filter...

I'd prefer separate classes instead of parametrization or a delegation
mechanism. Parametrization is less extensible and less modular.

Cheers

robert

Mark Wilson

unread,
May 28, 2003, 5:50:37 PM5/28/03
to

On Wednesday, May 28, 2003, at 01:39 AM, Mauricio Fernández wrote:

> [snip]

> So if anything, the first step would be creating a Set class. Set
> objects could be hinted on creation whether to use internally a
> sorted array, a hash ( object => true ), etc, depending on the desired
> complexity and space/time trade-off.
>

> [snip]

Ruby 1.8p2 has a Set class. There is a 'sort' instance method. Some
preliminary documentation can be found at:

http://www.rubygarden.org/ruby?ProgrammingRubyTwo/Set

There is also a new 'to_set' method that set.rb adds to Enumerable.

I am very happy that this has been added and hope to see relations
added as well.

Regards,

Mark


Mark Wilson

unread,
May 29, 2003, 12:01:45 AM5/29/03
to
The following is a proposed implementation of a cartesian product
method for the set class. Comments and suggestions would be welcome.

require 'set'
require 'tuple'

class Set
def cproduct(set)
set3 = Set.new
self.each do |ae|
set.each do |be|
set3.add(Tuple.new(ae, be))
end
end
set3
end
end

tuple.rb is from the Seattle Ruby Brigade's Ruby Audit, which can be
downloaded here:

http://sourceforge.net/project/showfiles.php?group_id=50485

The Tuple class creates immutable arrays. Methods that would modify the
object in place return an error for attempted modification of a frozen
object. A deepfreeze method is also added to Enumerable so that the
Tuple object can't be unfrozen.


Wojciech Kaczmarek

unread,
Jun 1, 2003, 7:24:20 PM6/1/03
to
Mon, 26 May 2003 18:54:24 +0900, Michael Neumann wrote:
[snip Array lookup problem]

> ruby -r complex -e 'c,m,w,h=Complex(-0.75,0.136),50,150,100;puts "P6\n#{w}
> #{h}\n255";(0...h).each{|j|(0...w).each{|i|
> n,z=0,Complex(.9*i/w,.9*j/h);while n<=m&&(z-c).abs<=2;z=z*z+c;n+=1 end;print
> [10+n*15,0,rand*99].pack("C*")}}'|display

Now that is a nice hack. And you've just reminded me about a bunch of
unwritten fractal-related small apps I planned to recode in Ruby. Thank you
:->

--
A moose once bit my sister

Robert Klemme

unread,
Jun 2, 2003, 4:03:28 AM6/2/03
to

"Mark Wilson" <mwil...@cox.net> schrieb im Newsbeitrag
news:40411F81-918A-11D7...@cox.net...

> The following is a proposed implementation of a cartesian product
> method for the set class. Comments and suggestions would be welcome.
>
> require 'set'
> require 'tuple'
>
> class Set
> def cproduct(set)
> set3 = Set.new
> self.each do |ae|
> set.each do |be|
> set3.add(Tuple.new(ae, be))

I'd prefer another method name here, since Tuple.new suggests that there
is a new instance of class Tuple but not Array (as your comment suggests).

> end
> end
> set3
> end
> end

Did you consider implementing a cartesian product for n Enumerables? This
could look like

module Enumerable
def Enumerable.cproduct(*enums)
enums.each {|e| raise "illegal type" unless e.kind_of? Enumerable
# impl here
end
end

OTOH: these beasts are very likely to grow large and thus it might not be
useful to have an implementation for an arbitrary cartesian product at
hand. Maybe it is more reasonable to create an extra class for this that
creates the cproduct on the fly (i.e. while doing each).

> tuple.rb is from the Seattle Ruby Brigade's Ruby Audit, which can be
> downloaded here:
>
> http://sourceforge.net/project/showfiles.php?group_id=50485
>
> The Tuple class creates immutable arrays. Methods that would modify the
> object in place return an error for attempted modification of a frozen
> object. A deepfreeze method is also added to Enumerable so that the
> Tuple object can't be unfrozen.

*shudder* :-))

Kind regards

robert

Martin DeMello

unread,
Jun 2, 2003, 4:29:23 AM6/2/03
to
Robert Klemme <bob....@gmx.net> wrote:
> hand. Maybe it is more reasonable to create an extra class for this that
> creates the cproduct on the fly (i.e. while doing each).

I think that one's already in EnumerableTools

martin

Michael Neumann

unread,
Jun 2, 2003, 10:02:07 AM6/2/03
to

You can find the expanded, more verbose source code here:

http://www.ntecs.de/blog/Tech/ComputerScience/JuliaSets.html


Regards,

Michael

Joel VanderWerf

unread,
Jun 2, 2003, 12:19:36 PM6/2/03
to

Simon Strandgaard

unread,
Jun 7, 2003, 7:56:50 AM6/7/03
to
On Thu, 29 May 2003 14:01:45 +0900, Mark Wilson wrote:

> The following is a proposed implementation of a cartesian product
> method for the set class. Comments and suggestions would be welcome.


I was playing around with Array#foldr and made this piece of code:

def cproduct(a, b)
a.foldr([]) do |rx, x|
rx + b.foldr([]) { |ry, y| ry + [[x, y]] }
end
end

The Array#foldl is located here:
http://www.rubygarden.org/ruby?ArrayFold


--
Simon Strandgaard

0 new messages