best dictionary algorithm?

luser- -droog

unread,

Jan 20, 2011, 2:28:15 AM1/20/11

to

Dissatisfied with my cheap linear search technique, I'm
investigating more sophiticated techniques. I'm pretty
psyched about the Patricia Tree, a trimmed bitwise Trie.
Is there a consensus among implementations about the
dictionary mechanism?

Helmar

unread,

Jan 20, 2011, 11:47:48 AM1/20/11

to

In Forth there are hash tables not uncommon.
How about one big (outside VM) hash table that keeps track about all
dictionary keys and for real dictionaries you use the memory addresses
returned from this table as key. You can then store the things in
sorted lists for real dictionaries or make a numeric hash. It has also
the benefit that comparing two names is comparing two pointers only.

For example:

... type /arraytype eq ...

is comparing two numeric values.

Regards,
-Helmar

luser- -droog

unread,

Jan 20, 2011, 10:15:38 PM1/20/11

to

Perhaps the two association mechanisms should be dealt with
separately. There's the dictionary which associates objects
with other objects. And then there's the name mechanism
which associates strings with, presumably, integers.

According to PLRM 2ed, level 1 interpreters stored the name
data (whatever is necessary to reconstruct the string with 'cvs')
in VM (where it is subject to save/restore and therefore names
can trigger invalidrestore). Level 2 introduces global VM,
and removes the restriction on names from 'restore'. I can
only imagine that names are now in global VM.

For a hosted implementation, it appears that the names can
be put anywhere. Even a dynamic table outside VM entirely!
But you still want to be able to perform the mapping quickly
in either direction. And you don't want to use a lot of space
to do it in.

So for names, I'm looking at Ternary Search Trees, and Patricia Trees.
Although a DAWG appears to be roughly the same thing.

But for dictionaries, themselves, the mapping is between objects
and objects. Though 99% of the time the keys are names, yet it still
needs the ability to use arbitrary objects. So I'm thinking I can
use a bitwise hash on the significant portions of the object
(as determined by type). 2-9 bytes.

In fact, this 'condensed' object is probably roughly what I need for
packed arrays, hmmm...

Sorted output from dictionaries isn't particularly important, is it?

BTW, are strings still implicitly converted to names when used as
keys? I remember something about that (IIRC, it was /Real World
Postscript/) but it was some time ago.

ThomasW

unread,

Jan 21, 2011, 2:13:06 AM1/21/11

to

On 21 Jan., 04:15, luser- -droog <mijo...@yahoo.com> wrote:
>
> But for dictionaries, themselves, the mapping is between objects
> and objects. Though 99% of the time the keys are names, yet it still
> needs the ability to use arbitrary objects. So I'm thinking I can
> use a bitwise hash on the significant portions of the object
> (as determined by type). 2-9 bytes.
>

I've not dealt with this in depth and I don't know how you're
representing objects, so maybe I'm talking nonsense. Anyway, I'd
imagine you have a struct where, say, one byte stores the type of the
object and, say, four bytes store the value of the object (which would
be an integer/float/boolean etc. for simple types and a VM address for
composite objects). Then there would be a few more bits that would
flag the executable state and such that are not relevant for
dictionary lookup. Those five bytes would uniquely identify the key,
wouldn't they? So I think the hash generation can be agnostic of the
type and does not depend on anything from VM.

> BTW, are strings still implicitly converted to names when used as
> keys? I remember something about that (IIRC, it was /Real World
> Postscript/) but it was some time ago.

Yes, they are:

<<(foo)(bar)>>{}forall pstack % /foo (bar)

Otherwise the key lookup would be more complicated because it would
depend on VM data.

Thomas W.

luser- -droog

unread,

Jan 21, 2011, 2:34:03 AM1/21/11

to

On Jan 21, 1:13 am, ThomasW <zupf...@googlemail.com> wrote:
> On 21 Jan., 04:15, luser- -droog <mijo...@yahoo.com> wrote:
>
>
>
> > But for dictionaries, themselves, the mapping is between objects
> > and objects. Though 99% of the time the keys are names, yet it still
> > needs the ability to use arbitrary objects. So I'm thinking I can
> > use a bitwise hash on the significant portions of the object
> > (as determined by type). 2-9 bytes.
>
> I've not dealt with this in depth and I don't know how you're
> representing objects, so maybe I'm talking nonsense. Anyway, I'd
> imagine you have a struct where, say, one byte stores the type of the
> object and, say, four bytes store the value of the object (which would
> be an integer/float/boolean etc. for simple types and a VM address for
> composite objects).

Yeah that's pretty much it. But I'm using doubles for the reals,
so the payload is up to 8 bytes. And even though I could pack the
type and flags into a byte, keeping the double aligned (on x86)
means the tag is 4 bytes. But for most types, a lot of this space
is meaningless; so I either have to keep it all zeroed or explicitly
ignore it for comparisons. So there seems to be a natural
symmetry for this notion of a packed-form or canonical object.

Symmetry means less work!

>Then there would be a few more bits that would
> flag the executable state and such that are not relevant for
> dictionary lookup. Those five bytes would uniquely identify the key,
> wouldn't they? So I think the hash generation can be agnostic of the
> type and does not depend on anything from VM.

So is it ok to do linear search for dictionary lookup?
It just seems so naive!

Helmar

unread,

Jan 21, 2011, 4:58:40 AM1/21/11

to

On 21 Jan., 04:15, luser- -droog <mijo...@yahoo.com> wrote:

Hi,

> But for dictionaries, themselves, the mapping is between objects
> and objects. Though 99% of the time the keys are names, yet it still
> needs the ability to use arbitrary objects. So I'm thinking I can
> use a bitwise hash on the significant portions of the object
> (as determined by type). 2-9 bytes.

most significant is it's storing address in memory. That should be the
simplest.

> Sorted output from dictionaries isn't particularly important, is it?

You can use binary search if you sort the pointers (something
integer!) to the objects.
This is pretty fast usually. Sorting of the integers should be fast
enough too, since most access is searching the key if I understand it
right.

Regards,
-Helmar

luser- -droog

unread,

Jan 21, 2011, 3:38:13 PM1/21/11

to

On Jan 20, 9:15 pm, luser- -droog <mijo...@yahoo.com> wrote:
> Perhaps the two association mechanisms should be dealt with
> separately. There's the dictionary which associates objects
> with other objects. And then there's the name mechanism
> which associates strings with, presumably, integers.

Here's my prototype for the name mechanism.
Ternary Search Tree. Lifted from Bently/Sedgwick
Dr. Dobbs article. Lightly modified to use explicit
pointers.

#include <stdio.h>
#include <stdlib.h>

/* ternary search tree */
typedef unsigned char byte;

typedef struct node {
byte val;
struct node *lo, *eq, *hi;
} node;

node *root = NULL;
char *insertstr;

int rsearch (node *p, char *s) {
if (!p) return 0;
if (*s < p->val) return rsearch(p->lo, s);
if (*s > p->val) return rsearch(p->hi, s);
if (*s == 0) return 1;
return rsearch(p->eq, ++s);
}

int search (char *s) {
node *p;
p = root;
while(p) {
if (*s < p->val) {
p = p->lo;
} else if (*s == p->val) {
if (*s++ == 0) return 1;
p = p->eq;
} else {
p = p->hi;
}
}
return 0;
}

node *insert (node *p, char *s) {
if (!p) {
p = malloc(sizeof*p);
p->val = *s;
p->lo = p->eq = p->hi = 0;
}
if (*s < p->val) {
p->lo = insert(p->lo, s);
} else if (*s == p->val) {
if (*s)
p->eq = insert(p->eq, ++s);
else
p->eq = (node *)insertstr;
} else {
p->hi = insert(p->hi, s);
}
return p;
}

void traverse (node *p) {
if (!p) return;
traverse(p->lo);
if (p->val)
traverse(p->eq);
else
printf("%s\n", (char *)p->eq);
traverse(p->hi);
}

int main () {
insertstr = "foo";
root = insert(root, insertstr);
insertstr = "baloney";
root = insert(root, insertstr);

printf("the search for foo yields %s!\n", search("foo")? "true":
"false");
printf("the search for baloney yields %s!\n", search("baloney")?
"true": "false");
printf("the search for foloney yields %s!\n", search("foloney")?
"true": "false");

printf("known keys:\n");
traverse(root);
return 0;
}

luser- -droog

unread,

Jan 30, 2011, 3:03:19 AM1/30/11

to

So could I do a quicksort for every insertion?
Then a binary search for lookup would be quite fast.
I'm reluctant to go with hashing because I'm still
holding off on the garbage collector and all level 2 stuff.
And with level 1, you tend to make your dictionaries
exactly the same size as the contents (since they can't
grow and you don't want to waste space). So for hashing
with linear chaining, the whole dictionary is one massive
collision if it's full; and it will be. So I can either
double all my allocation sizes so they max-out at half-full,
or look for something else. I kind of like the idea of
straight linear search while swapping each found entry
to the front. Self-prioritizing!

ThomasW

unread,

Jan 30, 2011, 10:44:00 AM1/30/11

to

On 21 Jan., 08:13, ThomasW <zupf...@googlemail.com> wrote:
>
> I've not dealt with this in depth and I don't know how you're
> representing objects, so maybe I'm talking nonsense. Anyway, I'd
> imagine you have a struct where, say, one byte stores the type of the
> object and, say, four bytes store the value of the object (which would
> be an integer/float/boolean etc. for simple types and a VM address for

> composite objects). [...] Those five bytes would uniquely identify the key,
> wouldn't they?

I've not taken into account integer and floating point objects. As
they are regarded as the same key if they have the same value, the raw
bytes can't be used for lookup.

luser- -droog

unread,

Jan 31, 2011, 12:53:13 AM1/31/11

to

On Jan 30, 2:03 am, luser- -droog <mijo...@yahoo.com> wrote:

> I'm reluctant to go with hashing because I'm still
> holding off on the garbage collector and all level 2 stuff.
> And with level 1, you tend to make your dictionaries
> exactly the same size as the contents (since they can't
> grow and you don't want to waste space). So for hashing
> with linear chaining, the whole dictionary is one massive
> collision if it's full; and it will be. So I can either
> double all my allocation sizes so they max-out at half-full,
> or look for something else. I kind of like the idea of
> straight linear search while swapping each found entry
> to the front. Self-prioritizing!

I may have just had the best idea yet: Dictionary pages!
I could resolve hash collisions by allocating a new "frame"
for the new entry. This should keep the hash-lookups fast,
allow dictionaries to grow, and solve the save-undef-restore
dilemma before it rears its head!

So I do a hash on the key (treated atomically, using pointers
as data), and if the hash indicates a slot that's occupied
and doesn't match, I allocate a new "page" of the dictionary
(not sure of the best size for these: 10? 20?) and/or
recurse to the next page.

The complication is whether to copy the whole dictionary
when there's a write to a lower save level, or just the page.

luser- -droog

unread,

Feb 3, 2011, 2:02:03 AM2/3/11

to

I'm trying to start simple. So this is about as simple
as I can make it, based on my limited understanding of
hashing. The hash function probably needs some tweaking
if my crude test data is any indication. But I'll worry
about then after there's more functionality to exercise.

975(1)12:50 AM:mark2 0> cat dic.c
#include "core.h"

/* SUMCOMP(o) yields the entire "payload" of a object o in a big
integer */
//#define SUMCOMP(o) ((long long)o.u.c.a<<32 + o.u.c.off<<16 +
o.u.c.n)
#define SUMCOMP(o) (*(long long *)&o.u.i)

/* return 0 if equal, >0 if l > r, <0 if l < r */
int objcmp (state *st, object l, object r) {
if (l.tag == r.tag) switch (l.tag) {
case marktype:
case nulltype: return 0;
case booleantype: return l.u.b - r.u.b;
case integertype: return l.u.i - r.u.i;
case realtype: return l.u.r - r.u.r;
case arraytype:
case dicttype:
default: return SUMCOMP(l) - SUMCOMP(r);
case stringtype: return SUMCOMP(l) == SUMCOMP(r) ?
0 :
( l.u.c.n != r.u.c.n ?
l.u.c.n - r.u.c.n :
strncmp(STR(l), STR(r), l.u.c.n) ) ;
} else return l.tag - r.tag;
}

/* return size of the significant portion of the payload */
size odsize (object o) {
switch(o.tag) {
case marktype:
case nulltype: return 0;
case booleantype: return sizeof(boolean);
case nametype:
case integertype: return sizeof(integer);
case realtype: return sizeof(real);
default: return sizeof(composite);
}
}

/* perform a Sum MOD n of the significant bytes of the payload */
integer crc (object k, integer n) {
unsigned res;
int i,m;
res = k.tag;
if (VERBOSE) printf("crc of %02x ", res);
m = odsize(k);
for (i=0; i < m; i++) {
if (VERBOSE) printf("%02x ", (unsigned)((byte *)&k.u.b)[i]);
res += ((byte *)&k.u.b)[i];
}
if (VERBOSE) printf(" MOD %ld\n", n);
res %= n;
return res;
}

/* Dicts are implicitly one-entry larger than declared
in order to simplify searching (terminate on null).
So we allocate n+1 pairs here
and call hash with n+1 in lookup later.
For all other purposes, n is the maxlength.
*/
object consdict (state *st, size n) {
object d = { .tag = dicttype };
int i;
d.u.c.a = vmalloc(st, 2*(n+1)*sizeof(object));
d.u.c.n = n; /* maxlength */
for (i=0; i < d.u.c.n+1; i++) {
ARR(d)[2*i] = ARR(d)[2*i+1] = null;
}
HDR(d)->sz = 0; /* length */
return d;
}

/* call function on each non-empty key-value pair in dict d */
void enumerate (state *st, object d, void (*fp)(object k, object v)) {
int i;
if (fp)
for (i=0;i < d.u.c.n; i++)
if (objcmp(st, ARR(d)[i], null)!=0)
fp(ARR(d)[2*i], ARR(d)[2*i+1]);
}

/* count dictionary entries */
/*
static size dcnt;
void d_count(object k, object v) { ++dcnt; }
*/
size d_length (state *st, object d) {
/* dcnt = 0; enumerate(st, d, d_count); return dcnt; */
return HDR(d)->sz;
}

size d_maxlength (state *st, object d) { return d.u.c.n; }

bool d_full (state *st, object d) {
return HDR(d)->sz == d.u.c.n;
}

/* return address of key,value pair
or address of first empty slot */
object *d_lookup (state *st, object d, object k) {
integer i, hash;
i = hash = crc(k, d.u.c.n+1);
if (VERBOSE) printf("key hashed to %ld\n", hash);
while (objcmp(st, ARR(d)[2*i], null) != 0) {
if (objcmp(st, ARR(d)[2*i], k) == 0) break;
i = (i-1 >= 0)? i-1: d.u.c.n;
//--i; (void)(i<0? i=d.u.c.n: 0xF00);
//++i; i %= d.u.c.n+1;
if (i == hash) fatal("dict too full! loop in lookup.\n");
}
if (VERBOSE) printf("lookup terminated on index %ld\n", i);
return &ARR(d)[2*i];
}

object d_get (state *st, object d, object k) {
object *e;
e = d_lookup(st, d, k);
return e[1];
}

void d_put (state *st, object d, object k, object v) {
object *e;
e = d_lookup(st, d, k);
if (objcmp(st, e[0],null)==0) {
if (d_full(st, d)) error(dictfull);
e[0] = k;
HDR(d)->sz++;
}
e[1] = v;
}

void d_undef (state *st, object d, object k) {
object *e;
e = d_lookup(st, d, k);
if (objcmp(st, e[0], null)!=0) {
e[0] = e[1] = null;
HDR(d)->sz--;
}
}

and the testing section from main:

if (1) { /* test dictionary */
object d;
object v;
d = consdict(st, 10);
d_put(st, d, consname(st, "foo"), consint(42));
d_put(st, d, consname(st, "bar"), consreal(42e12));
d_put(st, d, consname(st, "banana"), consstring(st,
STR_CNT("no bananas here")));
d_put(st, d, consint(2), consint(5));
d_put(st, d, consint(3), consint(6));
d_put(st, d, consbool(true), consint(7));
v = d_get(st, d, consname(st, "foo")); o_dump(st, v);
v = d_get(st, d, consname(st, "bar")); o_dump(st, v);
v = d_get(st, d, consname(st, "banana")); o_dump(st, v);
v = d_get(st, d, consint(2)); o_dump(st, v);
v = d_get(st, d, consint(3)); o_dump(st, v);
v = d_get(st, d, consbool(true)); o_dump(st, v);
v = d_get(st, d, consbool(false)); o_dump(st, v);
}

And the rather cluttered output:
Everything's clustered at the beginning.
But the behavior appears correct!

installing new name pop as 1
resizing name table
installing new name exch as 2
installing new name dup as 3
installing new name copy as 4
installing new name index as 5
installing new name clear as 6
installing new name count as 7
installing new name cleartomark as 8
installing new name counttomark as 9
installing new name add as 10
installing new name idiv as 11
installing new name mod as 12
installing new name sub as 13
installing new name abs as 14
installing new name array as 15
installing new name length as 16
installing new name get as 17
installing new name put as 18
installing new name string as 19
vmalloc requested 264
vmalloc padded request to 264
vmalloc increased size to 288 for header
vmalloc returning 12(0) for 288
vmalloc increased vmsz to 288
installing new name foo as 20
crc of 03 14 00 00 00 MOD 11
key hashed to 1
lookup terminated on index 1
installing new name bar as 21
crc of 03 15 00 00 00 MOD 11
key hashed to 2
lookup terminated on index 2
vmalloc requested 15
vmalloc padded request to 24
vmalloc increased size to 48 for header
vmalloc returning 300(288) for 48
vmalloc increased vmsz to 336
installing new name banana as 22
crc of 03 16 00 00 00 MOD 11
key hashed to 3
lookup terminated on index 3
crc of 01 02 00 00 00 MOD 11
key hashed to 3
lookup terminated on index 0
crc of 01 03 00 00 00 MOD 11
key hashed to 4
lookup terminated on index 4
crc of 00 01 MOD 11
key hashed to 1
lookup terminated on index 10
crc of 03 14 00 00 00 MOD 11
key hashed to 1
lookup terminated on index 1
<integer 0:X--- 42>
crc of 03 15 00 00 00 MOD 11
key hashed to 2
lookup terminated on index 2
<real 0:X--- 42000000000000.000000>
crc of 03 16 00 00 00 MOD 11
key hashed to 3
lookup terminated on index 3
<string 0:X--- no bananas here>
crc of 01 02 00 00 00 MOD 11
key hashed to 3
lookup terminated on index 0
<integer 0:X--- 5>
crc of 01 03 00 00 00 MOD 11
key hashed to 4
lookup terminated on index 4
<integer 0:X--- 6>
crc of 00 01 MOD 11
key hashed to 1
lookup terminated on index 10
<integer 0:X--- 7>
crc of 00 00 MOD 11
key hashed to 0
lookup terminated on index 9
<null 0:X--- ~>

luser- -droog

unread,

Feb 3, 2011, 3:32:12 AM2/3/11

to

That's not true is it?
That was one of Goswell's assumptions and it seems reasonable
to hold distinct types as distinct keys.

Let's see. What does ghostscript do?
GPL Ghostscript 8.62 (2008-02-29)
Copyright (C) 2008 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
GS>1 42 def
GS>1.0 load
GS<1>=
42
GS>

Well, I'll be damned! Look at that!
That's got to be some evil voodoo.

luser- -droog

unread,

Feb 3, 2011, 3:46:14 AM2/3/11

to

But the bad news is it chokes on an integer that
exceeds the precision of a real.

GS>16#7ffffff0 42 def
GS>16#7ffffff0 cvr load
Error: /undefined in --execute--
Operand stack:
2.14748e+09
Execution stack:
%interp_exit .runexec2 --nostringval-- --nostringval-- --
nostringval-- 2 %stopped_push --nostringval-- --
nostringval-- %loop_continue 1785 2 3 %oparray_pop --
nostringval-- --nostringval-- false 1
%stopped_push .runexec2 --nostringval-- --nostringval-- --
nostringval-- 2 %stopped_push --nostringval--
Dictionary stack:
--dict:1152/1684(ro)(G)-- --dict:0/20(G)-- --dict:94/200(L)--
Current allocation mode is local
Last OS error: 2
Current file position is 21

But using the integer still works:

GS>16#7ffffff0 load
GS<1>=
42

But I also get a rangecheck converting back and forth
so the undefined above isn't so terrible.

GS>16#7ffffff0 cvr cvi
Error: /rangecheck in --execute--
Operand stack:
2.14748e+09
Execution stack:
%interp_exit .runexec2 --nostringval-- --nostringval-- --
nostringval-- 2 %stopped_push --nostringval-- --
nostringval-- %loop_continue 1785 2 3 %oparray_pop --
nostringval-- --nostringval-- false 1
%stopped_push .runexec2 --nostringval-- --nostringval-- --
nostringval-- 2 %stopped_push --nostringval--
Dictionary stack:
--dict:1152/1684(ro)(G)-- --dict:0/20(G)-- --dict:94/200(L)--
Current allocation mode is local
Current file position is 20
GS<1>

ken

unread,

Feb 3, 2011, 4:57:38 AM2/3/11

to

In article <da49c190-2511-4cd5-9137-
da7f14...@s11g2000yqh.googlegroups.com>, mij...@yahoo.com says...

> That's not true is it?
> That was one of Goswell's assumptions and it seems reasonable
> to hold distinct types as distinct keys.
>
> Let's see. What does ghostscript do?
> GPL Ghostscript 8.62 (2008-02-29)
> Copyright (C) 2008 Artifex Software, Inc. All rights reserved.
> This software comes with NO WARRANTY: see the file PUBLIC for details.
> GS>1 42 def
> GS>1.0 load
> GS<1>=
> 42
> GS>
>
> Well, I'll be damned! Look at that!
> That's got to be some evil voodoo.

Adobe Acrobat Distiller does exactly the same. Not conclusive but
indicative that this is intended behaviour.

Ken

ken

unread,

Feb 3, 2011, 5:01:29 AM2/3/11

to

In article <b3570de8-babe-4b91-b19b-
8d9a5c...@r16g2000yql.googlegroups.com>, mij...@yahoo.com says...

> But the bad news is it chokes on an integer that
> exceeds the precision of a real.
>
> GS>16#7ffffff0 42 def
> GS>16#7ffffff0 cvr load
> Error: /undefined in --execute--
> Operand stack:
> 2.14748e+09

And the Distiller result:

%%[ Error: undefined; OffendingCommand: load ]%%

Stack:
2.14748e+09

> But using the integer still works:
>
> GS>16#7ffffff0 load
> GS<1>=
> 42

Again, Acrobat Distiller has the same result.

Ken