In Forth there are hash tables not uncommon.
How about one big (outside VM) hash table that keeps track about all
dictionary keys and for real dictionaries you use the memory addresses
returned from this table as key. You can then store the things in
sorted lists for real dictionaries or make a numeric hash. It has also
the benefit that comparing two names is comparing two pointers only.
For example:
... type /arraytype eq ...
is comparing two numeric values.
Regards,
-Helmar
Perhaps the two association mechanisms should be dealt with
separately. There's the dictionary which associates objects
with other objects. And then there's the name mechanism
which associates strings with, presumably, integers.
According to PLRM 2ed, level 1 interpreters stored the name
data (whatever is necessary to reconstruct the string with 'cvs')
in VM (where it is subject to save/restore and therefore names
can trigger invalidrestore). Level 2 introduces global VM,
and removes the restriction on names from 'restore'. I can
only imagine that names are now in global VM.
For a hosted implementation, it appears that the names can
be put anywhere. Even a dynamic table outside VM entirely!
But you still want to be able to perform the mapping quickly
in either direction. And you don't want to use a lot of space
to do it in.
So for names, I'm looking at Ternary Search Trees, and Patricia Trees.
Although a DAWG appears to be roughly the same thing.
But for dictionaries, themselves, the mapping is between objects
and objects. Though 99% of the time the keys are names, yet it still
needs the ability to use arbitrary objects. So I'm thinking I can
use a bitwise hash on the significant portions of the object
(as determined by type). 2-9 bytes.
In fact, this 'condensed' object is probably roughly what I need for
packed arrays, hmmm...
Sorted output from dictionaries isn't particularly important, is it?
BTW, are strings still implicitly converted to names when used as
keys? I remember something about that (IIRC, it was /Real World
Postscript/) but it was some time ago.
I've not dealt with this in depth and I don't know how you're
representing objects, so maybe I'm talking nonsense. Anyway, I'd
imagine you have a struct where, say, one byte stores the type of the
object and, say, four bytes store the value of the object (which would
be an integer/float/boolean etc. for simple types and a VM address for
composite objects). Then there would be a few more bits that would
flag the executable state and such that are not relevant for
dictionary lookup. Those five bytes would uniquely identify the key,
wouldn't they? So I think the hash generation can be agnostic of the
type and does not depend on anything from VM.
> BTW, are strings still implicitly converted to names when used as
> keys? I remember something about that (IIRC, it was /Real World
> Postscript/) but it was some time ago.
Yes, they are:
<<(foo)(bar)>>{}forall pstack % /foo (bar)
Otherwise the key lookup would be more complicated because it would
depend on VM data.
Thomas W.
Yeah that's pretty much it. But I'm using doubles for the reals,
so the payload is up to 8 bytes. And even though I could pack the
type and flags into a byte, keeping the double aligned (on x86)
means the tag is 4 bytes. But for most types, a lot of this space
is meaningless; so I either have to keep it all zeroed or explicitly
ignore it for comparisons. So there seems to be a natural
symmetry for this notion of a packed-form or canonical object.
Symmetry means less work!
>Then there would be a few more bits that would
> flag the executable state and such that are not relevant for
> dictionary lookup. Those five bytes would uniquely identify the key,
> wouldn't they? So I think the hash generation can be agnostic of the
> type and does not depend on anything from VM.
So is it ok to do linear search for dictionary lookup?
It just seems so naive!
Hi,
> But for dictionaries, themselves, the mapping is between objects
> and objects. Though 99% of the time the keys are names, yet it still
> needs the ability to use arbitrary objects. So I'm thinking I can
> use a bitwise hash on the significant portions of the object
> (as determined by type). 2-9 bytes.
most significant is it's storing address in memory. That should be the
simplest.
> Sorted output from dictionaries isn't particularly important, is it?
You can use binary search if you sort the pointers (something
integer!) to the objects.
This is pretty fast usually. Sorting of the integers should be fast
enough too, since most access is searching the key if I understand it
right.
Regards,
-Helmar
Here's my prototype for the name mechanism.
Ternary Search Tree. Lifted from Bently/Sedgwick
Dr. Dobbs article. Lightly modified to use explicit
pointers.
#include <stdio.h>
#include <stdlib.h>
/* ternary search tree */
typedef unsigned char byte;
typedef struct node {
byte val;
struct node *lo, *eq, *hi;
} node;
node *root = NULL;
char *insertstr;
int rsearch (node *p, char *s) {
if (!p) return 0;
if (*s < p->val) return rsearch(p->lo, s);
if (*s > p->val) return rsearch(p->hi, s);
if (*s == 0) return 1;
return rsearch(p->eq, ++s);
}
int search (char *s) {
node *p;
p = root;
while(p) {
if (*s < p->val) {
p = p->lo;
} else if (*s == p->val) {
if (*s++ == 0) return 1;
p = p->eq;
} else {
p = p->hi;
}
}
return 0;
}
node *insert (node *p, char *s) {
if (!p) {
p = malloc(sizeof*p);
p->val = *s;
p->lo = p->eq = p->hi = 0;
}
if (*s < p->val) {
p->lo = insert(p->lo, s);
} else if (*s == p->val) {
if (*s)
p->eq = insert(p->eq, ++s);
else
p->eq = (node *)insertstr;
} else {
p->hi = insert(p->hi, s);
}
return p;
}
void traverse (node *p) {
if (!p) return;
traverse(p->lo);
if (p->val)
traverse(p->eq);
else
printf("%s\n", (char *)p->eq);
traverse(p->hi);
}
int main () {
insertstr = "foo";
root = insert(root, insertstr);
insertstr = "baloney";
root = insert(root, insertstr);
printf("the search for foo yields %s!\n", search("foo")? "true":
"false");
printf("the search for baloney yields %s!\n", search("baloney")?
"true": "false");
printf("the search for foloney yields %s!\n", search("foloney")?
"true": "false");
printf("known keys:\n");
traverse(root);
return 0;
}
So could I do a quicksort for every insertion?
Then a binary search for lookup would be quite fast.
I'm reluctant to go with hashing because I'm still
holding off on the garbage collector and all level 2 stuff.
And with level 1, you tend to make your dictionaries
exactly the same size as the contents (since they can't
grow and you don't want to waste space). So for hashing
with linear chaining, the whole dictionary is one massive
collision if it's full; and it will be. So I can either
double all my allocation sizes so they max-out at half-full,
or look for something else. I kind of like the idea of
straight linear search while swapping each found entry
to the front. Self-prioritizing!
I've not taken into account integer and floating point objects. As
they are regarded as the same key if they have the same value, the raw
bytes can't be used for lookup.
> I'm reluctant to go with hashing because I'm still
> holding off on the garbage collector and all level 2 stuff.
> And with level 1, you tend to make your dictionaries
> exactly the same size as the contents (since they can't
> grow and you don't want to waste space). So for hashing
> with linear chaining, the whole dictionary is one massive
> collision if it's full; and it will be. So I can either
> double all my allocation sizes so they max-out at half-full,
> or look for something else. I kind of like the idea of
> straight linear search while swapping each found entry
> to the front. Self-prioritizing!
I may have just had the best idea yet: Dictionary pages!
I could resolve hash collisions by allocating a new "frame"
for the new entry. This should keep the hash-lookups fast,
allow dictionaries to grow, and solve the save-undef-restore
dilemma before it rears its head!
So I do a hash on the key (treated atomically, using pointers
as data), and if the hash indicates a slot that's occupied
and doesn't match, I allocate a new "page" of the dictionary
(not sure of the best size for these: 10? 20?) and/or
recurse to the next page.
The complication is whether to copy the whole dictionary
when there's a write to a lower save level, or just the page.
975(1)12:50 AM:mark2 0> cat dic.c
#include "core.h"
/* SUMCOMP(o) yields the entire "payload" of a object o in a big
integer */
//#define SUMCOMP(o) ((long long)o.u.c.a<<32 + o.u.c.off<<16 +
o.u.c.n)
#define SUMCOMP(o) (*(long long *)&o.u.i)
/* return 0 if equal, >0 if l > r, <0 if l < r */
int objcmp (state *st, object l, object r) {
if (l.tag == r.tag) switch (l.tag) {
case marktype:
case nulltype: return 0;
case booleantype: return l.u.b - r.u.b;
case integertype: return l.u.i - r.u.i;
case realtype: return l.u.r - r.u.r;
case arraytype:
case dicttype:
default: return SUMCOMP(l) - SUMCOMP(r);
case stringtype: return SUMCOMP(l) == SUMCOMP(r) ?
0 :
( l.u.c.n != r.u.c.n ?
l.u.c.n - r.u.c.n :
strncmp(STR(l), STR(r), l.u.c.n) ) ;
} else return l.tag - r.tag;
}
/* return size of the significant portion of the payload */
size odsize (object o) {
switch(o.tag) {
case marktype:
case nulltype: return 0;
case booleantype: return sizeof(boolean);
case nametype:
case integertype: return sizeof(integer);
case realtype: return sizeof(real);
default: return sizeof(composite);
}
}
/* perform a Sum MOD n of the significant bytes of the payload */
integer crc (object k, integer n) {
unsigned res;
int i,m;
res = k.tag;
if (VERBOSE) printf("crc of %02x ", res);
m = odsize(k);
for (i=0; i < m; i++) {
if (VERBOSE) printf("%02x ", (unsigned)((byte *)&k.u.b)[i]);
res += ((byte *)&k.u.b)[i];
}
if (VERBOSE) printf(" MOD %ld\n", n);
res %= n;
return res;
}
/* Dicts are implicitly one-entry larger than declared
in order to simplify searching (terminate on null).
So we allocate n+1 pairs here
and call hash with n+1 in lookup later.
For all other purposes, n is the maxlength.
*/
object consdict (state *st, size n) {
object d = { .tag = dicttype };
int i;
d.u.c.a = vmalloc(st, 2*(n+1)*sizeof(object));
d.u.c.n = n; /* maxlength */
for (i=0; i < d.u.c.n+1; i++) {
ARR(d)[2*i] = ARR(d)[2*i+1] = null;
}
HDR(d)->sz = 0; /* length */
return d;
}
/* call function on each non-empty key-value pair in dict d */
void enumerate (state *st, object d, void (*fp)(object k, object v)) {
int i;
if (fp)
for (i=0;i < d.u.c.n; i++)
if (objcmp(st, ARR(d)[i], null)!=0)
fp(ARR(d)[2*i], ARR(d)[2*i+1]);
}
/* count dictionary entries */
/*
static size dcnt;
void d_count(object k, object v) { ++dcnt; }
*/
size d_length (state *st, object d) {
/* dcnt = 0; enumerate(st, d, d_count); return dcnt; */
return HDR(d)->sz;
}
size d_maxlength (state *st, object d) { return d.u.c.n; }
bool d_full (state *st, object d) {
return HDR(d)->sz == d.u.c.n;
}
/* return address of key,value pair
or address of first empty slot */
object *d_lookup (state *st, object d, object k) {
integer i, hash;
i = hash = crc(k, d.u.c.n+1);
if (VERBOSE) printf("key hashed to %ld\n", hash);
while (objcmp(st, ARR(d)[2*i], null) != 0) {
if (objcmp(st, ARR(d)[2*i], k) == 0) break;
i = (i-1 >= 0)? i-1: d.u.c.n;
//--i; (void)(i<0? i=d.u.c.n: 0xF00);
//++i; i %= d.u.c.n+1;
if (i == hash) fatal("dict too full! loop in lookup.\n");
}
if (VERBOSE) printf("lookup terminated on index %ld\n", i);
return &ARR(d)[2*i];
}
object d_get (state *st, object d, object k) {
object *e;
e = d_lookup(st, d, k);
return e[1];
}
void d_put (state *st, object d, object k, object v) {
object *e;
e = d_lookup(st, d, k);
if (objcmp(st, e[0],null)==0) {
if (d_full(st, d)) error(dictfull);
e[0] = k;
HDR(d)->sz++;
}
e[1] = v;
}
void d_undef (state *st, object d, object k) {
object *e;
e = d_lookup(st, d, k);
if (objcmp(st, e[0], null)!=0) {
e[0] = e[1] = null;
HDR(d)->sz--;
}
}
and the testing section from main:
if (1) { /* test dictionary */
object d;
object v;
d = consdict(st, 10);
d_put(st, d, consname(st, "foo"), consint(42));
d_put(st, d, consname(st, "bar"), consreal(42e12));
d_put(st, d, consname(st, "banana"), consstring(st,
STR_CNT("no bananas here")));
d_put(st, d, consint(2), consint(5));
d_put(st, d, consint(3), consint(6));
d_put(st, d, consbool(true), consint(7));
v = d_get(st, d, consname(st, "foo")); o_dump(st, v);
v = d_get(st, d, consname(st, "bar")); o_dump(st, v);
v = d_get(st, d, consname(st, "banana")); o_dump(st, v);
v = d_get(st, d, consint(2)); o_dump(st, v);
v = d_get(st, d, consint(3)); o_dump(st, v);
v = d_get(st, d, consbool(true)); o_dump(st, v);
v = d_get(st, d, consbool(false)); o_dump(st, v);
}
And the rather cluttered output:
Everything's clustered at the beginning.
But the behavior appears correct!
installing new name pop as 1
resizing name table
installing new name exch as 2
installing new name dup as 3
installing new name copy as 4
installing new name index as 5
installing new name clear as 6
installing new name count as 7
installing new name cleartomark as 8
installing new name counttomark as 9
installing new name add as 10
installing new name idiv as 11
installing new name mod as 12
installing new name sub as 13
installing new name abs as 14
installing new name array as 15
installing new name length as 16
installing new name get as 17
installing new name put as 18
installing new name string as 19
vmalloc requested 264
vmalloc padded request to 264
vmalloc increased size to 288 for header
vmalloc returning 12(0) for 288
vmalloc increased vmsz to 288
installing new name foo as 20
crc of 03 14 00 00 00 MOD 11
key hashed to 1
lookup terminated on index 1
installing new name bar as 21
crc of 03 15 00 00 00 MOD 11
key hashed to 2
lookup terminated on index 2
vmalloc requested 15
vmalloc padded request to 24
vmalloc increased size to 48 for header
vmalloc returning 300(288) for 48
vmalloc increased vmsz to 336
installing new name banana as 22
crc of 03 16 00 00 00 MOD 11
key hashed to 3
lookup terminated on index 3
crc of 01 02 00 00 00 MOD 11
key hashed to 3
lookup terminated on index 0
crc of 01 03 00 00 00 MOD 11
key hashed to 4
lookup terminated on index 4
crc of 00 01 MOD 11
key hashed to 1
lookup terminated on index 10
crc of 03 14 00 00 00 MOD 11
key hashed to 1
lookup terminated on index 1
<integer 0:X--- 42>
crc of 03 15 00 00 00 MOD 11
key hashed to 2
lookup terminated on index 2
<real 0:X--- 42000000000000.000000>
crc of 03 16 00 00 00 MOD 11
key hashed to 3
lookup terminated on index 3
<string 0:X--- no bananas here>
crc of 01 02 00 00 00 MOD 11
key hashed to 3
lookup terminated on index 0
<integer 0:X--- 5>
crc of 01 03 00 00 00 MOD 11
key hashed to 4
lookup terminated on index 4
<integer 0:X--- 6>
crc of 00 01 MOD 11
key hashed to 1
lookup terminated on index 10
<integer 0:X--- 7>
crc of 00 00 MOD 11
key hashed to 0
lookup terminated on index 9
<null 0:X--- ~>
That's not true is it?
That was one of Goswell's assumptions and it seems reasonable
to hold distinct types as distinct keys.
Let's see. What does ghostscript do?
GPL Ghostscript 8.62 (2008-02-29)
Copyright (C) 2008 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
GS>1 42 def
GS>1.0 load
GS<1>=
42
GS>
Well, I'll be damned! Look at that!
That's got to be some evil voodoo.
But the bad news is it chokes on an integer that
exceeds the precision of a real.
GS>16#7ffffff0 42 def
GS>16#7ffffff0 cvr load
Error: /undefined in --execute--
Operand stack:
2.14748e+09
Execution stack:
%interp_exit .runexec2 --nostringval-- --nostringval-- --
nostringval-- 2 %stopped_push --nostringval-- --
nostringval-- %loop_continue 1785 2 3 %oparray_pop --
nostringval-- --nostringval-- false 1
%stopped_push .runexec2 --nostringval-- --nostringval-- --
nostringval-- 2 %stopped_push --nostringval--
Dictionary stack:
--dict:1152/1684(ro)(G)-- --dict:0/20(G)-- --dict:94/200(L)--
Current allocation mode is local
Last OS error: 2
Current file position is 21
But using the integer still works:
GS>16#7ffffff0 load
GS<1>=
42
But I also get a rangecheck converting back and forth
so the undefined above isn't so terrible.
GS>16#7ffffff0 cvr cvi
Error: /rangecheck in --execute--
Operand stack:
2.14748e+09
Execution stack:
%interp_exit .runexec2 --nostringval-- --nostringval-- --
nostringval-- 2 %stopped_push --nostringval-- --
nostringval-- %loop_continue 1785 2 3 %oparray_pop --
nostringval-- --nostringval-- false 1
%stopped_push .runexec2 --nostringval-- --nostringval-- --
nostringval-- 2 %stopped_push --nostringval--
Dictionary stack:
--dict:1152/1684(ro)(G)-- --dict:0/20(G)-- --dict:94/200(L)--
Current allocation mode is local
Current file position is 20
GS<1>
> That's not true is it?
> That was one of Goswell's assumptions and it seems reasonable
> to hold distinct types as distinct keys.
>
> Let's see. What does ghostscript do?
> GPL Ghostscript 8.62 (2008-02-29)
> Copyright (C) 2008 Artifex Software, Inc. All rights reserved.
> This software comes with NO WARRANTY: see the file PUBLIC for details.
> GS>1 42 def
> GS>1.0 load
> GS<1>=
> 42
> GS>
>
> Well, I'll be damned! Look at that!
> That's got to be some evil voodoo.
Adobe Acrobat Distiller does exactly the same. Not conclusive but
indicative that this is intended behaviour.
Ken
And the Distiller result:
%%[ Error: undefined; OffendingCommand: load ]%%
Stack:
2.14748e+09
> But using the integer still works:
>
> GS>16#7ffffff0 load
> GS<1>=
> 42
Again, Acrobat Distiller has the same result.
Ken