Identity Hash

165 views
Skip to first unread message

John Clapperton

unread,
Feb 7, 2026, 9:08:06 AMFeb 7
to VAST Community Forum
Object>>identityHash calls Object>>basicHash which is an equality (#=) hash, not an identity (#==) hash, and which is therefore slow to compute for long Strings, and does not distinguish between different Strings which happen to be equal. For example:
| s1 s2 is |
s1 := 'Hello World'.
s2 := 'Hello World'.
is := IdentitySet new.
is add: s1; add: s2.
is IdentitySet('Hello World' )

Wayne Johnston

unread,
Feb 7, 2026, 11:40:27 AMFeb 7
to VAST Community Forum
Change that one line to
   s2 := 'Hello ', 'World'.
or
   s2 := s1 copy.
then you get an IdentitySet with two elements.  So I think your original code is demonstrating some optimization (I suspect an expert will use the proper word).  That is, inspect this:
   s1 == s2

John Clapperton

unread,
Feb 7, 2026, 8:54:07 PMFeb 7
to VAST Community Forum
Well spotted! So then I thought this was being handled as a hash collision, but (as below) Object>>basicHash is actually an identity hash, contrary to its comment, while EsString>>hash is an equality hash:

Object>>basicHash
"Answer a SmallInteger that represents the receiver. Objects that are equivalent (=) must
 answer the same value for the message #basicHash.

 Do not override #basicHash in any subclass!"

<primitive: VMprObjectBasicHash>
^ 0

| s1 s2 |
s1 := String new: 1000000.
s1 atAllPut: $z; at: 1 put: $a; at: 1000000 put: $q.
s2 := s1 copy.
s1 basicHash  25613    
s2 basicHash  22578    
s1 hash  25672  
s2 hash  25672  

And as IdentityDictionary uses #basicHash, I'm still no wiser to where all the time is going with long Strings as the key in this (from EsbSampler):
...
| | | |IdentityDictionary>>#at:put: (20.4%,20.4%) [1014 hits][41879(ms)]
| | | | |LookupTable>>#validateKey: (0.0%,0.0%) [3 hits][65(ms)]
(with nothing deeper)
...

Marcus Wagner

unread,
Feb 9, 2026, 8:45:34 AMFeb 9
to VAST Community Forum
I hope to clarify and contribute, the terms identity or equality hash caused my irritation, as well as some aspects of the examples mentioned before (mathematical aspects of a set).
 
First: a hash is a hash, there is no "identity" or "equality" hash.

To begin with, to distinguish, the difference between identiy == and equailty =, in words, if objects are the same (==) or have the same value(s) (=).

Technically, identity, is implemented by a pointer comparison (it is the same object if it has the same location) - no hash involved. It relies on the fact no dublicates of an object exist. 
Equality, is implemented by comparing (many) values stored in the object, that means to compare all of its elements stored in a location. Duplicates of such objects may exist. Note: conceptually also no hash involved here!

Thus identity always performs better than equality as comparing the address (location) is faster in general (with some exceptions, see at the end) than many values, elements of the object.

Now what about hashes: hashing is used to gain speed this comparison, but this may fail: 

if hashes of objects are different, then the objects  are certainly different - but if two objects have the same hash, they STILL MAY BE DIFFERENT.

Now it is left to decide if such a hash collision causes an error when applied or can silently be ignored or requires a costly recheck to detect and avoid the collision.

So the framework offers you IdentitySets and IdentityDictionaries to avoid collision costs, but at the expense of obeying the law of forbidden duplication.
If you do not care about the hash collision cost, you savely can use normal Sets, LookupTable, Dictionaries, based on eqality, which in turn is based on hashes.

The price is if you want to profit from faster identity checks, you have to make certain that you are not duplicating objects braking their identity.
It costs quite a lot of work to grant object identies, in particular for e.g. persistent objects. In the technical world, UUID have been invented to suport this (also a specialised form of a hash code).

So e.g. for some elementary objects (numbers, characters and atoms) have to be treated specially to grant this logic. 
E.g. 3 copy or $A copy does not make sense and provides no second object with the same value.
Here it is faster to compare the values and not to allocate a number or character object at all.

To remember: this is also the most important difference between symbols #  and atoms ##.

And finally, to the difference of basicHash and hash. hash methods are built upon basisHash methods.
As I said, hashes imposes logic uncertainty due to hash collisions.
You can gain performance, if you design its hash computation, e.g. you benefit if you can ignore elements of an object which are irrelevant.
So Smalltalk offers you the opportunity to implement your own specific hash and = methods, as long as you are folowing the implied rules (or otherwise, you are responsible for the implications or failures).
The basicHash is then (as the name says) the most basic implementation to grant system wide functionality, is forms the (basic) fall back, if you do not provide your own hash. 
It would be counterproductive to implement other basicHashes, as you can do this already in the hash methods. 

And a remark to the speed investigations: in general it is conceptually senseless to use String keys in IdentitySets and IdentityDictionaries) (as Strings can have duplicates). 
It's like division by 0.

Kind regards
M
Message has been deleted

Marcus Wagner

unread,
Feb 13, 2026, 11:32:51 AMFeb 13
to VAST Community Forum
I had to correct and enhance my first clarification, e.g. the pointer comparison. Obviously such a pointer comparison won't work in a dynamic gc environment.

In Smalltalk the concept of hash codes has been chosen to implement object identity comparison:  
1) To test the identity of an object, send it ==.
2) Any object has a unique hash code (within this image).
3) The hash code of an object is its representation as (a small) integer value.
4) The hash code of an object can be obtained by sending basicHash to it.
The benefit is in general that this == identity test can be realized by comparing the hash codes, this (small) integer comparison performs always the same, indepedent from the location, the size or complexity of the objects to be checked.
(flexibility concerning the location allows for dynamic memory handling and garbage collection).
Implicit is a maximum number of objects in an image, it is limited to the largest small integer, 32767 (see basicHash:).
So if you instantiate more then 32767 objects, identity checks will collide.

How to initially assign this unique hash code when instantiating an object?
Here an example demonstrating (also copying objects)
 | org copy |
 org := Object new.
 copy := org copy.
 Array with: org basicHash with: copy basicHash
or
'abc' copy ~= 'abc'
 
Hash code computation is left to virtual machine and is unpublished, but conceptually counting object instantiations will do, its implementation is realized in <primitive: VMprObjectBasicHash>.
This basic hash can be changed for non immediates and writeable objects, <primitive: VMprObjectBasicHashColon>. (Objects can be set readOnly, so their hash code is included to be save).
For certain objects this is optimized:
° For immediate objects, instead of their hash code, their value is directly compared (e.g. Boolean, Character or Integer allInstances are empty)
° For character and integer objects their value directly serves as the basic unique hash code.
° The basic hash of true is 30, of false is 46 and of nil is 14.
° For floats their value is compared, their basic hash code is 0, but they have instances.
String constants are cashed in some context, as shown in (Array with: 'abc' basicHash with: 'abc' basicHash) inspect, but 'abc' basicHash is ~= 'abc' copy basicHash, as expected, a copy is a different object having its own hash.
Scaled decimals, fractions are not optimized, as they are composite, not primitive objects. 
Inspect e.g. (Array with: (Fraction numerator: 2 denominator: 3) basicHash with: (Fraction numerator: 2 denominator: 3) basicHash).
Some technical objects are optimized:
° Pool associations, their basic hash code stems from the key, making different pool associations of the same key identical (see string constant example above, causing the observed cash).
° CompiledMethods basic hash stems from its selector.
° Wrappers basic hash stems from its wrapped object.
And some objects use the basic hash code space for flags (e.g. Graphemes).
Note: all of this causes several basic hash collisions, violating 2) from above.
So that requires special == treatment of nil, true, false and Floats (immediate objects in general), as 
$. basicHash == false basicHash or 
5.6 basicHash == 4.5 basicHash 
all deliver true, but they are obviously not identical.
 
This solution for the identity complex has been applied and relaxed for equivalency: two (not identical) objects are considered equivalent if they contain same values.
5) To test the equivalence of an object, send it =.
6) This equivalency test is supported by sending hash to it. The fundamental realization of hash is basicHash (as I said before a hash is a hash).
Again the benefit is that this = equivalence test can be realized by comparing hash codes, this integer comparison performs always the same, see above.
To leave room for individual relaxations of identity, hash implementations in subclasses are changed:
Floats, Fractions, ScaledDecimals are equivalent if their integer nearest towards zero are the same, e.g. 5.1 hash = 5.2 hash. (However 5.1 ~= 5.2, as expected).
Associations are equivalent if their key is equivalent, file names are equivalent in in case insensitive file systems if their lower case names are the same.
Note potential hash collisions concerning Strings (use of random values), Points, Rectangles, LargeIntegers, Unicodes and some technical support objects. Inspect e.g. (Array with: (1@2) hash with: (2@1) hash).
This equivalence implementation allows for open context sensitive, application specific implementations.
See e.g. Code Assist and VA Content Assist implementations.

Finally about the IdentitySets and IdentityDictionaries (inspecting their differences), they are using identity checks instead of equivalence. 
They explicitely use basicHash instead of hash (set) and == instead of = (lookup key in the dictionary).
This avoids hash collisions like (1 @ 2) (2 @ 1), being different points in the plain having the same hash value (posing a problem in a coordinate lookup table). 
Besides of these specific implementation differences they perform the same (see observed bench run results).
Hence any performance difference will be implied from different content interpretation following the difference of identity vs. equivalence of the elements or keys. 
And the IdentityDictionaries is a specialized LookupTable, not a specialized Dictionary (based on Associations).
Finally there exist many other specialized variations, to name a few: AbtHighCapacityLookupTable, AbtWeakKeyLookupTable, FixedSizeLookupTable, StringLookupTable or EpIdentityDictionary (as an example to treat large object collections concerning performance problems). 
That means, code and image handling and packaging traditionally had special aspects to be treated similarily as the Identy variations.
Kind regards
M

John Clapperton

unread,
Feb 13, 2026, 7:01:51 PMFeb 13
to VAST Community Forum
Thanks for the clarification. My original thought, to explain the long time being spent, equally, in IdentityDictionary>>at:ifAbsent: and IdentityDictionary>>at:put: (in the ifAbsent: block) was that the comment in
Object>>basicHash
"Answer a SmallInteger that represents the receiver. Objects that are equivalent (=) must
 answer the same value for the message #basicHash.
suggested that it was doing an equivalency inspection (of a long String) which was therefore taking a long time. But not so. So I'm still looking for an explanation.

The reason for having a long String as the key in an IdentityDictionary, is than when serializing (into a String) an unknown arbitrary structure which may include circular references, it's necessary to keep track of which components have been visited, so that none gets done more than once; and an arbitrary structure may contain long Strings (of other serialized objects in this case). As Strings don't point to other objects, there is probably a workaround by handling Strings specially, but it's curious at least.

(from EsbSampler):
...
|VOUnLoader>>#unloadIndexFor: (41.5%,0.0%) [20 hits][85133(ms)]
|IdentityDictionary>>#at:ifAbsent: (41.4%,20.9%) [33 hits][84850(ms)]
| |[] in VOUnLoader>>#unloadIndexFor: (20.4%,0.0%) [1014 hits][41899(ms)]

| | |IdentityDictionary>>#at:put: (20.4%,20.4%) [1014 hits][41879(ms)]
| | | |LookupTable>>#validateKey: (0.0%,0.0%) [3 hits][65(ms)]
|SmallInteger>>#voPrintString (0.0%,0.0%) [4 hits][84(ms)]
|SequenceableCollection>>#copyFrom:to: (0.0%,0.0%) [1 hit][21(ms)]
...

Marcus Wagner

unread,
Feb 14, 2026, 10:22:40 AMFeb 14
to VAST Community Forum
Hi, John,
I guess, we went in the wrong direction.

Let me know more about the number of entities.
So elegant dictionaries and lookup tables are, they react vulnerable. Consider e.g. the construction of 

| d |

d := LookupTable new.

0 to: 99999 do: [:i | | n |

n :='S', i printString.

 d at: n put: n hash].

 d inspect

will perform bad, for several reasons.

First, building the table will have to rehash itself when ever it has to expand (as I remember, it then doubles its capacity).
For 100000 elements it will expand 100000 log2 times -> 16 times, so already added elements have to contribute to the hash computation many times. 
To avoid expansions while building, the lookupTable must be allocated with 
d := LookupTable new: 100000. 

Second,  hash codes must collide (remember, small integers) because the number of strings is larger than the scope of possible hash values.
To be resistant to (always possible) collisions, e.g. the lookup uses the hash code as split point into all elements. 
A test at the split point checks if the element there is a success [(element at: hash) == searched]. If not, it starts two sequential searches from there: first one upwards and then one downwards.
If you run the example above (patience) and inspect its content, you see how the hashes are distributed and you can imagine, that it will take a long time to find an element if the first test was not a success.

It might be that some simple adoptions can help to overcome this small integer aspect, if such efects are the cause of the observed performance. 

But I want to provide also other ideas: what about an attempt to topologically sort the elements. 
Also being vulnerable to cyclic dependencies, the sort will initially treat and remove elements as long they are not stuck in a cycle.
Given such an element being start and end of a cycle, a problem specific handler has to deal with that to treat and remove it. 
After that, another iteration (topological sort of the rest) has to be repeated, as long as all elements are handled (exactly once).
As a side effect, if you carefully design the key to be topologically sorted = declare the desired ordering, you can take care about what is to be serialized in which order.
As I understood, this is currently represented by a string.
But I do not know enough about the details of your problem, so its my guesswork.
Kind regards
M    

John Clapperton

unread,
Feb 17, 2026, 11:00:44 AMFeb 17
to VAST Community Forum
The biggest SmallInteger seems to be
( ( ( 29 power: 2) * 2) - 1) 1073741823 class SmallInteger
( ( 29 power: 2) * 2) 1073741824 class LargeInteger

And is presumably also the largest basicHash and number of objects in the image.
1073741822 basicHash 1073741822
1073741823 basicHash 1073741823
1073741824 basicHash 0

John Clapperton

unread,
Feb 17, 2026, 11:00:46 AMFeb 17
to VAST Community Forum
Hash collisions maybe.

John Clapperton

unread,
Feb 17, 2026, 11:01:02 AMFeb 17
to VAST Community Forum
Yes, the first thing I tried was initializing a bigger dictionary, which helped a bit, but not much, and caused more garbage collection; most times an initially small dictionary is enough.

Topological change is not an option, as this is the serialised format of VOSS databases, some of which are 25 years old, mission critical, and for which everything must be backwards compatible.

Transactions now typically each contain 2k to 10k object clusters, each of which is serialised, pending writing to the end of the toSpace file (architecture is modelled on Smalltalk 80), before writing of which is the construction of the transaction log entry, which is a collection of those strings, each wrapped in a cluster with its new object table 21-byte descriptor string, the whole of which is serialised into a string for the whole transaction, to be appended to the transaction log file before the serialised object clusters are each appended to the toSpace file and their descriptor records updated to point to their new locations.

This all works fine for small transactions, but with 10k+ clusters per transaction (the largest of which has been 1.1 million), IdentityDictionary>>at:put: is taking orders of magnitude longer, and I don't know why.

Marcus Wagner

unread,
Feb 18, 2026, 11:26:13 AMFeb 18
to VAST Community Forum
Hi John,
given the additional information I subsumise
The builtin hash (answered by basicHash if not overwritten), is stored along the object (in its header, in the flags, except of immutables.
The hash represents this object and is limited to 15 bits and will not be larger than 32767 = 15 bits (as hashes are unsigned).
Any group of objects having a similar or larger size than 32767 naturally will suffer from hash collisions. 
I deduce this from source code published in e.g. 14.1.0x86>samples>callin>include>espriv.h and its define
#define EsPrivateHash(o)    ((EsFlags(o) >> 16) & 0x7FFF)
This matches your observation concerning the performance (e.g. of 10k+ clusters).
I do not comment that hashes elsewhere are typed as Smallintegers or that immutables can have and naturally use larger hash codes, as these classes are not involved in this case.
IdentityDictionary is built on this hash. For me this is absolutely clear, this combination behaves predictable.

This problem and its implied restrictions is obviously not new. It must have been popped up ages ago and so I looked for applied solutions in the past.
First, a relative new common situation, dictionaries of UnicodeStrings -> StringDictionary in UnicodeSupport. 
Here the elements (UnicodeScalars, UnicodeStrings) provide utf8Hash instead of hash, delivering SmallIntegers (31 bit unsigned), which are used in StringDictionary.
-> consider to use this StringDictionary instead of IdentityDictionars, as ordinary Strings conform to UnicodeStrings.
Second, a relative old, but rare situation, EswCachedIdentityDictionary in Swapper, e.g. a situation which seems to be similar to your problem. it makes uses of basicHashOf: aKey range: anInteger, an extended hash to support object swapping.
This variation of the dictionary uses linear growth when expanding (instead of the exponential growth elsewhere), a side effect which might also become a problem in your context.

So my advice is to replace IdentityDictionary by something else, either a own, new special design or make use of one of the two examples,  StringDictionary or  EswCachedIdentityDictionary, as these have been built to deal with the implied hash problem.
And finally: the given implementaton directly roots in the original blue book, Smalltalk-80 of 1989. 
It reminds me of the 16 bit implementations of that time, leaving traces in the language definition then.
Staying conformant to this up to day may seem strange in a 64 bit world today. But this is benefit of a stable standard.
Kind regards
M

Henry Johansen

unread,
Feb 18, 2026, 9:48:05 PMFeb 18
to va-sma...@googlegroups.com
With a dictionary size > 32K objects, the slowdown you are experiencing is caused by the identityHashes of the objects being clustered together.
Meaning, not only will there start being collisions, but once every slot with index < 32K is filled,
*any* insertion will have to scan from its initial hash index all the way to the end in order to find a free slot.
This effect only gets worse the more items you add to the dictionary.

The main idea to work around this, is to space out the different unique identityHashes.
While there will still be collisions, the values stored for any two given indexes are less likely to merge into a single cluster.

There are two good examples of how this can be done in the image, but no premade solution for IdentityDictionaries:

1) The different AbtHighCapacity* classes do this by using a hash function on hash values with a limited range.
This means the different hash values are not evenly spaced - but has a lower chance of triggering a pathological case where the expanded values end up clustering modulus the collection size.
The equivalent "the simplest thing that could possibly work" would be to subclass IdentityDictionary, and change all hashIndex calculations to be:
hashIndex := aKey basicHash * 2048 \\ (elementsSize := keys size) + 1.  

2) EpLargeIdentitySet does this by using a 2-level collection - effectively preventing the clustering of a large number of distinct hashes.
However, the level1 size is static after it has been created, and it can be slow to create if the level1 size is large.

I would not use the EswCachedIdentityDictionary primitive - it is mostly an optimization combining fetching the object identityHash, and doing modulus + one operation into a single primitive, it has the same collision / clustering behavior as the normal identityHash.

Cheers,
Henry

VAST Consultant

Senior Software Engineer




 hjoh...@instantiations.com
 instantiations.com



Twitter LinkedIn VAST Community Forum GitHub YouTube pub.dev


--
You received this message because you are subscribed to the Google Groups "VAST Community Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to va-smalltalk...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/va-smalltalk/77ee6cc3-1494-4beb-ad27-757baadde326n%40googlegroups.com.

Hans-Martin Mosner

unread,
Feb 19, 2026, 10:49:45 AMFeb 19
to VAST Community Forum
The simple answer is that the compiler finds two equal string literals and emits only one in the compiled method, which is referenced in both assignments.
So s1 and s2 denote the same (identical) object.
Nothing wrong with identityHash.

John Clapperton

unread,
Feb 25, 2026, 10:03:03 AMFeb 25
to VAST Community Forum
Thanks all for your help. But Henry, I don't fully understand your concern about EswCachedIdentityDictionary, as its #basicHashOf:range: does return different (i.e. identity) hashes, greater than 15-bit 32767, for two equal long strings:
| eswDict s1 s2  |
eswDict := nil.

s1 := String new: 1000000.
s1 atAllPut: $z; at: 1 put: $a; at: 1000000 put: $q.
s2 := s1 copy.
eswDict := EswCachedIdentityDictionary new: 1000000.
eswDict basicHashOf: s1 range: 1000000.     875257
eswDict basicHashOf: s2 range: 1000000.     712386

Marcus Wagner

unread,
Feb 27, 2026, 10:34:24 AM (14 days ago) Feb 27
to VAST Community Forum
Again, another attempt to clarify.

First, in Smalltalk hash is a special instance variable, part of the header of ANY object in Smalltalk, with the exception of immediate objects (standing for themselves). 
It is accessed by basicHash and basicHash: primitives and is able to hold an unsigned 15 bit integer. 
This hash, being an integer, represents the object, its value characterizes the object, not its content! 

The pair of hash and the class forms a quick non-equal-identity check (due to hash collisions, two different objects may have the same hash).

This hash is set as part of Object new and ought to be kept fixed for the life time of the object. It characterizes the object identity.
So the == operation is partially covered by a quick integer comparison.

Several other constructs are based upon this hash: e.g. the hash serves to select bins/partitions where many objects are kept for some purpose, so to reduce the search space (bin).

If the objects content is changed normally that should not change the hash (as the identity is kept fix to help to support any navigation based on it stable).
Sometimes, when a hash has to be changed (for other purposes), then usually any construct based on the initial hash has to reorganize / this is usually is named rehash.

The point is once having been assigned a hash value at the object creation point, its value is fixed. 
Any basicHash therefore is a getter (and as fast as any other instance variable access) to a precomputed value. 
The hash code is e.g. invariant to the location of the object (e.g. does not change when being moved by GC).
In other computer languages such identity checks are realized as pointer comparison (a pointer again seen as integer object identity), as long as the objects are not moved in the memory. Many procedural languages provide a heap to allocate objects, which are not moved later.
In Smalltalk, such a pointer comparison would fail (except after sending it makeFixed - but makeFixed is another topic, important in outerworld parameter communication of in and outward calls), because the location of an object is not fix, it is moved in the image, while GC is working.

So this hash performs good. It is calculated only once in Object class>>basicNew. 

The example of Hans Martin is not a strange wizzardry of the compiler, this a general design to cope with any literals in any compiled method.
As any constant in a method code has to be provided at runtime, they are held as literals (see CompiledMethod allLiterals).
To keep method code free of redundancy, any literal is kept only once per method (no duplicates).
So even if the same string, class, symbol, pool association, .... appears many times in the source code of a method, only one instance of it is kept in the literals.
As the method keeps only one instance, thus having also only one hash in it, so asking the same object it will answer the same value, its hash.

John, a word to your example: a copy of an object provides another object, hence it has a new hash code, initialized when it is allocated, indepent of its content. Hence the extension provided by basicHash: range: is different because the basicHash in it is already different (as the copy is another object than the original). It does not matter how long the strings are or what they contain, they are (s1, s2) different objects!
The basicHash:range: construct is a specific reaction to avoid (basic)hash collisions in certain hash distributions.

Then to the so called immediate objects:
Immediate objects are always singletons, they cannot be allocated, modified or copied. 
They just exist, examples are $A, nil, true, false, 5 or 5.2e4 . For some of such objects, the class method allInstances fails (being useless).
This opens a wide field of optimization, including their hash (see my comment about special hash values for them). 
A hash value can simply be defined here - and does not need even an initial computation.

However, a string, even a string constant like "abab", is not immediate: it is an ordinary object like any others (actually an Array of Characters).
So I turn to the readOnly property of any object: this is the only difference between String new: x and '.....' having length x.
You can send 'abbaba' markReadOnly: false and then modify its content (it is an array of characters, write lock removed).
You can also investigate  'abbaba' basicHash, change the read only state, modify it and finally look again at its basicHash, it has to stay the same.
The markReadOnly is independent of the objects identity and the 'ahjahjhs' hash follows the general rule of any object.

Thus any use of IdentitySets and IdentityDictionary suffer from the same restriction: making use of basicHash as an unsigned 15 bit integer.
No matter which objects are kept, when the amount of objects nears 2**15 their hashes tend to collide (independent from any content).
As  I started initially with the hash as a  quick non-equal-identity check the consequence is you have to dive deeper if hashes are equal.
So as Henry stated before: if hashes collide, performance will fall dramatically because of the unavoidable collision handling.

So it is important for hashed based solutions (like Sets and Dictionaries) to study the hash value distribution if they do not perform.
You have to design the hash to avoid hash collisions.
And you have to learn from the Object hash example: it is computed once and reused later, providing an identity mark as hidden instance variable.
In certain environments, e.g. a UUID serves exactly this purpose, being computed once for an instance and then kept stable (however UUID are thought to be universally unique, not only in the current image).
It would be counterproductive in most cases to compute the hash more than once, when the object is allocated.

To conclude: to decide whether to use IdentySets/Dictionaries or ordinary Sets/LookupTable/Dictionary you have to define how any object can be found: by its object identity or by comparing its content (equality). A content comparison in general is expensive. Alternatively, objects had to be provided by some identity mark, which is (in Smalltalk traditionally since the 80ies, see the blue book) approached by a hash code.
And this in turn will ask questions about the involved hash distributions.

And also think about framework design, concerning the existance of wanted or unwanted copies. Obey the design decisions concerning immediate objects to avoid copies and thus avoid complex hash calculations. See ##Atoms of how copies are avoided, using a central registry.

And finally if copies of objects cannot be avoided, do not use IdentitySets/Dictionaries. 

From this standpoint, IdentitySets/Dictionaries  can ONLY be usefull when object copies are impossible, either physically (as by immediate objects, automatically garanteed singletons) or by design (like Atoms or certain persistent object frameworks).
This is to close the loop to the start of this conversation where it starts with an Identity Set and two different objects.

John Clapperton

unread,
Feb 27, 2026, 11:40:51 AM (14 days ago) Feb 27
to VAST Community Forum

Marcus, the purpose of my example was to demonstrate that EswCachedIdentityDictionary>>basicHashOf:range: returns a hash which is

1) different for the different objects and thus not spending time on a long test of equality, and

2) (given a big range), is bigger than 32767, therefore reducing the likely occurrence of hash collisions in a large EswCachedIdentityDictionary as compared with a similar sized IdentityDictionary.

EswCachedIdentityDictionary therefore seems to meet the requirements; so I'd like to know more about Henry's concern which seems to suggest that it would be no better than IdentityDictionary, where he says:

<<I would not use the EswCachedIdentityDictionary primitive - it is mostly an optimization combining fetching the object identityHash, and doing modulus + one operation into a single primitive, it has the same collision / clustering behavior as the normal identityHash.>>


Henry Johansen

unread,
Feb 27, 2026, 12:13:31 PM (14 days ago) Feb 27
to va-sma...@googlegroups.com
You are right - being lazy, I only tested with immediate object, for which it behaves as I described:

eswDict := EswCachedIdentityDictionary new.
num := 42.
num2 := 43.
num basicHash  42
num2 basicHash 43

eswDict basicHashOf: num range: 100000.  86
eswDict basicHashOf: num range: 1000000. 86
eswDict basicHashOf: num2 range: 100000.  88
eswDict basicHashOf: num range: 1000000. 86

a := $a.
b := $b.
a basicHash  97
b basicHash 98

eswDict basicHashOf: a range: 100000. 196
eswDict basicHashOf: a range: 1000000.  196
eswDict basicHashOf: a range: 100000. 196
eswDict basicHashOf: b range: 1000000.  198


If all of your keys are Strings, another approach could be to subclass the StringDictionary Marcus mentioned,
overriding the equality checks to use == instead of =, but keep using the improved utf8Hash method to calculate indexes.
This works, since
a == b also means a = b,and by extension, a hash = b hash
So identical strings will still have the same hashIndex - you'll have to scan through duplicates at the same hashIndexto find a == match, 
but should mostly end up avoiding clustering.
However, while utf8Hash is much faster than the legacy #hash, it is not constant time like the identityHash.
So for large keys, and small dictionary sizes, there will be an overhead - but it scales much better to a large number of keys.
It will also be slower if your strings contain non-ascii characters, in which case they will have to be converted to UnicodeStrings before the hash is computed.

So there's trade-offs, and the right answer for which implementation to pick, depends on your expected key distribution.

s1 := String new: 1000000.
s1 atAllPut: $z; at: 1 put: $a; at: 1000000 put: $q.
s2 := s1 copyFrom: 1 to: 100000.
s3 := s2 copyFrom: 1 to: 10000.
s4 := s3 copyFrom: 1 to: 1000.

s5 := s1 copy at: 1 put:$b; yourself.
s1 identityHash 25802
s5 identityHash 28945
"Much better utf8Hash range, and distribution when strings are similar"
s1 hash 25672
s5 hash 25815
s1 utf8Hash 444841898
s5 utf8Hash 842055959
"much better perfomance than legacy #hash - but not constant time like identityHash"
#(10 100 1000 10000) collect: [:loopCount | Time millisecondsToRun: [1 to: loopCount do: [:i | s1 identityHash ]]] (0 0 0 0) (0 0 0 0)
#(10 100 1000 10000) collect: [:loopCount | Time millisecondsToRun: [1 to: loopCount do: [:i | s1 hash ]]]  (16 78 813 7734)(0 93 782 7718)
#(10 100 1000 10000) collect: [:loopCount | Time millisecondsToRun: [1 to: loopCount do: [:i | s1 utf8Hash ]]]  (0 15 110 953) (0 0 109 1047).

#(10 100 1000 10000) collect: [:loopCount | Time millisecondsToRun: [1 to: loopCount do: [:i | s2 identityHash ]]] (0 0 0 0) (0 0 0 0)
#(10 100 1000 10000) collect: [:loopCount | Time millisecondsToRun: [1 to: loopCount do: [:i | s2 hash ]]] (0 0 78 797)  (0 16 78 813)
#(10 100 1000 10000) collect: [:loopCount | Time millisecondsToRun: [1 to: loopCount do: [:i | s2 utf8Hash ]]] (0 0 0 125)  (0 0 16 109)

#(10 100 1000 10000) collect: [:loopCount | Time millisecondsToRun: [1 to: loopCount do: [:i | s3 identityHash ]]]   (0 0 0 0) (0 0 0 0)
#(10 100 1000 10000) collect: [:loopCount | Time millisecondsToRun: [1 to: loopCount do: [:i | s3 hash ]]] (0 0 16 78)  (0 0 0 109)
#(10 100 1000 10000) collect: [:loopCount | Time millisecondsToRun: [1 to: loopCount do: [:i | s3 utf8Hash ]]]  (0 0 0 15) (0 0 0 16)


Henry Johansen

VAST Consultant

Senior Software Engineer




Twitter LinkedIn VAST Community Forum GitHub YouTube pub.dev
--
You received this message because you are subscribed to the Google Groups "VAST Community Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to va-smalltalk...@googlegroups.com.

Henry Johansen

unread,
Feb 27, 2026, 12:22:52 PM (14 days ago) Feb 27
to va-sma...@googlegroups.com
Copy pasta aside - the point was to show that for immediates, the  eturn value of the primitive does not change if the range increases :)
Corrected:
eswDict basicHashOf: num range: 100000.  86
eswDict basicHashOf: num range: 1000000. 86
eswDict basicHashOf: num2 range: 100000 88.  
eswDict basicHashOf: num2 range: 1000000.  88


eswDict basicHashOf: a range: 100000. 196
eswDict basicHashOf: a range: 1000000.  196
eswDict basicHashOf: b range: 100000.  198

eswDict basicHashOf: b range: 1000000. 198  

Henry Johansen

VAST Consultant

Senior Software Engineer




 hjoh...@instantiations.com
 instantiations.com



Twitter LinkedIn VAST Community Forum GitHub YouTube pub.dev

John Clapperton

unread,
Feb 27, 2026, 1:57:42 PM (14 days ago) Feb 27
to VAST Community Forum

In the problematic case of interest (serialising the transaction-log entry) none of the keys will be immediates, and the exact number of elementary objects is known, one sixth of them being long strings, the others instances of known classes, so there's no problem in using a linear-growing EswCachedIdentityDictionary (as it won't need to grow), and an IdentityDictionary for serialising the smaller arbitrary clusters (creating the above long strings, of which there will routinely be 10k+ and maybe up to a million in a transaction). So it would seem that EswCachedIdentityDictionary will meet the requirements. Thank you.

Reply all
Reply to author
Forward
0 new messages