Identity Hash

95 views
Skip to first unread message

John Clapperton

unread,
Feb 7, 2026, 9:08:06 AM (13 days ago) Feb 7
to VAST Community Forum
Object>>identityHash calls Object>>basicHash which is an equality (#=) hash, not an identity (#==) hash, and which is therefore slow to compute for long Strings, and does not distinguish between different Strings which happen to be equal. For example:
| s1 s2 is |
s1 := 'Hello World'.
s2 := 'Hello World'.
is := IdentitySet new.
is add: s1; add: s2.
is IdentitySet('Hello World' )

Wayne Johnston

unread,
Feb 7, 2026, 11:40:27 AM (13 days ago) Feb 7
to VAST Community Forum
Change that one line to
   s2 := 'Hello ', 'World'.
or
   s2 := s1 copy.
then you get an IdentitySet with two elements.  So I think your original code is demonstrating some optimization (I suspect an expert will use the proper word).  That is, inspect this:
   s1 == s2

John Clapperton

unread,
Feb 7, 2026, 8:54:07 PM (12 days ago) Feb 7
to VAST Community Forum
Well spotted! So then I thought this was being handled as a hash collision, but (as below) Object>>basicHash is actually an identity hash, contrary to its comment, while EsString>>hash is an equality hash:

Object>>basicHash
"Answer a SmallInteger that represents the receiver. Objects that are equivalent (=) must
 answer the same value for the message #basicHash.

 Do not override #basicHash in any subclass!"

<primitive: VMprObjectBasicHash>
^ 0

| s1 s2 |
s1 := String new: 1000000.
s1 atAllPut: $z; at: 1 put: $a; at: 1000000 put: $q.
s2 := s1 copy.
s1 basicHash  25613    
s2 basicHash  22578    
s1 hash  25672  
s2 hash  25672  

And as IdentityDictionary uses #basicHash, I'm still no wiser to where all the time is going with long Strings as the key in this (from EsbSampler):
...
| | | |IdentityDictionary>>#at:put: (20.4%,20.4%) [1014 hits][41879(ms)]
| | | | |LookupTable>>#validateKey: (0.0%,0.0%) [3 hits][65(ms)]
(with nothing deeper)
...

Marcus Wagner

unread,
Feb 9, 2026, 8:45:34 AM (11 days ago) Feb 9
to VAST Community Forum
I hope to clarify and contribute, the terms identity or equality hash caused my irritation, as well as some aspects of the examples mentioned before (mathematical aspects of a set).
 
First: a hash is a hash, there is no "identity" or "equality" hash.

To begin with, to distinguish, the difference between identiy == and equailty =, in words, if objects are the same (==) or have the same value(s) (=).

Technically, identity, is implemented by a pointer comparison (it is the same object if it has the same location) - no hash involved. It relies on the fact no dublicates of an object exist. 
Equality, is implemented by comparing (many) values stored in the object, that means to compare all of its elements stored in a location. Duplicates of such objects may exist. Note: conceptually also no hash involved here!

Thus identity always performs better than equality as comparing the address (location) is faster in general (with some exceptions, see at the end) than many values, elements of the object.

Now what about hashes: hashing is used to gain speed this comparison, but this may fail: 

if hashes of objects are different, then the objects  are certainly different - but if two objects have the same hash, they STILL MAY BE DIFFERENT.

Now it is left to decide if such a hash collision causes an error when applied or can silently be ignored or requires a costly recheck to detect and avoid the collision.

So the framework offers you IdentitySets and IdentityDictionaries to avoid collision costs, but at the expense of obeying the law of forbidden duplication.
If you do not care about the hash collision cost, you savely can use normal Sets, LookupTable, Dictionaries, based on eqality, which in turn is based on hashes.

The price is if you want to profit from faster identity checks, you have to make certain that you are not duplicating objects braking their identity.
It costs quite a lot of work to grant object identies, in particular for e.g. persistent objects. In the technical world, UUID have been invented to suport this (also a specialised form of a hash code).

So e.g. for some elementary objects (numbers, characters and atoms) have to be treated specially to grant this logic. 
E.g. 3 copy or $A copy does not make sense and provides no second object with the same value.
Here it is faster to compare the values and not to allocate a number or character object at all.

To remember: this is also the most important difference between symbols #  and atoms ##.

And finally, to the difference of basicHash and hash. hash methods are built upon basisHash methods.
As I said, hashes imposes logic uncertainty due to hash collisions.
You can gain performance, if you design its hash computation, e.g. you benefit if you can ignore elements of an object which are irrelevant.
So Smalltalk offers you the opportunity to implement your own specific hash and = methods, as long as you are folowing the implied rules (or otherwise, you are responsible for the implications or failures).
The basicHash is then (as the name says) the most basic implementation to grant system wide functionality, is forms the (basic) fall back, if you do not provide your own hash. 
It would be counterproductive to implement other basicHashes, as you can do this already in the hash methods. 

And a remark to the speed investigations: in general it is conceptually senseless to use String keys in IdentitySets and IdentityDictionaries) (as Strings can have duplicates). 
It's like division by 0.

Kind regards
M
Message has been deleted

Marcus Wagner

unread,
Feb 13, 2026, 11:32:51 AM (7 days ago) Feb 13
to VAST Community Forum
I had to correct and enhance my first clarification, e.g. the pointer comparison. Obviously such a pointer comparison won't work in a dynamic gc environment.

In Smalltalk the concept of hash codes has been chosen to implement object identity comparison:  
1) To test the identity of an object, send it ==.
2) Any object has a unique hash code (within this image).
3) The hash code of an object is its representation as (a small) integer value.
4) The hash code of an object can be obtained by sending basicHash to it.
The benefit is in general that this == identity test can be realized by comparing the hash codes, this (small) integer comparison performs always the same, indepedent from the location, the size or complexity of the objects to be checked.
(flexibility concerning the location allows for dynamic memory handling and garbage collection).
Implicit is a maximum number of objects in an image, it is limited to the largest small integer, 32767 (see basicHash:).
So if you instantiate more then 32767 objects, identity checks will collide.

How to initially assign this unique hash code when instantiating an object?
Here an example demonstrating (also copying objects)
 | org copy |
 org := Object new.
 copy := org copy.
 Array with: org basicHash with: copy basicHash
or
'abc' copy ~= 'abc'
 
Hash code computation is left to virtual machine and is unpublished, but conceptually counting object instantiations will do, its implementation is realized in <primitive: VMprObjectBasicHash>.
This basic hash can be changed for non immediates and writeable objects, <primitive: VMprObjectBasicHashColon>. (Objects can be set readOnly, so their hash code is included to be save).
For certain objects this is optimized:
° For immediate objects, instead of their hash code, their value is directly compared (e.g. Boolean, Character or Integer allInstances are empty)
° For character and integer objects their value directly serves as the basic unique hash code.
° The basic hash of true is 30, of false is 46 and of nil is 14.
° For floats their value is compared, their basic hash code is 0, but they have instances.
String constants are cashed in some context, as shown in (Array with: 'abc' basicHash with: 'abc' basicHash) inspect, but 'abc' basicHash is ~= 'abc' copy basicHash, as expected, a copy is a different object having its own hash.
Scaled decimals, fractions are not optimized, as they are composite, not primitive objects. 
Inspect e.g. (Array with: (Fraction numerator: 2 denominator: 3) basicHash with: (Fraction numerator: 2 denominator: 3) basicHash).
Some technical objects are optimized:
° Pool associations, their basic hash code stems from the key, making different pool associations of the same key identical (see string constant example above, causing the observed cash).
° CompiledMethods basic hash stems from its selector.
° Wrappers basic hash stems from its wrapped object.
And some objects use the basic hash code space for flags (e.g. Graphemes).
Note: all of this causes several basic hash collisions, violating 2) from above.
So that requires special == treatment of nil, true, false and Floats (immediate objects in general), as 
$. basicHash == false basicHash or 
5.6 basicHash == 4.5 basicHash 
all deliver true, but they are obviously not identical.
 
This solution for the identity complex has been applied and relaxed for equivalency: two (not identical) objects are considered equivalent if they contain same values.
5) To test the equivalence of an object, send it =.
6) This equivalency test is supported by sending hash to it. The fundamental realization of hash is basicHash (as I said before a hash is a hash).
Again the benefit is that this = equivalence test can be realized by comparing hash codes, this integer comparison performs always the same, see above.
To leave room for individual relaxations of identity, hash implementations in subclasses are changed:
Floats, Fractions, ScaledDecimals are equivalent if their integer nearest towards zero are the same, e.g. 5.1 hash = 5.2 hash. (However 5.1 ~= 5.2, as expected).
Associations are equivalent if their key is equivalent, file names are equivalent in in case insensitive file systems if their lower case names are the same.
Note potential hash collisions concerning Strings (use of random values), Points, Rectangles, LargeIntegers, Unicodes and some technical support objects. Inspect e.g. (Array with: (1@2) hash with: (2@1) hash).
This equivalence implementation allows for open context sensitive, application specific implementations.
See e.g. Code Assist and VA Content Assist implementations.

Finally about the IdentitySets and IdentityDictionaries (inspecting their differences), they are using identity checks instead of equivalence. 
They explicitely use basicHash instead of hash (set) and == instead of = (lookup key in the dictionary).
This avoids hash collisions like (1 @ 2) (2 @ 1), being different points in the plain having the same hash value (posing a problem in a coordinate lookup table). 
Besides of these specific implementation differences they perform the same (see observed bench run results).
Hence any performance difference will be implied from different content interpretation following the difference of identity vs. equivalence of the elements or keys. 
And the IdentityDictionaries is a specialized LookupTable, not a specialized Dictionary (based on Associations).
Finally there exist many other specialized variations, to name a few: AbtHighCapacityLookupTable, AbtWeakKeyLookupTable, FixedSizeLookupTable, StringLookupTable or EpIdentityDictionary (as an example to treat large object collections concerning performance problems). 
That means, code and image handling and packaging traditionally had special aspects to be treated similarily as the Identy variations.
Kind regards
M

John Clapperton

unread,
Feb 13, 2026, 7:01:51 PM (6 days ago) Feb 13
to VAST Community Forum
Thanks for the clarification. My original thought, to explain the long time being spent, equally, in IdentityDictionary>>at:ifAbsent: and IdentityDictionary>>at:put: (in the ifAbsent: block) was that the comment in
Object>>basicHash
"Answer a SmallInteger that represents the receiver. Objects that are equivalent (=) must
 answer the same value for the message #basicHash.
suggested that it was doing an equivalency inspection (of a long String) which was therefore taking a long time. But not so. So I'm still looking for an explanation.

The reason for having a long String as the key in an IdentityDictionary, is than when serializing (into a String) an unknown arbitrary structure which may include circular references, it's necessary to keep track of which components have been visited, so that none gets done more than once; and an arbitrary structure may contain long Strings (of other serialized objects in this case). As Strings don't point to other objects, there is probably a workaround by handling Strings specially, but it's curious at least.

(from EsbSampler):
...
|VOUnLoader>>#unloadIndexFor: (41.5%,0.0%) [20 hits][85133(ms)]
|IdentityDictionary>>#at:ifAbsent: (41.4%,20.9%) [33 hits][84850(ms)]
| |[] in VOUnLoader>>#unloadIndexFor: (20.4%,0.0%) [1014 hits][41899(ms)]

| | |IdentityDictionary>>#at:put: (20.4%,20.4%) [1014 hits][41879(ms)]
| | | |LookupTable>>#validateKey: (0.0%,0.0%) [3 hits][65(ms)]
|SmallInteger>>#voPrintString (0.0%,0.0%) [4 hits][84(ms)]
|SequenceableCollection>>#copyFrom:to: (0.0%,0.0%) [1 hit][21(ms)]
...

Marcus Wagner

unread,
Feb 14, 2026, 10:22:40 AM (6 days ago) Feb 14
to VAST Community Forum
Hi, John,
I guess, we went in the wrong direction.

Let me know more about the number of entities.
So elegant dictionaries and lookup tables are, they react vulnerable. Consider e.g. the construction of 

| d |

d := LookupTable new.

0 to: 99999 do: [:i | | n |

n :='S', i printString.

 d at: n put: n hash].

 d inspect

will perform bad, for several reasons.

First, building the table will have to rehash itself when ever it has to expand (as I remember, it then doubles its capacity).
For 100000 elements it will expand 100000 log2 times -> 16 times, so already added elements have to contribute to the hash computation many times. 
To avoid expansions while building, the lookupTable must be allocated with 
d := LookupTable new: 100000. 

Second,  hash codes must collide (remember, small integers) because the number of strings is larger than the scope of possible hash values.
To be resistant to (always possible) collisions, e.g. the lookup uses the hash code as split point into all elements. 
A test at the split point checks if the element there is a success [(element at: hash) == searched]. If not, it starts two sequential searches from there: first one upwards and then one downwards.
If you run the example above (patience) and inspect its content, you see how the hashes are distributed and you can imagine, that it will take a long time to find an element if the first test was not a success.

It might be that some simple adoptions can help to overcome this small integer aspect, if such efects are the cause of the observed performance. 

But I want to provide also other ideas: what about an attempt to topologically sort the elements. 
Also being vulnerable to cyclic dependencies, the sort will initially treat and remove elements as long they are not stuck in a cycle.
Given such an element being start and end of a cycle, a problem specific handler has to deal with that to treat and remove it. 
After that, another iteration (topological sort of the rest) has to be repeated, as long as all elements are handled (exactly once).
As a side effect, if you carefully design the key to be topologically sorted = declare the desired ordering, you can take care about what is to be serialized in which order.
As I understood, this is currently represented by a string.
But I do not know enough about the details of your problem, so its my guesswork.
Kind regards
M    

John Clapperton

unread,
Feb 17, 2026, 11:00:44 AM (3 days ago) Feb 17
to VAST Community Forum
The biggest SmallInteger seems to be
( ( ( 29 power: 2) * 2) - 1) 1073741823 class SmallInteger
( ( 29 power: 2) * 2) 1073741824 class LargeInteger

And is presumably also the largest basicHash and number of objects in the image.
1073741822 basicHash 1073741822
1073741823 basicHash 1073741823
1073741824 basicHash 0

John Clapperton

unread,
Feb 17, 2026, 11:00:46 AM (3 days ago) Feb 17
to VAST Community Forum
Hash collisions maybe.

John Clapperton

unread,
Feb 17, 2026, 11:01:02 AM (3 days ago) Feb 17
to VAST Community Forum
Yes, the first thing I tried was initializing a bigger dictionary, which helped a bit, but not much, and caused more garbage collection; most times an initially small dictionary is enough.

Topological change is not an option, as this is the serialised format of VOSS databases, some of which are 25 years old, mission critical, and for which everything must be backwards compatible.

Transactions now typically each contain 2k to 10k object clusters, each of which is serialised, pending writing to the end of the toSpace file (architecture is modelled on Smalltalk 80), before writing of which is the construction of the transaction log entry, which is a collection of those strings, each wrapped in a cluster with its new object table 21-byte descriptor string, the whole of which is serialised into a string for the whole transaction, to be appended to the transaction log file before the serialised object clusters are each appended to the toSpace file and their descriptor records updated to point to their new locations.

This all works fine for small transactions, but with 10k+ clusters per transaction (the largest of which has been 1.1 million), IdentityDictionary>>at:put: is taking orders of magnitude longer, and I don't know why.

Marcus Wagner

unread,
Feb 18, 2026, 11:26:13 AM (2 days ago) Feb 18
to VAST Community Forum
Hi John,
given the additional information I subsumise
The builtin hash (answered by basicHash if not overwritten), is stored along the object (in its header, in the flags, except of immutables.
The hash represents this object and is limited to 15 bits and will not be larger than 32767 = 15 bits (as hashes are unsigned).
Any group of objects having a similar or larger size than 32767 naturally will suffer from hash collisions. 
I deduce this from source code published in e.g. 14.1.0x86>samples>callin>include>espriv.h and its define
#define EsPrivateHash(o)    ((EsFlags(o) >> 16) & 0x7FFF)
This matches your observation concerning the performance (e.g. of 10k+ clusters).
I do not comment that hashes elsewhere are typed as Smallintegers or that immutables can have and naturally use larger hash codes, as these classes are not involved in this case.
IdentityDictionary is built on this hash. For me this is absolutely clear, this combination behaves predictable.

This problem and its implied restrictions is obviously not new. It must have been popped up ages ago and so I looked for applied solutions in the past.
First, a relative new common situation, dictionaries of UnicodeStrings -> StringDictionary in UnicodeSupport. 
Here the elements (UnicodeScalars, UnicodeStrings) provide utf8Hash instead of hash, delivering SmallIntegers (31 bit unsigned), which are used in StringDictionary.
-> consider to use this StringDictionary instead of IdentityDictionars, as ordinary Strings conform to UnicodeStrings.
Second, a relative old, but rare situation, EswCachedIdentityDictionary in Swapper, e.g. a situation which seems to be similar to your problem. it makes uses of basicHashOf: aKey range: anInteger, an extended hash to support object swapping.
This variation of the dictionary uses linear growth when expanding (instead of the exponential growth elsewhere), a side effect which might also become a problem in your context.

So my advice is to replace IdentityDictionary by something else, either a own, new special design or make use of one of the two examples,  StringDictionary or  EswCachedIdentityDictionary, as these have been built to deal with the implied hash problem.
And finally: the given implementaton directly roots in the original blue book, Smalltalk-80 of 1989. 
It reminds me of the 16 bit implementations of that time, leaving traces in the language definition then.
Staying conformant to this up to day may seem strange in a 64 bit world today. But this is benefit of a stable standard.
Kind regards
M

Henry Johansen

unread,
Feb 18, 2026, 9:48:05 PM (2 days ago) Feb 18
to va-sma...@googlegroups.com
With a dictionary size > 32K objects, the slowdown you are experiencing is caused by the identityHashes of the objects being clustered together.
Meaning, not only will there start being collisions, but once every slot with index < 32K is filled,
*any* insertion will have to scan from its initial hash index all the way to the end in order to find a free slot.
This effect only gets worse the more items you add to the dictionary.

The main idea to work around this, is to space out the different unique identityHashes.
While there will still be collisions, the values stored for any two given indexes are less likely to merge into a single cluster.

There are two good examples of how this can be done in the image, but no premade solution for IdentityDictionaries:

1) The different AbtHighCapacity* classes do this by using a hash function on hash values with a limited range.
This means the different hash values are not evenly spaced - but has a lower chance of triggering a pathological case where the expanded values end up clustering modulus the collection size.
The equivalent "the simplest thing that could possibly work" would be to subclass IdentityDictionary, and change all hashIndex calculations to be:
hashIndex := aKey basicHash * 2048 \\ (elementsSize := keys size) + 1.  

2) EpLargeIdentitySet does this by using a 2-level collection - effectively preventing the clustering of a large number of distinct hashes.
However, the level1 size is static after it has been created, and it can be slow to create if the level1 size is large.

I would not use the EswCachedIdentityDictionary primitive - it is mostly an optimization combining fetching the object identityHash, and doing modulus + one operation into a single primitive, it has the same collision / clustering behavior as the normal identityHash.

Cheers,
Henry

VAST Consultant

Senior Software Engineer




 hjoh...@instantiations.com
 instantiations.com



Twitter LinkedIn VAST Community Forum GitHub YouTube pub.dev


--
You received this message because you are subscribed to the Google Groups "VAST Community Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to va-smalltalk...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/va-smalltalk/77ee6cc3-1494-4beb-ad27-757baadde326n%40googlegroups.com.

Hans-Martin Mosner

unread,
Feb 19, 2026, 10:49:45 AM (16 hours ago) Feb 19
to VAST Community Forum
The simple answer is that the compiler finds two equal string literals and emits only one in the compiled method, which is referenced in both assignments.
So s1 and s2 denote the same (identical) object.
Nothing wrong with identityHash.
Reply all
Reply to author
Forward
0 new messages