XHashTable performance

9 views
Skip to first unread message

Waldek Hebisch

unread,
Dec 17, 2024, 5:59:06 PM12/17/24
to fricas...@googlegroups.com
I looked at profile info from a run of api2.spad and it seems that
there is some inefficiency in XHashTable. Namely,
profile shows that 'localSearch' takes 20% of run time (that alone
is reasonable) and that 6% of run time is spent in
'positiveRemainder' (that is about 30% of 'localSearch' is due to
'positiveRemainder'). 'positiveRemainder' is more expensive than
necessary, because it can handle Integer. However, AFAICS
the numbers involved are nonnegative SingleInteger-s and
'rem' from SingleInteger should work fine.

--
Waldek Hebisch

Ralf Hemmecke

unread,
Dec 19, 2024, 8:14:06 AM12/19/24
to fricas-devel
On 12/17/24 23:59, Waldek Hebisch wrote:
> 'positiveRemainder' (that is about 30% of 'localSearch' is due to
> 'positiveRemainder'). 'positiveRemainder' is more expensive than
> necessary, because it can handle Integer. However, AFAICS the
> numbers involved are nonnegative SingleInteger-s and 'rem' from
> SingleInteger should work fine.

Looking at the code I wonder a bit why I used both Integer and
SingleInteger. The reason must have been that PrimitiveArray(S) uses
#: % -> NonNegativeInteger and elt: (%, Integer) -> S, but we have
hash: % -> SingleInteger. So we must anyway have a conversion between
Integer and SingleInteger, although we could for efficiency reasons let
XHashTable use an adhoc implementation of PrimitiveArray (which is not
seen outside of XHashTable) that uses SingleInteger for indexing. OK,
but that was not your question.

The positiveness is important to get a proper index into the array.
And no, the numbers involved are not necessarily only non-negative.

1) The value function from HashState claims to return a SingleInteger
(see http://fricas.github.io/api/HashState.html) and not a
"non-negative SingleInteger".

2) In XHashTable, one can create a new table via the function
table: (Key -> SingleInteger) -> %.

While (1) can easily be resolved by adding some text to the
specification of "value" saying that the returned integer will be
non-negative (because according to our current implementation it is, in
fact, the case), I am not so sure about (2). Requiring the user to
provide only hash functions that return non-negative SingleInteger, is
possible, but I fear that this might be overlooked and then leading to
indexing errors while using XHashTable.

I would rather keep a safety meassure inside XHashTable. And probably
that was the reason why I chose positiveRemainder over rem.

Thinking about that actual reason for XHashTable, it makes, of course,
sense to care about efficiency. However, I started this domain, because
there was nothing that would allow me my own hashing function. So

table: (Key -> SingleInteger) -> %

is an important function.

Discussion is open of what compromise between speed and safety is to be
taken here.

Ralf

Waldek Hebisch

unread,
Dec 19, 2024, 8:33:31 AM12/19/24
to 'Ralf Hemmecke' via FriCAS - computer algebra system
You can do what 'value' in 'HashState' is doing, that is drop
sign bit using bitwise and. Bitwise and is much cheaper than
'positiveRemainder'.

>
> Thinking about that actual reason for XHashTable, it makes, of course, sense
> to care about efficiency. However, I started this domain, because there was
> nothing that would allow me my own hashing function. So
>
> table: (Key -> SingleInteger) -> %
>
> is an important function.
>
> Discussion is open of what compromise between speed and safety is to be
> taken here.

Actually, if you want your own hash function you should also want
your own "equality" function. Namely, hash and "equality" naturally
make a pair. And for several domains hash matched with mathematical
equality is hard or impossible to define, while hash and "equality"
for representation may be reasonably easy and useful.

--
Waldek Hebisch

Ralf Hemmecke

unread,
Dec 19, 2024, 9:57:44 AM12/19/24
to fricas...@googlegroups.com
On 12/19/24 14:33, Waldek Hebisch wrote:
> You can do what 'value' in 'HashState' is doing, that is drop
> sign bit using bitwise and. Bitwise and is much cheaper than
> 'positiveRemainder'.

See attached proposal.

The only problem I have with

h1: Z := ((HASHSTATEMAKEFIXNUM(h k)$Lisp) pretend I)::Z

is that it introduces another place in the algebra code that uses a call
to Lisp. If that would be what you suggest, then I'd rather prefer to
require the user to provide a non-negative hashing function. The latter
then would not add a needless chopping of bits for the default hash
function value coming from HashState.

"rem" is, in fact, not properly specified. What tells me that the result
of 5 rem 4 is 1 and not -3?
The docstring of "rem" for Integer is

rem: (%, %) -> %
x rem y is the same as divide(x, y).remainder. See divide.

divide: (%, %) -> Record(quotient: %, remainder: %)
divide(x, y) divides x by y producing a record containing a quotient
and remainder, where the remainder is smaller (see sizeLess?) than
the divisor y.

And we have ...

%%% (9) -> for i in [5,-5] repeat for j in [4, -4] repeat print(i rem j)
1
1
- 1
- 1

And sizeLess?(+/-1, +/-4) is true for all 4 combinations of the sign. So
it is, in fact, unclear wheter rem gives a positive or negative value.

So that speaks against the usage of 'rem' (even though, of course, our
implementation always returns a positive value for positive input).
I'd like to make the specification of 'rem' more precise in the same patch.

BTW, can you say whet

convert(x : %) : Integer == x pretend Integer
qconvert(x : Integer) : % == x pretend %

are actually doing internally? Integer and SingleInteger surely have not
the same representation. Does "pretend" do any conversion in FriCAS.
My belief up to this point was that "pretend" is just like a type cast
in C, i.e. no change in the underlying bit representation.

> Actually, if you want your own hash function you should also want
> your own "equality" function. Namely, hash and "equality" naturally
> make a pair. And for several domains hash matched with mathematical
> equality is hard or impossible to define, while hash and "equality"
> for representation may be reasonably easy and useful.

Sure, but XHashTable maybe interesting for people implementing their own
datastructures. It would be their responsibility to use XHashTable
accordingly.

Ralf
0001-use-rem-instead-of-positiveRemainder.patch

Waldek Hebisch

unread,
Dec 19, 2024, 11:35:28 AM12/19/24
to 'Ralf Hemmecke' via FriCAS - computer algebra system
On Thu, Dec 19, 2024 at 03:57:41PM +0100, 'Ralf Hemmecke' via FriCAS - computer algebra system wrote:
> On 12/19/24 14:33, Waldek Hebisch wrote:
> > You can do what 'value' in 'HashState' is doing, that is drop
> > sign bit using bitwise and. Bitwise and is much cheaper than
> > 'positiveRemainder'.
>
> See attached proposal.

Looks OK.

> The only problem I have with
>
> h1: Z := ((HASHSTATEMAKEFIXNUM(h k)$Lisp) pretend I)::Z
>
> is that it introduces another place in the algebra code that uses a call to
> Lisp. If that would be what you suggest, then I'd rather prefer to require
> the user to provide a non-negative hashing function. The latter then would
> not add a needless chopping of bits for the default hash function value
> coming from HashState.

I was thinking rather about

h(k) And max()$Singleinteger

which is the same as current HASHSTATEMAKEFIXNUM but written in Spad
(AFAICS HASHSTATEMAKEFIXNUM in Lisp makes some sense, as it must
be matched with other Lisp code). But also see below.

> "rem" is, in fact, not properly specified. What tells me that the result of
> 5 rem 4 is 1 and not -3?
> The docstring of "rem" for Integer is
>
> rem: (%, %) -> %
> x rem y is the same as divide(x, y).remainder. See divide.
>
> divide: (%, %) -> Record(quotient: %, remainder: %)
> divide(x, y) divides x by y producing a record containing a quotient
> and remainder, where the remainder is smaller (see sizeLess?) than
> the divisor y.
>
> And we have ...
>
> %%% (9) -> for i in [5,-5] repeat for j in [4, -4] repeat print(i rem j)
> 1
> 1
> - 1
> - 1
>
> And sizeLess?(+/-1, +/-4) is true for all 4 combinations of the sign. So it
> is, in fact, unclear wheter rem gives a positive or negative value.

'rem' for positive arguments gives nonnegative value, for other we
get what Lisp or maybe hardware gives us.

> So that speaks against the usage of 'rem' (even though, of course, our
> implementation always returns a positive value for positive input).
> I'd like to make the specification of 'rem' more precise in the same patch.

If you mean changing docstring, you probably guessed that I have
doubts about this. Our docstrings are _not_ specification. I
normally try to give enough information in docstrings so that
user with general knowledge can "identify" given operation. That
is when thare are several variants/conventions in use I am trying to
specify which is used in FriCAS. But only hostile/insane
implementation would give you negative remaider from positive
integer arguments.

> BTW, can you say whet
>
> convert(x : %) : Integer == x pretend Integer
> qconvert(x : Integer) : % == x pretend %
>
> are actually doing internally?

Nothing. Or if you prefer you can say that they make a hole
in type system.

> Integer and SingleInteger surely have not the
> same representation.

From FriCAS point of view representations are the same. They
differ because SingleInteger has limited range. At Lisp
level, knowing that range is limited allows different representation.
Lisp calls integers in implementation-defined range "fixnums".
Lisp integer operations need to check representation, and act
depending on it. This adds extra code which normally is considerd
too large to add inline, so they are usually done by function calls.
Knowing that integers are in range of SingleInteger (that is Lisp
"fixnum") allows Lisp compiler to generate reasonably efficient
machine code.

Concerning representation at Lisp level, there are various
possibilities. Some Lisp implementations (for examples old
versions of GCL) used essentially the same representation
for all integers. Normally general representation requires
dynamic allocation of memory, Lisp-s using "the same
representation" usually had table of preallocated "small"
integers and instead of allocating new integer if possible
they returned result from the table. In Lisp arrays indices
are "fixnums", which means that "fixnums" need reasonable
range, much more than possible with preallocated table of
"small" integers. So implementations like this must dynamically
allocate some fixnums. Conseqently, arithmetic on fixnums tends
to be quite slow with such choice. More typical method uses
"immediate" integers. Namely, normal Lisp data is accessed via
pointers (machine addresses). But not all pointers are valid and
otherwise invalid pointers can be used for different purposes.
For example, in sbcl low 2 (for 32-bit machine) or 3 (for 64-bit
machine) bits of address of Lisp data are always 0. IIRC, to
simplify arithmetic sbcl reserves numbers with lowest bit 0 as
"fixnum"-s. Other Lisp data is represented by address with
artitmetically added tag. This makes "fixnum" arithmetic
faster at cost of slowing down memory acceses (which must
subtract tag before actual access). Recent GCL uses a
different scheme: there is a magic size. Anyting smaller
than this magic size is treated as address. Otherwise
magic correction is subtracted from the "address" and result of
subtraction is treated as an integer.

In short, Lisp dynamic typing allows users to treat integers
as a single set. For limited range (called "fixnum"-s) Lisp
may have tricks which make operations much faster than general
case. Users can tell Lisp than numbers are "fixnum"-s
(in our case this is done by Spad compiler). Lying to Lisp
can lead to runtime errors, wrong results or crashes.

Due to your question about HASHSTATEMAKEFIXNUM and representation
I realised that there is potential trouble with GCL: in GCL
max()$SingleInteger is not a power of 2 minus 1. That is bitwise
"and" that we are using may turn some lower order bits to 0,
decreasing quality of hash function. So we probably need special
version of HASHSTATEMAKEFIXNUM for GCL.

> Does "pretend" do any conversion in FriCAS.
> My belief up to this point was that "pretend" is just like a type cast in C,
> i.e. no change in the underlying bit representation.

Cast in C may change bit representation. "pretend" have no effect
at runtime. At compile time beside making hole in type system,
"pretend" has similar effect to "@", that is it affects overload
resolution, favouring specified type.


> From d58bb59b814fe5243135e7b0b2e7f086445332b1 Mon Sep 17 00:00:00 2001
> From: Ralf Hemmecke <ra...@hemmecke.org>
> Date: Thu, 19 Dec 2024 15:54:55 +0100
> Subject: use rem instead of positiveRemainder
>
> ---
> src/algebra/xhash.spad | 19 +++++--------------
> 1 file changed, 5 insertions(+), 14 deletions(-)
>
> diff --git a/src/algebra/xhash.spad b/src/algebra/xhash.spad
> index 6c101ea6..1ab3b690 100644
> --- a/src/algebra/xhash.spad
> +++ b/src/algebra/xhash.spad
> @@ -1,16 +1,7 @@
> )abbrev domain XHASHTBL XHashTable
> )if LiterateDoc
> \documentclass{article}
> -\usepackage{url}
> -\usepackage{color}
> -\definecolor{bgcode}{rgb}{1,1,0.7}
> -
> -\usepackage{listings}
> -\lstdefinelanguage{spad}{basicstyle=\footnotesize\ttfamily,%
> - numbers=left,firstnumber=last,backgroundcolor=\color{bgcode}}
> -\parindent0pt
> -\parskip\medskipamount
> -
> +\usepackage{literatedoc}
> \begin{document}
> \title{A hash table implementation via double hashing}
> \author{Ralf Hemmecke}
> @@ -45,7 +36,7 @@ respectively. \texttt{VACANT} and \texttt{DELETED} should not be part
> of the key space.
>
> Let's assume that the type of the marker values is \texttt{Marker}.
> -Naively, and type correctly, we would have to use as representation
> +Naively, and type correctly, we would have to use a representation
> like the following.
> \begin{verbatim}
> Union(Marker, Record(key: Key, entry: Entry))
> @@ -341,9 +332,9 @@ stored and thus the loops eventually terminate.
> mk := getMKey(a, p)
>
> n: Z := numOfBuckets a
> - h1: Z := (h k)::Z
> - p: Z := positiveRemainder(h1, n) -- position in array
> - h2: Z := 1 + positiveRemainder(h1, n-2)
> + h1: Z := ((HASHSTATEMAKEFIXNUM(h k)$Lisp) pretend I)::Z -- h1>=0
> + p: Z := h1 rem n -- position in array (p>=0)
> + h2: Z := 1 + (h1 rem (n-2)) -- h2>=0
> mk: MKey := getMKey(a, p)
> deletedPosition?: Boolean := false
> while not vacant? mk repeat
> --
> 2.43.0
>


--
Waldek Hebisch

Ralf Hemmecke

unread,
Dec 19, 2024, 5:25:18 PM12/19/24
to fricas...@googlegroups.com
Hi Waldek,

> h(k) And max()$Singleinteger

> Due to your question about HASHSTATEMAKEFIXNUM and representation I
> realised that there is potential trouble with GCL: in GCL
> max()$SingleInteger is not a power of 2 minus 1. That is bitwise
> "and" that we are using may turn some lower order bits to 0,
> decreasing quality of hash function. So we probably need special
> version of HASHSTATEMAKEFIXNUM for GCL.

With this GCL danger in mind, I tend to rather require the user to
already provide a hash function that does not yield negative
SingleInteger values. Modified patch attached.

> If you mean changing docstring, you probably guessed that I have
> doubts about this. Our docstrings are _not_ specification.

Yes. Unfortunately.

> I normally try to give enough information in docstrings so that user
> with general knowledge can "identify" given operation.

I understand that you'd like to have a simple docstring that is easy and
quick to understand. What I do not understand is that you seem to be
much against additional text that clarify any doubts a user might have.

I, for example, would not have been able to guess from the docstring of
rem how the function behaves if negative arguments are involved. And I
still don't know whether (if I now use my experiments made with SCBL) I
get the same results if I run on a different LISP. Clearly, my wish is
that Code written and documented in SPAD does behave the same no matter
on which LISP FriCAS runs.


Thanks for you explanation about "pretend".

That would show a difference between Aldor and FriCAS.
Aldor obviously employs "immediate integers". The attached
program gives output:

%>aldor -Fx -laldor int.as
%>./int
1
1

0
1

3
5
9
17

The commented line in int.as would lead to a segmentation fault.

That is certainly different from FriCAS.

%%% (1) -> (1$SingleInteger) pretend Integer

(1) 1
Type: Integer
%%% (2) -> (1$Integer) pretend SingleInteger

(2) 1
Type: SingleInteger

OK, no problem, because there is the dangerous "pretend", but one should
be aware of that when dealing with Aldor and FriCAS at the same time.

Ralf
int.as
0001-use-rem-instead-of-positiveRemainder.patch

Waldek Hebisch

unread,
Dec 19, 2024, 8:59:06 PM12/19/24
to 'Ralf Hemmecke' via FriCAS - computer algebra system
On Thu, Dec 19, 2024 at 11:25:14PM +0100, 'Ralf Hemmecke' via FriCAS - computer algebra system wrote:
> Hi Waldek,
>
> > h(k) And max()$Singleinteger
>
> > Due to your question about HASHSTATEMAKEFIXNUM and representation I
> > realised that there is potential trouble with GCL: in GCL
> > max()$SingleInteger is not a power of 2 minus 1. That is bitwise "and"
> > that we are using may turn some lower order bits to 0, decreasing
> > quality of hash function. So we probably need special version of
> > HASHSTATEMAKEFIXNUM for GCL.
>
> With this GCL danger in mind, I tend to rather require the user to
> already provide a hash function that does not yield negative
> SingleInteger values. Modified patch attached.

OK, good.

> > If you mean changing docstring, you probably guessed that I have doubts
> > about this. Our docstrings are _not_ specification.
>
> Yes. Unfortunately.
>
> > I normally try to give enough information in docstrings so that user
> > with general knowledge can "identify" given operation.
>
> I understand that you'd like to have a simple docstring that is easy and
> quick to understand. What I do not understand is that you seem to be much
> against additional text that clarify any doubts a user might have.

First, I feel that docstrings are bad place to put extra explanations.
"The same" issue typically is shared by multiple functions, attaching
explanation to a single function makes it harder to find. Attaching
it to multiple functions makes documentation bulky and harder to
maintain. Also, repeating the same thing in many places puts extra
load on users, when reading a copy they do not know if this is
a copy of thing that they already read, or there is some twist
making it different.

Second, mixing documentation with code for me makes working with
code harder: when working on code I want to see code without any
distraction. Short docstring may help with coding, as they
frequently contain crucial non-obvious information like argument
order (or more generaly role of arguments). Longer would turn
into distraction.

Third, adding to documentation piece of info that person X
tried to find, but was absent does not lead to quality
documentation. Namely, next person most likely will search
something different so we will get random collection of
facts, which will get bulky well before it attain resonable
coverage (queries are likely to follow Zips law, which
means that significant fraction of them will be unique).
Rather, there is need to select fundamental info including
things that users need to know, but do not ask (say because
they do not know that there is question to ask). We need
to explain principles allowing to infer answers which are
not given explictely. And we need to arrange docs so that
information is findable.

> I, for example, would not have been able to guess from the docstring of rem
> how the function behaves if negative arguments are involved. And I still
> don't know whether (if I now use my experiments made with SCBL) I get the
> same results if I run on a different LISP. Clearly, my wish is that Code
> written and documented in SPAD does behave the same no matter on which LISP
> FriCAS runs.

Well, we could probably add somewhere a paragraph about arithmentic.
People tend to assume that they know arithmentic, but divison with
remainder is frequently problematic. Mathematically, division
with remainder is associated with modulo arithmetic and for this
nonnegative remainder and possibly symmetric remainder are "good" ones.
Unfortunately, hardware usually provides "wrong" remainder
and IIRC that is codified by Fortran and Lisp. For me it
normally does not matter what exact rules are. What matters is
that they are wrong. More precisely, as long as we divide
nonnegative number d by positve one n, we get nonnegative remainder
r, that is 0 <= r < n. When d or n are negative, we normally
get wrong result, which needs correction.

When we move to Euclidean domain R, there is an easy well-known
theorem: if there is unique remainder, then R is isomorphic to
a ring of univariate polynomials over a field. If there are
two possibilities for remainder, than R is isomorphic to
integers. So, in general, there are more than two possibilites
for remainder and I do not know any general condtions to
choose a good one.

In a bit more general spirit, we have an ideal and want
"canonical" choice of representative for residue classes.
For multivarite polynomials over a field Groebner bases
and reduction procedure give such choice (depending on
order). Theortically Groebner bases can be extended to
much more general rings, but there are practical difficulties.

Anyway, in general there are troubles with remainder, and
unfortunatly, in case of integers, where there are no
mathematical troubles there is practical trouble caused
by hardware behaviour.

> Thanks for you explanation about "pretend".
>
> That would show a difference between Aldor and FriCAS.
> Aldor obviously employs "immediate integers". The attached
> program gives output:
>
> %>aldor -Fx -laldor int.as
> %>./int
> 1
> 1
>
> 0
> 1
>
> 3
> 5
> 9
> 17
>
> The commented line in int.as would lead to a segmentation fault.
>
> That is certainly different from FriCAS.
>
> %%% (1) -> (1$SingleInteger) pretend Integer
>
> (1) 1
> Type: Integer
> %%% (2) -> (1$Integer) pretend SingleInteger
>
> (2) 1
> Type: SingleInteger
>
> OK, no problem, because there is the dangerous "pretend", but one should
> be aware of that when dealing with Aldor and FriCAS at the same time.

That is one reason to to have 'qconvert' and using 'convert' in
the name: for Aldor it would need to do change of representation.

I am not sure why Aldor designers made their choice. One possible
rationalle is that Lisp choice plays batter with dynamic typing.
Aldor designers could assume that static typing makes convertion
less problematic (one needs to do something to change type).
Personally, I do not have string opinion. Aldor choice means
that there is some efficiency gain because small integers
directly use hardware instructions and "arbitrary" integers
do not need to check for case of small integer. OTOH, when
procedure needs to handle arbitrary integers, but mainly
deals with small ones, Lisp choice means that small integers
will be more efficient.

--
Waldek Hebisch
Reply all
Reply to author
Forward
0 new messages