hash

George Mpouras

unread,

May 12, 2012, 6:41:00 PM5/12/12

to

The clever equal $a ~~ @array is a lot of slower than exists $hash{$a}
Also hashing algorythm is much slower than binary search, so wouldn't be
better if we store key,values to btree structures instead of hashes ?

Jürgen Exner

unread,

May 12, 2012, 6:58:28 PM5/12/12

to

"George Mpouras" <nospam.gravit...@hotmail.com.nospam> wrote:
[...]

>Also hashing algorythm is much slower than binary search,

???

hash access is typically O(1) while binary search is O(log n). Why do
you think hashing is slower than binary search?

>so wouldn't be
>better if we store key,values to btree structures instead of hashes ?

Is this a question about the internal implementation of Perl hashes deep
down in the guts of the abstract Perl machine?

jue

Keith Thompson

unread,

May 12, 2012, 10:07:25 PM5/12/12

to

Jürgen Exner <jurg...@hotmail.com> writes:
> "George Mpouras" <nospam.gravit...@hotmail.com.nospam> wrote:
> [...]
>>Also hashing algorythm is much slower than binary search,
>
> ???
>
> hash access is typically O(1) while binary search is O(log n). Why do
> you think hashing is slower than binary search?

Hashing itself is not really an algorithm; it's a data structure.

Some particular thing you do with hashing may well be slower than binary
search, but without knowing what what the OP is doing it's difficult to
speculate.

[...]

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Will write code for food.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Jürgen Exner

unread,

May 12, 2012, 11:10:57 PM5/12/12

to

Keith Thompson <ks...@mib.org> wrote:
>Jürgen Exner <jurg...@hotmail.com> writes:
>> "George Mpouras" <nospam.gravit...@hotmail.com.nospam> wrote:
>> [...]
>>>Also hashing algorythm is much slower than binary search,
>>
>> ???
>>
>> hash access is typically O(1) while binary search is O(log n). Why do
>> you think hashing is slower than binary search?
>
>Hashing itself is not really an algorithm; it's a data structure.

True

>Some particular thing you do with hashing may well be slower than binary
>search,

Absolutely. That's why I said "typically". Of course, if you got a
degenerated hash then access might be as bad as a linear search.

>but without knowing what what the OP is doing it's difficult to
>speculate.

ACK

jue

George Mpouras

unread,

May 13, 2012, 6:48:06 AM5/13/12

to

>> hash access is typically O(1) while binary search is O(log n). Why do
>> you think hashing is slower than binary search?
>
>Hashing itself is not really an algorithm; it's a data structure.

True

hash is a data structure and there is a an internal logic of how to retrieve
values. If there is a big number of keys then hash may have conflict keys;
these conflicts are stored as lists where its retrieval is slow (n).
Retrieving btree values is much faster O(log n). There is no particular
problem, I have a conversation with my colleague discussing this and maybe
there is a need for a module to do it easy.

Peter J. Holzer

unread,

May 13, 2012, 6:49:17 AM5/13/12

to

On 2012-05-13 03:10, Jürgen Exner <jurg...@hotmail.com> wrote:
> Keith Thompson <ks...@mib.org> wrote:
>>Jürgen Exner <jurg...@hotmail.com> writes:
>>> "George Mpouras" <nospam.gravit...@hotmail.com.nospam> wrote:
>>> [...]
>>>>Also hashing algorythm is much slower than binary search,
>>>
>>> ???
>>>
>>> hash access is typically O(1) while binary search is O(log n). Why do
>>> you think hashing is slower than binary search?

Big-O notation only tells you how an algorithm scales with increasing n.
It doesn't tell you how it performs for any particular n (especially not
for a small n).

>>Hashing itself is not really an algorithm; it's a data structure.
>
> True

<nitpick>
The data structure is called a "hash table". There is also the "hash
function". Both are frequently abbreviated as "hash". "Hashing" may
refer to computing a hash function or to filling a hash table. In
both cases it is an algorithm.
</nitpick>

>>Some particular thing you do with hashing may well be slower than binary
>>search,
>
> Absolutely. That's why I said "typically". Of course, if you got a
> degenerated hash then access might be as bad as a linear search.

A degenerated hash is not necessary.

A successful lookup of an element in a hash consists of the following
steps:

1) compute the hash of the lookup key (cost: c1 * length(lookup_key))
2) compute the location of element in the hash (cost: negligible)
3) compare the lookup key to the element key
(cost c2 * length(lookup_key) if successfull, almost zero if
unsuccessfull, because two strings with the same hash collision are
unlikely to have a common prefix).
4) If unsuccessfull, compute the location of the next element (this may
be a simple pointer lookup or involve the computation of a new hash)
and continue at 3.

If the hash is properly constructed, step 4 is very rare, so the access
time is

c1 * length(lookup_key) + c2 * length(lookup_key)

For a binary tree, you have to descend d levels, and do a string
comparison at every level, so the cost is

c2 * length(common_prefix(lookup_key, node1_key)) +
c2 * length(common_prefix(lookup_key, node2_key)) +
...
c2 * length(lookup_key)

(the common prefix is likely to get longer as we descend down the tree)

Note that the last term is the same, so we can eliminate it.

It follows that the hash access is faster than the binary tree[1] access
if the computation of the hash is faster than the (d-1) partial string
compares. This depends on the length of lookup_key, the distribution of
the keys in the dictionary, the hash algorithm, the depth of the tree,
how much of your data is already in the CPU cache, etc.

In short, it is impossible to say that a hash is faster than a binary
tree or vice versa. You have to benchmark particular implementations for
particular data.

There are some general observations, though:

The tree gets slower with increasing depth while the hash is quite
insensitive to the number of elements, so the hash has an advantage
for a large number of elements.

For a hash lookup you have to compute a hash of the entire key up front,
while for the tree you only have to compare the common prefix (+1 byte),
so the tree has an advantage if the keys are long (and don't have a
common prefix).

The hash is likely to be more cache friendly because you need to
look only at two contiguous memory locations while in the tree you need
to look at (d+1) memory locations.

>>but without knowing what what the OP is doing it's difficult to
>>speculate.
>
> ACK

Also ACK.

hp

[1] George also wrote "btree", which is not a binary tree. For a btree
it gets even more complicated, but btrees are normally used for disk
data structures, not in memory, although some in-memory data structures
(e.g. Judy arrays) use similar techniques to exploit caching.

--
_ | Peter J. Holzer | Deprecating human carelessness and
|_|_) | Sysadmin WSR | ignorance has no successful track record.
| | | h...@hjp.at |
__/ | http://www.hjp.at/ | -- Bill Code on as...@irtf.org

Jürgen Exner

unread,

May 13, 2012, 8:01:09 AM5/13/12

to

"George Mpouras" <nospam.gravit...@hotmail.com.nospam> wrote:
>
>
>>> hash access is typically O(1) while binary search is O(log n). Why do
>>> you think hashing is slower than binary search?
>>
>>Hashing itself is not really an algorithm; it's a data structure.
>
>True
>
>hash is a data structure and there is a an internal logic of how to retrieve
>values. If there is a big number of keys then hash may have conflict keys;

That has nothing to do with the number keys but only with how the keys
are distributed across the buckets. Degenerated hashes where all the
keys point into the same bucket can happen with any number of keys. But
the probability is very small, therefore on average hashes are
significantly faster than trees.

>these conflicts are stored as lists where its retrieval is slow (n).

If they are stored as simple lists, then yes. But there are other ways
how buckets can be organized. If Perl uses a simple list or a more
complex structure I do not know.
If Perl uses simple lists then I presume that the problem of hash key
conflicts is not very relevant.

>Retrieving btree values is much faster O(log n).

Unfortunately for trees O(log n) it is for all cases, including best
case and typical case, while hashes are O(1) for both, best as well as
typical case.

jue

Peter J. Holzer

unread,

May 13, 2012, 9:30:00 AM5/13/12

to

On 2012-05-13 12:01, Jürgen Exner <jurg...@hotmail.com> wrote:

> "George Mpouras" <nospam.gravit...@hotmail.com.nospam> wrote:
>>>> hash access is typically O(1) while binary search is O(log n). Why do
>>>> you think hashing is slower than binary search?
>>>
>>>Hashing itself is not really an algorithm; it's a data structure.
>>
>>True
>>
>>hash is a data structure and there is a an internal logic of how to retrieve
>>values. If there is a big number of keys then hash may have conflict keys;
>
> That has nothing to do with the number keys but only with how the keys
> are distributed across the buckets. Degenerated hashes where all the
> keys point into the same bucket can happen with any number of keys.

That's the worst case.

> But the probability is very small, therefore on average hashes are
> significantly faster than trees.

The average case depends on the load factor.

>>these conflicts are stored as lists where its retrieval is slow (n).
>
> If they are stored as simple lists, then yes. But there are other ways
> how buckets can be organized. If Perl uses a simple list or a more
> complex structure I do not know. If Perl uses simple lists then I
> presume that the problem of hash key conflicts is not very relevant.

Perl uses simple lists - see PerlGuts Illustrated[1]. This is not a
problem. Since perl keeps the load factor (the ratio between number of
elements and slots) below 1 and it can change the seed of the hash
function if a hash degenerates, the lists are always very short -
certainly shorter than the depth of a binary tree with the same number
of elements.

hp

[1] http://cpansearch.perl.org/src/RURBAN/illguts-0.42/index-14.html

Rainer Weikusat

unread,

May 13, 2012, 2:00:21 PM5/13/12

to

Keith Thompson <ks...@mib.org> writes:
> Jürgen Exner <jurg...@hotmail.com> writes:
>> "George Mpouras" <nospam.gravit...@hotmail.com.nospam> wrote:
>> [...]
>>>Also hashing algorythm is much slower than binary search,
>>
>> ???
>>
>> hash access is typically O(1) while binary search is O(log n). Why do
>> you think hashing is slower than binary search?
>
> Hashing itself is not really an algorithm; it's a data structure.

Hashing is an algorithm and not a data structure: Usually, it refers
to 'calculate a "hash value"' (relatively small integer) from some
(significantly) larger 'input data value'. Usually, this hash value is
then used as index into a vector of pointers to locate 'a list' on
which some kind of data item associated with this 'input data value'
(key) should either exist or needs to be put on.

Rainer Weikusat

unread,

May 13, 2012, 2:11:39 PM5/13/12

to

"George Mpouras" <nospam.gravit...@hotmail.com.nospam>
writes:

[...]

> hash is a data structure and there is a an internal logic of how to
> retrieve values. If there is a big number of keys then hash may have
> conflict keys; these conflicts are stored as lists where its retrieval
> is slow (n). Retrieving btree values is much faster O(log n).

This doesn't exactly make sense: Let's assume I'm storing 1024
key-value pairs in a hash table and what I end up with are 128 lists of
length 8. Ignoring the overhead for calculating the hash values, this
means that I can locate each of these 1024 pairs with doing at most 8
key comparisons. If the same 1024 key-value pairs where organized as
some kind of balanced binary search tree, that would be log2(1024) =
10 key comparisons. Since '128 pointers' isn't exactly a lot of data
nowadays, it should be possible to double the size of the table and
thus end up with 256 list of length 4 and so on.

NB: This is a theoretical consideration and 'in practice' things
aren't that simple. But it should be sufficient to show that 'searcing
on a linked list is slow while ...' doesn't hold water.

Ben Morrow

unread,

May 13, 2012, 3:00:39 PM5/13/12

to

Quoth Rainer Weikusat <rwei...@mssgmbh.com>:

> Keith Thompson <ks...@mib.org> writes:
> > Jürgen Exner <jurg...@hotmail.com> writes:
> >> "George Mpouras" <nospam.gravit...@hotmail.com.nospam> wrote:
> >> [...]
> >>>Also hashing algorythm is much slower than binary search,
> >>

> >> hash access is typically O(1) while binary search is O(log n). Why do
> >> you think hashing is slower than binary search?
> >
> > Hashing itself is not really an algorithm; it's a data structure.
>
> Hashing is an algorithm and not a data structure: Usually, it refers
> to 'calculate a "hash value"' (relatively small integer) from some
> (significantly) larger 'input data value'. Usually, this hash value is
> then used as index into a vector of pointers to locate 'a list' on
> which some kind of data item associated with this 'input data value'
> (key) should either exist or needs to be put on.

I don't know about 'usually'. One of the more common uses of hash
algorithms is in message digests and digital signatures and such.

I think perhaps Keith meant to say 'A hash *table* is not an algorithm,
it's a data structure', which is entirely true.

Ben

Rainer Weikusat

unread,

May 13, 2012, 4:49:38 PM5/13/12

to

Ben Morrow <b...@morrow.me.uk> writes:
> Quoth Rainer Weikusat <rwei...@mssgmbh.com>:
>> Keith Thompson <ks...@mib.org> writes:
>> > Jürgen Exner <jurg...@hotmail.com> writes:
>> >> "George Mpouras" <nospam.gravit...@hotmail.com.nospam> wrote:
>> >> [...]
>> >>>Also hashing algorythm is much slower than binary search,
>> >>
>> >> hash access is typically O(1) while binary search is O(log n). Why do
>> >> you think hashing is slower than binary search?
>> >
>> > Hashing itself is not really an algorithm; it's a data structure.
>>
>> Hashing is an algorithm and not a data structure: Usually, it refers
>> to 'calculate a "hash value"' (relatively small integer) from some
>> (significantly) larger 'input data value'. Usually, this hash value is
>> then used as index into a vector of pointers to locate 'a list' on
>> which some kind of data item associated with this 'input data value'
>> (key) should either exist or needs to be put on.
>
> I don't know about 'usually'. One of the more common uses of hash
> algorithms is in message digests and digital signatures and such.

The common use of hashing in perl is the implementation of so-called
hashes (if you think that hashes in perl are sufficiently uncommon
that referring to them as 'commmon' seems unwarranted, please feel
free to argue for another definition of 'commmonly used'). In this
case, it reduces a sequence of bytes to an unsigned 32-bit value
which is used as 'base value' for an index into a table of
lists.

So-called 'hash functions' have other uses than in lookup
tables (such as for generating message digests which can then be
'signed' by encrypting them with the private key of a public/private
key pair for public key cryptography to produce a so-called 'digital
signature or to calcluate hased message authentication codes [HMAC])
but in Perl, such uses are relatively rare, and in any case, this is
besides the point in a discussion about 'fast/slow lookups'.

> I think perhaps Keith meant to say 'A hash *table* is not an
> algorithm, it's a data structure', which is entirely true.

The term is commonly used but this is really just as sloppy as
referring to 'the usual case' as 'the usual case': A 'hash table' is
inherently in now way different from any other 'table' (for this
definition of table), just the way it is being used differs.

Martijn Lievaart

unread,

May 14, 2012, 2:47:29 AM5/14/12

to

On Sun, 13 May 2012 19:11:39 +0100, Rainer Weikusat wrote:

> "George Mpouras" <nospam.gravit...@hotmail.com.nospam>
> writes:
>
> [...]
>
>> hash is a data structure and there is a an internal logic of how to
>> retrieve values. If there is a big number of keys then hash may have
>> conflict keys; these conflicts are stored as lists where its retrieval
>> is slow (n). Retrieving btree values is much faster O(log n).
>
> This doesn't exactly make sense: Let's assume I'm storing 1024 key-value
> pairs in a hash table and what I end up with are 128 lists of length 8.
> Ignoring the overhead for calculating the hash values, this means that I
> can locate each of these 1024 pairs with doing at most 8 key
> comparisons.

If you end up with so many lists of that length, something is seriously
wrong with your hash table. It's either to small, or you have a very bad
hashing function.

> If the same 1024 key-value pairs where organized as some
> kind of balanced binary search tree, that would be log2(1024) = 10 key
> comparisons. Since '128 pointers' isn't exactly a lot of data nowadays,
> it should be possible to double the size of the table and thus end up
> with 256 list of length 4 and so on.

George was talking about a btree, not a binary tree. Those are fairly
different beasts.

HTH,
M4

Peter J. Holzer

unread,

May 14, 2012, 4:00:55 AM5/14/12

to

On 2012-05-14 06:47, Martijn Lievaart <m...@rtij.nl.invlalid> wrote:
> On Sun, 13 May 2012 19:11:39 +0100, Rainer Weikusat wrote:
>> "George Mpouras" <nospam.gravit...@hotmail.com.nospam>
>> writes:
>>> hash is a data structure and there is a an internal logic of how to
>>> retrieve values. If there is a big number of keys then hash may have
>>> conflict keys; these conflicts are stored as lists where its retrieval
>>> is slow (n). Retrieving btree values is much faster O(log n).
>>
>> This doesn't exactly make sense: Let's assume I'm storing 1024 key-value
>> pairs in a hash table and what I end up with are 128 lists of length 8.
>> Ignoring the overhead for calculating the hash values, this means that I
>> can locate each of these 1024 pairs with doing at most 8 key
>> comparisons.
>
> If you end up with so many lists of that length, something is seriously
> wrong with your hash table. It's either to small, or you have a very bad
> hashing function.

The perl hash implementation always keeps the load factor <= 1, so a
hash table with 1024 elements has at least 1024 slots. Of course it's
possible that 896 of these slots are empty and 128 slots have 8 elements
each, but as you say that would mean a very bad hash function or a
deliberate attack. To guard against the latter possibility, perl >= 5.8
(IIRC) monitors the chain length and changes the seed of the hash
function if a chain grows too long.

But I think Rainer was trying to come up with a worst-case scenario
where a hash is still faster (requires fewer comparisons) than a binary
tree, as the next paragraph shows:

>> If the same 1024 key-value pairs where organized as some
>> kind of balanced binary search tree, that would be log2(1024) = 10 key
>> comparisons. Since '128 pointers' isn't exactly a lot of data nowadays,
>> it should be possible to double the size of the table and thus end up
>> with 256 list of length 4 and so on.
>
> George was talking about a btree, not a binary tree. Those are fairly
> different beasts.

Yes, I noted that too in an earlier posting, but btrees are normally
used for on-disk data structures and they don't implement a binary
search, so it seems likely that George really meant binary trees, not
btrees.

hp

Martijn Lievaart

unread,

May 14, 2012, 4:45:15 AM5/14/12

to

On Mon, 14 May 2012 10:00:55 +0200, Peter J. Holzer wrote:

>> George was talking about a btree, not a binary tree. Those are fairly
>> different beasts.
>
> Yes, I noted that too in an earlier posting, but btrees are normally
> used for on-disk data structures and they don't implement a binary

Btrees are used all over the place, not just on-disk. OTOH, in memory you
are much more likely to actually use a red-black tree or similar, but the
interface and performance characteristics are so similar to a btree that
they are often called btrees as well.

> search, so it seems likely that George really meant binary trees, not
> btrees.

Possible, even probable.

M4

Rainer Weikusat

unread,

May 14, 2012, 8:45:54 AM5/14/12

to

Martijn Lievaart <m...@rtij.nl.invlalid> writes:
> On Sun, 13 May 2012 19:11:39 +0100, Rainer Weikusat wrote:
>> "George Mpouras" <nospam.gravit...@hotmail.com.nospam>
>> writes:
>>
>> [...]
>>
>>> hash is a data structure and there is a an internal logic of how to
>>> retrieve values. If there is a big number of keys then hash may have
>>> conflict keys; these conflicts are stored as lists where its retrieval
>>> is slow (n). Retrieving btree values is much faster O(log n).
>>
>> This doesn't exactly make sense: Let's assume I'm storing 1024 key-value
>> pairs in a hash table and what I end up with are 128 lists of length 8.
>> Ignoring the overhead for calculating the hash values, this means that I
>> can locate each of these 1024 pairs with doing at most 8 key
>> comparisons.
>
> If you end up with so many lists of that length, something is seriously
> wrong with your hash table. It's either to small, or you have a very bad
> hashing function.

This was an example supposed to illustrate that 'searching in linked
list is slow while ...' doesn't make sense.

Apart from that: If I have a fixed size table (128 slots in this
example) and my key/value pairs are distributed evenly onto these 128
slots (yielding 128 lists of length 8), my hash function quite
obviously did the best job it is theoretically capable of doing.

>> If the same 1024 key-value pairs where organized as some
>> kind of balanced binary search tree, that would be log2(1024) = 10 key
>> comparisons. Since '128 pointers' isn't exactly a lot of data nowadays,
>> it should be possible to double the size of the table and thus end up
>> with 256 list of length 4 and so on.
>
> George was talking about a btree, not a binary tree. Those are fairly
> different beasts.

And I was writing about a data structure with the performance
characteristics he mentioned. Even if he was really referring to 'a
B-tree' and not just using btree as abbreviations for 'binary tree'
(which I very strongly suspect), that doesn't matter for this example.

Rainer Weikusat

unread,

May 14, 2012, 8:49:48 AM5/14/12

to

Martijn Lievaart <m...@rtij.nl.invlalid> writes:
> On Mon, 14 May 2012 10:00:55 +0200, Peter J. Holzer wrote:
>>> George was talking about a btree, not a binary tree. Those are fairly
>>> different beasts.
>>
>> Yes, I noted that too in an earlier posting, but btrees are normally
>> used for on-disk data structures and they don't implement a binary
>
> Btrees are used all over the place, not just on-disk. OTOH, in memory you
> are much more likely to actually use a red-black tree or similar, but the
> interface and performance characteristics are so similar to a btree that
> they are often called btrees as well.

A 'red-black tree' is a balanced, binary search tree with the
'red-black' referring to a specific balancing algorithm.

Martijn Lievaart

unread,

May 14, 2012, 4:45:45 PM5/14/12

to

I actually ment skip lists (brain fart), but the statement is still not
incorrect.

M4

Rainer Weikusat

unread,

May 14, 2012, 5:49:11 PM5/14/12

to

Since a B-tree is a more general tree structure than a binary search
tree, every binary search tree is also a B-tree, just not vice
versa. While that's not consistent with a statement you made in other
subthread, it is surely 'not incorrect'. But this "data structures I
heard of!"-bingo is IMO pretty pointless.

Martijn Lievaart

unread,

May 15, 2012, 2:42:13 AM5/15/12

to

Well, that was not my intention. Mind responding to the point?

M4

Rainer Weikusat

unread,

May 15, 2012, 7:11:40 AM5/15/12

to

Which point?

Martijn Lievaart

unread,

May 15, 2012, 6:06:01 PM5/15/12

to

On Tue, 15 May 2012 12:11:40 +0100, Rainer Weikusat wrote:

> Which point?

Never mind. I don't feel like explaining everything.

M4

Rainer Weikusat

unread,

May 16, 2012, 5:58:32 AM5/16/12

to

And I 'feel' that you neither wrote anything even remotely related to
the original lookup question nor something which could be considered
'a point' at all, just a bunch of rambling statements about various
data structures. This may be wrong but if you can't be bothered with
expressing yourself clearly in face of a misunderstanding, whatever
you meant to express doesn't matter.