Mixed-up Threads (?)

croy

unread,

Jan 12, 2013, 12:52:42 PM1/12/13

to

Every now and then, I come across a message thread where the
subject seems to change in mid thread, and replies that
follow are from some other thread.

I finally have found an instance of this, and, at the same
time, have time to report it.

In this group, on 9/2/2010, Terry Pinnell posted a message
titled: "Default Persona" (Message-ID:
<h4cv769k3gic046vd...@4ax.com>).

I see two replies that are from the next day, and on-topic
for the OP. One of those replies is from Ralph Fox. There
is one on-topic reply to Ralph's post, and then two more
replies to Ralph's post that are titled: "Creating X-Folder
header". The first of these is dated 1/11/2013 (Message-ID:
<nmm0f8lrmvir6nbe3...@4ax.com>). The second
is from Jim Higgins dated 1/12/2013 (Message-ID:
<8023f8t8oafsvenq7...@4ax.com>).

Anyone else seeing this, and/or know the scoop? I *do* have
some rather huge message stores...

--
croy

Nick Spalding

unread,

Jan 12, 2013, 1:04:40 PM1/12/13

to

croy wrote, in <ht73f81kpetjr52io...@4ax.com>
on Sat, 12 Jan 2013 09:52:42 -0800:

This is the well-known Hash Collision Bug. Agent associates messages
with a thread by means of hash number any one of which can be generated
by more than one thread. It is most likely to happen if you have a lot
of messages in a group. More bits are needed in that hash number.
--
Nick Spalding
IE8, Vista Home Premium SP2, 32 bit, Intel Viiv dual core
E6300 (1.86Ghz, 1066MHz FSB), 2GB RAM, 320GB NTFS HD,
Video Nvidia GeForce 7900GS LCD 1024x768x75Hz

Ralph Fox

unread,

Jan 12, 2013, 1:46:13 PM1/12/13

to

In addition to Nick's reply, see this 2001 message for a table
of probabilities vs the number of messages in the folder.
https://groups.google.com/group/alt.usenet.offline-reader.forte-agent/msg/1a9fde6539a0f8e8?hl=en-GB

As stated in the message, the table assumes all hash values
are equally likely. In fact message-IDs will use just over
50% of the range of hash values, which means that the same
probability will occur at approximately 71% of the number of
messages given in the table.

--
Kind regards
Ralph

croy

unread,

Jan 12, 2013, 1:53:56 PM1/12/13

to

Ah. Thanks Nick.

--
croy

croy

unread,

Jan 12, 2013, 1:55:35 PM1/12/13

to

On Sat, 12 Jan 2013 18:04:40 +0000, Nick Spalding
<spal...@iol.ie> wrote:

I'm still using v5. Are later versions any better in this
regard?

--
croy

Jaimie Vandenbergh

unread,

Jan 12, 2013, 2:02:06 PM1/12/13

to

[Default] On Sat, 12 Jan 2013 10:55:35 -0800, croy

<ha...@spam.invalid.net> wrote:

>On Sat, 12 Jan 2013 18:04:40 +0000, Nick Spalding
><spal...@iol.ie> wrote:
>
>>croy wrote, in <ht73f81kpetjr52io...@4ax.com>
>> on Sat, 12 Jan 2013 09:52:42 -0800:
>>
>>> Every now and then, I come across a message thread where the
>>> subject seems to change in mid thread, and replies that
>>> follow are from some other thread.
>>

>>This is the well-known Hash Collision Bug. Agent associates messages
>>with a thread by means of hash number any one of which can be generated
>>by more than one thread. It is most likely to happen if you have a lot
>>of messages in a group. More bits are needed in that hash number.
>
>I'm still using v5. Are later versions any better in this
>regard?

Unfortunately not, so far. It'll take some pretty major renovations of
Agent's database format to allow Forte to increase the size of the
hash and thereby reduce the likelihood of collisions to almost-zero -
at least, that's my (well-)educated guess. I don't think we've heard
either way from Forte themselves.

And now with APN delivering years of text group retention, we need it
more than before.

Cheers - Jaimie
--
Ford carried on counting quietly. This is about the most aggressive thing
you can do to a computer, the equivalent of going up to a human being and
saying "Blood... blood... blood... blood..." -- Douglas Adams

Jaimie Vandenbergh

unread,

Jan 12, 2013, 2:04:00 PM1/12/13

to

[Default] On Sun, 13 Jan 2013 07:46:13 +1300, Ralph Fox

<-rf-nz-@-.invalid> wrote:

>As stated in the message, the table assumes all hash values
>are equally likely. In fact message-IDs will use just over
>50% of the range of hash values,

What do you mean by that, Ralph? Signed int and negative aren't used?

Cheers - Jaimie
--
'The most merciful thing in the world, I think, is the inability of the human
mind to correlate all its contents' - H.P.Lovecraft, "The Call of Cthulhu"

Ralph Fox

unread,

Jan 12, 2013, 5:13:32 PM1/12/13

to

On Sat, 12 Jan 2013 19:04:00 +0000, Jaimie Vandenbergh wrote:

> [Default] On Sun, 13 Jan 2013 07:46:13 +1300, Ralph Fox
> <-rf-nz-@-.invalid> wrote:
>
> >As stated in the message, the table assumes all hash values
> >are equally likely. In fact message-IDs will use just over
> >50% of the range of hash values,
>
> What do you mean by that, Ralph? Signed int and negative aren't used?

No, it is not a contiguous range. For hash calculations like this,
the missing values are not going to be a contiguous range.

The table assumes there are 67,108,785 possible MID hash values in
the range -33554392..+33554392. In fact no more than 33,554,432
values in this range are possible MID hashes.

Agent's hash calculation for message-IDs includes the surrounding <>.
So the last character in a MID hash calculation is always going to
be '>', 0x3E.

If you look at the very last iteration of the hash calculation
h = ((h << 7) + key[j]) mod 0x01ffffd9
which takes the last character '>', 0x3E, and uses 32-bit arithmetic,
you will get this...

j.1 previous h up to 67,108,785 possible values

j.2 (h << 7) This is a 32-bit number where the least significant
7 bits are always zero. There are up to
2^(32-7) = 2^25 = 33,554,432 possible values.

j.3 (h << 7) + 0x3E The last character key[j] is always '>' 0x3E.
This is a 32-bit number where the least significant
7 bits are always 0x3E. There are up to
2^(32-7) = 2^25 = 33,554,432 possible values.

j.4 ((h << 7) + 0x3E) mod 0x01ffffd9

If there are 33,554,432 possible values for ((h << 7) + 0x3E)
then there can be no more than 33,554,432 possible values
for ((h << 7) + 0x3E) mod 0x01ffffd9

--
Kind regards
Ralph

Jaimie Vandenbergh

unread,

Jan 13, 2013, 5:41:29 AM1/13/13

to

[Default] On Sun, 13 Jan 2013 11:13:32 +1300, Ralph Fox

<-rf-nz-@-.invalid> wrote:

>On Sat, 12 Jan 2013 19:04:00 +0000, Jaimie Vandenbergh wrote:
>
>> [Default] On Sun, 13 Jan 2013 07:46:13 +1300, Ralph Fox
>> <-rf-nz-@-.invalid> wrote:
>>
>> >As stated in the message, the table assumes all hash values
>> >are equally likely. In fact message-IDs will use just over
>> >50% of the range of hash values,
>>
>> What do you mean by that, Ralph? Signed int and negative aren't used?
>
>No, it is not a contiguous range. For hash calculations like this,
>the missing values are not going to be a contiguous range.
>
>The table assumes there are 67,108,785 possible MID hash values in
>the range -33554392..+33554392. In fact no more than 33,554,432
>values in this range are possible MID hashes.
>
>Agent's hash calculation for message-IDs includes the surrounding <>.

Ah... careless but understanable. Thanks.

Cheers - Jaimie
--
"Wow! Virtual memory! Now I can have a REALLY big RAM disk!"

Message has been deleted

Ralph Fox

unread,

Jan 14, 2013, 5:55:02 AM1/14/13

to

On Sun, 13 Jan 2013 18:27:48 +0000, Marc Wilson wrote:

> In alt.usenet.offline-reader.forte-agent, (Ralph Fox) wrote in
> <vbk3f8ds46apuutsl...@4ax.com>::

>
> >
> >Agent's hash calculation for message-IDs includes the surrounding <>.
> >So the last character in a MID hash calculation is always going to
> >be '>', 0x3E.
>

> So, that's a pretty dumb design, isn't it?

No. Including the '>' was not dumb; see reply to Jamie for more details.

I would also suggest not bandying about words like "dumb".

A little learning is a dang'rous thing;
Drink deep, or taste not the Pierian spring:
There shallow draughts intoxicate the brain,
And drinking largely sobers us again.

--
Kind regards
Ralph

Ralph Fox

unread,

Jan 14, 2013, 5:55:02 AM1/14/13

to

No. Including the surrounding <> was _not_ what was careless.

There is a class of hash functions using the iteration
h = ((h << N) + key[j]) mod P
which produce very good results when these conditions are met:
(1) key[j] is in the range 0 to (2^N)-1
(2) P is an odd prime number
(3) The mod uses _unsigned_ arithmetic
(4) The shift (h << N) does not overflow - i.e. does not lose
any bits of information. If there are X values for h then
there must be X different values for (h << N).

The MID hash function used by Agent was clearly intended to meet
these specific conditions, but Fort� made one little mistake.

If the hash function had met these conditions, then including
fixed chars like '>' on the end will have no effect at all on the
chance of a hash collision. This would simply result in a 1:1
permutation of the set of hashcodes.

As is, Agent's hashcode fails to meet the conditions. As a
result the hash function has some poor characteristics. Even if
the <> were left out, the possible hash values would not all be
equally likely. If they are not equally likely then the chance
of a hash collision goes up.

My previous post was only to demonstrate that not all alleged
67,108,785 MID hash values are possible. It did not cover whether
or not the hash values are all equally likely.

--
Kind regards
Ralph

Jaimie Vandenbergh

unread,

Jan 14, 2013, 2:56:06 PM1/14/13

to

[Default] On Mon, 14 Jan 2013 23:55:02 +1300, Ralph Fox

<-rf-nz-@-.invalid> wrote:

>On Sun, 13 Jan 2013 10:41:29 +0000, Jaimie Vandenbergh wrote:
>
>> [Default] On Sun, 13 Jan 2013 11:13:32 +1300, Ralph Fox
>> <-rf-nz-@-.invalid> wrote:
>>
>> >Agent's hash calculation for message-IDs includes the surrounding <>.
>>
>> Ah... careless but understanable. Thanks.
>
>
>No. Including the surrounding <> was _not_ what was careless.

Thanks for that - I've never looked into hash functions before, hadn't
realised they were so complex and susceptible to pitfalls. I've not
worked out what the flaw is yet... but I know that the character set
available for MId's is limited, does the hashing algorithm not take
that into account?

Cheers - Jaimie
--
...there should be a feature added to the RAID 0 standard stating that
if anyone selects RAID 0 as an option, they must type in, "I know what
I am doing and that it is wrong," before they can proceed.
- Archangel Mychael, ArsTechnica comments