hash function over IP address

Mark

unread,

Apr 4, 2012, 4:37:50 PM4/4/12

to

Hello

I'm trying to find a way create a hash value over multiple IP addresses (I
keep a list of IP addresses and its number is variable. The simplest method
I've found so far is (obviously not the most effective and collision free):

unsigned long hash_ipaddr(struct in_addr *addr)
{
unsigned long res;

if (addr == NULL)
return 0UL;

res = (((addr->s_addr >> 24) & 0xff) * 256) + \
(((addr->s_addr >> 16) & 0xff) * 256) + \
(((addr->s_addr >> 8) & 0xff)* 256) + \
(addr->s_addr & 0xff);

return res;
}

And then sum up hashes for every IP and store it.

Can somebody suggest some better approach for my task?
Thanks.

Mark

Barry Margolin

unread,

Apr 4, 2012, 4:49:05 PM4/4/12

to

In article <jliber$ocn$1...@speranza.aioe.org>,

How about s_addr % modulus, where modulus is a large (> 2^16) prime
number?

--
Barry Margolin, bar...@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***

Message has been deleted

Mark

unread,

Apr 4, 2012, 5:21:20 PM4/4/12

to

"Barry Margolin" <bar...@alum.mit.edu> wrote in message
news:barmar-B6E941....@news.eternal-september.org...
[skip]

>> And then sum up hashes for every IP and store it.
>>
>> Can somebody suggest some better approach for my task?
>> Thanks.
>
> How about s_addr % modulus, where modulus is a large (> 2^16) prime
> number?
>

Thanks for suggestion. Is it common approach to sum up the produced hashes?

Mark

Ben Pfaff

unread,

Apr 4, 2012, 5:40:32 PM4/4/12

to

"Mark" <mark_cruz...@hotmail.com> writes:

> I'm trying to find a way create a hash value over multiple IP addresses (I
> keep a list of IP addresses and its number is variable. The simplest method
> I've found so far is (obviously not the most effective and collision free):

Why not use one of many available high-quality hash functions?
Three that come to mind without even going to a search engine:
Jenkins lookup3, murmurhash, FNV.
--
Ben Pfaff
http://benpfaff.org

Daniel Pitts

unread,

Apr 4, 2012, 5:47:30 PM4/4/12

to

If order matters, what I've seen is basically the following:

for each item in list:
hash = hash * someSmallPrime + hashOf(item)

Rainer Weikusat

unread,

Apr 4, 2012, 5:52:46 PM4/4/12

to

The simple answer is these hash functions are designed to be used for
large, variable length string keys NOT for small, fixed-size
quantities which can be compared with a single machine instruction.

http://burtleburtle.net/bob/hash/integer.html
http://www.cris.com/~Ttwang/tech/inthash.htm

Ben Pfaff

unread,

Apr 4, 2012, 6:35:47 PM4/4/12

to

He has a list of multiple IP addresses. A hash function for
variable length keys may well be appropriate.
--
Peter Seebach on managing engineers:
"It's like herding cats, only most of the engineers are already
sick of laser pointers."

Rick Jones

unread,

Apr 4, 2012, 7:18:27 PM4/4/12

to

In comp.unix.programmer Mark <mark_cruz...@hotmail.com> wrote:
> I'm trying to find a way create a hash value over multiple IP
> addresses (I keep a list of IP addresses and its number is
> variable. The simplest method I've found so far is (obviously not
> the most effective and collision free):

> unsigned long hash_ipaddr(struct in_addr *addr)

No IPv6?

rick jones
--
Process shall set you free from the need for rational thought.
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

William Ahern

unread,

Apr 4, 2012, 7:57:37 PM4/4/12

to

Mark <mark_cruz...@hotmail.com> wrote:
> Hello

> I'm trying to find a way create a hash value over multiple IP addresses (I
> keep a list of IP addresses and its number is variable. The simplest
> method I've found so far is (obviously not the most effective and
> collision free):

http://cmph.sourceforge.net/

Also, because IP addresses are fixed size, perhaps the most practical
algorithm to use is binary search. Far easier to implement than perfect
hashing, with no storage overhead--a simple linear array (or two, for v4 and
v6) of addresses.

Eric Sosman

unread,

Apr 4, 2012, 9:05:56 PM4/4/12

to

On 4/4/2012 4:37 PM, Mark wrote:
> Hello
>
> I'm trying to find a way create a hash value over multiple IP addresses (I
> keep a list of IP addresses and its number is variable. The simplest method
> I've found so far is (obviously not the most effective and collision free):
>
> unsigned long hash_ipaddr(struct in_addr *addr)
> {
> unsigned long res;
>
> if (addr == NULL)
> return 0UL;
>

> res = (((addr->s_addr>> 24)& 0xff) * 256) + \
> (((addr->s_addr>> 16)& 0xff) * 256) + \
> (((addr->s_addr>> 8)& 0xff)* 256) + \
> (addr->s_addr& 0xff);

(Idle question: Why the backslashes?)

> return res;

Briefly, this calculation takes four chunks of the address
as numbers between 0..255 and adds them to produce a sum 0..1020.
So: You start with 32 bits of information and squash them down to
a smidgen less than 10 bits -- with a moderate central tendencey,
too. Ponder, young one: Wise this is?

> }
>
> And then sum up hashes for every IP and store it.

Not sure what you mean by this. A given IP produces one and
only one hash value with this scheme (or any other sane scheme),
so what do you mean by "sum up the hashes?"

> Can somebody suggest some better approach for my task?

You haven't explained what your task is, so I'll have to guess.
If the objective is to use a hash table keyed on IP addresses, there
are two likely possibilities:

- The number of buckets in the table is a prime number (or at
the very least an odd number): Just take `*addr % N', and the
outcome is very likely to be better, especially if N > 1020.
The remainder after division by a prime is likely to break up
simple patterns.

- The number of buckets is a power of two. Now `*addr % N' would
just throw away the high-order bits -- which might not be Bad,
but could easily be Less Than Good if there are patterns in the
input keys. One possibility is to use `f(*addr) % N', where f()
is a suitable "scrambling" function: For example, implement a
two-line linear congruential pseudo-random number generator, seed
it with `*addr', and take its output as f().

For further ideas, see TAOCP volume III.

--
Eric Sosman
eso...@ieee-dot-org.invalid

Barry Margolin

unread,

Apr 4, 2012, 10:52:33 PM4/4/12

to

In article <jlir5m$dd8$1...@dont-email.me>,

Eric Sosman <eso...@ieee-dot-org.invalid> wrote:

> On 4/4/2012 4:37 PM, Mark wrote:
> > Hello
> >
> > I'm trying to find a way create a hash value over multiple IP addresses (I
> > keep a list of IP addresses and its number is variable. The simplest method
> > I've found so far is (obviously not the most effective and collision free):
> >
> > unsigned long hash_ipaddr(struct in_addr *addr)
> > {
> > unsigned long res;
> >
> > if (addr == NULL)
> > return 0UL;
> >
> > res = (((addr->s_addr>> 24)& 0xff) * 256) + \
> > (((addr->s_addr>> 16)& 0xff) * 256) + \
> > (((addr->s_addr>> 8)& 0xff)* 256) + \
> > (addr->s_addr& 0xff);
>
> (Idle question: Why the backslashes?)
>
> > return res;
>
> Briefly, this calculation takes four chunks of the address
> as numbers between 0..255 and adds them to produce a sum 0..1020.
> So: You start with 32 bits of information and squash them down to
> a smidgen less than 10 bits -- with a moderate central tendencey,
> too. Ponder, young one: Wise this is?

That's not what I see. It looks to me like he's adding the three
high-order bytes together, shifting that 8 bits to the left, and then
adding in the low-order byte. So he's shrinking a 32-bit value down to
an 18-bit value. Another way to write his expression is:

(((addr->s_addr >> 16) && 0xff00) +
((addr->s_addr >> 8) && 0xff00) +
(addr->s_addr && 0xffff))

Hash functions are expected to squash a large value into a much smaller
one -- that's the whole point, isn't it? Although it's hard understand
why one would bother when the hash is more than half the size of the
original.

>
> > }
> >
> > And then sum up hashes for every IP and store it.
>
> Not sure what you mean by this. A given IP produces one and
> only one hash value with this scheme (or any other sane scheme),
> so what do you mean by "sum up the hashes?"

I think what he REALLY wants is a hash of a list of IPs. So he's
hashing each IP, then combining all the hashes together by adding them.

A better solution might be to concatenate all the IPs, then use a
standard hash function on that. If you don't want the result to depend
on the order of the IPs, sort the list before concatenating them.

BGB

unread,

Apr 4, 2012, 11:21:25 PM4/4/12

to

On 4/4/2012 1:37 PM, Mark wrote:
> Hello
>
> I'm trying to find a way create a hash value over multiple IP addresses (I
> keep a list of IP addresses and its number is variable. The simplest method
> I've found so far is (obviously not the most effective and collision free):
>
> unsigned long hash_ipaddr(struct in_addr *addr)
> {
> unsigned long res;
>
> if (addr == NULL)
> return 0UL;
>

> res = (((addr->s_addr>> 24)& 0xff) * 256) + \
> (((addr->s_addr>> 16)& 0xff) * 256) + \
> (((addr->s_addr>> 8)& 0xff)* 256) + \

> (addr->s_addr& 0xff);

>
> return res;
> }
>
> And then sum up hashes for every IP and store it.
>
> Can somebody suggest some better approach for my task?
> Thanks.
>

maybe:
((addr->s_addr*65521)>>16);

then the hashed IP address is reduced to around 16 bits, and maybe these
can be added and then masked off to the desired number of bits?...

this maybe isn't very good either, oh well.

Udit Gangwani

unread,

Apr 5, 2012, 3:26:19 AM4/5/12

to

Hashing is preferable if you have large number of keys and you know the number of keys before hand. In that case hashing gives you the best results. Moreover, Hashing is more common data structure for the purpose of indexing non-integral keys.

> I'm trying to find a way create a hash value over multiple IP addresses (I
> keep a list of IP addresses and its number is variable. The simplest method
> I've found so far is (obviously not the most effective and collision free):

In your case the number of IPAddresses are variable and your keys are integral.
So you can use these points for your benefit to use a more better approach.

1st Approach :
If the number of IPAddress will not exceed a very large value( < 100000), then you can use AVL Trees/Red Black Trees. If you are using C++, you can use std::map<> data structure directly for this purpose.
These will give you in worst case the order of O(15-17) for Insertion, Deletion and Searching. Plus it will save you a lot of time in creating your own hash data structure and making sure it is giving you best performance.

2nd Approach :
If you know the range of you IP addresses, lets say between 192.168.0.0 to 192.168.10.256. Then you can directly allocate an array of size of the number of IP Addresses in this range. And you can perform the constant searching by subtracting the minimum IP Address from your queried IP Address to get the index into the array.

But if the above two approaches doesnot suit your needs then you can always use some libraries which provide hash datastructures already implemented in them.

For c++ check out this link :
http://attractivechaos.wordpress.com/2008/08/28/comparison-of-hash-table-libraries/

Rainer Weikusat

unread,

Apr 5, 2012, 7:08:58 AM4/5/12

to

b...@cs.stanford.edu (Ben Pfaff) writes:
> Rainer Weikusat <rwei...@mssgmbh.com> writes:
>
>> Ben Pfaff <b...@cs.stanford.edu> writes:
>>> "Mark" <mark_cruz...@hotmail.com> writes:
>>>
>>>> I'm trying to find a way create a hash value over multiple IP addresses (I
>>>> keep a list of IP addresses and its number is variable. The simplest method
>>>> I've found so far is (obviously not the most effective and collision free):
>>>
>>> Why not use one of many available high-quality hash functions?
>>> Three that come to mind without even going to a search engine:
>>> Jenkins lookup3, murmurhash, FNV.
>>
>> The simple answer is these hash functions are designed to be used for
>> large, variable length string keys NOT for small, fixed-size
>> quantities which can be compared with a single machine instruction.
>>
>> http://burtleburtle.net/bob/hash/integer.html
>> http://www.cris.com/~Ttwang/tech/inthash.htm
>
> He has a list of multiple IP addresses. A hash function for
> variable length keys may well be appropriate.

Calculating the hash value must be cheaper than the full comparisons
which will be needed for a linear search. And performing extensive
byte-shuffling on machine words is not a good strategy to achieve
that.

Rainer Weikusat

unread,

Apr 5, 2012, 7:15:40 AM4/5/12

to

"Mark" <mark_cruz...@hotmail.com> writes:
> unsigned long hash_ipaddr(struct in_addr *addr)
> {
> unsigned long res;
>
> if (addr == NULL)
> return 0UL;
>
> res = (((addr->s_addr >> 24) & 0xff) * 256) + \
> (((addr->s_addr >> 16) & 0xff) * 256) + \
> (((addr->s_addr >> 8) & 0xff)* 256) + \
> (addr->s_addr & 0xff);
>
> return res;
> }

Unrelated question (Barry Margolin also noticed that but I'd like to
try expressing it somewhat clearer): This function adds the three
highest octets together in the high octet of a 16-bit integer and
then adds the lowest octet of the original address to the lowest octet
of this 16-bit integer whose value was zero before this addition. This
seems rather odd. Is this intentional?

Barry Margolin

unread,

Apr 5, 2012, 7:52:09 AM4/5/12

to

In article <87fwcio...@sapphire.mobileactivedefense.com>,

He apparently wants each address in a /24 to generate a different hash.
This seems reasonable if he expects his application to have to deal with
many clients from the same subnet, and avoid clashes among them.

Mark

unread,

Apr 5, 2012, 8:42:00 AM4/5/12

to

"Eric Sosman" <eso...@ieee-dot-org.invalid> wrote in message
news:jlir5m$dd8$1...@dont-email.me...

[skip]

>> Can somebody suggest some better approach for my task?
>
> You haven't explained what your task is, so I'll have to guess.
> If the objective is to use a hash table keyed on IP addresses, there
> are two likely possibilities:

First of all, thanks for comments. What I'm trying to solve is to find a way
to search in AVL tree used in IP routing application. Initially a tree's
node looked simple:

struct nhlfe_entry
{
u_int32_t nhlfe_ix;
u_char opcode;
...
struct nhlfe_key key; /* nexthop key */
}

/* defines a search key. */
struct nhlfe_key
{
struct in_addr nh_addr;
u_int32_t oif_ix;
u_int32_t out_label;
}

So the search was based on 'struct nhlfe_key'. Now, what I'm trying to add
support for multiple next hops, that is I add a linked list in nhlfe_entry:
struct nhlfe_entry{ u_int32_t nhlfe_ix; u_char opcode; ... struct list
*nhkey_list;}where 'struct list' is struct listnode that embeds 'void *data'
pointer to caller'sprivate data, and this is 'struct nhlfe_key'.So I'm
trying to find the way to generate key based on multiple elements in the
list to store/search nodes in the tree. One of options I was considering
hash function, but it now looks unfeasible to me; the other is to always
have a key based on the very first element of the list (but whenever list
changes, re-insert node in the tree); and the third is to have two different
containers - one is AVL tree where key is IP address and the second is list
or array of 'nhlfe_entry', and each tree node should have some reference to
elements from the 2nd container.. So not sure which one to proceed with.

Mark

Eric Sosman

unread,

Apr 5, 2012, 8:46:17 AM4/5/12

to

On 4/4/2012 10:52 PM, Barry Margolin wrote:
> In article<jlir5m$dd8$1...@dont-email.me>,
> Eric Sosman<eso...@ieee-dot-org.invalid> wrote:
>
>> On 4/4/2012 4:37 PM, Mark wrote:
>>> Hello
>>>
>>> I'm trying to find a way create a hash value over multiple IP addresses (I
>>> keep a list of IP addresses and its number is variable. The simplest method
>>> I've found so far is (obviously not the most effective and collision free):
>>>
>>> unsigned long hash_ipaddr(struct in_addr *addr)
>>> {
>>> unsigned long res;
>>>
>>> if (addr == NULL)
>>> return 0UL;
>>>
>>> res = (((addr->s_addr>> 24)& 0xff) * 256) + \
>>> (((addr->s_addr>> 16)& 0xff) * 256) + \
>>> (((addr->s_addr>> 8)& 0xff)* 256) + \
>>> (addr->s_addr& 0xff);
>>
>> (Idle question: Why the backslashes?)
>>
>>> return res;
>>
>> Briefly, this calculation takes four chunks of the address
>> as numbers between 0..255 and adds them to produce a sum 0..1020.
>> So: You start with 32 bits of information and squash them down to
>> a smidgen less than 10 bits -- with a moderate central tendencey,
>> too. Ponder, young one: Wise this is?
>
> That's not what I see.

Nor I, now that my eyes are open. Thanks for the catch. (Sigh.)

> It looks to me like he's adding the three
> high-order bytes together, shifting that 8 bits to the left, and then
> adding in the low-order byte. So he's shrinking a 32-bit value down to
> an 18-bit value. Another way to write his expression is:
>
> (((addr->s_addr>> 16)&& 0xff00) +
> ((addr->s_addr>> 8)&& 0xff00) +

> (addr->s_addr&& 0xffff))

>
> Hash functions are expected to squash a large value into a much smaller
> one -- that's the whole point, isn't it? Although it's hard understand
> why one would bother when the hash is more than half the size of the
> original.

Yeah. He's still got the central-tendency problem with three
of the bytes, too. Even if he's got a 2**18-bucket table and will
use this hash directly as the bucket index, I doubt he'll get a
good distribution: I'd expect more-than-normal collisions in the
buckets near 97920, with more-than-normal empty buckets in the "tails."

>>> And then sum up hashes for every IP and store it.
>>
>> Not sure what you mean by this. A given IP produces one and
>> only one hash value with this scheme (or any other sane scheme),
>> so what do you mean by "sum up the hashes?"
>
> I think what he REALLY wants is a hash of a list of IPs. So he's
> hashing each IP, then combining all the hashes together by adding them.
>
> A better solution might be to concatenate all the IPs, then use a
> standard hash function on that. If you don't want the result to depend
> on the order of the IPs, sort the list before concatenating them.

Or hash each individually and combine them with a commutative
operation like + or ^. That's not usually a great idea because it
gives (A,B) and (B,A) the same hash -- but if he *wants* them to
have the same hash, well, ...

--
Eric Sosman
eso...@ieee-dot-org.invalid

Rainer Weikusat

unread,

Apr 5, 2012, 10:04:12 AM4/5/12

to

"Mark" <mark_cruz...@hotmail.com> writes:
> First of all, thanks for comments. What I'm trying to solve is to find a way
> to search in AVL tree used in IP routing application. Initially a tree's
> node looked simple:
>
> struct nhlfe_entry
> {
> u_int32_t nhlfe_ix;
> u_char opcode;
> ...
> struct nhlfe_key key; /* nexthop key */
> }
>
> /* defines a search key. */
> struct nhlfe_key
> {
> struct in_addr nh_addr;
> u_int32_t oif_ix;
> u_int32_t out_label;
> }
>
> So the search was based on 'struct nhlfe_key'. Now, what I'm trying to add
> support for multiple next hops, that is I add a linked list in nhlfe_entry:
> struct nhlfe_entry{ u_int32_t nhlfe_ix; u_char opcode; ... struct list
> *nhkey_list;}where 'struct list' is struct listnode that embeds 'void *data'
> pointer to caller'sprivate data, and this is 'struct nhlfe_key'.So I'm
> trying to find the way to generate key based on multiple elements in the
> list to store/search nodes in the tree.

The simplest solution would be to concatenate all 'next hops' in
order to form a key. A better idea would be to build a so-called radix
search trie based on keys like this. The best online description of
that I found quickly is:

http://www.cs.hut.fi/Opinnot/T-106.253/k2002/english/extra/digital_search_tree.html

At least the ancient version of this book

http://www.informit.com/store/product.aspx?isbn=0201314525

I happen to own contains a good description. In contrast to this, the
Wikipedia article completely useless, except as an excellent example
how language can be used to obscure phenomenons instead of explaining
them.

Ben Pfaff

unread,

Apr 5, 2012, 10:42:50 AM4/5/12

to

You suggested http://burtleburtle.net/bob/hash/integer.html.
Take a look at the first suggested hash function for an integer
there. GCC 4.4 with -O3 compiles this into 12 instructions. And
then you have to figure out a way to combine multiple hashes for
subsequent integers.

Now look at, say, http://burtleburtle.net/bob/c/lookup3.c. It
takes 12*3 instructions (according to the comments) to hash three
32-bit integers, which, per-integer, is the same rate, and you
don't have to invent a way to combine them.
--
"While the Melissa license is a bit unclear, Melissa aggressively
encourages free distribution of its source code."
--Kevin Dalley <ke...@seti.org>

Mark

unread,

Apr 5, 2012, 10:43:25 AM4/5/12

to

"Rainer Weikusat" <rwei...@mssgmbh.com> wrote in message
news:87bon65...@sapphire.mobileactivedefense.com...

>> So the search was based on 'struct nhlfe_key'. Now, what I'm trying to
>> add
>> support for multiple next hops, that is I add a linked list in
>> nhlfe_entry:
>> struct nhlfe_entry{ u_int32_t nhlfe_ix; u_char opcode; ... struct
>> list
>> *nhkey_list;}where 'struct list' is struct listnode that embeds 'void
>> *data'
>> pointer to caller'sprivate data, and this is 'struct nhlfe_key'.So I'm
>> trying to find the way to generate key based on multiple elements in the
>> list to store/search nodes in the tree.
>
> The simplest solution would be to concatenate all 'next hops' in
> order to form a key.

Number of next hops is variable, thus anyway it would require to delete the
node and re-insert it back with a new key generated, right? (that's not an
issue though). How would you suggest to concatenate 32-bit long unsigned
integers, in what form -- convert to strings, concatenate and use such
string as a key?

> A better idea would be to build a so-called radix
> search trie based on keys like this.

Unfortunately this isn't feasible at the moment, as it will require lots of
chnages all over the code and likely break the existing functionality.

Mark

Rainer Weikusat

unread,

Apr 5, 2012, 11:29:57 AM4/5/12

to

b...@cs.stanford.edu (Ben Pfaff) writes:
> Rainer Weikusat <rwei...@mssgmbh.com> writes:

>> b...@cs.stanford.edu (Ben Pfaff) writes:

[...]

>>>>> Why not use one of many available high-quality hash functions?
>>>>> Three that come to mind without even going to a search engine:
>>>>> Jenkins lookup3, murmurhash, FNV.
>>>>
>>>> The simple answer is these hash functions are designed to be used for
>>>> large, variable length string keys NOT for small, fixed-size
>>>> quantities which can be compared with a single machine instruction.
>>>>
>>>> http://burtleburtle.net/bob/hash/integer.html
>>>> http://www.cris.com/~Ttwang/tech/inthash.htm
>>>
>>> He has a list of multiple IP addresses. A hash function for
>>> variable length keys may well be appropriate.
>>
>> Calculating the hash value must be cheaper than the full comparisons
>> which will be needed for a linear search. And performing extensive
>> byte-shuffling on machine words is not a good strategy to achieve
>> that.
>
> You suggested http://burtleburtle.net/bob/hash/integer.html.

That's a web page with some background explanations by an 'authority' you
were (indirectly) refering to and some examples of 'integer hash
functions' supposed to produce hash values with qualities Bob Jenkins
considered to be desirable. I wouldn't want to use them to hash IP
addresses, much as I also wouldn't want to use one of the 'monster
hashes' for string lookup.

> Now look at, say, http://burtleburtle.net/bob/c/lookup3.c. It
> takes 12*3 instructions (according to the comments) to hash three
> 32-bit integers, which, per-integer, is the same rate, and you
> don't have to invent a way to combine them.

According to the table in

http://burtleburtle.net/bob/hash/doobs.html

hashing 3 32-bit integers _byte by byte_ with lookup three is assumed
to take 80 instructions.

Daniel Pitts

unread,

Apr 5, 2012, 1:05:21 PM4/5/12

to

Really? How do you hash a string then? The point of a hash function is
often (not always) used to determine a bucket to look into. It needn't
be faster than a comparison, and actually is often slower.
Cryptographically strong hashes are significantly slower than that
usually.

Often performance of the program is not dominated by the speed of the
hash-function, but the number of collisions created by it. Having a
hash-function that is 10 times slower than a comparison, but creates an
even spread will generally outperform the "fastest" hash. If you want a
fast hash, "return 0;"

> And performing extensive
> byte-shuffling on machine words is not a good strategy to achieve
> that.

True, but what else are you going to do?

The original posters problem, as I understand it, is that he has several
IP addresses that need to be hashed together. I would assume that means
he has a table that is a map from a collection of IP addresses to some
object.

First, an IPV4 address is basically 32 bits. The entropy is low on the
high-order bits and higher on the low order, which is fine for typical
hash use-cases.

So, he has basically an integer bag (The OP hasn't specified whether the
order of IPs matter, nor whether they may have duplicates)

If order doesn't matter, then the easiest correct approach is simply to
sum up the integer value of each IP (not each octant, but the whole IP).

If order does matter, then you might want Σ(IP[i] * smallPrime^i).
(where ^ is the exponent operator, not xor). This can be implemented
as: int hash=0; foreach(IP in theList) hash = hash * smallPrime + IP;

Duplicates will be handled correctly for both cases, but the hashes will
only be equal if the same duplicates exist.

Either way, the actual speed of the hash function only matters if it
because a bottleneck to the performance of the program. The requires
real-world measurement and profiling, and not a bunch of supposition
from wiseguys on a newsgroup :-).

Rainer Weikusat

unread,

Apr 5, 2012, 1:12:16 PM4/5/12

to

Daniel Pitts <newsgrou...@virtualinfinity.net> writes:
> On 4/5/12 4:08 AM, Rainer Weikusat wrote:
>> b...@cs.stanford.edu (Ben Pfaff) writes:
>>> Rainer Weikusat<rwei...@mssgmbh.com> writes:
>>>
>>>> Ben Pfaff<b...@cs.stanford.edu> writes:
>>>>> "Mark"<mark_cruz...@hotmail.com> writes:
>>>>>
>>>>>> I'm trying to find a way create a hash value over multiple IP addresses (I
>>>>>> keep a list of IP addresses and its number is variable. The simplest method
>>>>>> I've found so far is (obviously not the most effective and collision free):
>>>>>
>>>>> Why not use one of many available high-quality hash functions?
>>>>> Three that come to mind without even going to a search engine:
>>>>> Jenkins lookup3, murmurhash, FNV.
>>>>
>>>> The simple answer is these hash functions are designed to be used for
>>>> large, variable length string keys NOT for small, fixed-size
>>>> quantities which can be compared with a single machine instruction.
>>>>
>>>> http://burtleburtle.net/bob/hash/integer.html
>>>> http://www.cris.com/~Ttwang/tech/inthash.htm
>>>
>>> He has a list of multiple IP addresses. A hash function for
>>> variable length keys may well be appropriate.
>>
>> Calculating the hash value must be cheaper than the full comparisons
>> which will be needed for a linear search.
> Really? How do you hash a string then?

Really. If calculating the hash value is more expensive than the
comparisons it is supposed to avoid, it obviously buys you nothing.

Daniel Pitts

unread,

Apr 5, 2012, 1:23:04 PM4/5/12

to

A single hash is supposed to save you multiple comparisons, not a single
comparison, so it can be many times more expensive than a single
comparison and still improve performance. A hash that is 10 times
slower than a comparison, but allows you to find a bucket with only 3
out of 3000 entries saves you an average cost of 1487 comparisons.

Rainer Weikusat

unread,

Apr 5, 2012, 1:34:01 PM4/5/12

to

An the letter 's' following a word, as in comparisonS, is supposed to
indicate a plural.

Daniel Pitts

unread,

Apr 5, 2012, 2:11:13 PM4/5/12

to

Yes, I single hash operation is used so that you can avoid multiple
(meaning plural) comparisons. I said exactly what I meant, and meant
exactly what I said. My point still stands. A hash can be slower than a
comparison, and still save you a significant amount of time.

Rainer Weikusat

unread,

Apr 5, 2012, 2:34:49 PM4/5/12

to

And so did I, superbrain. Please pester someone else in future.

Rainer Weikusat

unread,

Apr 5, 2012, 2:38:40 PM4/5/12

to

Rainer Weikusat <rwei...@mssgmbh.com> writes:

[...]

>>>>> Really. If calculating the hash value is more expensive than the
>>>>> comparisons it is supposed to avoid, it obviously buys you nothing.
>>>> A single hash is supposed to save you multiple comparisons, not a
>>>> single comparison,
>>>
>>> An the letter 's' following a word, as in comparisonS, is supposed to
>>> indicate a plural.
>> Yes, I single hash operation is used so that you can avoid multiple
>> (meaning plural) comparisons. I said exactly what I meant, and meant
>> exactly what I said. My point still stands. A hash can be slower than
>> a comparison, and still save you a significant amount of time.
>
> And so did I, superbrain. Please pester someone else in future.

And just in case there are more 'Blitzmerkers' of this kind around
here: I was referring to the complete set of comparisons that would
need to be done for a search without hashing all the time.

Scott Lurndal

unread,

Apr 5, 2012, 2:45:18 PM4/5/12

to

This seems a bit of a non sequitur - crytpographic hashes aren't used
for bucket selection.

I'd be interested in more data about the original posters problem:

- How many entries will his "AVL tree" contain over time?
- What type of system does the software run on (embedded 1Ghz
processor or core I7 at 3.5Ghz?)
- What is the frequency (over time) of the various search, insert
and delete operations?

>
>Often performance of the program is not dominated by the speed of the
>hash-function, but the number of collisions created by it. Having a
>hash-function that is 10 times slower than a comparison, but creates an
>even spread will generally outperform the "fastest" hash. If you want a
>fast hash, "return 0;"

I think many people would be surprised that simple O(N) algorithms would
be of suitable performance for many problems where more complicated data
structures are designed (e.g. a linear table lookup instead of an AVL or R/B tree
or hash buckets).

Of course, if the OP is building a core router, then a hardware CAM may
be the most appropriate solution.

scott

Mark

unread,

Apr 5, 2012, 3:02:41 PM4/5/12

to

"Scott Lurndal" <sc...@slp53.sl.home> wrote in message
news:2Llfr.24766$ID1....@news.usenetserver.com...

> I'd be interested in more data about the original posters problem:
>
> - How many entries will his "AVL tree" contain over time?

I gave some more details in this thread in response to Eric Sosman's
comments. Performance isn't a top priority for this project.

> - What type of system does the software run on (embedded 1Ghz
> processor or core I7 at 3.5Ghz?)
> - What is the frequency (over time) of the various search, insert
> and delete operations?

Mark

Daniel Pitts

unread,

Apr 5, 2012, 3:18:30 PM4/5/12

to

Pointing out a flaw in your post is not pestering you. The fact that
you haven't chosen to respond to the actual point I made shows you are
in fact the one who pesters.

Hash's are slower than comparisons, as a matter of course.

Good day to you sir.

Eric Sosman

unread,

Apr 5, 2012, 10:24:11 PM4/5/12

to

On 4/5/2012 10:43 AM, Mark wrote:
> "Rainer Weikusat"<rwei...@mssgmbh.com> wrote in message
> news:87bon65...@sapphire.mobileactivedefense.com...
>>> So the search was based on 'struct nhlfe_key'. Now, what I'm trying to
>>> add
>>> support for multiple next hops, that is I add a linked list in
>>> nhlfe_entry:
>>> struct nhlfe_entry{ u_int32_t nhlfe_ix; u_char opcode; ... struct
>>> list
>>> *nhkey_list;}where 'struct list' is struct listnode that embeds 'void
>>> *data'
>>> pointer to caller'sprivate data, and this is 'struct nhlfe_key'.So I'm
>>> trying to find the way to generate key based on multiple elements in the
>>> list to store/search nodes in the tree.
>>
>> The simplest solution would be to concatenate all 'next hops' in
>> order to form a key.
>
> Number of next hops is variable, thus anyway it would require to delete the
> node and re-insert it back with a new key generated, right? (that's not an
> issue though). How would you suggest to concatenate 32-bit long unsigned
> integers, in what form -- convert to strings, concatenate and use such
> string as a key?

What are you trying to *do* with this key? Yes, we know, you're
using it to order the nodes in an AVL tree -- but that's like the guy
saying "I'm stacking bricks on top of other bricks" instead of saying
"I'm building a cathedral." Tell us what you want to do with this AVL
tree once you have it: Especially, tell us what you'll have in your
hand when you knock on its door and ask "Have you got one of THESE?"

Are you holding a single IP address, and looking for all the
multi-address routes that begin there?

Are you holding a single IP address, and looking for all the
routes that include it anywhere?

Are you holding an entire sequence of IP addresses, and looking
for that exact route?

Are you holding a sequence of addresses, looking for all routes
that begin with that sequence?

... or what? These are all "search problems," but they're of
rather different character and require rather different responses.

--
Eric Sosman
eso...@ieee-dot-org.invalid

Rainer Weikusat

unread,

Apr 6, 2012, 12:13:48 PM4/6/12

to

"Mark" <mark_cruz...@hotmail.com> writes:
> "Rainer Weikusat" <rwei...@mssgmbh.com> wrote in message
> news:87bon65...@sapphire.mobileactivedefense.com...
>>> So the search was based on 'struct nhlfe_key'. Now, what I'm trying to
>>> add
>>> support for multiple next hops, that is I add a linked list in
>>> nhlfe_entry:
>>> struct nhlfe_entry{ u_int32_t nhlfe_ix; u_char opcode; ... struct
>>> list
>>> *nhkey_list;}where 'struct list' is struct listnode that embeds 'void
>>> *data'
>>> pointer to caller'sprivate data, and this is 'struct nhlfe_key'.So I'm
>>> trying to find the way to generate key based on multiple elements in the
>>> list to store/search nodes in the tree.
>>
>> The simplest solution would be to concatenate all 'next hops' in
>> order to form a key.
>
> Number of next hops is variable, thus anyway it would require to delete the
> node and re-insert it back with a new key generated, right? (that's not an
> issue though). How would you suggest to concatenate 32-bit long unsigned
> integers, in what form -- convert to strings, concatenate and use such
> string as a key?

I would consolidate each pair into a 64-bit integer, using 0 in place
of a possibly-missing 'second 32-bit number' and then do 'string
comparions' on 64-bit integer strings of some length. Whether I would
rather store an explicit length together with such a string or use a
(64-bit) null terminator would depend on the actual circumstances.

Rainer Weikusat

unread,

Apr 6, 2012, 12:21:42 PM4/6/12

to

Rainer Weikusat <rwei...@mssgmbh.com> writes:
> "Mark" <mark_cruz...@hotmail.com> writes:

[...]

>> How would you suggest to concatenate 32-bit long unsigned
>> integers, in what form -- convert to strings, concatenate and use such
>> string as a key?
>
> I would consolidate each pair into a 64-bit integer, using 0 in place
> of a possibly-missing 'second 32-bit number' and then do 'string
> comparions' on 64-bit integer strings of some length.

Clarification: This is not supposed to refer to the C str*-library
functions but to equivalent algorithms working with 64-bit integers
instead of chars.

Mark

unread,

Apr 9, 2012, 10:04:49 AM4/9/12

to

"Rainer Weikusat" <rwei...@mssgmbh.com> wrote in message

news:87ehs0g...@sapphire.mobileactivedefense.com...

>> Number of next hops is variable, thus anyway it would require to delete
>> the
>> node and re-insert it back with a new key generated, right? (that's not
>> an
>> issue though). How would you suggest to concatenate 32-bit long unsigned
>> integers, in what form -- convert to strings, concatenate and use such
>> string as a key?
>
> I would consolidate each pair into a 64-bit integer, using 0 in place
> of a possibly-missing 'second 32-bit number' and then do 'string
> comparions' on 64-bit integer strings of some length. Whether I would
> rather store an explicit length together with such a string or use a
> (64-bit) null terminator would depend on the actual circumstances.

Thanks, this looks reasonable, however I don't understand what do you mean
by "each pair"? For instance, given three next hops:

10.10.10.1
20.20.20.1
30.30.30.1

this would yield in two 64-bit values:

10.10.10.1_20.20.20.1
30.30.30.1_0.0.0.0

Is this what you mean?

Mark

Rainer Weikusat

unread,

Apr 9, 2012, 10:54:28 AM4/9/12

to

Using another notataion: The three IPv4 addresses

0x0a0a0a01
0x14141401
0x1e1e1e01

would yield the two 64-bit values

0x0a0a0a0114141401
0x1e1e1e0100000000