I'm trying to find a way create a hash value over multiple IP addresses (I keep a list of IP addresses and its number is variable. The simplest method I've found so far is (obviously not the most effective and collision free):
unsigned long hash_ipaddr(struct in_addr *addr)
{
unsigned long res;
> I'm trying to find a way create a hash value over multiple IP addresses (I > keep a list of IP addresses and its number is variable. The simplest method > I've found so far is (obviously not the most effective and collision free):
> unsigned long hash_ipaddr(struct in_addr *addr)
> {
> unsigned long res;
"Mark" <mark_cruzNOTFORS...@hotmail.com> writes:
> I'm trying to find a way create a hash value over multiple IP addresses (I > keep a list of IP addresses and its number is variable. The simplest method > I've found so far is (obviously not the most effective and collision free):
Why not use one of many available high-quality hash functions?
Three that come to mind without even going to a search engine:
Jenkins lookup3, murmurhash, FNV.
-- Ben Pfaff http://benpfaff.org
Ben Pfaff <b...@cs.stanford.edu> writes:
> "Mark" <mark_cruzNOTFORS...@hotmail.com> writes:
>> I'm trying to find a way create a hash value over multiple IP addresses (I >> keep a list of IP addresses and its number is variable. The simplest method >> I've found so far is (obviously not the most effective and collision free):
> Why not use one of many available high-quality hash functions?
> Three that come to mind without even going to a search engine:
> Jenkins lookup3, murmurhash, FNV.
The simple answer is these hash functions are designed to be used for
large, variable length string keys NOT for small, fixed-size
quantities which can be compared with a single machine instruction.
>>> I'm trying to find a way create a hash value over multiple IP addresses (I >>> keep a list of IP addresses and its number is variable. The simplest method >>> I've found so far is (obviously not the most effective and collision free):
>> Why not use one of many available high-quality hash functions?
>> Three that come to mind without even going to a search engine:
>> Jenkins lookup3, murmurhash, FNV.
> The simple answer is these hash functions are designed to be used for
> large, variable length string keys NOT for small, fixed-size
> quantities which can be compared with a single machine instruction.
He has a list of multiple IP addresses. A hash function for
variable length keys may well be appropriate.
-- Peter Seebach on managing engineers:
"It's like herding cats, only most of the engineers are already
sick of laser pointers."
In comp.unix.programmer Mark <mark_cruzNOTFORS...@hotmail.com> wrote:
> I'm trying to find a way create a hash value over multiple IP
> addresses (I keep a list of IP addresses and its number is
> variable. The simplest method I've found so far is (obviously not
> the most effective and collision free):
> unsigned long hash_ipaddr(struct in_addr *addr)
No IPv6?
rick jones
-- Process shall set you free from the need for rational thought. these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
Mark <mark_cruzNOTFORS...@hotmail.com> wrote:
> Hello
> I'm trying to find a way create a hash value over multiple IP addresses (I
> keep a list of IP addresses and its number is variable. The simplest
> method I've found so far is (obviously not the most effective and
> collision free):
Also, because IP addresses are fixed size, perhaps the most practical
algorithm to use is binary search. Far easier to implement than perfect
hashing, with no storage overhead--a simple linear array (or two, for v4 and
v6) of addresses.
> I'm trying to find a way create a hash value over multiple IP addresses (I
> keep a list of IP addresses and its number is variable. The simplest method
> I've found so far is (obviously not the most effective and collision free):
> unsigned long hash_ipaddr(struct in_addr *addr)
> {
> unsigned long res;
Briefly, this calculation takes four chunks of the address
as numbers between 0..255 and adds them to produce a sum 0..1020.
So: You start with 32 bits of information and squash them down to
a smidgen less than 10 bits -- with a moderate central tendencey,
too. Ponder, young one: Wise this is?
> }
> And then sum up hashes for every IP and store it.
Not sure what you mean by this. A given IP produces one and
only one hash value with this scheme (or any other sane scheme),
so what do you mean by "sum up the hashes?"
> Can somebody suggest some better approach for my task?
You haven't explained what your task is, so I'll have to guess.
If the objective is to use a hash table keyed on IP addresses, there
are two likely possibilities:
- The number of buckets in the table is a prime number (or at
the very least an odd number): Just take `*addr % N', and the
outcome is very likely to be better, especially if N > 1020.
The remainder after division by a prime is likely to break up
simple patterns.
- The number of buckets is a power of two. Now `*addr % N' would
just throw away the high-order bits -- which might not be Bad,
but could easily be Less Than Good if there are patterns in the
input keys. One possibility is to use `f(*addr) % N', where f()
is a suitable "scrambling" function: For example, implement a
two-line linear congruential pseudo-random number generator, seed
it with `*addr', and take its output as f().
> > I'm trying to find a way create a hash value over multiple IP addresses (I
> > keep a list of IP addresses and its number is variable. The simplest method
> > I've found so far is (obviously not the most effective and collision free):
> > unsigned long hash_ipaddr(struct in_addr *addr)
> > {
> > unsigned long res;
> Briefly, this calculation takes four chunks of the address
> as numbers between 0..255 and adds them to produce a sum 0..1020.
> So: You start with 32 bits of information and squash them down to
> a smidgen less than 10 bits -- with a moderate central tendencey,
> too. Ponder, young one: Wise this is?
That's not what I see. It looks to me like he's adding the three high-order bytes together, shifting that 8 bits to the left, and then adding in the low-order byte. So he's shrinking a 32-bit value down to an 18-bit value. Another way to write his expression is:
Hash functions are expected to squash a large value into a much smaller one -- that's the whole point, isn't it? Although it's hard understand why one would bother when the hash is more than half the size of the original.
> > }
> > And then sum up hashes for every IP and store it.
> Not sure what you mean by this. A given IP produces one and
> only one hash value with this scheme (or any other sane scheme),
> so what do you mean by "sum up the hashes?"
I think what he REALLY wants is a hash of a list of IPs. So he's hashing each IP, then combining all the hashes together by adding them.
A better solution might be to concatenate all the IPs, then use a standard hash function on that. If you don't want the result to depend on the order of the IPs, sort the list before concatenating them.
-- Barry Margolin, bar...@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
> I'm trying to find a way create a hash value over multiple IP addresses (I
> keep a list of IP addresses and its number is variable. The simplest method
> I've found so far is (obviously not the most effective and collision free):
> unsigned long hash_ipaddr(struct in_addr *addr)
> {
> unsigned long res;
On Thursday, 5 April 2012 02:07:50 UTC+5:30, Mark wrote:
> Hello
> I'm trying to find a way create a hash value over multiple IP addresses (I > keep a list of IP addresses and its number is variable. The simplest method > I've found so far is (obviously not the most effective and collision free):
> unsigned long hash_ipaddr(struct in_addr *addr)
> {
> unsigned long res;
> And then sum up hashes for every IP and store it.
> Can somebody suggest some better approach for my task?
> Thanks.
> Mark
Hashing is preferable if you have large number of keys and you know the number of keys before hand. In that case hashing gives you the best results. Moreover, Hashing is more common data structure for the purpose of indexing non-integral keys.
> I'm trying to find a way create a hash value over multiple IP addresses (I > keep a list of IP addresses and its number is variable. The simplest method > I've found so far is (obviously not the most effective and collision free):
In your case the number of IPAddresses are variable and your keys are integral. So you can use these points for your benefit to use a more better approach.
1st Approach : If the number of IPAddress will not exceed a very large value( < 100000), then you can use AVL Trees/Red Black Trees. If you are using C++, you can use std::map<> data structure directly for this purpose.
These will give you in worst case the order of O(15-17) for Insertion, Deletion and Searching. Plus it will save you a lot of time in creating your own hash data structure and making sure it is giving you best performance.
2nd Approach :
If you know the range of you IP addresses, lets say between 192.168.0.0 to 192.168.10.256. Then you can directly allocate an array of size of the number of IP Addresses in this range. And you can perform the constant searching by subtracting the minimum IP Address from your queried IP Address to get the index into the array.
But if the above two approaches doesnot suit your needs then you can always use some libraries which provide hash datastructures already implemented in them.
>> Ben Pfaff <b...@cs.stanford.edu> writes:
>>> "Mark" <mark_cruzNOTFORS...@hotmail.com> writes:
>>>> I'm trying to find a way create a hash value over multiple IP addresses (I >>>> keep a list of IP addresses and its number is variable. The simplest method >>>> I've found so far is (obviously not the most effective and collision free):
>>> Why not use one of many available high-quality hash functions?
>>> Three that come to mind without even going to a search engine:
>>> Jenkins lookup3, murmurhash, FNV.
>> The simple answer is these hash functions are designed to be used for
>> large, variable length string keys NOT for small, fixed-size
>> quantities which can be compared with a single machine instruction.
> He has a list of multiple IP addresses. A hash function for
> variable length keys may well be appropriate.
Calculating the hash value must be cheaper than the full comparisons
which will be needed for a linear search. And performing extensive
byte-shuffling on machine words is not a good strategy to achieve
that.
Unrelated question (Barry Margolin also noticed that but I'd like to
try expressing it somewhat clearer): This function adds the three
highest octets together in the high octet of a 16-bit integer and
then adds the lowest octet of the original address to the lowest octet
of this 16-bit integer whose value was zero before this addition. This
seems rather odd. Is this intentional?
> Unrelated question (Barry Margolin also noticed that but I'd like to
> try expressing it somewhat clearer): This function adds the three
> highest octets together in the high octet of a 16-bit integer and
> then adds the lowest octet of the original address to the lowest octet
> of this 16-bit integer whose value was zero before this addition. This
> seems rather odd. Is this intentional?
He apparently wants each address in a /24 to generate a different hash.
This seems reasonable if he expects his application to have to deal with many clients from the same subnet, and avoid clashes among them.
-- Barry Margolin, bar...@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
>> Can somebody suggest some better approach for my task?
> You haven't explained what your task is, so I'll have to guess.
> If the objective is to use a hash table keyed on IP addresses, there
> are two likely possibilities:
First of all, thanks for comments. What I'm trying to solve is to find a way to search in AVL tree used in IP routing application. Initially a tree's node looked simple:
So the search was based on 'struct nhlfe_key'. Now, what I'm trying to add support for multiple next hops, that is I add a linked list in nhlfe_entry:
struct nhlfe_entry{ u_int32_t nhlfe_ix; u_char opcode; ... struct list *nhkey_list;}where 'struct list' is struct listnode that embeds 'void *data' pointer to caller'sprivate data, and this is 'struct nhlfe_key'.So I'm trying to find the way to generate key based on multiple elements in the list to store/search nodes in the tree. One of options I was considering hash function, but it now looks unfeasible to me; the other is to always have a key based on the very first element of the list (but whenever list changes, re-insert node in the tree); and the third is to have two different containers - one is AVL tree where key is IP address and the second is list or array of 'nhlfe_entry', and each tree node should have some reference to elements from the 2nd container.. So not sure which one to proceed with.
> In article<jlir5m$dd...@dont-email.me>,
> Eric Sosman<esos...@ieee-dot-org.invalid> wrote:
>> On 4/4/2012 4:37 PM, Mark wrote:
>>> Hello
>>> I'm trying to find a way create a hash value over multiple IP addresses (I
>>> keep a list of IP addresses and its number is variable. The simplest method
>>> I've found so far is (obviously not the most effective and collision free):
>>> unsigned long hash_ipaddr(struct in_addr *addr)
>>> {
>>> unsigned long res;
>> Briefly, this calculation takes four chunks of the address
>> as numbers between 0..255 and adds them to produce a sum 0..1020.
>> So: You start with 32 bits of information and squash them down to
>> a smidgen less than 10 bits -- with a moderate central tendencey,
>> too. Ponder, young one: Wise this is?
> That's not what I see.
Nor I, now that my eyes are open. Thanks for the catch. (Sigh.)
> It looks to me like he's adding the three
> high-order bytes together, shifting that 8 bits to the left, and then
> adding in the low-order byte. So he's shrinking a 32-bit value down to
> an 18-bit value. Another way to write his expression is:
> Hash functions are expected to squash a large value into a much smaller
> one -- that's the whole point, isn't it? Although it's hard understand
> why one would bother when the hash is more than half the size of the
> original.
Yeah. He's still got the central-tendency problem with three
of the bytes, too. Even if he's got a 2**18-bucket table and will
use this hash directly as the bucket index, I doubt he'll get a
good distribution: I'd expect more-than-normal collisions in the
buckets near 97920, with more-than-normal empty buckets in the "tails."
>>> And then sum up hashes for every IP and store it.
>> Not sure what you mean by this. A given IP produces one and
>> only one hash value with this scheme (or any other sane scheme),
>> so what do you mean by "sum up the hashes?"
> I think what he REALLY wants is a hash of a list of IPs. So he's
> hashing each IP, then combining all the hashes together by adding them.
> A better solution might be to concatenate all the IPs, then use a
> standard hash function on that. If you don't want the result to depend
> on the order of the IPs, sort the list before concatenating them.
Or hash each individually and combine them with a commutative
operation like + or ^. That's not usually a great idea because it
gives (A,B) and (B,A) the same hash -- but if he *wants* them to
have the same hash, well, ...
"Mark" <mark_cruzNOTFORS...@hotmail.com> writes:
> First of all, thanks for comments. What I'm trying to solve is to find a way > to search in AVL tree used in IP routing application. Initially a tree's > node looked simple:
> So the search was based on 'struct nhlfe_key'. Now, what I'm trying to add > support for multiple next hops, that is I add a linked list in nhlfe_entry:
> struct nhlfe_entry{ u_int32_t nhlfe_ix; u_char opcode; ... struct list > *nhkey_list;}where 'struct list' is struct listnode that embeds 'void *data' > pointer to caller'sprivate data, and this is 'struct nhlfe_key'.So I'm > trying to find the way to generate key based on multiple elements in the > list to store/search nodes in the tree.
The simplest solution would be to concatenate all 'next hops' in
order to form a key. A better idea would be to build a so-called radix
search trie based on keys like this. The best online description of
that I found quickly is:
I happen to own contains a good description. In contrast to this, the
Wikipedia article completely useless, except as an excellent example
how language can be used to obscure phenomenons instead of explaining
them.
>>> Ben Pfaff <b...@cs.stanford.edu> writes:
>>>> "Mark" <mark_cruzNOTFORS...@hotmail.com> writes:
>>>>> I'm trying to find a way create a hash value over multiple IP addresses (I >>>>> keep a list of IP addresses and its number is variable. The simplest method >>>>> I've found so far is (obviously not the most effective and collision free):
>>>> Why not use one of many available high-quality hash functions?
>>>> Three that come to mind without even going to a search engine:
>>>> Jenkins lookup3, murmurhash, FNV.
>>> The simple answer is these hash functions are designed to be used for
>>> large, variable length string keys NOT for small, fixed-size
>>> quantities which can be compared with a single machine instruction.
>> He has a list of multiple IP addresses. A hash function for
>> variable length keys may well be appropriate.
> Calculating the hash value must be cheaper than the full comparisons
> which will be needed for a linear search. And performing extensive
> byte-shuffling on machine words is not a good strategy to achieve
> that.
You suggested http://burtleburtle.net/bob/hash/integer.html.
Take a look at the first suggested hash function for an integer
there. GCC 4.4 with -O3 compiles this into 12 instructions. And
then you have to figure out a way to combine multiple hashes for
subsequent integers.
Now look at, say, http://burtleburtle.net/bob/c/lookup3.c. It
takes 12*3 instructions (according to the comments) to hash three
32-bit integers, which, per-integer, is the same rate, and you
don't have to invent a way to combine them.
-- "While the Melissa license is a bit unclear, Melissa aggressively
encourages free distribution of its source code."
--Kevin Dalley <ke...@seti.org>
>> So the search was based on 'struct nhlfe_key'. Now, what I'm trying to >> add
>> support for multiple next hops, that is I add a linked list in >> nhlfe_entry:
>> struct nhlfe_entry{ u_int32_t nhlfe_ix; u_char opcode; ... struct >> list
>> *nhkey_list;}where 'struct list' is struct listnode that embeds 'void >> *data'
>> pointer to caller'sprivate data, and this is 'struct nhlfe_key'.So I'm
>> trying to find the way to generate key based on multiple elements in the
>> list to store/search nodes in the tree.
> The simplest solution would be to concatenate all 'next hops' in
> order to form a key.
Number of next hops is variable, thus anyway it would require to delete the node and re-insert it back with a new key generated, right? (that's not an issue though). How would you suggest to concatenate 32-bit long unsigned integers, in what form -- convert to strings, concatenate and use such string as a key?
> A better idea would be to build a so-called radix
> search trie based on keys like this.
Unfortunately this isn't feasible at the moment, as it will require lots of chnages all over the code and likely break the existing functionality.
>>>>> Why not use one of many available high-quality hash functions?
>>>>> Three that come to mind without even going to a search engine:
>>>>> Jenkins lookup3, murmurhash, FNV.
>>>> The simple answer is these hash functions are designed to be used for
>>>> large, variable length string keys NOT for small, fixed-size
>>>> quantities which can be compared with a single machine instruction.
>>> He has a list of multiple IP addresses. A hash function for
>>> variable length keys may well be appropriate.
>> Calculating the hash value must be cheaper than the full comparisons
>> which will be needed for a linear search. And performing extensive
>> byte-shuffling on machine words is not a good strategy to achieve
>> that.
That's a web page with some background explanations by an 'authority' you
were (indirectly) refering to and some examples of 'integer hash
functions' supposed to produce hash values with qualities Bob Jenkins
considered to be desirable. I wouldn't want to use them to hash IP
addresses, much as I also wouldn't want to use one of the 'monster
hashes' for string lookup.
> Now look at, say, http://burtleburtle.net/bob/c/lookup3.c. It
> takes 12*3 instructions (according to the comments) to hash three
> 32-bit integers, which, per-integer, is the same rate, and you
> don't have to invent a way to combine them.
>>> Ben Pfaff<b...@cs.stanford.edu> writes:
>>>> "Mark"<mark_cruzNOTFORS...@hotmail.com> writes:
>>>>> I'm trying to find a way create a hash value over multiple IP addresses (I
>>>>> keep a list of IP addresses and its number is variable. The simplest method
>>>>> I've found so far is (obviously not the most effective and collision free):
>>>> Why not use one of many available high-quality hash functions?
>>>> Three that come to mind without even going to a search engine:
>>>> Jenkins lookup3, murmurhash, FNV.
>>> The simple answer is these hash functions are designed to be used for
>>> large, variable length string keys NOT for small, fixed-size
>>> quantities which can be compared with a single machine instruction.
>> He has a list of multiple IP addresses. A hash function for
>> variable length keys may well be appropriate.
> Calculating the hash value must be cheaper than the full comparisons
> which will be needed for a linear search.
Really? How do you hash a string then? The point of a hash function is often (not always) used to determine a bucket to look into. It needn't be faster than a comparison, and actually is often slower. Cryptographically strong hashes are significantly slower than that usually.
Often performance of the program is not dominated by the speed of the hash-function, but the number of collisions created by it. Having a hash-function that is 10 times slower than a comparison, but creates an even spread will generally outperform the "fastest" hash. If you want a fast hash, "return 0;"
> And performing extensive
> byte-shuffling on machine words is not a good strategy to achieve
> that.
True, but what else are you going to do?
The original posters problem, as I understand it, is that he has several IP addresses that need to be hashed together. I would assume that means he has a table that is a map from a collection of IP addresses to some object.
First, an IPV4 address is basically 32 bits. The entropy is low on the high-order bits and higher on the low order, which is fine for typical hash use-cases.
So, he has basically an integer bag (The OP hasn't specified whether the order of IPs matter, nor whether they may have duplicates)
If order doesn't matter, then the easiest correct approach is simply to sum up the integer value of each IP (not each octant, but the whole IP).
If order does matter, then you might want Σ(IP[i] * smallPrime^i). (where ^ is the exponent operator, not xor). This can be implemented as: int hash=0; foreach(IP in theList) hash = hash * smallPrime + IP;
Duplicates will be handled correctly for both cases, but the hashes will only be equal if the same duplicates exist.
Either way, the actual speed of the hash function only matters if it because a bottleneck to the performance of the program. The requires real-world measurement and profiling, and not a bunch of supposition from wiseguys on a newsgroup :-).
>>>> Ben Pfaff<b...@cs.stanford.edu> writes:
>>>>> "Mark"<mark_cruzNOTFORS...@hotmail.com> writes:
>>>>>> I'm trying to find a way create a hash value over multiple IP addresses (I
>>>>>> keep a list of IP addresses and its number is variable. The simplest method
>>>>>> I've found so far is (obviously not the most effective and collision free):
>>>>> Why not use one of many available high-quality hash functions?
>>>>> Three that come to mind without even going to a search engine:
>>>>> Jenkins lookup3, murmurhash, FNV.
>>>> The simple answer is these hash functions are designed to be used for
>>>> large, variable length string keys NOT for small, fixed-size
>>>> quantities which can be compared with a single machine instruction.
>>> He has a list of multiple IP addresses. A hash function for
>>> variable length keys may well be appropriate.
>> Calculating the hash value must be cheaper than the full comparisons
>> which will be needed for a linear search.
> Really? How do you hash a string then?
Really. If calculating the hash value is more expensive than the
comparisons it is supposed to avoid, it obviously buys you nothing.
>>>>> Ben Pfaff<b...@cs.stanford.edu> writes:
>>>>>> "Mark"<mark_cruzNOTFORS...@hotmail.com> writes:
>>>>>>> I'm trying to find a way create a hash value over multiple IP addresses (I
>>>>>>> keep a list of IP addresses and its number is variable. The simplest method
>>>>>>> I've found so far is (obviously not the most effective and collision free):
>>>>>> Why not use one of many available high-quality hash functions?
>>>>>> Three that come to mind without even going to a search engine:
>>>>>> Jenkins lookup3, murmurhash, FNV.
>>>>> The simple answer is these hash functions are designed to be used for
>>>>> large, variable length string keys NOT for small, fixed-size
>>>>> quantities which can be compared with a single machine instruction.
>>>> He has a list of multiple IP addresses. A hash function for
>>>> variable length keys may well be appropriate.
>>> Calculating the hash value must be cheaper than the full comparisons
>>> which will be needed for a linear search.
>> Really? How do you hash a string then?
> Really. If calculating the hash value is more expensive than the
> comparisons it is supposed to avoid, it obviously buys you nothing.
A single hash is supposed to save you multiple comparisons, not a single comparison, so it can be many times more expensive than a single comparison and still improve performance. A hash that is 10 times slower than a comparison, but allows you to find a bucket with only 3 out of 3000 entries saves you an average cost of 1487 comparisons.