In the book, "Programming in C++" by Stephen Dewhurst and Kathy Stark (1989, Prentice Hall, page 14), I find a short hashing subroutine ("hash" as in cryptographic hash, not an associative array) by the famous Peter Weinberger.
The subroutine accepts a string, and returns a numerical value. One possible application of such a subroutine would be for storing (and retrieving) of a vast number of files (many thousands or millions) - the files could be stored in an arbitrary number of directories/ filesystems (ie, "hash buckets") and retrieved based on the hash value of their filename.
I wish to convert this C++ subroutine to a Perl subroutine. But my knowledge of C++ is limited to what I learned in a class in college - I have never coded in C++ and I am severely deficient in this language.
Based on my limited C++ skills, I have made an effort to do this conversion, but it does not produce the expected results. I was hoping someone here who was more familiar with C++ could point out the error of my ways.
Here is subroutine from the book, based on Peter Weinberger's code:
int hashpjw( char *s ) { const prime = 211; unsigned hash = 0, g;
David Filmer <use...@davidfilmer.com> wrote: > In the book, "Programming in C++" by Stephen Dewhurst and Kathy Stark > (1989, Prentice Hall, page 14), I find a short hashing subroutine > ("hash" as in cryptographic hash, not an associative array) by the > famous Peter Weinberger. > The subroutine accepts a string, and returns a numerical value. One > possible application of such a subroutine would be for storing (and > retrieving) of a vast number of files (many thousands or millions) - > the files could be stored in an arbitrary number of directories/ > filesystems (ie, "hash buckets") and retrieved based on the hash value > of their filename.
Mmmm, but you also have to take into account possible "hash collisions"...
> I wish to convert this C++ subroutine to a Perl subroutine. But my > knowledge of C++ is limited to what I learned in a class in college - > I have never coded in C++ and I am severely deficient in this > language. > Based on my limited C++ skills, I have made an effort to do this > conversion, but it does not produce the expected results. I was > hoping someone here who was more familiar with C++ could point out the > error of my ways. > Here is subroutine from the book, based on Peter Weinberger's code: > int > hashpjw( char *s ) { > const prime = 211; > unsigned hash = 0, g; > for( char *p = s; *p, p++ ) { > hash = ( hash << 4 ) + *p; > // assumes 32 bit int size > if( g = hash & 0xf0000000 ) { > hash ^= g >> 24; > hsh ^= g; > } > } > return hash % prime; > }
Your 'prime' is awfully small, with that you never get more than 211 different hash values...
> Here is my <lame>attempt</lame> to convert it to Perl: > #!/usr/bin/perl > use strict;
use warnings;
;-)
> print hashpjw( 'bar', 123 );
Why do you pass it two arguments if the C/C++ version only takes a single one, the string of which you want to compute the hash value?
sub hashpjw { my ( $s, $prime, $hash ) = ( shift, 211, 0 );
for ( split //, $s ) { $hash = ( $hash << 4 ) + ord $_; if ( my $g = $hash & 0xf0000000 ) { $hash ^= $g >> 24; $hash ^= $g; } }
return $hash % $prime;
}
One warning: the C/C++ version is rather likely meant to be used with 8-bit characters. The Perl version as it is will happily accept e.g. Unicode characters and then the result will differ from the C/C++ version! Regards, Jens -- \ Jens Thoms Toerring ___ j...@toerring.de \__________________________ http://toerring.de
>>>>> "DF" == David Filmer <use...@DavidFilmer.com> writes:
hi dave!
DF> In the book, "Programming in C++" by Stephen Dewhurst and Kathy Stark DF> (1989, Prentice Hall, page 14), I find a short hashing subroutine DF> ("hash" as in cryptographic hash, not an associative array) by the DF> famous Peter Weinberger.
well, perl hashes are called that because they use a hash function to allow O(1) (or close to it) access!
DF> The subroutine accepts a string, and returns a numerical value. One DF> possible application of such a subroutine would be for storing (and DF> retrieving) of a vast number of files (many thousands or millions) - DF> the files could be stored in an arbitrary number of directories/ DF> filesystems (ie, "hash buckets") and retrieved based on the hash value DF> of their filename.
which is also how perl hashes work!
DF> int DF> hashpjw( char *s ) { DF> const prime = 211; DF> unsigned hash = 0, g;
the c++ code declares only a single char pointer arg which is similar to a perl string. and as i pointed out above, you pass in two args.
DF> my $prime = 211; DF> my ( $hash, $g, $p ); DF> for my $char( $p = $s, $p, $p++ ) {
that redeclares $char so the string arg you pass in is never used. also this loop just scans the string to be hashed. there are many easy idioms for this. this s about the simplest but may be slow on very long strings:
foreach my $char ( split //, $s ) {
this works better on long strings:
while( my ($char) = $s =~ /(.)/sg ) {
DF> $hash = ( $hash << 4 ) + $p;
you never set $p to anything. it needs to be the next char. so change $p to $char. *p is accessing the next char in the c++ string.
well, you never scan the input string which is never properly passed in. fix those bugs and see what happens. also test this out with a single char string so you can debug all the steps more easily.
On 24 Feb 2009 09:11:20 GMT, j...@toerring.de (Jens Thoms Toerring) wrote:
>David Filmer <use...@davidfilmer.com> wrote: >> In the book, "Programming in C++" by Stephen Dewhurst and Kathy Stark >> (1989, Prentice Hall, page 14), I find a short hashing subroutine
On Mon, 23 Feb 2009 22:36:22 -0800 (PST), David Filmer <use...@DavidFilmer.com> wrote: >Greetings.
>In the book, "Programming in C++" by Stephen Dewhurst and Kathy Stark >(1989, Prentice Hall, page 14), I find a short hashing subroutine >("hash" as in cryptographic hash, not an associative array) by the >famous Peter Weinberger.
>The subroutine accepts a string, and returns a numerical value. One >possible application of such a subroutine would be for storing (and >retrieving) of a vast number of files (many thousands or millions) - >the files could be stored in an arbitrary number of directories/ >filesystems (ie, "hash buckets") and retrieved based on the hash value >of their filename.
>I wish to convert this C++ subroutine to a Perl subroutine. But my >knowledge of C++ is limited to what I learned in a class in college - >I have never coded in C++ and I am severely deficient in this >language.
>Based on my limited C++ skills, I have made an effort to do this >conversion, but it does not produce the expected results. I was >hoping someone here who was more familiar with C++ could point out the >error of my ways.
>Here is subroutine from the book, based on Peter Weinberger's code:
On 24 Feb 2009 09:11:20 GMT j...@toerring.de (Jens Thoms Toerring) wrote:
JTT> for ( split //, $s ) { JTT> $hash = ( $hash << 4 ) + ord $_; ... JTT> The Perl version as it is will happily accept e.g. Unicode JTT> characters and then the result will differ from the C/C++ version!
An easy way to fix that is to replace the lines with
Ted Zlatanov <t...@lifelogs.com> wrote: > On 24 Feb 2009 09:11:20 GMT j...@toerring.de (Jens Thoms Toerring) wrote: > JTT> for ( split //, $s ) { > JTT> $hash = ( $hash << 4 ) + ord $_; > ... > JTT> The Perl version as it is will happily accept e.g. Unicode > JTT> characters and then the result will differ from the C/C++ version! > An easy way to fix that is to replace the lines with > for ( unpack "C*", $s) { > $hash = ( $hash << 4 ) + $_; > which will probably also be faster, a nice property for a hashing > function (unpack over split, and no ord() call).
Is a char guaranteed to be unsigned in C++? I'm only sure it isn't in C, so the behaviour of the original function (which could be C or C++) wouldn't be well-defined at least in C for anything than 7-bit (ASCII) chars as the input.
Using unpack() is definitely more elegant (and more in the spirit of the original function should there be Unicode chars in the string). All one would need to know is if "C*" or "c*" would have to be used with unpack().