In the book, "Programming in C++" by Stephen Dewhurst and Kathy Stark
(1989, Prentice Hall, page 14), I find a short hashing subroutine
("hash" as in cryptographic hash, not an associative array) by the
famous Peter Weinberger.
The subroutine accepts a string, and returns a numerical value. One
possible application of such a subroutine would be for storing (and
retrieving) of a vast number of files (many thousands or millions) -
the files could be stored in an arbitrary number of directories/
filesystems (ie, "hash buckets") and retrieved based on the hash value
of their filename.
I wish to convert this C++ subroutine to a Perl subroutine. But my
knowledge of C++ is limited to what I learned in a class in college -
I have never coded in C++ and I am severely deficient in this
language.
Based on my limited C++ skills, I have made an effort to do this
conversion, but it does not produce the expected results. I was
hoping someone here who was more familiar with C++ could point out the
error of my ways.
Here is subroutine from the book, based on Peter Weinberger's code:
int
hashpjw( char *s ) {
const prime = 211;
unsigned hash = 0, g;
for( char *p = s; *p, p++ ) {
hash = ( hash << 4 ) + *p;
// assumes 32 bit int size
if( g = hash & 0xf0000000 ) {
hash ^= g >> 24;
hsh ^= g;
}
}
return hash % prime;
}
Here is my <lame>attempt</lame> to convert it to Perl:
#!/usr/bin/perl
use strict;
print hashpjw( 'bar', 123 );
sub hashpjw {
my( $char, $s ) = @_;
my $prime = 211;
my ( $hash, $g, $p );
for my $char( $p = $s, $p, $p++ ) {
$hash = ( $hash << 4 ) + $p;
if( $g = $hash & 0xf0000000 ) {
$hash = $hash ^ ($g >> 24);
$hash = $hash ^ $g;
}
}
return $hash % $prime;
}
__END__
Any assistance is greatly appreciated!
> The subroutine accepts a string, and returns a numerical value. One
> possible application of such a subroutine would be for storing (and
> retrieving) of a vast number of files (many thousands or millions) -
> the files could be stored in an arbitrary number of directories/
> filesystems (ie, "hash buckets") and retrieved based on the hash value
> of their filename.
Mmmm, but you also have to take into account possible "hash
collisions"...
> I wish to convert this C++ subroutine to a Perl subroutine. But my
> knowledge of C++ is limited to what I learned in a class in college -
> I have never coded in C++ and I am severely deficient in this
> language.
> Based on my limited C++ skills, I have made an effort to do this
> conversion, but it does not produce the expected results. I was
> hoping someone here who was more familiar with C++ could point out the
> error of my ways.
> Here is subroutine from the book, based on Peter Weinberger's code:
> int
> hashpjw( char *s ) {
> const prime = 211;
> unsigned hash = 0, g;
> for( char *p = s; *p, p++ ) {
> hash = ( hash << 4 ) + *p;
> // assumes 32 bit int size
> if( g = hash & 0xf0000000 ) {
> hash ^= g >> 24;
> hsh ^= g;
> }
> }
> return hash % prime;
> }
Your 'prime' is awfully small, with that you never get more
than 211 different hash values...
> Here is my <lame>attempt</lame> to convert it to Perl:
> #!/usr/bin/perl
> use strict;
use warnings;
;-)
> print hashpjw( 'bar', 123 );
Why do you pass it two arguments if the C/C++ version only
takes a single one, the string of which you want to compute
the hash value?
> sub hashpjw {
> my( $char, $s ) = @_;
> my $prime = 211;
> my ( $hash, $g, $p );
> for my $char( $p = $s, $p, $p++ ) {
> $hash = ( $hash << 4 ) + $p;
> if( $g = $hash & 0xf0000000 ) {
> $hash = $hash ^ ($g >> 24);
> $hash = $hash ^ $g;
> }
> }
> return $hash % $prime;
> }
Try this instead (and pass it just the string)
sub hashpjw {
my ( $s, $prime, $hash ) = ( shift, 211, 0 );
for ( split //, $s ) {
$hash = ( $hash << 4 ) + ord $_;
if ( my $g = $hash & 0xf0000000 ) {
$hash ^= $g >> 24;
$hash ^= $g;
}
}
return $hash % $prime;
}
One warning: the C/C++ version is rather likely meant to be used
with 8-bit characters. The Perl version as it is will happily
accept e.g. Unicode characters and then the result will differ
from the C/C++ version!
Regards, Jens
--
\ Jens Thoms Toerring ___ j...@toerring.de
\__________________________ http://toerring.de
> #!/usr/bin/perl
> use strict;
>
> print hashpjw( 'bar', 123 );
Call hashpjw with two args, a string 'bar' and a number 123
>
> sub hashpjw {
> my( $char, $s ) = @_;
Get two args: $char = 'bar' and $s = 123
> for my $char( $p = $s, $p, $p++ ) {
Here you define a new $char hiding the one that has 'bar'.
Also this loop is almost certainly not doing what you want, unless
what you want is very strange:
perl -e 'for $c ($p=123, $p, $p++) { print $c,"\n" }'
124
124
123
> Any assistance is greatly appreciated!
Start by treating perl as a different language than C. First off,
there are no pointers, so trying to do pointer arithmetic is, well,
pointless.
hi dave!
DF> In the book, "Programming in C++" by Stephen Dewhurst and Kathy Stark
DF> (1989, Prentice Hall, page 14), I find a short hashing subroutine
DF> ("hash" as in cryptographic hash, not an associative array) by the
DF> famous Peter Weinberger.
well, perl hashes are called that because they use a hash function to
allow O(1) (or close to it) access!
DF> The subroutine accepts a string, and returns a numerical value. One
DF> possible application of such a subroutine would be for storing (and
DF> retrieving) of a vast number of files (many thousands or millions) -
DF> the files could be stored in an arbitrary number of directories/
DF> filesystems (ie, "hash buckets") and retrieved based on the hash value
DF> of their filename.
which is also how perl hashes work!
DF> int
DF> hashpjw( char *s ) {
DF> const prime = 211;
DF> unsigned hash = 0, g;
DF> for( char *p = s; *p, p++ ) {
DF> hash = ( hash << 4 ) + *p;
DF> // assumes 32 bit int size
DF> if( g = hash & 0xf0000000 ) {
DF> hash ^= g >> 24;
DF> hsh ^= g;
DF> }
DF> }
DF> return hash % prime;
DF> }
DF> print hashpjw( 'bar', 123 );
you pass a string to $char but a number to $s
DF> sub hashpjw {
DF> my( $char, $s ) = @_;
the c++ code declares only a single char pointer arg which is similar to
a perl string. and as i pointed out above, you pass in two args.
DF> my $prime = 211;
DF> my ( $hash, $g, $p );
DF> for my $char( $p = $s, $p, $p++ ) {
that redeclares $char so the string arg you pass in is never used. also
this loop just scans the string to be hashed. there are many easy idioms
for this. this s about the simplest but may be slow on very long
strings:
foreach my $char ( split //, $s ) {
this works better on long strings:
while( my ($char) = $s =~ /(.)/sg ) {
DF> $hash = ( $hash << 4 ) + $p;
you never set $p to anything. it needs to be the next char. so change $p
to $char. *p is accessing the next char in the c++ string.
DF> if( $g = $hash & 0xf0000000 ) {
DF> $hash = $hash ^ ($g >> 24);
DF> $hash = $hash ^ $g;
DF> }
that looks ok but i can't tell without testing.
DF> }
DF> return $hash % $prime;
same here.
well, you never scan the input string which is never properly passed
in. fix those bugs and see what happens. also test this out with a
single char string so you can debug all the steps more easily.
uri
--
Uri Guttman ------ u...@stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Free Perl Training --- http://perlhunter.com/college.html ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
>David Filmer <use...@davidfilmer.com> wrote:
>> In the book, "Programming in C++" by Stephen Dewhurst and Kathy Stark
>> (1989, Prentice Hall, page 14), I find a short hashing subroutine
[snip]
>Try this instead (and pass it just the string)
>
>sub hashpjw {
> my ( $s, $prime, $hash ) = ( shift, 211, 0 );
>
> for ( split //, $s ) {
> $hash = ( $hash << 4 ) + ord $_;
> if ( my $g = $hash & 0xf0000000 ) {
> $hash ^= $g >> 24;
> $hash ^= $g;
> }
> }
>
> return $hash % $prime;
>}
>
The oputput for this is the same as the C function, for
'bar' its 168. However the C function as typed in won't compile
without some fixes.
If there were ever a need for speed, the last two $hash asignments
can be combined into a single asignment.
$hash = ($hash ^ $g >> 24) ^ $g;
-sln
[snip]
Jen has a nice Perl solution.
But, the C code won't compile under C++, probably some typo's.
C++ does not support default-int
There is no default type for const.
The for loop has a ',' where it should have a ';'.
Fixed up and speedier version:
int
hashpjw( char *s ) {
const int prime = 211;
unsigned hash = 0, g;
for( char *p = s; *p; p++ ) {
hash = ( hash << 4 ) + *p;
// assumes 32 bit int size
if( g = hash & 0xf0000000 )
hash = (hash ^ g >> 24) ^ g;
}
return hash % prime;
}
-sln
JTT> for ( split //, $s ) {
JTT> $hash = ( $hash << 4 ) + ord $_;
...
JTT> The Perl version as it is will happily accept e.g. Unicode
JTT> characters and then the result will differ from the C/C++ version!
An easy way to fix that is to replace the lines with
for ( unpack "C*", $s) {
$hash = ( $hash << 4 ) + $_;
which will probably also be faster, a nice property for a hashing
function (unpack over split, and no ord() call).
Ted
Is a char guaranteed to be unsigned in C++? I'm only sure it
isn't in C, so the behaviour of the original function (which
could be C or C++) wouldn't be well-defined at least in C for
anything than 7-bit (ASCII) chars as the input.
Using unpack() is definitely more elegant (and more in the
spirit of the original function should there be Unicode chars
in the string). All one would need to know is if "C*" or "c*"
would have to be used with unpack().