Trying to convert small C++ subroutine (by Peter Weinberger) to Perl

David Filmer

unread,

Feb 24, 2009, 1:36:22 AM2/24/09

to

Greetings.

In the book, "Programming in C++" by Stephen Dewhurst and Kathy Stark
(1989, Prentice Hall, page 14), I find a short hashing subroutine
("hash" as in cryptographic hash, not an associative array) by the
famous Peter Weinberger.

The subroutine accepts a string, and returns a numerical value. One
possible application of such a subroutine would be for storing (and
retrieving) of a vast number of files (many thousands or millions) -
the files could be stored in an arbitrary number of directories/
filesystems (ie, "hash buckets") and retrieved based on the hash value
of their filename.

I wish to convert this C++ subroutine to a Perl subroutine. But my
knowledge of C++ is limited to what I learned in a class in college -
I have never coded in C++ and I am severely deficient in this
language.

Based on my limited C++ skills, I have made an effort to do this
conversion, but it does not produce the expected results. I was
hoping someone here who was more familiar with C++ could point out the
error of my ways.

Here is subroutine from the book, based on Peter Weinberger's code:

int
hashpjw( char *s ) {
const prime = 211;
unsigned hash = 0, g;

for( char *p = s; *p, p++ ) {
hash = ( hash << 4 ) + *p;
// assumes 32 bit int size
if( g = hash & 0xf0000000 ) {
hash ^= g >> 24;
hsh ^= g;
}
}
return hash % prime;
}

Here is my <lame>attempt</lame> to convert it to Perl:

#!/usr/bin/perl
use strict;

print hashpjw( 'bar', 123 );

sub hashpjw {
my( $char, $s ) = @_;
my $prime = 211;
my ( $hash, $g, $p );
for my $char( $p = $s, $p, $p++ ) {
$hash = ( $hash << 4 ) + $p;
if( $g = $hash & 0xf0000000 ) {
$hash = $hash ^ ($g >> 24);
$hash = $hash ^ $g;
}
}
return $hash % $prime;
}
__END__

Any assistance is greatly appreciated!

Jens Thoms Toerring

unread,

Feb 24, 2009, 4:11:20 AM2/24/09

to

David Filmer <use...@davidfilmer.com> wrote:
> In the book, "Programming in C++" by Stephen Dewhurst and Kathy Stark
> (1989, Prentice Hall, page 14), I find a short hashing subroutine
> ("hash" as in cryptographic hash, not an associative array) by the
> famous Peter Weinberger.

> The subroutine accepts a string, and returns a numerical value. One
> possible application of such a subroutine would be for storing (and
> retrieving) of a vast number of files (many thousands or millions) -
> the files could be stored in an arbitrary number of directories/
> filesystems (ie, "hash buckets") and retrieved based on the hash value
> of their filename.

Mmmm, but you also have to take into account possible "hash
collisions"...

> I wish to convert this C++ subroutine to a Perl subroutine. But my
> knowledge of C++ is limited to what I learned in a class in college -
> I have never coded in C++ and I am severely deficient in this
> language.

> Based on my limited C++ skills, I have made an effort to do this
> conversion, but it does not produce the expected results. I was
> hoping someone here who was more familiar with C++ could point out the
> error of my ways.

> Here is subroutine from the book, based on Peter Weinberger's code:

> int
> hashpjw( char *s ) {
> const prime = 211;
> unsigned hash = 0, g;

> for( char *p = s; *p, p++ ) {
> hash = ( hash << 4 ) + *p;
> // assumes 32 bit int size
> if( g = hash & 0xf0000000 ) {
> hash ^= g >> 24;
> hsh ^= g;
> }
> }
> return hash % prime;
> }

Your 'prime' is awfully small, with that you never get more
than 211 different hash values...

> Here is my <lame>attempt</lame> to convert it to Perl:

> #!/usr/bin/perl
> use strict;

use warnings;

;-)

> print hashpjw( 'bar', 123 );

Why do you pass it two arguments if the C/C++ version only
takes a single one, the string of which you want to compute
the hash value?

> sub hashpjw {
> my( $char, $s ) = @_;
> my $prime = 211;
> my ( $hash, $g, $p );
> for my $char( $p = $s, $p, $p++ ) {
> $hash = ( $hash << 4 ) + $p;
> if( $g = $hash & 0xf0000000 ) {
> $hash = $hash ^ ($g >> 24);
> $hash = $hash ^ $g;
> }
> }
> return $hash % $prime;
> }

Try this instead (and pass it just the string)

sub hashpjw {
my ( $s, $prime, $hash ) = ( shift, 211, 0 );

for ( split //, $s ) {
$hash = ( $hash << 4 ) + ord $_;
if ( my $g = $hash & 0xf0000000 ) {
$hash ^= $g >> 24;
$hash ^= $g;
}
}

return $hash % $prime;
}

One warning: the C/C++ version is rather likely meant to be used
with 8-bit characters. The Perl version as it is will happily
accept e.g. Unicode characters and then the result will differ
from the C/C++ version!
Regards, Jens
--
\ Jens Thoms Toerring ___ j...@toerring.de
\__________________________ http://toerring.de

smallpond

unread,

Feb 24, 2009, 12:18:51 PM2/24/09

to

On Feb 24, 1:36 am, David Filmer <use...@DavidFilmer.com> wrote:

> #!/usr/bin/perl
> use strict;
>
> print hashpjw( 'bar', 123 );

Call hashpjw with two args, a string 'bar' and a number 123

>
> sub hashpjw {
> my( $char, $s ) = @_;

Get two args: $char = 'bar' and $s = 123

> for my $char( $p = $s, $p, $p++ ) {

Here you define a new $char hiding the one that has 'bar'.

Also this loop is almost certainly not doing what you want, unless
what you want is very strange:

perl -e 'for $c ($p=123, $p, $p++) { print $c,"\n" }'
124
124
123

> Any assistance is greatly appreciated!

Start by treating perl as a different language than C. First off,
there are no pointers, so trying to do pointer arithmetic is, well,
pointless.

Uri Guttman

unread,

Feb 24, 2009, 12:30:01 PM2/24/09

to

>>>>> "DF" == David Filmer <use...@DavidFilmer.com> writes:

hi dave!

DF> In the book, "Programming in C++" by Stephen Dewhurst and Kathy Stark
DF> (1989, Prentice Hall, page 14), I find a short hashing subroutine
DF> ("hash" as in cryptographic hash, not an associative array) by the
DF> famous Peter Weinberger.

well, perl hashes are called that because they use a hash function to
allow O(1) (or close to it) access!

DF> The subroutine accepts a string, and returns a numerical value. One
DF> possible application of such a subroutine would be for storing (and
DF> retrieving) of a vast number of files (many thousands or millions) -
DF> the files could be stored in an arbitrary number of directories/
DF> filesystems (ie, "hash buckets") and retrieved based on the hash value
DF> of their filename.

which is also how perl hashes work!

DF> int
DF> hashpjw( char *s ) {
DF> const prime = 211;
DF> unsigned hash = 0, g;

DF> for( char *p = s; *p, p++ ) {
DF> hash = ( hash << 4 ) + *p;
DF> // assumes 32 bit int size
DF> if( g = hash & 0xf0000000 ) {
DF> hash ^= g >> 24;
DF> hsh ^= g;
DF> }
DF> }
DF> return hash % prime;
DF> }

DF> print hashpjw( 'bar', 123 );

you pass a string to $char but a number to $s

DF> sub hashpjw {
DF> my( $char, $s ) = @_;

the c++ code declares only a single char pointer arg which is similar to
a perl string. and as i pointed out above, you pass in two args.

DF> my $prime = 211;
DF> my ( $hash, $g, $p );
DF> for my $char( $p = $s, $p, $p++ ) {

that redeclares $char so the string arg you pass in is never used. also
this loop just scans the string to be hashed. there are many easy idioms
for this. this s about the simplest but may be slow on very long
strings:

foreach my $char ( split //, $s ) {

this works better on long strings:

while( my ($char) = $s =~ /(.)/sg ) {

DF> $hash = ( $hash << 4 ) + $p;

you never set $p to anything. it needs to be the next char. so change $p
to $char. *p is accessing the next char in the c++ string.

DF> if( $g = $hash & 0xf0000000 ) {
DF> $hash = $hash ^ ($g >> 24);
DF> $hash = $hash ^ $g;
DF> }

that looks ok but i can't tell without testing.

DF> }
DF> return $hash % $prime;

same here.

well, you never scan the input string which is never properly passed
in. fix those bugs and see what happens. also test this out with a
single char string so you can debug all the steps more easily.

uri

--
Uri Guttman ------ u...@stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Free Perl Training --- http://perlhunter.com/college.html ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

s...@netherlands.com

unread,

Feb 24, 2009, 1:24:46 PM2/24/09

to

On 24 Feb 2009 09:11:20 GMT, j...@toerring.de (Jens Thoms Toerring) wrote:

>David Filmer <use...@davidfilmer.com> wrote:
>> In the book, "Programming in C++" by Stephen Dewhurst and Kathy Stark
>> (1989, Prentice Hall, page 14), I find a short hashing subroutine

[snip]

>Try this instead (and pass it just the string)
>
>sub hashpjw {
> my ( $s, $prime, $hash ) = ( shift, 211, 0 );
>
> for ( split //, $s ) {
> $hash = ( $hash << 4 ) + ord $_;
> if ( my $g = $hash & 0xf0000000 ) {
> $hash ^= $g >> 24;
> $hash ^= $g;
> }
> }
>
> return $hash % $prime;
>}
>

The oputput for this is the same as the C function, for
'bar' its 168. However the C function as typed in won't compile
without some fixes.

If there were ever a need for speed, the last two $hash asignments
can be combined into a single asignment.

$hash = ($hash ^ $g >> 24) ^ $g;

-sln

s...@netherlands.com

unread,

Feb 24, 2009, 1:32:30 PM2/24/09

to

[snip]

Jen has a nice Perl solution.
But, the C code won't compile under C++, probably some typo's.

C++ does not support default-int
There is no default type for const.
The for loop has a ',' where it should have a ';'.

Fixed up and speedier version:

int
hashpjw( char *s ) {

const int prime = 211;

unsigned hash = 0, g;

for( char *p = s; *p; p++ ) {

hash = ( hash << 4 ) + *p;
// assumes 32 bit int size
if( g = hash & 0xf0000000 )

hash = (hash ^ g >> 24) ^ g;
}
return hash % prime;
}

-sln

Ted Zlatanov

unread,

Feb 25, 2009, 4:16:16 PM2/25/09

to

On 24 Feb 2009 09:11:20 GMT j...@toerring.de (Jens Thoms Toerring) wrote:

JTT> for ( split //, $s ) {
JTT> $hash = ( $hash << 4 ) + ord $_;
...
JTT> The Perl version as it is will happily accept e.g. Unicode
JTT> characters and then the result will differ from the C/C++ version!

An easy way to fix that is to replace the lines with

for ( unpack "C*", $s) {
$hash = ( $hash << 4 ) + $_;

which will probably also be faster, a nice property for a hashing
function (unpack over split, and no ord() call).

Ted

Jens Thoms Toerring

unread,

Feb 25, 2009, 4:52:37 PM2/25/09

to

Is a char guaranteed to be unsigned in C++? I'm only sure it
isn't in C, so the behaviour of the original function (which
could be C or C++) wouldn't be well-defined at least in C for
anything than 7-bit (ASCII) chars as the input.

Using unpack() is definitely more elegant (and more in the
spirit of the original function should there be Unicode chars
in the string). All one would need to know is if "C*" or "c*"
would have to be used with unpack().

pww19...@gmail.com

unread,

Apr 21, 2015, 9:28:23 PM4/21/15

to

在 2009年2月24日星期二 UTC+8下午2:36:22，David Filmer写道：

Title: The core of the core of the big data solutions -- Map
Author: pengwenwei
Email: wenwei19710430
Language: c++
Platform: Windows, linux
Technology: Perfect hash algorithm
Level: Advanced
Description: Map algorithm with high performance
Section MFC c++ map stl
SubSection c++ algorithm
License: (GPLv3)

Download demo project - 1070 Kb
Download source - 1070 Kb

Introduction:
For the c++ program, map is used everywhere.And bottleneck of program performance is often the performance of map.Especially in the case of large data,and the business association closely and unable to realize the data distribution and parallel processing condition.So the performance of map becomes the key technology.

In the work experience with telecommunications industry and the information security industry, I was dealing with the big bottom data,especially the most complex information security industry data,all can’t do without map.

For example, IP table, MAC table, telephone number list, domain name resolution table, ID number table query, the Trojan horse virus characteristic code of cloud killing etc..

The map of STL library using binary chop, its has the worst performance.Google Hash map has the optimal performance and memory at present, but it has repeated collision probability.Now the big data rarely use a collision probability map,especially relating to fees, can’t be wrong.

Now I put my algorithms out here,there are three kinds of map,after the build is Hash map.We can test the comparison,my algorithm has the zero probability of collision,but its performance is also better than the hash algorithm, even its ordinary performance has no much difference with Google.

My algorithm is perfect hash algorithm,its key index and the principle of compression algorithm is out of the ordinary,the most important is a completely different structure,so the key index compression is fundamentally different.The most direct benefit for program is that for the original map need ten servers for solutions but now I only need one server.
Declare: the code can not be used for commercial purposes, if for commercial applications,you can contact me with QQ 75293192.
Download:
https://sourceforge.net/projects/pwwhashmap/files

Applications:
First,modern warfare can’t be without the mass of information query, if the query of enemy target information slows down a second, it could lead to the delaying fighter, leading to failure of the entire war. Information retrieval is inseparable from the map, if military products use pwwhashMap instead of the traditional map,you must be the winner.

Scond,the performance of the router determines the surfing speed, just replace open source router code map for pwwHashMap, its speed can increase ten times.
There are many tables to query and set in the router DHCP ptotocol,such as IP,Mac ,and all these are completed by map.But until now,all map are using STL liabrary,its performance is very low,and using the Hash map has error probability,so it can only use multi router packet dispersion treatment.If using pwwHashMap, you can save at least ten sets of equipment.

Third,Hadoop is recognized as the big data solutions at present,and its most fundamental thing is super heavy use of the map,instead of SQL and table.Hadoop assumes the huge amounts of data so that the data is completely unable to move, people must carry on the data analysis in the local.But as long as the open source Hadoop code of the map changes into pwwHashMap, the performance will increase hundredfold without any problems.

Background to this article that may be useful such as an introduction to the basic ideas presented:
http://blog.csdn.net/chixinmuzi/article/details/1727195