FAQ 4.41 How can I remove duplicate elements from a list or array?

PerlFAQ Server

unread,

Apr 18, 2006, 3:03:02 PM4/18/06

to

This is an excerpt from the latest version perlfaq4.pod, which
comes with the standard Perl distribution. These postings aim to
reduce the number of repeated questions as well as allow the community
to review and update the answers. The latest version of the complete
perlfaq is at http://faq.perl.org .

--------------------------------------------------------------------

4.41: How can I remove duplicate elements from a list or array?

(contributed by brian d foy)

Use a hash. When you think the words "unique" or "duplicated", think
"hash keys".

If you don't care about the order of the elements, you could just create
the hash then extract the keys. It's not important how you create that
hash: just that you use "keys" to get the unique elements.

my %hash = map { $_, 1 } @array;
# or a hash slice: @hash{ @array } = ();
# or a foreach: $hash{$_} = 1 foreach ( @array );

my @unique = keys %hash;

You can also go through each element and skip the ones you've seen
before. Use a hash to keep track. The first time the loop sees an
element, that element has no key in %Seen. The "next" statement creates
the key and immediately uses its value, which is "undef", so the loop
continues to the "push" and increments the value for that key. The next
time the loop sees that same element, its key exists in the hash *and*
the value for that key is true (since it's not 0 or undef), so the next
skips that iteration and the loop goes to the next element.

my @unique = ();
my %seen = ();

foreach my $elem ( @array )
{
next if $seen{ $elem }++;
push @unique, $elem;
}

You can write this more briefly using a grep, which does the same thing.

my %seen = ();
my @unique = grep { ! $seen{ $_ }++ } @array;

--------------------------------------------------------------------

The perlfaq-workers, a group of volunteers, maintain the perlfaq. They
are not necessarily experts in every domain where Perl might show up,
so please include as much information as possible and relevant in any
corrections. The perlfaq-workers also don't have access to every
operating system or platform, so please include relevant details for
corrections to examples that do not work on particular platforms.
Working code is greatly appreciated.

If you'd like to help maintain the perlfaq, see the details in
perlfaq.pod.
*** Posted via a free Usenet account from http://www.teranews.com ***

robic0

unread,

Apr 18, 2006, 5:18:20 PM4/18/06

to

This may be more efficient:

$lastelement = undef;
foreach $element (sort @array)
{
if ($lastelement != $element) {
push @newarray, $element;
$lastelement = $element;
}
}

Paul Lalli

unread,

Apr 18, 2006, 6:16:55 PM4/18/06

to

robic0 wrote:
> On Tue, 18 Apr 2006 12:03:02 -0700, PerlFAQ Server <br...@stonehenge.com> wrote:
>
> >4.41: How can I remove duplicate elements from a list or array?

> > my %seen = ();
> > my @unique = grep { ! $seen{ $_ }++ } @array;

> This may be more efficient:
>
> $lastelement = undef;
> foreach $element (sort @array)
> {
> if ($lastelement != $element) {
> push @newarray, $element;
> $lastelement = $element;
> }
> }

"may be"? Why guess? Why not actually find out?

#!/usr/bin/perl
use strict;
use warnings;
no warnings 'uninitialized';
use Benchmark qw/cmpthese/;

my @unsorted = map { int rand(100) } 1..1000;

cmpthese( -10,
{
'sort' => sub {
my $lastelement = undef;
my @newarray;
foreach my $element (sort @unsorted) {

if ($lastelement != $element) {
push @newarray, $element;
$lastelement = $element;
}
}

},

'hash' => sub {
my %seen = ();
my @unique = grep { ! $seen{ $_ }++ } @unsorted;
}
}
);
__END__

Benchmark: running hash, sort, each for at least 10 CPU seconds...
hash: 10 wallclock secs (10.51 usr + 0.00 sys = 10.51 CPU) @
358.71/s (n=3770)
sort: 10 wallclock secs (10.53 usr + 0.00 sys = 10.53 CPU) @
197.53/s (n=2080)
Rate sort hash
sort 198/s -- -45%
hash 359/s 82% --

Doesn't look like your method is more efficient...

Paul Lalli

robic0

unread,

Apr 18, 2006, 8:40:15 PM4/18/06

to

this looks like my code

> },
>
> 'hash' => sub {
> my %seen = ();
> my @unique = grep { ! $seen{ $_ }++ } @unsorted;
> }

this doesen't look like FAQ code

Lukas Mai

unread,

Apr 18, 2006, 8:57:32 PM4/18/06

to

robic0 schrob:
[perlfaq fullquote snipped]

>
> This may be more efficient:
>
> $lastelement = undef;
> foreach $element (sort @array)
> {
> if ($lastelement != $element) {

use warnings; use strict;
Using != with undef is a bad idea, even more so as you're apparently
dealing with numbers here: 0 == undef.

But wait, if @array contains numbers, why did you use a lexicographical
sort above?

robic0

unread,

Apr 18, 2006, 8:58:40 PM4/18/06

to

On 18 Apr 2006 15:16:55 -0700, "Paul Lalli" <mri...@gmail.com> wrote:

Don't know bench too well, is the hash doing twice the op/sec as
the sort? Dunno. Looking back in my minds eye, just don't see a minimal
other than what I posted. A sort must be done once, the array transversed,
a new array created. Is it just me or have you just invented some magic
Einstien would be proud of??

robic0

unread,

Apr 18, 2006, 9:02:31 PM4/18/06

to

On Wed, 19 Apr 2006 02:57:32 +0200, "Lukas Mai" <rwxr...@gmx.de> wrote:

>robic0 schrob:
>[perlfaq fullquote snipped]
>>
>> This may be more efficient:
>>
>> $lastelement = undef;
>> foreach $element (sort @array)
>> {
>> if ($lastelement != $element) {
>
>use warnings; use strict;
>Using != with undef is a bad idea, even more so as you're apparently
>dealing with numbers here: 0 == undef.

yes, sorry u are correct.

>
>But wait, if @array contains numbers, why did you use a lexicographical
>sort above?

Its either all numeric or all chareacters. The 'mix' goes to characters...

Paul Lalli

unread,

Apr 19, 2006, 7:29:38 AM4/19/06

to

robic0 wrote:
> On 18 Apr 2006 15:16:55 -0700, "Paul Lalli" <mri...@gmail.com> wrote:
> > 'hash' => sub {
> > my %seen = ();
> > my @unique = grep { ! $seen{ $_ }++ } @unsorted;
> > }
> this doesen't look like FAQ code

Yes it does. I quoted the FAQ code in my post, which you then
conveniently snipped. Look at the bottom two lines of the FAQ. They
are, once again:

> > > my %seen = ();
> > > my @unique = grep { ! $seen{ $_ }++ } @array;

Paul Lalli

unread,

Apr 19, 2006, 7:32:37 AM4/19/06

to

robic0 wrote:
> On 18 Apr 2006 15:16:55 -0700, "Paul Lalli" <mri...@gmail.com> wrote:
> >Benchmark: running hash, sort, each for at least 10 CPU seconds...
> > hash: 10 wallclock secs (10.51 usr + 0.00 sys = 10.51 CPU) @
> >358.71/s (n=3770)
> > sort: 10 wallclock secs (10.53 usr + 0.00 sys = 10.53 CPU) @
> >197.53/s (n=2080)
> > Rate sort hash
> >sort 198/s -- -45%
> >hash 359/s 82% --
>

> Don't know bench too well, is the hash doing twice the op/sec as
> the sort?

Yes, exactly correct. the sort() method is able to run its block 198
times in one second, whereas the hash method is able to run its block
359 times in one second.

> Dunno. Looking back in my minds eye, just don't see a minimal
> other than what I posted.

Your mind's eye needs glasses then, because both the FAQ and my post
just gave it to you.

> A sort must be done once, the array transversed,
> a new array created.

Which is exactly the problem. Sorting is not cheap. Sorting takes a
long time, whereas a linear traversal of a single array combined with
hash lookups are very quick. That's the point of a hash.

> Is it just me or have you just invented some magic
> Einstien would be proud of??

No, I've simply used the correct feature for the given problem, as
recommended by the FAQ.

Paul Lalli

robic0

unread,

Apr 20, 2006, 12:38:03 PM4/20/06

to

On 18 Apr 2006 15:16:55 -0700, "Paul Lalli" <mri...@gmail.com> wrote:

Looks like not in this case.

#!/usr/bin/perl
use strict;
use warnings;

use Benchmark qw/cmpthese/;

my @unsorted = map { "int rand(100)" } 1..1000;

cmpthese( -10,
{
'sort' => sub {

my $lastelement = "-1";

my @newarray;
foreach my $element (sort @unsorted) {

if ($lastelement ne $element) {

push @newarray, $element;
$lastelement = $element;
}
}
},

'hash' => sub {
my %seen = ();
my @unique = grep { ! $seen{ $_ }++ } @unsorted;
}
}
);
__END__

Rate hash sort
hash 3520/s -- -22%
sort 4541/s 29% --

If the mapping is correct and the grep wasn't tailored to numbers,
there may be a performance hit with large strings as keys.
Otherwise I would have said that keys are grown and kept sorted using
binary search, which of course is not involved in sorting.

Uri Guttman

unread,

Apr 20, 2006, 1:04:41 PM4/20/06

to

>>>>> "r" == robic0 <robic0> writes:

r> my @unsorted = map { "int rand(100)" } 1..1000;

check out the results of that line. it ain't doing what you think it is
doing. it also makes your code do much less work than the hash code so
your benchmark is royally screwed up. now be nice about this and look
carefully at that line (print the output to see the bug) and admit your
example is faulty.

uri

--
Uri Guttman ------ u...@stemsystems.com -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org

Paul Lalli

unread,

Apr 20, 2006, 1:08:03 PM4/20/06

to

robic0 wrote:
> On 18 Apr 2006 15:16:55 -0700, "Paul Lalli" <mri...@gmail.com> wrote:
>
> >my @unsorted = map { int rand(100) } 1..1000;
> my @unsorted = map { "int rand(100)" } 1..1000;

Do you not see a difference between what got put in my array and what
got put in yours?

Paul Lalli

robic0

unread,

Apr 20, 2006, 1:12:48 PM4/20/06

to

On second look, the mapping produces the same string in a 1000 element array..
I would imagine the numbers represent a boundry condition anomoly in the grep.
So I'm going to check with real large strings and see if it has an affect.
For now, I'm going on the premise that hash keys are grown and lookup is accomplished
with a binary search. And by that assuming the statistical comparisons are roughly
%60-75 as much as a straight bubble sort plus the old/new and asignment comparison
overhead. A memmove is involved if the key array is kept sorted while it is grown
(using a binary search), a C primitive not available in Perl. An additional
binary search is required for the lookup, test for seen, although this could be
combined in one operation in this case.

robic0

unread,

Apr 20, 2006, 1:16:12 PM4/20/06

to

On Thu, 20 Apr 2006 13:04:41 -0400, Uri Guttman <u...@stemsystems.com> wrote:

>>>>>> "r" == robic0 <robic0> writes:
>
> r> my @unsorted = map { "int rand(100)" } 1..1000;
>
>check out the results of that line. it ain't doing what you think it is
>doing. it also makes your code do much less work than the hash code so
>your benchmark is royally screwed up. now be nice about this and look
>carefully at that line (print the output to see the bug) and admit your
>example is faulty.
>
>uri

I picked it up and was typing a followup right after I posted. Its a boundry condition,
still it breaks the benchmark.

robic0

unread,

Apr 20, 2006, 1:17:56 PM4/20/06

to

I saw it 1 second after I posted it. There's a follow up. Still it breaks
the bench's.

robic0

unread,

Apr 20, 2006, 2:08:43 PM4/20/06

to

Ok, just need some clarification here.
I can't quite make out this syntax.
I am asuming that $seen{ $_ }++ means exists so the '!'
returns true/false.
It might be in this context (!$seen{ $_ })++, dunno, its a moot point.

The fact is that a hash key lookup is both searched for and if not exists,
then added to the array. The syntax is irrelavent!
I will have to stand by my analogy that binary search is used in both the searching
for the key and the existence simutaneously.
This is the only way it could outpace a simple bubble sort.

However, this begs the question of why there is no direct memmove primitive in Perl?
The user should be given the option. Memmove is the core to growing/shrinking arrays at mid-point,
not just at the beginning and end with the shift/unshift and push/pop.

I've spent my wadd, Johnny would be proud.....

Juha Laiho

unread,

Apr 20, 2006, 2:39:23 PM4/20/06

to

robic0 said:
(referring to code

my @unique = grep { ! $seen{ $_ }++ } @array;

in the FAQ 4.41 entry)

>Ok, just need some clarification here.
>I can't quite make out this syntax.
>I am asuming that $seen{ $_ }++ means exists so the '!'
>returns true/false.

Yep, quite a few things happen "at once" here.
- the "seen" hash is being searched for the key $_
- if the key wasn't found, a new key-value pair will be created,
the value defaulting to "undef"
- the value for the key (either the existing value, or the newly
created value) is returned from the expression
- the stored value (but not the returned!) is incremented by one,
thus, if the value was newly created, it will now become "1"
- the ! operator will make a logic negation, thus if the hash
operation returned "undef" (for a key that didn't previously exist
in the hash), the grep will see the result as logic "true"; all
later accesses with the same key will return false (as ! applied
to any nonzero number will return "false")

... so, "unique" array will be assigned only unique values from
the "array" array -- and in the exact order in which they appear in
the "array".

>The fact is that a hash key lookup is both searched for and if not exists,
>then added to the array. The syntax is irrelavent!

The syntax is relevant to that extent, that it returns different result
for the first access with a given key then for any subsequent access
with the same key.

>I will have to stand by my analogy that binary search is used in both
>the searching for the key and the existence simutaneously.

It's a hash lookup, which has potential to be faster than binary search
for the common case.

>However, this begs the question of why there is no direct memmove
>primitive in Perl?

Why should there be? Perl is on a different abstraction level (much of
the time, at least). Memory addresses are not a typical abstraction
in perl. The closest to memmove is that array (and hash) slices are
assignable. But as you saw, the memmove is irrelevant for the hash
case -- and behind the scenes, building the hash could very well use
the C memmove. But on the Perl level you just don't need to care.
Trust that if you use hash in the correct place for your code, it
will perform fast enough.

>The user should be given the option. Memmove is the core to
>growing/shrinking arrays at mid-point,
>not just at the beginning and end with the shift/unshift and push/pop.

As said, array and hash slices are assignable.
--
Wolf a.k.a. Juha Laiho Espoo, Finland
(GC 3.0) GIT d- s+: a C++ ULSH++++$ P++@ L+++ E- W+$@ N++ !K w !O !M V
PS(+) PE Y+ PGP(+) t- 5 !X R !tv b+ !DI D G e+ h---- r+++ y++++
"...cancel my subscription to the resurrection!" (Jim Morrison)

Paul Lalli

unread,

Apr 20, 2006, 2:44:48 PM4/20/06

to

robic0 wrote:
> On 20 Apr 2006 10:08:03 -0700, "Paul Lalli" <mri...@gmail.com> wrote:
>
> >robic0 wrote:
> >> On 18 Apr 2006 15:16:55 -0700, "Paul Lalli" <mri...@gmail.com> wrote:
> >>
> >> >my @unsorted = map { int rand(100) } 1..1000;
> >> my @unsorted = map { "int rand(100)" } 1..1000;
> >
> >Do you not see a difference between what got put in my array and what
> >got put in yours?

> I saw it 1 second after I posted it. There's a follow up. Still it breaks
> the bench's.

Of course it doesn't. You've pretty clearly forgotten the original
point of this. The FAQ states "How can I remove duplicate elements
from a list or array?". If you want to go ahead and create a case in
which all 1000 elements are the same damned thing, making the final 999
of them "duplicates", then I have an even faster solution for you:

my @unsorted = map { "the same string" } 1..1000;
my @array = $unsorted[0];

Paul Lalli

unread,

Apr 20, 2006, 2:50:27 PM4/20/06

to

robic0 wrote:
> I will have to stand by my analogy that binary search is used in both the searching
> for the key and the existence simutaneously.

Where in the hell did you get that idea from? A hash is, by
definition, not ordered. So how could there possibly be a binary
search involved? It's a hash. "searches" and insertions are done via
a hashing function.

> This is the only way it could outpace a simple bubble sort.

It's time for you to go read an introductory computer science book.
You seem to be under the very bizarre impression that "bubble sort" is
the only kind of sorting. "Bubble sort" is not used in any code other
than to introduce students to the concept of sorting. It is the worst
form of sorting possible, next to chaos sort[1] anyway

You need to read up on what hashes actually are. When you read that
introductory CompSci book like I just suggested, pay close attention to
the section on "hash functions".

Paul Lalli

[1] That is, randomly choose two numbers from 0 to $#array. Swap those
positions. Check if the array is now ordered. If so, done. If not,
go back to step 1.

robic0

unread,

Apr 20, 2006, 3:09:32 PM4/20/06

to

Oh, sounds like 'quick sort'. Yes I know 'buble sort' is one of the slowest.
There's only about 7 popular sorting methods. After looking at the code
in 1987, the thrill didn't sink in.

I assume the hash is sorted when its created, or is a 'quick sort' used in
lookup as well? If not, the problem then is in the lookup.
If the insertion (whatever mechanics that involves) overlays the test for
prior existence, then your method beats mine. Otherwise it shouldn't.
Is a independent lookup not a binary search, or, if keys are not sorted,
does a 'quick sort' act as a lookup?

robic0

unread,

Apr 20, 2006, 3:11:27 PM4/20/06

to

Or apparently a change in sort method is need on my side.

Paul Lalli

unread,

Apr 20, 2006, 3:13:20 PM4/20/06

to

robic0 wrote:
> I assume the hash is sorted when its created, or is a 'quick sort' used in
> lookup as well?

You are *really* not listening here. It's a HASH. There IS NO SORT
involved. NONE. No bubble sort, no quick sort, no merge sort. NO
SORTS. It's a HASHING FUNCTION.

Go. Read. Learn.

Paul Lalli

robic0

unread,

Apr 20, 2006, 3:15:36 PM4/20/06

to

Ok u win. But its not systematic..

robic0

unread,

Apr 20, 2006, 3:24:18 PM4/20/06

to

Ok, for christs sakes, I'll find out. Guess I've been avoiding it
long enough.
By default for along time now, I've know the primitives applicaplable accross
all lang's. Beware though, you send me on a learning exercise and I find out
something I've know for 20 years, I'm coming back here and gonna beat the
be-jesus out of you!

Is there something simple about hashes you would like to tell me, to correct
my grey area?

Paul Lalli

unread,

Apr 20, 2006, 3:29:43 PM4/20/06

to

robic0 wrote:
> On 20 Apr 2006 12:13:20 -0700, "Paul Lalli" <mri...@gmail.com> wrote:
>
> >robic0 wrote:
> >> I assume the hash is sorted when its created, or is a 'quick sort' used in
> >> lookup as well?
> >
> >You are *really* not listening here. It's a HASH. There IS NO SORT
> >involved. NONE. No bubble sort, no quick sort, no merge sort. NO
> >SORTS. It's a HASHING FUNCTION.
> >
> >Go. Read. Learn.

> Ok, for christs sakes, I'll find out. Guess I've been avoiding it
> long enough.
> By default for along time now, I've know the primitives applicaplable accross
> all lang's. Beware though, you send me on a learning exercise and I find out
> something I've know for 20 years

You continuously refer to "sorting" and "searching" as pertains to
hashes. I assure you, you haven't known what you should about Hash
Tables for 20 years.

> , I'm coming back here and gonna beat the
> be-jesus out of you!

Bring it on.

> Is there something simple about hashes you would like to tell me, to correct
> my grey area?

No, I'm done trying to help you. You have shown a remarkable lack of
desire to actually *learn*. More than enough of my time has been
wasted on this thread. If you find the answers on your own,
congratulations. But I'm not going to hand-hold you any more.

Paul Lalli

robic0

unread,

Apr 20, 2006, 4:25:00 PM4/20/06

to

Yea, whatever... I mean your right of course!

>
>>The fact is that a hash key lookup is both searched for and if not exists,
>>then added to the array. The syntax is irrelavent!
>
>The syntax is relevant to that extent, that it returns different result
>for the first access with a given key then for any subsequent access
>with the same key.
>

Yea, right... I mean true of course!

>>I will have to stand by my analogy that binary search is used in both
>>the searching for the key and the existence simutaneously.
>
>It's a hash lookup, which has potential to be faster than binary search
>for the common case.

Ok, now this... According to someone else who I won't name (Paul Lalli),
claims the hash keys are never sorted, nor are the keys indexed, either
seperately, or as a result of a concatination re-nameing scheme.
So I would imagine *this* is the heart of the whole thing right here.
That other person seems to imply (as do you), that in the process of
creating an array of hash key's and subsequent lookup, no sorting is done,
none! By magic, therefore, a non-linear lookup is involved and something
that does not have the bad word 'sort' attached (even though sort/lookup
can be done simultaneous in certain circumstances).

I'll state this one time: "You can run from sort, but you can't hide".
The funny thing about sort is, the same technique surely can be used
anywhere, in any code, in any language, in identical form.

All the benchmark disparity boils down to 'possible' different sorting
techniques used in "sort" and "hash" creation/lookup. The question of
if sorting is happening on both sides is absolutely *not* in doubt!!!
The internal "behind the scenes" of expansion/contraction of HASH
is that its close to the os primatives of memove, malloc and realloc.

So, while it appears on the surface that hashing produces better results,
there is no underlying power, in principle, that indeed, the same
concept would appear to have equal weight on both sides.
The fault is that Perl hash is heavily wieghted with the underlying
os core primatives. The abstraction is identical, both compare based on the
boundries of comparisons in the sort, if they are identical.

>
>>However, this begs the question of why there is no direct memmove
>>primitive in Perl?
>
>Why should there be? Perl is on a different abstraction level (much of
>the time, at least). Memory addresses are not a typical abstraction
>in perl. The closest to memmove is that array (and hash) slices are
>assignable. But as you saw, the memmove is irrelevant for the hash
>case -- and behind the scenes, building the hash could very well use
>the C memmove. But on the Perl level you just don't need to care.
>Trust that if you use hash in the correct place for your code, it
>will perform fast enough.
>

Its hard to get my head around having hash (with its C memory primitives)
as the closest I can come to unique in Perl.

>>The user should be given the option. Memmove is the core to
>>growing/shrinking arrays at mid-point,
>>not just at the beginning and end with the shift/unshift and push/pop.
>
>As said, array and hash slices are assignable.

Yes, this is good. Alot of effort seems to have been put in Perls hash.
If I had a clear path to Perl intrinsic's that is performance designed
directly to C memory primatives then I quite possibly could find a thousand
uses for it as I did in C primitaves knowing syntax would not get in the way.
Don't you think?
Dunno......

Anno Siegel

unread,

Apr 20, 2006, 4:48:10 PM4/20/06

to

Paul Lalli <mri...@gmail.com> wrote in comp.lang.perl.misc:

> robic0 wrote:
> > On Tue, 18 Apr 2006 12:03:02 -0700, PerlFAQ Server
> <br...@stonehenge.com> wrote:
> >
> > >4.41: How can I remove duplicate elements from a list or array?
> > > my %seen = ();
> > > my @unique = grep { ! $seen{ $_ }++ } @array;

[...]

If you want fast, compare

fasthash => sub {
my %seen;
@seen{ @unsorted} = ();
my @unique = keys %seen;
},

It's more than twice the speed of "hash" on my machine. The grep() method
in "hash" returns the unique elements in their original order. If it matters
at all, it usually matters a lot, and the grep method is used.

Anno
--
If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers.

DJ Stunks

unread,

Apr 20, 2006, 4:48:40 PM4/20/06

to

PerlFAQ Server wrote:
> --------------------------------------------------------------------

>
> 4.41: How can I remove duplicate elements from a list or array?

I suggest this FAQ be amended to include L::MU's uniq.

That is all.

-jp

A. Sinan Unur

unread,

Apr 20, 2006, 5:04:25 PM4/20/06

to

> robic0 wrote:
...

>> Is there something simple about hashes you would like to tell me, to
>> correct my grey area?

Your problems are not related to any grey areas. Your problem, simply, is
the nonexistence of grey matter in the appropriate places.

http://en.wikipedia.org/wiki/Grey_matter

Sinan

--
A. Sinan Unur <1u...@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

robic0

unread,

Apr 20, 2006, 5:21:56 PM4/20/06

to

On Thu, 20 Apr 2006 21:04:25 GMT, "A. Sinan Unur" <1u...@llenroc.ude.invalid> wrote:

>> robic0 wrote:
>...
>>> Is there something simple about hashes you would like to tell me, to
>>> correct my grey area?
>
>Your problems are not related to any grey areas. Your problem, simply, is
>the nonexistence of grey matter in the appropriate places.
>
>http://en.wikipedia.org/wiki/Grey_matter
>
>Sinan

Very funny...

Uri Guttman

unread,

Apr 20, 2006, 5:34:57 PM4/20/06

to

>>>>> "r" == robic0 <robic0> writes:

r> On Thu, 20 Apr 2006 13:04:41 -0400, Uri Guttman <u...@stemsystems.com> wrote:
>>>>>>> "r" == robic0 <robic0> writes:
>>
r> my @unsorted = map { "int rand(100)" } 1..1000;
>>
>> check out the results of that line. it ain't doing what you think it is
>> doing. it also makes your code do much less work than the hash code so
>> your benchmark is royally screwed up. now be nice about this and look
>> carefully at that line (print the output to see the bug) and admit your
>> example is faulty.
>>
>> uri

r> I picked it up and was typing a followup right after I posted. Its
r> a boundry condition, still it breaks the benchmark.

boundary condition?? it is a bug in your code. do you even see the bug
in that line?

robic0

unread,

Apr 20, 2006, 5:39:42 PM4/20/06

to

On Thu, 20 Apr 2006 17:34:57 -0400, Uri Guttman <u...@stemsystems.com> wrote:

>>>>>> "r" == robic0 <robic0> writes:
>
> r> On Thu, 20 Apr 2006 13:04:41 -0400, Uri Guttman <u...@stemsystems.com> wrote:
> >>>>>>> "r" == robic0 <robic0> writes:
> >>
> r> my @unsorted = map { "int rand(100)" } 1..1000;
> >>
> >> check out the results of that line. it ain't doing what you think it is
> >> doing. it also makes your code do much less work than the hash code so
> >> your benchmark is royally screwed up. now be nice about this and look
> >> carefully at that line (print the output to see the bug) and admit your
> >> example is faulty.
> >>
> >> uri
>
> r> I picked it up and was typing a followup right after I posted. Its
> r> a boundry condition, still it breaks the benchmark.
>
>boundary condition?? it is a bug in your code. do you even see the bug
>in that line?
>
>uri

Nope, it compiles and runs just fine.....

int rand(100) int rand(100) int rand(100) int rand(100) int rand(100) int rand(100)
int rand(100) int rand(100) int rand(100) int rand(100) int rand(100) int rand(100)
int rand(100) int rand(100) int rand(100) int rand(100) int rand(100) int rand(100)
int rand(100) int rand(100) int rand(100) int rand(100) int rand(100) int rand(100)

Uri Guttman

unread,

Apr 20, 2006, 6:18:45 PM4/20/06

to

>>>>> "r" == robic0 <robic0> writes:

>> >>>>>>> "r" == robic0 <robic0> writes:
>> >>
r> my @unsorted = map { "int rand(100)" } 1..1000;

r> Nope, it compiles and runs just fine.....

r> int rand(100) int rand(100) int rand(100) int rand(100) int
r> rand(100) int rand(100) int rand(100) int rand(100) int rand(100)
r> int rand(100) int rand(100) int rand(100) int rand(100) int
r> rand(100) int rand(100) int rand(100) int rand(100) int rand(100)
r> int rand(100) int rand(100) int rand(100) int rand(100) int
r> rand(100) int rand(100)

and you call that working? compiling and running does not mean
working. that data is useless to test duplicate removals as it is all
duplicate data. it is especially useless to benchmark it since that
degenerate data set works better with your poor algorithm that the other
one.

brian d foy

unread,

Apr 20, 2006, 5:56:56 PM4/20/06

to

In article <1145566120....@z34g2000cwc.googlegroups.com>, DJ
Stunks <DJSt...@gmail.com> wrote:

patched and submitted. Thanks, :)
*** Posted via a free Usenet account from http://www.teranews.com ***

DJ Stunks

unread,

Apr 20, 2006, 8:04:47 PM4/20/06

to

brian d foy wrote:
> In article <1145566120....@z34g2000cwc.googlegroups.com>, DJ
> Stunks <DJSt...@gmail.com> wrote:
>
> > PerlFAQ Server wrote:
> > > --------------------------------------------------------------------
> > >
> > > 4.41: How can I remove duplicate elements from a list or array?
> >
> > I suggest this FAQ be amended to include L::MU's uniq.
>
> patched and submitted. Thanks, :)

I'm a contributor, baby!

(sheds leech costume)

Sweet!

hahahaha,
-jp

robic0

unread,

Apr 23, 2006, 4:54:16 PM4/23/06

to

On Thu, 20 Apr 2006 18:18:45 -0400, Uri Guttman <u...@stemsystems.com> wrote:

>>>>>> "r" == robic0 <robic0> writes:
>
> >> >>>>>>> "r" == robic0 <robic0> writes:
> >> >>
> r> my @unsorted = map { "int rand(100)" } 1..1000;
>
> r> Nope, it compiles and runs just fine.....
>
> r> int rand(100) int rand(100) int rand(100) int rand(100) int
> r> rand(100) int rand(100) int rand(100) int rand(100) int rand(100)
> r> int rand(100) int rand(100) int rand(100) int rand(100) int
> r> rand(100) int rand(100) int rand(100) int rand(100) int rand(100)
> r> int rand(100) int rand(100) int rand(100) int rand(100) int
> r> rand(100) int rand(100)
>
>and you call that working? compiling and running does not mean
>working. that data is useless to test duplicate removals as it is all
>duplicate data. it is especially useless to benchmark it since that
>degenerate data set works better with your poor algorithm that the other
>one.
>
>uri

So let me get this strait. A faq hash method used to identify and remove duplicates doesn't work
as well as my sort method on duplicates? Did I not provide duplicates? Does the hash method fall
off in performance the more duplicates?

Brad Baxter

unread,

Apr 24, 2006, 9:54:17 AM4/24/06

to

He did.

http://en.wikipedia.org/wiki/Hash_function

--
Brad

David Squire

unread,

Apr 24, 2006, 10:10:23 AM4/24/06

to

Brad Baxter wrote:
> robic0 wrote:

[snip]

>> Is there something simple about hashes you would like to tell me, to correct
>> my grey area?
>
> He did.
>
> http://en.wikipedia.org/wiki/Hash_function
>

Even better, http://en.wikipedia.org/wiki/Hash_table, since the article
cited above focuses more on cryptographic hash functions.

DS

Juha Laiho

unread,

Apr 29, 2006, 3:54:41 AM4/29/06

to

robic0 said:
>Ok, now this... According to someone else who I won't name (Paul Lalli),
>claims the hash keys are never sorted, nor are the keys indexed, either
>seperately, or as a result of a concatination re-nameing scheme.
>So I would imagine *this* is the heart of the whole thing right here.
>That other person seems to imply (as do you), that in the process of
>creating an array of hash key's and subsequent lookup, no sorting is done,
>none! By magic, therefore, a non-linear lookup is involved and something
>that does not have the bad word 'sort' attached (even though sort/lookup
>can be done simultaneous in certain circumstances).

Apologies for the delayed reply.

I'll try to explain the hash thing here, and why it doesn't sort;
this is just a generic explanation of hash lookup mechanism, and
not referring to the perl impelmentation -- but you should get the
overall picture.

When a key is stored in a hash, a "hash value" is computed from the key.
This hash value then determines the "storage bucket" for the key. There
can be several different keys that provide a single hash value.
So, a rough categorization of the keys according to some mathematical
formula is done. The contents of the bucket are then stored in some
structure; most often a list. The list search time is linear, but
the hash mechanism should provide enough buckets to keep the lists
(mostly) short.

When a key is being searched for from a hash, the same "hash value"
computation is done to find the correct bucket; the bucket contents
(list) are then being searched for an exact match for the key.

A very simplistic hash value computation could be to just pick the
first letter from the key to be indexed; you'd get buckets by each
letter. In reality, the value computation is more complex (but still
relatively simple, to keep it fast). But as you see, with this kind
of mechanism, there is no sorting of the actual data - and even the
hash values cannot be used to sort the data, as "alpha" and "zacharias"
could easily provide the same hash value (for some formula). So,
hashing helps to keep things searchable - but not in any specific
order.

Ordering things will make the collection searchable as well, but
at a cost higher than that needed for hashing. So, if searchability
is the only requirement, why pay extra for keeping the order?

>So, while it appears on the surface that hashing produces better results,
>there is no underlying power, in principle, that indeed, the same
>concept would appear to have equal weight on both sides.
>The fault is that Perl hash is heavily wieghted with the underlying
>os core primatives. The abstraction is identical, both compare based on the
>boundries of comparisons in the sort, if they are identical.

Nope; you'd do well to read at least
http://en.wikipedia.org/wiki/Associative_array#Efficient_representations
and perhaps also skim
http://en.wikipedia.org/wiki/Hash_table
to get better understanding of the hash table mechanism.

robic0

unread,

Apr 29, 2006, 9:35:10 PM4/29/06

to

Ok, just read this (your late post, heh). Since my last post (I guess to yours),
someone posted a link to hash's on wikepedia that I spent 3 minutes on before
I ran into code that would extend my want to be there time.
http://en.wikipedia.org/wiki/Hash_table

I've known about md5 for along time but had no idea it was a representation of a
cyrptic hash. Whatever a hash is, I know for sure, but in its primative state
before md5 (as hash aproaches uniqueness), the algol's result in as you describe
above. I have a very weird background loaded with math, engineering, liberal arts, etc..
Numerical methods (itterations), Gauss-Jordon, Runge-Kutta, and on and on. Mostly from the old days
of Univac 1100 and Ibm-370. Fortran, Nastran, Finite Element. Vectors/Matrices.
All computer driven. Doesn't matter. Then I have this other background, this machine background
from doing hundreds of thousands of lines of assembly on micro-mini's.

I have certainly heard the term 'buckets' before. I discount all those kind of terms.
I can certainly tell you that I am enforcing my own standards and ignore the newage
re-terminology of something that was first thought about probably a hundred years ago.
I intensly studied methodology when I was young. With my original Mech Engineering
background before computers, I learned formal analysis that doesen't exist to that extent
in computer science. Seems someone is always renaming the old known methods and re-inventing
the wheel. I'ts actually funny. I see new terms, then look up the definition and see its
been renamed as a whole new idea! See how I can get pissed off?

While I was skimming that page you mentioned, I ran into several things. I was only clicking
links for 20 min's. I did not want to (and don't want to) drill down to multiple core principles
unless I have to. The opening page suggested a key/value pair was a record with an index magically
generated returned from a hashing function, with the key as input, as you mentioned.
That index was like from 0..300 (doesn't matter). The output of the hash function was a alpha-numeric
mix. Forget about the collisions which was covered as some kind of hacker invader countermeasures,
you can't get from (alpha-numeric) key to distinct array index with a hash function, its not possible!!!
Its a psuedo index at best, an index to an index. I didn't read all the algo's. I may.

The simple bucket theory you mentioned surely falls appart in the statistical spread.
Thats all for now. Be aware though that I doubt anybody here knows how performance hash/lookup works.
Btw, they have curves that include binary searches. Seems binary searches is the achillies heel, it wins
sometimes (within the buckets?). We'll see. I will read some more. I won't take 'just take it and shut up',
if you write a language based on what the so called experts say, there is trouble. They also mention
database indexes being produced by hash function. Hehe, bring it on....

robic0

unread,

May 1, 2006, 1:30:37 AM5/1/06

to

Btw, don't hang your hat on 'buckets'. The phrase is long since worn out.
If you search the over 2,000 posts I have on this forum you will find several of your 'bucket'
definitions posted. Funny thing is, to me its all the same sort logic. It confuses the issue
when acronyms are used. There's nothing new under the sun especially when you've seen as many
sun's as me. They rate my iq off the charts, not as a direct point, but as a unknown linear
progression unto infinity. Never seen before. It means that I start off slow, minimimal knowledge
then each pass know more that extends off the chart, to infinity. I don't waste that gift from God.
I need a clear understanding before I waste my time. There's too much out there and I won't give it
away. My iq is off the charts last time I checked.

Marc Bissonnette

unread,

May 1, 2006, 1:42:00 AM5/1/06

to

robic0 denied reality and substituted his own by excreting
news:qm6b52dm9si09i09v...@4ax.com:

> On Sat, 29 Apr 2006 18:35:10 -0700, robic0 wrote:

[snip]

> Btw, don't hang your hat on 'buckets'. The phrase is long since worn
> out. If you search the over 2,000 posts I have on this forum you will
> find several of your 'bucket' definitions posted. Funny thing is, to
> me its all the same sort logic. It confuses the issue when acronyms
> are used. There's nothing new under the sun especially when you've
> seen as many sun's as me. They rate my iq off the charts, not as a
> direct point, but as a unknown linear progression unto infinity. Never
> seen before. It means that I start off slow, minimimal knowledge then
> each pass know more that extends off the chart, to infinity. I don't
> waste that gift from God. I need a clear understanding before I waste
> my time. There's too much out there and I won't give it away. My iq is
> off the charts last time I checked.

... So is your ego.

(Did you spend a lot of time on the "Kick me!" sign, or is that a magic
marker and bristle-board jobbie ?)

--
Marc Bissonnette
Looking for a new ISP? http://www.canadianisp.com
Largest ISP comparison site across Canada.

robic0

unread,

May 1, 2006, 2:23:21 AM5/1/06

to

I guess I have to fight off folks I don't know now. Hey, thanks regulars
thanks alot.

Hey dude????

How bout I fuck with your mind some?

Let me know dude, I really want to mother fuck up your goddamend asshole mind!

Matt Garrish

unread,

May 1, 2006, 6:55:58 AM5/1/06

to

<robic0> wrote in message news:qm6b52dm9si09i09v...@4ax.com...

I thought your 'c'ing of 0 was a direct correlation to your iq. I don't
stand corrected...

Matt

Marc Bissonnette

unread,

May 1, 2006, 9:46:47 AM5/1/06

to

robic0 denied reality and substituted his own by excreting

news:k6ab52psroea3gnei...@4ax.com:

You'll have to pardon me if I'm not exactly quaking in fear.

You had a perfect opportunity to show *why* you think your module is
functional and not worth the negative comments in here and, rather than
acting like a professional and showing that your creation is a good piece
of code, you jumped in like a potty-mouthed school child with rank
amateur insults.

My question about your "Kick me" sign stands.