How to remove duplicate entries in a sorted array

alavna

unread,

Jun 24, 2006, 12:34:41 PM6/24/06

to

This is not a question but a solution that can be found I guess by
searching in google newsgroups.
Sleep-deprived, I could not come up with a simple solution at all. I
searched google groups and found many answers but it was amazing that
non of them worked.
Waking up after 10 hours of sleep I immediately could wrote a short
working algo(I needed it to work on an array of >133.000.000 elements:
//Arr is the large array to be filtered.
//newar is the array containing elements after removing duplicates
//after declaring the variables
//Arr's first index is zero
g:=length(arr);
count:=0;
setlength(newar,1);
newar[0]:=1;
for a := 1 to g-1 do
begin
if arr[a]<>arr[a-1] then
begin
setlength(newar,count+1);
newar[count]:=arr[a];
count:=count+1;
end;
end;
//now newar has what you want

John Herbster

unread,

Jun 24, 2006, 12:49:28 PM6/24/06

to

"alavna" <omer....@gmail.com> wrote
> This

[How to remove duplicate entries in a sorted array]

> is not a question but a solution ...
> ... I needed it to work on an array of
> ...
> setlength(newar,count+1);
> ...

Alavna,
Do you really want to do the SetLength a large number
of times? That could be very slow. (I did not look at the
rest of the algorithm.)
Regards, JohnH

A Programmer

unread,

Jun 24, 2006, 12:48:34 PM6/24/06

to

On Sat, 24 Jun 2006 19:34:41 +0300, alavna <omer....@gmail.com>
wrote:

>Waking up after 10 hours of sleep I immediately could wrote a short
>working algo(I needed it to work on an array of >133.000.000 elements:

Removing duplicates in a sorted array. Basically the best thing
you're going to be able to do is just sequentially search through the
array. Then identify your duplicates and go from there - they will
all be next to one another if the array is sorted.

Something like this would be my first attempt (untested):

const
arraybound = 133000000;
type
arraytype = whatever your array is;
arrayunit = whatever each unit is in the array;
var
oldarray, newarray: arraytype;
oldcount, newcount: longint;
currentvalue: arrayunit;

oldcount := 1;
newcount := 1;
while (oldcount < arraybound) do
begin
currentvalue := oldarray[oldcount];
newarray[newcount] := currentvalue;
inc(newcount);
repeat { to always increment oldcount at least once }
inc(oldcount);
until (oldarray[oldcount] <> currentvalue)
or (oldcount > arraybound);
end;

alavna

unread,

Jun 24, 2006, 1:31:50 PM6/24/06

to

you are right. But in my case I knew uniques values were less than 1000
in that ~133000000-elements array. And the result was 259 unique numbers
which means setlength was executed 259 times only. The other problem is
that I dont know the length of the new array.Anyway as I said i am using
it to test somethings and not for endusers.
thanks

Mike Williams (TeamB)

unread,

Jun 24, 2006, 1:20:07 PM6/24/06

to

A Programmer wrote:

> until (oldarray[oldcount] <> currentvalue)
> or (oldcount > arraybound);

You might want to reverse those two conditions. <g>

--
-Mike (TeamB)

Mike Williams (TeamB)

unread,

Jun 24, 2006, 2:12:15 PM6/24/06

to

alavna wrote:

> you are right. But in my case I knew uniques values were less than
> 1000 in that ~133000000-elements array. And the result was 259 unique
> numbers which means setlength was executed 259 times only. The other
> problem is that I dont know the length of the new array.Anyway as I
> said i am using it to test somethings and not for endusers.

What is the range of the unique numbers?

--
-Mike (TeamB)

Message has been deleted

John Herbster

unread,

Jun 24, 2006, 6:14:46 PM6/24/06

to

> ...in my case I knew unique values were less than 1000

> in that ~133000000-elements array. And the result was

> 259 unique numbers ...

Then why not initially allocate 1000 for the result array,
and then maybe trim it when you know the final length?

I am almost afraid to ask why the input array is sorted,
but I will anyway. Was it sorted just for finding duplicates?
If so, there is a shorter way.

--JohnH

Dr John Stockton

unread,

Jun 24, 2006, 4:09:11 PM6/24/06

to

JRS: In article <449d...@newsgroups.borland.com>, dated Sat, 24 Jun
2006 11:49:28 remote, seen in news:borland.public.delphi.language.delphi
.general, John Herbster <herb-sci1_AT_sbcglobal.net@?.?> posted :

Clearly one should normally set the length to the maximum possibly
needed, and contract it once at the end.

If the array is very large and has only a few duplicates, it may be
better to copy the wanted ones (after the first duplicate) into the
original array, and at the end to shorten it : that saves getting more
array space and may reduce copying. That's like squeezing (or
squashing, or whatever the word was) a DEC RT-11 floppy, where files do
not fragment but free space may need to be consolidated (at the end).

In some applications, it may be better to replace the extra entries with
"null" entries, omit those when processing, and maybe consolidate
occasionally.

--
© John Stockton, Surrey, UK. ?@merlyn.demon.co.uk Delphi 3 Turnpike 4 ©
<URL:http://www.merlyn.demon.co.uk/> TP/BP/Delphi/&c., FAQqy topics & links;
<URL:http://www.bancoems.com/CompLangPascalDelphiMisc-MiniFAQ.htm> clpdmFAQ;
<URL:http://www.borland.com/newsgroups/guide.html> news:borland.* Guidelines

alavna

unread,

Jun 24, 2006, 8:12:25 PM6/24/06

to

Rudy Velthuis [TeamB] wrote:

> At 19:31:50, 24.06.2006, alavna wrote:
>
>> you are right. But in my case I knew uniques values were less than 1000
>> in that ~133000000-elements array.
>

> Untested, but you can do it in place:
>
> readIndex := 0;
> writeIndex := 0;
> while readIndex < Length(A) - 1 do
> begin
> // skip duplicates
> while A[readIndex] = A[readIndex + 1] do
> Inc(readIndex);
> A[writeIndex] := A[readIndex];
> Inc(readIndex);
> Inc(writeIndex);
> end;
> SetLength(A, writeIndex);
>
>
Thanks Rudy

alavna

unread,

Jun 24, 2006, 8:16:11 PM6/24/06

to

Rudy Velthuis [TeamB] wrote:

> Why not just
>
> SetLength(newArr, g);
>
> once and then
>
> SetLength(newArr, count);
>
> at the end of the loop?
>
> Or why actually use a new array? Just have a read and a write pointer (or
> index). The read pointer (or index) should be lower or equal to the write
> pointer (or index).
>
:)thanks Rudy but really i still have a lot to learn. Pointers are yet
to read about for me. Solutions I use myself may look very primitive and
unoptimized and hope this will change in the near future.

alavna

unread,

Jun 24, 2006, 8:11:19 PM6/24/06

to

between 1 and 133,000,000

alavna

unread,

Jun 24, 2006, 8:10:12 PM6/24/06

to

Yes i actually sorted it to find the duplicates. What is the shorter
way? thanks in advance

Glynn Owen

unread,

Jun 24, 2006, 11:04:42 PM6/24/06

to

alavna wrote:

I don't think Rudy is talking about pointers, except in the metaphoric
sense that the index of an array is some kind of pointer. Rudy is
suggesting you use two indexes into the original array. Say I and J -

Function ZapDupes(var MyArray:TSomeIntegerArray):integer;
Var I,J,K,Val:integer;

begin
I := 0; // Assumes array is 0-based
J := 0;
K := length(MyArray);
Val := -1; // an impossible value
While I < K do
// Compare new value to previous value
If Val = MyArray[I]
then Inc(I) // No change
else begin
Val := MyArray[I]; // Update value
MyArray[J] := Val; // Move new value into J index
Inc(I);
Inc(J);
end
Setlength(MyArray,J);
Result := J; // Return length of modified array
end;

The J index is only incremented if the previous value is different from
the current value, so that J <= I always. The I index is always
incremented. At the finish, the J index will be the count of
non-duplicate entries, and all the non-duplicate entries will have been
moved to the top of the array, so you can just set the length of
MyArray accordingly. I used K so that length(MyArray) is calculated
only once.

This does assume that MyArray is sorted. I believe a better solution to
the entire problem might be for the sorting process to not insert
duplicates at all.

HTH, Glynn

--
"It is high time that the ideal of success should be replaced by the
ideal of service." (Albert Einstein).

Message has been deleted

alavna

unread,

Jun 25, 2006, 7:57:31 AM6/25/06

to

Rudy Velthuis [TeamB] wrote:

> At 02:16:11, 25.06.2006, alavna wrote:
>
>> to read about for me. Solutions I use myself may look very primitive
>> and unoptimized and hope this will change in the near future.
>

> Like I wrote, instead of a pointer, you can use the index. I posted a
> piece of code in one of my other posts.
>
I tried yours. it works. thanks

John Herbster

unread,

Jun 25, 2006, 9:09:44 AM6/25/06

to

"alavna" <omer....@gmail.com> wrote

> > I am almost afraid to ask why the input array is sorted,
> > but I will anyway. Was it sorted just for finding duplicates?
> > If so, there is a shorter way.

> Yes I actually sorted it to find the duplicates. What is the
> shorter way?

If you know the number of unique values is very likely to be less
than 1024, you could use a hash table of size 2048 and code
a hash function to convert the inputs into a 0..2047 key. Of
course you have to have the usual mechanisms to handle
collisions. As you scan the inputs and find them not in the
table, you add them.

A second method is to build your own binary search tree as
you scan the inputs, but this requires a little more unique
coding and may not be faster.

Regards, JohnH

Dr John Stockton

unread,

Jun 25, 2006, 7:54:20 AM6/25/06

to

JRS: In article <449db8bc$1...@newsgroups.borland.com>, dated Sat, 24 Jun
2006 17:14:46 remote, seen in news:borland.public.delphi.language.delphi

.general, John Herbster <herb-sci1_AT_sbcglobal.net@?.?> posted :
>

>I am almost afraid to ask why the input array is sorted,
>but I will anyway. Was it sorted just for finding duplicates?
>If so, there is a shorter way.

What would that be?