[How to remove duplicate entries in a sorted array]
> is not a question but a solution ...
> ... I needed it to work on an array of
> ...
> setlength(newar,count+1);
> ...
Alavna,
Do you really want to do the SetLength a large number
of times? That could be very slow. (I did not look at the
rest of the algorithm.)
Regards, JohnH
>Waking up after 10 hours of sleep I immediately could wrote a short
>working algo(I needed it to work on an array of >133.000.000 elements:
Removing duplicates in a sorted array. Basically the best thing
you're going to be able to do is just sequentially search through the
array. Then identify your duplicates and go from there - they will
all be next to one another if the array is sorted.
Something like this would be my first attempt (untested):
const
arraybound = 133000000;
type
arraytype = whatever your array is;
arrayunit = whatever each unit is in the array;
var
oldarray, newarray: arraytype;
oldcount, newcount: longint;
currentvalue: arrayunit;
oldcount := 1;
newcount := 1;
while (oldcount < arraybound) do
begin
currentvalue := oldarray[oldcount];
newarray[newcount] := currentvalue;
inc(newcount);
repeat { to always increment oldcount at least once }
inc(oldcount);
until (oldarray[oldcount] <> currentvalue)
or (oldcount > arraybound);
end;
> until (oldarray[oldcount] <> currentvalue)
> or (oldcount > arraybound);
You might want to reverse those two conditions. <g>
--
-Mike (TeamB)
> you are right. But in my case I knew uniques values were less than
> 1000 in that ~133000000-elements array. And the result was 259 unique
> numbers which means setlength was executed 259 times only. The other
> problem is that I dont know the length of the new array.Anyway as I
> said i am using it to test somethings and not for endusers.
What is the range of the unique numbers?
--
-Mike (TeamB)
Then why not initially allocate 1000 for the result array,
and then maybe trim it when you know the final length?
I am almost afraid to ask why the input array is sorted,
but I will anyway. Was it sorted just for finding duplicates?
If so, there is a shorter way.
--JohnH
Clearly one should normally set the length to the maximum possibly
needed, and contract it once at the end.
If the array is very large and has only a few duplicates, it may be
better to copy the wanted ones (after the first duplicate) into the
original array, and at the end to shorten it : that saves getting more
array space and may reduce copying. That's like squeezing (or
squashing, or whatever the word was) a DEC RT-11 floppy, where files do
not fragment but free space may need to be consolidated (at the end).
In some applications, it may be better to replace the extra entries with
"null" entries, omit those when processing, and maybe consolidate
occasionally.
--
© John Stockton, Surrey, UK. ?@merlyn.demon.co.uk Delphi 3 Turnpike 4 ©
<URL:http://www.merlyn.demon.co.uk/> TP/BP/Delphi/&c., FAQqy topics & links;
<URL:http://www.bancoems.com/CompLangPascalDelphiMisc-MiniFAQ.htm> clpdmFAQ;
<URL:http://www.borland.com/newsgroups/guide.html> news:borland.* Guidelines
I don't think Rudy is talking about pointers, except in the metaphoric
sense that the index of an array is some kind of pointer. Rudy is
suggesting you use two indexes into the original array. Say I and J -
Function ZapDupes(var MyArray:TSomeIntegerArray):integer;
Var I,J,K,Val:integer;
begin
I := 0; // Assumes array is 0-based
J := 0;
K := length(MyArray);
Val := -1; // an impossible value
While I < K do
// Compare new value to previous value
If Val = MyArray[I]
then Inc(I) // No change
else begin
Val := MyArray[I]; // Update value
MyArray[J] := Val; // Move new value into J index
Inc(I);
Inc(J);
end
Setlength(MyArray,J);
Result := J; // Return length of modified array
end;
The J index is only incremented if the previous value is different from
the current value, so that J <= I always. The I index is always
incremented. At the finish, the J index will be the count of
non-duplicate entries, and all the non-duplicate entries will have been
moved to the top of the array, so you can just set the length of
MyArray accordingly. I used K so that length(MyArray) is calculated
only once.
This does assume that MyArray is sorted. I believe a better solution to
the entire problem might be for the sorting process to not insert
duplicates at all.
HTH, Glynn
--
"It is high time that the ideal of success should be replaced by the
ideal of service." (Albert Einstein).
> > I am almost afraid to ask why the input array is sorted,
> > but I will anyway. Was it sorted just for finding duplicates?
> > If so, there is a shorter way.
> Yes I actually sorted it to find the duplicates. What is the
> shorter way?
If you know the number of unique values is very likely to be less
than 1024, you could use a hash table of size 2048 and code
a hash function to convert the inputs into a 0..2047 key. Of
course you have to have the usual mechanisms to handle
collisions. As you scan the inputs and find them not in the
table, you add them.
A second method is to build your own binary search tree as
you scan the inputs, but this requires a little more unique
coding and may not be faster.
Regards, JohnH
What would that be?