How to check a StringList for duplicate entries???

M.Wasbauer

unread,

Jun 1, 2002, 5:37:49 PM6/1/02

to

Hello everybody,
I have a TStringList which I fill with a list of 200,000 entries.
So there are 200,000 lines in this list and I need a fast way to
run through this StringList to check for duplicate entries.
If a duplicate is found, it must be copied into a memo and it must
be deleted from the stringlist, leaving every line unique.

Any ideas?

Kind regards,

Maurice

John Herbster

unread,

Jun 1, 2002, 6:22:29 PM6/1/02

to

"M.Wasbauer" <m.was...@nedpro.nl> wrote

> I have a TStringList which I fill with a list of 200,000

> entries. I need a fast way to ... check for [and remove]
> duplicate entries. ... Any ideas?

Maurice, It's a clasic problem with classic solution.
(I think I first read about it in a Hardy Boys book, in which
they were solving a crime by working with collections of
license plate numbers.) Just sort and search from tail
of list to front, deleting duplicates. Have fun. JohnH

Brian Slack

unread,

Jun 1, 2002, 7:07:31 PM6/1/02

to

Why not not allow duplicates and add in a try except block

--

Brian Slack
http://www.depicus.com
"Wake On Lan" and "Remote Shutdown" Software

"...never was I so scared as when I saw a Formula 1 car controlled by Visual
Basic..."
.
"John Herbster" <herb...@swbell.net> wrote in message
news:3cf94925$1_1@dnews...

ozbear

unread,

Jun 1, 2002, 7:32:39 PM6/1/02

to

Set the Duplicates property to dupError,
the Sorted property to True,
and the CaseSensitive property to true or false
depending on whether "cat" is supposed to be
treated as a duplicate for "Cat". Then your
code boils down to...

try
yourstringlist.Add(somenewstring);
except
on EStringListError do begin
yourmemo.Lines.Add(somenewstring + ' is a duplicate');
end;

If you know that your TStringList is going to be that
large, it would probably be a good idea to set the
Capacity property too to avoid frequent reallocations.

Oz

John Herbster

unread,

Jun 1, 2002, 7:37:20 PM6/1/02

to

"Brian Slack" <newsg...@depicus.com> wrote

> Why not not allow duplicates and add in a try except block

Brian, Exceptions are expensive in terms of CPU cycles
and maybe memory fragmentation. However, you could
write your own insertion sort and avoid exceptions. But,
until I knew more about the need, like the expected
percent duplicates, would go with the search and delete
scheme. Rgds, JohnH

John Herbster

unread,

Jun 1, 2002, 8:51:14 PM6/1/02

to

"ozbear" <ozb...@bigpond.com> wrote

> Set the Duplicates property to dupError

Oz, I had fogotten about the Duplicates property.
Why don't you just set it to dupIgnore and then
do your Add()s. I don't think that Maurice wanted
to print messages about duplicates and if he did
there are faster ways than using exceptions to
detect a failure to add. Regards, JohnH

Lord Crc

unread,

Jun 1, 2002, 10:43:54 PM6/1/02

to

On Sat, 01 Jun 2002 23:32:39 GMT, ozb...@bigpond.com (ozbear) wrote:

>Set the Duplicates property to dupError,
>the Sorted property to True,
>and the CaseSensitive property to true or false
>depending on whether "cat" is supposed to be
>treated as a duplicate for "Cat". Then your
>code boils down to...
>
>try
> yourstringlist.Add(somenewstring);
>except
> on EStringListError do begin
> yourmemo.Lines.Add(somenewstring + ' is a duplicate');
>end;

in my limited tests, this took about 50 seconds (32 if i just muted
the excetion), while with the same stringlist source, my code took
about 3 seconds.

src.Sort;

i:= 0;
reported:= false;
lines:= src.Count;
while (i < lines) do
begin
s:= src[i];
dupe:= (i > 0) and (so = s);
if not dupe then
begin
reported:= false;
tmp.Add(s);
so:= s;
end
else
begin
if not reported then
begin
outp.Add(s);
reported:= true;
end;
end;
inc(i);
end;

(src is the input, outp is the duplicate-free list, log is the
stringlist that holds the duplicate values (i assumed Maurice only
wanted each string reported once), assign it to the memo when the loop
is done)

- Asbjørn

Dustin Campbell

unread,

Jun 2, 2002, 10:31:55 AM6/2/02

to

Here's the fastest I can find. Of course it uses more memory and there
is some sorting (which is required to use TStringList's
duplicate-checking feature). It takes about 40 seconds on my machine.

slItemList: TStringList; // your list with 200,000 entries
memDups: TMemo; // your memo for duplicates

slTempCleaned: TStringList; // tempory stringlist for cleaned items.
slTempDups: TStringList; // temporary stringlist to store duplicates.
I: Integer; // iterator

// First, create the temporary string lists.
slTempDups := TStringList.Create;
slTempDups.Duplicates := dupIgnore;
slTempDups.Sorted := True;

slTempCleaned := TStringList.Create;
slTempCleaned.Duplicates := dupError;
slTempCleaned.Sorted := True;
try

slTempDups.BeginUpdate;
slTempCleaned.BeginUpdate;

for I := 0 to Pred(slItemList.Count) do
begin
try
slTempCleaned.Add(slItemList[I]);
except
on EStringListError do slTempDups.Add(slItemList[I]);
end;
end;

slItemList.Assign(slTempCleaned);
memDups.Lines.Assign(slTempDups);

finally
slTempDups.Free;
slTempCleaned.Free;
end;

Regards,
Dustin Campbell
Microsys Technologies, Inc.

On Sat, 1 Jun 2002 23:37:49 +0200, "M.Wasbauer" <m.was...@nedpro.nl>
wrote:

John Herbster

unread,

Jun 2, 2002, 12:25:00 PM6/2/02

to

"John Herbster" <herb...@swbell.net> wrote
> ... Just sort and search from tail of list to front,
> deleting duplicates.

Only one big problem, deleting from the D5 and probably
other Delphi TStringLists can be *very* slow. The time
required appears to grow at about the 2.15 power of the
number of items in the list when deleting about 22% of
the items.

A solution is to create a new TStringList and Add
the items that are *not* to be deleted. This very fast
compared to the sorting. The TStringListSort is
pretty fast (about 2000 CPU cycles per item when
doing about 200,000 items, each a string like
"234152"). --JohnH

John Scalco

unread,

Jun 2, 2002, 8:49:23 PM6/2/02

to

You can use a TStringList - or better yet a THashedStringList - I think new
in Delphi 6. But TMemIniFile uses this internally, so my guess is that they
just exposed something which has been there awhile.

You might try using Names[] and Values[] These are like hash pairs.

You can then Say Add ('First Item=1'); // where 1 means it's the first time
this item has been added to the list.
Then use IndexOfNames('First Item') to see if it already exists - I imagine
this uses hashing techniques so this would probably be faster than the
standard TStringlist maybe.

You can iterate through the list using Names[i] to get what's on the left of
the = sign and Values[] gives you what's on the right.

hope this helps ,
sincerely,
John

"M.Wasbauer" <m.was...@nedpro.nl> wrote in message
news:3cf93ee2_2@dnews...

ozbear

unread,

Jun 2, 2002, 9:11:59 PM6/2/02

to

Hmmm...he said he wanted the duplicates added to a memo.

There are many solutions to this problem, and the "best"
one will be predicated on how the original input data
is obtained. In some cases presorting a list and copying
the unique items to another list might be preferable
as in LordCRC's solution.

I supplied on a general solution where all the original
string data might not be available (e.g., it is arriving
a string at a time from a socket or is being entered by
a user from a form) so presorting isn't an option.

The general idea is to not add them in the first place rather
than go through all the machinery of adding only to delete
them later if they are a duplicate.

If the number of -expected- duplicates is small in relation to
the 200,000 input strings, it might be better to add the
duplicates to an auxiliary TStringList and just do a
memo.string.assign to get them to the memo.

Oz

John Herbster

unread,

Jun 2, 2002, 10:11:05 PM6/2/02

to

"ozbear" <ozb...@bigpond.com> wrote
> ... he said he wanted the duplicates added to a memo.

On rereading the problem, I see he did!

Did you see my note about deleting from the D5 and
probably other Delphi TStringLists being *very* slow.

The time required appears to grow at about the 2.15
power of the number of items in the list when deleting
about 22% of the items.

I thought that deleting would just leave empty space at
the end of the list. But there is more to it than that.
Regards, JohnH

Chris Willig

unread,

Jun 3, 2002, 5:44:03 PM6/3/02

to

Not using exceptions would cut your time down considerably...

cnt : integer;

slTempCleaned.Duplicates := dupIgnore;

cnt := 0;

for I := 0 to Pred(slItemList.Count) do
begin

slTempCleaned.Add(slItemList[I]);

if cnt = slTempCleaned.Count then
slTempDups.Add(slItemList[I])
else
cnt := slTempCleaned.Count;

end;