Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Looking for a "Uniqueness Search"

4 views
Skip to first unread message

Christopher Browne

unread,
Dec 27, 2002, 4:53:53 PM12/27/02
to
I'm looking for a text tool that more or less corresponds to the
pipelines:

cat foo | sort | uniq | wc -l
cat foo | wc -l

If the counts differ, then evidently there's something /not/ unique in
the files.

What it amounts to is looking at whether or not the data presented
represents a unique key across the data file.

What actually happens is that "foo" is actually the result of a "cut"
command, ala "cut -f 1,7,9 some_data_file".

I can easily script up such a comparison in Perl, using a hash table,
that can abort as soon as a non-unique value is found, roughly via:

while ($line = <>) {
if $HASH{$line} {
print $count, $HASH{$line};
die -1;
} else {
$HASH{$line} = $count++;
}
}

Unfortunately, the interpretive overhead is pretty heavy, such that
this is a lot slower than "cat foo | sort | uniq | wc -l", even though
it is able to abort as soon as it hits a violation of the uniqueness
requirement.

I can't think of any Unix text command that does a uniqueness test
without starting with a sort; has anyone built such? If not, then I
may wind up writing up something in C using dbm. But there's no sense
in replicating a wheel that may already exist.

(Note that this /is/ relevant to "databasing" as the whole purpose of
the exercise is to validate that attempts to create unique indexes in
a PostgreSQL DBMS will succeed without actually initiating the "create
index.")
--
(reverse (concatenate 'string "moc.enworbbc@" "enworbbc"))
http://www.ntlug.org/~cbbrowne/rdbms.html
"prepBut nI vrbLike adjHungarian! qWhat's artThe adjBig nProblem?"
-- Alec Flett

Patrick TJ McPhee

unread,
Dec 27, 2002, 8:29:24 PM12/27/02
to
In article <auii5h$7kti0$1...@ID-125932.news.dfncis.de>,
Christopher Browne <cbbr...@acm.org> wrote:

% I'm looking for a text tool that more or less corresponds to the
% pipelines:
%
% cat foo | sort | uniq | wc -l
% cat foo | wc -l

Still a pipeline (sorting the file only once), but how about
sort foo | sort -uc
?

This will set $? to 0 if foo has unique lines, and 1 otherwise.
I expect sort's -k option can be used to set the key up the
way you want it.

% I can't think of any Unix text command that does a uniqueness test
% without starting with a sort; has anyone built such? If not, then I
% may wind up writing up something in C using dbm. But there's no sense
% in replicating a wheel that may already exist.

Can you hold the keys in memory? It would likely be faster to use hsearch()
and friends if so.

--

Patrick TJ McPhee
East York Canada
pt...@interlog.com

Chris F.A. Johnson

unread,
Dec 27, 2002, 8:47:19 PM12/27/02
to
On Fri, 27 Dec 2002 at 21:53 GMT, Christopher Browne wrote:
> I'm looking for a text tool that more or less corresponds to the
> pipelines:
>
> cat foo | sort | uniq | wc -l
> cat foo | wc -l
>
> If the counts differ, then evidently there's something /not/ unique in
> the files.
>
> What it amounts to is looking at whether or not the data presented
> represents a unique key across the data file.
[snip]

>
> I can't think of any Unix text command that does a uniqueness test
> without starting with a sort; has anyone built such? If not, then I
> may wind up writing up something in C using dbm. But there's no sense
> in replicating a wheel that may already exist.

awk 'x[$0]++ {print "Duplicate found"}'

--
Chris F.A. Johnson http://cfaj.freeshell.org
===================================================================
My code (if any) in this post is copyright 2002, Chris F.A. Johnson
and may be copied under the terms of the GNU General Public License

Christopher Browne

unread,
Dec 27, 2002, 9:55:23 PM12/27/02
to
Quoth pt...@interlog.com (Patrick TJ McPhee):

> In article <auii5h$7kti0$1...@ID-125932.news.dfncis.de>,
> Christopher Browne <cbbr...@acm.org> wrote:
>
> % I'm looking for a text tool that more or less corresponds to the
> % pipelines:
> %
> % cat foo | sort | uniq | wc -l
> % cat foo | wc -l
>
> Still a pipeline (sorting the file only once), but how about
> sort foo | sort -uc
> ?
>
> This will set $? to 0 if foo has unique lines, and 1 otherwise.
> I expect sort's -k option can be used to set the key up the
> way you want it.

I mostly hate the -k option; between the variations of "old options"
and "new options," I would usually rather pipe things through cut,
first.

The -uc option had not occurred to me; it's still going to sort, I
presume, before indicating non-uniqueness, right? That is a bit of a
win over the current approach, but would seem only by a small margin.
I want the /big/ margin :-).

>% I can't think of any Unix text command that does a uniqueness test
>% without starting with a sort; has anyone built such? If not, then I
>% may wind up writing up something in C using dbm. But there's no sense
>% in replicating a wheel that may already exist.
>
> Can you hold the keys in memory? It would likely be faster to use
> hsearch() and friends if so.

Ah, so hsearch() is preferable these days? I need portability across
Sun + Linux, so that stuff like GDBM is certainly somewhat unlikable.
hsearch() sure seems suitable, and I'm planning to have a couple GB of
RAM around, so "in memory" is certainly an excellent assumption.
hsearch() is both in Solaris 8.x and Linux GLIBC, so that seems quite
a legitimate answer.

Gracias...
--
(reverse (concatenate 'string "gro.gultn@" "enworbbc"))
http://cbbrowne.com/info/spiritual.html
43% of all statistics are worthless.

Paul G. Brown

unread,
Dec 27, 2002, 11:11:06 PM12/27/02
to
Christopher Browne <cbbr...@acm.org> wrote in message news:<auii5h$7kti0$1...@ID-125932.news.dfncis.de>...

> I'm looking for a text tool that more or less corresponds to the
> pipelines:

[ snip ]

> (Note that this /is/ relevant to "databasing" as the whole purpose of
> the exercise is to validate that attempts to create unique indexes in
> a PostgreSQL DBMS will succeed without actually initiating the "create
> index.")

Err. . .

SELECT T.Data, COUNT(*)
FROM Table T
GROUP BY T.Data HAVING COUNT(*) > 1;

This will tell you not only that the column is OK for unique key, but also
what value(s) are repeated.

Use the DBMS, Luke!

KR

Pb

Christopher Browne

unread,
Dec 28, 2002, 12:26:50 AM12/28/02
to

That's going to involve a table scan, /before/ creating the index, and
will thus be not terribly efficient.

I want to analyze possible problems /before/ the indices even get into
place, and before I get into 3 hours of processing to create a
boat-load of indices.


--
(reverse (concatenate 'string "moc.enworbbc@" "enworbbc"))

http://cbbrowne.com/info/spiritual.html
CBS News report on Fort Worth tornado damage:
"Eight major downtown buildings were severely damaged and 1,000 homes
were damaged, with 95 uninhabitable. Gov. George W. Bush declared
Tarrant County a disaster area. Federal Emergency Management Agency
workers are expected to arrive sometime next week after required
paperwork is completed."

Patrick TJ McPhee

unread,
Dec 28, 2002, 1:04:19 PM12/28/02
to
In article <auj3qr$7kp0l$2...@ID-125932.news.dfncis.de>,
Christopher Browne <cbbr...@acm.org> wrote:

% The -uc option had not occurred to me; it's still going to sort, I
% presume, before indicating non-uniqueness, right? That is a bit of a

Well, the first sort in the pipeline will, but -uc ought to bail out
as soon as either a duplicate line or a line out of order comes up.

Paul G. Brown

unread,
Dec 28, 2002, 1:32:06 PM12/28/02
to
Christopher Browne <cbbr...@acm.org> wrote in message news:<aujcmp$7op3d$1...@ID-125932.news.dfncis.de>...

> The world rejoiced as paul_geof...@yahoo.com (Paul G. Brown) wrote:
> > Christopher Browne <cbbr...@acm.org> wrote in message news:<auii5h$7kti0$1...@ID-125932.news.dfncis.de>...
> >> I'm looking for a text tool that more or less corresponds to the
> >> pipelines:

[ snip ]

> >


> > Err. . .
> >
> > SELECT T.Data, COUNT(*)
> > FROM Table T
> > GROUP BY T.Data HAVING COUNT(*) > 1;


[ snip ]

> That's going to involve a table scan, /before/ creating the index, and
> will thus be not terribly efficient.
>
> I want to analyze possible problems /before/ the indices even get into
> place, and before I get into 3 hours of processing to create a
> boat-load of indices.

And your proposal is to unload the table into a text file (table scan) and
then to scan the text file repeatedly? I find it hard to believe that your
pipes approach will be any faster - and I *guarantee* that the SQL approach
will take less time than writing an entire 'C' program. (Postgres ought to
use a hash to handle the GROUP BY, in any case).

Ralph Corderoy

unread,
Dec 28, 2002, 3:18:57 PM12/28/02
to
Hi,

> > I'm looking for a text tool that more or less corresponds to the
> > pipelines:
> >
> > cat foo | sort | uniq | wc -l
> > cat foo | wc -l
> >
> > If the counts differ, then evidently there's something /not/ unique
> > in the files.
>

> awk 'x[$0]++ {print "Duplicate found"}'

Or, as the original poster suggested Perl

perl -ne '$h{$_}++ and exit 1'

If there's interpretive overhead, as the OP complained, then ensuring
the source is idiomatic often helps the interpreted speed.

Besides, awk traditionally does all the work of splitting into $1, $2,
etc., even if the fields aren't required.

Cheers,


Ralph.

John Slimick

unread,
Dec 30, 2002, 1:24:56 PM12/30/02
to

>> >> I'm looking for a text tool that more or less corresponds to the
>> >> pipelines:

If each line of the file has more than a trivial
amount of information, why not use a "better" hashing
technique and generate a file of the hash values of each
line. Then sort the hash values in filea and then remove
unduplicated values to fileb. This sort should be a
good deal cheaper than the sort of the entire lines.

If the size of filea and fileb
are the same, then there were no duplicate lines. If
there are, then either the pearl or the sort on the original
lines is justified.

By "better" hash methods I am thinking if CRC-32 or SHA-1.

john slimick
sli...@pitt.edu

Christopher Browne

unread,
Dec 30, 2002, 3:54:42 PM12/30/02
to
sli...@jcs.upb.pitt.edu (John Slimick) wrote:
>>> >> I'm looking for a text tool that more or less corresponds to the
>>> >> pipelines:
>
> If each line of the file has more than a trivial
> amount of information, why not use a "better" hashing
> technique and generate a file of the hash values of each
> line. Then sort the hash values in filea and then remove
> unduplicated values to fileb. This sort should be a
> good deal cheaper than the sort of the entire lines.

I'm fine with running the data through "cut" so that all that remains
in each line is the portion that is /supposed/ to be unique.

That addresses the issue of "sorting something that's bigger than it
needs to be;" by cutting out just the portions that are relevant,
nothing unnecessary is included.

> If the size of filea and fileb
> are the same, then there were no duplicate lines. If
> there are, then either the pearl or the sort on the original
> lines is justified.
>
> By "better" hash methods I am thinking if CRC-32 or SHA-1.

The first attempt, using POSIX hsearch(), turned out to be a poor
choice. Apparently something about hsearch() is inefficient when the
number of records gets beyond a few hundred.

The next choice, that seems to be turning out better, is to use one of
the variations on DJB's cdb.

What I ought to be able to do is thus:

while (!feof(stdin)) {
fgets(line, 200, stdin);
llen = strlen(line);
lineno++;
rc = cdb_make_put(cdbmp, line, llen, &lineno, sizeof(lineno),
CDB_PUT_ADD);
if (rc != 0) {
printf("rc = %d - %s repeated on line %d\n", rc, line, lineno);
exit(1);
}
}
exit (0);

In theory, I might even be able to throw this at /dev/null, so that
data merely gets thrown in the trash...
--
(reverse (concatenate 'string "moc.enworbbc@" "sirhc"))
http://cbbrowne.com/info/linuxxian.html
"In my opinion MS is a lot better at making money than it is at making
good operating systems." -- Linus Torvalds

Ralph Corderoy

unread,
Dec 30, 2002, 7:01:34 PM12/30/02
to
Hi Christopher,

> The next choice, that seems to be turning out better, is to use one of
> the variations on DJB's cdb.

Could you give us some outputs of time(1) using the various methods
together with what you'd like to achieve? I'd be interested to see the
difference between your original perl script and the one-liner I
proposed in response to the awk one-liner.

Cheers,


Ralph.

0 new messages