Need help ** removing duplicate rows **

Randal Schwartz

unread,

Oct 30, 1990, 7:36:27 PM10/30/90

to

In article <1990Oct30.2...@agate.berkeley.edu>, c60b-3ac@web (Eric Thompson) writes:
| I have a few very long files that contain rows of ASCII data. Each row
| looks something like this (not the actual data here):
|
| a:A:b:c:d:e:f:g:h:i:j:k:l:m
| a:B:b:c:d:e:f:g:h:i:j:k:l:m
| a:C:b:c:d:e:f:g:h:i:j:k:l:m
| a:D:b:c:d:e:f:g:h:i:j:k:l:m
| b:A:n:o:p:q:s:t:u:v:w:x:y:z
| c:A:x:a:x:b:x:c:d:a:m:l:v:x
| d:A:m:l:k:j:i:h:g:f:e:d:c:b
| d:B:m:l:k:j:i:h:g:f:e:d:c:b
| d:C:m:l:k:j:i:h:g:f:e:d:c:b
|
| It's the second column that's important. If there are multiple rows that
| are exactly the same except for the second column, I want to GET RID of them.
| If the row is unique (for example, the ones starting with "b" and "c" above)
| then it should stay. Sounds like what I need is a way to filter out rows
| that are duplicate except in the second column.

A one-liner in Perl:

perl -ne '($a,$b,$c) = split(":",$_,3); print unless $seen{$a,$c}++;'

Fast enough?

print "Just another Perl hacker,"
--
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III |
| mer...@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Intel put the 'backward' in 'backward compatible'..."=========/

Larry Wall

unread,

Oct 30, 1990, 8:26:06 PM10/30/90

to

In article <1990Oct31....@iwarp.intel.com> mer...@iwarp.intel.com (Randal Schwartz) writes:

: In article <1990Oct30.2...@agate.berkeley.edu>, c60b-3ac@web (Eric Thompson) writes:
: | I have a few very long files that contain rows of ASCII data. Each row
: | looks something like this (not the actual data here):
: |
: | a:A:b:c:d:e:f:g:h:i:j:k:l:m
: | a:B:b:c:d:e:f:g:h:i:j:k:l:m
: | a:C:b:c:d:e:f:g:h:i:j:k:l:m
: | a:D:b:c:d:e:f:g:h:i:j:k:l:m
: | b:A:n:o:p:q:s:t:u:v:w:x:y:z
: | c:A:x:a:x:b:x:c:d:a:m:l:v:x
: | d:A:m:l:k:j:i:h:g:f:e:d:c:b
: | d:B:m:l:k:j:i:h:g:f:e:d:c:b
: | d:C:m:l:k:j:i:h:g:f:e:d:c:b
: |
: | It's the second column that's important. If there are multiple rows that
: | are exactly the same except for the second column, I want to GET RID of them.
: | If the row is unique (for example, the ones starting with "b" and "c" above)
: | then it should stay. Sounds like what I need is a way to filter out rows
: | that are duplicate except in the second column.
:
: A one-liner in Perl:
:
: perl -ne '($a,$b,$c) = split(":",$_,3); print unless $seen{$a,$c}++;'
:
: Fast enough?

Maybe, but he said they were very long files, and that may mean more than
you'd want to store in an associative array, even with virtual memory.
Presuming the files are sorted reasonably, you can get away with this:

perl -ne '($this = $_) =~ s/:[^:]*//; print if $this ne $that; $that = $this'

Of course, someone will post a solution using cut and uniq, which will be
fine if you don't mind losing the second field. Or swapping the first
two fields around. I'll leave the awk and sed solutions to someone else.

Larry

Dan Bernstein

unread,

Oct 31, 1990, 12:18:32 AM10/31/90

to

In article <10...@jpl-devvax.JPL.NASA.GOV> lw...@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
> In article <1990Oct31....@iwarp.intel.com> mer...@iwarp.intel.com (Randal Schwartz) writes:
> : In article <1990Oct30.2...@agate.berkeley.edu>, c60b-3ac@web (Eric Thompson) writes:

[ if multiple (consecutive?) rows of colon-separated columns ]
[ have the same second column, scrap 'em ]

> : perl -ne '($a,$b,$c) = split(":",$_,3); print unless $seen{$a,$c}++;'
> : Fast enough?

[ as happens with every Perl program posted to the net, Larry points ]
[ out how inefficient this can be: ]

> Maybe, but he said they were very long files, and that may mean more than
> you'd want to store in an associative array, even with virtual memory.
> Presuming the files are sorted reasonably, you can get away with this:
> perl -ne '($this = $_) =~ s/:[^:]*//; print if $this ne $that; $that = $this'

That does look like what Eric was asking for, but what if the file is
not sorted? Is there a fast Perl solution?

> Of course, someone will post a solution using cut and uniq, which will be
> fine if you don't mind losing the second field. Or swapping the first
> two fields around.

cut? uniq? Why? There's already a tool perfectly matched to the job:

sort -u -t: +0 -1 +2

sort already knows how to work in limited memory. If the input is
already sorted,

sort -m -u -t: +0 -1 +2

should do the trick. Both of these solutions are easy to figure out, easy
to type, very fast even on long files, and quite portable.

> I'll leave the awk and sed solutions to someone else.

Yes, I seem to always be defending the classic tools against this
onslaught of Perl code that nobody but you can ever optimize.

---Dan

Larry Wall

unread,

Oct 31, 1990, 12:29:08 PM10/31/90

to

In article <28220:Oct3105:18:32...@kramden.acf.nyu.edu> brn...@kramden.acf.nyu.edu (Dan Bernstein) writes:
: > I'll leave the awk and sed solutions to someone else.

:
: Yes, I seem to always be defending the classic tools against this
: onslaught of Perl code that nobody but you can ever optimize.

You're obviously too defensive. :-)

I often post non-Perl solutions if I think they're appropriate. And I
freely admit that I overlooked sort -u, which is, as you say, perfect for
the job. The sort I grew up with didn't have -u, so I never seem to think
of it. Dratted fossilized neurons...

And what's so amazing about me being better than other people with Perl?
I bet you're better than me with auth. You push auth a little, I push
Perl a little, and the world becomes a better place. If you consistently
take an antagonististic approach, however, people are going to start
thinking you're from New York. :-)

Love,
Larry

Gary Weimer

unread,

Nov 7, 1990, 3:56:44 PM11/7/90

to

In article <10...@jpl-devvax.JPL.NASA.GOV> lw...@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:

>In article <1990Oct31....@iwarp.intel.com> mer...@iwarp.intel.com (Randal Schwartz) writes:
>: In article <1990Oct30.2...@agate.berkeley.edu>, c60b-3ac@web (Eric Thompson) writes:

>: | Sounds like what I need is a way to filter out rows

>: | that are duplicate except in the second column.
>:
>: A one-liner in Perl:
>:
>: perl -ne '($a,$b,$c) = split(":",$_,3); print unless $seen{$a,$c}++;'
>:
>: Fast enough?
>
>Maybe, but he said they were very long files, and that may mean more than
>you'd want to store in an associative array, even with virtual memory.
>Presuming the files are sorted reasonably, you can get away with this:
>
>perl -ne '($this = $_) =~ s/:[^:]*//; print if $this ne $that; $that = $this'
>
>Of course, someone will post a solution using cut and uniq, which will be
>fine if you don't mind losing the second field. Or swapping the first
>two fields around. I'll leave the awk and sed solutions to someone else.

Who needs sed?

awk -F: '{cur=$1$3$4$5$6$7$8$9$10$11$12$13$14;if(cur!=prev){prev=cur;print $0}}'
InFile > OutFile

NOTE: split to fit in 80 columns--needs rejoined