deleting duplicates

Sivaram Neelakantan

unread,

Jun 28, 2016, 11:59:24 PM6/28/16

to

I have data that looks like this

c1 c2
------
1 4
1 2
1 5
2 1
4 1

The pairwise combo of (1,4) and (4,1) are considered duplicates. As
are (1,2) and (2,1). How do I delete one of the duplicates? Doesn't
matter which row is deleted.

sivaram
--

Janis Papanagnou

unread,

Jun 29, 2016, 12:38:48 AM6/29/16

to

One possibility with awk...

awk 'NR<=2 || !(($1,$2) in a); { a[$1,$2] ; a[$2,$1] }' your_data_file

Janis

>
>
>
> sivaram
>

Ben Bacarisse

unread,

Jun 29, 2016, 6:36:06 AM6/29/16

to

Do you really need the NR<=2 test?

Another slight variation that simplifies it a bit would be

awk '!(($1,$2) in a) { print; a[$1,$2]; a[$2,$1] }'

--
Ben.

Janis Papanagnou

unread,

Jun 29, 2016, 7:25:14 AM6/29/16

to

On 29.06.2016 12:36, Ben Bacarisse wrote:
> Janis Papanagnou <janis_pa...@hotmail.com> writes:
>
>> On 29.06.2016 05:59, Sivaram Neelakantan wrote:
>>>
>>> I have data that looks like this
>>>
>>> c1 c2
>>> ------
>>> 1 4
>>> 1 2
>>> 1 5
>>> 2 1
>>> 4 1
>>>
>>> The pairwise combo of (1,4) and (4,1) are considered duplicates. As
>>> are (1,2) and (2,1). How do I delete one of the duplicates? Doesn't
>>> matter which row is deleted.
>>
>> One possibility with awk...
>>
>> awk 'NR<=2 || !(($1,$2) in a); { a[$1,$2] ; a[$2,$1] }' your_data_file
>
> Do you really need the NR<=2 test?

It's purpose is to keep it obvious that the first two lines should always
be taken unchanged. Technically speaking, you don't need it, because the
second condition would be effective as well, but stating that condition
explicitly is more robust in case of changes and smells less like a hack.
I prefer to have conditions for syntactically different header lines
semantically separated in the awk code.

Janis

Ed Morton

unread,

Jun 29, 2016, 8:54:10 AM6/29/16

to

$ awk '!seen[$1>$2? $1 FS $2 : $2 FS $1]++' file

Kaz Kylheku

unread,

Jun 29, 2016, 9:48:49 AM6/29/16

to

On 2016-06-29, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> On 29.06.2016 12:36, Ben Bacarisse wrote:
>> Janis Papanagnou <janis_pa...@hotmail.com> writes:
>>
>>> On 29.06.2016 05:59, Sivaram Neelakantan wrote:
>>>>
>>>> I have data that looks like this
>>>>
>>>> c1 c2
>>>> ------
>>>> 1 4
>>>> 1 2
>>>> 1 5
>>>> 2 1
>>>> 4 1
>>>>
>>>> The pairwise combo of (1,4) and (4,1) are considered duplicates. As
>>>> are (1,2) and (2,1). How do I delete one of the duplicates? Doesn't
>>>> matter which row is deleted.
>>>
>>> One possibility with awk...
>>>
>>> awk 'NR<=2 || !(($1,$2) in a); { a[$1,$2] ; a[$2,$1] }' your_data_file
>>
>> Do you really need the NR<=2 test?
>
> It's purpose is to keep it obvious that the first two lines should always
> be taken unchanged. Technically speaking, you don't need it, because the
> second condition would be effective as well, but stating that condition
> explicitly is more robust in case of changes and smells less like a hack.
> I prefer to have conditions for syntactically different header lines
> semantically separated in the awk code.

You might need that test for correctness, because the requirement for
the following input is quite possibly that the header does not establish
a de-duplicating entry that removes the "c2 c1" datum.

c1 c2
------
1 4
1 2

c2 c1
2 1
4 1

The "c2 c1" may have to be output.

Sivaram Neelakantan

unread,

Jun 29, 2016, 10:45:33 AM6/29/16

to

On Wed, Jun 29 2016,Janis Papanagnou wrote:

[snipped 16 lines]

> One possibility with awk...
>
> awk 'NR<=2 || !(($1,$2) in a); { a[$1,$2] ; a[$2,$1] }' your_data_file
>
>

[snipped 8 lines]

Thank you, that worked exactly as I wanted.

sivaram
--

Sivaram Neelakantan

unread,

Jun 29, 2016, 10:46:34 AM6/29/16

to

On Wed, Jun 29 2016,Ben Bacarisse wrote:

[snipped 23 lines]

>
> Another slight variation that simplifies it a bit would be
>
> awk '!(($1,$2) in a) { print; a[$1,$2]; a[$2,$1] }'

This works too. Thanks

sivaram
--

Sivaram Neelakantan

unread,

Jul 2, 2016, 2:20:10 PM7/2/16

to

On Wed, Jun 29 2016,Janis Papanagnou wrote:

I also found that (min(C1,C2), max(C1,C2)) will duplicate the rows
correctly as

1,2
1,2
1,4
1,4

which I can sort and remove duplicates. Seems to work for strings too
for most languages

sivaram
--

Ed Morton

unread,

Jul 3, 2016, 9:29:15 PM7/3/16

to

Did you LOOK at the solution I posted on 6/29:

awk '!seen[$1>$2 ? $1 FS $2 : $2 FS $1]++' file

That's exactly what it does.

Ed.
>
> sivaram
>

Sivaram Neelakantan

unread,

Jul 3, 2016, 11:59:33 PM7/3/16

to

On Sun, Jul 03 2016,Ed Morton wrote:

[snipped 23 lines]

>> I also found that (min(C1,C2), max(C1,C2)) will duplicate the rows
>> correctly as
>>
>> 1,2
>> 1,2
>> 1,4
>> 1,4
>>
>> which I can sort and remove duplicates. Seems to work for strings too
>> for most languages
>
> Did you LOOK at the solution I posted on 6/29:
>
> awk '!seen[$1>$2 ? $1 FS $2 : $2 FS $1]++' file
>
> That's exactly what it does.
>
> Ed.

My bad. I didn't check/try out the code that you posted, Ed. I apologise for
that mistake of mine.

sivaram
--