awk script to separate "duplicated" records

Harry

unread,

Aug 3, 2017, 4:58:19 PM8/3/17

to

Hi All,

Need some help on a shell script, preferable on GNU awk.

I have an input text file Batch_1.txt containing 10K records,
where each record has the following format.

-- this is one record --
header1|h1_field1|h1_field2|...[CR]
header2|h2_field1|h2_field2|...[CR]
header3|h3_field1|phn_record_1|h3_field3|...[CR]
header4|h4_field1|h4_field2|...[CR][LF]
-- this is one record --

The Batch_1.txt file contains 95% records with unique phn
(Prov. Health Number) values. While about 5% of the records
contains duplicated phn values.

N.B.
Duplicated phn records may not be adjacent with others.
Duplicated phn records my be have 2 duplicats,
or 3 duplicates, or more.

The goal is to generate several output files:
- Batch_1A.txt contains all records with unique phn values from Batch_1.txt
- Batch_B1.txt contains the 1st occurrences of those phn-duplicated records
- Batch_B2.txt contains the 2nd occurrences of those phn-duplicated records
- Batch_B3.txt, etc.

Examples of Input file (Batch_1.txt), one record per line:

.... phn_111 ...
.... phn_222 ... 1st occurence
.... phn_333 ...
.... phn_222 ... 2nd occurrence
.... phn_444 ...

Example of output file (Batch_1A), one record per line:

.... phn_111 ...
.... phn_333 ...
.... phn_444 ...

Example of output file (Batch_B1), one record per line:

.... phn_222 ... 1st occurence

Example of output file (Batch_B2), one record per line:

.... phn_222 ... 2nd occurrence

Also, I want to keep the orginal record sequences intact
on the output files, when compared with the original
record sequences on the input file.

Any help appreciated.

TIA

Harry

unread,

Aug 3, 2017, 5:10:10 PM8/3/17

to

P.S.
Each record has the same number of fields, and same segment headers.
PHN is always on field #16 from the start of a record, and
"|" is the field separator,
[CR] is the segment separator,
[LF] is the record separator.

Harry

unread,

Aug 3, 2017, 5:22:01 PM8/3/17

to

Currently I used the following script to generate the batch_B1/B2 output.
But at least two problems:
1. need to add batch_B3/B4/... etc manually.
2. the script would invoke numerous grep and sed calls.

So, I am hoping for a better solution.

$cat yy.sh
batch_file=batch_1.txt
rm -f batch_B1.txt
rm -f batch_B2.txt
for phn in `awk 'BEGIN{RS="[\r\n]";FS="|"} /PID\|/ {phn=$3;c[phn]++} END {for (v in c) if (c[v] > 1) printf("%s %d\n", v, c[v])}' < $batch_file|awk -F^ '{printf "%s\n",$1}'`
do
grep $phn $batch_file | sed -n 1p >> batch_B1.txt
grep $phn $batch_file | sed -n 2p >> batch_B2.txt
done

Ben Bacarisse

unread,

Aug 3, 2017, 8:13:12 PM8/3/17

to

Harry <harryoo...@hotmail.com> writes:
<snip>

This suggests you should use RS="\r\n".

What's not entirely clear is whether the extra \r characters can be
ignored or whether they behave like extra field separators. Here's a
record where they can probably be ignored:

f1|f2|\rf3|f4|f5\r\n

and here's an example where they are simply and alternate separator

f1|f2\rf3|f4|f5\r\n

The latter seems more likely so I'll assume that FS="[|\r]".

<snip>

> The goal is to generate several output files:
> - Batch_1A.txt contains all records with unique phn values from Batch_1.txt
> - Batch_B1.txt contains the 1st occurrences of those phn-duplicated records
> - Batch_B2.txt contains the 2nd occurrences of those phn-duplicated records
> - Batch_B3.txt, etc.

This, or something very like it, might work for you. I've assumed $4 is
the PHN but that will change of course.

BEGIN {
RS = "\r\n"
FS = "[|\r]"
fname = "Batch_1.txt"

while (getline < fname)
++count[$4]
close(fname)

while (getline < fname)
print > "Batch_" (count[$4] == 1 ? "1A" : "B" ++seen[$4]) ".txt"
}

<snip>
--
Ben.

Harry

unread,

Aug 4, 2017, 12:32:46 PM8/4/17

to

On Thursday, August 3, 2017 at 5:13:12 PM UTC-7, Ben Bacarisse wrote:

> This, or something very like it, might work for you. I've assumed $4 is
> the PHN but that will change of course.
>
> BEGIN {
> RS = "\r\n"
> FS = "[|\r]"
> fname = "Batch_1.txt"
>
> while (getline < fname)
> ++count[$4]
> close(fname)
>
> while (getline < fname)
> print > "Batch_" (count[$4] == 1 ? "1A" : "B" ++seen[$4]) ".txt"
> }
>
> <snip>
> --
> Ben.

The phn is on $18 when both "|" and [CR] used as field separator.

So, the following works for me, with lighting speed (compared with
my previous inefficient solution.

$ cat xx.awk
BEGIN {
RS = "\n"

FS = "[|\r]"
fname = "Batch_1.txt"

while (getline < fname)

++count[$18]
close(fname)

while (getline < fname)
print > "Batch_" (count[$18] == 1 ? "1A" : "B" ++seen[$18]) ".txt"
}

The only little issue is that the relative record sequences are not
preserved.
Any further suggestion on keeping the relative record sequences?

E.G. phn_66666 have duplicates of two on the original file, so is phh_99999.
And they should show up on Batch_B1.txt and Batch_B2 with the same relative
sequences.

Thank you Ben.

Ed Morton

unread,

Aug 4, 2017, 12:55:00 PM8/4/17

to

I don't have time right now to read/understand the original question but best I
can tell the above script can be written more awk-ishly and robustly without
getlines (the above will go into an infite loop on getline failures) as:

BEGIN {
RS = "\r\n"
FS = "[|\r]"

ARGV[1] = ARGV[2] = "Batch_1.txt"
ARGC = 3
}
NR==FNR { ++count[$4]; next }
{ print > ("Batch_" (count[$4] == 1 ? "1A" : "B" ++seen[$4]) ".txt") }

I also added parens to the print statement as an expression on the right side of
print > without them is undefined behavior per POSIX. It will only run on gawk
anyway though due to the multi-char RS so that wasn't stricly necessary in this
case as gawk is one awk variant that would have worked without the extra parens.

Regards,

Ed.

Ed Morton

unread,

Aug 4, 2017, 1:09:51 PM8/4/17

to

The above simply prints the input in the same order it is read so what do you
mean by that?

>
> E.G. phn_66666 have duplicates of two on the original file, so is phh_99999.
> And they should show up on Batch_B1.txt and Batch_B2 with the same relative
> sequences.
>
> Thank you Ben.
>

Post some concise, testable sample input and expected output that demonstrates
your problem so we can help debug it.

Ed.

Harry

unread,

Aug 4, 2017, 2:00:02 PM8/4/17

to

On Friday, August 4, 2017 at 10:09:51 AM UTC-7, Ed Morton wrote:

> The above simply prints the input in the same order it is read so what do you
> mean by that?

I think I might be wrong when I said the sequences were not preserved.

So, when I compared the Batch_B1.txt and Batch_B2.txt, each with 100+
entries, their phn sequences did not match each other.
That may be due to the phn sequences of the duplicates in the Batch_1.txt
interlaced among themselves.

Sorry I may have came to the wrong conclusion when I replied Ben's post.
When 20K records in my original input, I still need someone to verified
the solution.

Sorry about the confusion.

Ben Bacarisse

unread,

Aug 4, 2017, 6:41:37 PM8/4/17

to

Ed Morton <morto...@gmail.com> writes:

> On 8/3/2017 7:13 PM, Ben Bacarisse wrote:

<snip>

>> This, or something very like it, might work for you. I've assumed $4 is
>> the PHN but that will change of course.
>>
>> BEGIN {
>> RS = "\r\n"
>> FS = "[|\r]"
>> fname = "Batch_1.txt"
>>
>> while (getline < fname)
>> ++count[$4]
>> close(fname)
>>
>> while (getline < fname)
>> print > "Batch_" (count[$4] == 1 ? "1A" : "B" ++seen[$4]) ".txt"
>> }
>>
>> <snip>
>
> I don't have time right now to read/understand the original question
> but best I can tell the above script can be written more awk-ishly and
> robustly without getlines (the above will go into an infite loop on
> getline failures) as:
>
> BEGIN {
> RS = "\r\n"
> FS = "[|\r]"
> ARGV[1] = ARGV[2] = "Batch_1.txt"
> ARGC = 3
> }
> NR==FNR { ++count[$4]; next }
> { print > ("Batch_" (count[$4] == 1 ? "1A" : "B" ++seen[$4]) ".txt") }

I considered the "double arg" idea but I am not a huge fan of
it. It's a bit tricksy for my taste, unless something else in the
problem makes it the simpler alternative.

But what makes it more AWk-ish? Both seem to be rather out on a limb
compared to most AWK programs!

> I also added parens to the print statement as an expression on the
> right side of print > without them is undefined behavior per POSIX. It
> will only run on gawk anyway though due to the multi-char RS so that
> wasn't stricly necessary in this case as gawk is one awk variant that
> would have worked without the extra parens.

As it happens, gawk was preferred by the OP.

--
Ben.

Ed Morton

unread,

Aug 5, 2017, 12:53:50 AM8/5/17

to

I wouldn't naturally do it either, I'd just specify the file name twice on the
command line but I wanted to provide a script that precisely did the same as
yours did but without the getline loops.

>
> But what makes it more AWk-ish? Both seem to be rather out on a limb
> compared to most AWK programs!

What I wrote is pretty mundane if you pass in a file name twice instead of
hard-coding it.

Awk has an implicit while read loop so it's more awkish to let it just do what
it does by default rather than introducing 2 other hand-written while read loops
and bypassing the normal awk behavior for that which in turn stops you from
using awks basic `condition { action }` syntax and so overall makes your program
lengthier, more fragile, and generally more C-like than awk-like.

>
>> I also added parens to the print statement as an expression on the
>> right side of print > without them is undefined behavior per POSIX. It
>> will only run on gawk anyway though due to the multi-char RS so that
>> wasn't stricly necessary in this case as gawk is one awk variant that
>> would have worked without the extra parens.
>
> As it happens, gawk was preferred by the OP.

Right, parenthesizing the expression is just a good habit to get into so you
don't get bit when you happen to not be using gawk.

Ed.

Ben Bacarisse

unread,

Aug 5, 2017, 8:03:56 AM8/5/17

to

I was not really commenting on how the double args were provided. But
requiring that the same argument be repeated in the AWK command line is,
if anything, slightly worse. I'd want a comment to explain the usage,
or I'd put the whole thing in a shell script that provides the double
argument. I think doing it in BEGIN is as you do is better.

In posh code, I might duplicate ARGV[1] after checking that ARGC is 2.

> Awk has an implicit while read loop so it's more awkish to let it just
> do what it does by default rather than introducing 2 other
> hand-written while read loops and bypassing the normal awk behavior
> for that which in turn stops you from using awks basic `condition {
> action }` syntax and so overall makes your program lengthier, more
> fragile, and generally more C-like than awk-like.

AWK's inherent behaviour is not really what's needed in this case, but
if avoiding it is not AWKish, so be it (I am no authority on what's
idiomatic AWK).

<snip>
--
Ben.

Ed Morton

unread,

Aug 5, 2017, 10:44:37 AM8/5/17

to

No need to check ARGC. In general, if I were calling awk from the command line
I'd do:

awk 'BEGIN{ARGV[ARGC]=ARGV[ARGC-1]; ARGC++} NR==FNR{..; next} {..}' file

and from inside a shell script where I have the file name in a variable:

awk 'NR==FNR{..; next} {..}' "$file" "$file"

and only if I for some reason needed a hard-coded file name in an awk script
would I use this (what I posted earlier was just a simplified version of this to
keep it as clear as possible):

awk 'BEGIN{ARGV[ARGC+1]=ARGV[ARGC]="file"; ARGC+=2} NR==FNR{..; next} {..}' file

I see/write code like that frequently so it seemed perfectly clear to me but
thinking about it now I can understand that it's non-obvious the first time you
come across it and worth a comment!

>
>> Awk has an implicit while read loop so it's more awkish to let it just
>> do what it does by default rather than introducing 2 other
>> hand-written while read loops and bypassing the normal awk behavior
>> for that which in turn stops you from using awks basic `condition {
>> action }` syntax and so overall makes your program lengthier, more
>> fragile, and generally more C-like than awk-like.
>
> AWK's inherent behaviour is not really what's needed in this case, but
> if avoiding it is not AWKish, so be it (I am no authority on what's
> idiomatic AWK).

It's extremely common to have to parse the same file twice, often to identify
lines/blocks on the first pass then do something with them on the second pass or
to count something(s) in the first pass and use those counts in the second pass
as we did in this example, and while you can write your own code to bypass it if
you choose, awks inherent behavior is perfectly suited for doing so.

Ed.
>
> <snip>
>

Ed Morton

unread,

Aug 5, 2017, 11:03:24 AM8/5/17

to

Actually, I'd only do the above if I were writing a script to call using a
shebang within a shell script that takes a file name argument or an awk script
to invoke with awk -f script file, otherwise realistically if I'm writing a
one-liner to call from the command line I'd just do:

awk 'NR==FNR{..; next} {..}' file file

>
> and from inside a shell script where I have the file name in a variable:
>
> awk 'NR==FNR{..; next} {..}' "$file" "$file"
>
> and only if I for some reason needed a hard-coded file name in an awk script
> would I use this (what I posted earlier was just a simplified version of this to
> keep it as clear as possible):
>
> awk 'BEGIN{ARGV[ARGC+1]=ARGV[ARGC]="file"; ARGC+=2} NR==FNR{..; next} {..}' file

The above should have been minus the file arg at the end:

awk 'BEGIN{ARGV[ARGC+1]=ARGV[ARGC]="file"; ARGC+=2} NR==FNR{..; next} {..}'

>