Number lines of file2 based on the contents of file1.

Hongyi Zhao

unread,

Dec 17, 2016, 7:40:05 AM12/17/16

to

Hi all,

I've two files, file1 and file2, which have the contents as following:

file1:

line_number_1 line_1
line_number_2 line_2
...
line_number_n line_n

file2:

line_1
line_2
..
line_m

Now, I want to add line numbers for each line in the file2, the rules are
as follows:

[1] The line numbers begin from 1, and increase naturally.
[2] If the line is also appeared in the file1, then just not use it for
numbering.
[3] If the line number also appeared in the file1, then just not use it.

For the above purpose, I write the following codes:

awk '
BEGIN { ind = 1 }
!x {
appeared_ind[$1]
appeared_line[$2]
next
}

! ($0 in appeared_line) {
while (ind in appeared_ind) ind ++
print ind, $0
ind ++
}
' file1 x=1 file2

It seems this can do my job. But my code seems not so graceful, could
you please give me some notes/hints on touching-up my above codes?

Regards
--
.: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.

Manuel Collado

unread,

Dec 17, 2016, 1:09:02 PM12/17/16

to

1. Let 'ind' start at 0 and keep it as the last used line number. No
need for a BEGIN rule.

2. Forget about the 'x' control variable. Use the usual NR==FNR idiom to
detect records from the first file. And invoke awk just as
awk '....' file1 file2

3. Use 'do ... while' instead of 'while ...' to search for the next
unused line number:
do ind++ while (ind in appeared_ind)
This way the last ind++ is unnecessary.

HTH.

Hongyi Zhao

unread,

Dec 17, 2016, 6:47:29 PM12/17/16

to

On Sat, 17 Dec 2016 19:08:52 +0100, Manuel Collado wrote:

> 2. Forget about the 'x' control variable. Use the usual NR==FNR idiom to
> detect records from the first file. And invoke awk just as
> awk '....' file1 file2

I use the `x' control variable for dealing with the case of file1 is
empty, while the `NR==FNR' method cann't deal this case.

Hongyi Zhao

unread,

Dec 17, 2016, 7:09:21 PM12/17/16

to

On Sat, 17 Dec 2016 19:08:52 +0100, Manuel Collado wrote:

> 3. Use 'do ... while' instead of 'while ...' to search for the next
> unused line number:
> do ind++ while (ind in appeared_ind)
> This way the last ind++ is unnecessary.

What's the corresponding for-loop based version of the above code? I
tried with for-loop but still not succeed.

Hongyi Zhao

unread,

Dec 17, 2016, 7:42:25 PM12/17/16

to

On Sat, 17 Dec 2016 19:08:52 +0100, Manuel Collado wrote:

> do ind++ while (ind in appeared_ind)

I tried and it seems that this must be written as follows:

do { ind ++ } while (ind in appeared_ind)

Janis Papanagnou

unread,

Dec 17, 2016, 7:53:21 PM12/17/16

to

On 18.12.2016 01:42, Hongyi Zhao wrote:
> On Sat, 17 Dec 2016 19:08:52 +0100, Manuel Collado wrote:
>
>> do ind++ while (ind in appeared_ind)
>
> I tried and it seems that this must be written as follows:
>
> do { ind ++ } while (ind in appeared_ind)

or...

do ind++; while (ind in appeared_ind)

or...

do ind++
while (ind in appeared_ind)

Janis

>
> Regards
>

Hongyi Zhao

unread,

Dec 17, 2016, 8:01:41 PM12/17/16

to

On Sun, 18 Dec 2016 01:53:20 +0100, Janis Papanagnou wrote:

> or...
>
> do ind++; while (ind in appeared_ind)
>
> or...
>
> do ind++
> while (ind in appeared_ind)

Thanks for your notes.

Regards
>
>
> Janis

Hongyi Zhao

unread,

Dec 17, 2016, 9:16:36 PM12/17/16

to

On Sun, 18 Dec 2016 01:53:20 +0100, Janis Papanagnou wrote:

>> do { ind ++ } while (ind in appeared_ind)
>
> or...
>
> do ind++; while (ind in appeared_ind)
>
> or...
>
> do ind++
> while (ind in appeared_ind)

I usually use a space when writing the operator `++' and the
corresponding variable for readability:

var ++

Regards
>
>
> Janis

Hongyi Zhao

unread,

Dec 18, 2016, 2:51:09 AM12/18/16

to

On Sat, 17 Dec 2016 19:08:52 +0100, Manuel Collado wrote:

> 1. Let 'ind' start at 0 and keep it as the last used line number. No
> need for a BEGIN rule.
>
> 2. Forget about the 'x' control variable. Use the usual NR==FNR idiom to
> detect records from the first file. And invoke awk just as
> awk '....' file1 file2
>
> 3. Use 'do ... while' instead of 'while ...' to search for the next
> unused line number:
> do ind++ while (ind in appeared_ind)
> This way the last ind++ is unnecessary.

Based on the above notes, I rewritten my code as follows:

awk '

!x {
appeared_ind[$1]
appeared_line[$2]
next
}

! ($0 in appeared_line) {

do { ind ++ } while (ind in appeared_ind)

print ind, $0
}
' file1 x=1 file2

This will do the job and has the more concise form. But, I also tried
the for-loop based method as follows:

awk '
!x {
appeared[$2] = $1
next
}

! ($0 in appeared) {
ind ++
for (i in appeared) {
if (ind == appeared[i]) { ind ++; contiune }
else
break
}
print ind, $0
}
' file1 x=1 file2

But, the second method will give error results. What's the bug in my
code?

Regards

>
> HTH.

Hongyi Zhao

unread,

Dec 18, 2016, 9:54:18 AM12/18/16

to

On Sun, 18 Dec 2016 07:51:09 +0000, Hongyi Zhao wrote:

> awk '
> !x {
> appeared[$2] = $1 next
> }
>
> ! ($0 in appeared) {
> ind ++
> for (i in appeared) {
> if (ind == appeared[i]) { ind ++; contiune }
> else
> break
> }
> print ind, $0
> }
> ' file1 x=1 file2

After some thought on the code, I find the following will do the trick:

awk '
BEGIN { PROCINFO["sorted_in"] = "@val_num_asc" }

!x {
appeared[$2] = $1
next
}

! ($0 in appeared) {
ind ++
for (i in appeared)
if (ind == appeared[i])
ind ++

print ind, $0
}
' file1 x=1 file2

Still, it seems the above codes are not the most elegant solution. If
there is any touching-up for the code, please give me some hints/notes.
Thanks in advance.

Regards