Re-ordering lines (records).

John Fitzsimons

unread,

Nov 5, 2009, 8:47:37 PM11/5/09

to

Hi,

Windows Gawk newbie here.

Suppose one has data in the following style..

Line one :
Line two :
Line three :
Line four :
Line five :

Line one :
Line four :
Line two :
Line five :

Line one :
Line three :
Line four :
Line two :
Line five :
Line six :

In other words some lines out of sequence, others missing and
sometimes blocks including additional lines.

How would I be able to re-arrange the above so that five records
are in the correct order in each block ? With lines not named
one,.....five, ignored ? Also, in each block, any missing line to be
replaced by eg. text saying "missing" ? Or maybe a zero ?

The data blocks could be greater eg. 20 items. I assume that an
array could cope with this but I don't know the correct syntax. I
found examples on the web that talked about re-ordering fields,
but not ones about re-ordering more than a couple of lines.

Can anyone here help please ?

Regards, John.

nag

unread,

Nov 5, 2009, 11:36:55 PM11/5/09

to

On Nov 6, 6:47 am, John Fitzsimons <DELETEucwubq...@sneakemail.com>
wrote:

Can you give some sample data?

Janis Papanagnou

unread,

Nov 6, 2009, 12:34:24 AM11/6/09

to

Any solution depends on the concrete data; I fear your sample data is not
sufficiently exact described, but you may try this awk code - which works
with your test data - on your actual data, and modify it appropriately...

BEGIN {
RS="" ; FS="\n"
n=split("one|two|three|four|five",keys,"|")
}
{
for (i=1; i<=n; i++)
{
line="Line "keys[i]" :"
if ($0 ~ line)
print line
else
print "missing"
}
print ""
}

What you probably have to change for your real data is printing the actual
line contents instead of the key; there are various ways to do that, and
the most appropriate way depends on your actual data.

One way could be to iterate over the lines in each block, like (pseudo code)

{ for (i=1; i<=NF; i++) if ($i in keys) print index_of(keys) }

with appropriate access (construct an index array) to the keys index with
some user defined index_of() function, which may be defined as a plain array
in the BEGIN block by

for (i=1; i<=n; i++)
ikeys[keys[i]]=i

Another way is to memorize the lines for each key in a block, to access the
line contens more easily.

Or nest two loops (doing actually a selection sort of O(n^2) on the blocks;
one loop iterating keys from 1 to n and the lines in the block from 1 to NF,
where you compare whether the actual line matches the keys element.

But before I continue with code, guessing what you you need, please be more
precise with your data.

Janis

>
> Regards, John.
>

Ed Morton

unread,

Nov 6, 2009, 9:11:39 AM11/6/09

to

Try this:
$ cat file

Line one :
Line two :
Line three :
Line four :
Line five :

Line one :
Line four :
Line two :
Line five :

Line one :
Line three :
Line four :
Line two :
Line five :
Line six :

$
$ cat tst.awk
BEGIN{
names="one two three four five"
numNbrs = split(names,nbr2name," ")
for (i=1;i<=numNbrs;i++) {
name2nbr[nbr2name[i]]=i
}
RS=""; FS="\n"
}
{
for (i=1;i<=NF;i++) {
split($i,flds,/[[:space:]]+/)
name = flds[2]
if (flds[1]" "name" "flds[3] ~ /Line [[:alpha:]]+ :/) {
if (name in name2nbr) {
nbr = name2nbr[name]
if (nbr in rec) {
print "duplicates",rec[nbr],$i
}
rec[nbr] = $i
}
}
}
for (nbr=1;nbr<=numNbrs;nbr++) {
print (nbr in rec ? rec[nbr] : "missing " nbr2name[nbr])
delete rec[nbr]
}
print ""
}
$
$ awk -f tst.awk file

Line one :
Line two :
Line three :
Line four :
Line five :

Line one :
Line two :
missing three
Line four :
Line five :

Line one :
Line two :
Line three :
Line four :
Line five :

Regards,

Ed.

Anton Treuenfels

unread,

Nov 7, 2009, 7:03:05 PM11/7/09

to

"John Fitzsimons" <DELETEu...@sneakemail.com> wrote in message
news:jqs6f5p0nrvbrjscl...@4ax.com...

Here's an attempt based on the observation that a blank line separates each
block ("blank" meaning zero or more whitespace chars). It re-orders the
lines of each block and ignores any that are not named "one", "two", etc. It
also incorporates Ed's check for duplicate fields. You can build up "names"
in the BEGIN section to whatever length you need by successive concatenation
before splitting it.

BEGIN {

names = "one two three four five" # etc.
fieldCnt = split( names, fieldNames, " " )
}

$0 !~ /^[ \t]*$/ {

if ( $2 in fieldNames ) {
if ( !($2 in foundField) )
foundField[ $2 ] = $0
else
print "Error: duplicate field at line " FNR " of " FILENAME
}

next
}

length( foundField ) {

for ( i = 1; i <= fieldCnt; i++ ) {
name = fieldName[ i ]
if ( name in foundField )
print foundField[ name ]
else
print "Line " name ": missing"
}

print "" # new blank line

delete( foundField )
}

- Anton Treuenfels

John Fitzsimons

unread,

Nov 7, 2009, 10:08:36 PM11/7/09

to

On Sat, 7 Nov 2009 18:03:05 -0600, "Anton Treuenfels"
<teamt...@yahoo.com> wrote:

>"John Fitzsimons" <DELETEu...@sneakemail.com> wrote in message
>news:jqs6f5p0nrvbrjscl...@4ax.com...

Hi Anton,

< snip >

Thank you, but that resulted in...

gawk.exe -f anton.awk suntest.txt >ressun.txt

gawk: anton.awk:19: (FILENAME=suntest.txt FNR=6) fatal: attempt to use
array `foundField' in a scalar context

Regards, John.

John Fitzsimons

unread,

Nov 7, 2009, 10:08:36 PM11/7/09

to

On Fri, 06 Nov 2009 06:34:24 +0100, Janis Papanagnou
<janis_pa...@hotmail.com> wrote:

>John Fitzsimons wrote:

>> Hi,

>> Windows Gawk newbie here.

>> Suppose one has data in the following style..

>> Line one :
>> Line two :
>> Line three :

< snip >

>Any solution depends on the concrete data

< snip >

>What you probably have to change for your real data is printing the actual
>line contents instead of the key; there are various ways to do that,

< snip >

Yes, I wanted the line contents. I will think about your suggestions.
Thank you.

Regards, John.

John Fitzsimons

unread,

Nov 7, 2009, 10:08:36 PM11/7/09

to

On Fri, 06 Nov 2009 08:11:39 -0600, Ed Morton <morto...@gmail.com>
wrote:

>John Fitzsimons wrote:

< snip >

Hi Ed,

>Try this:

< snip >

>BEGIN{
> names="one two three four five"
> numNbrs = split(names,nbr2name," ")
> for (i=1;i<=numNbrs;i++) {
> name2nbr[nbr2name[i]]=i
> }
> RS=""; FS="\n"
>}
>{
> for (i=1;i<=NF;i++) {
> split($i,flds,/[[:space:]]+/)
> name = flds[2]
> if (flds[1]" "name" "flds[3] ~ /Line [[:alpha:]]+ :/) {
> if (name in name2nbr) {
> nbr = name2nbr[name]
> if (nbr in rec) {
> print "duplicates",rec[nbr],$i
> }
> rec[nbr] = $i
> }
> }
> }
> for (nbr=1;nbr<=numNbrs;nbr++) {
> print (nbr in rec ? rec[nbr] : "missing " nbr2name[nbr])
> delete rec[nbr]
> }
> print ""
>}

Excellent ! Exactly what I wanted. You even made it easy for me to
increase/decrease the total number of lines. Very well done.

Thank you. Very much appreciated. :-)

Regards, John.

Hermann Peifer

unread,

Nov 8, 2009, 5:06:02 AM11/8/09

to

What does gawk --version say?

`length(array)' is a gawk extension that was added in 2005. From the ChangeLog:

> Sun Jun 26 16:37:59 2005 Arnold D. Robbins <arn...@skeeve.com>
> * builtin.c (do_length): Allow array argument to length().
> Returns number of elements in array.

Hermann

John Fitzsimons

unread,

Nov 8, 2009, 10:20:53 PM11/8/09

to

On Sun, 08 Nov 2009 11:06:02 +0100, Hermann Peifer <pei...@gmx.eu>
wrote:

>John Fitzsimons wrote:

< snip >

>> gawk.exe -f anton.awk suntest.txt >ressun.txt

>> gawk: anton.awk:19: (FILENAME=suntest.txt FNR=6) fatal: attempt to use
>> array `foundField' in a scalar context

>What does gawk --version say?

gawk --version
GNU Awk 3.1.4
Copyright (C) 1989, 1991-2003 Free Software Foundation.

Running in DOS on a win 98SE system.

Regards, John.

Anton Treuenfels

unread,

Nov 9, 2009, 12:42:49 AM11/9/09

to

>>"John Fitzsimons" <DELETEu...@sneakemail.com> wrote in message
>>news:jqs6f5p0nrvbrjscl...@4ax.com...

> Thank you, but that resulted in...

>
> gawk.exe -f anton.awk suntest.txt >ressun.txt
>
> gawk: anton.awk:19: (FILENAME=suntest.txt FNR=6) fatal: attempt to use
> array `foundField' in a scalar context

Ah, one of the many advantages of TAWK is that I can use length() like that.
Apparently so can later versions of GAWK than you seem to have.

Anyway, it's not essential. It's only there as a flag to guard against
multiple blank lines, printing the previous record at the first blank line
and NOT printing "fieldCnt" lines of "missing" at each successive blank
line.

If you can guarantee there is only one blank line separating each record you
can safely drop any test for that. If you can't, you may want to redundantly
set a flag in the first block every time you add an item to to "foundField",
use it to check whether to execute the second block, and clear it in the
second block.

- Anton Treuenfels

Hermann Peifer

unread,

Nov 9, 2009, 2:55:15 AM11/9/09

to

The gawk man page says about `length(array)'
> Starting with version 3.1.5, as a non-standard extension,
> length() returns the number of elements in the array.

So you have to use a newer gawk, which you might want to do anyway.
Release 3.1.4 is more than 5 years old (from August 2004).

In addition, in order to make anton.awk work, you have to do the
following changes, if I am not mistaken:

# if ( $2 in fieldNames ) {
if ( index(names, $2 ) {

and

# name = fieldName[ i ]
name = fieldNames[ i ]

Hermann

Anton Treuenfels

unread,

Nov 10, 2009, 8:13:35 PM11/10/09

to

"Hermann Peifer" <pei...@gmx.eu> wrote in message news:hd8ht4$580>

> In addition, in order to make anton.awk work, you have to do the following
> changes, if I am not mistaken:
>
> # if ( $2 in fieldNames ) {
> if ( index(names, $2 ) {
>
> and
>
> # name = fieldName[ i ]
> name = fieldNames[ i ]

Ach! Not only are you right (and that first correction is an interesting
approach), I made the same class of mistake in another program I was playing
with last night...

- Anton Treuenfels