Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

awk multiple line sort

56 views
Skip to first unread message

Piotrzot

unread,
Sep 24, 2015, 5:33:58 AM9/24/15
to
Hello everybody!
I'm new in this group and quite a rookie with awk so please go easy on me..
I tried to study the problem first, then googled longtime but I was not able to find the right answer.

I've got a text file composed like this :

YEAR - 1972
ABSTRACT - tchatchatchatcha
AUTHOR - Gilles Deleuze
AUTHOR - Felix Guattari
TITLE - Anti-Oedipus
PUBLPLACE - Paris

YEAR - 1781
ABSTRACT - tchatchatchatcha
AUTHOR - Immanuel Kant
TITLE - Critique of the pure reason
PUBLPLACE - Konigsberg

And I'd like to sort it (it's a bibliographic database) in order to obtain something like this :

YEAR - 1781
TITLE - Critique of the pure reason
AUTHOR - Immanuel Kant
PUBLPLACE - Konigsberg
ABSTRACT - tchatchatchatcha

YEAR - 1972
TITLE - Anti-Oedipus
AUTHOR - Gilles Deleuze
AUTHOR - Felix Guattari
PUBLPLACE - Paris
ABSTRACT - tchetchetchetche

Records are sorted by YEAR;
fields are sorted in an arbitrary (not alphabetical) order ;
NB : the total number of fields per record can change, because of multiple entries under "AUTHOR" for exemple.

I hope everything is clear and thanx you all for your support.
P

Ed Morton

unread,
Sep 24, 2015, 8:06:04 AM9/24/15
to
Do you have GNU awk for builtin sort functions? If not, this might be your best bet:

$ awk -v RS= -v OFS=';' -v FS='\n' '{$1=$1}1' file |
sort |
awk -v FS=';' -v OFS='\n' -v ORS='\n\n' '{$1=$1}1'

Janis Papanagnou

unread,
Sep 24, 2015, 8:17:48 AM9/24/15
to
Are you working in a Unix environment? Have you GNU awk available?
This is a non-trivial task in standard awk. Have you tried anything
yourself or are you expecting someone to provide a complete solution?

Janis

Piotrzot

unread,
Sep 24, 2015, 8:39:11 AM9/24/15
to
Hi,
I work in a GNU-Linux environment. gawk is available. But I wasn't able to understand how to use the sorting functions.

Usually, i try to solve problems by my self, surfing the net. But it's frustrating sometimes.

Ideally, I would prefert the "right" link to a tutorial/discussion.

Anyway, thanx to you both
Piotr

Ed Morton

unread,
Sep 24, 2015, 8:54:47 AM9/24/15
to
You will be able to solve it yourself if you buy and read/refer to the book
Effective Awk Programming, 4th Edition, by Arnold Robbins.

Ed.

Piotrzot

unread,
Sep 24, 2015, 9:05:05 AM9/24/15
to
Well Ed,

thanx for this 602 pages link !

Janis Papanagnou

unread,
Sep 24, 2015, 10:13:22 AM9/24/15
to
On 24.09.2015 14:39, Piotrzot wrote:
> Hi,
> I work in a GNU-Linux environment. gawk is available.

That's fine. In case the Linux distribution is too old (version 3
is still delivered) download the latest gawk 4.1.3 version

> But I wasn't able to understand how to use the sorting functions.

I'm still unsure about your programming skills and awk knowledge.

>
> Usually, i try to solve problems by my self, surfing the net. But it's frustrating sometimes.

I would solve the issue in two steps.

The first is to sort the blocks; awk supports a method to process
such blank-line-separated blocks (multi-line records) by setting
RS="". Then sort the records; either inspect the gawk manual about
defining an own sorting function to compare the date value, or use
a common trick to (temporarily) prepend the date (i.e. your second
field) as first element to the block, then sort, and remove that
element after the sorting.

The second step would be to arrange the lines in each block in the
desired order. In case you have a known set of keywords you can
collect (concatenate them, because of multiple entry AUTHOR) the
entries in respective variables, and if you encounter a blank line
(or the end of the file; use awk's END clause) you can print out
the variables in the desired order, and clear them for the next
block.

>
> Ideally, I would prefert the "right" link to a tutorial/discussion.

The GNU awk manual (Arnold Robbins book) would provide you with
all details. There you find also sort functions described. In case
what I wrote above makes no sense to you it's worth to start from
the beginning. Otherwise you're also welcome if you have concrete
questions.

Janis

Kees Nuyt

unread,
Sep 24, 2015, 12:15:47 PM9/24/15
to
I would compose a sort key for every line, where the sort key
consists of 1) year 2) subkey, then pipe

Something like:

prep.awk:
#####
{
key2=99 # this one for the empty line(s)
}
/^YEAR/{
key1 = $0
sub(/^[^[:digit:]]+/,"",key1)
key2 = 0
}
/^TITLE/{
key2 = 1
}
/^AUTHOR/{
key2 = 2
}
/^PUBLPLACE/{
key2 = 3
}
## more entries here

printf "%04d%02d %s\n",0+key1,key2,$0
#####

Proces:
gawk -f prep.awk inputfile | sort -k1,1n | cut -c 8- >outputfile

(untested)
I'm sure there are other ways, but anyway, I hope this helps.


--
Regards
Kees Nuyt

Luuk

unread,
Sep 24, 2015, 3:20:57 PM9/24/15
to
gawk -F \- '/YEAR/{ x=$2 }
{ a[x][$1]=$2 }
END{ for (i in a) {
for (j in a[i]){ print i,j,a[i][j];} } }' inputfile


1972
1972 YEAR 1972
1972 TITLE Anti
1972 AUTHOR Felix Guattari
1972 PUBLPLACE Paris
1972 ABSTRACT tchatchatchatcha
1781
1781 YEAR 1781
1781 TITLE Critique of the pure reason
1781 AUTHOR Immanuel Kant
1781 PUBLPLACE Konigsberg
1781 ABSTRACT tchatchatchatcha


Of course you can delete the 'i' from 'print i,j,a[i][j]'.......


Janis Papanagnou

unread,
Sep 24, 2015, 7:23:25 PM9/24/15
to
You missed the double AUTHOR, and your for-loop does not sort (it's just
by coincidence in this test-case with two entries).

Janis

>
>

Luuk

unread,
Sep 26, 2015, 7:29:12 AM9/26/15
to
aarg, the sorting thing....
i never sort things in awk

i'm sorry for missing the double author .... ;)
0 new messages