In article <kcvopr$944$
1...@news.xmission.com>,
> >> >sort -r <inputfile | rev | uniq -f1 | rev | sort >outputfile
> >>
> >> In what alternate universe is this "simpler"?
> >
> >In the universe where you don't have to learn an entirely new
> >programming language to do it.
>
> I suppose. But I don't know "sort" - every time I try to figure it out, my
> head hurts.
>
> "rev" I suppose I could guess what it does - but it seems weird to me.
>
> "uniq" I'd have to look up.
>
> So, you want *me* to learn *3* new languages???
'sort', 'uniq' and 'rev' are simple line-oriented text manipulation
utilities, not entire programming languages. Each of them are easily
figured out by reading their man pages.
But I'll explain my reasoning anyway. Perhaps that'll clarify things for
other readers as well. First I'll explain briefly what each program
does, then I'll illustrate my thought process.
'sort' sorts lines of text in ascending order; the '-r' option gets it
to sort them in reverse/descending order. The '-n' option makes it sort
the input as numbers instead of as text (so that '2' is sorted before
'10' and not after).
'rev' (sort for "reverse") is a very simple program that outputs each
line of input backwards. For example, it writes 'abcdefg' as 'gfedcba'.
'uniq' (short for "unique") takes lines of text and omits any adjacent
repeated lines in it, outputting only the first of each set of identical
lines. As it detects only adjacent repeated lines and not repeated lines
that are apart, you have to feed it sorted input to eliminate all
repeated lines. (Hence the well-known pipe 'sort | uniq'.)
The -f option gets 'uniq' to ignore a given number of fields in each
line, starting from the beginning (where 'field' is defined as any
sequence of non-blank characters separated by any number of blank
characters). So with 'uniq -f1', a line is still considered identical if
the first field is different, with 'uniq -f2' the first two fields may
be different, etc.
Now, here is how I figured out how to string these simple programs
together to get the desired effect.
STEP 1:
The OP wants to keep only the latest date for each ID, so to begin our
processing, it is convenient to sort the input in reverse order using
'sort -r'. The 'sort' program sorts entire lines at a time, so the input
is now reverse-sorted by ID and then by date:
$ sort -r <input
3917 1995-02-08
3915 1997-02-08
3915 1996-01-07
3915 1995-02-07
2464 1998-07-14
2464 1996-11-06
2066 1997-06-18
2066 1996-09-23
For each ID, we now have most recent date first. This is already
significant progress towards the OP's goal.
(A sidenote: a bit later, it dawned on me that no one said all IDs have
the same number of digits, so sorting as numbers is safer: 'sort -r -n'.
This avoids the '10 comes before 2' effect. It doesn't otherwise affect
things much, as IDs are grouped together regardless.)
STEP 2:
As you can see in the output above, our problem of "keep only the latest
date" has now been reduced to "keep only the first of each repeated ID".
This clearly calls for 'uniq'.
We want to get 'uniq' to ignore the second field (date) in its
determination of what constitutes a repeated line. Unfortunately, a
quick read of its man page taught me that it can only skip fields
starting from the beginning of each line. That is useless to our purpose.
So we need some way to be able to skip not the second, but the first
field. Here is where the "rev" program comes to the rescue. It quite
simply writes the lines backwards:
$ sort -r <input | rev
80-20-5991 7193
80-20-7991 5193
70-10-6991 5193
70-20-5991 5193
41-70-8991 4642
60-11-6991 4642
81-60-7991 6602
32-90-6991 6602
STEP 3:
Now we can ignore the date by telling "uniq" to ignore the first field,
keeping only the first of each ID listed;
$ sort -r <input | rev | uniq -f1
80-20-5991 7193
80-20-7991 5193
41-70-8991 4642
81-60-7991 6602
Hey presto, our basic problem is already solved! Only the latest date is
kept for each ID. Unfortunately it's still written backwards.
STEP 4:
Backwards twice is forwards again:
$ sort -r <input | rev | uniq -f1 | rev
3917 1995-02-08
3915 1997-02-08
2464 1998-07-14
2066 1997-06-18
Better. But the output is still sorted by ID in descending order, and
the OP desired output in ascending order.
STEP 5:
A final simple "sort" invocation makes the output perfectly compliant
with the OP's specification:
$ sort -r <input | rev | uniq -f1 | rev | sort
2066 1997-06-18
2464 1998-07-14
3915 1997-02-08
3917 1995-02-08
And all IDs lived happily ever after. The end.
OK, so this message got long. I still claim this is simpler than using
'awk'. I can explain this in a Usenet article. To explain the awk
programming language, you need a book.
The trick with pipe constructs is that you need to think in terms of
breaking down a complex problem into several simple problems. Each
simple problem is solved using a simple program, each using the output
of the previous one as its input.
This way of thinking is very much fundamental to the traditional "Unix
way". I'm new to this group, but I thought people in a shell programming
group would be at ease with it.
- Martijn