Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Deduplicate and keep latest date only - sort? awk?

74 views
Skip to first unread message

John Larson

unread,
Jan 6, 2013, 2:53:46 PM1/6/13
to
Dear All,

In a dataset with an ID and a date on each line I have one or more dates per ID. I would like to keep only the most recent date. I can sort it easily but have trouble keeping only the latest date.

Sample input:
2066 1996-09-23
2066 1997-06-18
2464 1996-11-06
2464 1998-07-14
3915 1995-02-07
3915 1996-01-07
3915 1997-02-08
3917 1995-02-08

Desired output:
2066 1997-06-18
2464 1998-07-14
3915 1997-02-08
3917 1995-02-08

Any help is greatly appreciated!

-- John

Radoulov, Dimitre

unread,
Jan 6, 2013, 3:03:41 PM1/6/13
to
Assuming the input file is already ordered by id and date:

awk 'NR > 1 && id != $1 {
print rec
}
{ id = $1; rec = $0 }
END { print rec }' infile

Otherwise:



Regards
Dimitre

Dave Gibson

unread,
Jan 6, 2013, 5:40:34 PM1/6/13
to
Assuming that later dates appear later in the input:

awk '{ a[$1] = $0 } END { for (k in a) print a[k] }' input > output

The output may require sorting.

woc...@writeme.com

unread,
Jan 7, 2013, 12:50:35 PM1/7/13
to

Thanks Dimitre and Dave - both solutions work perfectly.

-- John

Luuk

unread,
Jan 7, 2013, 1:52:28 PM1/7/13
to
awk '{ if ($2>a[$1]) { a[$1]=$2;}}END{ for (i in a) { print i,a;} }' infile

Martijn Dekker

unread,
Jan 13, 2013, 5:44:41 PM1/13/13
to
In article <f1d0891c-7016-4835...@googlegroups.com>,
John Larson <woc...@writeme.com> wrote:

> In a dataset with an ID and a date on each line I have one or more dates per
> ID. I would like to keep only the most recent date. I can sort it easily but
> have trouble keeping only the latest date.
>
> Sample input:
> 2066 1996-09-23
> 2066 1997-06-18
> 2464 1996-11-06
> 2464 1998-07-14
> 3915 1995-02-07
> 3915 1996-01-07
> 3915 1997-02-08
> 3917 1995-02-08
>
> Desired output:
> 2066 1997-06-18
> 2464 1998-07-14
> 3915 1997-02-08
> 3917 1995-02-08

A simpler way without "awk" that does not require the input to be sorted:

sort -r <inputfile | rev | uniq -f1 | rev | sort >outputfile

- Martijn

Martijn Dekker

unread,
Jan 13, 2013, 6:11:14 PM1/13/13
to
In article <martijn-929A27...@news.individual.net>,
Martijn Dekker <mar...@inlv.demon.nl> wrote:

> In article <f1d0891c-7016-4835...@googlegroups.com>,
> John Larson <woc...@writeme.com> wrote:
[...]
> > Sample input:
> > 2066 1996-09-23
> > 2066 1997-06-18
> > 2464 1996-11-06
> > 2464 1998-07-14
> > 3915 1995-02-07
> > 3915 1996-01-07
> > 3915 1997-02-08
> > 3917 1995-02-08
> >
> > Desired output:
> > 2066 1997-06-18
> > 2464 1998-07-14
> > 3915 1997-02-08
> > 3917 1995-02-08
>
> A simpler way without "awk" that does not require the input to be sorted:
>
> sort -r <inputfile | rev | uniq -f1 | rev | sort >outputfile

On second thought, if the ID field can vary in length,
using numerical sort ("sort -n") is necessary:

sort -r -n <inputfile | rev | uniq -f1 | rev | sort -n >outputfile

- M.

Kenny McCormack

unread,
Jan 13, 2013, 7:08:23 PM1/13/13
to
In article <martijn-929A27...@news.individual.net>,
Martijn Dekker <mar...@inlv.demon.nl> wrote:
...
>A simpler way without "awk" that does not require the input to be sorted:
>
>sort -r <inputfile | rev | uniq -f1 | rev | sort >outputfile
>
>- Martijn

In what alternate universe is this "simpler"?

--
They say compassion is a virtue, but I don't have the time!

- David Byrne -

Martijn Dekker

unread,
Jan 13, 2013, 8:53:46 PM1/13/13
to
In article <kcvi9n$283$1...@news.xmission.com>,
gaz...@shell.xmission.com (Kenny McCormack) wrote:

> In article <martijn-929A27...@news.individual.net>,
> Martijn Dekker <mar...@inlv.demon.nl> wrote:
> ...
> >A simpler way without "awk" that does not require the input to be sorted:
> >
> >sort -r <inputfile | rev | uniq -f1 | rev | sort >outputfile
>
> In what alternate universe is this "simpler"?

In the universe where you don't have to learn an entirely new
programming language to do it.

- M.

Kenny McCormack

unread,
Jan 13, 2013, 8:59:23 PM1/13/13
to
In article <martijn-C17D74...@news.individual.net>,
Martijn Dekker <mar...@inlv.demon.nl> wrote:
>In article <kcvi9n$283$1...@news.xmission.com>,
> gaz...@shell.xmission.com (Kenny McCormack) wrote:
>
>> In article <martijn-929A27...@news.individual.net>,
>> Martijn Dekker <mar...@inlv.demon.nl> wrote:
>> ...
>> >A simpler way without "awk" that does not require the input to be sorted:
>> >
>> >sort -r <inputfile | rev | uniq -f1 | rev | sort >outputfile
>>
>> In what alternate universe is this "simpler"?
>
>In the universe where you don't have to learn an entirely new
>programming language to do it.
>
>- M.

I suppose. But I don't know "sort" - every time I try to figure it out, my
head hurts.

"rev" I suppose I could guess what it does - but it seems weird to me.

"uniq" I'd have to look up.

So, you want *me* to learn *3* new languages???

--
A liberal, a moderate, and a conservative walk into a bar...

Bartender says, "Hi, Mitt!"

Martijn Dekker

unread,
Jan 13, 2013, 10:45:55 PM1/13/13
to
In article <kcvopr$944$1...@news.xmission.com>,
gaz...@shell.xmission.com (Kenny McCormack) wrote:

> In article <martijn-C17D74...@news.individual.net>,
> Martijn Dekker <mar...@inlv.demon.nl> wrote:
> >In article <kcvi9n$283$1...@news.xmission.com>,
> > gaz...@shell.xmission.com (Kenny McCormack) wrote:
> >
> >> In article <martijn-929A27...@news.individual.net>,
> >> Martijn Dekker <mar...@inlv.demon.nl> wrote:
> >> ...
> >> >sort -r <inputfile | rev | uniq -f1 | rev | sort >outputfile
> >>
> >> In what alternate universe is this "simpler"?
> >
> >In the universe where you don't have to learn an entirely new
> >programming language to do it.
>
> I suppose. But I don't know "sort" - every time I try to figure it out, my
> head hurts.
>
> "rev" I suppose I could guess what it does - but it seems weird to me.
>
> "uniq" I'd have to look up.
>
> So, you want *me* to learn *3* new languages???

'sort', 'uniq' and 'rev' are simple line-oriented text manipulation
utilities, not entire programming languages. Each of them are easily
figured out by reading their man pages.

But I'll explain my reasoning anyway. Perhaps that'll clarify things for
other readers as well. First I'll explain briefly what each program
does, then I'll illustrate my thought process.

'sort' sorts lines of text in ascending order; the '-r' option gets it
to sort them in reverse/descending order. The '-n' option makes it sort
the input as numbers instead of as text (so that '2' is sorted before
'10' and not after).

'rev' (sort for "reverse") is a very simple program that outputs each
line of input backwards. For example, it writes 'abcdefg' as 'gfedcba'.

'uniq' (short for "unique") takes lines of text and omits any adjacent
repeated lines in it, outputting only the first of each set of identical
lines. As it detects only adjacent repeated lines and not repeated lines
that are apart, you have to feed it sorted input to eliminate all
repeated lines. (Hence the well-known pipe 'sort | uniq'.)

The -f option gets 'uniq' to ignore a given number of fields in each
line, starting from the beginning (where 'field' is defined as any
sequence of non-blank characters separated by any number of blank
characters). So with 'uniq -f1', a line is still considered identical if
the first field is different, with 'uniq -f2' the first two fields may
be different, etc.

Now, here is how I figured out how to string these simple programs
together to get the desired effect.

STEP 1:

The OP wants to keep only the latest date for each ID, so to begin our
processing, it is convenient to sort the input in reverse order using
'sort -r'. The 'sort' program sorts entire lines at a time, so the input
is now reverse-sorted by ID and then by date:

$ sort -r <input
3917 1995-02-08
3915 1997-02-08
3915 1996-01-07
3915 1995-02-07
2464 1998-07-14
2464 1996-11-06
2066 1997-06-18
2066 1996-09-23

For each ID, we now have most recent date first. This is already
significant progress towards the OP's goal.

(A sidenote: a bit later, it dawned on me that no one said all IDs have
the same number of digits, so sorting as numbers is safer: 'sort -r -n'.
This avoids the '10 comes before 2' effect. It doesn't otherwise affect
things much, as IDs are grouped together regardless.)

STEP 2:

As you can see in the output above, our problem of "keep only the latest
date" has now been reduced to "keep only the first of each repeated ID".
This clearly calls for 'uniq'.

We want to get 'uniq' to ignore the second field (date) in its
determination of what constitutes a repeated line. Unfortunately, a
quick read of its man page taught me that it can only skip fields
starting from the beginning of each line. That is useless to our purpose.

So we need some way to be able to skip not the second, but the first
field. Here is where the "rev" program comes to the rescue. It quite
simply writes the lines backwards:

$ sort -r <input | rev
80-20-5991 7193
80-20-7991 5193
70-10-6991 5193
70-20-5991 5193
41-70-8991 4642
60-11-6991 4642
81-60-7991 6602
32-90-6991 6602

STEP 3:

Now we can ignore the date by telling "uniq" to ignore the first field,
keeping only the first of each ID listed;

$ sort -r <input | rev | uniq -f1
80-20-5991 7193
80-20-7991 5193
41-70-8991 4642
81-60-7991 6602

Hey presto, our basic problem is already solved! Only the latest date is
kept for each ID. Unfortunately it's still written backwards.

STEP 4:

Backwards twice is forwards again:

$ sort -r <input | rev | uniq -f1 | rev
3917 1995-02-08
3915 1997-02-08
2464 1998-07-14
2066 1997-06-18

Better. But the output is still sorted by ID in descending order, and
the OP desired output in ascending order.

STEP 5:

A final simple "sort" invocation makes the output perfectly compliant
with the OP's specification:

$ sort -r <input | rev | uniq -f1 | rev | sort
2066 1997-06-18
2464 1998-07-14
3915 1997-02-08
3917 1995-02-08

And all IDs lived happily ever after. The end.

OK, so this message got long. I still claim this is simpler than using
'awk'. I can explain this in a Usenet article. To explain the awk
programming language, you need a book.

The trick with pipe constructs is that you need to think in terms of
breaking down a complex problem into several simple problems. Each
simple problem is solved using a simple program, each using the output
of the previous one as its input.

This way of thinking is very much fundamental to the traditional "Unix
way". I'm new to this group, but I thought people in a shell programming
group would be at ease with it.

- Martijn

Ed Morton

unread,
Jan 13, 2013, 11:02:19 PM1/13/13
to
On 1/13/2013 9:45 PM, Martijn Dekker wrote:
<snip>
> OK, so this message got long. I still claim this is simpler than using
> 'awk'. I can explain this in a Usenet article. To explain the awk
> programming language, you need a book.

You don't need to describe the entire awk programming language to describe the
awk solution to this problem, just like you don't need to describe the entire
UNIX toolset to describe the piped-UNIX-tools solution to this problem.

Dave Gibson posted this:

awk '{ a[$1] = $0 } END { for (k in a) print a[k] }'

which can be described as: "Store the contents of each line in an array, a,
indexed by the first field in each line so every subsequent line with the same
key field overwrites the contents of the previous entry. After all lines have
been read print the array contents."

Couldn't be much simpler.

Ed.

Chris F.A. Johnson

unread,
Jan 13, 2013, 10:58:22 PM1/13/13
to
Most of us are.

--
Chris F.A. Johnson, author <http://shell.cfajohnson.com/>
===================================================================
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
Pro Bash Programming: Scripting the GNU/Linux Shell (2009, Apress)

Martijn Dekker

unread,
Jan 13, 2013, 11:55:29 PM1/13/13
to
In article <kd000a$u31$1...@dont-email.me>,
Ed Morton <morto...@gmail.com> wrote:

> You don't need to describe the entire awk programming language to describe
> the
> awk solution to this problem, just like you don't need to describe the entire
> UNIX toolset to describe the piped-UNIX-tools solution to this problem.

That is true if you only need to understand an example given by others,
but I don't see it holding up if you have to write the code yourself.
Dave Gibson's example is simple only to those who already know the
basics of awk syntax and array manipulation. For anyone familiar with
the POSIX shell but not awk (this is, after all, comp.unix.shell and not
comp.lang.awk), it takes much more effort to learn even basic awk syntax
and concepts than to look up a few appropriate commands to pipe together.

- M.

Martijn Dekker

unread,
Jan 14, 2013, 1:38:00 AM1/14/13
to
In article <kd000a$u31$1...@dont-email.me>,
Ed Morton <morto...@gmail.com> wrote:

> Dave Gibson posted this:
>
> awk '{ a[$1] = $0 } END { for (k in a) print a[k] }'
>
> which can be described as: "Store the contents of each line in an array, a,
> indexed by the first field in each line so every subsequent line with the
> same
> key field overwrites the contents of the previous entry. After all lines have
> been read print the array contents."

I don't understand the awk language well enough to verify that the code
matches your description, but assuming your description is accurate, it
does not meet the OP's specification. He was looking for code that
preserves the latest date, not the last-appearing date.

So to fix that (assuming awk does not have a built-in sort command), we
would have to do:

sort -n | awk '{ a[$1] = $0 } END { for (k in a) print a[k] }'

...and there we are, piping commands again.

Dave Gibson originally noted that his awk code does not sort the output,
but according to your description that is irrelevant, because it only
works correctly with sorted input, which would automatically produce
sorted output.

- M.

Ben Bacarisse

unread,
Jan 14, 2013, 7:56:48 AM1/14/13
to
Actually no. If the output is to be sorted you'd have to add another
sort to the pipeline because "for (k in a) ..." does not access the
elements of 'a' in any particular order (but that can be changed in some
awks as shown below).

BTW, the initial sort can be omitted by making the assignment
conditional:

awk 'a[$1] < $0 { a[$1] = $0 } END { for (k in a) print a[k] }'

If your awk has the asort function the final sort can also be done
internally:

awk 'a[$1]<$0 { a[$1]=$0 } END { asort(a); while (a[++i]) print a[i] }'

Some awks (GNU awk for example) can now control the order in which array
access (and sorting) occurs, but it's not portable and rather clumsy for
a short script like this:

awk 'BEGIN { PROCINFO["sorted_in"] = "@ind_num_asc" } \
a[$1] < $0 { a[$1] = $0 } END { for (k in a) print a[k] }'

And another BTW... I agree with you than pipelines (even complex ones)
and be conceptually simpler than awk programs. There are two reasons
for this in my option. One, you get to know commands like sort and uniq
in other contexts so you are often familiar with them already. Two, the
pipeline idea is beautifully orthogonal to the commands that get joined,
so pipelines are conceptually simple.

--
Ben.

Janis Papanagnou

unread,
Jan 14, 2013, 8:00:18 AM1/14/13
to
I interpreted Dave's assumed precondition - "Assuming that later dates
appear later in the input" - as precondition WRT all common key subsets.
And the OP's data seem to exactly have this (weaker!) property; dates
for the same key seem to be ordered, but ignoring the key the dates are
not ordered. In case this partially sorting is not a pure coincidence,
you don't need the sort(1) command or anything else.

Dave's second remark WRT necessity to sort was targeted to the fact that
awk arrays are hash-indexed arrays, so the for-loop will not guarantee
ordering in awk.

To sum up; the awk program most likely works as expected for the task.

Janis

>
> - M.
>

Ed Morton

unread,
Jan 14, 2013, 8:05:50 AM1/14/13
to
On 1/14/2013 6:56 AM, Ben Bacarisse wrote:
<snip>
> And another BTW... I agree with you than pipelines (even complex ones)
> and be conceptually simpler than awk programs. There are two reasons
> for this in my option. One, you get to know commands like sort and uniq
> in other contexts so you are often familiar with them already. Two, the
> pipeline idea is beautifully orthogonal to the commands that get joined,
> so pipelines are conceptually simple.

I don't thing anyone would disagree with the CONCEPT, it's just in this
particular instance this pipeline:

sort -r -n <inputfile | rev | uniq -f1 | rev | sort -n >outputfile

is so convoluted compared to this awk script:

awk '{a[$1] = $0} END{for (k in a) print a[k]}' input > output

that claiming that that pipeline is simpler than that awk script is highly
debatable.

You can add a "| sort" to the end of the awk script if necessary (which it
doesn't seem to be given the OPs posted input file) and the result is still much
simpler than the suggested pipeline.

Ed.

Kenny McCormack

unread,
Jan 14, 2013, 8:43:20 AM1/14/13
to
In article <kd0vre$ilg$1...@dont-email.me>,
Ed Morton <morto...@gmail.com> wrote:
...
>I don't thing anyone would disagree with the CONCEPT, it's just in this
>particular instance this pipeline:
>
> sort -r -n <inputfile | rev | uniq -f1 | rev | sort -n >outputfile
>
>is so convoluted compared to this awk script:
>
> awk '{a[$1] = $0} END{for (k in a) print a[k]}' input > output
>
>that claiming that that pipeline is simpler than that awk script is highly
>debatable.
>
>You can add a "| sort" to the end of the awk script if necessary (which it
>doesn't seem to be given the OPs posted input file) and the result is
>still much
>simpler than the suggested pipeline.

Agreed - on all counts. I should add that the real problem is that people
often end up writing pipelines like:

cat ... | grep ... | sed ... | cut ... | grep ... | awk ... | sed ... | sort ...

where most of it (and, of course, the ubiquitous UUOC) could be replaced
with a single invocation of AWK. Two other comments:

1) The discussion (like many before and in the future) seems to turn on
whether AWK is a "shell tool" (like grep, cut, sed, etc) or a programming
language (like perl, ruby, python, etc). The fact is that AWK sits squarely
in the middle between these two camps. I've often said that AWK is just the
right combination of "low level" (i.e., programming language) and "high
level" (i.e., shell tool) - and that's why I like it. It is just the right
combination of each.

Now, obviously, it is not quite as easy-to-use as the rest of the shell
tools, nor as powerful as the rest of the programming languages, but, again,
it is the right happy medium.

2) Although I was being just a bit facetious about the rest of the commands
used in Martjyn's (Hope I spelled that right) pipeline, I really am serious
about "sort". I've always found sort's command syntax bizarre and
non-intuitive, so I avoid it whenever possible. And since all of the AWKs
that I use do have built-in sorting functionality (as Ben alluded to), I
never have to use "sort". Note also that one of the arguments often raised
in sort's defense is that it is efficient. I don't doubt that, but then
again, I never use datasets large enough for this to matter.

--
People who say they'll vote for someone else because Obama couldn't solve
all of Bush's messes are like people complaining that he couldn't cure cancer,
so they'll go and vote for cancer.

Ben Bacarisse

unread,
Jan 14, 2013, 9:37:52 AM1/14/13
to
Ed Morton <morto...@gmail.com> writes:

> On 1/14/2013 6:56 AM, Ben Bacarisse wrote:
> <snip>
>> And another BTW... I agree with you than pipelines (even complex ones)
>> and be conceptually simpler than awk programs. There are two reasons
>> for this in my option. One, you get to know commands like sort and uniq
>> in other contexts so you are often familiar with them already. Two, the
>> pipeline idea is beautifully orthogonal to the commands that get joined,
>> so pipelines are conceptually simple.
>
> I don't thing anyone would disagree with the CONCEPT, it's just in
> this particular instance this pipeline:
>
> sort -r -n <inputfile | rev | uniq -f1 | rev | sort -n >outputfile
>
> is so convoluted compared to this awk script:
>
> awk '{a[$1] = $0} END{for (k in a) print a[k]}' input > output
>
> that claiming that that pipeline is simpler than that awk script is
> highly debatable.

But they don't do the same thing. That script is, roughly, the
equivalent of

tac | rev | uniq -f1 | rev

which I think is very clear and neat (but, sadly, tac is not in POSIX).
Using only POSIX commands we should compare

awk '{a[$1] = $0} END{for (k in a) print a[k]}'

with

sort -r | rev | uniq -f1 | rev

I don't think that's convoluted and it's doing more than the awk command
since it does not rely anymore on the input order.

I would argue that it's simpler. I know awk well, so it's not that I am
at all mystified by what's going on, it's just that one can mentally
decompose it more easily. Bracketing any command with rev | ... | rev
switches the meaning field numbers. Using tac (or sort -r) makes uniq
keep the last, rather than the first, unique input line.

> You can add a "| sort" to the end of the awk script if necessary

Yup. Then we'd be comparing

awk '{a[$1] = $0} END{for (k in a) print a[k]}' | sort

with

sort -r | rev | uniq -f1 | rev | sort

but, again, the second is doing more that the first in that it does not
rely on the input order. But look at the beautiful symmetry here! All
of these variants:

sort -r | rev | uniq -f1 | rev | sort
sort -r | uniq -f1 | sort
rev | uniq -f1 | rev
uniq -f1

do related things. The awk equivalents are all simple but they are not
related to each other by any such pattern. Maybe just the way my mind
works...

> (which it doesn't seem to be given the OPs posted input file) and the
> result is still much simpler than the suggested pipeline.

I don't think it's related to the input order. You'd add a "| sort" to
the end if the output needs to be sorted. The awk for loop destroys the
input order which it relies on.

--
Ben.

Martijn Dekker

unread,
Jan 14, 2013, 9:37:53 AM1/14/13
to
In article
<0.f6e059b03cb0f1bc29c8.2013...@bsb.me.uk>,
Ben Bacarisse <ben.u...@bsb.me.uk> wrote:

> Martijn Dekker <mar...@inlv.demon.nl> writes:
[...]
> > Dave Gibson originally noted that his awk code does not sort the output,
> > but according to your description that is irrelevant, because it only
> > works correctly with sorted input, which would automatically produce
> > sorted output.
>
> Actually no. If the output is to be sorted you'd have to add another
> sort to the pipeline because "for (k in a) ..." does not access the
> elements of 'a' in any particular order (but that can be changed in some
> awks as shown below).

Heh. Wow. A "for" loop that does not necessarily process the data in
order of input, that's a new one on me. And the solution within awk is
implementation-dependent? Yet people claim an awk solution is simple!?

I rest my case.

This certainly demonstrates the pitfalls of copying and pasting code in
a programming language you don't thoroughly understand. I believe it's
called cargo cult programming:
http://en.wikipedia.org/wiki/Cargo_cult_programming

[...]
> And another BTW... I agree with you than pipelines (even complex ones)
> and be conceptually simpler than awk programs. There are two reasons
> for this in my option. One, you get to know commands like sort and uniq
> in other contexts so you are often familiar with them already. Two, the
> pipeline idea is beautifully orthogonal to the commands that get joined,
> so pipelines are conceptually simple.

Well put.

This is also illustrated by the fact that my pipeline has now been
demonstrated to be more robust than the original awk program: the
pipeline deals correctly with unsorted data and produces sorted data.

You did manage to make the awk solution robust (I'm taking your word for
it), but that required abandoning any pretence of simplicity.

Ah well, I said I rest my case, so I should shut up now.

- Martijn

Ed Morton

unread,
Jan 14, 2013, 9:59:50 AM1/14/13
to
On 1/14/2013 8:37 AM, Ben Bacarisse wrote:
> Ed Morton <morto...@gmail.com> writes:
>
>> On 1/14/2013 6:56 AM, Ben Bacarisse wrote:
>> <snip>
>>> And another BTW... I agree with you than pipelines (even complex ones)
>>> and be conceptually simpler than awk programs. There are two reasons
>>> for this in my option. One, you get to know commands like sort and uniq
>>> in other contexts so you are often familiar with them already. Two, the
>>> pipeline idea is beautifully orthogonal to the commands that get joined,
>>> so pipelines are conceptually simple.
>>
>> I don't thing anyone would disagree with the CONCEPT, it's just in
>> this particular instance this pipeline:
>>
>> sort -r -n <inputfile | rev | uniq -f1 | rev | sort -n >outputfile
>>
>> is so convoluted compared to this awk script:
>>
>> awk '{a[$1] = $0} END{for (k in a) print a[k]}' input > output
>>
>> that claiming that that pipeline is simpler than that awk script is
>> highly debatable.
>
> But they don't do the same thing. That script is, roughly, the
> equivalent of
>
> tac | rev | uniq -f1 | rev

But we're not discussing whether or not you can write a different pipeline to
solve the problem, we're discussing the statement Martijn made that THIS pipeline:

sort -r -n <inputfile | rev | uniq -f1 | rev | sort -n >outputfile

is simpler than THIS awk command:

awk '{a[$1] = $0} END{for (k in a) print a[k]}' input > output

and some of us disagree with that statement. I'm actually amazed that anyone
agrees with it.

Ed.

Kenny McCormack

unread,
Jan 14, 2013, 10:13:28 AM1/14/13
to
In article <kd16h5$r30$1...@dont-email.me>,
Ed Morton <morto...@gmail.com> wrote:
...
>> But they don't do the same thing. That script is, roughly, the
>> equivalent of
>>
>> tac | rev | uniq -f1 | rev
>
>But we're not discussing whether or not you can write a different pipeline to
>solve the problem, we're discussing the statement Martijn made that THIS
>pipeline:

Agreed - totally, on all counts.

The thing is, the word "simple" is indeed quite a complex word.

First of call, we're all really just using the word "simple" as a proxy for
"best" - which is to say, what we're really arguing about is which solution
is the "best" - which is to say, which of us is the studliest programmer.

Second, "simple" has at least two entirely distinct meanings:

1) Least keystrokes

2) Most "readable" (whatever that means - i.e., it is in the eye of the
beholder)

Lots of people argue that UUOC is a Good Thing, because it makes their
pipelines more "readable". Many of us think this argument ridiculous.

It is clear that if you understand AWK, then the AWK solutions are much
clearer and readable (i.e., "simple"). And if you don't understand AWK,
that's a defect in your education that needs to be corrected (stat!).

--
Is God willing to prevent evil, but not able? Then he is not omnipotent.
Is he able, but not willing? Then he is malevolent.
Is he both able and willing? Then whence cometh evil?
Is he neither able nor willing? Then why call him God?
~ Epicurus

Ed Morton

unread,
Jan 14, 2013, 10:20:21 AM1/14/13
to
On 1/14/2013 8:37 AM, Martijn Dekker wrote:
> In article
> <0.f6e059b03cb0f1bc29c8.2013...@bsb.me.uk>,
> Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
>
>> Martijn Dekker <mar...@inlv.demon.nl> writes:
> [...]
>>> Dave Gibson originally noted that his awk code does not sort the output,
>>> but according to your description that is irrelevant, because it only
>>> works correctly with sorted input, which would automatically produce
>>> sorted output.
>>
>> Actually no. If the output is to be sorted you'd have to add another
>> sort to the pipeline because "for (k in a) ..." does not access the
>> elements of 'a' in any particular order (but that can be changed in some
>> awks as shown below).
>
> Heh. Wow. A "for" loop that does not necessarily process the data in
> order of input, that's a new one on me.

Then I assume associative arrays are new to you. Given an associative array that
I populate as follows:

a["y"] = 3
a["z"] = 7
a["x"] = 5

you feel the correct order of outputting it's contents is the order in which the
data was entered so y:3, z:7, x:5. Why? Why not the order of the indices x, y,
z? Why not the order of the elements 3, 5, 7? Why not some other order?

The point is there is no "right" way to print an associative array so the order
it's printed by default doesn't matter - you simply code the order you want it
printed for any given application if that matters to you for that application
and script by script you'll have different requirements.

And the solution within awk is
> implementation-dependent? Yet people claim an awk solution is simple!?

Yes, it is very simple. You have an array, you print the contents of the array.
If you want it in any particular order you write the code to print it in that
particular order.

> I rest my case.
>
> This certainly demonstrates the pitfalls of copying and pasting code in
> a programming language you don't thoroughly understand. I believe it's
> called cargo cult programming:
> http://en.wikipedia.org/wiki/Cargo_cult_programming

Huh?

> [...]
>> And another BTW... I agree with you than pipelines (even complex ones)
>> and be conceptually simpler than awk programs. There are two reasons
>> for this in my option. One, you get to know commands like sort and uniq
>> in other contexts so you are often familiar with them already. Two, the
>> pipeline idea is beautifully orthogonal to the commands that get joined,
>> so pipelines are conceptually simple.
>
> Well put.
>
> This is also illustrated by the fact that my pipeline has now been
> demonstrated to be more robust than the original awk program: the
> pipeline deals correctly with unsorted data and produces sorted data.

That would make a difference if it was necessary but since it doesn't appear to
be something the OP is looking for its just unnecessary additional complexity.

> You did manage to make the awk solution robust (I'm taking your word for
> it), but that required abandoning any pretence of simplicity.

By adding a "| sort", if necessary, the posted script would not become more
complicated than the alternative multi-command pipeline under discussion.

Ed.

Janis Papanagnou

unread,
Jan 14, 2013, 10:22:33 AM1/14/13
to
Am 14.01.2013 15:37, schrieb Martijn Dekker:
> In article
> <0.f6e059b03cb0f1bc29c8.2013...@bsb.me.uk>,
> Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
>
>> Martijn Dekker <mar...@inlv.demon.nl> writes:
> [...]
>>> Dave Gibson originally noted that his awk code does not sort the output,
>>> but according to your description that is irrelevant, because it only
>>> works correctly with sorted input, which would automatically produce
>>> sorted output.
>>
>> Actually no. If the output is to be sorted you'd have to add another
>> sort to the pipeline because "for (k in a) ..." does not access the
>> elements of 'a' in any particular order (but that can be changed in some
>> awks as shown below).
>
> Heh. Wow. A "for" loop that does not necessarily process the data in
> order of input, that's a new one on me.

One must observe that the above for-loop is not a loop over the input,
rather over an associative array; a map data structure.

> And the solution within awk is implementation-dependent?

Not quite; any solution that uses hashed-index arrays to store data
may observe implementation dependencies.

> Yet people claim an awk solution is simple!?

Actually, one point why awk is considered a simple tool is exactly
because it deliberately abstained from support of many complex data
structures, and relying on few but powerful concepts.

Certainly and nonetheless you can add elements in a countet way into
a hash-indexed array; in awk, say, like that: a[++c] = b

>
> I rest my case.
>
> This certainly demonstrates the pitfalls of copying and pasting code in
> a programming language you don't thoroughly understand. I believe it's
> called cargo cult programming:
> http://en.wikipedia.org/wiki/Cargo_cult_programming
>
> [...]
>> And another BTW... I agree with you than pipelines (even complex ones)
>> and be conceptually simpler than awk programs. There are two reasons
>> for this in my option. One, you get to know commands like sort and uniq
>> in other contexts so you are often familiar with them already. Two, the
>> pipeline idea is beautifully orthogonal to the commands that get joined,
>> so pipelines are conceptually simple.
>
> Well put.
>
> This is also illustrated by the fact that my pipeline has now been
> demonstrated to be more robust than the original awk program: the
> pipeline deals correctly with unsorted data and produces sorted data.

No. The pipeline is only more robust with a specific interpretation of
the requirements. In other interpretations, say where original sorting
would destroy the sub-structure that should be kept intact, it may be
worse, and the awk solution more robust.

Also robustness involves yet more issues; e.g. if you have large data
sets you may not be allowed to do O(n log n) sorting of the whole
input data before the processing, while single pass solutions may be
appropriate. Also space considerations can be a robustness factor; in
some reply tac(1) had been proposed, but that has immense demands on
memory if you process large data sets. Finally process ressources will
affect robustness; if you have many pipelined processes, say in a loop,
you may notice huge performance effects. Algorithmic robustness and
robustness against extensions is another issue; once you've experienced
that you'd need in one of the final processes information from one of
the first processes you need an often clumsy (and often error proce)
"side-hannel" to exchange such information.

Awk has the interesting property that it's well balanced WRT complexity
and functionality. Solutions are quite easily extensible, and not only
in cases where an alternative pipelined solution might solve the issue
as well.

Janis

Dave Gibson

unread,
Jan 14, 2013, 10:47:11 AM1/14/13
to
Ben Bacarisse <ben.u...@bsb.me.uk> wrote:

> But they don't do the same thing. That script is, roughly, the
> equivalent of
>
> tac | rev | uniq -f1 | rev
>
> which I think is very clear and neat (but, sadly, tac is not in POSIX).

I think rev is non-POSIX too.

Dave Gibson

unread,
Jan 14, 2013, 11:30:29 AM1/14/13
to
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> Am 14.01.2013 07:38, schrieb Martijn Dekker:
>> In article <kd000a$u31$1...@dont-email.me>,
>> Ed Morton <morto...@gmail.com> wrote:
>>
>>> Dave Gibson posted this:
>>>
>>> awk '{ a[$1] = $0 } END { for (k in a) print a[k] }'
>>>
>>> which can be described as: "Store the contents of each line in
>>> an array, a, indexed by the first field in each line so every
>>> subsequent line with the same key field overwrites the contents of
>>> the previous entry. After all lines have been read print the array
>>> contents."
>>
>> I don't understand the awk language well enough to verify that the code
>> matches your description, but assuming your description is accurate, it
>> does not meet the OP's specification. He was looking for code that
>> preserves the latest date, not the last-appearing date.

>> Dave Gibson originally noted that his awk code does not sort the output,
>> but according to your description that is irrelevant, because it only
>> works correctly with sorted input, which would automatically produce
>> sorted output.
>
> I interpreted Dave's assumed precondition - "Assuming that later dates
> appear later in the input" - as precondition WRT all common key subsets.
> And the OP's data seem to exactly have this (weaker!) property; dates
> for the same key seem to be ordered, but ignoring the key the dates are
> not ordered. In case this partially sorting is not a pure coincidence,
> you don't need the sort(1) command or anything else.

Yes, the sample appeared to represent a date-stamped logfile.

>
> Dave's second remark WRT necessity to sort was targeted to the fact that
> awk arrays are hash-indexed arrays, so the for-loop will not guarantee
> ordering in awk.

That too.

Dave Gibson

unread,
Jan 14, 2013, 11:31:45 AM1/14/13
to
Look at the last word on the Subject: line.

Ed Morton

unread,
Jan 14, 2013, 11:58:47 AM1/14/13
to
I just realized - no-one HAS agreed with it. Various people have said that the
concept of piped commands makes sense (agreed) and that other pipelines might be
simpler (agreed) and that the posted pipeline works on non-sorted input and
produces sorted output and so might be considered more robust in some contexts
(agreed) but now I think about it I don't actually recall seeing anyone other
than Martijn arguing that the pipeline he posted:

sort -r -n <inputfile | rev | uniq -f1 | rev | sort -n >outputfile

is simpler than the awk command Dave posted:

awk '{a[$1] = $0} END{for (k in a) print a[k]}' input > output

so I'm not entirely sure where we went off track!

Regards,

Ed.
>


Posted using www.webuse.net

Martijn Dekker

unread,
Jan 14, 2013, 1:14:19 PM1/14/13
to
In article
<0.25bf0f1f556c04d78aad.2013...@bsb.me.uk>,
Ben Bacarisse <ben.u...@bsb.me.uk> wrote:

> But they don't do the same thing. That script is, roughly, the
> equivalent of
>
> tac | rev | uniq -f1 | rev
>
> which I think is very clear and neat (but, sadly, tac is not in POSIX).

Heh. Yes, GNU 'tac' seems like a handy program to have: "Write each FILE
to standard output, last line first. With no FILE, or when FILE is -,
read standard input." So it's "cat" written backwards and working
backwards. The GNU people are clever like that.

Anyway, my interest is in portable POSIX shell programming. This being
the shell programming group, surely we can re-implement it:

tac() {
nl "$@" | sort -r -n | cut -f 2-
}

Step 1: Number the lines of the input (either standard input or the
files specified as arguments).

Step 2: Sort the lines by number in reverse order.

Step 3: Get rid of the line numbers.

Discussion:

- Unless specified otherwise, the 'nl' program separates line numbers
and lines with a tab, which happens to be the default field separator
used by "cut", so getting rid of the line numbers again is simple: "cut
-f 2-" cuts away the first field, preserving the second field and up
("2-").

- Even though the 'nl' program is specified by POSIX, some systems (at
least my old 2004 Mac OS X 10.3.9 system which I still use for
portability testing) don't have it. However, that system does have the
"n" option to "cat" which does the same. "cat -n" is not POSIX but it's
possibly more ubiquitous.
So, if we assume that either one or the other is everywhere, a shell
test inserted in a script should make any subsequent code using simple
optionless "nl" invocations, including our 'tac' implementation, more
portable:
test "$(echo blah | nl 2>/dev/null)" \
= "$(printf '%6d\t%s\n' 1 blah)" \
|| nl() { cat -n "$@"; }

- Unfortunately, if multiple files are specified to "cat -n", the Linux
version continues counting with each file whereas the BSD version resets
the counter for each file. Portable implementations should "cat"
multiple files together first before feeding them as standard input

- As always, read the man pages. I wish I had done so first -- see below.

_________________________________________________________________________

Pipelines are so wonderfully simple. Unfortunately I failed to take my
own advice and wonder what commands to combine in pipelines first, so I
programmed and almost posted the shell function below that stores the
input in the positional parameters and reverse-outputs them using a
'while' loop. It does illustrate a number of shell concepts for those
who may be interested, so I'll leave it here anyway even though it's
much slower and uses much more memory.

tac() (
IFS='
'
set -- $(cat "$@")
linenr=$#
while test $linenr -ge 1; do
eval "printf '%s\\n' \${$linenr}"
linenr=$(( $linenr - 1 ))
done
)

Discussion:

- Specifying the function as tac() ( ... ) instead of tac() { ... }
causes it to be executed in a subshell, so we don't need to worry about
changing or leaking variables.

- We want to process lines of text. The IFS assignment gets the shell to
use only newlines as Internal Field Separators for further processing.

- The "IFS='
'" assignment looks ugly. Unfortunately replacing it with something
like IFS=$(printf '\n') doesn't work because $(command substitution)
removes all trailing newlines in the output.
Bash, ksh and zsh support referring to a newline as $'\n' -- so they
can use IFS=$'\n' -- but that is not POSIX and not portable.
If anyone knows an elegant portable solution, please let me know.

- "set -- $(cat "$@")" sets the positional parameters to the contents of
the file: $1 to the first line, $2 to the second, etc.

- The "while" loop starts from the number of positional parameters $#
(i.e., the number of lines) and counts backwards.

- We need to refer to the positional parameter ${X} where X is the value
referred to by the $line variable. Directly doing ${$line} is not
supported, hence the need for 'eval'. The double quotes in the argument
necessitate escaping the backslash in \n and the '$' that we don't want
interpreted.

- Any options are passed right on to "cat" so you can use all the same
ones, such as "tac -n" to number lines backwards.

- The function processes all input in memory, so it's not ideal for
large input.

- Martijn
Message has been deleted

Dave Gibson

unread,
Jan 14, 2013, 1:55:04 PM1/14/13
to
Martijn Dekker <mar...@inlv.demon.nl> wrote:

> Anyway, my interest is in portable POSIX shell programming. This being
> the shell programming group, surely we can re-implement it:
>
> tac() {
> nl "$@" | sort -r -n | cut -f 2-
> }

nl needs '-ba' to number empty lines.

>
> Step 1: Number the lines of the input (either standard input or the
> files specified as arguments).
>
> Step 2: Sort the lines by number in reverse order.
>
> Step 3: Get rid of the line numbers.

Just for reference:

tac() {
awk '{ a[++n] = $0 } END { while (n > 0) print a[n--] }' "$@"
}

[...]

> - The "IFS='
> '" assignment looks ugly. Unfortunately replacing it with something
> like IFS=$(printf '\n') doesn't work because $(command substitution)
> removes all trailing newlines in the output.
> Bash, ksh and zsh support referring to a newline as $'\n' -- so they
> can use IFS=$'\n' -- but that is not POSIX and not portable.
> If anyone knows an elegant portable solution, please let me know.

The usual method is to capture newline <anything> then strip <anything>.

IFS=$(printf '\nX\n')
IFS=${IFS%X}

Janis Papanagnou

unread,
Jan 14, 2013, 2:22:04 PM1/14/13
to
On 14.01.2013 19:14, Martijn Dekker wrote:
> In article
> <0.25bf0f1f556c04d78aad.2013...@bsb.me.uk>,
> Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
>
>> But they don't do the same thing. That script is, roughly, the
>> equivalent of
>>
>> tac | rev | uniq -f1 | rev
>>
>> which I think is very clear and neat (but, sadly, tac is not in POSIX).
>
> Heh. Yes, GNU 'tac' seems like a handy program to have: "Write each FILE
> to standard output, last line first. With no FILE, or when FILE is -,
> read standard input." So it's "cat" written backwards and working
> backwards. The GNU people are clever like that.
>
> Anyway, my interest is in portable POSIX shell programming. This being
> the shell programming group, surely we can re-implement it:
>
> tac() {
> nl "$@" | sort -r -n | cut -f 2-
> }
>
> Step 1: Number the lines of the input (either standard input or the
> files specified as arguments).
>
> Step 2: Sort the lines by number in reverse order.
>
> Step 3: Get rid of the line numbers.
>
> Discussion:

$ wc input
10000000 20000000 160000000 input


$ time awk '{a[$1] = $0} END{for (k in a) print a[k]}' input | sort > output-1
real 0m3.763s
user 0m3.640s
sys 0m0.120s


$ time tac <input | rev | uniq -f1 | rev | sort > output-2
real 0m51.674s
user 0m54.080s
sys 0m3.530s


time sort -r <input | rev | uniq -f1 | rev | sort > output-3
real 0m54.094s
user 0m57.570s
sys 0m0.700s


$ mytac() {
nl | sort -r -n | cut -f 2-
}

$ time mytac <input | rev | uniq -f1 | rev | sort > output-4
real 1m5.651s
user 1m10.370s
sys 0m4.650s


Note:
output-1 and output-3 produce identical results
output-2 and output-4 produce identical results
output-1 and output-2 differ

And I don't even start to try the pure shell loop; too inefficient.

BTW: Does this even work for non-toy data sizes: set -- $(cat "$@")

Janis

Ed Morton

unread,
Jan 14, 2013, 2:56:04 PM1/14/13
to
Dave Gibson <dave.gma...@googlemail.com.invalid> wrote:

> Martijn Dekker <mar...@inlv.demon.nl> wrote:
>
> > Anyway, my interest is in portable POSIX shell programming. This being
> > the shell programming group, surely we can re-implement it:
> >
> > tac() {
> > nl "$@" | sort -r -n | cut -f 2-
> > }

Nice! I wonder if adding -k1 or whatever the right sort option is to just make
it look at the first field would make it more efficient?

> nl needs '-ba' to number empty lines.

I'd use "cat -n" instead of "nl -ba". I've no idea if there's a functional or
performance difference, it's just the most obvious thing IMHO.

Ed.

Posted using www.webuse.net

Martijn Dekker

unread,
Jan 14, 2013, 3:05:09 PM1/14/13
to
In article <8t0es9x...@perseus.wenlock-data.co.uk>,
dave.gma...@googlemail.com.invalid (Dave Gibson) wrote:

> Martijn Dekker <mar...@inlv.demon.nl> wrote:
>
> > Anyway, my interest is in portable POSIX shell programming. This being
> > the shell programming group, surely we can re-implement it:
> >
> > tac() {
> > nl "$@" | sort -r -n | cut -f 2-
> > }
>
> nl needs '-ba' to number empty lines.

Good point. Don't know how I managed to miss that. :/

[...]
> The usual method is to capture newline <anything> then strip <anything>.
>
> IFS=$(printf '\nX\n')
> IFS=${IFS%X}

Yeah, that works. Thanks.

- M.

Ed Morton

unread,
Jan 14, 2013, 3:07:00 PM1/14/13
to
awk is as much an appropriate shell command as sed, grep, etc. If you have to do
any kind of analysis of text files, awk should be the first tool that comes to
mind as that's what it was created to do, it's not something to be considered as
somehow un-shell-like and therefore avoided. Anyone doing shell scripting who
doesn't know basic awk syntax should simply learn basic awk syntax. It's not
hard at all and you will save yourself many hours of hacking together
workarounds and the associated maintenance headaches afterwards. That doesn't
mean you can't write chains of piped commands with or without awk in the mix,
just don't dismiss awk in favor of combinations of other UNIX tools.

Ed.

Posted using www.webuse.net

Jon LaBadie

unread,
Jan 14, 2013, 4:18:09 PM1/14/13
to
On 01/14/2013 01:14 PM, Martijn Dekker wrote:
> In article
> <0.25bf0f1f556c04d78aad.2013...@bsb.me.uk>,
> Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
>
>> But they don't do the same thing. That script is, roughly, the
>> equivalent of
>>
>> tac | rev | uniq -f1 | rev
>>
>> which I think is very clear and neat (but, sadly, tac is not in POSIX).
>
> Heh. Yes, GNU 'tac' seems like a handy program to have: "Write each FILE
> to standard output, last line first. With no FILE, or when FILE is -,
> read standard input." So it's "cat" written backwards and working
> backwards. The GNU people are clever like that.
>
> Anyway, my interest is in portable POSIX shell programming. This being
> the shell programming group, surely we can re-implement it:
>
> tac() {
> nl "$@" | sort -r -n | cut -f 2-
> }
>

Either use 'nl -ba "$@"' or 'cat -n "@"'.

By default nl does not number blank lines.

Jon


Martijn Dekker

unread,
Jan 14, 2013, 4:26:06 PM1/14/13
to
In article <201301141...@webuse.net>,
"Ed Morton" <morto...@gmail.com> wrote:

> Nice! I wonder if adding -k1 or whatever the right sort option is to just make
> it look at the first field would make it more efficient?

If you care enough, you could gather lots of text together, for example:

find /usr -type f -name '*.txt' -exec cat {} + >big.txt

and time the sorting of it with and without -k1:

time cat -n big.txt | sort -r -n -k1 >/dev/null
time cat -n big.txt | sort -r -n >/dev/null

I doubt it'll make much difference. I could even see the -k1 version
being slower, as 'sort' may have to actually do more work to only
consider part of a line and disregard the rest.

What is known to really speed up internationalization-aware
implementations of 'sort' (i.e. all modern ones), if you use a non-ASCII
locale, is to use ASCII sorting only by exporting LC_COLLATE=C in the
environment. This might be worth adding to the tac() function. But then
it should be executed in a subshell so the original environment is not
lost, or it should be saved and then restored in some other way.

Also, come to think of it, since both 'nl -ba' and 'cat -n' produce
6-character line number fields until the number of lines exceeds a
million minus one, the -n option is not strictly necessary if your input
is known to have less than a million lines, and I could see simple ASCII
sorting being a little faster than numerical sorting. However, omitting
-n might break it if non-ASCII character sets are used. I have no idea
what character code the Chinese use for a blank, for example. Setting
LC_COLLATE=C in the environment of both "nl -ba"/"cat -n" and "sort"
should make it safe enough.

All of this is mere speculation, because I can't be bothered to do the
tests. :-P

> > nl needs '-ba' to number empty lines.
>
> I'd use "cat -n" instead of "nl -ba". I've no idea if there's a functional or
> performance difference, it's just the most obvious thing IMHO.

It's not POSIX, but it seems to be widespread.

- M.

Martijn Dekker

unread,
Jan 14, 2013, 5:43:50 PM1/14/13
to
In article <201301142...@webuse.net>,
"Ed Morton" <morto...@gmail.com> wrote:

> awk is as much an appropriate shell command as sed, grep, etc. If you
> have to do any kind of analysis of text files, awk should be the
> first tool that comes to mind as that's what it was created to do,
> it's not something to be considered as somehow un-shell-like and
> therefore avoided.

OK, but then shouldn't the same be true of other languages like perl,
python, and even JavaScript? All of them can be used from shell commands
and perl was once meant to replace awk, if I recall correctly.

> Anyone doing shell scripting who doesn't know basic awk syntax should
> simply learn basic awk syntax. It's not hard at all and you will save
> yourself many hours of hacking together workarounds and the
> associated maintenance headaches afterwards. That doesn't mean you
> can't write chains of piped commands with or without awk in the mix,
> just don't dismiss awk in favor of combinations of other UNIX tools.

Fair enough. In fact I have been thinking for a long time I should gain
a better understanding of awk. But so far I can't be bothered because I
haven't felt the lack of it and because the idea of mixing two different
programming paradigms in the same program just rubs me the wrong way.

Anyway, I've ranted on more than enough now in this thread. Perhaps
it'll finally get me to take up some awk.

- M.

Ben Bacarisse

unread,
Jan 14, 2013, 10:55:11 PM1/14/13
to
Janis Papanagnou <janis_pa...@hotmail.com> writes:
<snip>
> Also robustness involves yet more issues; e.g. if you have large data
> sets you may not be allowed to do O(n log n) sorting of the whole
> input data before the processing, while single pass solutions may be
> appropriate. Also space considerations can be a robustness factor; in
> some reply tac(1) had been proposed, but that has immense demands on
> memory if you process large data sets.

That would surprise me (but I suppose it's possible). GNU's tac does
not have that behaviour. For example, tac of a 140470 line 5MB file
allocates no more than about 20K bytes.

If the input is suitable (i.e. no sorting required)

tac | rev | uniq -f1 | rev | tac

might have the best memory usage of all the solutions because no part of
the pipeline needs to store much data at all. There's the process
overhead of course, but awk is not small compared to tac and rev.

--
Ben.

Ben Bacarisse

unread,
Jan 14, 2013, 11:02:46 PM1/14/13
to
Ed Morton <morto...@gmail.com> writes:
<snip>
> Yes, it is very simple. You have an array, you print the contents of
> the array. If you want it in any particular order you write the code
> to print it in that particular order.

How you do that in awk? I know it can be done (you could roll your own
sort function for example) but maybe there's a better way for this
specific use case. (It's simple in gawk, of course.)

<snip>
--
Ben.

Ben Bacarisse

unread,
Jan 14, 2013, 11:05:45 PM1/14/13
to
Yes, you're right. At least that does not make the solution any less
portable. :-)

--
Ben.

Ben Bacarisse

unread,
Jan 14, 2013, 11:09:43 PM1/14/13
to
Yes, I went on to discuss these two in the part you snipped (unless your
point is about the -n flag which I dripped because none of the awk
solutions (except my gawk one) use numerical sorting).

> and some of us disagree with that statement. I'm actually amazed that
> anyone agrees with it.

The world would be very dull without surprises.

--
Ben.

Ben Bacarisse

unread,
Jan 14, 2013, 11:36:29 PM1/14/13
to
Janis Papanagnou <janis_pa...@hotmail.com> writes:
<snip>
> $ wc input
> 10000000 20000000 160000000 input
>
>
> $ time awk '{a[$1] = $0} END{for (k in a) print a[k]}' input | sort > output-1
> real 0m3.763s
> user 0m3.640s
> sys 0m0.120s
>
>
> $ time tac <input | rev | uniq -f1 | rev | sort > output-2
> real 0m51.674s
> user 0m54.080s
> sys 0m3.530s

Interesting. I get:

$ wc input
10000000 20000000 160000000 input
$ time awk '{a[$1] = $0} END{for (k in a) print a[k]}' input | sort > output-1

real 0m3.585s
user 0m3.484s
sys 0m0.052s
$ time tac <input | rev | uniq -f1 | rev | sort > output-2

real 0m3.759s
user 0m4.492s
sys 0m0.48

<snip>
--
Ben.

Ben Bacarisse

unread,
Jan 14, 2013, 11:49:00 PM1/14/13
to
Ben Bacarisse <ben.u...@bsb.me.uk> writes:

> Janis Papanagnou <janis_pa...@hotmail.com> writes:
> <snip>
>> Also robustness involves yet more issues; e.g. if you have large data
>> sets you may not be allowed to do O(n log n) sorting of the whole
>> input data before the processing, while single pass solutions may be
>> appropriate. Also space considerations can be a robustness factor; in
>> some reply tac(1) had been proposed, but that has immense demands on
>> memory if you process large data sets.
>
> That would surprise me (but I suppose it's possible). GNU's tac does
> not have that behaviour. For example, tac of a 140470 line 5MB file
> allocates no more than about 20K bytes.

I should have stopped here!

> If the input is suitable (i.e. no sorting required)
>
> tac | rev | uniq -f1 | rev | tac
>
> might have the best memory usage of all the solutions because no part of
> the pipeline needs to store much data at all.

That's not true because the last pipe must hold all the data before
producing any output.

--
Ben.

Ed Morton

unread,
Jan 15, 2013, 12:34:47 AM1/15/13
to
On 1/14/2013 4:43 PM, Martijn Dekker wrote:
> In article <201301142...@webuse.net>,
> "Ed Morton" <morto...@gmail.com> wrote:
>
>> awk is as much an appropriate shell command as sed, grep, etc. If you
>> have to do any kind of analysis of text files, awk should be the
>> first tool that comes to mind as that's what it was created to do,
>> it's not something to be considered as somehow un-shell-like and
>> therefore avoided.
>
> OK, but then shouldn't the same be true of other languages like perl,
> python, and even JavaScript?

No, none of those come as part of a standard UNIX installation, unlike sed,
grep, awk, sort, etc.

All of them can be used from shell commands
> and perl was once meant to replace awk, if I recall correctly.

That wouldn't make sense to me as perl sounds like it's more similar to a shell
in it's capabilities than to a text processing tool but I don't know the history
of perl and I've never used it.

>> Anyone doing shell scripting who doesn't know basic awk syntax should
>> simply learn basic awk syntax. It's not hard at all and you will save
>> yourself many hours of hacking together workarounds and the
>> associated maintenance headaches afterwards. That doesn't mean you
>> can't write chains of piped commands with or without awk in the mix,
>> just don't dismiss awk in favor of combinations of other UNIX tools.
>
> Fair enough. In fact I have been thinking for a long time I should gain
> a better understanding of awk. But so far I can't be bothered because I
> haven't felt the lack of it and because the idea of mixing two different
> programming paradigms in the same program just rubs me the wrong way.

Not sure what 2 paradigms you're thinking of there.

> Anyway, I've ranted on more than enough now in this thread. Perhaps
> it'll finally get me to take up some awk.

Feel free to ask questions....

Ed.

Chris F.A. Johnson

unread,
Jan 15, 2013, 12:52:59 AM1/15/13
to
wc < "$file"
1817964 1824936 143423184

time awk '{a[$1] = $0} END{for (k in a) print a[k]}' "$file" | sort > /dev/null

real 0m1.669s
user 0m1.555s
sys 0m0.087s

time tac "$file" | rev | uniq -f1 | rev | sort > /dev/null

real 0m4.853s
user 0m5.262s
sys 0m0.675s


--
Chris F.A. Johnson, author <http://shell.cfajohnson.com/>
===================================================================
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
Pro Bash Programming: Scripting the GNU/Linux Shell (2009, Apress)

Janis Papanagnou

unread,
Jan 15, 2013, 1:47:07 AM1/15/13
to
On 14.01.2013 23:43, Martijn Dekker wrote:
>
>> Anyone doing shell scripting who doesn't know basic awk syntax should
>> simply learn basic awk syntax. It's not hard at all and you will save
>> yourself many hours of hacking together workarounds and the
>> associated maintenance headaches afterwards. That doesn't mean you
>> can't write chains of piped commands with or without awk in the mix,
>> just don't dismiss awk in favor of combinations of other UNIX tools.
>
> Fair enough. In fact I have been thinking for a long time I should gain
> a better understanding of awk. But so far I can't be bothered because I
> haven't felt the lack of it and because the idea of mixing two different
> programming paradigms in the same program just rubs me the wrong way.

Awk is a filter as many of the other Unix tools, and it would be used
the same way in pipe constructs.

>
> Anyway, I've ranted on more than enough now in this thread. Perhaps
> it'll finally get me to take up some awk.

I'm sure it would pay.

Janis

>
> - M.
>

Ed Morton

unread,
Jan 15, 2013, 1:59:55 AM1/15/13
to
Here's how to print an array in the order it's populated:

$ cat file
y 3
z 7
x 5
$ awk '{idx[NR]=$1; val[NR]=$2} END{
for (i=1;i<=NR;i++) {
print idx[i] " -> " val[i]
}
}' file
y -> 3
z -> 7
x -> 5

Beyond that, I wrote some sorting functions in awk one time to demonstrate some
other things, they're at http://awk.info/?doc/sorting.html if you're interested.

Ed.

Janis Papanagnou

unread,
Jan 15, 2013, 2:08:27 AM1/15/13
to
Hmm.. - I suspect something... - retrying those commands with LC_ALL=C LANG=C
I get for the respective 'real' numbers above

0m3.77s
0m14.48s
0m14.71s
0m22.05s (ksh) 0m29.962s (bash)

A bit better, but still a factor 4-8 worse than the awk solution.

So locale setting degradations must also be taken into account!


Another issue can be (and probably is) the test data; what properties do have
your data? - I tried to be as close as possible to the OP's data with its
partial key-subset sorting property. In case you're interested to retry with
the same test data that I used, here's the program to generate the data...

# adjust parameter as necessary
awk -v n=${1:-10} '
BEGIN {
for (i=1; i<=n; i++) {
r = 1000 + int (rand() * 9000)
if (r in last) {
l = nxt(last[r])
} else {
l = "1000-01-01"
}
last[r] = l
print r, l
}
}
function nxt(d) {
split(d,a,"-")
if (a[3]<=30) return sprintf("%04d-%02d-%02d", a[1], a[2], 1+a[3])
if (a[2]<=11) return sprintf("%04d-%02d-%02d", a[1], 1+a[2], 1)
return sprintf("%04d-%02d-%02d", 1+a[1], 1, 1)
}
' > input

In case you try with those data, I'd be interested in your results.

Janis

Janis Papanagnou

unread,
Jan 15, 2013, 2:17:28 AM1/15/13
to
On 15.01.2013 05:49, Ben Bacarisse wrote:
> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>
>> Janis Papanagnou <janis_pa...@hotmail.com> writes:
>> <snip>
>>> Also robustness involves yet more issues; e.g. if you have large data
>>> sets you may not be allowed to do O(n log n) sorting of the whole
>>> input data before the processing, while single pass solutions may be
>>> appropriate. Also space considerations can be a robustness factor; in
>>> some reply tac(1) had been proposed, but that has immense demands on
>>> memory if you process large data sets.
>>
>> That would surprise me (but I suppose it's possible). GNU's tac does
>> not have that behaviour. For example, tac of a 140470 line 5MB file
>> allocates no more than about 20K bytes.

It is possible and maybe even likely that tac has a more sophisticated
implementation, say by seek'ing from the begin and from the end of the
file and swap (while taking the NLs into account) the respective areas.
So if implemented that way tac may not only have no huge memory demands,
but could also be maximal memory efficient. I seem to recall that (GNU)
tail(1) would as well inspect files efficiently from the rear. At least
that would explain the good memory performance that you observed.

Janis

>
> I should have stopped here!
> [...]


Chris F.A. Johnson

unread,
Jan 15, 2013, 3:18:41 AM1/15/13
to
Taking the top 100,000 lines created by your script:

wc < "$file"
1000000 2000000 16000000

time tac "$file" | rev | uniq -f1 | rev | sort > ~/txt3

real 0m2.085s
user 0m2.433s
sys 0m0.229s

time awk '{a[$1] = $0} END{for (k in a) print a[k]}' "$file" | sort > ~/txt4

real 0m0.716s
user 0m0.676s
sys 0m0.020s


However, the files are very different:

$ head ~/txt[34]
==> /home/chris/txt3 <==
1000 1000-01-01
1000 1000-01-02
1000 1000-01-03
1000 1000-01-04
1000 1000-01-05
1000 1000-01-06
1000 1000-01-07
1000 1000-01-08
1000 1000-01-09
1000 1000-01-10

==> /home/chris/txt4 <==
1000 1000-04-08
1001 1000-05-08
1002 1000-04-20
1003 1000-04-21
1004 1000-04-21
1005 1000-04-23
1006 1000-05-06
1007 1000-04-14
1008 1000-04-23
1009 1000-05-10

Janis Papanagnou

unread,
Jan 15, 2013, 5:31:33 AM1/15/13
to
>> [...]
>>
>> In case you try with those data, I'd be interested in your results.
>
> Taking the top 100,000 lines created by your script:
>
> wc < "$file"
> 1000000 2000000 16000000
>
> time tac "$file" | rev | uniq -f1 | rev | sort > ~/txt3
>
> real 0m2.085s
> user 0m2.433s
> sys 0m0.229s
>
> time awk '{a[$1] = $0} END{for (k in a) print a[k]}' "$file" | sort > ~/txt4
>
> real 0m0.716s
> user 0m0.676s
> sys 0m0.020s
>
>
> However, the files are very different:

Not surprising; mind where I upthread noted:
:>> Note:
:>> output-1 and output-3 produce identical results
:>> output-2 and output-4 produce identical results
:>> output-1 and output-2 differ
There are different requirements assumed with the tac-versions.

Ben already said that upthread: "But they don't do the same thing."
and he suggested taking the sort -r version for comparison.
(The timing issues are independent, though, since the magnitude and
missing robustness (WRT performance penalty depending on size of
data, locales) seem to be inherent to the suggested multi-command
pipeline solutions.)

Janis

Ben Bacarisse

unread,
Jan 15, 2013, 8:57:23 AM1/15/13
to
Ed Morton <morto...@gmail.com> writes:

> On 1/14/2013 10:02 PM, Ben Bacarisse wrote:
>> Ed Morton <morto...@gmail.com> writes:
>> <snip>
>>> Yes, it is very simple. You have an array, you print the contents of
>>> the array. If you want it in any particular order you write the code
>>> to print it in that particular order.
>>
>> How you do that in awk? I know it can be done (you could roll your own
>> sort function for example) but maybe there's a better way for this
>> specific use case. (It's simple in gawk, of course.)
>
> Here's how to print an array in the order it's populated:
>
> $ cat file
> y 3
> z 7
> x 5
> $ awk '{idx[NR]=$1; val[NR]=$2} END{
> for (i=1;i<=NR;i++) {
> print idx[i] " -> " val[i]
> }
> }' file
> y -> 3
> z -> 7
> x -> 5

This prints by line number which might not correspond to the population
ordering. I thought you might know a more general trick.

I may have misunderstood you. When you said "it is very simple" you did
not mean that the code you have to write is very simple, just that the
idea is very simple: it's up to you to come up with a method to get the
order you want.

I'm sure the authors intended "| sort" to be the preferred solution in
many cases.

<snip>
--
Ben.

Ben Bacarisse

unread,
Jan 15, 2013, 9:46:43 AM1/15/13
to
It's to do with the data. The trouble is that there have been so many
different variation of the requirements (how sorted is the input, does
the output need to be sorted, etc) that it's not easy to compare like
with like. I must have misread what you wrote about the outputs
differing because I constructed an input that would make the "tac" and
the "awk" commands produce the same output -- basically the well-sorted
input that the OP originally posted.

With your input, uniq keeps finding apparently uniq IDs because they are
not grouped together, The output is then 1000 times bigger and the
final sort is much more costly.

If I had the time, I'd test solutions for various requirements: sorted
output, output preserving the input order, arbitrary output order, and
various input patterns: jumbled up, ordered by date but with un-grouped
IDs, grouped and date ordered, etc. Too much work!

--
Ben.

Janis Papanagnou

unread,
Jan 15, 2013, 10:29:08 AM1/15/13
to
Am 15.01.2013 15:46, schrieb Ben Bacarisse:
> Janis Papanagnou <janis_pa...@hotmail.com> writes:
>
>> On 15.01.2013 05:36, Ben Bacarisse wrote:
>>> Janis Papanagnou <janis_pa...@hotmail.com> writes:
>>> <snip>
[...]
>>
>> A bit better, but still a factor 4-8 worse than the awk solution.
>>
>> So locale setting degradations must also be taken into account!
>>
>>
>> Another issue can be (and probably is) the test data; what properties do have
>> your data? - I tried to be as close as possible to the OP's data with its
>> partial key-subset sorting property. In case you're interested to retry with
>> the same test data that I used, here's the program to generate the data...
[...]
>>
>> In case you try with those data, I'd be interested in your results.
>
> It's to do with the data. The trouble is that there have been so many
> different variation of the requirements (how sorted is the input, does
> the output need to be sorted, etc) that it's not easy to compare like
> with like. I must have misread what you wrote about the outputs
> differing because I constructed an input that would make the "tac" and
> the "awk" commands produce the same output -- basically the well-sorted
> input that the OP originally posted.
>
> With your input, uniq keeps finding apparently uniq IDs because they are
> not grouped together, The output is then 1000 times bigger and the
> final sort is much more costly.

Please note for comparison that purposes I've added a sort to the awk
version as well, so the other sorts shouldn't add to the observed
performance penalty.

Janis

Janis Papanagnou

unread,
Jan 15, 2013, 1:07:03 PM1/15/13
to
On 15.01.2013 14:57, Ben Bacarisse wrote:
> Ed Morton <morto...@gmail.com> writes:
>
>> On 1/14/2013 10:02 PM, Ben Bacarisse wrote:
>>> Ed Morton <morto...@gmail.com> writes:
>>> <snip>
>>>> Yes, it is very simple. You have an array, you print the contents of
>>>> the array. If you want it in any particular order you write the code
>>>> to print it in that particular order.
>>>
>>> How you do that in awk? I know it can be done (you could roll your own
>>> sort function for example) but maybe there's a better way for this
>>> specific use case. (It's simple in gawk, of course.)
>>
>> Here's how to print an array in the order it's populated:
>>
>> [...]
>
> This prints by line number which might not correspond to the population
> ordering.

Exactly. But I think it's noteworthy that variations of requirements may
be more easily met with awk than with pipe constructs in the more general
cases. It may not be as simple as solving many other requirements - those
fabulous one-liners -, but at least it's [comparably "easy"] possible to
solve, and mostly even without loss of performance. (This doesn't prevent
me from using pipe constructs, process substitution, etc. etc. where it's
appropriate, of course.)

Janis

Ed Morton

unread,
Jan 15, 2013, 1:20:44 PM1/15/13
to
No, and I just meant in the order THAT ARRAY is populated (as opposed to
whatever order you'd get from the "in" operator), not the order in which arrays
are populated in general.

> I may have misunderstood you. When you said "it is very simple" you did
> not mean that the code you have to write is very simple, just that the
> idea is very simple: it's up to you to come up with a method to get the
> order you want.

Right.

> I'm sure the authors intended "| sort" to be the preferred solution in
> many cases.

Yes, I use it a lot, even with gawk as I sometimes just get confused enough by
the way asort() and asorti() work wrt numerical vs alphabetic sorting that I
find it easier to pipe to sort than figure them out.

Ed.

Jon LaBadie

unread,
Jan 15, 2013, 1:25:17 PM1/15/13
to
On 01/14/2013 04:26 PM, Martijn Dekker wrote:
> In article <201301141...@webuse.net>,
> "Ed Morton" <morto...@gmail.com> wrote:
>
>> Nice! I wonder if adding -k1 or whatever the right sort option is to just make
>> it look at the first field would make it more efficient?
>
> If you care enough, you could gather lots of text together, for example:
>
> find /usr -type f -name '*.txt' -exec cat {} + >big.txt
>
> and time the sorting of it with and without -k1:
>
> time cat -n big.txt | sort -r -n -k1 >/dev/null
> time cat -n big.txt | sort -r -n >/dev/null
>
> I doubt it'll make much difference. I could even see the -k1 version
> being slower, as 'sort' may have to actually do more work to only
> consider part of a line and disregard the rest.
>

With "-k1 you have told sort that the first key starts with the first field.
But you haven't specified where sort should stop. As written both command
lines sort on the entire line.

-k, --key=POS1[,POS2]
start a key at POS1 (origin 1), end it at POS2
(default end of line)


Ben Bacarisse

unread,
Jan 15, 2013, 11:11:07 PM1/15/13
to
Yes, but the difference comes from the amount of data being sorted. If
the input is arranged to that the awk and tac version produce the same
output, both seem to take much the same time.

<snip>
--
Ben.

Janis Papanagnou

unread,
Jan 16, 2013, 2:56:04 AM1/16/13
to
On 16.01.2013 05:11, Ben Bacarisse wrote:
>> [...]
>> Please note for comparison that purposes I've added a sort to the awk
>> version as well, so the other sorts shouldn't add to the observed
>> performance penalty.
>
> Yes, but the difference comes from the amount of data being sorted. [...]

Yes, that's right.

Janis

Marc Girod

unread,
Jan 16, 2013, 10:30:33 AM1/16/13
to
On Jan 14, 6:17 pm, houghi <hou...@houghi.org.invalid> wrote:

> This +1

Yes, i.e. +2?

> For me the piped command is more readable

Yes again.

I am a Perl user, which means that I have not used awk
as much for some time anymore, and I missed the point
that the array keys were not sorted (as for the keys of
a Perl hash).

This results in the fact that even if I was confident I knew
what the awk version did, in fact I was wrong.

Marc
0 new messages