Maintaining tabular data

Kenny McCormack

unread,

Aug 23, 2015, 12:04:23 PM8/23/15

to

One of the historical weaknesses of AWK has been that of maintaining the
appearance of tabular data - such as the output of the "ps" command.
Generally, people don't even bother to try to continue to use the built-in
AWK capabilities once they see how AWK can mess things up, so they end up
using various workarounds, such as using printf() to reformat everything.

But let's assume that we want to make it work with the built-in facilities.

Assume that we have a (Unix/Linux) command line something like:

$ ps ... | gawk 'BEGIN {OFS="\t"} {$3=$4=$5=$6="";$9="something";print}'

The intent is to delete several of the fields and to change the value of
another field, and then to print it out with tabs (i.e., a single tab)
between each of the printed fields. Unfortunately, what we get is output
with runs of multiple tabs because there really is no way to actually
*delete* a field; you can only set it to empty. So, my question is:
Shouldn't this work? Shouldn't there be a way to make it work (other than
the obvious workarounds) ?

Notes:
1) Among the usual workarounds are doing a $0=$0 and, if that fails,
doing $0=$0"". I've tried both of these to no avail.
2) What I ended up doing, but which I'm not happy with (hence the
motivation for this post to the Usenet) is changing the print to:

print gensub(/\t+/,"\t","G")

This last meets the intent, but I still think it is more work than you
should have to do.

--
Watching ConservaLoons playing with statistics and facts is like watching a
newborn play with a computer. Endlessly amusing, but totally unproductive.

Kaz Kylheku

unread,

Aug 23, 2015, 12:48:46 PM8/23/15

to

On 2015-08-23, Kenny McCormack <gaz...@shell.xmission.com> wrote:
> One of the historical weaknesses of AWK has been that of maintaining the
> appearance of tabular data - such as the output of the "ps" command.
> Generally, people don't even bother to try to continue to use the built-in
> AWK capabilities once they see how AWK can mess things up, so they end up
> using various workarounds, such as using printf() to reformat everything.
>
> But let's assume that we want to make it work with the built-in facilities.
>
> Assume that we have a (Unix/Linux) command line something like:
>
> $ ps ... | gawk 'BEGIN {OFS="\t"} {$3=$4=$5=$6="";$9="something";print}'
>
> The intent is to delete several of the fields and to change the value of
> another field, and then to print it out with tabs (i.e., a single tab)
> between each of the printed fields. Unfortunately, what we get is output
> with runs of multiple tabs because there really is no way to actually
> *delete* a field; you can only set it to empty. So, my question is:
> Shouldn't this work? Shouldn't there be a way to make it work (other than
> the obvious workarounds) ?

$ cat delfield.awk

function delete_fields(low, high)
{
# Translation of
# for (i = low, j = high + 1; j <= NF; i++, j++)
# into BASIC:

i = low
j = high + 1

while (j <= NF) {
$i = $j
i++; j++ # saving grace: at least we have ++
}

NF = i - 1;
}

# tests
NR == 1 { delete_fields(2, 4); print }
NR == 2 { delete_fields(1, 2); print }
NR == 3 { delete_fields(1, 5); print }
NR == 4 { delete_fields(2, 4); print }
NR == 5 { delete_fields(3, 3); print }
NR == 6 { delete_fields(5, 5); print }
NR == 7 { delete_fields(1, 1); print }

Run tests:

$ yes 1 2 3 4 5 | head -7 | awk -f delfield.awk -
1 5
3 4 5

1 5
1 2 4 5
1 2 3 4
2 3 4 5

Ed Morton

unread,

Aug 23, 2015, 2:22:05 PM8/23/15

to

I wouldn't do the second workaround you mention above as it'll delete blank
fields that were present in the original input in addition to those you are
making blank. The first one would only work with the default FS or if your FS
was "\t+" instead of "\t" in your example and then it'd have the same issue as I
just mentioned for the second one.

It's trivial to do this given a one-char FS where you can use RE intervals to
count "field+separator"s:

$ printf 'a\tb\tc\td\te\n'
a b c d e

$ printf 'a\tb\tc\td\te\n' | awk '{$0=gensub(/([^\t]+\t)([^\t]+\t){2}/,"\\1","")}1'
a d e

It's just when you get to multi-char FSs that it gets trickier since you can't
just do [^FS]+FS to identify a field+separator. Then you need to start thinking
a bit...

Ed.

Janis Papanagnou

unread,

Aug 23, 2015, 5:15:11 PM8/23/15

to

On 23.08.2015 18:04, Kenny McCormack wrote:
> One of the historical weaknesses of AWK has been that of maintaining the
> appearance of tabular data - such as the output of the "ps" command.
> Generally, people don't even bother to try to continue to use the built-in
> AWK capabilities once they see how AWK can mess things up, so they end up
> using various workarounds, such as using printf() to reformat everything.
>
> But let's assume that we want to make it work with the built-in facilities.
>
> Assume that we have a (Unix/Linux) command line something like:
>
> $ ps ... | gawk 'BEGIN {OFS="\t"} {$3=$4=$5=$6="";$9="something";print}'
>
> The intent is to delete several of the fields and to change the value of
> another field, and then to print it out with tabs (i.e., a single tab)
> between each of the printed fields. Unfortunately, what we get is output
> with runs of multiple tabs because there really is no way to actually
> *delete* a field; you can only set it to empty. So, my question is:
> Shouldn't this work? Shouldn't there be a way to make it work (other than
> the obvious workarounds) ?

It would be very useful if awk would be able to simply support that.

To me it seems it would need to remove the field and the subsequent or
preceding field separators (inconsistencies at begin and end of record).
In the general case, though, I can't think of a way to correctly remove
a field; for example (assuming standard FS):

AAA Hello 12345
BBB Hi 42
CCC H 1

how would one remove the second field without spoiling the format? (And
it may become much worse if blanks and TABs are mixed.)

For practical purposes - depending on the used system - I'd probably use
in many cases just some external means, like

ps | awk '...' | column -t

(but note that this is also not bullet-prove, say in case of ps -f as
source of input). But note that the problem is the inhomogeneous data
source; with a clean (well defined) parsable syntax you could do more,
but most sources (including [historic] output of Unix or DOS commands)
are not satisfying in that respect.

That all said, and answering your question; I don't think there could
be a [simple] way to solve the general task.

Janis

>
> [...]

Ed Morton

unread,

Aug 23, 2015, 6:45:39 PM8/23/15

to

I believe this will work in GNU awk for any FS value:

$ printf 'a\tb\tc\td\te\n' |

awk -F'\t' '{split($0,f,FS,s); f[2]=s[2]=f[3]=s[3]="";
r=s[0]; for (i=i;i<=NF;i++) r=r f[i] s[i]; print r}'
a d e

Regards,

Ed.

Kenny McCormack

unread,

Aug 24, 2015, 6:09:07 AM8/24/15

to

In article <mrdd4t$cjg$1...@news.m-online.net>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
...

>For practical purposes - depending on the used system - I'd probably use
>in many cases just some external means, like
>
> ps | awk '...' | column -t
>

Thanks for the pointer to the "column" command. I've never heard of that
before. It solves the problem nicely.

FWIW, I couldn't make much sense out of "man column" - and I'm not sure
what it does that is useful when invoked without the "-t" option - but with
the "-t" option, it is magic!

--

First of all, I do not appreciate your playing stupid here at all.

- Thomas 'PointedEars' Lahn -

Ed Morton

unread,

Aug 24, 2015, 10:04:45 AM8/24/15

to

On 8/24/2015 5:09 AM, Kenny McCormack wrote:
> In article <mrdd4t$cjg$1...@news.m-online.net>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> ...
>> For practical purposes - depending on the used system - I'd probably use
>> in many cases just some external means, like
>>
>> ps | awk '...' | column -t
>>
>
> Thanks for the pointer to the "column" command. I've never heard of that
> before. It solves the problem nicely.
>
> FWIW, I couldn't make much sense out of "man column" - and I'm not sure
> what it does that is useful when invoked without the "-t" option - but with
> the "-t" option, it is magic!
>

column is useful to produce tabular-looking output when all fields are populated
but it doesn't solve this problem where some aren't and there could be blank
chars within a field and/or some original fields are blank.

For example, here's all that column can do using various combinations of columns
input/output field separators and with a tr added to highlight blank chars (#)
vs tabs (-) in the output:

$ printf 'a\tb\tc\tX Y\te\n' |
awk -F'\t' -v OFS='\t' '{$2=$3=""}1' |
column -t | tr ' \t' '#-'
a##X##Y##e

$ printf 'a\tb\tc\tX Y\te\n' |
awk -F'\t' -v OFS='\t' '{$2=$3=""}1' |
column -s$'\t' -t | tr ' \t' '#-'
a######X#Y##e

$ printf 'a\tb\tc\tX Y\te\n' |
awk -F'\t' -v OFS='\t' '{$2=$3=""}1' |
column -o$'\t' -t | tr ' \t' '#-'
a-X-Y-e

compared to what you really want using GNU awk for the 4th arg to split():

$ printf 'a\tb\tc\tX Y\te\n' | awk -F'\t' '{split($0,f,FS,s);

f[2]=s[2]=f[3]=s[3]=""; r=s[0]; for (i=i;i<=NF;i++) r=r f[i] s[i]; print r}' |

tr ' \t' '#-'
a-X#Y-e

Regards,

Ed.