Sed1liners in Awk?

Dave Aldridge

unread,

Sep 29, 2008, 12:04:50 AM9/29/08

to

Hello.

I was wondering if anyone here knew of a link
where I could find the sed1liners done in awk.
Or if not, if any of you experts would be willing
to create such a document.

http://sed.sourceforge.net/sed1line.txt

This would be very helpful for those of us who
have seen the limitations of sed and want to
upgrade to awk, so to speak.

Thanks,

David

Grant

unread,

Sep 29, 2008, 12:13:13 AM9/29/08

to

Nice idea, but since I've been using awk, I forgot the sed stuff,
I'll use awk first these days...

Grant.
--
http://bugsplatter.id.au/

pgas

unread,

Sep 29, 2008, 2:17:07 AM9/29/08

to

Dave Aldridge <nospa...@nowhere.invalid> wrote:
> I was wondering if anyone here knew of a link
> where I could find the sed1liners done in awk.
> Or if not, if any of you experts would be willing
> to create such a document.

The first hit searching for awk oneliners is:

http://student.northpark.edu/pemente/awk/awk1line.txt

as it is written by the same person who wrote the sed onliners
maybe you'll find it intersesting.

IMHO it's not that interesting in awk because
unlike with sed the simple tasks are straightforward in awk
once you learn the basics.

--
pgas @ SDF Public Access UNIX System - http://sdf.lonestar.org

Dave Aldridge

unread,

Sep 29, 2008, 3:01:15 AM9/29/08

to

pgas wrote:
> Dave Aldridge <nospa...@nowhere.invalid> wrote:
>> I was wondering if anyone here knew of a link
>> where I could find the sed1liners done in awk.
>> Or if not, if any of you experts would be willing
>> to create such a document.
>
> The first hit searching for awk oneliners is:
>
> http://student.northpark.edu/pemente/awk/awk1line.txt
>
> as it is written by the same person who wrote the sed onliners
> maybe you'll find it intersesting.

Jeeeeesh! It never occurred to me to try awk1liners. I was using
search strings like "sed1liners awk". Nice work and thanks.

>
> IMHO it's not that interesting in awk because
> unlike with sed the simple tasks are straightforward in awk
> once you learn the basics.

I'm getting the oreilly book, and you are probably right, but
the oneliners will be a good thing to work on in the meantime.

David

---------------------------------------------------------------

For Grant, who said:

"Nice idea, but since I've been using awk, I forgot the sed stuff,
I'll use awk first these days..."

Seems like that would make you the perfect person to do the job,
but fortunately for both of us someone already did.

David

Ed Morton

unread,

Sep 29, 2008, 9:14:30 AM9/29/08

to

On 9/29/2008 2:01 AM, Dave Aldridge wrote:
> pgas wrote:
>
>>Dave Aldridge <nospa...@nowhere.invalid> wrote:
>>
>>>I was wondering if anyone here knew of a link
>>>where I could find the sed1liners done in awk.
>>>Or if not, if any of you experts would be willing
>>>to create such a document.
>>
>>The first hit searching for awk oneliners is:
>>
>>http://student.northpark.edu/pemente/awk/awk1line.txt
>>
>>as it is written by the same person who wrote the sed onliners
>>maybe you'll find it intersesting.
>
>
> Jeeeeesh! It never occurred to me to try awk1liners. I was using
> search strings like "sed1liners awk". Nice work and thanks.
>
>
>>IMHO it's not that interesting in awk because
>>unlike with sed the simple tasks are straightforward in awk
>>once you learn the basics.
>
>
> I'm getting the oreilly book

O'Reilly publishes several books, including more than one on awk, so hopefully
you're talking about Effective Awk Programming, Third Edition By Arnold Robbins
(http://www.oreilly.com/catalog/awkprog3/).

and you are probably right, but
> the oneliners will be a good thing to work on in the meantime.

Below are some more awk examples in case it helps. IIRC one of my "selecting a
range of records" examples is wrong - which one is left as an exercise :-).

Ed.

1. Delete all fields up to field N, preserving input formatting.
gawk --re-interval 'sub(/^[[:space:]]*([^[:space:]]*[[:space:]]*){1}/,"")'
gawk --posix '...'
The number within the "{...}" is the number of initial fields to delete.
Note that "gensub()" is not available with "--posix" but it is available
with "--re-interval" so if you need to use an interval expression (e.g.
{1,} or {8} or {2,4}) with gensub() then you must use --re-interval
rather than --posix so --re-interval is generally the preferred method.

2. Extract user id and file name from ls -l.
gawk --re-interval 'sub(/([^[:blank:]]{1,}[[:blank:]]{1,}){8}/,"")'
gawk --posix '...'

3. Extract the string that matches a RE.
awk -v re="a|b" '
function extract(str,regexp)
{ RMATCH = (match(str,regexp) ? substr(str,RSTART,RLENGTH) : "")
return RSTART
}
extract($0,re) { print RMATCH }
'

4. Use an RS that must be at the start of a line.
gawk 'BEGIN{rs="SeparatorText";RS="(^|\n)"rs}NR==1{next}
{printf"----\n%s%s\n",rs,$0}'

5. Substituting matched REs in *sub().
PS1> echo "abcbd" | gawk 'sub(/b/,"|&|")'
a|b|cbd
PS1> echo "abcbd" | gawk 'gsub(/b/,"|&|")'
a|b|c|b|d
PS1> echo "abcbd" | gawk '$0=gensub(/b/,"|&|","")'
a|b|cbd
PS1> echo "abcbd" | gawk '$0=gensub(/b/,"|&|","g")'
a|b|c|b|d
PS1> echo "abcbd" | gawk '$0=gensub(/(b)/,"|\\1|","")'
a|b|cbd
PS1> echo "abcbd" | gawk '$0=gensub(/(b)/,"|\\1|","g")'
a|b|c|b|d
PS1> echo "abcbd" | gawk '$0=gensub(/(b)(c)/,"|\\2\\1|","g")'
a|cb|bd

6. Writing changes back to the original file.
gawk '
function printout(_str) { _out[++_nr] = _str }
function flushout( _i) { close(FILENAME);
for (_i=1; _i<=_nr;_i++)
print _out[_i] > FILENAME
}
{ printout( NR " " $0 ) }
END { flushout() }'

7. Transposing rows to selected columns and sorting by key.

Given the following input file:
Number of executions = 437
Number of compilations = 1
Worst preparation time (ms) = 1
Best preparation time (ms) = 1
Rows deleted = 0

Number of executions = 1
Number of compilations = 1
Worst preparation time (ms) = 4
Best preparation time (ms) = 4
Rows deleted = 0

Number of executions = 29
Number of compilations = 1
Worst preparation time (ms) = 1
Best preparation time (ms) = 1
Rows deleted = 0

To tranpose certain rows into columns and sort by one of the
column, like the following which is sorted by "Number of executions":

Number of executions Number of compilations Rows deleted
437 1 0
29 1 0
29 1 0

This will do it all in gawk:

gawk -vRS="" -F"\n" 'BEGIN{ fields = "1 2 5"; key = "1"
numflds = split(fields,flds," ")
}
{
for (i=1; i<=NF;i++) {
split($i,f,"=")
# Get rid of all spaces from the end of the title text
sub(/[[:blank:]]*$/,"",f[1])
title[i]=f[1]
# Get rid of all spaces from the value field
value[i]=f[2]+0
# Determine the width for this column based on the width
# of the title text plus 3 for spacing. Left-justify (%-).
fmt[i]="%-"(length(title[i])+3)"s"
}
# We will want to sort on the key column so we need to create a
# string at the start of each line to sort on later. Take the key
# columns value and pad it with zeros up to 20 chars followed by
# a space to separate it fromthe first real column. Conversion of
# "7" to "0007" and "17" to "0017" is necessary because asort()
# is alphabetical not numerical so all numeric fields must be the
# same width to compare alphabetically.
lines[NR] = sprintf("%020s ",value[key])

# Now add the real columns, formatted as determined earlier.
for (i=1; i<=numflds; i++) {
lines[NR] = lines[NR] sprintf(fmt[flds[i]], value[flds[i]])
}
}
END {
# Print the title line
for (i=1; i<=numflds; i++) {
printf fmt[flds[i]], title[flds[i]]
}
print ""
# Sort the lines alphabetically, i.e. by the value of the key column
# added above to the front of each line.
asort(lines)
# Print each line
for (i=1; i<=NR; i++) {
# strip out the first numeric value, the key value added above
sub("[[:digit:]]* ","",lines[i])
print lines[i]
}
}'

Setting fields and key at the beginning obvious dictates which fields to
be printed and which key to sort on. The only thing it assumes about field
sizes is that the key fields values won't be more than 20 characters.

8. Performing arithmetic on and replacing part of an RE.
Given a tag in a file that is myval="45", to multiply the "45" by 2 and replace it
with the result (90) in the same tag, i.e. myval="90":

gawk -vRS="myval=\"[0-9]+\"" '{ORS=RT;gsub("[^0-9]","",RT);sub("[0-9]+",2*RT,ORS)}1'

9. Recursively in-lining "include" files.
This script will not only expand all the lines that say "include subfile", but
by writing the result to a tmp file, resetting ARGV[1] (the highest level input
file) and not resetting ARGV[2] (the tmp file), it then lets awk do any normal
record parsing on the result of the expansion since that's now stored in the
tmp file. If you don't need that, just do the "print" to stdout and remove any
other references to a tmp file or ARGV[2].

awk 'function read(file) {
while ( (getline < file) > 0) {
if ($1 == "include") {
read($2)
} else {
print > ARGV[2]
}
}
close(file)
}
BEGIN{
read(ARGV[1])
ARGV[1]=""
close(ARGV[2])
}1' a.txt tmp

The result of running the above given these 3 files in the current directory:

a.txt b.txt c.txt
----- ----- -----
1 3 5
2 4 6
include b.txt include c.txt
9 7
10 8

would be to print the numbers 1 through 10 and save them in a file named "tmp".

10. Convert the first letter of every word to upper case.
awk 'BEGIN{RS="[[:space:]]";FS=OFS=""}{$1=toupper($1);ORS=RT}1'

11. Convert a string to an array.
To convert a string to an array indexed by each word's position in the string:

awk 'BEGIN{str="abc def";c=split(str,arr);for (i=1;i<=c;i++) print arr[i]}'

To convert a string to an array indexed by each word:

awk 'BEGIN{str="abc def";c=split(str,tmp);for (i=1;i<=c;i++)arr[tmp[i]]++;delete
tmp;
for (w in arr) print w}'

or:

awk 'function str2arr(s,a,i,_c,_j){_c=split(s,i);for
(_j=1;_j<=_c;_j++)a[i[_j]]++;return _c}
BEGIN{str="abc def";c=str2arr(str,arr,idx);for (i=1;i<=c;i++)print
idx[i],arr[idx[i]]}'

where arr is an array of numbers indexed by strings and idx is an array of
strings indexed by
numbers for ease of accessing arr[]s indices in the original order.

12. Remove text between nested delimiters.
Given an input file like:

2005-04-26 DEBUG [junk][junkjunkjunk] The message says...
2005-04-26 DEBUG [morejunk]meaningful [junkagain]more meaning to life...
2005-04-26 DEBUG a[junk[junk[junk]junk]]b The message says...

this script:

awk -vFS="" 'function rmvJunk() {
for (i+=1; i <= NF; i++)
if ($i == "[") rmvJunk(); else if ($i == "]") return
}
{ for (i=1; i <= NF; i++)
if ($i == "[") rmvJunk(); else printf "%s",$i
print ""
}'

will remove everything between the nested pairs of "[...]"s to produce:

2005-04-26 DEBUG The message says...
2005-04-26 DEBUG meaningful more meaning to life...
2005-04-26 DEBUG ab The message says...

so will this much simpler one:

awk '{while (sub(/\[[^][]*]/," "));}1'

though it only works in gawk and /usr/xpg4/bin/awk, not nawk or old awk.
The RE it's using looks for:

\[ = a single "["
[^][] = a character that's neither "]" nor "["
* = repeated
] = until a single "]"

so it finds the innermost "[...]" every iteration through the loop.

13. Join lines that end in backslashes.
gawk 'BEGIN{RS="\\\\\n|\n"}{ORS=RT~/\\/?"":"\n"}1'

14. Expand a file of references.
The point of the exercise is to avoid mixing the use of getline with the normal
awk processing so we don't have to worry about which variables getline is
resetting, etc. and every line of every file (reference file plus referred
files) goes through the normal gawk condition->action work loop, represented
by one "printf" line in the below example. In other words, to process every
file as if it had been specified on the command line (though I'm choosing
to print a warning for an empty file instead of just skipping it).

A simple solution is:

gawk 'NR==FNR{ARGV[ARGC++]=$0;next}1' refs

If you REALLY feel a burning desire to test for the file existing and being
readable within this awk script instead of ensuring your input file is sane
before calling it, then you can either compile the "stat()" library into
gawk or use getline just as a test when populating ARGV[]. e.g.:

gawk 'NR==FNR{if((getline tmp<$0)>=0)ARGV[ARGC++]=$0; else printf "error:
%s\n",$0;next}1' refs

Finally, here's a robust solution to handle the worst case where the reference file
contains the names of files that are empty, files that don't exist and files
that exist but can't be opened and you want the warning message for each of
those conditions to be printed out in the order in which the file was
referenced (as would occur with the in-line getline solution) rather than
before the main file processing begins (as would occur with my simpler
solution above).If I have this in a file named expandRefs.awk:

function analyzeFiles(file, _ret) {
_ret = (getline tmp < file)
close(file)
if (_ret > 0) ARGV[ARGC++] = file
else badFiles[ARGC] = badFiles[ARGC] SUBSEP _ret SUBSEP file
}
function prtBadFiles( argind ) {
c = split(badFiles[argind],bfA,SUBSEP)
for (i=2; i < c; i+=2) {
if (bfA[i] == "0") { printf "Warning: %s is empty\n",bfA[i+1] }
else { printf "Error: %s cannot be opened\n",bfA[i+1] }
}
delete badFiles[argind]
}
NR == FNR { analyzeFiles($0) }
ARGIND in badFiles { prtBadFiles(ARGIND) }
{printf "File %s NR %d FNR %d: %s\n",FILENAME, NR, FNR, $0}
END { prtBadFiles(ARGIND+1) }

I have the following files:

$ cat refs
file1
file2
file3
file4
file5
file6
$ cat file1
abc
def
ghi
$ cat file2
klm
nop
qrs
$ cat file3
cat: file3: Permission denied
$ cat file4
$ cat file5
tuv
wxy
zzz
$ cat file6
cat: file6: No such file or directory

i.e. the reference file includes 6 files, of which file3 is unreadable, file4 is
empty, and file6 doesn't exist. If I run the above script I get:

$ gawk -f expandRefs.awk refs
File refs NR 1 FNR 1: file1
File refs NR 2 FNR 2: file2
File refs NR 3 FNR 3: file3
File refs NR 4 FNR 4: file4
File refs NR 5 FNR 5: file5
File refs NR 6 FNR 6: file6
File file1 NR 7 FNR 1: abc
File file1 NR 8 FNR 2: def
File file1 NR 9 FNR 3: ghi
File file2 NR 10 FNR 1: klm
File file2 NR 11 FNR 2: nop
File file2 NR 12 FNR 3: qrs
Error: file3 cannot be opened
Warning: file4 is empty
File file5 NR 13 FNR 1: tuv
File file5 NR 14 FNR 2: wxy
File file5 NR 15 FNR 3: zzz
Error: file6 cannot be opened

15. Converting and diffing 2 date/time values.
This will print the number of seconds between 2 date/time values given in some
non-standard format:

function cvttime(t, a) {
split(t,a,"[/:]")
match("JanFebMarAprMayJunJulAugSepOctNovDec",a[2])
a[2] = sprintf("%02d",(RSTART+2)/3)
return( mktime(a[3]" "a[2]" "a[1]" "a[4]" "a[5]" "a[6]) )
}
BEGIN{
t1="01/Dec/2005:00:04:42"
t2="01/Dec/2005:17:14:12"
print cvttime(t2) - cvttime(t1)
}

16. Buffering lines to emulate GNU grep -B/-A.
Here's how to emulate -A and -B (print some numbers of lines Before and After
a given search pattern is found). It's a trivial tweak to make it work for the
simpler -N:

BEGIN{ bufSize = B + 1 }
{ saveBuf($0) }
found { found--; print }
$0 ~ pat { found = A; prtBuf() }

Save it in a file called, say "grep.awk" then run it as:

awk -v pat="foo" -v B=2 -v A=3 -f grep.awk file

to search for "foo" plus the 2 lines before and 3 lines after it.

17. Getline.
See http://tinyurl.com/yn9ka9.

18. Selecting a range of records.
The following idioms describe how to select a range of records given
a specific pattern to match:

a) Print all records from some pattern:

awk '/pattern/{f=1}f' file

b) Print all records after some pattern:

awk 'f;/pattern/{f=1}' file

c) Print the Nth record after some pattern:

awk 'c&&!--c;/pattern/{c=N}' file

d) Print every record except the Nth record after some pattern:

awk 'c&&!--c{next}/pattern/{c=N}' file

e) Print the N records after some pattern:

awk 'c&&c--;/pattern/{c=N}' file

f) Print every record except the N records after some pattern:

awk 'c&&c--{next}/pattern/{c=N}' file

I changed the variable name from "f" for "found" to "c" for "count" where
appropriate as that's more expressive of what the variable actually IS.

19. Removing text between 2 tags.
POSIX: a 2-pass approach to turn all the searched-for patterns into a single
char (control-B in this case for no particular reason) first and then use that
as the RS (since an RS that's an RE is gawk-only):

awk '{$1=$1}1' FS='(begin|end)' OFS=^B file | awk 'NR%2' RS=^B ORS=

where the opening and closing tags are "begin" and "end" respecitvely.

gawk equivalent: directly uses an RE for the RS:

gawk -v RS='(begin|end)' -v ORS= 'NR%2'

pk

unread,

Sep 29, 2008, 9:43:50 AM9/29/08

to

On Monday 29 September 2008 15:14, Ed Morton wrote:

> Below are some more awk examples in case it helps. IIRC one of my
> "selecting a range of records" examples is wrong - which one is left as an
> exercise :-).

Actually, I've spotted two of them.

> d) Print every record except the Nth record after some pattern:
>
> awk 'c&&!--c{next}/pattern/{c=N}' file

That does not print anything.

awk 'c&&!--c{next}/pattern/{c=N}1' file

> f) Print every record except the N records after some pattern:
>
> awk 'c&&c--{next}/pattern/{c=N}' file

Same as before.

awk 'c&&c--{next}/pattern/{c=N}1' file

Ed Morton

unread,

Sep 29, 2008, 10:01:17 AM9/29/08

to

Ah yes. Thanks!

Ed.

pk

unread,

Sep 29, 2008, 10:34:53 AM9/29/08

to

Just FYI, here are other minor glitches I found.

On Monday 29 September 2008 15:14, Ed Morton wrote:

> 1. Delete all fields up to field N, preserving input formatting.
> gawk --re-interval 'sub(/^[[:space:]]*([^[:space:]]*[[:space:]]*){1}/,"")'
> gawk --posix '...'
> The number within the "{...}" is the number of initial fields to delete.
> Note that "gensub()" is not available with "--posix" but it is available
> with "--re-interval" so if you need to use an interval expression (e.g.
> {1,} or {8} or {2,4}) with gensub() then you must use --re-interval
> rather than --posix so --re-interval is generally the preferred method.
>
> 2. Extract user id and file name from ls -l.
> gawk --re-interval 'sub(/([^[:blank:]]{1,}[[:blank:]]{1,}){8}/,"")'
> gawk --posix '...'

That removes everything except the file name. ITYM

gawk --re-interval '{u=$3;sub(/([^[:blank:]]{1,}[[:blank:]]{1,}
{8}/,"");print u,$0}'

Note that the above command will behave wildly differently depending on your
locale. Example:

# locale en_GB.utf8

$ ls -l
total 88
-rw-r--r-- 1 pk users 63277 2008-08-28 17:37 f
-rw-r--r-- 1 pk users 44 2008-08-28 12:55 file1
-rw-r--r-- 1 pk users 581 2008-08-28 13:03 file2
-rw-r--r-- 1 pk users 126 2008-08-30 17:47 file4
-rw-r--r-- 1 pk users 295 2008-09-23 11:21 f.sed
-rw-r--r-- 1 pk users 244 2008-09-23 11:45 s.sed

$ ls -l | \
awk --re-interval 'sub(/([^[:blank:]]{1,}[[:blank:]]{1,}){3}/,"")'
# doesn't print anything!

$ ls -l | \
LC_ALL=C awk --re-interval 'sub(/([^[:blank:]]{1,}[[:blank:]]{1,}){3}/,"")'
users 63277 2008-08-28 17:37 f
users 44 2008-08-28 12:55 file1
users 581 2008-08-28 13:03 file2
users 126 2008-08-30 17:47 file4
users 295 2008-09-23 11:21 f.sed
users 244 2008-09-23 11:45 s.sed

Also, again depending on your locale, you might have to remove either 7 or 8
fields (due to different date formatting in ls -l output). So I suggest
adding LC_ALL=C before generating ls -l output, to (hopefully) get an
output where exactly 8 fields have to be removed.

> 3. Extract the string that matches a RE.
> awk -v re="a|b" '
> function extract(str,regexp)
> { RMATCH = (match(str,regexp) ? substr(str,RSTART,RLENGTH) : "")
> return RSTART
> }
> extract($0,re) { print RMATCH }
> '

imho it's better to use single quotes for the regexp here, especially if it
contains shell metacharacters.

$ a=9
$ awk -v re="$a" 'BEGIN{print re}'
9
$ awk -v re='$a' 'BEGIN{print re}'
$a

(ok silly example, but you get the idea)

> 10. Convert the first letter of every word to upper case.
> awk 'BEGIN{RS="[[:space:]]";FS=OFS=""}{$1=toupper($1);ORS=RT}1'

This needs GNU awk.

> 12. Remove text between nested delimiters.
> Given an input file like:
>
> 2005-04-26 DEBUG [junk][junkjunkjunk] The message says...
> 2005-04-26 DEBUG [morejunk]meaningful [junkagain]more meaning to life...
> 2005-04-26 DEBUG a[junk[junk[junk]junk]]b The message says...
>
> this script:
>
> awk -vFS="" 'function rmvJunk() {
> for (i+=1; i <= NF; i++)
> if ($i == "[") rmvJunk(); else if ($i == "]") return
> }
> { for (i=1; i <= NF; i++)
> if ($i == "[") rmvJunk(); else printf "%s",$i
> print ""
> }'

Needs gawk.

Dave Aldridge

unread,

Sep 29, 2008, 5:47:40 PM9/29/08

to

Ed wrote:

> David wrote:
>> I'm getting the oreilly book
>
> O'Reilly publishes several books, including more than one on awk, so hopefully
> you're talking about Effective Awk Programming, Third Edition By Arnold Robbins
> (http://www.oreilly.com/catalog/awkprog3/).

I wasn't, but I can change my mind.

"This print book is out of stock, with no immediate plans to reprint."

Will look on amazon.

http://www.amazon.com/gp/offer-listing/0596000707/ref=sr_1_olp_1?ie=UTF8&s=books&qid=1222724642&sr=1-1

>> and you are probably right, but
>> the oneliners will be a good thing to work on in the meantime.
>
> Below are some more awk examples in case it helps. IIRC one of my "selecting a
> range of records" examples is wrong - which one is left as an exercise :-).
>
> Ed.
>
>

[delete]

That's a real treasure trove, Ed. I thank you very much.

David

Peteris Krumins

unread,

Sep 30, 2008, 2:15:36 AM9/30/08

to

Hi.

I just wrote the first part of an article which
explains every single one of awk one liners
from Eric's compilation:

http://www.catonmat.net/blog/awk-one-liners-explained-part-one/

ps. Eric sent me an email and said that there
was a much more recent Awk one-liner collection
available (dates April 2008) at:
http://www.pement.org/awk/awk1line.txt

Sincerely,
Peteris

Janis Papanagnou

unread,

Sep 30, 2008, 7:09:53 AM9/30/08

to

Peteris Krumins wrote:
[snip]

> I just wrote the first part of an article which
> explains every single one of awk one liners
> from Eric's compilation:
>
> http://www.catonmat.net/blog/awk-one-liners-explained-part-one/
>
> ps. Eric sent me an email and said that there
> was a much more recent Awk one-liner collection
> available (dates April 2008) at:
> http://www.pement.org/awk/awk1line.txt

Careful, though! A quick glance at the examples showed some problematic
code...

# print section of file from regular expression to end of file
awk '/regex/,0'
awk '/regex/,EOF'

The latter form is bad style since the program's behaviour will change
as soon as variable EOF gets defined in the context.

# remove duplicate, consecutive lines (emulates "uniq")
awk 'a !~ $0; {a=$0}'

This doesn't do what it says; it also removes subsequent lines that
contain a substring of the previous line. (The fix is trivial, though.)

And for those who try to avoid getline

# print the line immediately after a regex, but not the line
# containing the regex
awk '/regex/{getline;print}'

can be replaced by

awk 'f{print;f=0}/regex/{f=1}'

or more cryptic

awk '!--f;/regex/{f=1}'

and

# if a line ends with a backslash, append the next line to it (fails if
# there are multiple lines ending with backslash...)
awk '/\\$/ {sub(/\\$/,""); getline t; print $0 t; next}; 1' file*

can be replaced by

awk 'sub(/\\$/,""){printf("%s",$0);next};1'

which also handles multiple consecutive lines with backslashes.

(Disclaimer: I hope not, but there may be errors in my posted ad-hoc code.)

Janis

>
>
> Sincerely,
> Peteris
>

pk

unread,

Sep 30, 2008, 8:03:13 AM9/30/08

to

On Tuesday 30 September 2008 08:15, Peteris Krumins wrote:

> I just wrote the first part of an article which
> explains every single one of awk one liners
> from Eric's compilation:
>
> http://www.catonmat.net/blog/awk-one-liners-explained-part-one/
>
> ps. Eric sent me an email and said that there
> was a much more recent Awk one-liner collection
> available (dates April 2008) at:
> http://www.pement.org/awk/awk1line.txt

This one

# double space a file which already has blank lines in it. Output file
# should contain no more than one blank line between lines of text.
# NOTE: On Unix systems, DOS lines which have only CRLF (\r\n) are
# often treated as non-blank, and thus 'NF' alone will return TRUE.
awk 'NF{print $0 "\n"}'

actually inserts a blank line at the end of the text (after the last line).
A way to avoid that is

awk 'NF{print n $0;n="\n"}'

--------
# IN UNIX ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format
awk '{sub(/\r$/,"")};1' # assumes EACH line ends with Ctrl-M

That will work even if not all lines end with \r\n. If some line doesn't,
sub() will just fail, but the line will be printed nonetheless.

Ed Morton

unread,

Sep 30, 2008, 9:14:39 AM9/30/08

to

On 9/29/2008 9:34 AM, pk wrote:
> Just FYI, here are other minor glitches I found.
>
> On Monday 29 September 2008 15:14, Ed Morton wrote:

>>2. Extract user id and file name from ls -l.
>>gawk --re-interval 'sub(/([^[:blank:]]{1,}[[:blank:]]{1,}){8}/,"")'
>>gawk --posix '...'
>
>
> That removes everything except the file name. ITYM
>
> gawk --re-interval '{u=$3;sub(/([^[:blank:]]{1,}[[:blank:]]{1,}
> {8}/,"");print u,$0}'

Yes, or I meant to just select the file name. It's been a while....

>
> Note that the above command will behave wildly differently depending on your
> locale.

Yeah, I know, but I don't care. IMHO, people should just use LC_ALL=C and if
they don't then they need to get used to changing scripts. It also needs
modification if you're on cygwin and have a 2-word login name.

<snip>

>>3. Extract the string that matches a RE.
>>awk -v re="a|b" '
>>function extract(str,regexp)
>>{ RMATCH = (match(str,regexp) ? substr(str,RSTART,RLENGTH) : "")
>> return RSTART
>>}
>>extract($0,re) { print RMATCH }
>>'
>
>
> imho it's better to use single quotes for the regexp here, especially if it
> contains shell metacharacters.

Fair enough.

>>10. Convert the first letter of every word to upper case.
>>awk 'BEGIN{RS="[[:space:]]";FS=OFS=""}{$1=toupper($1);ORS=RT}1'
>
>
> This needs GNU awk.

True. I should stick a g at the front for clarity.

>
>>12. Remove text between nested delimiters.
>>Given an input file like:
>>
>>2005-04-26 DEBUG [junk][junkjunkjunk] The message says...
>>2005-04-26 DEBUG [morejunk]meaningful [junkagain]more meaning to life...
>>2005-04-26 DEBUG a[junk[junk[junk]junk]]b The message says...
>>
>>this script:
>>
>>awk -vFS="" 'function rmvJunk() {
>> for (i+=1; i <= NF; i++)
>> if ($i == "[") rmvJunk(); else if ($i == "]") return
>>}
>>{ for (i=1; i <= NF; i++)
>> if ($i == "[") rmvJunk(); else printf "%s",$i
>> print ""
>>}'
>
>
> Needs gawk.

Only because I didn't put a space between -v and FS="". I think other (but not
all) awks will do field splitting on characters when FS is empty. I'll change it
to -F'' or -v FS= and check on whether or not it's gawk-specific.

Thanks,

Ed.

Dave Aldridge

unread,

Sep 30, 2008, 4:04:21 PM9/30/08

to

Peteris wrote:
> On Sep 29, 7:04 am, Dave Aldridge <nospampl...@nowhere.invalid> wrote:
>> Hello.
>>
>> I was wondering if anyone here knew of a link
>> where I could find the sed1liners done in awk.
>> Or if not, if any of you experts would be willing
>> to create such a document.
>>
>> http://sed.sourceforge.net/sed1line.txt
>>
>> This would be very helpful for those of us who
>> have seen the limitations of sed and want to
>> upgrade to awk, so to speak.
>>
>> Thanks,
>>
>> David
>
> Hi.
>
> I just wrote the first part of an article which
> explains every single one of awk one liners
> from Eric's compilation:
>
> http://www.catonmat.net/blog/awk-one-liners-explained-part-one/

Very nice. Looks like a useful site, all around. I'll keep tuned
in for the following parts. Your work should vastly increase
my understanding of awk.

>
> ps. Eric sent me an email and said that there
> was a much more recent Awk one-liner collection
> available (dates April 2008) at:
> http://www.pement.org/awk/awk1line.txt

Thanks a lot, Peteris.

And thanks to pk and Janis for your input, too.

Back to lurking and playing with the problems that
are posted here. Those that are comprehensible to me,
that is. :-)

David

Dave Aldridge

unread,

Sep 30, 2008, 9:11:01 PM9/30/08

to

Dave Aldridge <nospa...@nowhere.invalid> wrote:
>

[delete]

> Thanks a lot, Peteris.
>
> And thanks to pk and Janis for your input, too.

And Ed!

Ed Morton

unread,

Oct 2, 2008, 8:10:24 AM10/2/08

to

On 9/30/2008 6:09 AM, Janis Papanagnou wrote:
<snip>

> And for those who try to avoid getline
>
> # print the line immediately after a regex, but not the line
> # containing the regex
> awk '/regex/{getline;print}'
>
> can be replaced by
>
> awk 'f{print;f=0}/regex/{f=1}'
>
> or more cryptic
>
> awk '!--f;/regex/{f=1}'

Is that always safe? I'd have thought you'd need

awk 'f&&!--f;/regex/{f=1}'

otherwise f will be decremented for every record and I expect at some huge
number will wrap around and become zero again producing an undesirable print.

Ed.

Kenny McCormack

unread,

Oct 2, 2008, 8:46:31 AM10/2/08

to

In article <48E4BA30...@lsupcaemnt.com>,
Ed Morton <mor...@lsupcaemnt.com> wrote:
...

>> awk 'f{print;f=0}/regex/{f=1}'
>>
>> or more cryptic
>>
>> awk '!--f;/regex/{f=1}'
>
>Is that always safe? I'd have thought you'd need
>
> awk 'f&&!--f;/regex/{f=1}'
>
>otherwise f will be decremented for every record and I expect at some huge
>number will wrap around and become zero again producing an undesirable print.

Actually, no. In classical AWK, numbers are always floating point (this
is written in the almighty standards documents), so integer wraparound
should not occur. Observe:

% gawk 'BEGIN {x=2^31;printf("%30d\n",x);printf("%30d\n",++x)}'
2147483648
2147483649
%

Note that I said "classical AWK" above. TAWK does things a little
differently. Sadly, they don't seem to have implemented a pure integer
type, the way I would have liked them to. In TAWK, a number can be an
integer, but the largest possible integer is 2^31-1. After that, it
silently converts it to floating point. Although the details are a
little involved for a newsgroup post, the above program in TAWK will
just print 2147483647 over and over...

So, here also, I could not demonstrate integer wraparound.

Anyway, few of us will ever in our lifetimes process a file with more
than 2.1 billion records.

Janis Papanagnou

unread,

Oct 2, 2008, 8:54:38 AM10/2/08

to

Frankly, I never processed text files with more than 2^9 lines[*] with awk,
so for my typical awk applications I am far from any "wrap around" limit.
But it's actually more of a "saturation" limit, I'd say, since calculation
will then be done in float arithmetic and the decrement of 1 won't change
the value any more

-2147483646
-2147483647
-2147483648
-2,14748e+09
-2,14748e+09
-2,14748e+09

So, AFAICT, the reduced expression I gave should be safe, but please CMIIW.

Beyond that, you're completely right. If there's a way - and especially if
as simple as your addition - to make it more robust, also just in case of
doubts, that's preferable.

Janis

[*] 2^31 is the bound of failure for my version of GNU awk, other versions
of awk may have lower limits, I suppose. Does anybody know for all awk's
which is the smallest supported integer range?

>
> Ed.
>

Janis Papanagnou

unread,

Oct 2, 2008, 8:57:42 AM10/2/08

to

Janis Papanagnou wrote:
>
> Frankly, I never processed text files with more than 2^9 lines[*] with awk,

That should have been 10^9 lines.

> [*] 2^31 [...]

Janis

Peteris Krumins

unread,

Oct 9, 2008, 3:56:44 AM10/9/08

to

On Sep 29, 7:04 am, Dave Aldridge <nospampl...@nowhere.invalid> wrote:

Hi again,

Just wanted to let you know that I have also started explaining the
sed1line.txt one-liners:
http://www.catonmat.net/blog/sed-one-liners-explained-part-one/

Peteris

William James

unread,

Oct 10, 2008, 12:40:58 PM10/10/08

to

On Sep 29, 8:14 am, Ed Morton <mor...@lsupcaemnt.com> wrote:

> 1. Delete all fields up to field N, preserving input formatting.
>
> gawk --re-interval 'sub(/^[[:space:]]*([^[:space:]]*[[:space:]]*){1}/,"")'

ruby -pne "sub(/^\s*(\S+\s*){1}/,'')"

> 3. Extract the string that matches a RE.
> awk -v re="a|b" '
> function extract(str,regexp)
> { RMATCH = (match(str,regexp) ? substr(str,RSTART,RLENGTH) : "")
> return RSTART
> }
> extract($0,re) { print RMATCH }
> '

Ruby:

"foo,barring none hoo fee"[ /,.* hoo/ ]
==>",barring none hoo"
"foo,barring none hoo fee"[ /,(.*) hoo/, 1 ]
==>"barring none"

> 6. Writing changes back to the original file.
> gawk '
> function printout(_str) { _out[++_nr] = _str }
> function flushout( _i) { close(FILENAME);
> for (_i=1; _i<=_nr;_i++)
> print _out[_i] > FILENAME
> }
> { printout( NR " " $0 ) }
> END { flushout() }'

ruby -i.bak -ne 'BEGIN{$,=" "}; print $., $_' data1

> 7. Transposing rows to selected columns and sorting by key.
>
> Given the following input file:
> Number of executions = 437
> Number of compilations = 1
> Worst preparation time (ms) = 1
> Best preparation time (ms) = 1
> Rows deleted = 0
>
> Number of executions = 1
> Number of compilations = 1
> Worst preparation time (ms) = 4
> Best preparation time (ms) = 4
> Rows deleted = 0
>
> Number of executions = 29
> Number of compilations = 1
> Worst preparation time (ms) = 1
> Best preparation time (ms) = 1
> Rows deleted = 0
>
> To tranpose certain rows into columns and sort by one of the
> column, like the following which is sorted by "Number of executions":
>
> Number of executions Number of compilations Rows deleted
> 437 1 0
> 29 1 0
> 29 1 0

Wrong. Change one of those 29's to 1.

=end

FIELDS = 0, 1, 4
KEY = 0
records = gets(nil).split( /\n\s*\n/ ).
map{|s| s.split( /\s*\n\s*/ ).values_at( *FIELDS ) }
header = records[0].map{|s| s[ /(.*)=/, 1 ].strip }
records.each{|fields| fields.map!{|s| s[ /=(.*)/, 1 ].strip } }
widths = header.zip( records.transpose ).map{|a|
a.flatten.map{|s| s.size}.max }
fmt = widths.map{|n| "%-#{ n }s" }.join( " " )
records = records.sort_by{|a| a[ KEY ].to_i }
( [ header ] + records ).each{|a|
puts fmt % a
}

--- output ---

Number of executions Number of compilations Rows deleted

1 1 0
29 1 0
437 1 0

> 10. Convert the first letter of every word to upper case.
> awk 'BEGIN{RS="[[:space:]]";FS=OFS=""}{$1=toupper($1);ORS=RT}1'

ruby -ne "print split(/(\s)/).map{|s| s.capitalize}"

> To convert a string to an array indexed by each word:
>
> awk 'BEGIN{str="abc def";c=split(str,tmp);for (i=1;i<=c;i++)arr[tmp[i]]++;delete
> tmp;
> for (w in arr) print w}'

Hash[ * "abc def".split.map{|x| [ x, "" ] }.flatten ]

Dave Aldridge

unread,

Oct 10, 2008, 3:41:50 PM10/10/08

to

I'm sure this is a worthy project, but not for someone like myself
who has stated that he wants to leave sed behind for awk, and not
very relevant on an awk newsgroup.

Sed gives me a headache, beyond the simple stuff. Awk, at least
is like a normal programming language. What I know about bash
is useful when approaching awk.

Please forgive the minor criticism. And don't try to involve me
in a debate about sed vs. awk. I'm not interested and won't
respond.

I wish you'd spend your time completing the awk1liners instead.

David

~
~
~

Edward Rosten

unread,

Oct 23, 2008, 8:22:44 AM10/23/08

to

On Oct 2, 1:54 pm, Janis Papanagnou <janis_papanag...@hotmail.com>
wrote:

> [*] 2^31 is the bound of failure for my version of GNU awk, other versions
> of awk may have lower limits, I suppose. Does anybody know for all awk's
> which is the smallest supported integer range?

Really? My version uses doubles, so:

$ gawk 'BEGIN {x=2^52;printf("%30d\n",x);printf("%30d\n",++x)}'
4503599627370496
4503599627370497
$ gawk 'BEGIN {x=2^53;printf("%30d\n",x);printf("%30d\n",++x)}'
9007199254740992
9007199254740992

This is consistent with the size of the mantissa in IEEE 754 doubles.

-Ed