On Friday, November 4, 2016 at 2:05:44 AM UTC-4, Ed Morton wrote:
> When wc -l reads a line from a file it increments a count. When awk reads a line
> from a file it splits that line into regexp-separated fields, increments the NR
> variable, populates NF, $0, and $1 through $NF, and compares each line to any
> given conditions. So, why would you think awks performance for printing the
> number of lines in a file should be comparable to wcs?
Actually, this is not quite true. I can speak only for gawk, since I haven't studied the internals of other versions, but gawk parses the input records on a lazy basis. That means that it parses the input record up to field n only when you access $i or NF. So if you never access any of the $i fields other than $0, it doesn't need to split the input record into fields. You can see this clearly with some simple time commands. For example:
bash-4.3$ time gawk 'END {print NR}' /usr/share/dict/words
479828
real 0m0.058s
user 0m0.057s
sys 0m0.001s
bash-4.3$ time gawk '{s += NF} END {print NR}' /usr/share/dict/words
479828
real 0m0.091s
user 0m0.091s
sys 0m0.001s
bash-4.3$ time gawk '{s += $NF} END {print NR}' /usr/share/dict/words
479828
real 0m0.116s
user 0m0.115s
sys 0m0.001s
The last time is a bit slower than it should be, since currently-released gawk does some unnecessary copying when a $i field is accessed. We are hoping to fix that in the next major release. In development, it's better:
bash-4.3$ time ./gawk '{s += $NF} END {print NR}' /usr/share/dict/words
479828
real 0m0.098s
user 0m0.096s
sys 0m0.002s
Please keep in mind that "wc -l" has a very simple job. It simply needs to count the number of times that "\n" appears in the file. It can read the file character-by-character and increment a counter when \n appears. It has no need to load each line into a record buffer or count the words. Note that without "-l", wc is much slower:
bash-4.3$ time wc -l /usr/share/dict/words
479828 /usr/share/dict/words
real 0m0.008s
user 0m0.003s
sys 0m0.001s
bash-4.3$ time wc /usr/share/dict/words
479828 479828 4953680 /usr/share/dict/words
real 0m0.067s
user 0m0.066s
sys 0m0.001s
Regards,
Andy