Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Using awk to calculate the total number of lines for a huge file is less effective than wc.

86 views
Skip to first unread message

Hongyi Zhao

unread,
Nov 3, 2016, 6:08:21 AM11/3/16
to
Hi all,

See the following:

$ time awk 'END{print NR }' up-down-log
511544

real 0m0.116s
user 0m0.112s
sys 0m0.000s
$ time wc -l < up-down-log
511544

real 0m0.022s
user 0m0.016s
sys 0m0.004s

Is it possible to do the above thing with awk more effectively comparable
with wc?

Regards
--
.: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.

Luuk

unread,
Nov 3, 2016, 12:59:38 PM11/3/16
to
On 03-11-16 11:08, Hongyi Zhao wrote:
> Hi all,
>
> See the following:
>
> $ time awk 'END{print NR }' up-down-log
> 511544
>
> real 0m0.116s
> user 0m0.112s
> sys 0m0.000s
> $ time wc -l < up-down-log
> 511544
>
> real 0m0.022s
> user 0m0.016s
> sys 0m0.004s
>
> Is it possible to do the above thing with awk more effectively comparable
> with wc?
>
> Regards
>

Yes,
luuk@opensuse:~/tmp> time awk 'BEGIN{ system("wc -l big")}'
511544 big

real 0m0.021s
user 0m0.012s
sys 0m0.004s
luuk@opensuse:~/tmp> time wc -l big
511544 big

real 0m0.015s
user 0m0.008s
sys 0m0.004s
luuk@opensuse:~/tmp> time awk 'END{ print NR }' big
511544

real 0m0.105s
user 0m0.100s
sys 0m0.004s
luuk@opensuse:~/tmp>

;)

Hongyi Zhao

unread,
Nov 3, 2016, 9:45:22 PM11/3/16
to
On Thu, 03 Nov 2016 17:59:37 +0100, Luuk wrote:

> luuk@opensuse:~/tmp> time awk 'BEGIN{ system("wc -l big")}'
> 511544 big
>
> real 0m0.021s user 0m0.012s sys 0m0.004s

Haha ...

Ed Morton

unread,
Nov 4, 2016, 2:05:44 AM11/4/16
to
On 11/3/2016 5:08 AM, Hongyi Zhao wrote:
> Hi all,
>
> See the following:
>
> $ time awk 'END{print NR }' up-down-log
> 511544
>
> real 0m0.116s
> user 0m0.112s
> sys 0m0.000s
> $ time wc -l < up-down-log
> 511544
>
> real 0m0.022s
> user 0m0.016s
> sys 0m0.004s
>
> Is it possible to do the above thing with awk more effectively comparable
> with wc?
>
> Regards
>

When wc -l reads a line from a file it increments a count. When awk reads a line
from a file it splits that line into regexp-separated fields, increments the NR
variable, populates NF, $0, and $1 through $NF, and compares each line to any
given conditions. So, why would you think awks performance for printing the
number of lines in a file should be comparable to wcs?

Ed.

Hongyi Zhao

unread,
Nov 4, 2016, 10:44:04 AM11/4/16
to
On Fri, 04 Nov 2016 01:05:35 -0500, Ed Morton wrote:

> When wc -l reads a line from a file it increments a count. When awk
> reads a line from a file it splits that line into regexp-separated
> fields, increments the NR variable, populates NF, $0, and $1 through
> $NF, and compares each line to any given conditions. So, why would you
> think awks performance for printing the number of lines in a file should
> be comparable to wcs?

I omitted all of these jobs done by awk, thanks for your notes.

Regards
>
> Ed.

Andrew Schorr

unread,
Nov 5, 2016, 10:54:31 AM11/5/16
to
On Friday, November 4, 2016 at 2:05:44 AM UTC-4, Ed Morton wrote:
> When wc -l reads a line from a file it increments a count. When awk reads a line
> from a file it splits that line into regexp-separated fields, increments the NR
> variable, populates NF, $0, and $1 through $NF, and compares each line to any
> given conditions. So, why would you think awks performance for printing the
> number of lines in a file should be comparable to wcs?

Actually, this is not quite true. I can speak only for gawk, since I haven't studied the internals of other versions, but gawk parses the input records on a lazy basis. That means that it parses the input record up to field n only when you access $i or NF. So if you never access any of the $i fields other than $0, it doesn't need to split the input record into fields. You can see this clearly with some simple time commands. For example:

bash-4.3$ time gawk 'END {print NR}' /usr/share/dict/words
479828

real 0m0.058s
user 0m0.057s
sys 0m0.001s
bash-4.3$ time gawk '{s += NF} END {print NR}' /usr/share/dict/words
479828

real 0m0.091s
user 0m0.091s
sys 0m0.001s
bash-4.3$ time gawk '{s += $NF} END {print NR}' /usr/share/dict/words
479828

real 0m0.116s
user 0m0.115s
sys 0m0.001s

The last time is a bit slower than it should be, since currently-released gawk does some unnecessary copying when a $i field is accessed. We are hoping to fix that in the next major release. In development, it's better:

bash-4.3$ time ./gawk '{s += $NF} END {print NR}' /usr/share/dict/words
479828

real 0m0.098s
user 0m0.096s
sys 0m0.002s

Please keep in mind that "wc -l" has a very simple job. It simply needs to count the number of times that "\n" appears in the file. It can read the file character-by-character and increment a counter when \n appears. It has no need to load each line into a record buffer or count the words. Note that without "-l", wc is much slower:

bash-4.3$ time wc -l /usr/share/dict/words
479828 /usr/share/dict/words

real 0m0.008s
user 0m0.003s
sys 0m0.001s
bash-4.3$ time wc /usr/share/dict/words
479828 479828 4953680 /usr/share/dict/words

real 0m0.067s
user 0m0.066s
sys 0m0.001s

Regards,
Andy

Marc de Bourget

unread,
Nov 12, 2016, 4:44:27 PM11/12/16
to
Le samedi 5 novembre 2016 15:54:31 UTC+1, Andrew Schorr a écrit :
> On Friday, November 4, 2016 at 2:05:44 AM UTC-4, Ed Morton wrote:
> > When wc -l reads a line from a file it increments a count. When awk reads a line
> > from a file it splits that line into regexp-separated fields, increments the NR
> > variable, populates NF, $0, and $1 through $NF, and compares each line to any
> > given conditions. So, why would you think awks performance for printing the
> > number of lines in a file should be comparable to wcs?
>
> Actually, this is not quite true. I can speak only for gawk, since I haven't studied the internals of other versions, but gawk parses the input records on a lazy basis. That means that it parses the input record up to field n only when you access $i or NF. So if you never access any of the $i fields other than $0, it doesn't need to split the input record into fields.

-> Same with TAWK, manual page 70:
"Fields are implemented efficiently: TAWK does not actually do
the work of finding the fields in the record until you use them."
0 new messages