Ben Hoyt <
ben...@gmail.com> writes:
>Thanks for the tips, Anton and Gerry. EXECUTE-PARSING-FILE is interesting. And you reminded me about >BODY (presumably a bit faster than execute here).
It's faster, but probably does not make a significant difference in
this case.
>My test input file is
https://github.com/benhoyt/countwords/blob/master/kjvbible.txt, but I'm concatenating it to itself x10 to get the actual file I'm testing (so it's 43MB instead of 4MB). Like so:
>
>cat kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt >kjvbible_x10.txt
Thanks. With that, I see on a Skylake (Core i5-6600K), invoking it
with:
for i in \
true \
"python3 count-unique.py" \
"$HOME/gforth/gforth-fast count-unique-hoyt.fs" \
"$HOME/gforth/gforth-fast count-unique.fs|tr [:upper:] [:lower:]|sort -nrk2" \
"$HOME/gforth/gforth-fast count-unique.fs"
do LC_NUMERIC=en_US.utf8 perf stat -B -e cycles:u -e instructions:u sh -c "$i" <
count-unique.in >/dev/null
done
cycles instructions
300,856 256,229 true # measure "sh -c" overhead
8,444,928,819 21,491,024,474 python3 count-unique.py
8,020,113,329 19,790,232,716 $HOME/gforth/gforth-fast count-unique-hoyt.fs
5,129,364,575 12,035,209,874 .../gforth-fast count-unique.fs|tr ...|sort ...
4,954,627,149 11,637,863,826 .../gforth-fast count-unique.fs
count-unique.py is from
<
57548ac6-9f34-480d...@googlegroups.com>.
count-unique-hoyt.fs is from
<
https://gist.githubusercontent.com/benhoyt/770a6ce13e307b8cd2e7f1e7c4731c70/raw/3954359b59de674fa6af81bd7a7a7dc4a9336f34/countwords.fs>.
count-unique.fs is from <
2021Mar1...@mips.complang.tuwien.ac.at>.
So Python 3.4.2 (as delivered with Debian 8) is slower than your
program when run on development gforth-fast (as built with
BUILD-FROM-SCRATCH on Debian 8). My program is quite a bit faster,
but I don't really know why. The last line shows that most of the
time of my program is spent in the Forth part.
Let's dissect your program:
cycles instructions
7,309,340,807 18,098,847,497 hoyt without show-words
2,186,413,395 5,530,766,295 hoyt without count-word show-words
7,297,503,704 18,031,073,597 hoyt without show-words, with my count-word
1,360,613,453 3,006,082,187 ertl without count-word show-words
The second result is surprising, because my program uses a very
similar count-word and calls it as often (I checked that, just to be
sure) with the same arguments. So it's surprising that the difference
from disabling count-word is more than the whole run-time of my
program. And indeed, using my count-words instead (third result)
provides little speedup. So I looked at what my program costs without
count-word (fourth result), and it's only about 3.6Gc faster without,
whereas your program is >5Gc faster without. I have no explanation
for why the same count-words calls cost more in your program than
mine; and it's not just a microarchitectural issue (cache misses,
branch mispreductions), you also see this difference in the number of
executed instructions (8.6G spent on count-word in my program, 12.5G
in your program).
I also found that both programs can reduce the cycle count by roughly
250Mc by inserting
15 :noname to hashbits hashdouble ; execute
at the start. This resizes the hash table to 64K buckets at the start
rather than doubling to 32K over time.
I then tried to fine-tune COUNT-WORD a little more:
: count-word ( c-addr u -- )
2dup counts find-name-in dup if
( name>interpret ) >body 1 swap +! 2drop
else
drop nextname create 1 ,
\ 1 num-uniques +!
then ;
cycles instructions
4,686,394,575 11,569,382,417 old COUNT-WORD
4,348,734,738 10,431,623,700 use FIND-NAME-IN instead of SEARCH-WORDLIST
4,383,716,571 11,018,557,131 use NEXTNAME instead of EXECUTE-PARSING
4,009,596,095 9,880,712,988 use both
It's interesting that these had such big effects, because
SEARCH-WORDLIST is a relative thin wrapper around FIND-NAME-IN, and
the ELSE part is called only 32000 times out of 8M calls to
count-word. Applying the same to your program with the hash table
improvement results in:
cycles instructions
6,993,463,336 17,962,483,836 hoyt with 64K buckets and faster COUNT-WORD
Let's see if we can improve the parsing (for my program at first):
: process-string1 ( c-addr u -- )
get-current >r counts set-current
begin
2dup (parse-white) dup while
2dup count-word + -rot + over -
repeat
2drop 2drop r> set-current ;
stdin slurp-fid process-string1 show-words
(PARSE-WHITE) is an internal factor of PARSE-NAME and WORD.
Strangely, when I do that, the number of cycles and number of
instructions goes up, but the parsing cycles and instructions go down.
cycles instructions
4,658,226,020 11,935,361,026 with PROCESS-STRING1
617,477,329 1,063,092,830 with PROCESS-STRING1 without COUNT-WORD
Using PROCESS-STRING1 as PROCESS-LINE in your program reduces the
instructions slightly, but increases the cycles slightly. It seems
that SCAN-WORD works nicely, no need for descending into
(PARSE-WHITE).
In any case, the fastest version is twice as fast as Python 3.4.2.