Counting frequencies of unique words

Ben Hoyt

unread,

Mar 12, 2021, 3:01:36 AM3/12/21

to

I'm writing a comparison of a simple program (print the counts of unique words in a text file, most frequent first) in various languages. Forth was my first programming language, and for old times' sake I thought I'd include a Forth version in my article. I haven't written Forth in years and years, so I'd love some code review / suggestions here.

I want to keep it "mostly standard" but I'm fine using a couple of Gforth-specific features if needed.

Most other language implementations read standard input line at a time, split into whitespace-separated words, insert the words into a hash table (or increment the count / hash value if it already exists). Then when printing out, convert to an array to sort by count descending. For reference, in Python that might look like this (though it's even easier if you use collections.Counter):

---------------
import sys

counts = {}
for line in sys.stdin:
words = line.lower().split()
for word in words:
counts[word] = counts.get(word, 0) + 1

ordered = sorted(counts.items(), key=lambda wc: wc[1], reverse=True)
for word, count in ordered:
print(word, count)
---------------

So for this input:

The foozle bar the
bar bar bar bar

It would print out:

bar 5
the 2
foozle 1

Here's my attempt at a Forth version:

---------------
wordlist constant counts \ Hash table of words to count variable
create uniques 65536 cells allot \ Array of unique words
variable uoffset 0 uoffset !

\ Allocate space for new string and copy bytes, return new string.
: copy-string ( addr u -- addr' u )
dup >r dup allocate throw
dup >r swap move r> r> ;

\ Convert character to lowercase.
: to-lower ( C -- c )
dup [char] A [ char Z 1+ ] literal within if
32 +
then ;

\ Convert string to lowercase in place.
: lower-in-place ( addr u -- )
over + swap do
i c@ to-lower i c!
loop ;

\ Count given word in hash table.
: count-word ( addr u -- )
2dup counts search-wordlist if
\ Increment existing word
execute 1 swap +!
2drop
else
\ Insert new (copied) word and add to uniques array
copy-string
2dup lower-in-place
2dup uniques uoffset @ cells + 2!
2 uoffset +!
['] create execute-parsing 1 ,
then ;

\ Process a line by splitting into words.
: process-line ( -- )
begin
parse-name dup
while
count-word
repeat
2drop ;

\ Show "word count" line for each word (unsorted).
: show-words ( -- )
uoffset @ 0= if
exit
then
uoffset @ 0 do
uniques i cells + 2@
2dup type
counts search-wordlist 0= if
type ." not found!" cr exit \ Shouldn't happen
then
execute @ space . cr
2 +loop ;

: main ( -- )
counts set-current
begin
refill
while
process-line
repeat
show-words
bye ;

main
The foozle bar the
bar bar bar bar
---------------

Some comments / questions:

1) I don't know the best way to read the input. Here I'm cheating, by appending the input to the source file! (After "main".) I really want it to read STDIN. I know you can use STDIN READ-LINE, but then how would I parse that into words with the Forth word parsing functions like PARSE-NAME? (Which seems much better and more efficient than rolling our own parsing.) Is there a way to READ-LINE and then set it as the parsing input?
2) I'm using a WORDLIST as a hash table, which also (for this use case) has the nice property of being case-insensitive. I'm still converting to lowercase so the output is consistent with the other language versions. It's a bit annoying that Forth doesn't have a built-in to-lower function, not even for characters.
3) I notice there's a TRAVERSE-WORDLIST in the latest standard (and latest Gforth) -- this is nice, but it's not in my version of Gforth on Ubuntu. So I have to keep a separate append-only UNIQUES array of the "addr u" pairs and UOFFSET to index into it.
4) There are no built-in sorting functions in Gforth (or standard Forth) by the looks, so I'm currently not sorting. Any pointers to a good merge sort implementation or similar (with a key/comparison function)?
5) Any other comments or feedback here?

Thanks a lot!

-Ben

dxforth

unread,

Mar 12, 2021, 4:41:01 AM3/12/21

to

On 12/03/2021 14:01, Ben Hoyt wrote:
>
> 1) I don't know the best way to read the input. Here I'm cheating, by appending the
> input to the source file! (After "main".) I really want it to read STDIN. I know you
> can use STDIN READ-LINE, but then how would I parse that into words with the Forth
> word parsing functions like PARSE-NAME? (Which seems much better and more efficient
> than rolling our own parsing.) Is there a way to READ-LINE and then set it as the
> parsing input?

DIY parser isn't too hard and will be 'more standard' than other methods.

\ Most forths have SKIP SCAN

\ Split string at character leaving first on top
: SPLIT ( a u c -- a2 u2 a3 u3 )
>r 2dup r> scan 2swap 2 pick - ;

\ Parsing1: ARGC /ARG !ARG ARGV
create ARGC ( -- a ) 3 cells allot

\ Reset parser to beginning
: /ARG ( -- ) 0 argc ! ;

\ Assign string for parsing
: !ARG ( a u -- ) argc cell+ 2! /arg ; here 0 !arg

\ Parse next blank delimited argument
: ARGV ( -- a u -1 | 0 )
1 argc +! argc cell+ 2@ 0 0
argc @ 0 ?do
2drop bl skip bl split
loop 2nip
dup if -1 exit then and ;

> 2) ... It's a bit annoying that Forth doesn't have a built-in to-lower function, not
> even for characters.

Or 'upper'. The powers-that-be suggest rolling your own on that one too.

none albert

unread,

Mar 12, 2021, 5:25:03 AM3/12/21

to

In article <57548ac6-9f34-480d...@googlegroups.com>,

<SNIP> (code)

>
>Some comments / questions:
>
>1) I don't know the best way to read the input. Here I'm cheating, by appending the input to the source file! (After "main".) I really want it to read STDIN. I know you can
>use STDIN READ-LINE, but then how would I parse that into words with the Forth word parsing functions like PARSE-NAME? (Which seems much better and more efficient than
>rolling our own parsing.) Is there a way to READ-LINE and then set it as the parsing input?

At least in ciforth this is standard idiom:
"AAP" GET-FILE ( "slurp")
BEGIN OVER WHILE ^J $/ process-line REPEAT 2DROP
process-line finds addr,len on the stack and is supposed to consume it.
Aguilar also has acknowledged the need for this kind of processing.

>2) I'm using a WORDLIST as a hash table, which also (for this use case) has the nice property of being case-insensitive. I'm still converting to lowercase so the output is
>consistent with the other language versions. It's a bit annoying that Forth doesn't have a built-in to-lower function, not even for characters.

Your Forth probably has one. Problem is, Forth has no libraries. So as soon
as this word makes it in the standard (an uphill battle) it becomes one of those
kitchen sinks that drags a fat Forth down. I tried to promote my WHAT mechanism
to put those words in a source library but it didn't catch on in 20 years so
it probably never will.

>3) I notice there's a TRAVERSE-WORDLIST in the latest standard (and latest Gforth) -- this is nice, but it's not in my version of Gforth on Ubuntu. So I have to keep a
>separate append-only UNIQUES array of the "addr u" pairs and UOFFSET to index into it.
>4) There are no built-in sorting functions in Gforth (or standard Forth) by the looks, so I'm currently not sorting. Any pointers to a good merge sort implementation or
>similar (with a key/comparison function)?
>5) Any other comments or feedback here?

I see that you are programming in Forth. Good.

>
>Thanks a lot!
>
>-Ben

Groetjes Albert
--
"in our communism country Viet Nam, people are forced to be
alive and in the western country like US, people are free to
die from Covid 19 lol" duc ha
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Ron AARON

unread,

Mar 12, 2021, 7:06:19 AM3/12/21

to

On 12/03/2021 5:01, Ben Hoyt wrote:
> I'm writing a comparison of a simple program (print the counts of unique words in a text file, most frequent first) in various languages. Forth was my first programming language, and for old times' sake I thought I'd include a Forth version in my article. I haven't written Forth in years and years, so I'd love some code review / suggestions here.
>
> I want to keep it "mostly standard" but I'm fine using a couple of Gforth-specific features if needed.
>
> Most other language implementations read standard input line at a time, split into whitespace-separated words, insert the words into a hash table (or increment the count / hash value if it already exists). Then when printing out, convert to an array to sort by count descending. For reference, in Python that might look like this (though it's even easier if you use collections.Counter):
>

Non-standard: 8th version ( file name on command-line )

\ Count unique words in a text file

\ Increment the count of the word in the map.
: m:++ \ m s -- m
tuck 0 m:@? n:1+
rot swap m:! ;

\ Get the first argument as a file-name, read and split into words:
{} 0 args f:slurp >s /\s+/ s:/

\ For each word in the list, insert into the map and count:
' m:++ a:each! drop

\ Convert the map into a sorted heap:
( 1 a:_@ swap 1 a:_@ n:cmp neg ) h:new swap ( 2 a:close h:push ) m:each drop

\ print the heap
h:len ( h:pop a:open swap . space . cr ) swap times drop
bye

Ben Hoyt

unread,

Mar 12, 2021, 7:39:26 AM3/12/21

to

Thanks for the help, folks. Updated version: https://gist.github.com/benhoyt/770a6ce13e307b8cd2e7f1e7c4731c70

(Merge sort taken from Rosetta Code.)

I decided to just build the latest version of Gforth so I could have TRAVERSE-WORDLIST.

dxforth, thanks a lot for the pointers to SKIP and SCAN -- very helpful.

It's actually quite fast -- not much slower than the Python solution! (Python takes ~2.2 seconds for my large 43MB input file, this Gforth solution takes 3.7s).

-Ben

Gerry Jackson

unread,

Mar 12, 2021, 8:38:50 AM3/12/21

to

On 12/03/2021 03:01, Ben Hoyt wrote:
> I'm writing a comparison of a simple program (print the counts of unique words in a text file, most frequent first) in various languages. Forth was my first programming language, and for old times' sake I thought I'd include a Forth version in my article. I haven't written Forth in years and years, so I'd love some code review / suggestions here.
>
> I want to keep it "mostly standard" but I'm fine using a couple of Gforth-specific features if needed.
>
> Most other language implementations read standard input line at a time, split into whitespace-separated words, insert the words into a hash table (or increment the count / hash value if it already exists). Then when printing out, convert to an array to sort by count descending. For reference, in Python that might look like this (though it's even easier if you use collections.Counter):
>

...

>
> 1) I don't know the best way to read the input. Here I'm cheating, by appending the input to the source file! (After "main".) I really want it to read STDIN. I know you can use STDIN READ-LINE, but then how would I parse that into words with the Forth word parsing functions like PARSE-NAME? (Which seems much better and more efficient than rolling our own parsing.) Is there a way to READ-LINE and then set it as the parsing input?

You already use EXECUTE-PARSING so you should be able to use that to
parse the string returned by STDIN READ-LINE. Something like (an
untested idea):
stdin read-line ' process-line execute-parsing

Another point is that If you save the value of >IN before PARSE-NAME, if
SEARCH-WORDLIST fails to find the word you can restore the value of >IN
and use CREATE directly instead of EXECUTE-PARSING it. You already allot
a cell to count the number of occurrences so DOES> can be used to
increment the count e.g. in outline

... >in @ >r parse-name ( search the wordlist )
( not found ) if r> >in ! create 1 ,
( copy the string and compile its address)
does> 1 +!
else ( execute the word ) r> drop then

There are other details like not being able to use DOES> in a control
structure so that would have to be factored out

--
Gerry

Anton Ertl

unread,

Mar 12, 2021, 9:57:14 AM3/12/21

to

Ben Hoyt <ben...@gmail.com> writes:
>I haven't written Forth in years and years,

Welcome back.

> uoffset @ 0=3D if

> exit
> then
> uoffset @ 0 do
> uniques i cells + 2@
> 2dup type

> counts search-wordlist 0=3D if

> type ." not found!" cr exit \ Shouldn't happen
> then
> execute @ space . cr
> 2 +loop ;
>
>: main ( -- )
> counts set-current
> begin
> refill
> while
> process-line
> repeat
> show-words
> bye ;
>
>main
>The foozle bar the
>bar bar bar bar
>---------------
>
>Some comments / questions:
>

>1) I don't know the best way to read the input. Here I'm cheating, by appen=
>ding the input to the source file! (After "main".) I really want it to read=
> STDIN. I know you can use STDIN READ-LINE, but then how would I parse that=
> into words with the Forth word parsing functions like PARSE-NAME? (Which s=
>eems much better and more efficient than rolling our own parsing.) Is there=

> a way to READ-LINE and then set it as the parsing input?

|execute-parsing ( ... addr u xt -- ... ) gforth
| Make addr u the current input source, execute xt '( ... -- ... )',
|then restore the previous input source.

A definition of this word in Standard Forth is provided in
<https://theforth.net/package/compat/current-view/execute-parsing.fs>

You are already using this word for creating a name, so using it for
turning a line into a Forth input source would result in testing the
reentrancy of EXECUTE-PARSING (there are occasional posts that doubt
the need for reentrancy).

There is also

|execute-parsing-file ( i*x fileid xt -- j*x ) gforth
| Make fileid the current input source, execute xt '( i*x -- j*x )',
|then restore the previous input source.

which will save you the READ-LINE part.

Or, alternatively, use

slurp-fid ( fid -- addr u ) gforth
ADDR U is the content of the file FID

and then EXECUTE-PARSING on the result.

>2) I'm using a WORDLIST as a hash table, which also (for this use case) has=
> the nice property of being case-insensitive. I'm still converting to lower=
>case so the output is consistent with the other language versions. It's a b=
>it annoying that Forth doesn't have a built-in to-lower function, not even =
>for characters.
>3) I notice there's a TRAVERSE-WORDLIST in the latest standard (and latest =
>Gforth) -- this is nice, but it's not in my version of Gforth on Ubuntu. So=
> I have to keep a separate append-only UNIQUES array of the "addr u" pairs =

>and UOFFSET to index into it.

>4) There are no built-in sorting functions in Gforth (or standard Forth) by=
> the looks, so I'm currently not sorting. Any pointers to a good merge sort=

> implementation or similar (with a key/comparison function)?

You can find quicksort for integers (in several variations) in
<http://www.complang.tuwien.ac.at/forth/programs/sort.fs>

But I would just pipe the unsorted Forth output into "sort -nrk 2"

>5) Any other comments or feedback here?

Here's my take:

----------------
\ Usage: gforth count-unique.fs <count-unique.in |tr '[:upper:]' '[:lower:]'|sort -nrk2

wordlist constant counts

: count-word ( c-addr u -- )
2dup counts search-wordlist dup if
drop >body 1 swap +! 2drop
else
drop ['] create execute-parsing 1 ,
then ;

: process-string ( -- )
get-current counts set-current

begin
parse-name dup while
count-word
repeat

2drop set-current ;

: show-words ( -- )
[: dup name>string type space name>interpret >body @ 0 u.r cr true ;]
counts traverse-wordlist ;

stdin slurp-fid ' process-string execute-parsing show-words
bye
---------------

For the most part, pretty close to your version, except that I left
the lower-casing and the sorting to other tools, and that I processed
the input as a whole instead of line by line, and I placed the
SET-CURRENT differently. I also tried to avoid the SET-CURRENT, but
currently the factoring of Gforth is not very amenable to that.

I note that the next standard contains almost everything used here,
except EXCUTE-PARSING STDIN SLURP-FID.

Forth-2012 does not contain [: ;]; Forth-94 does not contain
PARSE-NAME TRAVERSE-WORDLIST NAME>STRING NAME>INTERPET.

It's interesting how many features beyond Forth-94 can be found in
such a short program. Anybody who claims that they do not provide a
benefit is invited to show a Forth-94 program that does the same.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020

Anton Ertl

unread,

Mar 12, 2021, 10:25:05 AM3/12/21

to

Slower than Python does not sound fast to me:-)

Is the benchmark input publically available?

Brian Fox

unread,

Mar 12, 2021, 1:57:03 PM3/12/21

to

On 2021-03-12 5:21 AM, Anton Ertl wrote:
> Slower than Python does not sound fast to me:-)
>
> Is the benchmark input publically available?
>
> - anton
>

Thank you for saying what we all were thinking. :)

Paul Rubin

unread,

Mar 12, 2021, 5:53:22 PM3/12/21

to

Ben Hoyt <ben...@gmail.com> writes:
> It's actually quite fast -- not much slower than the Python solution!
> (Python takes ~2.2 seconds for my large 43MB input file, this Gforth
> solution takes 3.7s).

Try gforth-fast ?

Paul Rubin

unread,

Mar 12, 2021, 5:56:05 PM3/12/21

to

Brian Fox <bria...@rogers.com> writes:
> On 2021-03-12 5:21 AM, Anton Ertl wrote:
>> Slower than Python does not sound fast to me:-)

> Thank you for saying what we all were thinking. :)

It is pretty impressive, since the main work in the Python program would
have been done in e.g. Python's dictionary objects, which are
implemented in C and are carefully tuned. Most of the important Python
datatypes and libraries are written that way. The Python language
interpreter itself is quite slow, but Python programmers get into the
habit of organizing their programs so that the heavy lifting is done by
the libraries and builtins. The interpreter itself is just a scripting
wrapper around the libraries, more or less.

Ben Hoyt

unread,

Mar 12, 2021, 7:22:00 PM3/12/21

to

Thanks for the tips, Anton and Gerry. EXECUTE-PARSING-FILE is interesting. And you reminded me about >BODY (presumably a bit faster than execute here).

One of my constraints is the I'm not allowed to read the whole file into memory (up to 64KB chunks is okay), so SLURP-FID is out.

Yeah, I know the Python bytecode interpreter is not super-fast, but most of the C functions it calls have been very well optimized, so as long as your hot loop is not in Python, it can be quite fast.

My test input file is https://github.com/benhoyt/countwords/blob/master/kjvbible.txt, but I'm concatenating it to itself x10 to get the actual file I'm testing (so it's 43MB instead of 4MB). Like so:

cat kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt >kjvbible_x10.txt

-Ben

Ben Hoyt

unread,

Mar 12, 2021, 8:37:34 PM3/12/21

to

Side note (and possible bug report?): I can't seem to get Gforth's EXECUTE-PARSING-FILE to work. (This is a freshly-built gforth from https://github.com/forthy42/gforth). Am I holding it wrong? Here's my test code:

---------------
: process-input ( -- )
begin
parse-name dup
while
type cr
repeat
2drop ;

stdin ' process-input execute-parsing-file
\ s" t.txt" r/o open-file throw ' process-input execute-parsing-file
bye
---------------

And when I run it as follows:

gforth execute-parsing-file.fs <t.txt

It prints no output (the first PARSE-NAME call returns a length of zero). It doesn't change if I switch to the commented-out line to open a real file instead of stdin (I just wanted to check). I would have expected it to print the words in t.txt one per line. The contents of t.txt is:

The foozle bar the
bar bar bar bar

In any case, I'm falling back to using READ-LINE and then EXECUTE-PARSING on each line. Though incidentally, using SKIP vs SCAN manually is almost twice as fast as calling EXECUTE-PARSING on each line. I guess there's a fair amount of setup involved in each call to the latter?

-Ben

Paul Rubin

unread,

Mar 12, 2021, 10:33:18 PM3/12/21

to

Ben Hoyt <ben...@gmail.com> writes:
> begin
> parse-name dup

That looks like you want 2dup since parse-name leaves addr u on the
stack.

Ben Hoyt

unread,

Mar 12, 2021, 10:35:27 PM3/12/21

to

> > begin
> > parse-name dup
>
> That looks like you want 2dup since parse-name leaves addr u on the
> stack.

Yes, but the WHILE only tests and consumes the "u".

-Ben

dxforth

unread,

Mar 12, 2021, 11:21:46 PM3/12/21

to

On 12/03/2021 18:39, Ben Hoyt wrote:
>
> thanks a lot for the pointers to SKIP and SCAN -- very helpful.
>

Agreed. For this application there was nothing to be gained keeping
track of the remaining string (as occurs with PARSE-NAME or ARG).
A BL SKIP and BL SPLIT is all the complexity that was needed.

dxforth

unread,

Mar 12, 2021, 11:47:11 PM3/12/21

to

You're suggesting Python programmers are better in tune with what's
efficient than Forthers? Could be. Forthers go to extraordinary
lengths to demonstrate the slowness of their interpreter :)

Anton Ertl

unread,

Mar 13, 2021, 5:02:52 PM3/13/21

to

Ben Hoyt <ben...@gmail.com> writes:
>Thanks for the tips, Anton and Gerry. EXECUTE-PARSING-FILE is interesting. And you reminded me about >BODY (presumably a bit faster than execute here).

It's faster, but probably does not make a significant difference in
this case.

>My test input file is https://github.com/benhoyt/countwords/blob/master/kjvbible.txt, but I'm concatenating it to itself x10 to get the actual file I'm testing (so it's 43MB instead of 4MB). Like so:
>
>cat kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt >kjvbible_x10.txt

Thanks. With that, I see on a Skylake (Core i5-6600K), invoking it
with:

for i in \
true \
"python3 count-unique.py" \
"$HOME/gforth/gforth-fast count-unique-hoyt.fs" \
"$HOME/gforth/gforth-fast count-unique.fs|tr [:upper:] [:lower:]|sort -nrk2" \
"$HOME/gforth/gforth-fast count-unique.fs"
do LC_NUMERIC=en_US.utf8 perf stat -B -e cycles:u -e instructions:u sh -c "$i" <count-unique.in >/dev/null
done

cycles instructions
300,856 256,229 true # measure "sh -c" overhead
8,444,928,819 21,491,024,474 python3 count-unique.py
8,020,113,329 19,790,232,716 $HOME/gforth/gforth-fast count-unique-hoyt.fs
5,129,364,575 12,035,209,874 .../gforth-fast count-unique.fs|tr ...|sort ...
4,954,627,149 11,637,863,826 .../gforth-fast count-unique.fs

count-unique.py is from
<57548ac6-9f34-480d...@googlegroups.com>.
count-unique-hoyt.fs is from
<https://gist.githubusercontent.com/benhoyt/770a6ce13e307b8cd2e7f1e7c4731c70/raw/3954359b59de674fa6af81bd7a7a7dc4a9336f34/countwords.fs>.
count-unique.fs is from <2021Mar1...@mips.complang.tuwien.ac.at>.

So Python 3.4.2 (as delivered with Debian 8) is slower than your
program when run on development gforth-fast (as built with
BUILD-FROM-SCRATCH on Debian 8). My program is quite a bit faster,
but I don't really know why. The last line shows that most of the
time of my program is spent in the Forth part.

Let's dissect your program:

cycles instructions
7,309,340,807 18,098,847,497 hoyt without show-words
2,186,413,395 5,530,766,295 hoyt without count-word show-words
7,297,503,704 18,031,073,597 hoyt without show-words, with my count-word
1,360,613,453 3,006,082,187 ertl without count-word show-words

The second result is surprising, because my program uses a very
similar count-word and calls it as often (I checked that, just to be
sure) with the same arguments. So it's surprising that the difference
from disabling count-word is more than the whole run-time of my
program. And indeed, using my count-words instead (third result)
provides little speedup. So I looked at what my program costs without
count-word (fourth result), and it's only about 3.6Gc faster without,
whereas your program is >5Gc faster without. I have no explanation
for why the same count-words calls cost more in your program than
mine; and it's not just a microarchitectural issue (cache misses,
branch mispreductions), you also see this difference in the number of
executed instructions (8.6G spent on count-word in my program, 12.5G
in your program).

I also found that both programs can reduce the cycle count by roughly
250Mc by inserting

15 :noname to hashbits hashdouble ; execute

at the start. This resizes the hash table to 64K buckets at the start
rather than doubling to 32K over time.

I then tried to fine-tune COUNT-WORD a little more:

: count-word ( c-addr u -- )
2dup counts find-name-in dup if
( name>interpret ) >body 1 swap +! 2drop
else
drop nextname create 1 ,
\ 1 num-uniques +!
then ;

cycles instructions
4,686,394,575 11,569,382,417 old COUNT-WORD
4,348,734,738 10,431,623,700 use FIND-NAME-IN instead of SEARCH-WORDLIST
4,383,716,571 11,018,557,131 use NEXTNAME instead of EXECUTE-PARSING
4,009,596,095 9,880,712,988 use both

It's interesting that these had such big effects, because
SEARCH-WORDLIST is a relative thin wrapper around FIND-NAME-IN, and
the ELSE part is called only 32000 times out of 8M calls to
count-word. Applying the same to your program with the hash table
improvement results in:

cycles instructions
6,993,463,336 17,962,483,836 hoyt with 64K buckets and faster COUNT-WORD

Let's see if we can improve the parsing (for my program at first):

: process-string1 ( c-addr u -- )
get-current >r counts set-current
begin
2dup (parse-white) dup while
2dup count-word + -rot + over -
repeat
2drop 2drop r> set-current ;

stdin slurp-fid process-string1 show-words

(PARSE-WHITE) is an internal factor of PARSE-NAME and WORD.

Strangely, when I do that, the number of cycles and number of
instructions goes up, but the parsing cycles and instructions go down.

cycles instructions
4,658,226,020 11,935,361,026 with PROCESS-STRING1
617,477,329 1,063,092,830 with PROCESS-STRING1 without COUNT-WORD

Using PROCESS-STRING1 as PROCESS-LINE in your program reduces the
instructions slightly, but increases the cycles slightly. It seems
that SCAN-WORD works nicely, no need for descending into
(PARSE-WHITE).

In any case, the fastest version is twice as fast as Python 3.4.2.

Anton Ertl

unread,

Mar 13, 2021, 5:25:14 PM3/13/21

to

Ben Hoyt <ben...@gmail.com> writes:
>Side note (and possible bug report?): I can't seem to get Gforth's EXECUTE-=
>PARSING-FILE to work. (This is a freshly-built gforth from https://github.c=

>om/forthy42/gforth). Am I holding it wrong?

Maybe. Or maybe EXECUTE-PARSING-FILE handles badly. I have now tried
it, by starting with the code from

<57548ac6-9f34-480d...@googlegroups.com>

and replaced MAIN with:

: process-file ( -- )
begin
refill
while
process-line
repeat ;

: main ( -- )
counts set-current

stdin ['] process-file execute-parsing-file
show-words
bye ;

And it seems to work, including counting the first line. So what's
probably unintuitive about the current EXECUTE-PARSING-FILE is that it
needs a REFILL at the start.

Interestingly, nobody has complained about EXECUTE-PARSING-FILE
before, and I don't remember ever using it. It is used inside Gforth
in one place, as a factor of INCLUDE-FILE. Overall, one can say that
it is almost unused.

Interestingly, when EXECUTE-PARSING was discussed at a Forth-200x
meeting (IIRC over a decade ago), Stephen Pelc claimed that it is not
general enough, that people would soon want not only
EXECUTE-PARSING-FILE, but also EXECUTE-PARSING-FLASH and many more
such words, and that EXECUTE-PARSING is therefore insufficient, and we
should go for something more general. I then sketched a more general
mechanism, but that did not find positive resonance (not with me,
either).

Anyway, as a result we are here, over a decade later, without
standardized EXECUTE-PARSING, but at least we can evaluate the
predictions from that meeting: While I use EXECUTE-PARSING frequently
to work around the limitations of parsing words, I don't remember ever
using EXECUTE-PARSING-FILE, and the predicted other variants (e.g.,
EXECUTE-PARSING-FLASH) have not appeared. Not in VFX, and not in
Gforth.

Ben Hoyt

unread,

Mar 14, 2021, 12:00:10 AM3/14/21

to

Thanks a lot, Anton, for the analysis and optimization efforts -- I don't feel so bad now that even you "have no explanation" for the larger discrepancy. :-)

> In any case, the fastest version is twice as fast as Python 3.4.2.

Would you mind sending a link to your final "fastest version"? I had trouble keeping track of all the modifications by the end of your reply.

Thanks,
Ben.

Anton Ertl

unread,

Mar 14, 2021, 5:11:54 PM3/14/21

to

Ben Hoyt <ben...@gmail.com> writes:
>Would you mind sending a link to your final "fastest version"?

http://www.complang.tuwien.ac.at/anton/count-unique2.fs

Marcel Hendrix

unread,

Mar 14, 2021, 7:21:31 PM3/14/21

to

On Friday, March 12, 2021 at 10:57:14 AM UTC+1, Anton Ertl wrote:
[..]

> |execute-parsing ( ... addr u xt -- ... ) gforth
> | Make addr u the current input source, execute xt '( ... -- ... )',
> |then restore the previous input source.

It appears I've lost some understanding along the way. The combination
of parse-name, execute-parsing and execute-parsing-file does not
do what I expect. The way in which non-space characters
are handled is unfamiliar, or maybe its the definition of
the "space" character.

The reference definition of parse-name says
PARSE-NAME

CORE EXT

( "<spaces>name<space>" -- c-addr u )
Skip leading space delimiters. Parse name delimited by a space.

c-addr is the address of the selected string within the input buffer and u is its
length in characters. If the parse area is empty or contains only white space,
the resulting string has length zero.

Implementation:
: isspace? ( c -- f )
BL 1+ U< ;
: isnotspace? ( c -- f )
isspace? 0= ;

: xt-skip ( addr1 n1 xt -- addr2 n2 )
\ skip all characters satisfying xt ( c -- f )
>R
BEGIN
DUP
WHILE
OVER C@ R@ EXECUTE
WHILE
1 /STRING
REPEAT THEN
R> DROP ;
...

This does not simply skip spaces ("Skip leading space delimiters."), it skips
everything which is not a "BL". It will therefore skip <cr>, <lf>, and <tab> characters,
and as such it will work without problems with an execute-parsing-file that
does not do *any* processing of the input file.

I guess the trick is in 3.4.1.1 which allows control characters to be equivalent
to a space, when space is the delimiter? Thus there is a difference between
"Skip leading space delimiters" and "Skip leading spaces."

> Here's my take:
>
> ----------------
> \ Usage: gforth count-unique.fs <count-unique.in |tr '[:upper:]' '[:lower:]'|sort -nrk2
>
> wordlist constant counts
>
> : count-word ( c-addr u -- )
> 2dup counts search-wordlist dup if
> drop >body 1 swap +! 2drop
> else
> drop ['] create execute-parsing 1 ,
> then ;
>
> : process-string ( -- )
> get-current counts set-current
> begin
> parse-name dup while
> count-word
> repeat
> 2drop set-current ;
>
> : show-words ( -- )
> [: dup name>string type space name>interpret >body @ 0 u.r cr true ;]
> counts traverse-wordlist ;
>
> stdin slurp-fid ' process-string execute-parsing show-words
> bye
> ---------------

There could be a problem with Forths that have unified code and
data spaces, especially with a 43MByte input file.

-marcel

Ben Hoyt

unread,

Mar 14, 2021, 7:38:00 PM3/14/21

to

Thanks Anton for sending that version.

> There could be a problem with Forths that have unified code and
> data spaces, especially with a 43MByte input file.

Yeah, it's against the constraints I've set for myself to read the entire input. I'm going to kind of merge my version with Anton's but read in something like 64KB chunks.

-Ben

Anton Ertl

unread,

Mar 14, 2021, 10:27:22 PM3/14/21

to

Marcel Hendrix <m...@iae.nl> writes:
>On Friday, March 12, 2021 at 10:57:14 AM UTC+1, Anton Ertl wrote:
>[..]
>> |execute-parsing ( ... addr u xt -- ... ) gforth
>> | Make addr u the current input source, execute xt '( ... -- ... )',
>> |then restore the previous input source.
>
>It appears I've lost some understanding along the way. The combination
>of parse-name, execute-parsing and execute-parsing-file does not
>do what I expect. The way in which non-space characters
>are handled is unfamiliar, or maybe its the definition of
>the "space" character.
>
>The reference definition of parse-name says
>PARSE-NAME

[...]

>This does not simply skip spaces ("Skip leading space delimiters."), it skips
>everything which is not a "BL". It will therefore skip <cr>, <lf>, and <tab> characters,
>and as such it will work without problems with an execute-parsing-file that
>does not do *any* processing of the input file.
>
>I guess the trick is in 3.4.1.1 which allows control characters to be equivalent
>to a space, when space is the delimiter? Thus there is a difference between
>"Skip leading space delimiters" and "Skip leading spaces."

Also,

|11.3.5 Parsing
|
|When parsing from a text file using a space delimiter, control
|characters shall be treated the same as the space character.

EXECUTE-PARSING-FILE provides a line-oriented interface; you have to
REFILL at the end of each line (and also before the first line).

However, my version uses EXECUTE-PARSING (not EXECUTE-PARSING-FILE) on
a string containing newlines, and relies on PARSE-NAME treating
newlines as spaces.

>> stdin slurp-fid ' process-string execute-parsing show-words
>> bye
>> ---------------
>
>There could be a problem with Forths that have unified code and
>data spaces, especially with a 43MByte input file.

I don't see why. SLURP-FID puts the contents of the file in ALLOCATEd
memory, but even if it put them into the dictionary, the biggest
problem that I would expect was that the usual configuration does not
provide 43MB in the dictionary.

Marcel Hendrix

unread,

Mar 15, 2021, 9:17:15 AM3/15/21

to

On Sunday, March 14, 2021 at 11:27:22 PM UTC+1, Anton Ertl wrote:
> Marcel Hendrix <m...@iae.nl> writes:
> >On Friday, March 12, 2021 at 10:57:14 AM UTC+1, Anton Ertl wrote:
[..]

> However, my version uses EXECUTE-PARSING (not EXECUTE-PARSING-FILE) on
> a string containing newlines, and relies on PARSE-NAME treating
> newlines as spaces.

Yes, that attracted me too. REFILL and having to separate in
lines is really annoying. This EXECUTE-PARSING (currently my
$PROCESS) can benefit from this.

>>There could be a problem with Forths that have unified code and
>>data spaces, especially with a 43MByte input file.

> I don't see why. SLURP-FID puts the contents of the file in ALLOCATEd
> memory, but even if it put them into the dictionary, the biggest
> problem that I would expect was that the usual configuration does not
> provide 43MB in the dictionary.

I currently have a very sloppy SLURP-FILE that leaves free only a
few kbytes above HERE, and huge code/data ALLOTs will clobber the
read data. I will fix that by SLURPing top-down instead of bottom-up.

-marcel

Ben Hoyt

unread,

Mar 15, 2021, 9:33:34 AM3/15/21

to

For what it's worth, here is the final comparison article: https://benhoyt.com/writings/count-words/

Thanks for your help with the Forth version (especially Anton's efforts).

-Ben

P Falth

unread,

Mar 15, 2021, 1:57:01 PM3/15/21

to

I have MAP-FILE ( addr len fam -- addr' len' ior)
That memory maps a file using the OS calls.
There is also UNMAP-FILE, FLUSHMAP-FILE and MAP-FID

For parsing there is GET-LINE ( addr len -- addr" len" addr' len' )
and GET-WORD ( addr len -- addr" len" addr' len' )
Both parse a word/line, returns that as top of stack and the remaining part under that.

INCLUDE uses for example MAP-FILE to map the source R/O
and then REFILL uses GET-LINE to get the source line.

BR
Peter

Anton Ertl

unread,

Mar 15, 2021, 4:49:59 PM3/15/21

to

Ben Hoyt <ben...@gmail.com> writes:
>For what it's worth, here is the final comparison article: https://benhoyt.com/writings/count-words/

You showed off the profiling capabilities of other languages.
Development Gforth has a coverage tool that gives you execution
counts, used like:

~/gforth/gforth ~/gforth/coverage.fs -e ": bye .coverage bye ;" optimized.fs <count-unique.in

This also shows why I usually prefer not to have the BYE in the
program file, and instead write it on the invocation.

Anton Ertl

unread,

Mar 15, 2021, 4:51:32 PM3/15/21

to

P Falth <peter....@gmail.com> writes:
>I have MAP-FILE ( addr len fam -- addr' len' ior)
>That memory maps a file using the OS calls.
>There is also UNMAP-FILE, FLUSHMAP-FILE and MAP-FID

What do you do if the file cannot be mmaped?

P Falth

unread,

Mar 15, 2021, 5:54:09 PM3/15/21

to

On Monday, 15 March 2021 at 17:51:32 UTC+1, Anton Ertl wrote:
> P Falth <peter....@gmail.com> writes:
> >I have MAP-FILE ( addr len fam -- addr' len' ior)
> >That memory maps a file using the OS calls.
> >There is also UNMAP-FILE, FLUSHMAP-FILE and MAP-FID
> What do you do if the file cannot be mmaped?
> - anton

That is signaled by the ior that usually is consumed by throw.
In all cases I have seen so far it is due to a misspelled filename/path.
I have 128GB in my PC so have not yet had an out of memory error
due to a large file.
Peter

Brian Fox

unread,

Mar 15, 2021, 6:52:08 PM3/15/21

to

Nice write up Ben. Thanks

This table is making Forth show it's age perhaps.

Language Simple Optimized Notes
grep 0.04 0.04 grep reference; optimized sets LC_ALL=C
wc -w 0.29 0.20 wc reference; optimized sets LC_ALL=C
C 0.97 0.23
Go 1.14 0.38
Rust A 1.41 0.35 by Andrew Gallant
Rust B 1.48 0.28 also by Andrew: bonus and custom hash
C++ 1.75 0.98 “optimized” isn’t very optimized
Python 2.07 1.27
C# 3.43 original by John Taylor
AWK 3.52 1.13 optimized uses mawk
Forth 4.21 1.44
Shell 14.67 1.86 optimized does LC_ALL=C sort -S 2G

Although considering that C# is newish I guess Forth
faired pretty well.

C was given some lib help so there's that but GCC is a
monster is it not? Wow.

Being under Python does feel humbling...

I am thinking that Forth is more like Assembler in that it
requires lots of Forthy knowledge to get really high
performance. It seems to be a challenge even for good coders.

I am sure that marcel could make something faster with iForth
but I suspect it would not be much like Ben's code when it
was finished.

Interesting to be sure.

Anton Ertl

unread,

Mar 15, 2021, 7:01:24 PM3/15/21

to

P Falth <peter....@gmail.com> writes:
>On Monday, 15 March 2021 at 17:51:32 UTC+1, Anton Ertl wrote:
>> What do you do if the file cannot be mmaped?
>> - anton
>
>That is signaled by the ior that usually is consumed by throw.
>In all cases I have seen so far it is due to a misspelled filename/path.
>I have 128GB in my PC so have not yet had an out of memory error
>due to a large file.

I am not worried about running out of RAM; after all, the file is
demand-paged in, so if memory is tight, a few pages of physical memory
is enough to walk through a large file (one page for the data, and
some more for the page tables).

What I am thinking about is files that cannot be mmaped, e.g.,
terminal input or pipes.

- anton

NN

unread,

Mar 15, 2021, 9:58:17 PM3/15/21

to

I would disagree with the comment that "hash tables are present via wordlist".
The problem is that wordlists " look like " hashtables but not they are not the same.

Having seen your solution, I have been pondering over the question, how many here
would have written the solution using wordlists as a way to solving this. It was
not the first though that came to me given that you indicated you wanted the values
sorted later.

P Falth

unread,

Mar 15, 2021, 11:22:55 PM3/15/21

to

On Monday, 15 March 2021 at 20:01:24 UTC+1, Anton Ertl wrote:
> P Falth <peter....@gmail.com> writes:
> >On Monday, 15 March 2021 at 17:51:32 UTC+1, Anton Ertl wrote:
> >> What do you do if the file cannot be mmaped?
> >> - anton
> >
> >That is signaled by the ior that usually is consumed by throw.
> >In all cases I have seen so far it is due to a misspelled filename/path.
> >I have 128GB in my PC so have not yet had an out of memory error
> >due to a large file.
> I am not worried about running out of RAM; after all, the file is
> demand-paged in, so if memory is tight, a few pages of physical memory
> is enough to walk through a large file (one page for the data, and
> some more for the page tables).
>
> What I am thinking about is files that cannot be mmaped, e.g.,
> terminal input or pipes.

They will fail MAP-FILE and also all INCLUDE variants that are built on it.
I see no problem in that.

If I want to read from a pipe, that needs to be redirected to stdin.
It will work on Linux but not well on Windows yet.

The new Windows Terminal is a great step forward and behaves
almost like a Linux terminal.

Peter

Anton Ertl

unread,

Mar 16, 2021, 7:26:04 AM3/16/21

to

NN <novembe...@gmail.com> writes:
>On Monday, 15 March 2021 at 09:33:34 UTC, Ben Hoyt wrote:
>> For what it's worth, here is the final comparison article: https://benhoyt.com/writings/count-words/
>>
>> Thanks for your help with the Forth version (especially Anton's efforts).
>>
>> -Ben
>
>I would disagree with the comment that "hash tables are present via wordlist".
>The problem is that wordlists " look like " hashtables but not they are not the same.

In Gforth wordlists are implemented with a hash table (by default).
Given that he used Gforth, his assessment is definitely correct.

>Having seen your solution, I have been pondering over the question, how many here
>would have written the solution using wordlists as a way to solving this.

I would.

>It was
>not the first though that came to me given that you indicated you wanted the values
>sorted later.

He wanted to sort it by count. He used a hash table with the words as
keys. Plus, the counts are only known for sure after having processed
the input. So I don't see how the sorting would have helped much.
Ok, you could have used an array sorted by count, with linear search
from the most frequent one, but I expect that this would have been a
lot slower.

Anton Ertl

unread,

Mar 16, 2021, 7:52:39 AM3/16/21

to

P Falth <peter....@gmail.com> writes:
>On Monday, 15 March 2021 at 20:01:24 UTC+1, Anton Ertl wrote:
>> What I am thinking about is files that cannot be mmaped, e.g.,
>> terminal input or pipes.
>
>They will fail MAP-FILE and also all INCLUDE variants that are built on it.
>I see no problem in that.

As long as the user knows this and is prepared for it, it's fine.

>If I want to read from a pipe, that needs to be redirected to stdin.

In the case at hand the input came from stdin.

In addition, Unix also has named pipes. For the first 3 decades of my
Unix usage, I have not used them, but lately, I have picked up the
shell syntax "<(command)", and it creates a named pipe internally.
E.g., if I do

ls -l <(cat count-unique.fs)

it shows

lr-x------ 1 anton users 64 Mar 16 08:47 /dev/fd/63 -> pipe:[191951]

Marcel Hendrix

unread,

Mar 16, 2021, 8:04:30 AM3/16/21

to

It is not clear to me what the objective of the paper is, and thus
I wouldn't know how to let Forth come out on top. Is it runtime speed,
ease of writing it (using built-in facilities), clarity of the finished
program, ease of debugging, size of the code?

If this were a real-world problem that *I* had, Forth would
end quite a bit higher, but it wouldn't necessarily show up in the
source code as such (the kernel would be optimized differently).

One very interesting thing that the paper shows is that grep is
incredibly fast for this task. The reason is known, and that
algorithm/data structure should be adopted in Forth if it was
useful for the type of problems we want to solve.

I doubt that grep (or any of the other benchmarked programming tools)
is very convenient or easy to use when one wants to do specific types
of simulation that benefit from run-time generated code.

The paper goes into considerable detail w.r.t. tracing and debugging
facilities, but it is not actually using these very much. I am quite fond
of Forth in this respect. There is not much a mere programmer can do
with a one-line program, apart from going to stackexchange and
ask around (standard behavior of my Python colleagues).

-marcel

Anton Ertl

unread,

Mar 16, 2021, 8:22:10 AM3/16/21

to

Marcel Hendrix <m...@iae.nl> writes:
>One very interesting thing that the paper shows is that grep is
>incredibly fast for this task.

grep does not solve this task. He included the timing only to have an
idea how much the I/O overhead is, and it shows that little of the
time is spent in I/O.

Anton Ertl

unread,

Mar 16, 2021, 8:34:12 AM3/16/21

to

Brian Fox <bria...@rogers.com> writes:
>On 2021-03-15 5:33 AM, Ben Hoyt wrote:
>> For what it's worth, here is the final comparison article: https://benhoyt.com/writings/count-words/
>>
>> Thanks for your help with the Forth version (especially Anton's efforts).
>>
>> -Ben
>>
>Nice write up Ben. Thanks
>
>This table is making Forth show it's age perhaps.

I don't think that age has anything to do with it. It's more a
function of the implementation choices in Gforth, mainly in the
dictionary lookup.

>Language Simple Optimized Notes
>grep 0.04 0.04 grep reference; optimized sets LC_ALL=C
>wc -w 0.29 0.20 wc reference; optimized sets LC_ALL=C
>C 0.97 0.23
>Go 1.14 0.38
>Rust A 1.41 0.35 by Andrew Gallant
>Rust B 1.48 0.28 also by Andrew: bonus and custom hash
>C++ 1.75 0.98 “optimized” isn’t very optimized
>Python 2.07 1.27
>C# 3.43 original by John Taylor
>AWK 3.52 1.13 optimized uses mawk
>Forth 4.21 1.44
>Shell 14.67 1.86 optimized does LC_ALL=C sort -S 2G

...
>I am thinking that Forth is more like Assembler in that it
>requires lots of Forthy knowledge to get really high
>performance.

It generally requires lots of knowledge to get really high
performance. Unfortunately this is an area that is rife with
superstition.

Paul Rubin

unread,

Mar 16, 2021, 9:13:42 AM3/16/21

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
> Plus, the counts are only known for sure after having processed
> the input. So I don't see how the sorting would have helped much.

A sorting-based solution could be done with very little memory, though
it would be slow. A lot of exercises in Knuth vol 3 are like that. You
have a 1960's small computer with some kilobytes of memory and some
megabytes of disks or magtapes, and you have to do this frequency
counting problem. Basically you make a bunch of in-memory sort/merge
passes of small chunks of the input, and you end up with a list of words
and counts, sorted alphabetically by word. Then you sort that again,
this time by count.

dxforth

unread,

Mar 17, 2021, 3:51:16 AM3/17/21

to

The question seems to have been 'how can I do this in the easiest way using gforth'.
Had it been 'why should I do this in forth when there are easier alternatives' the
result may have been different. I use forth for things nobody else wants to write,
and other languages/forth implementations bore me. Perhaps other forthers too.

none albert

unread,

Mar 18, 2021, 4:01:13 PM3/18/21

to

In article <2021Mar1...@mips.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
<SNIP>

>
>Anyway, as a result we are here, over a decade later, without
>standardized EXECUTE-PARSING, but at least we can evaluate the
>predictions from that meeting: While I use EXECUTE-PARSING frequently
>to work around the limitations of parsing words, I don't remember ever
>using EXECUTE-PARSING-FILE, and the predicted other variants (e.g.,
>EXECUTE-PARSING-FLASH) have not appeared. Not in VFX, and not in
>Gforth.

I added EXECUTE-PARSING at that time, and did not bother with the
other ones. It is easy enough.
: EXECUTE-PARSING ROT ROT SAVE SET-SRC CATCH RESTORE THROW ;
It sits in a screen together with PARSE-NAME and SAVE-INPUT
awaiting external sources that use them.
[ In the kernel I see no use for them. ]
You are right that prepending this with GET-FILE or SLURP-FILE is easy
enough not to bother with extra words.

>
>- anton
--
"in our communism country Viet Nam, people are forced to be
alive and in the western country like US, people are free to
die from Covid 19 lol" duc ha
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

none albert

unread,

Mar 18, 2021, 7:25:03 PM3/18/21

to

In article <57548ac6-9f34-480d...@googlegroups.com>,
Ben Hoyt <ben...@gmail.com> wrote:
>I'm writing a comparison of a simple program (print the counts of unique
>words in a text file, most frequent first) in various languages. Forth
>was my first programming language, and for old times' sake I thought I'd
>include a Forth version in my article. I haven't written Forth in years
>and years, so I'd love some code review / suggestions here.
>
>I want to keep it "mostly standard" but I'm fine using a couple of
>Gforth-specific features if needed.
>
>Most other language implementations read standard input line at a time,
>split into whitespace-separated words, insert the words into a hash
>table (or increment the count / hash value if it already exists). Then
>when printing out, convert to an array to sort by count descending. For
>reference, in Python that might look like this (though it's even easier
>if you use collections.Counter):
>
<SNIP> code

As long as we're using a wordlist as list of words here is a solution
that doesn't really work for the problem but presents an interesting
technique. It works with prefixes, and the wordlist internal structure
is a linked list, not a hash.
The preparation is
lina64 -g 64 lina64+
This creates a Forth with 64 Mbyte on top of the 64 Mbyte present,
enough for some King James's.
SORT-VOC is in the library, it sorts a wordlist using a merge sort.
\ ------------------------------------------------------
: autocounter CREATE 1 , DOES> 1 SWAP +! ;
NAMESPACE wordsdb
wordsdb DEFINITIONS
: CATCH-ALL autocounter ; PREFIX
' CATCH-ALL PREVIOUS DEFINITIONS CONSTANT DEA-OF-CATCH-ALL
\ Make CATCH-ALL an empty prefix, such that it matches all
0 ( # chars) DEA-OF-CATCH-ALL >NFA @ !
: doit wordsdb DEFINITIONS 1 ARG[] INCLUDED PREVIOUS 'wordsdb SORT-VOC show ;
\ ------------------------------------------------------

This is intended to be a turnkey. `doit has the wordsdb database
on top. The file in argument 1 is included adding new words
of the type autocounter to wordsdb.
If the word is found it is executed, adding one to the count.
The interesting thing happens if it comes to the word CATCH-ALL.
Actually catch-all's name has become "", and because it is a prefix it
matches everything, so it executes autocounter, that adds the
word that was missing to the wordlist.

Some tests
REGRESS autocounter aap 'aap >BODY @ S: 1
REGRESS aap aap aap 'aap >BODY @ S: 4

Showing the words is a bit of a hassle, because we must exclude
the catch-all word. The words are found via their link field,
using `` >LFA @ ''.

: (show) DUP DEA-OF-CATCH-ALL <> IF
DUP >NFA @ $@ TYPE SPACE >BODY ? CR
ELSE DROP THEN ;
: show 'wordsdb >WID >LFA @ \ This gets the first word of the linked list.
BEGIN DUP (show) >LFA @ DUP 0= UNTIL DROP ;

: doit wordsdb DEFINITIONS 1 ARG[] INCLUDED PREVIOUS
'wordsdb SORT-VOC show ;

The result after
lina64 -c count-words.frt
count-words count-words.frt
looks like
\ ------------------------------------
wordsdb 4
that 1
test 2
such 1
show 3
prefix, 1
pietje 1
matches 1
it 1
empty 1
doit 1
chars) 1
autocounter 5
an 1
all 1
aapje 1
aap 4
...
...
(show) 2
( 1
'wordsdb 2
'aap 2
' 1
$@ 1
$= 1
# 1
! 1
\ -------------------------
That is in reverse order because of the SORT-VOC I had lying around.

>Thanks a lot!
Thanks for an inspirational question.
>
>-Ben

Groetjes Albert

none albert

unread,

Mar 19, 2021, 12:55:16 PM3/19/21

to

In article <6053a652$0$27924$e4fe...@news.xs4all.nl>,

A slightly improved version. Instead of scanning a string one can
use the interpreter (we have one! it is called QUIT) directly.
At end of input BYE is executed, it has to be revectored, otherwise
there is no output.
Also don't forget to go bye in the redefined code (!).

Another improvement:
have catchall in a wordlist of its own. Now it is no longer
an exception inside the wordsdb.

WANT SORT-VOC

: autocounter CREATE 1 , DOES> 1 SWAP +! ;

NAMESPACE catchallwid
NAMESPACE wordsdb
catchallwid DEFINITIONS
: CATCH-ALL autocounter ; PREFIX

\ Make CATCH-ALL an empty prefix, such that it matches all
' CATCH-ALL PREVIOUS DEFINITIONS CONSTANT DEA-OF-CATCH-ALL
0 ( # chars) DEA-OF-CATCH-ALL >NFA @ !

: (show) DUP >NFA @ $@ TYPE SPACE >BODY ? CR ;

: show 'wordsdb >WID >LFA @

BEGIN DUP (show) >LFA @ DUP 0= UNTIL DROP ;

'BYE ALIAS NEW-BYE
: on-exit 'wordsdb SORT-VOC show NEW-BYE ;

: doit 'on-exit 'BYE 3 CELLS MOVE \ revector BYE
catchallwid wordsdb DEFINITIONS QUIT ;

The benefit is that the redirection buffering of the Forth
(probably fast as hell) will take care of infinite lenght files.
No problem with like 666 King James Bibles.

QUIT can be replace by: "x" INCLUDE BYE

Note: I see that the ordering was supposed to be according to
frequency. That is a matter of changing the execution token
passed inside SORT-VOC.
Now:
\ For XT1 and XT2 return XT1 and XT2 plus "xt IS lower".
: GET-NAME >NFA @ $@ ; \ Aux. For EL, return NAME.
: NAMES< SWAP >R GET-NAME R> GET-NAME COMPARE 0 < ;
New:
: CONTENT< SWAP >R >BODY @ R> >BODY @ < ;
Replace 'NAMES< by 'CONTENT<

>
>>Thanks a lot!
>Thanks for an inspirational question.

This and the gforth example serve to illustrate that knowing your
interpreter can trump standard code. Of course in real life
standardization trumps it all, as c has proven to us.

none albert

unread,

Mar 19, 2021, 2:46:41 PM3/19/21

to

In article <57548ac6-9f34-480d...@googlegroups.com>,
Ben Hoyt <ben...@gmail.com> wrote:
>I'm writing a comparison of a simple program (print the counts of unique
>words in a text file, most frequent first) in various languages. Forth
>was my first programming language, and for old times' sake I thought I'd
>include a Forth version in my article. I haven't written Forth in years
>and years, so I'd love some code review / suggestions here.
>
>I want to keep it "mostly standard" but I'm fine using a couple of
>Gforth-specific features if needed.
>
>Most other language implementations read standard input line at a time,
>split into whitespace-separated words, insert the words into a hash
>table (or increment the count / hash value if it already exists). Then
>when printing out, convert to an array to sort by count descending. For
>reference, in Python that might look like this (though it's even easier
>if you use collections.Counter):

<SNIP>

>5) Any other comments or feedback here?

What would google do?

God?) 1
God? 62
God;) 2
God; 128
God:for 1
God: 204
God. 485
God-ward: 1
God-ward, 1
God-ward 1
God,) 2
God, 964
God's? 1
God's: 1
God's. 6
God's 18
God! 9
God 2229
GOD? 2
GOD;) 1
GOD; 119
GOD: 10
GOD. 62
GOD, 44
GOD! 8
GOD 53

Looking at the output of my 68 minute run, I suggest that
it is better to consider punctuation like . , : ; as white space
(which could in my Forth be done by revectoring ?BLANK ).

And what about:
1:9 65
1:8 65
1:7 65
I wouldn't count those as god's words.

>
>Thanks a lot!

Anton Ertl

unread,

Mar 19, 2021, 6:37:38 PM3/19/21

to

Ben Hoyt <ben...@gmail.com> writes:
>For what it's worth, here is the final comparison article: https://benhoyt.com/writings/count-words/

I took Ben Hoyt's optimized version, and wanted to see how different
Forth systems fare. So I first looked at which words in his code are
not in Forth-2012: Gforth's ans-report.fs reports:

|from non-ANS :
| hashbits hashdouble find-name-in nextname rdrop cell <= 0>= stdin
| execute-parsing -rot

I eliminated the Gforth-specific hash stuff. I went back from
FIND-NAME-IN to SEARCH-WORDLIST, and from NEXT-NAME to
EXECUTE-PARSING. I defined STDIN to be the opened file containing the
benchmark input (lxf and SwiftForth contain a STDIN, though). For
EXECUTE-PARSING, I used the implementation from the compat library
where necessary.

You can find the result at
<http://www.complang.tuwien.ac.at/anton/count-unique-portable.fs>

Then I tried it on a few systems:

iForth: The version I have does not implement NAME>INTERPRET and
TRAVERSE-WORDLIST, so I gave up on that.

lxf: This worked out of the box (it has EXECUTE-PARSING)

SwiftForth: I could get it to run by first loading
compat/execute-parsing.fs, but the output is not what it should be.
In particular, it shows counts that are not divisible by 10, and the
output has 155800 lines.

VFX: Similarly, I included compat/execute-parsing.fs, and
similarly, I got output other than expected (and apparently the same
as for SwiftForth).

I then tried using the reference implementation of PARSE-NAME
<http://forth200x.org/reference-implementations/parse-name.fs> with
SwiftForth and VFX, and it fixed the problem. The problem seems to be
that PARSE-NAME in SwiftForth (and, I guess, VFX) only scans for
blanks and skips blanks.

The command lines I used were:

~/gforth/gforth-fast count-unique-portable.fs -e "main bye"
lxf "include count-unique-portable.fs main bye"
sf "include forth200x-code/extensions/parse-name.fs include /home/anton/gforth/compat/execute-parsing.fs include count-unique-portable.fs main bye"
vfxlin "include forth200x-code/extensions/parse-name.fs include /home/anton/gforth/compat/execute-parsing.fs include count-unique-portable.fs main bye"

Performance (on Skylake: Core i5-6600K):

Measured with

LC_NUMERIC=en_US.utf8 perf stat -B -e cycles:u -e instructions:u <command> >/dev/null

cycles:u instructions:u
5,283,768,248 12,192,102,197 gforth-fast
25,415,445,850 14,875,738,337 lxf
323,601,779,148 71,447,700,709 SwiftForth
10,253,285,512 12,480,803,729 SwiftForth 1931 #strands !
8,924,046,495 12,943,588,904 SwiftForth 3119 #strands !
8,633,549,116 13,651,490,963 SwiftForth 4447 #strands !
8,869,032,555 14,764,023,064 SwiftForth 6211 #strands !
10,536,796,771 16,885,231,117 SwiftForth 9377 #strands !
16,499,932,432 22,816,689,082 SwiftForth 18181 #strands !
39,140,121,585 54,881,274,417 SwiftForth 65537 #strands !
156,625,611,921 68,036,722,876 VFX

So there are big speed differences. My guess is that, unlike Gforth,
some other Forth systems have not implemented wordlists with a view
towards using them as general lookup tables.

SwiftForth implements a wordlist asa hash table with separate
chaining, with a default of 31 buckets (aka #STRANDS, i.e., >1000
words per bucket). One can set the number of buckets in a
non-portable way, and I played around with that. My expectation was
that the more buckets I use, the faster SwiftForth will run (with
little gain after the average number of words per bucket is <1), but
for some reason, using a large number of buckets costs a lot of
instructions on SwiftForth, and I played around a bit to find a number
of buckets that gives a low number of cycles; it's still worse than
Gforth.

One other interesting aspect is the big differences in instructions
per cycle (IPC). lxf uses many much more instructions, but 5 times
more cycles than Gforth. SwiftForth (default) has a particularly low
IPC, about 10 times lower than gforth-fast. that this comes from
spending much of the time following the separate chains, which
probably results in L2 cache misses (perf stat reports 6,198,244,154
LLC-loads (last level cache), compared to 11,204,304 for gforth-fast).

P Falth

unread,

Mar 20, 2021, 8:10:38 AM3/20/21

to

lxf has a default of 499 buckets. Each wordlist has its own hashtable.
The number of buckets can be set with the value #threads before a
wordlist is created. It was made to save space for small wordlists.
I just tried it with a large number 6211. Creating words work but
WORDS fail, but that may be a problem with WORDS.

Peter

Anton Ertl

unread,

Mar 20, 2021, 9:33:14 AM3/20/21

to

P Falth <peter....@gmail.com> writes:
>On Friday, 19 March 2021 at 19:37:38 UTC+1, Anton Ertl wrote:
>> SwiftForth implements a wordlist asa hash table with separate
>> chaining, with a default of 31 buckets (aka #STRANDS, i.e., >1000
>> words per bucket). One can set the number of buckets in a
>> non-portable way, and I played around with that. My expectation was
>> that the more buckets I use, the faster SwiftForth will run (with
>> little gain after the average number of words per bucket is <1), but
>> for some reason, using a large number of buckets costs a lot of
>> instructions on SwiftForth, and I played around a bit to find a number
>> of buckets that gives a low number of cycles; it's still worse than
>> Gforth.

It seems that the reason for this behaviour is that SwiftForth's
TRAVERSE-WORDLIST produces the nts in the reverse order of definition,
and recreates that order by, at every step, selecting the nt with the
highest address among each bucket. So TRAVERSE-WORDLIST has O(m*n)
cost, where m is the number of buckets and n is the number of xt
executions.

TRAVERSE-WORDLIST was specified in a way that allows producing the nts
of one bucket in reverse insertion order for that bucket, and then the
next bucket etc., i.e., O(n) cost. But I guess that Forth, Inc. wants
wants to preserve the classic behaviour of WORDS, and implemented
TRAVERSE-WORDLIST accordingly.

Gforth also preserves the classic behaviour of WORDS and also produces
the words in reverse order of definition. It does this by having a
linked list representation which is the canonical representation of
the wordlist. It also has a hash table in addition. The linked list
is used for enumerating the words (with TRAVERSE-WORDLIST) and for
(re)building the hash table. The hash table is used for lookup.
Insert (word definition) inserts in both the linked list and the hash
table. One nice feature of this approach is that we can throw away
and rebuild the hash table any time when maintaining it is too
complicated, e.g., when executing a marker, or when doubling the size
of the hash table.

>lxf has a default of 499 buckets. Each wordlist has its own hashtable.
>The number of buckets can be set with the value #threads before a
>wordlist is created. It was made to save space for small wordlists.

Adding some results for lxf with varying buckets gives:

cycles:u instructions:u
5,283,768,248 12,192,102,197 gforth-fast
25,415,445,850 14,875,738,337 lxf

6,817,364,445 7,803,365,742 lxf 18181 to #threads
6,571,979,412 7,728,933,434 lxf 65537 to #threads

323,601,779,148 71,447,700,709 SwiftForth
10,253,285,512 12,480,803,729 SwiftForth 1931 #strands !
8,924,046,495 12,943,588,904 SwiftForth 3119 #strands !
8,633,549,116 13,651,490,963 SwiftForth 4447 #strands !
8,869,032,555 14,764,023,064 SwiftForth 6211 #strands !
10,536,796,771 16,885,231,117 SwiftForth 9377 #strands !
16,499,932,432 22,816,689,082 SwiftForth 18181 #strands !
39,140,121,585 54,881,274,417 SwiftForth 65537 #strands !
156,625,611,921 68,036,722,876 VFX

lxf behaves with increasing number of threads as I expected. I would
have hoped for a lower cycles result, though.

- anton
--

NN

unread,

Mar 20, 2021, 10:30:49 AM3/20/21

to

There was nothing in Ben's original post to say he was interested
in speed. So I dont know why the thread is so focused upon cycles
and performance.

What it shows is that this "simple" problem is difficult to solve in forth.

Anton Ertl

unread,

Mar 20, 2021, 12:06:25 PM3/20/21

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>I then tried using the reference implementation of PARSE-NAME
><http://forth200x.org/reference-implementations/parse-name.fs> with
>SwiftForth and VFX, and it fixed the problem. The problem seems to be
>that PARSE-NAME in SwiftForth (and, I guess, VFX) only scans for
>blanks and skips blanks.

This poses the question of how these systems deal with files
containing tabs. I did some testing of these systems, and what I found is:

SwiftForth converts each tab in an INCLUDEd file to a space before
presenting it to the text interpreter (but obviously not for EVALUATE,
which is used by compat/execute-parsing.fs). So files with tab can be
text-interpreted by SwiftForth, although it makes me wonder how ."
containing a tab works.

VFX does not convert the file on reading, but PARSE-NAME seems to use
both BL (32) and tab (9) as delimiter, but not the line feed (10).
Given that, on INCLUDEing a file, all line feeds are skipped when
doing linewise processing, PARSE-NAME never sees a line feed in that
context.

Anton Ertl

unread,

Mar 20, 2021, 12:16:12 PM3/20/21

to

NN <novembe...@gmail.com> writes:
>There was nothing in Ben's original post to say he was interested
>in speed.

Well, he compared Gforth speed with Python speed. But the final blog
post that he linked too was much more focused on performance. And of
course the unfavourable result for Forth made me wonder if we can get
better results for Forth.

So I looked if one of the native-code Forth systems could outperform
Gforth on a more portable variant of this problem; this gave some
insights into the portability of this program, and into performance of
wordlist implementations on various Forth systems.

Next I will look at whether I can get better speed out of Gforth.

>What it shows is that this "simple" problem is difficult to solve in forth.

I don't think so. Splitting the input into "words" is a very good
match to Forth's parsing, and Forth's wordlists are a very good match
for the counting part. What Forth does not have built-in is the
lower-casing (pretty easy) and the sorting (more effort), but I leave
that to shell tools when I need it.

none albert

unread,

Mar 20, 2021, 12:29:27 PM3/20/21

to

In article <2021Mar2...@mips.complang.tuwien.ac.at>,

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>NN <novembe...@gmail.com> writes:
>>There was nothing in Ben's original post to say he was interested
>>in speed.
>
>Well, he compared Gforth speed with Python speed. But the final blog
>post that he linked too was much more focused on performance. And of
>course the unfavourable result for Forth made me wonder if we can get
>better results for Forth.
>
>So I looked if one of the native-code Forth systems could outperform
>Gforth on a more portable variant of this problem; this gave some
>insights into the portability of this program, and into performance of
>wordlist implementations on various Forth systems.
>
>Next I will look at whether I can get better speed out of Gforth.
>
>>What it shows is that this "simple" problem is difficult to solve in forth.
>
>I don't think so. Splitting the input into "words" is a very good
>match to Forth's parsing, and Forth's wordlists are a very good match
>for the counting part. What Forth does not have built-in is the
>lower-casing (pretty easy) and the sorting (more effort), but I leave
>that to shell tools when I need it.

On top of that I think that in the end theological considerations
will come in. So a dedicated program that uses hashlists as a tool
is what we'll end with.

I think the bible is case sensitive. There is a difference between
"the god of the kaannites" and "Then God spoke to Abraham".
And I already drew attention to punctuation as in God God. God! God: God; .
I think they should all land in the same bucket.

Anyway this is not a difficult problem in principle. The difficulties
are down to the usual suspect, unclear specification by the customer.

>
>- anton

NN

unread,

Mar 20, 2021, 12:45:29 PM3/20/21

to

Ben's own example converts everything to lowercase so then the only thing
left is handling of punctuation.But I agree with your solution in an earlier
post of treating punctuation as space.

In fact if you assume words are just composed of alpha chars then the only
exceptions I can think of are hypenated words.

NN

unread,

Mar 20, 2021, 12:54:18 PM3/20/21

to

On Saturday, 20 March 2021 at 12:16:12 UTC, Anton Ertl wrote:
> NN <novembe...@gmail.com> writes:
> >There was nothing in Ben's original post to say he was interested
> >in speed.
> Well, he compared Gforth speed with Python speed. But the final blog
> post that he linked too was much more focused on performance. And of
> course the unfavourable result for Forth made me wonder if we can get
> better results for Forth.
>

His final blog post did.

> So I looked if one of the native-code Forth systems could outperform
> Gforth on a more portable variant of this problem; this gave some
> insights into the portability of this program, and into performance of
> wordlist implementations on various Forth systems.
>
> Next I will look at whether I can get better speed out of Gforth.
> >What it shows is that this "simple" problem is difficult to solve in forth.
> I don't think so. Splitting the input into "words" is a very good
> match to Forth's parsing, and Forth's wordlists are a very good match
> for the counting part. What Forth does not have built-in is the
> lower-casing (pretty easy) and the sorting (more effort), but I leave
> that to shell tools when I need it.

I agree with most of what you have said except for the "wordlist" part
which is why I was interested to see how many forthers would have tried
to solve it that way. However it seems that only three managed Ben,
yourself and Albert.

I think I will give this a try and see how far I get this weekend. Unfortunately
I dont have any punch card subroutines to carry in my back pocket so I am
just going to have to wheel out my own :(

Coos Haak

unread,

Mar 20, 2021, 2:06:31 PM3/20/21

to

Op 20 Mar 2021 12:28:44 GMT schreef none:

> I think the bible is case sensitive. There is a difference between
> "the god of the kaannites" and "Then God spoke to Abraham".

The New Testament was written in Greek and indeed has upper and
lower case letters.
The Old Testament was written in Hebrew and Arameic and
like Arabic (Koran) those scripts have no capitalisation.

groet, Coos

Anton Ertl

unread,

Mar 20, 2021, 3:41:40 PM3/20/21

to

Ben Hoyt <ben...@gmail.com> writes:
>For what it's worth, here is the final comparison article: https://benhoyt.com/writings/count-words/

I started with your "optimize.fs" and made another attempt at
peeling off layers of (Gforth) abstractions in order to speed it up
(but making the result Gforth-specific in the process). This time I
was successful and you can find the result at
<http://www.complang.tuwien.ac.at/anton/count-unique-gforth.fs>

Before I go into what I did let's do a speed comparison:

cycles:u instructions:u
4,274,918,212 10,483,947,413 count-unique-hoyt-opt.fs
2,049,405,097 4,022,462,373 count-unique-gforth.fs

The command lines are:

~/gforth/gforth-fast count-unique-hoyt-opt.fs <count-unique.in >/dev/null
~/gforth/gforth-fast count-unique-gforth.fs -e "main bye" >/dev/null

So I see a speedup by a factor of 2.09. Assuming you see the same
speedup on your machine, this would give an optimized time of 0.7s,
quite a bit slower than C, C++, Go, and Rust, but faster than C#,
Python, Lua, JavaScript, and awk. Interestingly, I see no Java entry
on your page, so it seems that Java has followed Cobol to the place
where strictly professional languages linger that nobody programs in
for the joy or the challenge.

What I did is to look at what Gforth does at the primitive level when
it does FIND-NAME-IN and PARSE-NAME, and call these primitives without
all these layers in-between:

So I replaced

counts find-name-in dup if

with

2dup hashbits (hashkey2) [ counts wordlist-extend @ ]L xor
cells [ hashtable ]L + @ (hashlfind) ?dup-if

and

: process-string ( -- )
begin
parse-name dup
while
count-word
repeat
2drop ;

with

: process-string1 ( c-addr u -- )
begin
2dup (parse-white) dup while
2dup count-word + -rot + over -
repeat
2drop 2drop ;

The first replacement brings the cycle count down to 2.7Gc, the second
to 2Gc.

So where is the remaining time spent? In the more complicated
primitives (hashkey2) (hashlfind) (parse-white) or in the rest of the
Forth code?

perf record and perf report give the following distribution of cycles:

Overhead Command Shared Object Symbol
24.33% gforth-fast gforth-fast [.] hashlfind
12.12% gforth-fast gforth-fast [.] parse_white
9.02% gforth-fast libc-2.19.so [.] __memmove_ssse3_back
7.57% gforth-fast perf-7033.map [.] 0x00007f0698c926f4
6.35% gforth-fast gforth-fast [.] memcasecmp
6.30% gforth-fast gforth-fast [.] hashkey2a
5.28% gforth-fast gforth-fast [.] gforth_engine
1.02% gforth-fast perf-7033.map [.] 0x00007f0698c92764
(and a lot more with <1%).

So about 31% of the time is spent in (hashlfind) (it calls
memcasecmp), 6% in (hashkey2), and 12% in (parse-white), for a total
of 49%. It's unclear to me here the memmove() calls are coming from.
The perf-7033.map "shared object" is the dynamically generated Gforth
code, and most of the rest of the time is spent there.

So here we see a speedup by a factor of 2 from using undocumented
Gforth-internal words that might change with the next release.
Ideally the abstraction layer between these primitives and the
documented and standardized words would be much thinner. I expect to
see some progress in that direction with native-code compilation.

NN

unread,

Mar 20, 2021, 6:43:07 PM3/20/21

to

My computer is slow AMD E450 ,
This took a lot longer than 3.7s, but I expect it will be faster for smaller files.

For those who decide to try it remember to take :
--> out

( -------------------------------------------------------------------------------- )
[defined] testing1 [if] testing1 [then] marker testing1

: eol? ( a u -- a u f ) dup 0= ;
: noteol? ( a u -- a u f ) dup 0> ;
: rch ( a u -- a u c ) over c@ ;
: nch ( a u -- a1 u1 ) 1 /string ;
: pch ( a u -- a1 u1 ) -1 /string ;

-->

: alpha ( c -- )
dup >r 'A' 'Z' 1+ within if r> drop true exit then
r> 'a' 'z' 1+ within ;

: isword ( -- )
begin noteol? while rch alpha while nch repeat then ;

: isnotword ( -- )
begin noteol? while rch alpha 0= while nch repeat then ;

-->

: extract-word ( a1 u1 a2 u2 -- a3 u3 a2 u2 )
2dup 2>r nip - 2r> ;

: rd-word ( a1 u1 -- )
0 -rot begin noteol? while isnotword
2dup isword extract-word 2>r rot 1+ 2r> repeat 2drop ;

-->

: ucase ( c -- )
dup 'a' 'z' 1+ within if $20 xor then ;

: word>upperc ( a u -- a u )
2dup bounds do i c@ ucase i c! loop ;

-->

: cn ( x y -- p ) 2 cells allocate throw dup >r 2! r> ;
: cn@ ( p -- x y ) 2@ ;
: hd ( p -- x ) cn@ drop ;
: tl ( p -- y ) cn@ nip ;
: !hd ( x p -- ) dup >r cn@ nip r> 2! ;
: !tl ( y p -- ) dup >r cn@ drop swap r> 2! ;

-->

: cp$ ( a u -- b ) dup 1+ allocate throw dup >r place r> ;
: null ( a -- f ) 0= ;
: eq ( x y -- f ) = ;
: eqstr ( x y -- f ) >r count r> count compare 0= ;
: caar ( a -- a1 ) hd hd ;
: assoc ( k al -- ^x T | k F )
begin dup null if drop FALSE true
else 2dup caar eqstr if hd TRUE true
else tl false then then until ;
: pr1 ( a -- ) dup tl swap hd
dup tl swap hd cr count type space . ;
: pr ( a -- )
begin dup null 0= while pr1 repeat drop ;

-->

variable alist

: check? ( i*x n -- )
0 ?do
eol? if 2drop else
word>upperc cp$ alist @ assoc if
dup tl 1+ swap !tl free throw
else 1 cn alist @ cn alist ! then
then loop ;

-->

variable fname$ 252 allot
s" kjvbible.txt" fname$ place

variable fid1
create tmp11 256 chars allot

: rdfile ( -- )
fname$ count r/o open-file throw fid1 !
begin tmp11 1+ 256 3 - fid1 @ read-line throw while
tmp11 c!
tmp11 count rd-word check?
repeat
fid1 @ close-file throw ;

rdfile alist @ pr

( -------------------------------------------------------------------------------- )

I didnt think I would get this far, and I am going to enjoy the moment.

The sort routine will take a while to write so I left it at this,
which is about the same point Ben left his in his original post.

I am sure it can be improved but I am going to leave that for another time.

none albert

unread,

Mar 21, 2021, 12:25:02 PM3/21/21

to

In article <wa2t879qklrn.16...@40tude.net>,

The earliest manuscipts of the New Testament was
SHOOTINGWRITTENINALLCAPSWITHOUTPUNCTUATIONNOTEVENSPACESITWAS\
ANINTERESTINGPUZZLETOFINDOUTWHERETHENEXTSENTENCESTARTHENCE\
THEFIRSTTHINGTHEOLOGICANSDIDWASENTERINGNUMBERSTOIDENTIFY\
CHAPTERSANDVERSES

The frequency count problem would be much more interesting with
those manuscripts, more of an ai project.
Also they did not have spelling checkers at the time, because
there was not even an agreed upon spelling.
The kjv authors did no go back to original texts like modern
scolars would do. Nothing before the Septuagint that is all
Greek would be studied.

(I have been wasting a lot of time lately with youtube films
about the subject, of i.a. prof Bart Ehrmann.)

>
>groet, Coos

Groetjes Albert

Brian Fox

unread,

Mar 21, 2021, 3:01:47 PM3/21/21

to

I don't think its a waste of time. It's archeology in some fashion.
Bart Ehrman is great for the newer texts. For some interesting
commentary on the time period just before and leading up to it,
Lawrence Shiffman has some lectures on Youtube on the dead sea scrolls
and what they tell us about that time regarding the different Jewish
groups including that upstart group that followed the most famous rabbi
of that time.
:)
His lecture at the Vatican was a good one.

Apologies if I sent you down another rabbit hole.

We now return to our regularly scheduled program.

NN

unread,

Mar 21, 2021, 3:24:01 PM3/21/21

to

This previous program should have come with a warning...

### This program is really slow ###

It was taking nearly an hour .

So I altered alist and the speedup was impressive.
There was another thread recently discussing link-lists vs hashed lists.
This demonstrates that difference is a way that's hard to argue.

Main changes are at alist, check? and pr which need to change to accomodate
the altered structure of alist.

( --------------------------------------------------------------------------------- )

[defined] testing1 [if] testing1 [then] marker testing1

: eol? ( a u -- a u f ) dup 0= ;
: noteol? ( a u -- a u f ) dup 0> ;
: rch ( a u -- a u c ) over c@ ;
: nch ( a u -- a1 u1 ) 1 /string ;
: pch ( a u -- a1 u1 ) -1 /string ;

: alpha ( c -- )
dup >r 'A' 'Z' 1+ within if r> drop true exit then
r> 'a' 'z' 1+ within ;

: isword ( -- )
begin noteol? while rch alpha while nch repeat then ;

: isnotword ( -- )
begin noteol? while rch alpha 0= while nch repeat then ;

: extract-word ( a1 u1 a2 u2 -- a3 u3 a2 u2 )
2dup 2>r nip - 2r> ;

: rd-word ( a1 u1 -- )
0 -rot begin noteol? while isnotword
2dup isword extract-word 2>r rot 1+ 2r> repeat 2drop ;

: ucase ( c -- )
dup 'a' 'z' 1+ within if $20 xor then ;

: word>upperc ( a u -- a u )
2dup bounds do i c@ ucase i c! loop ;

: cn ( x y -- p ) 2 cells allocate throw dup >r 2! r> ;
: cn@ ( p -- x y ) 2@ ;
: hd ( p -- x ) cn@ drop ;
: tl ( p -- y ) cn@ nip ;
: !hd ( x p -- ) dup >r cn@ nip r> 2! ;
: !tl ( y p -- ) dup >r cn@ drop swap r> 2! ;

: cp$ ( a u -- b ) dup 1+ allocate throw dup >r place r> ;
: null ( a -- f ) 0= ;
: eq ( x y -- f ) = ;
: eqstr ( x y -- f ) >r count r> count compare 0= ;
: caar ( a -- a1 ) hd hd ;
: assoc ( k al -- ^x T | k F )
begin dup null if drop FALSE true
else 2dup caar eqstr if hd TRUE true
else tl false then then until ;

variable alist 'Z' 1+ 'A' - dup * cells allot
alist 'Z' 1+ 'A' - dup * cells erase

: hash ( a u -- a u n )
eol? abort" ERR: Empty string"
dup 1 = if rch ucase 'A' - exit then
over count swap count swap drop ucase 'A' - swap ucase 'A' -
26 * + dup 676 > abort" ERR: recheck hash" ;

: check? ( a u -- )

0 ?do eol? if 2drop else

word>upperc hash >r cp$ r> cells alist + ( c$ ^alist )
2dup @ assoc if
dup tl 1+ swap !tl drop drop free throw
else 1 cn over @ cn swap ! drop then
then loop ;

: pr2 ( a -- )
dup hd count type space tl . ;

: pr1 ( a -- )

begin dup hd pr2 tl dup null until drop ;

: pr ( a -- )

'Z' 1+ 'A' - dup * 0 ?do
dup i cells + @ dup null if drop else pr1 then
loop ;

variable fname$ 252 allot
s" kjvbible.txt" fname$ place

variable fid1
create tmp11 256 chars allot

: rdfile ( -- )
fname$ count r/o open-file throw fid1 !
begin tmp11 1+ 256 3 - fid1 @ read-line throw while
tmp11 c!
tmp11 count rd-word check?
repeat
fid1 @ close-file throw ;

rdfile cr alist pr

( --------------------------------------------------------------------------------- )

Hugh Aguilar

unread,

Mar 21, 2021, 9:28:07 PM3/21/21

to

On Thursday, March 11, 2021 at 8:01:36 PM UTC-7, Ben Hoyt wrote:
> 4) There are no built-in sorting functions in Gforth (or standard Forth) by the looks,
> so I'm currently not sorting. Any pointers to a good merge sort implementation or similar
> (with a key/comparison function)?

I have ASSOCIATION.4TH in the novice-package. It is pretty fast.
It uses an LLRB binary-tree internally, so it may not be as fast as a hash-table.
It automatically sorts the nodes by key during insertion (the key has to be a string).
If you want a sorted output you just traverse the tree from left to right --- this is trivial.
It has a search function that allows the removal of sub-trees. For example, you can
remove all of the nodes with a key that starts with a particular prefix --- this is trivial.

P Falth

unread,

Mar 23, 2021, 8:59:32 AM3/23/21

to

I have done several tests to see where lxf spends the time in the program.
Running linux in a virtual machine I do not have access to the performance
counters. instead I time the execution of different parts. This is what I get
These analysises are with the original 4 .3MB file

include countwords2.4 ok
main3
Read words elapsed time: 378 milli-seconds
Append words elapsed time: 2 milli-seconds
Sort words elapsed time: 332 milli-seconds
Dump words elapsed time: 480 milli-seconds ok

This version writes the output to a file instead of the screen
writing to screen is 8192 milli-seconds!
Wanting to see the result instead of dumping it to /dev/null
I implemented buffered printing an writing the end result
in one go. The result is

main3
Read words elapsed time: 375 milli-seconds
Append words elapsed time: 2 milli-seconds
Sort words elapsed time: 342 milli-seconds
Dump words elapsed time: 4 milli-seconds ok

Next thing to look at was the sorting.
Mergesort should be a fast algorithm and here we are
sorting just 32.187 records. But seeing a move in there gave a warning.
I think we have had that discussed before here on CLF. It is an inplace sorting
and requires a lot of moving to achieve that.

I instead tried with a quicksort+insertion and got the following result

include countwords2.4 ok
main3
Read words elapsed time: 358 milli-seconds
Append words elapsed time: 3 milli-seconds
Sort words elapsed time: 6 milli-seconds
Dump words elapsed time: 5 milli-seconds ok

Running with 10 times concatenated file gives

main3
Read words elapsed time: 2844 milli-seconds
Append words elapsed time: 3 milli-seconds
Sort words elapsed time: 7 milli-seconds
Dump words elapsed time: 5 milli-seconds ok

This is the quicksort+insertion I used.
\ -------------------------------------------------------------------------------
\ set precedes for different datatypes or sort order.
\ defer precedes ' u> is precedes

: precedes ( nt1 nt2 -- f )
>r name>interpret >body @
r> name>interpret >body @
> ;

: exchange ( addr_1 addr_2 -- )
dup @ >r over @ swap ! r> swap ! ;

: -cell -1 cells ; ( -- n )
: cell- 1 cells - ; ( addr -- addr' )

: partition ( lo hi -- lo_1 hi_1 lo_2 hi_2 )
2dup over - 2/ -cell and + @ >r ( r: median)
2dup begin ( lo_1 hi_2 lo_2 hi_1)
swap begin dup @ r@ precedes while cell+ repeat
swap begin r@ over @ precedes while cell- repeat
2dup <= if 2dup exchange >r cell+ r> cell- then
2dup > until ( lo_1 hi_2 lo_2 hi_1)
r> drop ( r: )
swap rot ( lo_1 hi_1 lo_2 hi_2)
;

\ insertion sort to use when less then 30 items to sort remains
: sink ( f key where - f) rot >r
begin cell- 2dup @ precedes
while dup @ over cell+ ! dup r@ =
if ! r> exit then ( key where)
repeat cell+ ! r> ;

: insertion ( f l) 2dup u<
if cell+ over cell+ do i @ i sink cell +loop drop
else ( f l) 2drop then ;

: isort ( addr size -- )
1- cells over + insertion ;

: qsort+i ( lo hi -- )
\ 2dup >= if 2drop exit then
2dup swap - 30 cells <= if insertion exit then
partition ( lo_1 hi_1 lo_2 hi_2)
2over 2over - + ( . . . . lo_1 hi_1+lo_2-hi_2)
> if 2swap then ( lo_1 hi_1 lo_2 hi_2)
recurse
recurse ;

: sort ( addr n -- )
dup 2 < if 2drop exit then
1- cells over + ( addr addr+{n-1}cells) qsort+i ( ) ;
\ ----------------------------------------------------------------------------------
The mergesort in Rossetta Code should be improved or removed!

BR
Peter

none albert

unread,

Mar 23, 2021, 10:59:39 AM3/23/21

to

In article <1314fcd3-10f3-4df3...@googlegroups.com>,

P Falth <peter....@gmail.com> wrote:
>On Saturday, 20 March 2021 at 16:41:40 UTC+1, Anton Ertl wrote:

<SNIP>

>
>Next thing to look at was the sorting.
>Mergesort should be a fast algorithm and here we are
>sorting just 32.187 records. But seeing a move in there gave a warning.
>I think we have had that discussed before here on CLF. It is an inplace sorting
>and requires a lot of moving to achieve that.

merge sorting is indicated for e.g. lists where elements are of variable size
and point to each other via links. A properly implemented merge sort
will not move content around, just rearrange the pointers. 1]
So in general this is not true.
If a Forth implements a dictionary via a linked list it can be
merge sorted without moving entries around. merge-sort is stable so
the order of two words with the same name does not change .

<SNIP>

> The mergesort in Rossetta Code should be improved or removed!

I've seen it and am also not happy about it. It follows the same
nethod as some other languages, so it could stay.

Feel free to add the MERGE-SORT of ciforth there.

>
>BR
>Peter
>

Groetjes Albert
1] There are cache issues, but they cannot be avoided if elements
are sizable, neither for quicksort.

Marcel Hendrix

unread,

Mar 23, 2021, 12:07:59 PM3/23/21

to

On Tuesday, March 23, 2021 at 9:59:32 AM UTC+1, P Falth wrote:
[..]

[..]
This (append/sort/dump) is a bit unlikely. I expected
a low multiple of the single bible result (unless there
is a large fixed overhead). How accurate is your
milli-second timer?

-marcel

P Falth

unread,

Mar 23, 2021, 1:14:43 PM3/23/21

to

On Linux I use syscall 78 (gettimeofday) calling it directly.
On windows the high performance timers. There I get the same result.

I reran the program increasing the buckets in the hashtable to 65537 from 6211.
It improves the time to 2402 ms from 2852 for 10 bibles.
For 1 bible it makes no difference
I think I need to look into how good my hashfunction and wordlist search are!

Peter

Anton Ertl

unread,

Mar 23, 2021, 2:03:28 PM3/23/21

to

For append/sort/dump I expect little difference between one bible and
ten, because the only difference is that the counts all have a 0
appended.

I would expect the read part to take 10 times as long, but have the
following explanations for it being faster:

1) The CPU takes a little time before it ramps up the clock rate. Can
be worked around by letting some warmup code run before the
benchmarked code.

2) The first bible inserts ~32000 words, while the other bibles only
perform lookup of these words. If insertion is more expensive than
lookup, then that could be an explanation. However, the 32000 are
only 4% of the total number of words in each Bible, so the inserts
would have to be more expensive by a lot.

Marcel Hendrix

unread,

Mar 24, 2021, 12:12:20 PM3/24/21

to

On Tuesday, March 23, 2021 at 3:03:28 PM UTC+1, Anton Ertl wrote:
[..]
> >This (append/sort/dump) is a bit unlikely. I expected
> >a low multiple of the single bible result (unless there
> >is a large fixed overhead).
> For append/sort/dump I expect little difference between one bible and
> ten, because the only difference is that the counts all have a 0
> appended.

<facepalm></facepalm>

-marcel

Marcel Hendrix

unread,

Mar 26, 2021, 10:02:14 PM3/26/21

to

On Sunday, March 21, 2021 at 4:24:01 PM UTC+1, NN wrote:
> This previous program should have come with a warning...
>
> ### This program is really slow ###
>
> It was taking nearly an hour .
>
> So I altered alist and the speedup was impressive.

[..]
> ( --------------------------------------------------------------------------------- )

There were a few stack errors that needed fixing ( rdfile, pr ).

I skipped the sorting and printing as it is not interesting for the timing.
I replaced '676' with '26 SQR'.

FORTH> TEST ( 4,232,689 bytes )
Bible count : 1.658 seconds elapsed.
Printing : skipped ok

FORTH> TEST ( 44,323,230 bytes )
Bible count : 25.536 seconds elapsed.
Printing : skipped ok

-marcel

Marcel Hendrix

unread,

Mar 26, 2021, 10:07:31 PM3/26/21

to

On Friday, March 26, 2021 at 11:02:14 PM UTC+1, Marcel Hendrix wrote:
> FORTH> TEST ( 4,232,689 bytes )
> Bible count : 1.658 seconds elapsed.
> Printing : skipped ok

With allocate replaced by ALLOT
FORTH> TEST
Bible count : 1.572 seconds elapsed.

ALLOCATE is a lot faster than I thought (or the rest of the code
is a lot slower).

-marcel

NN

unread,

Mar 26, 2021, 10:47:31 PM3/26/21

to

Thank you for trying out the program and spotting the errors.

I wrote using gforth but I didnt think it was worth mentioning as
I wasnt using anything gforth specific.

I tried on vfx too where it did not like the allots, but changing
these to buffer: seemed to do the trick. ( -for tmp11 and alist )

The only error I have picked up so far was there is a 'drop' missing
in rdfile, it should be after the repeat. Can you tell me more about
the error in pr as nothing has flagged up when i was testing.

Marcel Hendrix

unread,

Mar 27, 2021, 12:17:42 AM3/27/21

to

On Friday, March 26, 2021 at 11:47:31 PM UTC+1, NN wrote:

> Thank you for trying out the program and spotting the errors.
>
> I wrote using gforth but I didnt think it was worth mentioning as
> I wasnt using anything gforth specific.
>
> I tried on vfx too where it did not like the allots, but changing
> these to buffer: seemed to do the trick. ( -for tmp11 and alist )
>
> The only error I have picked up so far was there is a 'drop' missing
> in rdfile, it should be after the repeat. Can you tell me more about
> the error in pr as nothing has flagged up when i was testing.

: pr ( a -- )

sz

0 ?do dup i cells + @ dup null
if drop
else pr1
then

loop DROP ;
where 'Z' 1+ 'A' - dup * constant sz

I tested a bit more, and all the runtime is consumed in rd-word.
With all allocates (there are two) changed into ALLOTs ( the use
of free is not obvious and caused some head scratching),
the final runtime becomes

\ FORTH> TEST ( 4.3MB )
\ Bible count : 1.195 seconds elapsed.
\ Printing : skipped ok

It is interesting that P Falth reported 358 ms for the word
isolating part in the original algorithm. I will try that next.

-marcel

Marcel Hendrix

unread,

Mar 27, 2021, 10:10:58 AM3/27/21

to

On Friday, March 26, 2021 at 11:47:31 PM UTC+1, NN wrote:

[..]

> Thank you for trying out the program and spotting the errors.

The biggest error is that the stack diagram of check? is wrong.

-marcel

P Falth

unread,

Mar 27, 2021, 11:41:45 AM3/27/21

to

After a lot of tweaking the one bible run is now down to 90 ms.
The 10 bible will run in 900 ms.

This is on an E5-4657lv2 cpu (2.4/2.9Ghz)

Looking at how my lxf handled creating and searching of words
I found a lot of redundant checks and case transformation.

Instead now I try to avoid rereading and moving. When a char
is read from the input buffer it is immediately converted to lower case
( a simple $20 OR instead of a UTF8 conversion ) then stored in the
location where it will be in if a header will be created.
There is still more to do. The optimized C version run in 350 ms on my cpu

Peter

Marcel Hendrix

unread,

Mar 27, 2021, 10:36:48 PM3/27/21

to

On Saturday, March 27, 2021 at 12:41:45 PM UTC+1, P Falth wrote:
> On Saturday, 27 March 2021 at 01:17:42 UTC+1, Marcel Hendrix wrote:

[..]

> After a lot of tweaking the one bible run is now down to 90 ms.
> The 10 bible will run in 900 ms.

That is fantastic!

It is not possible (without large changes) in iForth. The hashtable size
is 4096, i.e. the search speed will drop when more than 3000 words need
to be defined. That happens very quickly.

Here is my test. I run the word counter on the bible text, but I cut the
bible in 160 pieces and see what happens if the number of pieces
increases. Here is a dump for 44 pieces:

FORTH> 1 ok
[1]FORTH> dup s" countingwords.frt" included 1+ many
1 #words = 1161 0.007 seconds elapsed.
2 #words = 2142 0.012 seconds elapsed.
3 #words = 2890 0.016 seconds elapsed.
4 #words = 3514 0.022 seconds elapsed.
5 #words = 4146 0.029 seconds elapsed.
6 #words = 4846 0.036 seconds elapsed.
7 #words = 5421 0.044 seconds elapsed.
8 #words = 5883 0.050 seconds elapsed.
9 #words = 6201 0.058 seconds elapsed.
10 #words = 6530 0.063 seconds elapsed.
11 #words = 6922 0.072 seconds elapsed.
12 #words = 7315 0.079 seconds elapsed.
13 #words = 7575 0.086 seconds elapsed.
14 #words = 7753 0.091 seconds elapsed.
15 #words = 7896 0.098 seconds elapsed.
16 #words = 8105 0.104 seconds elapsed.
17 #words = 8334 0.113 seconds elapsed.
18 #words = 8541 0.121 seconds elapsed.
19 #words = 8779 0.128 seconds elapsed.
20 #words = 8986 0.141 seconds elapsed.
21 #words = 9158 0.148 seconds elapsed.
22 #words = 9395 0.154 seconds elapsed.
23 #words = 9540 0.161 seconds elapsed.
24 #words = 9846 0.174 seconds elapsed.
25 #words = 10001 0.180 seconds elapsed.
26 #words = 10266 0.193 seconds elapsed.
27 #words = 10451 0.203 seconds elapsed.
28 #words = 10570 0.210 seconds elapsed.
29 #words = 10717 0.220 seconds elapsed.
30 #words = 10886 0.232 seconds elapsed.
31 #words = 11061 0.245 seconds elapsed.
32 #words = 11245 0.259 seconds elapsed.
33 #words = 11391 0.269 seconds elapsed.
34 #words = 11734 0.277 seconds elapsed.
35 #words = 11964 0.303 seconds elapsed.
36 #words = 12146 0.298 seconds elapsed.
37 #words = 12386 0.308 seconds elapsed.
38 #words = 100 0.319 seconds elapsed.
39 #words = 12753 0.331 seconds elapsed.
40 #words = 12996 0.347 seconds elapsed.
41 #words = 13160 0.362 seconds elapsed.
42 #words = 13334 0.373 seconds elapsed.
43 #words = 13527 0.390 seconds elapsed.
...
151 #words = 31104 2.584 seconds elapsed. ok

When #words is less than 6000, each piece takes about
6 ms, but with 43 pieces this has grown to 9 ms, and for
all pieces together it has become 17 ms.

With a sufficiently large hashtable (no chains needed),
the initial 6 ms wouldn't change and 912 ms total time is
possible. To get to your 90ms would be quite a bit of
work, and in that case I would try a custom hashtable
without Forth ballast.

-marcel

dxforth

unread,

Mar 28, 2021, 1:07:54 AM3/28/21

to

On 28/03/2021 09:36, Marcel Hendrix wrote:
> ...

> To get to your 90ms would be quite a bit of
> work, and in that case I would try a custom hashtable
> without Forth ballast.

Hallelujah! The only question is why it has taken so many c.l.f.
posts to work that out. By all means use the forth system to save
effort (or not) - just don't expect it to be the most efficient
solution.

none albert

unread,

Mar 28, 2021, 9:49:08 AM3/28/21

to

Apparently there is no library code for hashing available.
That would be indicated to be used here.
I once needed a sort, a generic quick-sort is in the ciforth library
since. Same for merge sort and for the allocate wordset.

Anonymous definitions to be passed to a hash mechanism, would be
hash function, and a checking function that two two objects are
identical. Maybe the hash function could be built in and the
size of the objects could be given, then an area comparison
could be used. Maybe just borrow the API from python.

Groetjes Albert

Marcel Hendrix

unread,

Mar 28, 2021, 2:44:53 PM3/28/21

to

On Sunday, March 28, 2021 at 11:49:08 AM UTC+2, none albert wrote:
[..]

> Apparently there is no library code for hashing available.
> That would be indicated to be used here.

A couple years ago there was a thread in de.comp.lang.forth that
needed a hashtable library (Dec 24, 2007, 2:49:51 AM Ramanujan's Taxi),
and I reused that library (it is distributed with iForth) a while ago
for iSPICE. So yes, it probably would be quite useful for certain users.

-marcel

Marcel Hendrix

unread,

Mar 29, 2021, 5:15:56 PM3/29/21

to

On Sunday, March 28, 2021 at 4:44:53 PM UTC+2, Marcel Hendrix wrote:
[..]
I finally tried it from scratch, with my old hashtable implementation.

FORTH> count-biblewords 0.059 seconds elapsed. 32183 words found

-marcel
------------------------------------
ANEW -countingwords

-- 3583rd prime is 33469 which is enough for this Bible
2 CELLS =: DATASZ

#33469 =: hsize ( -- maxsize ) -- a prime number
0 VALUE hcount ( -- actsize )
CREATE htable PRIVATE hsize CELLS ALLOT
htable hsize CELLS ERASE

: .key ; PRIVATE ( node -- 'addr )
: .next CELL+ ; PRIVATE ( node -- 'addr )
: .addr 2 CELL[] ; PRIVATE ( node -- addr )
: .count 3 CELL[] ; PRIVATE ( node -- addr )

: hashcode ( index -- u ) U>D hsize UM/MOD DROP ; PRIVATE
: >index ( c-addr u -- index ) 0 >S 0 ?DO C@+ S> 3 ROL XOR >S LOOP DROP S> ; PRIVATE

: addnode ( c-addr u -- node )
HERE LOCAL str
DUP 1+ ALLOT str PACK COUNT >index ( -- index )
ALIGN
HERE 2 CELLS DATASZ + ALLOT LOCAL node
hcount hsize >= IF CR ." addnode :: too many objects, maximum = " hsize DEC. ABORT
ELSE 1 +TO hcount
ENDIF
node .key !
node .next OFF
node .addr str SWAP !
node .count 1 SWAP !
node ; PRIVATE

: hfind ( c-addr u -- node )
DLOCAL str str >index ( -- index )
DUP hashcode 0 0 LOCALS| node prev hash_code index |
htable hash_code CELL[] @ TO node
BEGIN node
WHILE node .key @ index = IF 1 node .count +! node EXIT ENDIF
node TO prev
node .next @ TO node
REPEAT ( node is 0 )
str addnode ( -- node ) DUP
prev IF prev .next !
ELSE htable hash_code CELL[] !
ENDIF ;

: bl-skip BEGIN DUP WHILE OVER C@ BL U<= WHILE 1 /STRING REPEAT THEN ; ( addr1 n1 -- addr2 n2 )
: bl-scan BEGIN DUP WHILE OVER C@ BL U> WHILE 1 /STRING REPEAT THEN ; ( addr1 n1 -- addr2 n2 )

: PARSE-NAME2 ( c-addr u -- c-addr2 u2 c-addr1 u1 )
bl-skip OVER >R
bl-scan ( end-input len r: start-input )
2DUP DROP R> TUCK - ;

: process-words ( c-addr u -- )
BEGIN PARSE-NAME2 DUP
WHILE hfind DROP
REPEAT 4DROP ;

: .words ( -- )
htable
hsize 0 ?DO @+ DUP IF ( node ) DUP .addr @ CR .$
( node ) #20 HTAB .count @ DEC.
ELSE DROP
ENDIF
LOOP DROP ;

: count-biblewords ( -- )
TIMER-RESET
S" /TMP/kjvbible.txt" SLURP-FILE(2)
process-words
\ .words
.ELAPSED SPACE hcount DEC. ." words found" ;

Brian Fox

unread,

Mar 29, 2021, 10:54:17 PM3/29/21

to

Very nice Marcel. Looks more respectable. :)

The C version in Ben's test uses the FNV1 algorithm which is purported
to be very good and Ben used a 65K hash table.
I had been trying to make it work in 16bits but it needs 32 minimum.

In fact the site below says that FNV1A is a bit better.
http://www.isthe.com/chongo/tech/comp/fnv/

It seems to depend on some magic numbers.

#define FNV_OFFSET 14695981039346656037UL
#define FNV_PRIME 1099511628211UL

Makes me wonder how important it is. Since you are using UM/MOD
and FNV1A is just * and XOR there might be an improvement. (?)

Marcel Hendrix

unread,

Mar 30, 2021, 8:12:12 AM3/30/21

to

On Tuesday, March 30, 2021 at 12:54:17 AM UTC+2, Brian Fox wrote:
> On 2021-03-29 1:15 PM, Marcel Hendrix wrote:
> > On Sunday, March 28, 2021 at 4:44:53 PM UTC+2, Marcel Hendrix wrote:
[..]

> Very nice Marcel. Looks more respectable. :)
>
> The C version in Ben's test uses the FNV1 algorithm which is purported
> to be very good and Ben used a 65K hash table.
> I had been trying to make it work in 16bits but it needs 32 minimum.

Ok, 65K hashtable does something (see below), so I guess there
were non-empty secondary threads.

> In fact the site below says that FNV1A is a bit better.
> http://www.isthe.com/chongo/tech/comp/fnv/
>
> It seems to depend on some magic numbers.

Simply increasing the hash table is potentially more effctive,
but I'll try it. Here is the 65K result for now (10% faster):

FORTH> CR count-biblewords many
32183 words found, 0.058 seconds elapsed.
32183 words found, 0.054 seconds elapsed.
32183 words found, 0.052 seconds elapsed.
32183 words found, 0.053 seconds elapsed.
32183 words found, 0.054 seconds elapsed.
32183 words found, 0.053 seconds elapsed.
32183 words found, 0.052 seconds elapsed.
32183 words found, 0.052 seconds elapsed.
32183 words found, 0.053 seconds elapsed.
32183 words found, 0.051 seconds elapsed.

PARSE-NAME2 can already do >index, and ALLOTing space for the
name is superfluous because we can return a pointer to the input.
Maybe I'll do that if the 51 ms barrier is broken :--)

There is an obvious way to make this 10 ..18 times faster still.

-marcel

P Falth

unread,

Mar 30, 2021, 2:28:31 PM3/30/21

to

I use almost the same hash function but with a rotate of 4 instead
and + instead of xor. number of buckets is 65537. I just tried with 3
and got a worse result.
In fact when I analyze how the hash table is filled I am satisfied with this
simple hash function. I get this table of distribution

Entries freq % Acc%
1 567018 69% 69%
2 184304 22% 91%
3 48183 6% 97%
4 11489 1% 99%
5 4651 1% 99%
6 3295 0% 100%
7 1772 0% 100%
8 292 0% 100%
9 129 0% 100%
Total 821133 100%

First column is number of entries in a hashbucket. 1 means that there are
no collision. freq is how many times they are searched.

But I get 32187 words found. Also the C version gets this. Have you trimmed
the input file?

BR
Peter

Marcel Hendrix

unread,

Mar 30, 2021, 3:43:30 PM3/30/21

to

On Tuesday, March 30, 2021 at 4:28:31 PM UTC+2, P Falth wrote:
> On Tuesday, 30 March 2021 at 10:12:12 UTC+2, Marcel Hendrix wrote:

[..]

> I use almost the same hash function but with a rotate of 4 instead
> and + instead of xor. number of buckets is 65537. I just tried with 3
> and got a worse result.
> In fact when I analyze how the hash table is filled I am satisfied with this
> simple hash function. I get this table of distribution

[..]

> 8 292 0% 100%
> 9 129 0% 100%
> Total 821133 100%
>
> First column is number of entries in a hashbucket. 1 means that there are
> no collision. freq is how many times they are searched.
>
> But I get 32187 words found. Also the C version gets this. Have you trimmed
> the input file?

I realize now that I don't count words in chains. The counts should be the same
as soon as I use your hash function.

-marcel

P Falth

unread,

Mar 30, 2021, 3:55:34 PM3/30/21

to

Do you lower-case the words before adding them?
If you only counted the buckets occupied your number seem high.
I have 24073 unique hashcodes. I lowercase before calculating the hash

BR
Peter

NN

unread,

Mar 31, 2021, 5:15:29 PM3/31/21

to

Just in case anyone comes back to try out the programs I have posted
please be aware there are mistakes.

I spent a while tracking them down.

Now I am wondering if I was successful .

1) Does anyone have a count of how many unique words are
in the file kjvbible.txt

I get 12568.

2) sorted list ... does anyone have a top 10 list

<FOR 8971 >
<UNTO 8997 >
<SHALL 9838 >
<HE 10419 >
<IN 12667 >
<THAT 12912 >
<TO 13580 >
<OF 34670 >
<AND 51696 >
<THE 64023 > ok

I wish to compare to see if my numbers are meaningful or still wrong.

Paul Rubin

unread,

Mar 31, 2021, 5:42:35 PM3/31/21

to

NN <novembe...@gmail.com> writes:
> 1) Does anyone have a count of how many unique words are

> 2) sorted list ... does anyone have a top 10 list

Changing "wc -l" to "head -10" in above, I get:

64023 the
51696 and
34670 of
13580 to
12912 that
12667 in
10419 he
9838 shall
8997 unto
8970 for

Anton Ertl

unread,

Mar 31, 2021, 5:48:25 PM3/31/21

to

NN <novembe...@gmail.com> writes:
>1) Does anyone have a count of how many unique words are
>in the file kjvbible.txt
>
>I get 12568.

32187. That's the count Ben Hoyt's Python program gets and that his
any my Forth programs get. Note that the problem description uses the
Forth definition of "word": Everything bounded by white space. I.e.,
"off!)" is a word.

>
>2) sorted list ... does anyone have a top 10 list
>
><FOR 8971 >
><UNTO 8997 >
><SHALL 9838 >
><HE 10419 >
><IN 12667 >
><THAT 12912 >
><TO 13580 >
><OF 34670 >
><AND 51696 >
><THE 64023 > ok

The first 10 lines of the output (for 10 bibles) is:

the 640150
and 513130
of 346340
to 135670
that 127840
in 125030
he 102610
shall 98380
unto 89870
for 88100

Your numbers are a little higher, probably because you counted "(the"
(8 times per bible) as "the".

NN

unread,

Mar 31, 2021, 7:46:28 PM3/31/21

to

Thanks for the numbers. Now that I have something that works,
it should be easier to tweak and adapt. Hopefully I will get the same
numbers.

In case anyone wondered, the definition of words I was using was
just alphabetic chars everything else is ignored.

NN

unread,

Mar 31, 2021, 9:23:03 PM3/31/21

to

I can now confirm that I get the same numbers.. I have just done it for
1 bible rather than 10.

So much work for something so simple, make me wonder where I went
wrong.

Paul Rubin

unread,

Apr 1, 2021, 12:02:55 AM4/1/21

to

NN <novembe...@gmail.com> writes:
> So much work for something so simple, make me wonder where I went wrong.

Here is another approach:

$ time tr "A-Z " "a-z\n" < kjvbible.txt | tr -cd "a-z\n" | sed "/^$/d" > words

real 0m0.145s
user 0m0.144s
sys 0m0.027s

$ sqlite3 foo
SQLite version 3.27.2 2019-02-25 16:06:06
Enter ".help" for usage hints.
sqlite> .timer on
sqlite> create table kjv(w str);
Run Time: real 0.011 user 0.000000 sys 0.002083
sqlite> .import words kjv
sqlite> select count(distinct w) from kjv;
12699
Run Time: real 0.392 user 0.390892 sys 0.000000
sqlite> .mode columns
sqlite> select w,count(w) from kjv group by w order by count(w) desc limit 10;
the 64023
and 51696
of 34670
to 13580
that 12912
in 12667
he 10419
shall 9838
unto 8997
for 8970
Run Time: real 0.549 user 0.537511 sys 0.011976

# create an index to see if that speeds things up

sqlite> create index k1 on kjv(w);
Run Time: real 0.590 user 0.525286 sys 0.024008
sqlite> select w,count(w) from kjv group by w order by count(w) desc limit 10;
the 64023
and 51696
of 34670
to 13580
that 12912
in 12667
he 10419
shall 9838
unto 8997
for 8970
Run Time: real 0.104 user 0.099199 sys 0.004091
sqlite>

NN

unread,

Apr 1, 2021, 12:43:32 AM4/1/21

to

It suggests to me that this being a forth group and that you did not
use forth to demonstrate a different approach , just validates that
this is a difficult problem to solve using forth .

Paul Rubin

unread,

Apr 1, 2021, 12:58:48 AM4/1/21

to

NN <novembe...@gmail.com> writes:
> It suggests to me that this being a forth group and that you did not
> use forth to demonstrate a different approach , just validates that
> this is a difficult problem to solve using forth .

I wouldn't use the word difficult per se, but rather that SQL, shell
scripts, etc. implement interfaces like counting and sorting that are
convenient to put together even if they are complicated under the hood.

Solving in Forth is straightforward, but you have to start at a low
level and implement everything yourself. It wouldn't have occurred to
me to use Forth wordlists and EVALUATE as a method of implementing
symbol tables or hash tables. I'd have coded a hash table in the
traditional way, then used file i/o and so on. Maybe I'll try that
sometime.

It does seem to me that SQL makes a lot of these sorts of problems easy.
I underestimated that aspect of SQL for a long time, instead
implementing what amounts to SQL joins and selections in imperative
code, in Python or C or whatever. Now I try to use it more.

For finding the most frequent words, the Hyperloglog approach might also
be interesting. It is included in Redis and should be fairly easy to
code in Forth or Python.

https://en.wikipedia.org/wiki/HyperLogLog

dxforth

unread,

Apr 1, 2021, 2:03:28 AM4/1/21

to

On 1/04/2021 11:43, NN wrote:
> On Thursday, 1 April 2021 at 01:02:55 UTC+1, Paul Rubin wrote:
>> NN <novembe...@gmail.com> writes:
>> > So much work for something so simple, make me wonder where I went wrong.
>> Here is another approach:
>>
>> $ time tr "A-Z " "a-z\n" < kjvbible.txt | tr -cd "a-z\n" | sed "/^$/d" > words

>> ...

>
> It suggests to me that this being a forth group and that you did not
> use forth to demonstrate a different approach , just validates that
> this is a difficult problem to solve using forth .

Why solve it in forth if acceptable tools already exist? When Forth is
reduced to playing the 'me too' game, seeking the attention of passers by,
it's like passing a beggar in the street who has seen better days.

Andy Valencia

unread,

Apr 1, 2021, 3:36:20 AM4/1/21

to

Paul Rubin <no.e...@nospam.invalid> writes:
> Here is another approach:
> $ time tr "A-Z " "a-z\n" < kjvbible.txt | tr -cd "a-z\n" | sed "/^$/d" > words

> ...
> $ sqlite3 foo

This is a very nice use of sqlite; I've saved it as an excellent
cheat sheet on the technique!

Andy

P Falth

unread,

Apr 1, 2021, 6:58:18 AM4/1/21

to

That is a nice approach. My favorite for quick analysis of data like this is Excel!
The first thing I did was to load the bible in a hex editor and do a search&replace
to change all space into linefeed.
I now had a column of data that could be imported into Excel.
Sort to get rid of empty lines
Add a count field of 1 and create a pivot table, sort in frequency order.
As a bonus it is easy to create a graph.

Later I used the excel file to verify my output from Forth and to analyze
the efficiency of different hash functions.

BR
Peter

Paul Rubin

unread,

Apr 1, 2021, 7:32:34 AM4/1/21

to

dxforth <dxf...@gmail.com> writes:
> Why solve it in forth if acceptable tools already exist? When Forth is
> reduced to playing the 'me too' game, seeking the attention of passers by,
> it's like passing a beggar in the street who has seen better days.

1) OP (maybe no longer here) compared implementations in a bunch of
different languages, so adding more is of some interest to that
comparison.

2) Are you saying Forth should be reserved as a last-resort method of
solving a problem? I.e. that almost any other way would be easier?
That doesn't speak well of Forth ;-).

Marcel Hendrix

unread,

Apr 1, 2021, 7:50:19 AM4/1/21

to

This is not a difficult problem to solve if one is only interested in the answer.
Google gives it in microseconds and probably would have done so even
before this thread started.

The reason to try to code it in Forth is to find out what the best algorithm is,
whether other programming languages have special features that can be used,
or if one's particular language implementation is the bottleneck.
Not unimportant is also if the answer is 'correct' and if that can be verified
to one's own satisfaction.

At some point in time one will encounter a problem or situation where there
is no ready-made answer, or the solution is not good enough according
to one's standards. The former does not rule out a quickie with Excel or
Haskell, but the latter might.

-marcel

Anton Ertl

unread,

Apr 1, 2021, 8:20:11 AM4/1/21

to

Paul Rubin <no.e...@nospam.invalid> writes:
>NN <novembe...@gmail.com> writes:
>> It suggests to me that this being a forth group and that you did not
>> use forth to demonstrate a different approach , just validates that
>> this is a difficult problem to solve using forth .

It's not at all difficult. I immediately knew how I would implement
it, and the implementation itself also did not take long
<2021Mar1...@mips.complang.tuwien.ac.at>.

>I wouldn't use the word difficult per se, but rather that SQL, shell
>scripts, etc. implement interfaces like counting and sorting that are
>convenient to put together even if they are complicated under the hood.

Sure, my solution also uses tr and sort for postprocessing, and your
SQL solution uses shell tools for preprocessing. So if your SQL
solution is an SQL solution, then my Forth solution is a Forth
solution. Interestingly, Ben Hoyt himself used sort (the shell tool)
for postprocessing in his awk version, but he calls sort from within
awk; maybe that's why he considers it awk, while he insisted on coding
the sorting in Forth for Forth; Gforth has open-pipe, so one could
also call tr and sort from within Forth (not as conveniently as from
the shell or awk, though).

>Solving in Forth is straightforward, but you have to start at a low
>level and implement everything yourself.

That's only due to the lack of your imagination:

>It wouldn't have occurred to
>me to use Forth wordlists and EVALUATE as a method of implementing
>symbol tables or hash tables.

That's no a failure of Forth, maybe one of the educational material on
Forth.

>I'd have coded a hash table in the
>traditional way, then used file i/o and so on.

In other words, you would have tried to reduce Forth to the common
subset with C. But you are not the only one in clf.

>It does seem to me that SQL makes a lot of these sorts of problems easy.
>I underestimated that aspect of SQL for a long time

It seems to me that you underestimate Forth, too.

>For finding the most frequent words, the Hyperloglog approach might also
>be interesting. It is included in Redis and should be fairly easy to
>code in Forth or Python.
>
>https://en.wikipedia.org/wiki/HyperLogLog

This only provides estimates, so it's not good enough for the problem
as stated.

NN

unread,

Apr 1, 2021, 10:53:12 AM4/1/21

to

A lot of people here are very talented programmers. That they can solve it in
a hundred different languages isn't being questioned.

Ben says: the idea was to set a problem that was simple but not too simple
such that it showed how capable the programmer was at solving the problem
at hand. He asked how would you solve it IN FORTH, without any reference
to performance/speed/quality, which came later in the blog.

You write "Solving in Forth is straightforward,..." so do it and I look forward to
reading your solution when its done.

NN

unread,

Apr 1, 2021, 11:02:39 AM4/1/21

to

I wrote this as a reply to Paul but I believe it answers your question :

Ben says: the idea was to set a problem that was simple but not too simple
such that it showed how capable the programmer was at solving the problem

at hand. He asked HOW WOULD YOU SOLVE IT IN FORTH, without any reference

dxforth

unread,

Apr 2, 2021, 12:42:17 AM4/2/21

to

Is it your problem to solve? Rosettacode - watching different languages
perform the same dreary task, often in the same way - has all the charm
and interest of a string of brothels in a red-light district.