Counting frequencies of unique words

610 views
Skip to first unread message

Ben Hoyt

unread,
Mar 11, 2021, 10:01:36 PMMar 11
to
I'm writing a comparison of a simple program (print the counts of unique words in a text file, most frequent first) in various languages. Forth was my first programming language, and for old times' sake I thought I'd include a Forth version in my article. I haven't written Forth in years and years, so I'd love some code review / suggestions here.

I want to keep it "mostly standard" but I'm fine using a couple of Gforth-specific features if needed.

Most other language implementations read standard input line at a time, split into whitespace-separated words, insert the words into a hash table (or increment the count / hash value if it already exists). Then when printing out, convert to an array to sort by count descending. For reference, in Python that might look like this (though it's even easier if you use collections.Counter):

---------------
import sys

counts = {}
for line in sys.stdin:
words = line.lower().split()
for word in words:
counts[word] = counts.get(word, 0) + 1

ordered = sorted(counts.items(), key=lambda wc: wc[1], reverse=True)
for word, count in ordered:
print(word, count)
---------------

So for this input:

The foozle bar the
bar bar bar bar

It would print out:

bar 5
the 2
foozle 1

Here's my attempt at a Forth version:

---------------
wordlist constant counts \ Hash table of words to count variable
create uniques 65536 cells allot \ Array of unique words
variable uoffset 0 uoffset !

\ Allocate space for new string and copy bytes, return new string.
: copy-string ( addr u -- addr' u )
dup >r dup allocate throw
dup >r swap move r> r> ;

\ Convert character to lowercase.
: to-lower ( C -- c )
dup [char] A [ char Z 1+ ] literal within if
32 +
then ;

\ Convert string to lowercase in place.
: lower-in-place ( addr u -- )
over + swap do
i c@ to-lower i c!
loop ;

\ Count given word in hash table.
: count-word ( addr u -- )
2dup counts search-wordlist if
\ Increment existing word
execute 1 swap +!
2drop
else
\ Insert new (copied) word and add to uniques array
copy-string
2dup lower-in-place
2dup uniques uoffset @ cells + 2!
2 uoffset +!
['] create execute-parsing 1 ,
then ;

\ Process a line by splitting into words.
: process-line ( -- )
begin
parse-name dup
while
count-word
repeat
2drop ;

\ Show "word count" line for each word (unsorted).
: show-words ( -- )
uoffset @ 0= if
exit
then
uoffset @ 0 do
uniques i cells + 2@
2dup type
counts search-wordlist 0= if
type ." not found!" cr exit \ Shouldn't happen
then
execute @ space . cr
2 +loop ;

: main ( -- )
counts set-current
begin
refill
while
process-line
repeat
show-words
bye ;

main
The foozle bar the
bar bar bar bar
---------------

Some comments / questions:

1) I don't know the best way to read the input. Here I'm cheating, by appending the input to the source file! (After "main".) I really want it to read STDIN. I know you can use STDIN READ-LINE, but then how would I parse that into words with the Forth word parsing functions like PARSE-NAME? (Which seems much better and more efficient than rolling our own parsing.) Is there a way to READ-LINE and then set it as the parsing input?
2) I'm using a WORDLIST as a hash table, which also (for this use case) has the nice property of being case-insensitive. I'm still converting to lowercase so the output is consistent with the other language versions. It's a bit annoying that Forth doesn't have a built-in to-lower function, not even for characters.
3) I notice there's a TRAVERSE-WORDLIST in the latest standard (and latest Gforth) -- this is nice, but it's not in my version of Gforth on Ubuntu. So I have to keep a separate append-only UNIQUES array of the "addr u" pairs and UOFFSET to index into it.
4) There are no built-in sorting functions in Gforth (or standard Forth) by the looks, so I'm currently not sorting. Any pointers to a good merge sort implementation or similar (with a key/comparison function)?
5) Any other comments or feedback here?

Thanks a lot!

-Ben

dxforth

unread,
Mar 11, 2021, 11:41:01 PMMar 11
to
On 12/03/2021 14:01, Ben Hoyt wrote:
>
> 1) I don't know the best way to read the input. Here I'm cheating, by appending the
> input to the source file! (After "main".) I really want it to read STDIN. I know you
> can use STDIN READ-LINE, but then how would I parse that into words with the Forth
> word parsing functions like PARSE-NAME? (Which seems much better and more efficient
> than rolling our own parsing.) Is there a way to READ-LINE and then set it as the
> parsing input?

DIY parser isn't too hard and will be 'more standard' than other methods.

\ Most forths have SKIP SCAN

\ Split string at character leaving first on top
: SPLIT ( a u c -- a2 u2 a3 u3 )
>r 2dup r> scan 2swap 2 pick - ;

\ Parsing1: ARGC /ARG !ARG ARGV
create ARGC ( -- a ) 3 cells allot

\ Reset parser to beginning
: /ARG ( -- ) 0 argc ! ;

\ Assign string for parsing
: !ARG ( a u -- ) argc cell+ 2! /arg ; here 0 !arg

\ Parse next blank delimited argument
: ARGV ( -- a u -1 | 0 )
1 argc +! argc cell+ 2@ 0 0
argc @ 0 ?do
2drop bl skip bl split
loop 2nip
dup if -1 exit then and ;

> 2) ... It's a bit annoying that Forth doesn't have a built-in to-lower function, not
> even for characters.

Or 'upper'. The powers-that-be suggest rolling your own on that one too.

none albert

unread,
Mar 12, 2021, 12:25:03 AMMar 12
to
In article <57548ac6-9f34-480d...@googlegroups.com>,
<SNIP> (code)
>
>Some comments / questions:
>
>1) I don't know the best way to read the input. Here I'm cheating, by appending the input to the source file! (After "main".) I really want it to read STDIN. I know you can
>use STDIN READ-LINE, but then how would I parse that into words with the Forth word parsing functions like PARSE-NAME? (Which seems much better and more efficient than
>rolling our own parsing.) Is there a way to READ-LINE and then set it as the parsing input?

At least in ciforth this is standard idiom:
"AAP" GET-FILE ( "slurp")
BEGIN OVER WHILE ^J $/ process-line REPEAT 2DROP
process-line finds addr,len on the stack and is supposed to consume it.
Aguilar also has acknowledged the need for this kind of processing.

>2) I'm using a WORDLIST as a hash table, which also (for this use case) has the nice property of being case-insensitive. I'm still converting to lowercase so the output is
>consistent with the other language versions. It's a bit annoying that Forth doesn't have a built-in to-lower function, not even for characters.

Your Forth probably has one. Problem is, Forth has no libraries. So as soon
as this word makes it in the standard (an uphill battle) it becomes one of those
kitchen sinks that drags a fat Forth down. I tried to promote my WHAT mechanism
to put those words in a source library but it didn't catch on in 20 years so
it probably never will.

>3) I notice there's a TRAVERSE-WORDLIST in the latest standard (and latest Gforth) -- this is nice, but it's not in my version of Gforth on Ubuntu. So I have to keep a
>separate append-only UNIQUES array of the "addr u" pairs and UOFFSET to index into it.
>4) There are no built-in sorting functions in Gforth (or standard Forth) by the looks, so I'm currently not sorting. Any pointers to a good merge sort implementation or
>similar (with a key/comparison function)?
>5) Any other comments or feedback here?

I see that you are programming in Forth. Good.

>
>Thanks a lot!
>
>-Ben

Groetjes Albert
--
"in our communism country Viet Nam, people are forced to be
alive and in the western country like US, people are free to
die from Covid 19 lol" duc ha
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Ron AARON

unread,
Mar 12, 2021, 2:06:19 AMMar 12
to
On 12/03/2021 5:01, Ben Hoyt wrote:
> I'm writing a comparison of a simple program (print the counts of unique words in a text file, most frequent first) in various languages. Forth was my first programming language, and for old times' sake I thought I'd include a Forth version in my article. I haven't written Forth in years and years, so I'd love some code review / suggestions here.
>
> I want to keep it "mostly standard" but I'm fine using a couple of Gforth-specific features if needed.
>
> Most other language implementations read standard input line at a time, split into whitespace-separated words, insert the words into a hash table (or increment the count / hash value if it already exists). Then when printing out, convert to an array to sort by count descending. For reference, in Python that might look like this (though it's even easier if you use collections.Counter):
>

Non-standard: 8th version ( file name on command-line )

\ Count unique words in a text file

\ Increment the count of the word in the map.
: m:++ \ m s -- m
tuck 0 m:@? n:1+
rot swap m:! ;

\ Get the first argument as a file-name, read and split into words:
{} 0 args f:slurp >s /\s+/ s:/

\ For each word in the list, insert into the map and count:
' m:++ a:each! drop

\ Convert the map into a sorted heap:
( 1 a:_@ swap 1 a:_@ n:cmp neg ) h:new swap ( 2 a:close h:push ) m:each drop

\ print the heap
h:len ( h:pop a:open swap . space . cr ) swap times drop
bye

Ben Hoyt

unread,
Mar 12, 2021, 2:39:26 AMMar 12
to
Thanks for the help, folks. Updated version: https://gist.github.com/benhoyt/770a6ce13e307b8cd2e7f1e7c4731c70

(Merge sort taken from Rosetta Code.)

I decided to just build the latest version of Gforth so I could have TRAVERSE-WORDLIST.

dxforth, thanks a lot for the pointers to SKIP and SCAN -- very helpful.

It's actually quite fast -- not much slower than the Python solution! (Python takes ~2.2 seconds for my large 43MB input file, this Gforth solution takes 3.7s).

-Ben

Gerry Jackson

unread,
Mar 12, 2021, 3:38:50 AMMar 12
to
On 12/03/2021 03:01, Ben Hoyt wrote:
> I'm writing a comparison of a simple program (print the counts of unique words in a text file, most frequent first) in various languages. Forth was my first programming language, and for old times' sake I thought I'd include a Forth version in my article. I haven't written Forth in years and years, so I'd love some code review / suggestions here.
>
> I want to keep it "mostly standard" but I'm fine using a couple of Gforth-specific features if needed.
>
> Most other language implementations read standard input line at a time, split into whitespace-separated words, insert the words into a hash table (or increment the count / hash value if it already exists). Then when printing out, convert to an array to sort by count descending. For reference, in Python that might look like this (though it's even easier if you use collections.Counter):
>
...

>
> 1) I don't know the best way to read the input. Here I'm cheating, by appending the input to the source file! (After "main".) I really want it to read STDIN. I know you can use STDIN READ-LINE, but then how would I parse that into words with the Forth word parsing functions like PARSE-NAME? (Which seems much better and more efficient than rolling our own parsing.) Is there a way to READ-LINE and then set it as the parsing input?

You already use EXECUTE-PARSING so you should be able to use that to
parse the string returned by STDIN READ-LINE. Something like (an
untested idea):
stdin read-line ' process-line execute-parsing

Another point is that If you save the value of >IN before PARSE-NAME, if
SEARCH-WORDLIST fails to find the word you can restore the value of >IN
and use CREATE directly instead of EXECUTE-PARSING it. You already allot
a cell to count the number of occurrences so DOES> can be used to
increment the count e.g. in outline

... >in @ >r parse-name ( search the wordlist )
( not found ) if r> >in ! create 1 ,
( copy the string and compile its address)
does> 1 +!
else ( execute the word ) r> drop then

There are other details like not being able to use DOES> in a control
structure so that would have to be factored out

--
Gerry

Anton Ertl

unread,
Mar 12, 2021, 4:57:14 AMMar 12
to
Ben Hoyt <ben...@gmail.com> writes:
>I haven't written Forth in years and years,

Welcome back.
> uoffset @ 0=3D if
> exit
> then
> uoffset @ 0 do
> uniques i cells + 2@
> 2dup type
> counts search-wordlist 0=3D if
> type ." not found!" cr exit \ Shouldn't happen
> then
> execute @ space . cr
> 2 +loop ;
>
>: main ( -- )
> counts set-current
> begin
> refill
> while
> process-line
> repeat
> show-words
> bye ;
>
>main
>The foozle bar the
>bar bar bar bar
>---------------
>
>Some comments / questions:
>
>1) I don't know the best way to read the input. Here I'm cheating, by appen=
>ding the input to the source file! (After "main".) I really want it to read=
> STDIN. I know you can use STDIN READ-LINE, but then how would I parse that=
> into words with the Forth word parsing functions like PARSE-NAME? (Which s=
>eems much better and more efficient than rolling our own parsing.) Is there=
> a way to READ-LINE and then set it as the parsing input?

|execute-parsing ( ... addr u xt -- ... ) gforth
| Make addr u the current input source, execute xt '( ... -- ... )',
|then restore the previous input source.

A definition of this word in Standard Forth is provided in
<https://theforth.net/package/compat/current-view/execute-parsing.fs>

You are already using this word for creating a name, so using it for
turning a line into a Forth input source would result in testing the
reentrancy of EXECUTE-PARSING (there are occasional posts that doubt
the need for reentrancy).

There is also

|execute-parsing-file ( i*x fileid xt -- j*x ) gforth
| Make fileid the current input source, execute xt '( i*x -- j*x )',
|then restore the previous input source.

which will save you the READ-LINE part.

Or, alternatively, use

slurp-fid ( fid -- addr u ) gforth
ADDR U is the content of the file FID

and then EXECUTE-PARSING on the result.

>2) I'm using a WORDLIST as a hash table, which also (for this use case) has=
> the nice property of being case-insensitive. I'm still converting to lower=
>case so the output is consistent with the other language versions. It's a b=
>it annoying that Forth doesn't have a built-in to-lower function, not even =
>for characters.
>3) I notice there's a TRAVERSE-WORDLIST in the latest standard (and latest =
>Gforth) -- this is nice, but it's not in my version of Gforth on Ubuntu. So=
> I have to keep a separate append-only UNIQUES array of the "addr u" pairs =
>and UOFFSET to index into it.
>4) There are no built-in sorting functions in Gforth (or standard Forth) by=
> the looks, so I'm currently not sorting. Any pointers to a good merge sort=
> implementation or similar (with a key/comparison function)?

You can find quicksort for integers (in several variations) in
<http://www.complang.tuwien.ac.at/forth/programs/sort.fs>

But I would just pipe the unsorted Forth output into "sort -nrk 2"

>5) Any other comments or feedback here?

Here's my take:

----------------
\ Usage: gforth count-unique.fs <count-unique.in |tr '[:upper:]' '[:lower:]'|sort -nrk2

wordlist constant counts

: count-word ( c-addr u -- )
2dup counts search-wordlist dup if
drop >body 1 swap +! 2drop
else
drop ['] create execute-parsing 1 ,
then ;

: process-string ( -- )
get-current counts set-current
begin
parse-name dup while
count-word
repeat
2drop set-current ;

: show-words ( -- )
[: dup name>string type space name>interpret >body @ 0 u.r cr true ;]
counts traverse-wordlist ;

stdin slurp-fid ' process-string execute-parsing show-words
bye
---------------

For the most part, pretty close to your version, except that I left
the lower-casing and the sorting to other tools, and that I processed
the input as a whole instead of line by line, and I placed the
SET-CURRENT differently. I also tried to avoid the SET-CURRENT, but
currently the factoring of Gforth is not very amenable to that.

I note that the next standard contains almost everything used here,
except EXCUTE-PARSING STDIN SLURP-FID.

Forth-2012 does not contain [: ;]; Forth-94 does not contain
PARSE-NAME TRAVERSE-WORDLIST NAME>STRING NAME>INTERPET.

It's interesting how many features beyond Forth-94 can be found in
such a short program. Anybody who claims that they do not provide a
benefit is invited to show a Forth-94 program that does the same.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020

Anton Ertl

unread,
Mar 12, 2021, 5:25:05 AMMar 12
to
Slower than Python does not sound fast to me:-)

Is the benchmark input publically available?

Brian Fox

unread,
Mar 12, 2021, 8:57:03 AMMar 12
to
On 2021-03-12 5:21 AM, Anton Ertl wrote:
> Slower than Python does not sound fast to me:-)
>
> Is the benchmark input publically available?
>
> - anton
>

Thank you for saying what we all were thinking. :)

Paul Rubin

unread,
Mar 12, 2021, 12:53:22 PMMar 12
to
Ben Hoyt <ben...@gmail.com> writes:
> It's actually quite fast -- not much slower than the Python solution!
> (Python takes ~2.2 seconds for my large 43MB input file, this Gforth
> solution takes 3.7s).

Try gforth-fast ?

Paul Rubin

unread,
Mar 12, 2021, 12:56:05 PMMar 12
to
Brian Fox <bria...@rogers.com> writes:
> On 2021-03-12 5:21 AM, Anton Ertl wrote:
>> Slower than Python does not sound fast to me:-)
> Thank you for saying what we all were thinking. :)

It is pretty impressive, since the main work in the Python program would
have been done in e.g. Python's dictionary objects, which are
implemented in C and are carefully tuned. Most of the important Python
datatypes and libraries are written that way. The Python language
interpreter itself is quite slow, but Python programmers get into the
habit of organizing their programs so that the heavy lifting is done by
the libraries and builtins. The interpreter itself is just a scripting
wrapper around the libraries, more or less.

Ben Hoyt

unread,
Mar 12, 2021, 2:22:00 PMMar 12
to
Thanks for the tips, Anton and Gerry. EXECUTE-PARSING-FILE is interesting. And you reminded me about >BODY (presumably a bit faster than execute here).

One of my constraints is the I'm not allowed to read the whole file into memory (up to 64KB chunks is okay), so SLURP-FID is out.

Yeah, I know the Python bytecode interpreter is not super-fast, but most of the C functions it calls have been very well optimized, so as long as your hot loop is not in Python, it can be quite fast.

My test input file is https://github.com/benhoyt/countwords/blob/master/kjvbible.txt, but I'm concatenating it to itself x10 to get the actual file I'm testing (so it's 43MB instead of 4MB). Like so:

cat kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt >kjvbible_x10.txt

-Ben

Ben Hoyt

unread,
Mar 12, 2021, 3:37:34 PMMar 12
to
Side note (and possible bug report?): I can't seem to get Gforth's EXECUTE-PARSING-FILE to work. (This is a freshly-built gforth from https://github.com/forthy42/gforth). Am I holding it wrong? Here's my test code:

---------------
: process-input ( -- )
begin
parse-name dup
while
type cr
repeat
2drop ;

stdin ' process-input execute-parsing-file
\ s" t.txt" r/o open-file throw ' process-input execute-parsing-file
bye
---------------

And when I run it as follows:

gforth execute-parsing-file.fs <t.txt

It prints no output (the first PARSE-NAME call returns a length of zero). It doesn't change if I switch to the commented-out line to open a real file instead of stdin (I just wanted to check). I would have expected it to print the words in t.txt one per line. The contents of t.txt is:

The foozle bar the
bar bar bar bar

In any case, I'm falling back to using READ-LINE and then EXECUTE-PARSING on each line. Though incidentally, using SKIP vs SCAN manually is almost twice as fast as calling EXECUTE-PARSING on each line. I guess there's a fair amount of setup involved in each call to the latter?

-Ben

Paul Rubin

unread,
Mar 12, 2021, 5:33:18 PMMar 12
to
Ben Hoyt <ben...@gmail.com> writes:
> begin
> parse-name dup

That looks like you want 2dup since parse-name leaves addr u on the
stack.

Ben Hoyt

unread,
Mar 12, 2021, 5:35:27 PMMar 12
to
> > begin
> > parse-name dup
>
> That looks like you want 2dup since parse-name leaves addr u on the
> stack.

Yes, but the WHILE only tests and consumes the "u".

-Ben

dxforth

unread,
Mar 12, 2021, 6:21:46 PMMar 12
to
On 12/03/2021 18:39, Ben Hoyt wrote:
>
> thanks a lot for the pointers to SKIP and SCAN -- very helpful.
>

Agreed. For this application there was nothing to be gained keeping
track of the remaining string (as occurs with PARSE-NAME or ARG).
A BL SKIP and BL SPLIT is all the complexity that was needed.

dxforth

unread,
Mar 12, 2021, 6:47:11 PMMar 12
to
You're suggesting Python programmers are better in tune with what's
efficient than Forthers? Could be. Forthers go to extraordinary
lengths to demonstrate the slowness of their interpreter :)

Anton Ertl

unread,
Mar 13, 2021, 12:02:52 PMMar 13
to
Ben Hoyt <ben...@gmail.com> writes:
>Thanks for the tips, Anton and Gerry. EXECUTE-PARSING-FILE is interesting. And you reminded me about >BODY (presumably a bit faster than execute here).

It's faster, but probably does not make a significant difference in
this case.

>My test input file is https://github.com/benhoyt/countwords/blob/master/kjvbible.txt, but I'm concatenating it to itself x10 to get the actual file I'm testing (so it's 43MB instead of 4MB). Like so:
>
>cat kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt kjvbible.txt >kjvbible_x10.txt

Thanks. With that, I see on a Skylake (Core i5-6600K), invoking it
with:

for i in \
true \
"python3 count-unique.py" \
"$HOME/gforth/gforth-fast count-unique-hoyt.fs" \
"$HOME/gforth/gforth-fast count-unique.fs|tr [:upper:] [:lower:]|sort -nrk2" \
"$HOME/gforth/gforth-fast count-unique.fs"
do LC_NUMERIC=en_US.utf8 perf stat -B -e cycles:u -e instructions:u sh -c "$i" <count-unique.in >/dev/null
done

cycles instructions
300,856 256,229 true # measure "sh -c" overhead
8,444,928,819 21,491,024,474 python3 count-unique.py
8,020,113,329 19,790,232,716 $HOME/gforth/gforth-fast count-unique-hoyt.fs
5,129,364,575 12,035,209,874 .../gforth-fast count-unique.fs|tr ...|sort ...
4,954,627,149 11,637,863,826 .../gforth-fast count-unique.fs

count-unique.py is from
<57548ac6-9f34-480d...@googlegroups.com>.
count-unique-hoyt.fs is from
<https://gist.githubusercontent.com/benhoyt/770a6ce13e307b8cd2e7f1e7c4731c70/raw/3954359b59de674fa6af81bd7a7a7dc4a9336f34/countwords.fs>.
count-unique.fs is from <2021Mar1...@mips.complang.tuwien.ac.at>.

So Python 3.4.2 (as delivered with Debian 8) is slower than your
program when run on development gforth-fast (as built with
BUILD-FROM-SCRATCH on Debian 8). My program is quite a bit faster,
but I don't really know why. The last line shows that most of the
time of my program is spent in the Forth part.

Let's dissect your program:

cycles instructions
7,309,340,807 18,098,847,497 hoyt without show-words
2,186,413,395 5,530,766,295 hoyt without count-word show-words
7,297,503,704 18,031,073,597 hoyt without show-words, with my count-word
1,360,613,453 3,006,082,187 ertl without count-word show-words

The second result is surprising, because my program uses a very
similar count-word and calls it as often (I checked that, just to be
sure) with the same arguments. So it's surprising that the difference
from disabling count-word is more than the whole run-time of my
program. And indeed, using my count-words instead (third result)
provides little speedup. So I looked at what my program costs without
count-word (fourth result), and it's only about 3.6Gc faster without,
whereas your program is >5Gc faster without. I have no explanation
for why the same count-words calls cost more in your program than
mine; and it's not just a microarchitectural issue (cache misses,
branch mispreductions), you also see this difference in the number of
executed instructions (8.6G spent on count-word in my program, 12.5G
in your program).

I also found that both programs can reduce the cycle count by roughly
250Mc by inserting

15 :noname to hashbits hashdouble ; execute

at the start. This resizes the hash table to 64K buckets at the start
rather than doubling to 32K over time.

I then tried to fine-tune COUNT-WORD a little more:

: count-word ( c-addr u -- )
2dup counts find-name-in dup if
( name>interpret ) >body 1 swap +! 2drop
else
drop nextname create 1 ,
\ 1 num-uniques +!
then ;

cycles instructions
4,686,394,575 11,569,382,417 old COUNT-WORD
4,348,734,738 10,431,623,700 use FIND-NAME-IN instead of SEARCH-WORDLIST
4,383,716,571 11,018,557,131 use NEXTNAME instead of EXECUTE-PARSING
4,009,596,095 9,880,712,988 use both

It's interesting that these had such big effects, because
SEARCH-WORDLIST is a relative thin wrapper around FIND-NAME-IN, and
the ELSE part is called only 32000 times out of 8M calls to
count-word. Applying the same to your program with the hash table
improvement results in:

cycles instructions
6,993,463,336 17,962,483,836 hoyt with 64K buckets and faster COUNT-WORD

Let's see if we can improve the parsing (for my program at first):

: process-string1 ( c-addr u -- )
get-current >r counts set-current
begin
2dup (parse-white) dup while
2dup count-word + -rot + over -
repeat
2drop 2drop r> set-current ;

stdin slurp-fid process-string1 show-words

(PARSE-WHITE) is an internal factor of PARSE-NAME and WORD.

Strangely, when I do that, the number of cycles and number of
instructions goes up, but the parsing cycles and instructions go down.

cycles instructions
4,658,226,020 11,935,361,026 with PROCESS-STRING1
617,477,329 1,063,092,830 with PROCESS-STRING1 without COUNT-WORD

Using PROCESS-STRING1 as PROCESS-LINE in your program reduces the
instructions slightly, but increases the cycles slightly. It seems
that SCAN-WORD works nicely, no need for descending into
(PARSE-WHITE).

In any case, the fastest version is twice as fast as Python 3.4.2.

Anton Ertl

unread,
Mar 13, 2021, 12:25:14 PMMar 13
to
Ben Hoyt <ben...@gmail.com> writes:
>Side note (and possible bug report?): I can't seem to get Gforth's EXECUTE-=
>PARSING-FILE to work. (This is a freshly-built gforth from https://github.c=
>om/forthy42/gforth). Am I holding it wrong?

Maybe. Or maybe EXECUTE-PARSING-FILE handles badly. I have now tried
it, by starting with the code from

<57548ac6-9f34-480d...@googlegroups.com>

and replaced MAIN with:

: process-file ( -- )
begin
refill
while
process-line
repeat ;

: main ( -- )
counts set-current
stdin ['] process-file execute-parsing-file
show-words
bye ;

And it seems to work, including counting the first line. So what's
probably unintuitive about the current EXECUTE-PARSING-FILE is that it
needs a REFILL at the start.

Interestingly, nobody has complained about EXECUTE-PARSING-FILE
before, and I don't remember ever using it. It is used inside Gforth
in one place, as a factor of INCLUDE-FILE. Overall, one can say that
it is almost unused.

Interestingly, when EXECUTE-PARSING was discussed at a Forth-200x
meeting (IIRC over a decade ago), Stephen Pelc claimed that it is not
general enough, that people would soon want not only
EXECUTE-PARSING-FILE, but also EXECUTE-PARSING-FLASH and many more
such words, and that EXECUTE-PARSING is therefore insufficient, and we
should go for something more general. I then sketched a more general
mechanism, but that did not find positive resonance (not with me,
either).

Anyway, as a result we are here, over a decade later, without
standardized EXECUTE-PARSING, but at least we can evaluate the
predictions from that meeting: While I use EXECUTE-PARSING frequently
to work around the limitations of parsing words, I don't remember ever
using EXECUTE-PARSING-FILE, and the predicted other variants (e.g.,
EXECUTE-PARSING-FLASH) have not appeared. Not in VFX, and not in
Gforth.

Ben Hoyt

unread,
Mar 13, 2021, 7:00:10 PMMar 13
to
Thanks a lot, Anton, for the analysis and optimization efforts -- I don't feel so bad now that even you "have no explanation" for the larger discrepancy. :-)

> In any case, the fastest version is twice as fast as Python 3.4.2.

Would you mind sending a link to your final "fastest version"? I had trouble keeping track of all the modifications by the end of your reply.

Thanks,
Ben.

Anton Ertl

unread,
Mar 14, 2021, 1:11:54 PMMar 14
to
Ben Hoyt <ben...@gmail.com> writes:
>Would you mind sending a link to your final "fastest version"?

http://www.complang.tuwien.ac.at/anton/count-unique2.fs

Marcel Hendrix

unread,
Mar 14, 2021, 3:21:31 PMMar 14
to
On Friday, March 12, 2021 at 10:57:14 AM UTC+1, Anton Ertl wrote:
[..]
> |execute-parsing ( ... addr u xt -- ... ) gforth
> | Make addr u the current input source, execute xt '( ... -- ... )',
> |then restore the previous input source.

It appears I've lost some understanding along the way. The combination
of parse-name, execute-parsing and execute-parsing-file does not
do what I expect. The way in which non-space characters
are handled is unfamiliar, or maybe its the definition of
the "space" character.

The reference definition of parse-name says
PARSE-NAME

CORE EXT

( "<spaces>name<space>" -- c-addr u )
Skip leading space delimiters. Parse name delimited by a space.

c-addr is the address of the selected string within the input buffer and u is its
length in characters. If the parse area is empty or contains only white space,
the resulting string has length zero.

Implementation:
: isspace? ( c -- f )
BL 1+ U< ;
: isnotspace? ( c -- f )
isspace? 0= ;

: xt-skip ( addr1 n1 xt -- addr2 n2 )
\ skip all characters satisfying xt ( c -- f )
>R
BEGIN
DUP
WHILE
OVER C@ R@ EXECUTE
WHILE
1 /STRING
REPEAT THEN
R> DROP ;
...

This does not simply skip spaces ("Skip leading space delimiters."), it skips
everything which is not a "BL". It will therefore skip <cr>, <lf>, and <tab> characters,
and as such it will work without problems with an execute-parsing-file that
does not do *any* processing of the input file.

I guess the trick is in 3.4.1.1 which allows control characters to be equivalent
to a space, when space is the delimiter? Thus there is a difference between
"Skip leading space delimiters" and "Skip leading spaces."

> Here's my take:
>
> ----------------
> \ Usage: gforth count-unique.fs <count-unique.in |tr '[:upper:]' '[:lower:]'|sort -nrk2
>
> wordlist constant counts
>
> : count-word ( c-addr u -- )
> 2dup counts search-wordlist dup if
> drop >body 1 swap +! 2drop
> else
> drop ['] create execute-parsing 1 ,
> then ;
>
> : process-string ( -- )
> get-current counts set-current
> begin
> parse-name dup while
> count-word
> repeat
> 2drop set-current ;
>
> : show-words ( -- )
> [: dup name>string type space name>interpret >body @ 0 u.r cr true ;]
> counts traverse-wordlist ;
>
> stdin slurp-fid ' process-string execute-parsing show-words
> bye
> ---------------

There could be a problem with Forths that have unified code and
data spaces, especially with a 43MByte input file.

-marcel

Ben Hoyt

unread,
Mar 14, 2021, 3:38:00 PMMar 14
to
Thanks Anton for sending that version.

> There could be a problem with Forths that have unified code and
> data spaces, especially with a 43MByte input file.

Yeah, it's against the constraints I've set for myself to read the entire input. I'm going to kind of merge my version with Anton's but read in something like 64KB chunks.

-Ben

Anton Ertl

unread,
Mar 14, 2021, 6:27:22 PMMar 14
to
Marcel Hendrix <m...@iae.nl> writes:
>On Friday, March 12, 2021 at 10:57:14 AM UTC+1, Anton Ertl wrote:
>[..]
>> |execute-parsing ( ... addr u xt -- ... ) gforth
>> | Make addr u the current input source, execute xt '( ... -- ... )',
>> |then restore the previous input source.
>
>It appears I've lost some understanding along the way. The combination
>of parse-name, execute-parsing and execute-parsing-file does not
>do what I expect. The way in which non-space characters
>are handled is unfamiliar, or maybe its the definition of
>the "space" character.
>
>The reference definition of parse-name says
>PARSE-NAME
[...]
>This does not simply skip spaces ("Skip leading space delimiters."), it skips
>everything which is not a "BL". It will therefore skip <cr>, <lf>, and <tab> characters,
>and as such it will work without problems with an execute-parsing-file that
>does not do *any* processing of the input file.
>
>I guess the trick is in 3.4.1.1 which allows control characters to be equivalent
>to a space, when space is the delimiter? Thus there is a difference between
>"Skip leading space delimiters" and "Skip leading spaces."

Also,

|11.3.5 Parsing
|
|When parsing from a text file using a space delimiter, control
|characters shall be treated the same as the space character.

EXECUTE-PARSING-FILE provides a line-oriented interface; you have to
REFILL at the end of each line (and also before the first line).

However, my version uses EXECUTE-PARSING (not EXECUTE-PARSING-FILE) on
a string containing newlines, and relies on PARSE-NAME treating
newlines as spaces.

>> stdin slurp-fid ' process-string execute-parsing show-words
>> bye
>> ---------------
>
>There could be a problem with Forths that have unified code and
>data spaces, especially with a 43MByte input file.

I don't see why. SLURP-FID puts the contents of the file in ALLOCATEd
memory, but even if it put them into the dictionary, the biggest
problem that I would expect was that the usual configuration does not
provide 43MB in the dictionary.

Marcel Hendrix

unread,
Mar 15, 2021, 5:17:15 AMMar 15
to
On Sunday, March 14, 2021 at 11:27:22 PM UTC+1, Anton Ertl wrote:
> Marcel Hendrix <m...@iae.nl> writes:
> >On Friday, March 12, 2021 at 10:57:14 AM UTC+1, Anton Ertl wrote:
[..]
> However, my version uses EXECUTE-PARSING (not EXECUTE-PARSING-FILE) on
> a string containing newlines, and relies on PARSE-NAME treating
> newlines as spaces.

Yes, that attracted me too. REFILL and having to separate in
lines is really annoying. This EXECUTE-PARSING (currently my
$PROCESS) can benefit from this.

>>There could be a problem with Forths that have unified code and
>>data spaces, especially with a 43MByte input file.

> I don't see why. SLURP-FID puts the contents of the file in ALLOCATEd
> memory, but even if it put them into the dictionary, the biggest
> problem that I would expect was that the usual configuration does not
> provide 43MB in the dictionary.

I currently have a very sloppy SLURP-FILE that leaves free only a
few kbytes above HERE, and huge code/data ALLOTs will clobber the
read data. I will fix that by SLURPing top-down instead of bottom-up.

-marcel

Ben Hoyt

unread,
Mar 15, 2021, 5:33:34 AMMar 15
to
For what it's worth, here is the final comparison article: https://benhoyt.com/writings/count-words/

Thanks for your help with the Forth version (especially Anton's efforts).

-Ben

P Falth

unread,
Mar 15, 2021, 9:57:01 AMMar 15
to
I have MAP-FILE ( addr len fam -- addr' len' ior)
That memory maps a file using the OS calls.
There is also UNMAP-FILE, FLUSHMAP-FILE and MAP-FID

For parsing there is GET-LINE ( addr len -- addr" len" addr' len' )
and GET-WORD ( addr len -- addr" len" addr' len' )
Both parse a word/line, returns that as top of stack and the remaining part under that.

INCLUDE uses for example MAP-FILE to map the source R/O
and then REFILL uses GET-LINE to get the source line.

BR
Peter

Anton Ertl

unread,
Mar 15, 2021, 12:49:59 PMMar 15
to
Ben Hoyt <ben...@gmail.com> writes:
>For what it's worth, here is the final comparison article: https://benhoyt.com/writings/count-words/

You showed off the profiling capabilities of other languages.
Development Gforth has a coverage tool that gives you execution
counts, used like:

~/gforth/gforth ~/gforth/coverage.fs -e ": bye .coverage bye ;" optimized.fs <count-unique.in

This also shows why I usually prefer not to have the BYE in the
program file, and instead write it on the invocation.

Anton Ertl

unread,
Mar 15, 2021, 12:51:32 PMMar 15
to
P Falth <peter....@gmail.com> writes:
>I have MAP-FILE ( addr len fam -- addr' len' ior)
>That memory maps a file using the OS calls.
>There is also UNMAP-FILE, FLUSHMAP-FILE and MAP-FID

What do you do if the file cannot be mmaped?

P Falth

unread,
Mar 15, 2021, 1:54:09 PMMar 15
to
On Monday, 15 March 2021 at 17:51:32 UTC+1, Anton Ertl wrote:
> P Falth <peter....@gmail.com> writes:
> >I have MAP-FILE ( addr len fam -- addr' len' ior)
> >That memory maps a file using the OS calls.
> >There is also UNMAP-FILE, FLUSHMAP-FILE and MAP-FID
> What do you do if the file cannot be mmaped?
> - anton

That is signaled by the ior that usually is consumed by throw.
In all cases I have seen so far it is due to a misspelled filename/path.
I have 128GB in my PC so have not yet had an out of memory error
due to a large file.
Peter

Brian Fox

unread,
Mar 15, 2021, 2:52:08 PMMar 15
to
Nice write up Ben. Thanks

This table is making Forth show it's age perhaps.

Language Simple Optimized Notes
grep 0.04 0.04 grep reference; optimized sets LC_ALL=C
wc -w 0.29 0.20 wc reference; optimized sets LC_ALL=C
C 0.97 0.23
Go 1.14 0.38
Rust A 1.41 0.35 by Andrew Gallant
Rust B 1.48 0.28 also by Andrew: bonus and custom hash
C++ 1.75 0.98 “optimized” isn’t very optimized
Python 2.07 1.27
C# 3.43 original by John Taylor
AWK 3.52 1.13 optimized uses mawk
Forth 4.21 1.44
Shell 14.67 1.86 optimized does LC_ALL=C sort -S 2G

Although considering that C# is newish I guess Forth
faired pretty well.

C was given some lib help so there's that but GCC is a
monster is it not? Wow.

Being under Python does feel humbling...

I am thinking that Forth is more like Assembler in that it
requires lots of Forthy knowledge to get really high
performance. It seems to be a challenge even for good coders.

I am sure that marcel could make something faster with iForth
but I suspect it would not be much like Ben's code when it
was finished.

Interesting to be sure.


Anton Ertl

unread,
Mar 15, 2021, 3:01:24 PMMar 15
to
P Falth <peter....@gmail.com> writes:
>On Monday, 15 March 2021 at 17:51:32 UTC+1, Anton Ertl wrote:
>> What do you do if the file cannot be mmaped?
>> - anton
>
>That is signaled by the ior that usually is consumed by throw.
>In all cases I have seen so far it is due to a misspelled filename/path.
>I have 128GB in my PC so have not yet had an out of memory error
>due to a large file.

I am not worried about running out of RAM; after all, the file is
demand-paged in, so if memory is tight, a few pages of physical memory
is enough to walk through a large file (one page for the data, and
some more for the page tables).

What I am thinking about is files that cannot be mmaped, e.g.,
terminal input or pipes.

- anton

NN

unread,
Mar 15, 2021, 5:58:17 PMMar 15
to
I would disagree with the comment that "hash tables are present via wordlist".
The problem is that wordlists " look like " hashtables but not they are not the same.

Having seen your solution, I have been pondering over the question, how many here
would have written the solution using wordlists as a way to solving this. It was
not the first though that came to me given that you indicated you wanted the values
sorted later.



P Falth

unread,
Mar 15, 2021, 7:22:55 PMMar 15
to
On Monday, 15 March 2021 at 20:01:24 UTC+1, Anton Ertl wrote:
> P Falth <peter....@gmail.com> writes:
> >On Monday, 15 March 2021 at 17:51:32 UTC+1, Anton Ertl wrote:
> >> What do you do if the file cannot be mmaped?
> >> - anton
> >
> >That is signaled by the ior that usually is consumed by throw.
> >In all cases I have seen so far it is due to a misspelled filename/path.
> >I have 128GB in my PC so have not yet had an out of memory error
> >due to a large file.
> I am not worried about running out of RAM; after all, the file is
> demand-paged in, so if memory is tight, a few pages of physical memory
> is enough to walk through a large file (one page for the data, and
> some more for the page tables).
>
> What I am thinking about is files that cannot be mmaped, e.g.,
> terminal input or pipes.

They will fail MAP-FILE and also all INCLUDE variants that are built on it.
I see no problem in that.

If I want to read from a pipe, that needs to be redirected to stdin.
It will work on Linux but not well on Windows yet.

The new Windows Terminal is a great step forward and behaves
almost like a Linux terminal.

Peter

Anton Ertl

unread,
Mar 16, 2021, 3:26:04 AMMar 16
to
NN <novembe...@gmail.com> writes:
>On Monday, 15 March 2021 at 09:33:34 UTC, Ben Hoyt wrote:
>> For what it's worth, here is the final comparison article: https://benhoyt.com/writings/count-words/
>>
>> Thanks for your help with the Forth version (especially Anton's efforts).
>>
>> -Ben
>
>I would disagree with the comment that "hash tables are present via wordlist".
>The problem is that wordlists " look like " hashtables but not they are not the same.

In Gforth wordlists are implemented with a hash table (by default).
Given that he used Gforth, his assessment is definitely correct.

>Having seen your solution, I have been pondering over the question, how many here
>would have written the solution using wordlists as a way to solving this.

I would.

>It was
>not the first though that came to me given that you indicated you wanted the values
>sorted later.

He wanted to sort it by count. He used a hash table with the words as
keys. Plus, the counts are only known for sure after having processed
the input. So I don't see how the sorting would have helped much.
Ok, you could have used an array sorted by count, with linear search
from the most frequent one, but I expect that this would have been a
lot slower.

Anton Ertl

unread,
Mar 16, 2021, 3:52:39 AMMar 16
to
P Falth <peter....@gmail.com> writes:
>On Monday, 15 March 2021 at 20:01:24 UTC+1, Anton Ertl wrote:
>> What I am thinking about is files that cannot be mmaped, e.g.,
>> terminal input or pipes.
>
>They will fail MAP-FILE and also all INCLUDE variants that are built on it.
>I see no problem in that.

As long as the user knows this and is prepared for it, it's fine.

>If I want to read from a pipe, that needs to be redirected to stdin.

In the case at hand the input came from stdin.

In addition, Unix also has named pipes. For the first 3 decades of my
Unix usage, I have not used them, but lately, I have picked up the
shell syntax "<(command)", and it creates a named pipe internally.
E.g., if I do

ls -l <(cat count-unique.fs)

it shows

lr-x------ 1 anton users 64 Mar 16 08:47 /dev/fd/63 -> pipe:[191951]

Marcel Hendrix

unread,
Mar 16, 2021, 4:04:30 AMMar 16
to
It is not clear to me what the objective of the paper is, and thus
I wouldn't know how to let Forth come out on top. Is it runtime speed,
ease of writing it (using built-in facilities), clarity of the finished
program, ease of debugging, size of the code?

If this were a real-world problem that *I* had, Forth would
end quite a bit higher, but it wouldn't necessarily show up in the
source code as such (the kernel would be optimized differently).

One very interesting thing that the paper shows is that grep is
incredibly fast for this task. The reason is known, and that
algorithm/data structure should be adopted in Forth if it was
useful for the type of problems we want to solve.

I doubt that grep (or any of the other benchmarked programming tools)
is very convenient or easy to use when one wants to do specific types
of simulation that benefit from run-time generated code.

The paper goes into considerable detail w.r.t. tracing and debugging
facilities, but it is not actually using these very much. I am quite fond
of Forth in this respect. There is not much a mere programmer can do
with a one-line program, apart from going to stackexchange and
ask around (standard behavior of my Python colleagues).

-marcel

Anton Ertl

unread,
Mar 16, 2021, 4:22:10 AMMar 16
to
Marcel Hendrix <m...@iae.nl> writes:
>One very interesting thing that the paper shows is that grep is
>incredibly fast for this task.

grep does not solve this task. He included the timing only to have an
idea how much the I/O overhead is, and it shows that little of the
time is spent in I/O.

Anton Ertl

unread,
Mar 16, 2021, 4:34:12 AMMar 16
to
Brian Fox <bria...@rogers.com> writes:
>On 2021-03-15 5:33 AM, Ben Hoyt wrote:
>> For what it's worth, here is the final comparison article: https://benhoyt.com/writings/count-words/
>>
>> Thanks for your help with the Forth version (especially Anton's efforts).
>>
>> -Ben
>>
>Nice write up Ben. Thanks
>
>This table is making Forth show it's age perhaps.

I don't think that age has anything to do with it. It's more a
function of the implementation choices in Gforth, mainly in the
dictionary lookup.

>Language Simple Optimized Notes
>grep 0.04 0.04 grep reference; optimized sets LC_ALL=C
>wc -w 0.29 0.20 wc reference; optimized sets LC_ALL=C
>C 0.97 0.23
>Go 1.14 0.38
>Rust A 1.41 0.35 by Andrew Gallant
>Rust B 1.48 0.28 also by Andrew: bonus and custom hash
>C++ 1.75 0.98 “optimized” isn’t very optimized
>Python 2.07 1.27
>C# 3.43 original by John Taylor
>AWK 3.52 1.13 optimized uses mawk
>Forth 4.21 1.44
>Shell 14.67 1.86 optimized does LC_ALL=C sort -S 2G
...
>I am thinking that Forth is more like Assembler in that it
>requires lots of Forthy knowledge to get really high
>performance.

It generally requires lots of knowledge to get really high
performance. Unfortunately this is an area that is rife with
superstition.

Paul Rubin

unread,
Mar 16, 2021, 5:13:42 AMMar 16
to
an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
> Plus, the counts are only known for sure after having processed
> the input. So I don't see how the sorting would have helped much.

A sorting-based solution could be done with very little memory, though
it would be slow. A lot of exercises in Knuth vol 3 are like that. You
have a 1960's small computer with some kilobytes of memory and some
megabytes of disks or magtapes, and you have to do this frequency
counting problem. Basically you make a bunch of in-memory sort/merge
passes of small chunks of the input, and you end up with a list of words
and counts, sorted alphabetically by word. Then you sort that again,
this time by count.

dxforth

unread,
Mar 16, 2021, 11:51:16 PMMar 16
to
The question seems to have been 'how can I do this in the easiest way using gforth'.
Had it been 'why should I do this in forth when there are easier alternatives' the
result may have been different. I use forth for things nobody else wants to write,
and other languages/forth implementations bore me. Perhaps other forthers too.

none albert

unread,
Mar 18, 2021, 12:01:13 PMMar 18
to
In article <2021Mar1...@mips.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
<SNIP>
>
>Anyway, as a result we are here, over a decade later, without
>standardized EXECUTE-PARSING, but at least we can evaluate the
>predictions from that meeting: While I use EXECUTE-PARSING frequently
>to work around the limitations of parsing words, I don't remember ever
>using EXECUTE-PARSING-FILE, and the predicted other variants (e.g.,
>EXECUTE-PARSING-FLASH) have not appeared. Not in VFX, and not in
>Gforth.

I added EXECUTE-PARSING at that time, and did not bother with the
other ones. It is easy enough.
: EXECUTE-PARSING ROT ROT SAVE SET-SRC CATCH RESTORE THROW ;
It sits in a screen together with PARSE-NAME and SAVE-INPUT
awaiting external sources that use them.
[ In the kernel I see no use for them. ]
You are right that prepending this with GET-FILE or SLURP-FILE is easy
enough not to bother with extra words.

>
>- anton
--
"in our communism country Viet Nam, people are forced to be
alive and in the western country like US, people are free to
die from Covid 19 lol" duc ha
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

none albert

unread,
Mar 18, 2021, 3:25:03 PMMar 18
to
In article <57548ac6-9f34-480d...@googlegroups.com>,
Ben Hoyt <ben...@gmail.com> wrote:
>I'm writing a comparison of a simple program (print the counts of unique
>words in a text file, most frequent first) in various languages. Forth
>was my first programming language, and for old times' sake I thought I'd
>include a Forth version in my article. I haven't written Forth in years
>and years, so I'd love some code review / suggestions here.
>
>I want to keep it "mostly standard" but I'm fine using a couple of
>Gforth-specific features if needed.
>
>Most other language implementations read standard input line at a time,
>split into whitespace-separated words, insert the words into a hash
>table (or increment the count / hash value if it already exists). Then
>when printing out, convert to an array to sort by count descending. For
>reference, in Python that might look like this (though it's even easier
>if you use collections.Counter):
>
<SNIP> code

As long as we're using a wordlist as list of words here is a solution
that doesn't really work for the problem but presents an interesting
technique. It works with prefixes, and the wordlist internal structure
is a linked list, not a hash.
The preparation is
lina64 -g 64 lina64+
This creates a Forth with 64 Mbyte on top of the 64 Mbyte present,
enough for some King James's.
SORT-VOC is in the library, it sorts a wordlist using a merge sort.
\ ------------------------------------------------------
: autocounter CREATE 1 , DOES> 1 SWAP +! ;
NAMESPACE wordsdb
wordsdb DEFINITIONS
: CATCH-ALL autocounter ; PREFIX
' CATCH-ALL PREVIOUS DEFINITIONS CONSTANT DEA-OF-CATCH-ALL
\ Make CATCH-ALL an empty prefix, such that it matches all
0 ( # chars) DEA-OF-CATCH-ALL >NFA @ !
: doit wordsdb DEFINITIONS 1 ARG[] INCLUDED PREVIOUS 'wordsdb SORT-VOC show ;
\ ------------------------------------------------------

This is intended to be a turnkey. `doit has the wordsdb database
on top. The file in argument 1 is included adding new words
of the type autocounter to wordsdb.
If the word is found it is executed, adding one to the count.
The interesting thing happens if it comes to the word CATCH-ALL.
Actually catch-all's name has become "", and because it is a prefix it
matches everything, so it executes autocounter, that adds the
word that was missing to the wordlist.

Some tests
REGRESS autocounter aap 'aap >BODY @ S: 1
REGRESS aap aap aap 'aap >BODY @ S: 4

Showing the words is a bit of a hassle, because we must exclude
the catch-all word. The words are found via their link field,
using `` >LFA @ ''.

: (show) DUP DEA-OF-CATCH-ALL <> IF
DUP >NFA @ $@ TYPE SPACE >BODY ? CR
ELSE DROP THEN ;
: show 'wordsdb >WID >LFA @ \ This gets the first word of the linked list.
BEGIN DUP (show) >LFA @ DUP 0= UNTIL DROP ;

: doit wordsdb DEFINITIONS 1 ARG[] INCLUDED PREVIOUS
'wordsdb SORT-VOC show ;

The result after
lina64 -c count-words.frt
count-words count-words.frt
looks like
\ ------------------------------------
wordsdb 4
that 1
test 2
such 1
show 3
prefix, 1
pietje 1
matches 1
it 1
empty 1
doit 1
chars) 1
autocounter 5
an 1
all 1
aapje 1
aap 4
...
...
(show) 2
( 1
'wordsdb 2
'aap 2
' 1
$@ 1
$= 1
# 1
! 1
\ -------------------------
That is in reverse order because of the SORT-VOC I had lying around.

>Thanks a lot!
Thanks for an inspirational question.
>
>-Ben

Groetjes Albert

none albert

unread,
Mar 19, 2021, 8:55:16 AMMar 19
to
In article <6053a652$0$27924$e4fe...@news.xs4all.nl>,
A slightly improved version. Instead of scanning a string one can
use the interpreter (we have one! it is called QUIT) directly.
At end of input BYE is executed, it has to be revectored, otherwise
there is no output.
Also don't forget to go bye in the redefined code (!).

Another improvement:
have catchall in a wordlist of its own. Now it is no longer
an exception inside the wordsdb.

WANT SORT-VOC
: autocounter CREATE 1 , DOES> 1 SWAP +! ;

NAMESPACE catchallwid
NAMESPACE wordsdb
catchallwid DEFINITIONS
: CATCH-ALL autocounter ; PREFIX
\ Make CATCH-ALL an empty prefix, such that it matches all
' CATCH-ALL PREVIOUS DEFINITIONS CONSTANT DEA-OF-CATCH-ALL
0 ( # chars) DEA-OF-CATCH-ALL >NFA @ !

: (show) DUP >NFA @ $@ TYPE SPACE >BODY ? CR ;
: show 'wordsdb >WID >LFA @
BEGIN DUP (show) >LFA @ DUP 0= UNTIL DROP ;

'BYE ALIAS NEW-BYE
: on-exit 'wordsdb SORT-VOC show NEW-BYE ;

: doit 'on-exit 'BYE 3 CELLS MOVE \ revector BYE
catchallwid wordsdb DEFINITIONS QUIT ;

The benefit is that the redirection buffering of the Forth
(probably fast as hell) will take care of infinite lenght files.
No problem with like 666 King James Bibles.

QUIT can be replace by: "x" INCLUDE BYE

Note: I see that the ordering was supposed to be according to
frequency. That is a matter of changing the execution token
passed inside SORT-VOC.
Now:
\ For XT1 and XT2 return XT1 and XT2 plus "xt IS lower".
: GET-NAME >NFA @ $@ ; \ Aux. For EL, return NAME.
: NAMES< SWAP >R GET-NAME R> GET-NAME COMPARE 0 < ;
New:
: CONTENT< SWAP >R >BODY @ R> >BODY @ < ;
Replace 'NAMES< by 'CONTENT<

>
>>Thanks a lot!
>Thanks for an inspirational question.

This and the gforth example serve to illustrate that knowing your
interpreter can trump standard code. Of course in real life
standardization trumps it all, as c has proven to us.

none albert

unread,
Mar 19, 2021, 10:46:41 AMMar 19
to
In article <57548ac6-9f34-480d...@googlegroups.com>,
Ben Hoyt <ben...@gmail.com> wrote:
>I'm writing a comparison of a simple program (print the counts of unique
>words in a text file, most frequent first) in various languages. Forth
>was my first programming language, and for old times' sake I thought I'd
>include a Forth version in my article. I haven't written Forth in years
>and years, so I'd love some code review / suggestions here.
>
>I want to keep it "mostly standard" but I'm fine using a couple of
>Gforth-specific features if needed.
>
>Most other language implementations read standard input line at a time,
>split into whitespace-separated words, insert the words into a hash
>table (or increment the count / hash value if it already exists). Then
>when printing out, convert to an array to sort by count descending. For
>reference, in Python that might look like this (though it's even easier
>if you use collections.Counter):

<SNIP>


>5) Any other comments or feedback here?

What would google do?

God?) 1
God? 62
God;) 2
God; 128
God:for 1
God: 204
God. 485
God-ward: 1
God-ward, 1
God-ward 1
God,) 2
God, 964
God's? 1
God's: 1
God's. 6
God's 18
God! 9
God 2229
GOD? 2
GOD;) 1
GOD; 119
GOD: 10
GOD. 62
GOD, 44
GOD! 8
GOD 53

Looking at the output of my 68 minute run, I suggest that
it is better to consider punctuation like . , : ; as white space
(which could in my Forth be done by revectoring ?BLANK ).

And what about:
1:9 65
1:8 65
1:7 65
I wouldn't count those as god's words.

>
>Thanks a lot!

Anton Ertl

unread,
Mar 19, 2021, 2:37:38 PMMar 19
to
Ben Hoyt <ben...@gmail.com> writes:
>For what it's worth, here is the final comparison article: https://benhoyt.com/writings/count-words/

I took Ben Hoyt's optimized version, and wanted to see how different
Forth systems fare. So I first looked at which words in his code are
not in Forth-2012: Gforth's ans-report.fs reports:

|from non-ANS :
| hashbits hashdouble find-name-in nextname rdrop cell <= 0>= stdin
| execute-parsing -rot

I eliminated the Gforth-specific hash stuff. I went back from
FIND-NAME-IN to SEARCH-WORDLIST, and from NEXT-NAME to
EXECUTE-PARSING. I defined STDIN to be the opened file containing the
benchmark input (lxf and SwiftForth contain a STDIN, though). For
EXECUTE-PARSING, I used the implementation from the compat library
where necessary.

You can find the result at
<http://www.complang.tuwien.ac.at/anton/count-unique-portable.fs>

Then I tried it on a few systems:

iForth: The version I have does not implement NAME>INTERPRET and
TRAVERSE-WORDLIST, so I gave up on that.

lxf: This worked out of the box (it has EXECUTE-PARSING)

SwiftForth: I could get it to run by first loading
compat/execute-parsing.fs, but the output is not what it should be.
In particular, it shows counts that are not divisible by 10, and the
output has 155800 lines.

VFX: Similarly, I included compat/execute-parsing.fs, and
similarly, I got output other than expected (and apparently the same
as for SwiftForth).

I then tried using the reference implementation of PARSE-NAME
<http://forth200x.org/reference-implementations/parse-name.fs> with
SwiftForth and VFX, and it fixed the problem. The problem seems to be
that PARSE-NAME in SwiftForth (and, I guess, VFX) only scans for
blanks and skips blanks.

The command lines I used were:

~/gforth/gforth-fast count-unique-portable.fs -e "main bye"
lxf "include count-unique-portable.fs main bye"
sf "include forth200x-code/extensions/parse-name.fs include /home/anton/gforth/compat/execute-parsing.fs include count-unique-portable.fs main bye"
vfxlin "include forth200x-code/extensions/parse-name.fs include /home/anton/gforth/compat/execute-parsing.fs include count-unique-portable.fs main bye"

Performance (on Skylake: Core i5-6600K):

Measured with

LC_NUMERIC=en_US.utf8 perf stat -B -e cycles:u -e instructions:u <command> >/dev/null

cycles:u instructions:u
5,283,768,248 12,192,102,197 gforth-fast
25,415,445,850 14,875,738,337 lxf
323,601,779,148 71,447,700,709 SwiftForth
10,253,285,512 12,480,803,729 SwiftForth 1931 #strands !
8,924,046,495 12,943,588,904 SwiftForth 3119 #strands !
8,633,549,116 13,651,490,963 SwiftForth 4447 #strands !
8,869,032,555 14,764,023,064 SwiftForth 6211 #strands !
10,536,796,771 16,885,231,117 SwiftForth 9377 #strands !
16,499,932,432 22,816,689,082 SwiftForth 18181 #strands !
39,140,121,585 54,881,274,417 SwiftForth 65537 #strands !
156,625,611,921 68,036,722,876 VFX

So there are big speed differences. My guess is that, unlike Gforth,
some other Forth systems have not implemented wordlists with a view
towards using them as general lookup tables.

SwiftForth implements a wordlist asa hash table with separate
chaining, with a default of 31 buckets (aka #STRANDS, i.e., >1000
words per bucket). One can set the number of buckets in a
non-portable way, and I played around with that. My expectation was
that the more buckets I use, the faster SwiftForth will run (with
little gain after the average number of words per bucket is <1), but
for some reason, using a large number of buckets costs a lot of
instructions on SwiftForth, and I played around a bit to find a number
of buckets that gives a low number of cycles; it's still worse than
Gforth.

One other interesting aspect is the big differences in instructions
per cycle (IPC). lxf uses many much more instructions, but 5 times
more cycles than Gforth. SwiftForth (default) has a particularly low
IPC, about 10 times lower than gforth-fast. that this comes from
spending much of the time following the separate chains, which
probably results in L2 cache misses (perf stat reports 6,198,244,154
LLC-loads (last level cache), compared to 11,204,304 for gforth-fast).

P Falth

unread,
Mar 20, 2021, 4:10:38 AMMar 20
to
lxf has a default of 499 buckets. Each wordlist has its own hashtable.
The number of buckets can be set with the value #threads before a
wordlist is created. It was made to save space for small wordlists.
I just tried it with a large number 6211. Creating words work but
WORDS fail, but that may be a problem with WORDS.

Peter

Anton Ertl

unread,
Mar 20, 2021, 5:33:14 AMMar 20
to
P Falth <peter....@gmail.com> writes:
>On Friday, 19 March 2021 at 19:37:38 UTC+1, Anton Ertl wrote:
>> SwiftForth implements a wordlist asa hash table with separate
>> chaining, with a default of 31 buckets (aka #STRANDS, i.e., >1000
>> words per bucket). One can set the number of buckets in a
>> non-portable way, and I played around with that. My expectation was
>> that the more buckets I use, the faster SwiftForth will run (with
>> little gain after the average number of words per bucket is <1), but
>> for some reason, using a large number of buckets costs a lot of
>> instructions on SwiftForth, and I played around a bit to find a number
>> of buckets that gives a low number of cycles; it's still worse than
>> Gforth.

It seems that the reason for this behaviour is that SwiftForth's
TRAVERSE-WORDLIST produces the nts in the reverse order of definition,
and recreates that order by, at every step, selecting the nt with the
highest address among each bucket. So TRAVERSE-WORDLIST has O(m*n)
cost, where m is the number of buckets and n is the number of xt
executions.

TRAVERSE-WORDLIST was specified in a way that allows producing the nts
of one bucket in reverse insertion order for that bucket, and then the
next bucket etc., i.e., O(n) cost. But I guess that Forth, Inc. wants
wants to preserve the classic behaviour of WORDS, and implemented
TRAVERSE-WORDLIST accordingly.

Gforth also preserves the classic behaviour of WORDS and also produces
the words in reverse order of definition. It does this by having a
linked list representation which is the canonical representation of
the wordlist. It also has a hash table in addition. The linked list
is used for enumerating the words (with TRAVERSE-WORDLIST) and for
(re)building the hash table. The hash table is used for lookup.
Insert (word definition) inserts in both the linked list and the hash
table. One nice feature of this approach is that we can throw away
and rebuild the hash table any time when maintaining it is too
complicated, e.g., when executing a marker, or when doubling the size
of the hash table.

>lxf has a default of 499 buckets. Each wordlist has its own hashtable.
>The number of buckets can be set with the value #threads before a
>wordlist is created. It was made to save space for small wordlists.

Adding some results for lxf with varying buckets gives:

cycles:u instructions:u
5,283,768,248 12,192,102,197 gforth-fast
25,415,445,850 14,875,738,337 lxf
6,817,364,445 7,803,365,742 lxf 18181 to #threads
6,571,979,412 7,728,933,434 lxf 65537 to #threads
323,601,779,148 71,447,700,709 SwiftForth
10,253,285,512 12,480,803,729 SwiftForth 1931 #strands !
8,924,046,495 12,943,588,904 SwiftForth 3119 #strands !
8,633,549,116 13,651,490,963 SwiftForth 4447 #strands !
8,869,032,555 14,764,023,064 SwiftForth 6211 #strands !
10,536,796,771 16,885,231,117 SwiftForth 9377 #strands !
16,499,932,432 22,816,689,082 SwiftForth 18181 #strands !
39,140,121,585 54,881,274,417 SwiftForth 65537 #strands !
156,625,611,921 68,036,722,876 VFX

lxf behaves with increasing number of threads as I expected. I would
have hoped for a lower cycles result, though.

- anton
--

NN

unread,
Mar 20, 2021, 6:30:49 AMMar 20
to
There was nothing in Ben's original post to say he was interested
in speed. So I dont know why the thread is so focused upon cycles
and performance.

What it shows is that this "simple" problem is difficult to solve in forth.

Anton Ertl

unread,
Mar 20, 2021, 8:06:25 AMMar 20
to
an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>I then tried using the reference implementation of PARSE-NAME
><http://forth200x.org/reference-implementations/parse-name.fs> with
>SwiftForth and VFX, and it fixed the problem. The problem seems to be
>that PARSE-NAME in SwiftForth (and, I guess, VFX) only scans for
>blanks and skips blanks.

This poses the question of how these systems deal with files
containing tabs. I did some testing of these systems, and what I found is:

SwiftForth converts each tab in an INCLUDEd file to a space before
presenting it to the text interpreter (but obviously not for EVALUATE,
which is used by compat/execute-parsing.fs). So files with tab can be
text-interpreted by SwiftForth, although it makes me wonder how ."
containing a tab works.

VFX does not convert the file on reading, but PARSE-NAME seems to use
both BL (32) and tab (9) as delimiter, but not the line feed (10).
Given that, on INCLUDEing a file, all line feeds are skipped when
doing linewise processing, PARSE-NAME never sees a line feed in that
context.

Anton Ertl

unread,
Mar 20, 2021, 8:16:12 AMMar 20
to
NN <novembe...@gmail.com> writes:
>There was nothing in Ben's original post to say he was interested
>in speed.

Well, he compared Gforth speed with Python speed. But the final blog
post that he linked too was much more focused on performance. And of
course the unfavourable result for Forth made me wonder if we can get
better results for Forth.

So I looked if one of the native-code Forth systems could outperform
Gforth on a more portable variant of this problem; this gave some
insights into the portability of this program, and into performance of
wordlist implementations on various Forth systems.

Next I will look at whether I can get better speed out of Gforth.

>What it shows is that this "simple" problem is difficult to solve in forth.

I don't think so. Splitting the input into "words" is a very good
match to Forth's parsing, and Forth's wordlists are a very good match
for the counting part. What Forth does not have built-in is the
lower-casing (pretty easy) and the sorting (more effort), but I leave
that to shell tools when I need it.

none albert

unread,
Mar 20, 2021, 8:29:27 AMMar 20
to
In article <2021Mar2...@mips.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>NN <novembe...@gmail.com> writes:
>>There was nothing in Ben's original post to say he was interested
>>in speed.
>
>Well, he compared Gforth speed with Python speed. But the final blog
>post that he linked too was much more focused on performance. And of
>course the unfavourable result for Forth made me wonder if we can get
>better results for Forth.
>
>So I looked if one of the native-code Forth systems could outperform
>Gforth on a more portable variant of this problem; this gave some
>insights into the portability of this program, and into performance of
>wordlist implementations on various Forth systems.
>
>Next I will look at whether I can get better speed out of Gforth.
>
>>What it shows is that this "simple" problem is difficult to solve in forth.
>
>I don't think so. Splitting the input into "words" is a very good
>match to Forth's parsing, and Forth's wordlists are a very good match
>for the counting part. What Forth does not have built-in is the
>lower-casing (pretty easy) and the sorting (more effort), but I leave
>that to shell tools when I need it.

On top of that I think that in the end theological considerations
will come in. So a dedicated program that uses hashlists as a tool
is what we'll end with.

I think the bible is case sensitive. There is a difference between
"the god of the kaannites" and "Then God spoke to Abraham".
And I already drew attention to punctuation as in God God. God! God: God; .
I think they should all land in the same bucket.

Anyway this is not a difficult problem in principle. The difficulties
are down to the usual suspect, unclear specification by the customer.


>
>- anton

NN

unread,
Mar 20, 2021, 8:45:29 AMMar 20
to
Ben's own example converts everything to lowercase so then the only thing
left is handling of punctuation.But I agree with your solution in an earlier
post of treating punctuation as space.

In fact if you assume words are just composed of alpha chars then the only
exceptions I can think of are hypenated words.

NN

unread,
Mar 20, 2021, 8:54:18 AMMar 20
to
On Saturday, 20 March 2021 at 12:16:12 UTC, Anton Ertl wrote:
> NN <novembe...@gmail.com> writes:
> >There was nothing in Ben's original post to say he was interested
> >in speed.
> Well, he compared Gforth speed with Python speed. But the final blog
> post that he linked too was much more focused on performance. And of
> course the unfavourable result for Forth made me wonder if we can get
> better results for Forth.
>

His final blog post did.

> So I looked if one of the native-code Forth systems could outperform
> Gforth on a more portable variant of this problem; this gave some
> insights into the portability of this program, and into performance of
> wordlist implementations on various Forth systems.
>
> Next I will look at whether I can get better speed out of Gforth.
> >What it shows is that this "simple" problem is difficult to solve in forth.
> I don't think so. Splitting the input into "words" is a very good
> match to Forth's parsing, and Forth's wordlists are a very good match
> for the counting part. What Forth does not have built-in is the
> lower-casing (pretty easy) and the sorting (more effort), but I leave
> that to shell tools when I need it.

I agree with most of what you have said except for the "wordlist" part
which is why I was interested to see how many forthers would have tried
to solve it that way. However it seems that only three managed Ben,
yourself and Albert.

I think I will give this a try and see how far I get this weekend. Unfortunately
I dont have any punch card subroutines to carry in my back pocket so I am
just going to have to wheel out my own :(

Coos Haak

unread,
Mar 20, 2021, 10:06:31 AMMar 20
to
Op 20 Mar 2021 12:28:44 GMT schreef none:

> I think the bible is case sensitive. There is a difference between
> "the god of the kaannites" and "Then God spoke to Abraham".

The New Testament was written in Greek and indeed has upper and
lower case letters.
The Old Testament was written in Hebrew and Arameic and
like Arabic (Koran) those scripts have no capitalisation.

groet, Coos

Anton Ertl

unread,
Mar 20, 2021, 11:41:40 AMMar 20
to
Ben Hoyt <ben...@gmail.com> writes:
>For what it's worth, here is the final comparison article: https://benhoyt.com/writings/count-words/

I started with your "optimize.fs" and made another attempt at
peeling off layers of (Gforth) abstractions in order to speed it up
(but making the result Gforth-specific in the process). This time I
was successful and you can find the result at
<http://www.complang.tuwien.ac.at/anton/count-unique-gforth.fs>

Before I go into what I did let's do a speed comparison:

cycles:u instructions:u
4,274,918,212 10,483,947,413 count-unique-hoyt-opt.fs
2,049,405,097 4,022,462,373 count-unique-gforth.fs

The command lines are:

~/gforth/gforth-fast count-unique-hoyt-opt.fs <count-unique.in >/dev/null
~/gforth/gforth-fast count-unique-gforth.fs -e "main bye" >/dev/null

So I see a speedup by a factor of 2.09. Assuming you see the same
speedup on your machine, this would give an optimized time of 0.7s,
quite a bit slower than C, C++, Go, and Rust, but faster than C#,
Python, Lua, JavaScript, and awk. Interestingly, I see no Java entry
on your page, so it seems that Java has followed Cobol to the place
where strictly professional languages linger that nobody programs in
for the joy or the challenge.

What I did is to look at what Gforth does at the primitive level when
it does FIND-NAME-IN and PARSE-NAME, and call these primitives without
all these layers in-between:

So I replaced

counts find-name-in dup if

with

2dup hashbits (hashkey2) [ counts wordlist-extend @ ]L xor
cells [ hashtable ]L + @ (hashlfind) ?dup-if

and

: process-string ( -- )
begin
parse-name dup
while
count-word
repeat
2drop ;

with

: process-string1 ( c-addr u -- )
begin
2dup (parse-white) dup while
2dup count-word + -rot + over -
repeat
2drop 2drop ;

The first replacement brings the cycle count down to 2.7Gc, the second
to 2Gc.

So where is the remaining time spent? In the more complicated
primitives (hashkey2) (hashlfind) (parse-white) or in the rest of the
Forth code?

perf record and perf report give the following distribution of cycles:

Overhead Command Shared Object Symbol
24.33% gforth-fast gforth-fast [.] hashlfind
12.12% gforth-fast gforth-fast [.] parse_white
9.02% gforth-fast libc-2.19.so [.] __memmove_ssse3_back
7.57% gforth-fast perf-7033.map [.] 0x00007f0698c926f4
6.35% gforth-fast gforth-fast [.] memcasecmp
6.30% gforth-fast gforth-fast [.] hashkey2a
5.28% gforth-fast gforth-fast [.] gforth_engine
1.02% gforth-fast perf-7033.map [.] 0x00007f0698c92764
(and a lot more with <1%).

So about 31% of the time is spent in (hashlfind) (it calls
memcasecmp), 6% in (hashkey2), and 12% in (parse-white), for a total
of 49%. It's unclear to me here the memmove() calls are coming from.
The perf-7033.map "shared object" is the dynamically generated Gforth
code, and most of the rest of the time is spent there.

So here we see a speedup by a factor of 2 from using undocumented
Gforth-internal words that might change with the next release.
Ideally the abstraction layer between these primitives and the
documented and standardized words would be much thinner. I expect to
see some progress in that direction with native-code compilation.

NN

unread,
Mar 20, 2021, 2:43:07 PMMar 20
to
My computer is slow AMD E450 ,
This took a lot longer than 3.7s, but I expect it will be faster for smaller files.

For those who decide to try it remember to take :
--> out

( -------------------------------------------------------------------------------- )
[defined] testing1 [if] testing1 [then] marker testing1

: eol? ( a u -- a u f ) dup 0= ;
: noteol? ( a u -- a u f ) dup 0> ;
: rch ( a u -- a u c ) over c@ ;
: nch ( a u -- a1 u1 ) 1 /string ;
: pch ( a u -- a1 u1 ) -1 /string ;

-->







: alpha ( c -- )
dup >r 'A' 'Z' 1+ within if r> drop true exit then
r> 'a' 'z' 1+ within ;

: isword ( -- )
begin noteol? while rch alpha while nch repeat then ;

: isnotword ( -- )
begin noteol? while rch alpha 0= while nch repeat then ;

-->





: extract-word ( a1 u1 a2 u2 -- a3 u3 a2 u2 )
2dup 2>r nip - 2r> ;

: rd-word ( a1 u1 -- )
0 -rot begin noteol? while isnotword
2dup isword extract-word 2>r rot 1+ 2r> repeat 2drop ;

-->








: ucase ( c -- )
dup 'a' 'z' 1+ within if $20 xor then ;

: word>upperc ( a u -- a u )
2dup bounds do i c@ ucase i c! loop ;









-->

: cn ( x y -- p ) 2 cells allocate throw dup >r 2! r> ;
: cn@ ( p -- x y ) 2@ ;
: hd ( p -- x ) cn@ drop ;
: tl ( p -- y ) cn@ nip ;
: !hd ( x p -- ) dup >r cn@ nip r> 2! ;
: !tl ( y p -- ) dup >r cn@ drop swap r> 2! ;








-->

: cp$ ( a u -- b ) dup 1+ allocate throw dup >r place r> ;
: null ( a -- f ) 0= ;
: eq ( x y -- f ) = ;
: eqstr ( x y -- f ) >r count r> count compare 0= ;
: caar ( a -- a1 ) hd hd ;
: assoc ( k al -- ^x T | k F )
begin dup null if drop FALSE true
else 2dup caar eqstr if hd TRUE true
else tl false then then until ;
: pr1 ( a -- ) dup tl swap hd
dup tl swap hd cr count type space . ;
: pr ( a -- )
begin dup null 0= while pr1 repeat drop ;

-->

variable alist

: check? ( i*x n -- )
0 ?do
eol? if 2drop else
word>upperc cp$ alist @ assoc if
dup tl 1+ swap !tl free throw
else 1 cn alist @ cn alist ! then
then loop ;

-->





variable fname$ 252 allot
s" kjvbible.txt" fname$ place

variable fid1
create tmp11 256 chars allot

: rdfile ( -- )
fname$ count r/o open-file throw fid1 !
begin tmp11 1+ 256 3 - fid1 @ read-line throw while
tmp11 c!
tmp11 count rd-word check?
repeat
fid1 @ close-file throw ;


rdfile alist @ pr

( -------------------------------------------------------------------------------- )


I didnt think I would get this far, and I am going to enjoy the moment.

The sort routine will take a while to write so I left it at this,
which is about the same point Ben left his in his original post.

I am sure it can be improved but I am going to leave that for another time.



none albert

unread,
Mar 21, 2021, 8:25:02 AMMar 21
to
In article <wa2t879qklrn.16...@40tude.net>,
The earliest manuscipts of the New Testament was
SHOOTINGWRITTENINALLCAPSWITHOUTPUNCTUATIONNOTEVENSPACESITWAS\
ANINTERESTINGPUZZLETOFINDOUTWHERETHENEXTSENTENCESTARTHENCE\
THEFIRSTTHINGTHEOLOGICANSDIDWASENTERINGNUMBERSTOIDENTIFY\
CHAPTERSANDVERSES

The frequency count problem would be much more interesting with
those manuscripts, more of an ai project.
Also they did not have spelling checkers at the time, because
there was not even an agreed upon spelling.
The kjv authors did no go back to original texts like modern
scolars would do. Nothing before the Septuagint that is all
Greek would be studied.

(I have been wasting a lot of time lately with youtube films
about the subject, of i.a. prof Bart Ehrmann.)

>
>groet, Coos

Groetjes Albert

Brian Fox

unread,
Mar 21, 2021, 11:01:47 AMMar 21
to
I don't think its a waste of time. It's archeology in some fashion.
Bart Ehrman is great for the newer texts. For some interesting
commentary on the time period just before and leading up to it,
Lawrence Shiffman has some lectures on Youtube on the dead sea scrolls
and what they tell us about that time regarding the different Jewish
groups including that upstart group that followed the most famous rabbi
of that time.
:)
His lecture at the Vatican was a good one.

Apologies if I sent you down another rabbit hole.

We now return to our regularly scheduled program.



NN

unread,
Mar 21, 2021, 11:24:01 AMMar 21