Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

word count

316 views
Skip to first unread message

Hugh Aguilar

unread,
Jan 1, 2022, 2:38:13 PM1/1/22
to
Recently code for a word-count program was posted in Python
here on comp.lang.forth.
https://groups.google.com/g/comp.lang.forth/c/jw6yU_yI15E
This was said:
On Thursday, December 30, 2021 at 1:12:23 AM UTC-7, dxforth wrote:
> On 30/12/2021 14:59, Hugh Aguilar wrote:
> > On Tuesday, December 28, 2021 at 8:07:01 PM UTC-7, dxforth wrote:
> >> On 29/12/2021 01:44, Andy Valencia wrote:
> >> > ...
> >> > I could write this in Forth, but I don't want to. It would be tedious
> >> > and take a LOT more time than these four lines of Python.
> >
> > Andy Valencia doesn't want to do this program in Forth because he doesn't know
> > Forth very well and hence even simple programs seem tedious and difficult to him.
> > A program such as this is trivial given my novice-package which is ANS-Forth.
> > ASSOCIATION.4TH can be used to collect all of the distinct words. Each node's key
> > would be the string and it would have one datum tacked on which would be the count.
> > STRING-STACK.4TH can be used to parse out the words from the text.
> > STRING-STACK.4TH has pretty good pattern-matching and sub-string extraction
> > capability, so it could use a much more sophisticated definition of what a word is
> > than just a blank-delimited string --- no word-count program has been so crude as to
> > use blank-delimited strings since the early 1970s --- that is kindergarten-level programming.
> >
> >> "We choose to go to the Moon [...] and do the other things, not because
> >> they are easy, but because they are hard; because that goal will serve
> >> to organize and measure the best of our energies and skills"
> >>
> >> The de-skilling of people that has occurred over the last 40 years because
> >> subsequent govts chose the easy way, has made us all dumb and vulnerable.
> >
> > DXforth says this as if he had skill --- but he doesn't --- he is all talk and no programming
> > beyond kindergarten-level programming such as his END macro.
> >
> END - easy to remember, easy to use and no shortage of ELSE to eliminate.
> It doesn't get better than that. It'll do me as an epitaph.

I was patient with DXforth (the idiot previously known as HAA) and his
kindergarten-level Forth programming. I even included his much-hyped
END macro in the novice-package:
-----------------------------------------------------------------------------------------------
\ END was HAA's idea:
\ https://groups.google.com/forum/#!topic/comp.lang.forth/MdGWDMEbKIA
macro: end ( -- )
exit then ;

macro: ?exit ( -- ) \ like END but even less useful
if exit then ;
-----------------------------------------------------------------------------------------------

Of course, DXforth doesn't have MACRO: because that kind of programming
is way beyond his skill-level, so he would have to write END like this:
-----------------------------------------------------------------------------------------------
: end ( -- ) postpone exit postpone then ; immediate
-----------------------------------------------------------------------------------------------

I am not patient with DXforth anymore because he attacked me by
using the term "disambiguifier" for some idiotic nonsense that he was
blathering about. He was mocking me and my disambiguifiers (that are
required for MACRO: to work) --- this insult will not be forgiven.

Anyway, getting back to the subject of a word-count program,
this is trivial for me to write given the novice package. I post this challenge
now though, that the Forth experts of comp.lang.forth should attempt this.

The first part of the challenge is to parse distinct "words" out of the
text stream. This is the hard part. Counting distinct words is trivial
given ASSOCIATION.4TH or something similar.
Here is my definition of a "word" (I just thought this up off the top
of my head; the definition may need some revision).
Punctuation is defined as one of: . , : ; ! ? ( )
Punctuation delimits words, as does whitespace.
The ' is an apostrophe. The 's is removed and only the prefix is used.
Any word with an @ in it is assumed to be an email address and is
left as is (not broken apart on the dot character).
Numbers are left as is (not broken apart on the dot character).
I'll worry about the hyphen later --- what I have above is enough for now.

Testing chars for punctuation can be done by DXforth using his END
macro. He would have a colon word with a series of IF ... END statements
testing for each character. This is not how I will do it though
(I have my <SWITCH ... FAST-SWITCH> construct).

So, comp.lang.forth experts --- prove that you haven't been de-skilled!

dxforth

unread,
Jan 1, 2022, 9:03:01 PM1/1/22
to
On 2/01/2022 06:38, Hugh Aguilar wrote:
>
> I am not patient with DXforth anymore because he attacked me by
> using the term "disambiguifier" for some idiotic nonsense that he was
> blathering about. He was mocking me and my disambiguifiers (that are
> required for MACRO: to work) --- this insult will not be forgiven.
>
> Anyway, getting back to the subject of a word-count program,
> this is trivial for me to write given the novice package. I post this challenge
> now though, that the Forth experts of comp.lang.forth should attempt this.

I'd expect a Forth Grand Master to be writing applications and compilers.
Issuing duels and peddling novice packs brings them down to my level :)

NN

unread,
Jan 2, 2022, 9:16:34 AM1/2/22
to
What do you intend to use for the text stream ?

/usr/share/dict/words ?

Hugh Aguilar

unread,
Jan 2, 2022, 9:03:04 PM1/2/22
to
Any English-language text file should be fine, as that would have punctuation in it.
I don't care.

I would like to see anybody on comp.lang.forth write any Forth --- a "Hello World" program
would be a step up from this endless debating about recognizers or other nonsense.
A word-count program is a super-easy challenge. I remember telling Gavino and
Elizabeth Rather to write some Forth code, but neither of them ever did.
Now we have DXforth endlessly whining about how any kind of language standard
or code-library puts his awesome creativity in a box, although he can't write even the
most simple programs --- he just wants to drag everybody down to his level, with his
endless stream of snarky comments and put-downs.
Stephen Pelc says that my disambiguifiers don't work, although he doesn't have either
SYNONYM or and early-binding MACRO: which both depend upon the disambiguifiers.
Stephen Pelc says that anybody can write a better string-stack that what I have,
although he has nothing to show, and this has been about 30 years now.

A lack of programming is why nobody considers Forth "programmers" to be programmers.
Write some Forth code! Anything!

dxforth

unread,
Jan 2, 2022, 11:28:50 PM1/2/22
to
On 3/01/2022 13:03, Hugh Aguilar wrote:
>
> Now we have DXforth endlessly whining about how any kind of language standard
> or code-library puts his awesome creativity in a box, although he can't write even the
> most simple programs --- he just wants to drag everybody down to his level, with his
> endless stream of snarky comments and put-downs.

So post your word count program without condition and full source so it can be
tested. Let folks decide for themselves if that's a system they can use. Let
them do better if they can.





Stephen Pelc

unread,
Jan 3, 2022, 8:27:11 AM1/3/22
to
On 3 Jan 2022 at 03:03:03 CET, Mr Angry, "Hugh Aguilar"
<hughag...@gmail.com> wrote:

> Stephen Pelc says that my disambiguifiers don't work, although he doesn't have
> either
> SYNONYM or and early-binding MACRO: which both depend upon the disambiguifiers.
> Stephen Pelc says that anybody can write a better string-stack that what I
> have,
> although he has nothing to show, and this has been about 30 years now.

Lies, lies, lies.

Stephen
--
Stephen Pelc, ste...@vfxforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, +44 (0)78 0390 3612, +34 649 662 974
http://www.mpeforth.com - free VFX Forth downloads

Hugh Aguilar

unread,
Jan 3, 2022, 7:23:34 PM1/3/22
to
On Monday, January 3, 2022 at 6:27:11 AM UTC-7, Stephen Pelc wrote:
> On 3 Jan 2022 at 03:03:03 CET, Mr Angry, "Hugh Aguilar"
> <hughag...@gmail.com> wrote:
>
> > Stephen Pelc says that my disambiguifiers don't work, although he doesn't have
> > either
> > SYNONYM or and early-binding MACRO: which both depend upon the disambiguifiers.
> > Stephen Pelc says that anybody can write a better string-stack that what I
> > have,
> > although he has nothing to show, and this has been about 30 years now.
> Lies, lies, lies.
>
> Stephen

I have already showed how the liar Stephen Pelc lied about my disambiguifiers:
https://groups.google.com/g/comp.lang.forth/c/T-yYkpVwYew
Stephen Pelc is a failure at ANS-Forth programming. He doesn't understand
how to fix FIND and tick etc. in ANS-Forth so they aren't ambiguous.
If he doesn't know how FIND works, then what does he know???
If he is unable to program in ANS-Forth at the most basic level, then why
is he qualified to be the chair-person of Forth-200x???

Here the liar Stephen Pelc says that some anonymous African wrote a better
string-stack than mine 30 years ago, but he doesn't have any source-code.
Most likely that was just a warmed over version of Wil Baden's crap code.
Nobody other than myself has ever done this with COW (copy-on-write).
All other string stacks are slow because they move the entire strings around
during stack-juggling of the string-stack, and they have severe restrictions
on how big the strings are and how many strings are supported.
Also, I have a lot of pattern-matching and substring-extraction code that
all of those other string-stack implementations lack --- they are useless.

On Sunday, March 29, 2020 at 5:27:53 PM UTC-7, hughag...@gmail.com wrote:
> On Tuesday, June 25, 2019 at 3:35:43 PM UTC-7, Stephen Pelc wrote:
> > On Tue, 25 Jun 2019 06:39:51 -0700 (PDT), hughag...@gmail.com
> > wrote:
> >
> > >I don't know what they have in Africa, but I doubt that it is
> > >any better than America, Europe, Russia, etc..
> > >I stand by what I said.
> > >There is nothing comparable in quality on any continent.
> >
> > You are just demonstrating your ignorance and your failure to
> > accept that anyone else can do better than you.
> >
> > Stephen
>
> Stephen Pelc doesn't have any working code, and I don't think he is
> capable of implementing a string-stack because he doesn't know
> what COW (copy-on-write) is.
> He has done nothing himself, but he says that anyone can do better than me.

Ron AARON

unread,
Jan 6, 2022, 12:35:41 AM1/6/22
to
On 01/01/2022 21:38, Hugh Aguilar wrote:

> Anyway, getting back to the subject of a word-count program,
> this is trivial for me to write given the novice package. I post this challenge
> now though, that the Forth experts of comp.lang.forth should attempt this.
>
> The first part of the challenge is to parse distinct "words" out of the
> text stream. This is the hard part. Counting distinct words is trivial
> given ASSOCIATION.4TH or something similar.
> Here is my definition of a "word" (I just thought this up off the top
> of my head; the definition may need some revision).
> Punctuation is defined as one of: . , : ; ! ? ( )
> Punctuation delimits words, as does whitespace.
> The ' is an apostrophe. The 's is removed and only the prefix is used.
> Any word with an @ in it is assumed to be an email address and is
> left as is (not broken apart on the dot character).
> Numbers are left as is (not broken apart on the dot character).
> I'll worry about the hyphen later --- what I have above is enough for now.
>
> Testing chars for punctuation can be done by DXforth using his END
> macro. He would have a colon word with a series of IF ... END statements
> testing for each character. This is not how I will do it though
> (I have my <SWITCH ... FAST-SWITCH> construct).
>
> So, comp.lang.forth experts --- prove that you haven't been de-skilled!

Here's a version in 8th which fulfills all your requirements, AFAICT:

\ Answer to Hugh's "word count challenge". Give the name of the file to
'count' on the command-line.

: remove-empties
( /^\s*$/ r:match not nip ) a:filter ;

: isnumber? \ s -- s T
/^[+-]?[0-9]+\.?[0-9]*/ r:match swap
r:str nip swap ;

: isemail? \ s -- s T
"@" s:search null? not nip ;

\ read the requested file into a big string:
0 args f:slurp >s

\ split the string by whitespace
/\s+/ s:/

\ remove empty or blank lines
remove-empties

\ collapse the array to a dictionary, split out punctuation except
numbers and emails, and make "Cat" and "cat" count as one word
: process-array \ m a -- m
\ iterate each item in the array
(
s:lc
isnumber? not if
isemail? not if
\ not a special case, so remove punctuation and split again
/[.,:;!?()]+/ " " s:replace! /\s+/ s:/
remove-empties
\ if more than one word, process the array again
a:len 1 n:> if recurse ;then
\ single word, remove 's at end of word
a:open /'s$/ "" s:replace
then
then
\ stick this word in the map
true m:!
) a:each! drop ;

m:new swap process-array
\ count how many words there are
m:len . cr bye

Ron AARON

unread,
Jan 6, 2022, 12:43:43 AM1/6/22
to
Correction: "recurse" should be "process-array".

dxforth

unread,
Jan 6, 2022, 1:23:45 AM1/6/22
to
On 6/01/2022 16:35, Ron AARON wrote:
> On 01/01/2022 21:38, Hugh Aguilar wrote:
>
>> Anyway, getting back to the subject of a word-count program,
>> this is trivial for me to write given the novice package. I post this challenge
>> now though, that the Forth experts of comp.lang.forth should attempt this.
>>
>> The first part of the challenge is to parse distinct "words" out of the
>> text stream. This is the hard part. Counting distinct words is trivial
>> given ASSOCIATION.4TH or something similar.
>> Here is my definition of a "word" (I just thought this up off the top
>> of my head; the definition may need some revision).
>> Punctuation is defined as one of: . , : ; ! ? ( )
>> Punctuation delimits words, as does whitespace.
>> The ' is an apostrophe. The 's is removed and only the prefix is used.
>> Any word with an @ in it is assumed to be an email address and is
>> left as is (not broken apart on the dot character).
>> Numbers are left as is (not broken apart on the dot character).
>> I'll worry about the hyphen later --- what I have above is enough for now.
>>
>> Testing chars for punctuation can be done by DXforth using his END
>> macro. He would have a colon word with a series of IF ... END statements
>> testing for each character. This is not how I will do it though
>> (I have my <SWITCH ... FAST-SWITCH> construct).
>>
>> So, comp.lang.forth experts --- prove that you haven't been de-skilled!
>
> Here's a version in 8th which fulfills all your requirements, AFAICT:
> ...

He forgot to mention the snakes

https://youtu.be/ClwIj3x24Q4?t=18

But you already knew that before accepting his challenge.

Ron AARON

unread,
Jan 6, 2022, 1:29:04 AM1/6/22
to
I did, but I've grown immune...

Jali Heinonen

unread,
Jan 6, 2022, 2:23:08 AM1/6/22
to
Here is my word count program in 8th, a bit low level approach using scanner but as an added benefit it gives you line and column numbers of bad input. It could easily be modified to calculate word, email and number count separately. Emails and numbers could be handled properly using state machine. Testing for alphabet is a bit ugly as it requires special-chars array for non ASCII input (you may need to extend it as I added just enough to handle my dictionary).


needs file/getc

ns?

ns: scanner


-1 constant EOF
10 constant LF
32 constant SPACE

0 constant .SYM
1 constant ,SYM
2 constant :SYM
3 constant ;SYM
4 constant !SYM
5 constant ?SYM
6 constant (SYM
7 constant )SYM
8 constant WORDSYM

[ 243, 228, 246, 229, 196, 214, 197, 252, 225,
226, 241, 231, 232, 233, 234, 237, 244, 251 ] constant special-chars

: >char \ n -- char
"" swap s:+ ;

: new \ file -- scanner
m:new
"file" rot m:!
"column" 0 m:!
"line" 1 m:!
"char" SPACE m:!
"word" "" m:! ;

: line@ \ scanner -- scanner line
"line" m:@ ;

: column@ \ scanner -- scanner column
"column" m:@ ;

: char@ \ scanner -- scanner char
"char" m:@ ;

: word@ \ scanner -- scanner string
"word" m:@ ;

: file@ \ scanner -- scanner file
"file" m:@ ;

: accept \ n a -- n T
( over n:= ) a:map
' or false a:reduce ;

: digit? \ n -- bool
'0 '9 n:between ;

: alpha? \ n -- bool
dup >r 'a 'z n:between r@ 'A 'Z n:between or
r> special-chars accept nip or ;

: read-char \ scanner -- scanner
char@
LF n:= if
line@ n:1+ "line" m:_!
"column" 0 m:!
then

"file" m:@
f:getc swap -rot "char" m:_!
swap
f:eof? nip if
"char" EOF m:!
then

char@ EOF n:= not if
column@
n:1+ "column" m:_!
then ;

[ 39, 64 ] constant accept-word
[ 39, 46 ] constant accept-email

' accept-word deferred: accept-table

: read-word \ scanner -- scanner
"word" "" m:!
repeat
char@ swap word@ rot s:+ "word" m:_!
read-char

char@ dup >r alpha? r@ accept-table accept nip or not r> EOF n:= or if
' accept-word w:is accept-table
break
else
char@ '@ n:= if
' accept-email w:is accept-table
then
then
again
word@ s:len 2 n:> if
-2 s:/ 1 a:@ "'s" s:= if
0 a:@ nip "word" m:_!
else
drop
then
else
drop
then ;

: read-number \ scanner -- scanner
"word" "" m:!
repeat
char@ swap word@ rot s:+ "word" m:_!
read-char
char@ dup >r digit? r@ '. n:= or not r> EOF n:= or if
break
then
again ;

[ ( char@ EOF n:= ), ( EOF ),
( char@ '. n:= ), ( read-char .SYM ),
( char@ ', n:= ), ( read-char ,SYM ),
( char@ ': n:= ), ( read-char :SYM ),
( char@ '; n:= ), ( read-char ;SYM ),
( char@ '! n:= ), ( read-char !SYM ),
( char@ '? n:= ), ( read-char ?SYM ),
( char@ 40 n:= ), ( read-char 6 ),
( char@ 41 n:= ), ( read-char 7 ),
( char@ digit? ), ( read-number WORDSYM ),
( char@ alpha? ), ( read-word WORDSYM ),
( char@ >char swap column@ swap line@ nip "Line:%d Column:%d, Illegal character: %s" s:strfmt throw )
] var, symbols

: get-token \ scanner -- scanner token
\ skip white space
repeat
char@ EOF n:= not swap char@ SPACE n:> not rot and if
read-char
else
break
then
again

symbols @ a:when ;


ns


a:new constant words

[ ' noop ,
' noop ,
' noop ,
' noop ,
' noop ,
' noop ,
' noop ,
' noop ,
( scanner:word@ words swap a:push drop )
] var, possible-tokens


: app:main
0 args "/usr/share/dict/words" ?: f:open-ro scanner:new

repeat
scanner:get-token
dup scanner:EOF n:= if
2drop
break
else
possible-tokens @ case
then
again

\ word count
words ' noop a:group m:len nip . cr
bye ;

Jali Heinonen

unread,
Jan 6, 2022, 10:14:44 AM1/6/22
to
Seems like source code formatting was lost... I made my word count program more modular and added validator for numbers, so it now properly parses numbers and supports E notation.

Source available here: https://www.dropbox.com/s/ettg5bra03m4y9w/wc.zip?dl=0

Robert L.

unread,
Feb 7, 2022, 11:07:16 AM2/7/22
to
On 1/1/2022, Hugh Aguilar wrote:

> The first part of the challenge is to parse distinct "words" out of the
> text stream. This is the hard part. Counting distinct words is trivial
> given ASSOCIATION.4TH or something similar.
> Here is my definition of a "word" (I just thought this up off the top
> of my head; the definition may need some revision).
> Punctuation is defined as one of: . , : ; ! ? ( )
> Punctuation delimits words, as does whitespace.
> The ' is an apostrophe. The 's is removed and only the prefix is used.
> Any word with an @ in it is assumed to be an email address and is
> left as is (not broken apart on the dot character).
> Numbers are left as is (not broken apart on the dot character).
> I'll worry about the hyphen later --- what I have above is enough for now.

( SP-Forth )

\ Hash tables.
REQUIRE new-hash ~pinka/lib/hash-table.f
\ OFF and ON
REQUIRE OFF lib/ext/onoff.f
\ .R
REQUIRE .R lib/include/ansi.f
\ Ignore case.
REQUIRE CASE-INS lib/ext/caseins.f

8192 new-hash value word-table

variable in-word
variable allow-dot
variable dot-count
variable char-buf
create word-space 4096 allot
variable word-idx
0 value f-handle

: incr-count ( )
word-space word-idx @ 2dup
word-table HASH@N
( If entry not found, default to 0. )
0= if 0 then
1 + -rot
word-table HASH!N ;

( Remove "'s" from end of word. )
: trim-word
word-idx @ 2 >
if
word-space word-idx @ dup 2 - /string
s" 's" compare 0=
if
-2 word-idx +!
then
then ;

: dot? ( char -- bool ) [char] . = ;

: char-in-string? ( char adr len -- bool )
rot char-buf c!
char-buf 1 search
nip nip ;

: punctuation? ( char -- bool )
s" .,:;!?()" char-in-string? ;

: numeral? ( char -- bool )
s" 0123456789" char-in-string? ;

: consider-punctuation ( char -- bool )
dot? allow-dot @ dot-count @ 1 < and and
if
allow-dot off
1 dot-count +!
-1
else
0
then ;

: word-char? ( char -- bool )
dup 33 <
if drop 0
else
dup punctuation?
if
consider-punctuation
else
drop -1
then
then ;

: read-char char-buf 1 f-handle read-file throw 0= throw
char-buf c@ ;

: start-new-word in-word on 0 word-idx ! 0 dot-count ! ;

: append-char ( char -- )
word-space word-idx @ + c!
1 word-idx +! ;

: process-word-char ( char -- )
in-word @ 0= if start-new-word then
dup [char] @ = over numeral? or
if allow-dot on then
append-char ;

: process-word
word-idx @
if
trim-word
\ word-space word-idx @ type cr
incr-count
0 word-idx !
then ;

: process-non-word-char ( char -- )
drop
in-word @
if
process-word
in-word off allow-dot off
then ;

: parse-file
begin
read-char dup
word-char?
if
process-word-char
else
process-non-word-char
then
again ;

s" foo.txt" r/o open-file throw to f-handle
' parse-file catch drop
f-handle close-file throw
process-word

:noname 5 .r space type cr ; word-table all-hash
word-table del-hash


If the file contains:

foo; b...@zip.org.bar.
foo! 9.725.bar.
(that's not foo's fault)

then the output is:

1 not
2 bar
1 b...@zip.org
1 fault
1 that
1 9.725
3 foo

--
archive.org/details/nolies

Robert L.

unread,
Feb 12, 2022, 11:05:40 AM2/12/22
to
Hugh Aguilar wrote:

> The first part of the challenge is to parse distinct "words" out of the
> text stream. This is the hard part. Counting distinct words is trivial
> given ASSOCIATION.4TH or something similar.
> Here is my definition of a "word" (I just thought this up off the top
> of my head; the definition may need some revision).
> Punctuation is defined as one of: . , : ; ! ? ( )
> Punctuation delimits words, as does whitespace.
> The ' is an apostrophe. The 's is removed and only the prefix is used.
> Any word with an @ in it is assumed to be an email address and is
> left as is (not broken apart on the dot character).
> Numbers are left as is (not broken apart on the dot character).
> I'll worry about the hyphen later --- what I have above is enough for now.

0 new messages