RfD Recognizer, 3rd version

Matthias Trute

unread,

Dec 1, 2015, 2:32:26 PM12/1/15

to

Based on the feedback there are again some
changes.

First: I think I've identified the building blocks
that haven't (really) changed over the time.

The design of a recognizer as a combination of
a parsing word and 3 data handling methods
associated with the data is one of them.
Similarly the idea to group recognizers as
stacks is now "carved in stone".

What has changed here over the time? The POSTPONE
action turned into one that literally follows
the existing spec. It compiles all data necessary to
append the compilation action into the dictionary for
later use. Part of this is generic, part is data
dependent. The words that deal with the recognizer
stacks now have a stack-identifier as an additional
parameter. For system words, a common anchor
FORTH-RECOGNIZER is introduced as VALUE.

Second: I removed all use cases. I got many
complaints about changing too much (interpreter)
or too little (search order). In fact most
remarks were about such use cases. It is great to
have a tool with such a wide range of possible
uses, but they are not really part of this game.
With the current spec I want to achieve two goals:
Keep the early adopters happy (enough) and invite
all others to at least give recognizers a try
without fearing that their systems will get
conquered.

Third: The rationale section covers the design
decisions and why they were taken. That includes
alternatives as well. It's more or less an excerpt
of 5 years of work and experience. Some of the
use cases went to it for inspiration purposes.

The RFD documents are available via
http://www.forth200x.org/ More information
including some sources is at
http://amforth.sf.net/Recognizers.html

Thanks to all, who gave feedback and more. I highly
appreciate your work, if I forget someone, please
contact me.

--------- Core part of the RFD v3 -----------

XY.6 Glossary

XY.6.1 Recognizer Words

DO-RECOGNIZER ( addr len stack-id -- i*x R:TABLE | R:FAIL )
RECOGNIZER
Apply the string at "addr/len" to the elements of the
recognizer stack identified by stack-id. Terminate the
iteration if either one recognizer returns a information
token that is different from R:FAIL or the stack is
exhausted. In this case return R:FAIL.

"i*x" is the result of the parsing word. It represents
the data from the string. It may be on other locations
than the data stack. In this case the stack diagram
should be read accordingly.

FORTH-RECOGNIZER ( -- stack-id ) RECOGNIZER
A system VALUE with a recognizer stack id.

It is VALUE that can be changed using TO assigning a new
recognizer stack id. This change has immediate effect.

The recognizer stack from this stack-id shall be used in
all system level words like EVALUATE, LOAD etc.

GET-RECOGNIZERS ( stack-id -- rec-n .. rec-1 n ) RECOGNIZER
Return the execution tokens rec-1 .. rec-n of the
parsing words in the recognizer stack identified with
stack-id. rec-1 identifies the recognizer that is called
first and rec-n the word that is called last.

The recognizer stack is left unchanged.

R>COMP ( R:TABLE -- XT-COMPILE ) RECOGNIZER
Return the execution token for the compilation action
from the recognizer information token.

R>INT ( R:TABLE -- XT-INTERPRET ) RECOGNIZER
Return the execution token for the interpretation action
from the recognizer information token.

R>POST ( R:TABLE -- XT-POSTPONE ) RECOGNIZER
Return the execution token for the postpone action from
the recognizer information token.

R:FAIL ( -- R:FAIL ) RECOGNIZER
An information token with two uses: First it is used to
deliver the information that a specific recognizer could
not deal with the string passed to it. Second it is a
predefined information token whose elements are used
when no recognizer from the recognizer stack could
handle the passed string. These methods provide the
system error actions.

The actual numeric value is system dependent.

RECOGNIZER ( size -- stack-id ) RECOGNIZER
Create a new recognizer stack with size elements.

RECOGNIZER: ( XT-INTERPRET XT-COMPILE XT-POSTPONE
"<spaces>name" -- ) RECOGNIZER
Skip leading space delimiters. Parse name delimited by a
space. Create a recognizer information token "name" with
the three execution tokens.

The words for XT-INTERPRET, XT-COMPILE and XT-POSTPONE
are called with the parsed data that the associated
parsing word of the recognizer returned. The information
token itself is consumed by the caller.

Each of the words XT-INTERPRET, XT-COMPILE and
XT-POSTPONE has the stack effect ( ... i*x -- j*y ). The
words to compile and postpone the data shall consume the
data "i*x". If the data "i*x" is on different locations
(e.g. floating point numbers), these words shall use
that data.

SET-RECOGNIZERS ( rec-n .. rec-1 n stack-id -- ) RECOGNIZER
Set the recognizer stack identified by stack-id to the
recognizers identified by the execution tokens of their
parsing words rec-n .. rec-1. rec-1 will be the parsing
word of the recognizer that is called first, rec-n will
be the last one.

If the size of the existing recognizer stack is too
small to hold all new elements, an ambiguous situation
arises.

--------------

Bernd Paysan

unread,

Dec 1, 2015, 8:40:23 PM12/1/15

to

Matthias Trute wrote:
> First: I think I've identified the building blocks
> that haven't (really) changed over the time.

One open question here is, since the stack-id (or array-id, as there's
actually no push/pop for individual entries) is not bound to recognizers, if
we rename that, to reflect the general usefulness, and add an generic
iterator, that can be used to traverse these arrays/stacks.

Pending discussions we already had, but didn't come to a final conclusion:

SET/GET/MAP-STACK: Too generic, and it's actually not really a stack without
push/pop operations

SET/GET-RECOGNIZER: Too specific, as it is a useful factor beyond
recognizers
(search order also could use that approach, and many systems have other
similar things like e.g. a search path for files).

SET/GET/MAP-CONFIG: Hm, maybe. Those stacks are typical some sort of
configuration.

Other suggestions? In the end, I would like to split up this proposal into
two parts, one dealing with the actual recognizers, and one dealing with the
generic ordered set of cells.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o ID: kQusJzA;7*?t=uy@X}1GWr!+0qqp_Cn176t4(dQ*
http://bernd-paysan.de/

JennyB

unread,

Dec 2, 2015, 5:29:57 AM12/2/15

to

On Wednesday, 2 December 2015 01:40:23 UTC, Bernd Paysan wrote:

> One open question here is, since the stack-id (or array-id, as there's
> actually no push/pop for individual entries) is not bound to recognizers, if
> we rename that, to reflect the general usefulness, and add an generic
> iterator, that can be used to traverse these arrays/stacks.
>
> Pending discussions we already had, but didn't come to a final conclusion:
>
> SET/GET/MAP-STACK: Too generic, and it's actually not really a stack without
> push/pop operations
>

Actually, it's a deque. We can push or pop to the front or the back.

: >FRONT \ x stack --
2DUP 2>R get-stack: 2R> ROT 1+ SWAP set-stack ;

: >BACK \ x stack --
DUP >R get-stack: R> 1+ set-stack:

and we can also save entire stacks with N>R

All of which only makes sense with stacks of limited size, but isn't that good enough?

Bernd Paysan

unread,

Dec 2, 2015, 12:42:58 PM12/2/15

to

Of course it's good enough. Maybe we should call it DEQUE then, and have
the >FRONT and >BACK operators as part of the DEQUE wordset (which can be
such a library as Stephen always want, there's nothing in it which can't be
implemented with standard words; except if you want these deques to reside
in EEPROM or such).

ruvim...@gmail.com

unread,

Sep 24, 2016, 9:44:12 PM9/24/16

to

On Tuesday, December 1, 2015 at 10:32:26 PM UTC+3, Matthias Trute wrote:
>
> The design of a recognizer as a combination of
> a parsing word and 3 data handling methods
> associated with the data is one of them.

The proposal v3 says:
REC:TABLE ( addr len -- i*x R:TABLE | R:FAIL )
The parsing word [...] may change >IN however.

It raises the following questions.

1. May this word change '>IN' in case of fail
(i.e., when it returns R:FAIL)?

2. If 'yes', where '>IN' should be restored?
If 'no', it should be mentioned explicitly.

3. If this mechanism is bound to SOURCE and '>IN', why don't to add the current token into environment too? For example, as SOURCE-TOKEN ( -- addr len )

4. If this mechanism allows to define multi-word recognizers, why don't to allow multi-line recognizers? For example, sometimes multiline string literals are useful.

Kind regards,
Ruvim

Paul Rubin

unread,

Sep 24, 2016, 10:06:29 PM9/24/16

to

ruvim...@gmail.com writes:
> The parsing word [...] may change >IN however.

Oh yucch, I didn't notice that before. It sounds ugly.

Matthias Trute

unread,

Sep 25, 2016, 4:23:37 AM9/25/16

to

Hi,

> It raises the following questions.
>
> 1. May this word change '>IN' in case of fail
> (i.e., when it returns R:FAIL)?
>
> 2. If 'yes', where '>IN' should be restored?
> If 'no', it should be mentioned explicitly.

The answer: >IN is only allowed to be changed if
the parser returns success. For R:FAIL or any other
exceptions the >IN has to be left unchanged, obviously.
I'll add it in the next version, thanks for pointing
to it.

> 3. If this mechanism is bound to SOURCE and '>IN', why don't to
> add the > current token into environment too? For example, as
> SOURCE-TOKEN ( -- addr len )
>
> 4. If this mechanism allows to define multi-word recognizers, why don't
> to allow multi-line recognizers? For example, sometimes multiline string
> literals are useful.

That I discuss in the chapter "Multiword Parsing" already. (Page 11 in the
PDF). The connection with SOURCE is only for "sentences", what is
SOURCE-TOKEN supposed to achieve?

Matthias

Albert van der Horst

unread,

Sep 25, 2016, 6:04:23 AM9/25/16

to

In article <d2403b69-b05a-4f23...@googlegroups.com>,

(Please trim your lines)

ciforth goes in this direction and allows multiple line strings.
This seems to be irreconsilable with >IN, and ciforth had to abondon
it for a more abstract PP (parse pointer).

>Kind regards,
>Ruvim

Groetjes Albert
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Albert van der Horst

unread,

Sep 25, 2016, 6:09:22 AM9/25/16

to

In article <878tuga...@jester.gateway.pace.com>,

Then you don't understand the very principle of prefixes.
Prefixes is the simplest form of recognizer.
For 789 the prefix 7 is recognized and executed. The parse pointer
sits past the 7. 7 is a number recognizer. It reads the remainder
of the number, with the effect that the parse pointer (mostly >IN
if it exists in the Forth) is increment and a number is left on the
stack or compiled.

Anton Ertl

unread,

Sep 25, 2016, 11:22:19 AM9/25/16

to

Matthias Trute <matthia...@gmail.com> writes:
>Hi,
>
>> It raises the following questions.
>>
>> 1. May this word change '>IN' in case of fail
>> (i.e., when it returns R:FAIL)?
>>
>> 2. If 'yes', where '>IN' should be restored?
>> If 'no', it should be mentioned explicitly.
>
>The answer: >IN is only allowed to be changed if
>the parser returns success. For R:FAIL or any other
>exceptions the >IN has to be left unchanged, obviously.

Yes, that was my first reaction, too, and in normal usage the
recognizer will restore >IN on non-recognition if it has changed >IN
by then (note that the only example of a >IN-changing recognizer I am
aware of, the "string" recognizer, already decides whether it succeeds
or fails before changing >IN.

However, the questions made me think of whether there might be a
recognizer that actualy wants to change >IN in case of failure, and I
can imagine such things. It would be a recognizer that would work in
tandem with another recognizer (or several), doing some preparatory
work that includes changing >IN, and then reports failure, so that the
other recognizer gets its turn. It might be more straightforward to
have the first recognizer just call the second one, but if the second
one is on the recognizer stack anyway, the programmer might chose the
failing approach.

Long story short: Changing >IN on failure is something that is
unlikely anybody will ever use.

>I'll add it in the next version, thanks for pointing
>to it.

I think the point he was getting at is whether the recognizer or the
system is responsible for restoring >IN on failure. It's the
recognizer('s programmer).

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2016: http://www.euroforth.org/ef16/

ruvim...@gmail.com

unread,

Sep 26, 2016, 10:48:31 PM9/26/16

to

On Sunday, September 25, 2016 at 11:23:37 AM UTC+3, Matthias Trute wrote:

> > 1. May this word change '>IN' in case of fail
> > (i.e., when it returns R:FAIL)?
>

> The answer: >IN is only allowed to be changed if
> the parser returns success.

So, in conjunction with access to SOURCE,
the proposed Recognizer mechanism v3 has the following
characteristics (in general case).

1. Recognizer can't be safely used for any arbitrary string
independently of SOURCE content. Since some recognizers
can give different result depending on content of SOURCE and '>IN'.

2. Recognizer has a side-effect. Since it can change '>IN' that is
part of Forth interpreter state. In general case we don't know
whether a recognizer has such side-effect or not, so we should treat
it as having a side-effect.

3. Recognizer ("parsing word") is non idempotent method.
We can't use it for the same argument twice,
since it can success on the first time and fail on the second time.

Conclusion

This Recognizers mechanism cannot be used just to resolve a name,
like '>NUMBER' and 'FIND' (or their derivatives).

Actually, some recognizers don't use SOURCE and they can be used
to resolve names. But intention to support such usage pattern
is bad idea while this pattern doesn't work for any recognizer.

Side note

Personally I believe that it is very important to have some
general mechanism of semantic analysis. Unfortunately, the
proposed Recognizer mechanism v3 can't play this role.
Note that without access to SOURCE it would be pure semantic analyzer.

Resolving names into xt (according to search order) and converting
numbers into binary form (with possible use of BASE state) are clear
semantic analysis (kind of interpretation in some sense). But parsing
single-words or multi-words sentences is responsibility of the parser only.

When you place parser into semantic analyzer, you lose some features
of the both.

So, the main trouble in the proposed design is that some
elements of a parser are placed into semantic analysis level
— and it is wrong solution.

Classical Forth parser parses word by word only. Although, it can
be extended by "parsing words" — the words that parses input stream
by themselves (with well known advantages and disadvantages).

This solution works well for VARIABLE, colon-definition, etc.
It works even for string literals via 'S"' word.

But what if we want to add support of string literals
in form "abc xyz" (i.e., without designated parsing word)?
I.e., what should be the mechanism that allows us to add such feature?
The right solution should lay in the parser level.

Some possible solutions (ideas).

1. A separate stack of the parsers. Each parser should return
a lexeme ( addr len ) on success.
To add support for the mentioned string literals, we should
add a parser that parses sequence of characters that are delimited
by quote marks, and a recognizer for string literal that
matches lexemes in form '"*"'.

2. Use only Recognizer mechanism (without access to SOURCE),
but in the following way.
Add recognizer for words in form '"*' that returns
R:TABLE with a parsing word for string literal.

[: parse-string-part2 :] \ interpret
[: parse-string-part2 SLIT, :] \ compile
[: parse-string-part2 SLIT, POSTPONE SLIT, :] \ postpone

where parse-string-part2 has stack effect
( addr1 len1 -- addr1 len2 )
and should analyze SOURCE

Kind regards,
Ruvim

trans

unread,

Sep 26, 2016, 11:34:01 PM9/26/16

to

Could someone explain how recognizers work in laymen terms?

Paul Rubin

unread,

Sep 27, 2016, 12:39:46 AM9/27/16

to

trans <tran...@gmail.com> writes:
> Could someone explain how recognizers work in laymen terms?

After parsing a word, pass it to a series of user-supplied functions
(plus some default ones) until one of them decides that it knows how to
handle the word. This contrasts with the traditional system of having
special hair built into the text interpreter to handle things like
numeric literals, char constants in 'x' notation, etc.

tabcom...@gmail.com

unread,

Sep 27, 2016, 1:45:20 AM9/27/16

to

Thanks, that helps. So it still goes word by word. But isn't that going to be rather slow? Every word has to run thru 1 to N functions to find a match, usually N. And the what does it do when it finds a match? Does it put words on the stack? e.g. does `"foo` => `s" foo`? Or I guess actually it just process it as if it were `"s foo`?

Paul Rubin

unread,

Sep 27, 2016, 1:55:13 AM9/27/16

to

tabcom...@gmail.com writes:
> Thanks, that helps. So it still goes word by word. But isn't that
> going to be rather slow? Every word has to run thru 1 to N functions
> to find a match, usually N.

Usually 1. The first thing is look up the word in the dictionary. If
not, see if it's an integer literal, etc. That's what the text
interpreter does now.

> And the what does it do when it finds a match?

It runs a user-supplied function. You could read the spec. Disclosure:
I've looked at the spec but not really studied it yet, so I don't
understand all of it.

See: http://amforth.sourceforge.net/Recognizers.html

Andrew Haley

unread,

Sep 27, 2016, 4:38:42 AM9/27/16

to

Paul Rubin <no.e...@nospam.invalid> wrote:
> tabcom...@gmail.com writes:
>> Thanks, that helps. So it still goes word by word. But isn't that
>> going to be rather slow? Every word has to run thru 1 to N functions
>> to find a match, usually N.
>
> Usually 1. The first thing is look up the word in the dictionary. If
> not, see if it's an integer literal, etc. That's what the text
> interpreter does now.

I think the proposal is to allow recognizers to be run before
dictionary search. The dictionary search itself can be a recognizer,
and can appear anywhere in the recognizer stack.

Andrew.

Anton Ertl

unread,

Sep 27, 2016, 8:12:30 AM9/27/16

to

ruvim...@gmail.com writes:
>This Recognizers mechanism cannot be used just to resolve a name,
>like '>NUMBER' and 'FIND' (or their derivatives).

Recognizers are intended to be components of INTERPRET. If you want
>NUMBER or FIND, or any other factor of a recognizer, use them.

>Actually, some recognizers don't use SOURCE and they can be used
>to resolve names.

You mean, if the author of a date-recognizer failed to provide the
date conversion as a usable factor, you can still do

s" 2016-09-27" rec:date r>int execute

and you get the converted date. Yes, that works for most recognizers,
but as you note, there is no guarantee that it works for every
recognizer. But this is not the primary use of recognizers, so I
don't consider it a problem.

>Side note
>
>Personally I believe that it is very important to have some
>general mechanism of semantic analysis. Unfortunately, the
>proposed Recognizer mechanism v3 can't play this role.
>Note that without access to SOURCE it would be pure semantic analyzer.
>
>
>Resolving names into xt (according to search order) and converting
>numbers into binary form (with possible use of BASE state) are clear
>semantic analysis (kind of interpretation in some sense). But parsing
>single-words or multi-words sentences is responsibility of the parser only.

The term "semantic analysis" I know is from compiler construction,
where the front end of a compiler is divided into phases: lexical
analysis (scanning, using regular expressions), syntax analysis
(parsing, using context free grammars), and semantic analysis
(everything else, e.g., name lookup and static type checking).

Forth text-interpretation does not fit this pattern. On the one hand,
it uses a very simple scanner, but on the other hand every word can be
a parsing word that can do anything at all. The only way to analyse a
Forth program is to text-interpret it.

In this context the proposed recognizers may lead to replacement of
some parsing words with recognizers; e.g., instead of "TO bla" you
might write "->bla", and the recognizer for "->bla" actually has the
properties you ask for, and a tool that relies on the separation into
lexical and semantic analysis will have fewer cases where it fails.
But because the user still has the option to write parsing words, such
a tool can (and probably will) fail quite a lot. Better let your tool
run as a hook in the Forth text interpreter.

>Classical Forth parser parses word by word only. Although, it can

>be extended by "parsing words" =E2=80=94 the words that parses input stream

>by themselves (with well known advantages and disadvantages).
>
>This solution works well for VARIABLE, colon-definition, etc.
>It works even for string literals via 'S"' word.

If the user writes

s" bla bla"

what does your semantic analysis do? What if the user does something
similar with a user-defined parsing word?

>But what if we want to add support of string literals
>in form "abc xyz" (i.e., without designated parsing word)?
>I.e., what should be the mechanism that allows us to add such feature?
>The right solution should lay in the parser level.
>
>
>Some possible solutions (ideas).
>
>1. A separate stack of the parsers. Each parser should return

>a lexeme ( addr len ) on success.=20

>To add support for the mentioned string literals, we should
>add a parser that parses sequence of characters that are delimited
>by quote marks, and a recognizer for string literal that
>matches lexemes in form '"*"'.

So your suggestion is to divide the text interpreter into a more
powerful parsing part than now and a semantic analysis part (that does
no parsing), and divide all the recognizers in that way. What would be
the advantage of your scheme? And what would your text interpreter
and recognizer interfaces look like?

In contrast, in the current proposal the text interpreter parses a
white-space-delimited string, and lets the recognizer recognize it,
and some recognizers (most notably the string recognizer) then do
additional parsing. For string parsing, it's not pretty, but simpler
than having a hard boundary between parsing and recognizing, and good
enough for the current uses.

>2. Use only Recognizer mechanism (without access to SOURCE),
>but in the following way.
>Add recognizer for words in form '"*' that returns
>R:TABLE with a parsing word for string literal.
>
>[: parse-string-part2 :] \ interpret
>[: parse-string-part2 SLIT, :] \ compile
>[: parse-string-part2 SLIT, POSTPONE SLIT, :] \ postpone
>
>where parse-string-part2 has stack effect
> ( addr1 len1 -- addr1 len2 )
>and should analyze SOURCE

What would be the advantage over the currently-proposed scheme?

Concerning separating the parsing from the "semantic analysis", this
seems to be just as non-separated as the current proposal.

Albert van der Horst

unread,

Sep 27, 2016, 9:06:27 AM9/27/16

to

In article <7cf7cab7-4593-48e7...@googlegroups.com>,

<tabcom...@gmail.com> wrote:
>On Tuesday, September 27, 2016 at 12:39:46 AM UTC-4, Paul Rubin wrote:
>> trans <tran...@gmail.com> writes:
>> > Could someone explain how recognizers work in laymen terms?

>>=20

>> After parsing a word, pass it to a series of user-supplied functions
>> (plus some default ones) until one of them decides that it knows how to
>> handle the word. This contrasts with the traditional system of having
>> special hair built into the text interpreter to handle things like
>> numeric literals, char constants in 'x' notation, etc.
>

>Thanks, that helps. So it still goes word by word. But isn't that going to =
>be rather slow? Every word has to run thru 1 to N functions to find a match=
>, usually N. And the what does it do when it finds a match? Does it put wor=
>ds on the stack? e.g. does `"foo` =3D> `s" foo`? Or I guess actually it jus=

>t process it as if it were `"s foo`?

Thank you for asking.
If you're at all concerned about the speed of compilation (I don't)
you should use a hash table for the dictionary.
This works like this. Read a word and derive a random number from it.
It is unlikely that there is an other dictionary word with the same
number. So if you've a mechanism to find the dictionary entry based on
that random number, you've approximately an O(1) process, meaning
that all words are found equally fast. This is called hashing.
You must have a provision for random words being the same. (collisions.)
But that is about it. There is no looping over anything so this can
be quite fast.

Enter prefixes. If you implement recognizers with prefixes in the way
I do, the next thing after a failed search is to try a string consisting
of the first letter of the word. That is of course about as fast as the
first search. A basic ciforth system has only one char recognizers.
The library contains one two character prefix ("0x").
In that case a search for two letter prefixes is to be added.

So concerning speed of recognizers:
1. I don't care
2. If paid for, I will show you a damned fast dictionary lookup.

A simple implementation of prefixes has the advantage that a
user can put them in a wordlist and put that wordlist in the
search order or not. This advantage is lost in hashing, but it is
not part of the recognizers as standardised anyway.
In example may help. Suppose I don't load the floating point
wordset. fp numbers are not recognized. Now I load the fp wordset
in its own wordlist, but I don't put that wordlist
in the search orde. fp numbers are still not recognized.
Now I put it in the search order. fp numbers are recognized.

Anton Ertl

unread,

Sep 27, 2016, 9:13:01 AM9/27/16

to

Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>Paul Rubin <no.e...@nospam.invalid> wrote:
>> tabcom...@gmail.com writes:
>>> Thanks, that helps. So it still goes word by word. But isn't that
>>> going to be rather slow? Every word has to run thru 1 to N functions
>>> to find a match, usually N.
>>
>> Usually 1. The first thing is look up the word in the dictionary. If
>> not, see if it's an integer literal, etc. That's what the text
>> interpreter does now.
>
>I think the proposal is to allow recognizers to be run before
>dictionary search.

What is allowed does not mean that it usually happens. The usual case
is that the recognizer for dictionary search is first, numbers second,
floats third, etc. If you use the usual recognizers and the usual
order, you will run as many recognizers as you run now.

If you add a "->value" recognizer at the end, and have a lot of
occurences of "->value" in your code, you will see more recognizers
being tried, on average, than now. I don't expect that to be a
performance issue, but if you really need to get the last drop of
performance out of text interpretation, you can reorder
non-conflicting recognizers (e.g., "->value" does not conflict with
the number recognizer for BASEs in the standard range) in the order
that produces the best performance (based on how often the recognizers
match and how fast they fail).

Bernd Paysan

unread,

Sep 27, 2016, 9:23:20 AM9/27/16

to

Am Tue, 27 Sep 2016 03:38:37 -0500 schrieb Andrew Haley:
> I think the proposal is to allow recognizers to be run before dictionary
> search. The dictionary search itself can be a recognizer,
> and can appear anywhere in the recognizer stack.

Yes, of course users can change the default recognizer stack to fit any
of their needs, and if they add slow recognizers on the top of the stack,
the whole search can become pretty slow.

Comparing gforth-0.7.3 vs. current git Gforth on loading cross.fs however
shows a measurable difference in favor of gforth-current (32ms vs. 36ms
user time). Something else became better... and in fact, we have tuned
our hash function.

If you think that the simple loop through the recognizer stack is too
expensive, it's certainly possible to compile a tuned search function on
every SET-RECOGNIZERS (possibly giving it a unique name derived from a
hash over the xts, and using that name if it is already available), and
that should eliminate the dynamic overhead. But we have that sort of
overhead already for the loop through the search order. In fact, treating
the search order as recognizer stack and using a similar precompilation
scheme on SET-ORDER could tune that, too.

trans

unread,

Sep 27, 2016, 10:14:40 AM9/27/16

to

> For string parsing, it's not pretty, but simpler
> than having a hard boundary between parsing and recognizing, and good
> enough for the current uses.

Hi Anton. First you deserve thanks for all the work you've done on this. So thank you.

Could you explain how strings are handled?

ruvim...@gmail.com

unread,

Sep 27, 2016, 9:41:07 PM9/27/16

to

On Tuesday, September 27, 2016 at 3:12:30 PM UTC+3, Anton Ertl wrote:

> Recognizers are intended to be components of INTERPRET.

There is a citation from the proposal v3:
"data parsing [...] is called from the interpreter
and analyses strings (usually sub-strings of SOURCE)"

This "usually" means that it also can be used for other strings,
that are not sub-strings of SOURCE.

Yet another citation:
"addr/len" is a string, if provided by the Forth text
interpreter a sub-string inside SOURCE.

This "if provided" means that its usage is not restricted
by the Forth text interpreter only.

So, if Recognizers intended to be component of INTERPRET only,
it should be specified in more concrete way, with consideration
all the variants, without just "usually", or "if provided".

Moreover, in such case it is simpler to keep the lexeme for
analyzes in variable instead passing it as argument ( addr len ).

Also there are many other questions for this case.

On the other hands, it is better, if Recognizers would be
independent mechanism that is useful for itself,
and INTERPRET just uses it.

The common idea of component reusing is also utilized
in the following citation:
"The common tools to actually implement both recognizer
and search order word sets may be useful for themselves"

> You mean, if the author of a date-recognizer failed to provide the
> date conversion as a usable factor, you can still do
>
> s" 2016-09-27" rec:date r>int execute
>
> and you get the converted date.

Yes. Perhaps it is better if the specification would
determine API that will force the author to provide
such usable factor ;)

> The only way to analyse a Forth program is to text-interpret it.

[...]

> Better let your tool run as a hook in the Forth text interpreter.

I didn't mean these concerns.

> If the user writes
>
> s" bla bla"
>
> what does your semantic analysis do?

It is not mine, — every Forth system does semantic analysis,
but in contrast to other compilers, it does it in
incremental manner, word by word (lexeme by lexeme).

In the given example a classical Forth text interpreter does
1. parsing: it gets the next lexeme, namely 's"' here;
2. semantic analysis: it resolves the lexeme 's"' into xt
(and flag of immediate if any);
3. applying: it executes the xt.

The further parsing of the string literal
and corresponding changes in the input stream
are out of scope of the Forth text interpreter.

Regarding the point 2 above -- you can say that
it is not semantic analysis. But this would be
debate just about terminology.
That action can be also called interpretation,
resolving, recognizing, or by some another term.

[...]

> So your suggestion is to divide the text interpreter into a more
> powerful parsing part than now and a semantic analysis part (that does
> no parsing),

Yes (as possible solution).

> and divide all the recognizers in that way.

No.
Only a few recognizers for multi-word lexemes
need to be divided. Moreover, this more powerful
parsing part is need to parse multi-word lexemes only.

By the way, is there some examples except string literals?
Coupling to SOURCE for string literals only is overkill.
String literals can be implemented as one single custom case
without any reusing mechanism at all.

> What would be the advantage of your scheme?

More elegant design with separation of concerns,
better modularity.

> And what would your text interpreter
> and recognizer interfaces look like?

Just for example:

: INTERPRET ( i*x -- j*x )
begin
parse-lexeme? ( a u true | false ) while
recognize-lexeme? ( x true | false ) while
apply-token
repeat -13 throw then
;

> In contrast, in the current proposal the text interpreter parses a
> white-space-delimited string, and lets the recognizer recognize it,
> and some recognizers (most notably the string recognizer) then do
> additional parsing. For string parsing, it's not pretty, but simpler
> than having a hard boundary between parsing and recognizing, and good
> enough for the current uses.

It sounds reasonable. Nevertheless, I would prefer a solution
without parsing inside recognizers ;)

> >2. Use only Recognizer mechanism (without access to SOURCE),
> >but in the following way.

[...]

> What would be the advantage over the currently-proposed scheme?

Recognizers without the side-effects.
Less coupling in the specification (no need to mention SOURCE at all).

--
Ruvim

trans

unread,

Sep 27, 2016, 10:10:03 PM9/27/16

to

In fact I realized today that a string recognizer that could handle spaces (e.g. `"foo bar"`, would *have* to come before the dictionary recognizer, otherwise things could turn into a hot mess., e.g. if someone defined a word `"foo`.

JennyB

unread,

Sep 28, 2016, 3:07:03 AM9/28/16

to

Perhaps there should be a way of associating a recognizer with the CURRENT wordlist, by analogy with TO FORTH-RECOGNIZER, so that it is active when the wordlist is in the search order. That would produce the same effect as defining recognizers with prefixes: INTERPRET would check each wordlist and its recognizer before moving on to the next. You could attach a recognizer to an empty wordlist, as so have it wherever you needed it in the search order.

Albert van der Horst

unread,

Sep 28, 2016, 3:54:21 AM9/28/16

to

In article <bb4789a1-e1c7-49cd...@googlegroups.com>,

That is not true. Look into the implementation of strings in ciforth,
i.e. the source of the word `` " ''
The regular rules apply, it is later in the search order than
"foo so "foo is found first.
But it is essential that the prefix is allowed to parse for itself,
i.e. increment the parse pointer >IN (or whatever has a similar
function)

If string recognizer means a "string recognizer in the sense of the
20xx standard" you may be right though. That might be an argument
against it.

ruvim...@gmail.com

unread,

Sep 28, 2016, 5:39:16 AM9/28/16

to

On Wednesday, September 28, 2016 at 10:54:21 AM UTC+3, Albert van der Horst wrote:

> >
> >In fact I realized today that a string recognizer that could handle
> >spaces (e.g. `"foo bar"`, would *have* to come before the dictionary
> >recognizer, otherwise things could turn into a hot mess., e.g. if
> >someone defined a word `"foo`.
> >
>
> That is not true. Look into the implementation of strings in ciforth,
> i.e. the source of the word `` " ''
> The regular rules apply, it is later in the search order than
> "foo so "foo is found first.

The both cases have an issue of shadowing.

Either `"*` words or `"foo *" strings will be shadowed.

So with this strings literals either names that start with
quote mark (`"`) or strings that start with existent words
having name `"*` will be inaccessible.

Perhaps the names that start with quote mark (i.e. in form `"*`)
should become reserved. Or such names should not be used when
recognizer for such string literals is in the scope.

--
Ruvim

Albert van der Horst

unread,

Sep 28, 2016, 7:14:02 AM9/28/16

to

In article <13bd0549-26fa-49ab...@googlegroups.com>,

<ruvim...@gmail.com> wrote:
>On Wednesday, September 28, 2016 at 10:54:21 AM UTC+3, Albert van der Horst wrote:
>
>> >
>> >In fact I realized today that a string recognizer that could handle
>> >spaces (e.g. `"foo bar"`, would *have* to come before the dictionary
>> >recognizer, otherwise things could turn into a hot mess., e.g. if
>> >someone defined a word `"foo`.
>> >
>>
>> That is not true. Look into the implementation of strings in ciforth,
>> i.e. the source of the word `` " ''
>> The regular rules apply, it is later in the search order than
>> "foo so "foo is found first.
>
>The both cases have an issue of shadowing.
>
>Either `"*` words or `"foo *" strings will be shadowed.
>
>So with this strings literals either names that start with
>quote mark (`"`) or strings that start with existent words
>having name `"*` will be inaccessible.

If you have a prefix " and a normal word "foo that takes precedence
then of course:
"foo aap noot"
is not recognized as a string of three words.
Instead "foo is found and executed/compiled.

That is what the poor computer was ordered to do.

("Do what I mean, not what I say!")

>
>Perhaps the names that start with quote mark (i.e. in form `"*`)
>should become reserved. Or such names should not be used when
>recognizer for such string literals is in the scope.

That would destroy the notion that " is not special, but
just another prefix in the life of a bicycle repair man.

>
>--
>Ruvim

Anton Ertl

unread,

Sep 28, 2016, 7:47:14 AM9/28/16

to

trans <tran...@gmail.com> writes:
>In fact I realized today that a string recognizer that could handle spaces (e.g. `"foo bar"`, would *have* to come before the dictionary recognizer, otherwise things could turn into a hot mess., e.g. if someone defined a word `"foo`.

That's the usual conflict in Forth. Somebody can define a word `10`
that will shadow the integer `10`, but nobody does. A more serious
problem is words like `add` when the BASE is HEX; that has classically
been worked around by writing the number as `0add`, and in modern days
as `$add`; and of course you don't define words that look like that.

Similarly, if you define a word in a world with a string recognizer,
you will not name it `"foo`. And if you design a new recognizer, you
will avoid using a pattern that conflicts with existing words.

Anton Ertl

unread,

Sep 28, 2016, 12:41:19 PM9/28/16

to

ruvim...@gmail.com writes:
>On Tuesday, September 27, 2016 at 3:12:30 PM UTC+3, Anton Ertl wrote:
>
>> Recognizers are intended to be components of INTERPRET.
>
>
>There is a citation from the proposal v3:
> "data parsing [...] is called from the interpreter
> and analyses strings (usually sub-strings of SOURCE)"
>
>This "usually" means that it also can be used for other strings,
>that are not sub-strings of SOURCE.

Yes, an example is

s" 2016-09-27" rec:date r>int execute

Of course, if a recognizer uses additional stuff from the input
stream, this becomes less practical.

>Yet another citation:
> "addr/len" is a string, if provided by the Forth text

> interpreter a sub-string inside SOURCE.=20

>
>This "if provided" means that its usage is not restricted
>by the Forth text interpreter only.

Yes.

>So, if Recognizers intended to be component of INTERPRET only,
>it should be specified in more concrete way, with consideration
>all the variants, without just "usually", or "if provided".

The text interpreter is certainly the main use of recognizers. If you
find other uses for some recognizers, great. Does that mean that all
recognizers have to be equally usable for these other uses? No. Does
that mean that no recognizer should be used for other things? No.

>On the other hands, it is better, if Recognizers would be
>independent mechanism that is useful for itself,
>and INTERPRET just uses it.

What would be the interface for that? What would be the benefits and
the costs?

>> You mean, if the author of a date-recognizer failed to provide the
>> date conversion as a usable factor, you can still do

>>=20

>> s" 2016-09-27" rec:date r>int execute

>>=20

>> and you get the converted date.
>
>Yes. Perhaps it is better if the specification would

>determine API that will force the author to provide=20
>such usable factor ;)

How would you force that in the API?

>> If the user writes
>>=20
>> s" bla bla"
>>=20

>> what does your semantic analysis do?
>

>It is not mine, =E2=80=94 every Forth system does semantic analysis,
>but in contrast to other compilers, it does it in=20

>incremental manner, word by word (lexeme by lexeme).
>
>In the given example a classical Forth text interpreter does
>1. parsing: it gets the next lexeme, namely 's"' here;
>2. semantic analysis: it resolves the lexeme 's"' into xt
>(and flag of immediate if any);
>3. applying: it executes the xt.
>
>The further parsing of the string literal
>and corresponding changes in the input stream
>are out of scope of the Forth text interpreter.

Unfortunately, the users take a different view. For them,

s" bla bla"

or

' drop

is a unit (not always, but much of the time), and, e.g., they want to
be able to cut and paste it between interpretation and compilation,
and are often confused when "' drop" does not work as expected when
being compiled. For S" STATE-smartness is used in many systems, which
causes problems elsewhere. So taking the view that the parsing done
by parsing words is out-of-scope does not really cut it.

>[...]
>> So your suggestion is to divide the text interpreter into a more
>> powerful parsing part than now and a semantic analysis part (that does
>> no parsing),
>
>Yes (as possible solution).
>
>> and divide all the recognizers in that way.
>

>No.=20

>Only a few recognizers for multi-word lexemes
>need to be divided. Moreover, this more powerful
>parsing part is need to parse multi-word lexemes only.

What would be the interface for that.

>By the way, is there some examples except string literals?

Not among the recognizers that I expect to be commonly used. However,
for some more exotic use take a look at
<http://theforth.net/package/recognizers/current-view/literacy.4th>.

>Coupling to SOURCE for string literals only is overkill.
>String literals can be implemented as one single custom case
>without any reusing mechanism at all.

It's not clear to me what you mean by that.

Anyway, originally I envisioned that the text interpreter parses a
space-delimited string (e.g., with PARSE-NAME) and passes it to the
recognizer stack, and each recognizer only works on that string.
Somebody else then came up with the idea of having a string recognizer
that parses more by itself, and this looked ugly to me at first.

However, over time I got used to the idea. But anyway, let's consider
some alternatives:

1) Recognizers do all the parsing themselves; so most recognizers
start by saving >IN, performing PARSE-NAME, and on failure restoring
>IN. That makes the recognizer code longer, makes the text
interpreter slightly slower, and makes other uses of recognizers
harder. The benefit is that the string recognizer does not stick out,
because now all recognizers parse. Not a benefit worth the cost IMO.

2) You have a division into parsing and the rest, and most recognizers
specify PARSE-NAME as their parser, and the string recognizer provides
a special string parser. In implementation terms, this would again do
the saving and restoring of >IN and the PARSE-NAME on every
recognizer, again costing a bit of performance, but at least the
recognizers are shorter than for option 1), but most are still longer
than in the proposal. Again, the benefit does not seem worth the
cost.

>> What would be the advantage of your scheme?
>
>More elegant design with separation of concerns,
>better modularity.

Ok, and what is the practical advantage? Can you give an example.

>> And what would your text interpreter
>> and recognizer interfaces look like?
>
>Just for example:
>
>: INTERPRET ( i*x -- j*x )
> begin
> parse-lexeme? ( a u true | false ) while
> recognize-lexeme? ( x true | false ) while
> apply-token
> repeat -13 throw then
>;

And now for multiple recognizers?

Anton Ertl

unread,

Sep 28, 2016, 12:58:51 PM9/28/16

to

trans <tran...@gmail.com> writes:
>Could you explain how strings are handled?

Say, you have a string

"bla bla"

The text interpreter uses something like PARSE-NAME, and passes the
resulting string '"bla' to the recognizers; at some point, it calls
the string recognizer with this string, and the string recognizer sees
the '"' at the beginning and decides that it does not fail. There is
no '"' at the end of '"bla', so the string recognizer decides it needs
to parse more of the input, until it finds the terminating '"'.
That's the parsing part of the string recognizer.

Matthias Trute

unread,

Sep 28, 2016, 3:54:01 PM9/28/16

to

Am Mittwoch, 28. September 2016 03:41:07 UTC+2 schrieb ruvim...@gmail.com:
> On Tuesday, September 27, 2016 at 3:12:30 PM UTC+3, Anton Ertl wrote:
>
> > Recognizers are intended to be components of INTERPRET.
>
>
> There is a citation from the proposal v3:
> "data parsing [...] is called from the interpreter
> and analyses strings (usually sub-strings of SOURCE)"
>
> This "usually" means that it also can be used for other strings,
> that are not sub-strings of SOURCE.

I think, it's too early for nitpicking. The designated
environment for recognizers is the text interpreter.
Here I make similar assumptions like PARSE or WORD etc. With
that in mind, the multi word parsing becomes possible with as
little as allowing to change >IN. No more no less.

>
> Yet another citation:
> "addr/len" is a string, if provided by the Forth text
> interpreter a sub-string inside SOURCE.
>
> This "if provided" means that its usage is not restricted
> by the Forth text interpreter only.

Right. But not all recognizers are useable *outside* the
interpreter. Namely the ones, that change >IN. Maybe the
EXECUTE-PARSING will help here.

> So, if Recognizers intended to be component of INTERPRET only,
> it should be specified in more concrete way, with consideration
> all the variants, without just "usually", or "if provided".

Please forgive me.

> On the other hands, it is better, if Recognizers would be
> independent mechanism that is useful for itself,
> and INTERPRET just uses it.

As mentioned, this is not the primary goal. You can factor these
words easily and create a simple wrapper to make the parsing
words a recognizer. >FLOAT is a good example, >NUMBER is not
(see RFD).

>
> The common idea of component reusing is also utilized
> in the following citation:
> "The common tools to actually implement both recognizer
> and search order word sets may be useful for themselves"

What's your problem with that? It's a trivial fact. IMHO.

> In the given example a classical Forth text interpreter does
> 1. parsing: it gets the next lexeme, namely 's"' here;
> 2. semantic analysis: it resolves the lexeme 's"' into xt
> (and flag of immediate if any);
> 3. applying: it executes the xt.
>
> The further parsing of the string literal
> and corresponding changes in the input stream
> are out of scope of the Forth text interpreter.

Hmm. Look at the spec for the text interpreter.
There are not only XT's mentioned.

>
> No.
> Only a few recognizers for multi-word lexemes
> need to be divided. Moreover, this more powerful
> parsing part is need to parse multi-word lexemes only.

I keep the text interpreter as much as possible. That
means that the text interpreter is still responsible for
splitting the input (from SOURCE, managed with >IN) into
white space delimited words. I do not want to re-invent
Forth as a whole.

> By the way, is there some examples except string literals?

I've started at theforth.net with some examples, at least a
simple hh:mm:ss recognizer is there. More examples are in
the gforth git repository at savannah and (of course) in
the amforth sources at sourceforge.

> > What would be the advantage over the currently-proposed scheme?
>
> Recognizers without the side-effects.
> Less coupling in the specification (no need to mention SOURCE at all).

That's discussed in the RFD. Page 12, section "Multiword parsing".
Together with an alternative solution and why it is not used.

Matthias

Andrew Haley

unread,

Sep 29, 2016, 11:17:58 AM9/29/16

to

Paul Rubin <no.e...@nospam.invalid> wrote:
> ruvim...@gmail.com writes:
>> The parsing word [...] may change >IN however.
>
> Oh yucch, I didn't notice that before. It sounds ugly.

It is one of the most difficult parts of the proposal.

A better design might pass the recognizer the entire parse area and
its length and allow a recognizer to return the number of characters
actually consumed rather than a flag. That would be similar to the
way that >NUMBER works. The problem with this approach is that useful
words such as WORD and PARSE act on the current input source and have
a side effect on it.

Some helper words would be need to be added so that this wasn't unduly
difficult. For example, a word (I'll call it EXTRACT for a lack of
imagination) could be added which took (addr n c) and returned a
substring (addr' n').

Consider the current proposal

: REC:FLOAT ( addr len -- ( R:FLOAT | 0 ) ( F: -- f )
>FLOAT IF R:FLOAT ELSE 0 THEN ;

This is very clean and simple.

It would become something like

: REC:FLOAT ( addr len -- n ( F: -- f )
BL EXTRACT DUP >R >FLOAT IF R> ELSE R> DROP 0 THEN ;

which is not quite as simple.

But a string parser might even be as simple as

: REC:STRING ( addr len -- addr n n )
CHAR " EXTRACT <copy the string somewhere> DUP ;

[Untested, probably has errors, you get the idea.]

Also, the current proposal allows a recognizer to call REFILL, which I
don't like at all. I suppose there could be some justification made
for multi-line strings, but that's rather a stretch. A simple text
recognizer shouldn't be having such effects. IMO, YMMV, etc...

Andrew.

Matthias Trute

unread,

Sep 29, 2016, 1:29:51 PM9/29/16

to

> Also, the current proposal allows a recognizer to call REFILL, which I
> don't like at all.

What?

----------
mt@A:~/RFD$ grep -i2 refill Recognizer-rfc-C.rst
Another aspect with multiword recognizers is that it is possible that
the closing syntactic element of the multi-word sentence is not within
the current input string One or more ``REFILL`` may be necessary to get
it. Since that may be troublesome in the long run, the closing element
shall be in the same input string as the opening one.
-----------

Ok, REFILL is not explicitly forbidden, as any other word too. I don't
want to maintain a whitelist of allowed words at all. Forther's would
ignore it anyways.

Somewhere else I write that a recognizer must not modify the string
it gets (a substring from SOURCE). That IMHO makes REFILL at least
troublesome, it usually changes what's stored in the SOURCE buffer.

Matthias

Albert van der Horst

unread,

Sep 29, 2016, 6:54:30 PM9/29/16

to

In article <Oe-dnZEZsNQ4rXDK...@supernews.com>,

I go in a totally different direction. None of my recognizers get
strings passed to them, instead all recognizer work directly
with (my equivalent of) >IN.

The essence of a recognizer is that it parses in its own way.
So PARSE-FLOAT advances >IN and leaves a floating point number on
the fp stack. Interestingly, it parses backwards for the mantissa
and forwards for the exponent.

>
>Andrew.

Andrew Haley

unread,

Sep 30, 2016, 3:56:11 AM9/30/16

to

Matthias Trute <matthia...@gmail.com> wrote:
>
>> Also, the current proposal allows a recognizer to call REFILL, which I
>> don't like at all.
>
> What?
>
> ----------
> mt@A:~/RFD$ grep -i2 refill Recognizer-rfc-C.rst
> Another aspect with multiword recognizers is that it is possible that
> the closing syntactic element of the multi-word sentence is not within
> the current input string One or more ``REFILL`` may be necessary to get
> it. Since that may be troublesome in the long run, the closing element
> shall be in the same input string as the opening one.
> -----------
>
> Ok, REFILL is not explicitly forbidden, as any other word too. I don't
> want to maintain a whitelist of allowed words at all. Forther's would
> ignore it anyways.

My bad. It wasn't completely clear to me. On re-reading, I get it
now.

Thanks,

Andrew.

Andrew Haley

unread,

Sep 30, 2016, 3:58:39 AM9/30/16

to

Albert van der Horst <alb...@cherry.spenarnc.xs4all.nl> wrote:
> I go in a totally different direction. None of my recognizers get
> strings passed to them, instead all recognizer work directly
> with (my equivalent of) >IN.
>
> The essence of a recognizer is that it parses in its own way.
> So PARSE-FLOAT advances >IN and leaves a floating point number on
> the fp stack.

So how does that work? Do you use a different WORD (or PARSE) and
then rewind >IN, or what?

Andrew.

Albert van der Horst

unread,

Sep 30, 2016, 5:41:35 AM9/30/16

to

In article <5tCdnXzfw5SwhnPK...@supernews.com>,

None of the above. The responsability for parsing is distributed
and modular.

This is based on my infamous "2 line addition to do denotations"
Parsing
"we gaan naar rome"
The interpreter gets the first word
"we
It is not in the FORTH dictionary as such, but it is found in
the ONLY dictionary as " . This is a modified FIND that can
find it because " is marked a PREFIX. It is also IMMEDIATE.
The interpreter does what it always does: execute " , even
in compile state.
(Implementation detail: a programmer should understand that
compare a string `` "we '' and a string `` " ''
over the minimum length of those strings.)

" takes it from there. It is responsable to leave the parse
pointer (SRC + >IN or whatever) at a right place.
It allocates a string in the dictionary and leaves/compiles
an address length pair on the stack. (sc : a string constant).
That is the Forth modular compiler in action.

I don't try a prefab in behalf of " to pass a string to it.

The most consequent example of this is in "yourforth"
because it is a non-ISO experiment.
ciforth uses it and remains (large and by) ISO.

[The first time this explanation was published in this newsgroup
was 2003. We're getting there. :-( ]

Anton Ertl

unread,

Sep 30, 2016, 8:52:00 AM9/30/16

to

JennyB <jenny...@googlemail.com> writes:
>Perhaps there should be a way of associating a recognizer with the CURRENT =
>wordlist, by analogy with TO FORTH-RECOGNIZER, so that it is active when t=
>he wordlist is in the search order. That would produce the same effect as d=
>efining recognizers with prefixes: INTERPRET would check each wordlist and =
>its recognizer before moving on to the next. You could attach a recognizer =
>to an empty wordlist, as so have it wherever you needed it in the search or=
>der.

What would be the use case of such a scheme?

Among the use cases for recognizers that I know of, such an
association is not needed:

1) For typical literals, I don't see a particular need to associate
them with wordlists. You might want to have a literal for some data
that is only dealt with in a particular wordlist, then the association
would save you from having to remove both the wordlist from the search
order and the recognizer from the recognizer stack. But that is not a
common pattern for using recognizers yet, so maybe we should wait if
it establishes itself before thinking about adding such a feature.

2) For stuff like dot-parsers ("a.b") and TO-replacement ("->v"), the
lookup of a and v happens in the search order anyway, so no need to
associate these recognizers with wordlists.

Albert van der Horst

unread,

Oct 2, 2016, 6:22:11 AM10/2/16

to

In article <Oe-dnZEZsNQ4rXDK...@supernews.com>,
Andrew Haley <andr...@littlepinkcloud.invalid> wrote:
<SNIP>

>
>Also, the current proposal allows a recognizer to call REFILL, which I
>don't like at all. I suppose there could be some justification made
>for multi-line strings, but that's rather a stretch. A simple text
>recognizer shouldn't be having such effects. IMO, YMMV, etc...

I don't have a problem with this at all.

Look at it this way. The Forth interpreter parses a word
DO-SOMETHING , it finds the execution token and executes it.
At this point the interpreter takes no responsibility that
DO-SOMETHING does something stupid like 0 CHAR A !
that leads to a crash. That is Forth's modularity.
The interpreter doesn't try to control what's done with the
stack or whatever.

Now recognizers (if restricted: prefices ) make the parsing part modular.
The Forth interpretor finds a word that is a recognizer,
then executes it. The recognizer takes responsability for parsing.
If it fouls up the parse area, that's Forth modularity.
The interpreter doesn't try to control what's done with the
parse area.

IMO there should be no fundamental difference between recognizers
and normal words.

>
>Andrew.

Matthias Trute

unread,

Oct 2, 2016, 6:55:20 AM10/2/16

to

> Look at it this way. The Forth interpreter parses a word
> DO-SOMETHING , it finds the execution token and executes it.

With recognizers it is not the text interpreter that searches
for DO-SOMETHING.

> At this point the interpreter takes no responsibility that
> DO-SOMETHING does something stupid like 0 CHAR A !
> that leads to a crash. That is Forth's modularity.
> The interpreter doesn't try to control what's done with the
> stack or whatever.
>
> Now recognizers (if restricted: prefices ) make the parsing part modular.
> The Forth interpretor finds a word that is a recognizer,
> then executes it.

The interpreter isolates a word "DO-SOMETHING" in SOURCE and calls
the recognizer stack (do-recognizer) with it as input. The recognizer
stack delivers back data from the string (e.g. the XT if the word is
found, the number if the string looks like a number) together with
actions that the text interpreter has to do to handle that data
depending on STATE (execute it compile it, leave it on the data
stack). The text interpreter sees only words, regardless what
they mean and does blindly what the recognizer stack tells him to
do.

> If it fouls up the parse area, that's Forth modularity.
> The interpreter doesn't try to control what's done with the
> parse area.

That is left unchanged.

Ruvim

unread,

Oct 4, 2016, 10:09:26 AM10/4/16

to

On 2016-09-27 12:54, Anton Ertl wrote:

> And what would your text interpreter
> and recognizer interfaces look like?

Actually I'm still preparing the detailed answer with a reference
implementation for illustration.

But one more comment for the following concern.

> In contrast, in the current proposal the text interpreter parses a
> white-space-delimited string, and lets the recognizer recognize it,
> and some recognizers (most notably the string recognizer) then do
> additional parsing. For string parsing, it's not pretty, but simpler
> than having a hard boundary between parsing and recognizing, and good
> enough for the current uses.
>
>> 2. Use only Recognizer mechanism (without access to SOURCE),
>> but in the following way.
>> Add recognizer for words in form '"*' that returns
>> R:TABLE with a parsing word for string literal.
>>
>> [: parse-string-part2 :] \ interpret
>> [: parse-string-part2 SLIT, :] \ compile
>> [: parse-string-part2 SLIT, POSTPONE SLIT, :] \ postpone
>>
>> where parse-string-part2 has stack effect
>> ( addr1 len1 -- addr1 len2 )
>> and should analyze SOURCE
>
> What would be the advantage over the currently-proposed scheme?

Note that this approach, when additional parsing can occur only on
executing (explicit call of the found token) is nearer to the classic
interpretation scheme when immediate word can do additional parsing when
it is executed only.

--
Ruvim