Character-based Lexer as Recognizer

trans

unread,

Sep 28, 2016, 9:54:19 AM9/28/16

to

Has anyone considered using an extensible character-based lexer for a recognizer?

I imagine it could be quite simple, given how Forth works in comparison to other languages. I imagine it working more-or-less along the following lines.

Recognizers are broken down into a hierarchy of "sub-recognizers" (for lack of a better term), each of these are simply triggered by two possible state conditions 1) the current character matching a character in a set and/or 2) a counter. Each sub-recognizer passes execution along to a sub-recognizers below it based on it's condition until a completion is reached.

I think an example will explain better. Lets do an ISO date, e.g. "2016-09-28".

The first "node" in our recognizer matches on any digit 0 to 9, and has itself as a sub-recognizer as long as it's counter is less than 4. When the counter reaches four and the next character is a "-", then it passes off to a sub-recognizer which does the same thing except this time it loops until the counter is 2 and the character is "-", then execution is passed to that sub-recognizer where each character is processed until the counter is 2 and the next character is a space. At that point we have our date.

The nice thing about this hierarchal approach is that it is very efficient b/c we can reuse sub-recognizers that recognizers have in common. So for instance dates and numbers can share the first sub-recognizer, passing off to different sub-recognizers based on different conditions (dates after a four count of repeat digits and numbers when a "." or " " appear.)

I hope I explained that well enough. I find it easier to demonstrate with a diagram but that's hard to draw here. When I have more time (next week) maybe I can draw it in an image and post it.

Jan Coombs

unread,

Oct 2, 2016, 8:03:30 AM10/2/16

to

On Wed, 28 Sep 2016 06:54:07 -0700 (PDT)
trans <tran...@gmail.com> wrote:

> Has anyone considered using an extensible character-based
> lexer for a recognizer?

Yes, occasionally, and just last week...

The ANS core wordset uses ~50 initial chars, so maybe a
character-based lexer would lead to a smaller and faster
dictionary if a table-based method is used.

Email me if you'd like my analysis and table sketch.

Jan Coombs
--

trans

unread,

Oct 4, 2016, 10:06:53 PM10/4/16

to

I would indeed. But I am using Google Groups and it obscures email addresses in the email header, so I can't get your full address. So go ahead and email me at tran...@gmail.com.

Albert van der Horst

unread,

Oct 5, 2016, 5:54:13 AM10/5/16

to

In article <20161002130329.483f838d@HP-6550b>,

Jan Coombs <jenfhaom...@murmic.plus.com> wrote:
>On Wed, 28 Sep 2016 06:54:07 -0700 (PDT)
>trans <tran...@gmail.com> wrote:
>
>> Has anyone considered using an extensible character-based
>> lexer for a recognizer?
>
>Yes, occasionally, and just last week...
>
>The ANS core wordset uses ~50 initial chars, so maybe a
>character-based lexer would lead to a smaller and faster
>dictionary if a table-based method is used.

If you're only concerned about the ANS core set, it is
easy enough to find a perfect hash. And then you want
to add definitions ...

>
>Email me if you'd like my analysis and table sketch.
>
>Jan Coombs
>--
>

Groetjes Albert
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst