A very productive day using FORTH

hel...@gmail.com

unread,

Mar 19, 2007, 2:00:35 PM3/19/07

to

Hi,

today I had a success with Forth in an area where one usually would
not expect it.
Problem was to analyze a big database of linguistic data, represented
in XML and custom SGML-like representations. The database contains
texts, translations and grammatical analysis data + some meta data.
The current data is perfectly repesentable in a tree. But the analysis
to be done would not be tree like - so a solution was needed how to
represent this. Big part of the problem was that it's a research
project where you do not even exactly know how data would be best
represented.
My idea last week was to give Forth a try - basically to assume each
word in the texts has an own life and the texts are "Forth programs".
Today we tried it. So we first formed some example texts to examine.
The problem number one was to find a nice representation, that was
done easily. Next problem was what a "word" has to do by default. We
also found this and where able to make our first statistics. Step by
step we enhanced the thing by prepending to the texts the questions we
had. Some things where a little more complicated but most questions
where programmed in seconds.
Best of all was that I step by step explained the linguist what I did
and we talked a lot about Forth in general - now she wants to learn
Forth to do at least simple questions by her own.

Conclusion is that Forth is very usable for linguistic research tasks.
The problems we had, turned out to be easily implementable. Even non-
programmers seem to be able to follow the logical steps if you explain
them well - Forth is with its simplicity very nice while you explain
what you are doing.
The basic framework we used was about 20 lines of Forth code. In about
5 hours of intensive work we implemented things that would have taken
months of work if it where done in the usual way the institution
offers for such requests.

Has anyone else used FORTH for linguistic research tasks?

-Helmar

Andreas Kochenburger

unread,

Mar 19, 2007, 2:17:04 PM3/19/07

to

hel...@gmail.com wrote:
> Has anyone else used FORTH for linguistic research tasks?

That's difficult to answer, you didn't give much info along with the
question.

Perhaps you should search the web for Mentifex and look out for
MindForth. But I don't know if that fits.

--

Andreas

hel...@gmail.com

unread,

Mar 19, 2007, 2:24:50 PM3/19/07

to

Thanks, looks interesting.
What we do, is to analyze texts for stylistics. We dont care about
semantics at the moment. We've the problem that there are not enough
base data for doing something semantics related. But we can search for
everything that is a repetition or phonetic-related. We have small
hope that we can use some grammar data and lemmata for to analyze the
style, but it's currently not clear if the data is good enough for the
questions.

-Helmar

> --
>
> Andreas

Paul E. Bennett

unread,

Mar 19, 2007, 2:27:29 PM3/19/07

to

hel...@gmail.com wrote:

Nice story of successful project accomplishment in short timespan. We shall
have to save this one in the archives for sure. Thankyou.
--
********************************************************************
Paul E. Bennett ....................<email://p...@amleth.demon.co.uk>
Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/>
Mob: +44 (0)7811-639972
Tel: +44 (0)1235-811095
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************

John Passaniti

unread,

Mar 19, 2007, 5:47:23 PM3/19/07

to

Andreas Kochenburger wrote:
> Perhaps you should search the web for Mentifex and look out for
> MindForth. But I don't know if that fits.

I assume you are kidding. Arthur does get some minor credit for
actually producing some code, putting him far ahead of the ultra-useless
werty. But his code is bad, doesn't do anything near what he claims,
produces bizarre and nonsensical output, and is only documented by
incomprehensible ASCII diagrams and ponderous text.

Arthur and werty do share in common the desire to put out a meme and
hope that people brighter than they are able to apply insight and skill
to produce something useful. But in both cases, if anything useful ever
comes from it, it's going to be a secondary effect.

The Beez'

unread,

Mar 20, 2007, 5:10:35 AM3/20/07

to

On 19 mrt, 19:00, hel...@gmail.com wrote:

> Problem was to analyze a big database of linguistic data, represented

> in XML and custom SGML-like representations. <snip>

> Has anyone else used FORTH for linguistic research tasks?

Not as much as for linguistic research. However, I did use Forth to
analyse an XML file. I treated the tags as words (and at times as
terminator). This approach was not only much faster than using
ordinary languages, but also very easy to maintain. For those
interested, I added the significant parts here. The file was generated
by Cisco Works.

Hans Bezemer

[needs lib/throw.4th]
\ structure to store data
struct
1 field Dirty \ flag: dirty buffer
17 field IPServer \ IP address of server
33 field AltServer \ Hostname of server
33 field AltSwitch \ Hostname of switch
17 field Poort \ Port of server
33 field PoortNaam \ Naam van poort
33 field Duplex \ Duplex description
17 field Mbit \ Speed in Mbit
33 field V-LAN \ Applicable VLAN
33 field MAC \ MAC address of server
end-struct /Patch

/Patch string Patch

today weekday value thisday \ calculate weekday

char < constant "<" \ ASCII code character "<"
char > constant ">" \ ASCII code character ">"
char / constant "/" \ ASCII code character "/"
char . constant "." \ ASCII code character "."
char , constant "," \ ASCII code character ","
char ' constant "'" \ ASCII code character "'"

250 constant /GO \ maximum number of statements/
GO

: is-bl bl = ; \ true if a space
: is-slash "/" = ; \ true if a slash
: is-dot "." = ; \ true if a dot
: is-no-digit is-digit 0= ; \ true if not a digit
: <tag "<" parse ; ( -- a n)
: tag> ">" parse ; ( -- a n)
: <tag> <tag 2drop tag> ; ( -- a n)
: chop 1 /string ; ( a n -- a+1 n-1)
: .' "'" emit ; ( --)
: ., "," emit space ; ( --)
: >> count .' type .' ., ; ( a n --)
: GO cr ." GO" cr ; ( --)
: ?GO dup /GO = if /GO - GO else cr 1+ then ;
: clean s>lower -leading -trailing ; ( a1 n1 -- a2 n2)
: >cell clean number ; ( a n1 -- n2)
: Numfield> is is-type split >cell ; ( a1 n1 xt -- a2 n2 n3)
: DateField> NumField> -rot chop ; ( a1 n1 xt -- n3 a2 n2)
: clear Patch /Patch 0 fill ; ( --)
: tag>? tag> compare if E.USER throw then ;
: field! <tag clean rot place tag>? ; ( a1 n1 a2 --)
: IP> ['] is-dot NumField> <# "." hold # # # #> Patch -> IPServer ;
( a1 n1 -- a2 n2 a3 n3 a4)
: /UTData ( --)
Patch -> AltServer count dup 0= if 2drop Patch -> PoortNaam count
then
dup >r Patch -> AltServer place r> \ use PortName if Hostname
empty

0<> Patch -> Dirty c@ and if \ if we have a dirty buffer,
write sql
." INSERT INTO Patch ( IPServer, AltServer, AltSwitch, Poort,
Duplex, Mbit, VLAN, MAC, WeekDay ) VALUES ( "
Patch -> IPServer >>
Patch -> AltServer >>
Patch -> AltSwitch >>
Patch -> Poort >>
Patch -> Duplex >>
Patch -> Mbit count type .,
Patch -> V-LAN >>
Patch -> MAC >>
thisday . [char] ) emit ?GO
then clear \ signal buffer is clean
;
\ check if data is recent
enough
: LastSeen ( --)
today jday <tag \ get julian date today
['] is-slash DateField> \ get year
['] is-slash DateField> \ get month
['] is-bl DateField> 2drop swap rot \ get day
jday - 30 > if clear then \ less than a month old?
s" /LastSeen" tag>? \ check closing tag
;
\ convert the IP address
: IPAddress ( --)
<tag IP> place chop IP> +place chop IP> +place chop
>cell <# # # # #> Patch -> IPServer +place s" /IPAddress" tag>?
;
\ convert the hostname
: HostName ( --)
['] is-dot is is-type \ check for a dot
<tag split clean Patch -> AltServer place 2drop
s" /HostName" tag>? \ check closing tag
;

: Port ( --)
['] is-no-digit is is-type \ setup for -SPLIT
<tag clean -split Patch -> Poort place
number <# # # #> Patch -> Poort +place
s" /Port" tag>? \ check closing tag
;
\ convert the following fields
: UTData true Patch -> Dirty c! ; ( --)
: PortSpeed <tag clean 1- Patch -> Mbit place s" /PortSpeed" tag>? ;
: MACAddress s" /MACAddress" Patch -> MAC field! ;
: DeviceName s" /DeviceName" Patch -> AltSwitch field! ;
: VLAN s" /VLAN" Patch -> V-LAN field! ;
: PortName s" /PortName" Patch -> PoortNaam field! ;
: PortDuplex s" /PortDuplex" Patch -> Duplex field! ;

Andreas Kochenburger

unread,

Mar 20, 2007, 5:55:55 AM3/20/07

to

"John Passaniti" <nn...@JapanIsShinto.com> schrieb im Newsbeitrag
news:LxDLh.5110$B25....@news01.roc.ny...

> Andreas Kochenburger wrote:
>> Perhaps you should search the web for Mentifex and look out for
>> MindForth. But I don't know if that fits.
>
> I assume you are kidding. Arthur does get some minor credit for actually
> producing some code, putting him far ahead of the ultra-useless werty.
> But his code is bad, doesn't do anything near what he claims, produces
> bizarre and nonsensical output, and is only documented by incomprehensible
> ASCII diagrams and ponderous text.

Not kidding, but I agree it was a shot in the blue.

However Helmar's post made me curious and I googled a bit about linguistic
analysis software. It seems that the problem category is called corpus
analysis. It involves pattern matching in data trees, and I wonder how that
could be done with Forth.

Perhaps another wild shot into the blue ;-)

Andreas

Leo Wong

unread,

Mar 20, 2007, 8:59:40 AM3/20/07

to

On Mar 20, 5:10 am, "The Beez'" <hans...@bigfoot.com> wrote:

> I did use Forth to
> analyse an XML file. I treated the tags as words (and at times as
> terminator). This approach was not only much faster than using
> ordinary languages, but also very easy to maintain.

I treat xml files the same way:

\ cv.f for Jenny Brien Leo Wong 30 March 02003 fyj +
\ Give chapter and verse of any phrase in the King James bible
\ Uses ot.xml and nt.xml from:
\ http://www.ibiblio.org/xml/examples/religion/

include from.f
from jenx.f

S" ot.xml" string ot
S" nt.xml" string nt

CREATE CurrentBook 84 CHARS ALLOT
CREATE CurrentChapter 84 CHARS ALLOT
CREATE CurrentVerse 1024 CHARS ALLOT
CREATE Phrase 256 CHARS ALLOT

: .Verse ( n -- )
CurrentBook COUNT CR CR TYPE SPACE
CurrentChapter COUNT <word TYPE 2DROP .c : 0 U.R
cBuff @+ CR TYPE ;

' -cBuff many: aliases p chtitle bktshort ; DROP

: /chtitle ( n -- 0 )
cBuff @+ CurrentChapter place DROP 0 ;

: /bktshort ( -- )
cBuff @+ CurrentBook place ;

: v ( n -- n+1 )
-cBuff 1+ ;

: /v ( n -- n )
cBuff @+ CurrentVerse $place
CurrentVerse @+ 2DUP supper Phrase COUNT sscan NIP
IF DUP .Verse THEN ;

: cv ( -- ) \ cv <phrase>
0 PARSE 2DUP supper Phrase place
ot INCLUDED nt INCLUDED ;

Leo Wong
http://www.murphywong.net
http://barzuncentennial.murphywong.net/

hel...@gmail.com

unread,

Mar 20, 2007, 3:30:59 PM3/20/07

to

On Mar 20, 9:55 am, "Andreas Kochenburger" <a...@nospam.org> wrote:
> "John Passaniti" <n...@JapanIsShinto.com> schrieb im Newsbeitragnews:LxDLh.5110$B25....@news01.roc.ny...

Hi ;)

A good shoot as it turnes out. We also do work for a corpus grammar.
The research about style may have a lot of implications for other
areas of research.

Today I felt that it was not really that productive. I've rewritten
most of the things we worked out yesterday to be something more nice
looking and useable. The things from yesterday where more like a
scratch pad and I was able to factor out some things.
We figured out that we need to change some things with representation.
Since today the first ammount of "real" data is accessible.
The other half of the day (where we could have been more
productive...) we used to figure out some problems with existing
software and data. The lemmata are not really useful organized for our
needs. So this will make a lot of work next time. Problem is that we
do have a list of lemmata and do have annotated for every word the
lemma, but the annotation is too exact - you've lemmata A, B, C, D but
B, C, D are only some variation of A and what we would need is that B,
C, D would be A. The "relational" database used is not that nice
structured that you are simply able to find that a C is an A... There
is some magic inside the lemma list that we need to have cleared out
by the programmers/administrators of this. But the administrator is
currently ill and the programmer we can currently not reach that easy.

BTW, I dont see any problems about pattern matching with FORTH and the
problem is that the questions are not really tree-related at the
moment.

BTW 2: Here is a nice example-source snipet from today:
---------------------
...

strophe
vers ---| jbhA.tj Hr.tj r =j | /vers
vers ---| n-n.tt jnk js wbd.w | /vers
vers ---| Hr.j-jb nr.ww | /vers
vers ---| jw =j m zxn.w hAb r nr =n | /vers
/strophe

strophe
vers ---| jbhA.tj Hr.tj r =j | /vers
vers ---| jnk wr wr.w m aHA aA | /vers
vers ---| m Hsb amjA.wt stS | /vers
vers ---| r-gs jmj.w aHA =sn | /vers
/strophe

...
---------------------

This is plainly translated from the tree-oriented SGML used in the
project. We removed some extras from the transcription that are not
needed for our purposes.

The things that start with "=" are very regular and can be predefined
- they can be ignored for some questions (some point we found to be
useful today). The complete text base we could use, would be at end
probably some 10000s of lines similar as above.

-Helmar

hel...@gmail.com

unread,

Mar 20, 2007, 3:49:06 PM3/20/07

to

Hi,

> Not as much as for linguistic research. However, I did use Forth to
> analyse an XML file.

my raw skeleton I use to analyze XML is:
> http://maschenwerk.de/repository/?p=4p;a=blob;f=xml4p
This is obviously not all you need but a good starting point. I've
used this before to convert complex XML files to PDF (using "pdf4p",
you'll find at the same place).
The questions to the text are very different to parsing XML ;)

-Helmar