script to search webpages in Forth

0 views
Skip to first unread message

Krishna Myneni

unread,
Aug 15, 2002, 10:59:09 PM8/15/02
to
If you have lynx (a text based web browser) and
the famous grep (pattern matching utility) on
your system, attached is a useful script written
in Forth to automatically search and print lines
containing the search term from a list of specified
web pages. The example searches news webpages
for headlines on a specified topic, e.g.

search-headlines india

will scan the web pages www.cnn.com, www.cbsnews.com,
etc. (you can add your own) and display all lines
containing "india" (not case sensitive). Neat, huh?

Krishna Myneni

========================================================

headlines.4th

h-peter recktenwald

unread,
Aug 16, 2002, 6:33:01 AM8/16/02
to
On Thu, 15 Aug 2002 21:59:09 -0500
Krishna Myneni <krishn...@compuserve.com> wrote:

[..]
> \ Use a definition of SHELL appropriate to your system
>
> : shell \ c-addr u -- n | execute a shell command in kForth
> strpck system ;
[..]


\ lib4th, f8
: shell -compile sh ; immediate

hi-level defns of "sh" and, "sh|" for piped output, in "lib4th" archive.

best,
hp

--
>>> pse, reply to : clf -at- lxhp -dot- in-berlin -dot- de <<<
Linux,Assembly,Forth: http://www.lxhp.in-berlin.de/index-lx.shtml en/de

Krishna Myneni

unread,
Aug 16, 2002, 5:21:03 PM8/16/02
to
The script doesn't seem to work in Windows. One
problem is that the shell command returns instantly
without waiting for the process to complete. But
there seem to be other complications in Windows
as well. Ughh! The script works fine under Linux.

Krishna Myneni

Marcel Hendrix

unread,
Aug 18, 2002, 11:33:31 AM8/18/02
to
(#29366) Krishna Myneni <krishn...@compuserve.com> wrote Re: script to search webpages in Forth

> If you have lynx (a text based web browser) and
> the famous grep (pattern matching utility) on
> your system, attached is a useful script written
> in Forth to automatically search and print lines
> containing the search term from a list of specified
> web pages. The example searches news webpages
> for headlines on a specified topic, e.g.

Here is a useful Forth utility that uses the w3m netbrowser.
( I'd like to thank Marc Olschok <sa7...@l1-hrz.uni-duisburg.de>
for drawing my attention to this useful program ).
Communication with this browser is through a pipe.

The idea is to check if any of the URLs in the "saved-urls.dat"
file have changed since last time we checked. This is done
by computing the 32-bit CRC of these pages.

You can check URLs in parallel with browsing the 'net in the
"normal" way. This saves lots of time while still keeping
you informed :-)

There is an infinite timeout when a page is off-line or not
accessible. I don't see an easy fix for that. (It may be
possible to change ACCUMULATE to have a time-out.)

-marcel

-- -----------------------------------------------------------------------
(*
* LANGUAGE : ANS Forth
* PROJECT : Forth Environments
* DESCRIPTION : check list of URLs for new data
* CATEGORY : Tool
* AUTHOR : Marcel Hendrix
* LAST CHANGE : August 18, 2002, Marcel Hendrix
*)

NEEDS -miscutil
NEEDS -pipes
NEEDS -assemble

REVISION -scooter "ÄÄÄ Scooter - URL fetch Version 1.00 ÄÄÄ"

PRIVATES

DOC
(*
WHAT
----
This program uses the w3m netbrowser. Communication is through a pipe.
The idea is to check if any of the URLs in the "saved-urls.dat" file have
changed since last time we checked. This is done by computing the 32-bit
CRC of these pages.
You can check URLs in parallel with browsing the 'net in the "normal" way.

BUGS
----
There is an infinite timeout when a page is off-line or not accessible.
I don't see an easy fix for that. (It may be possible to change
ACCUMULATE to have a time-out.)

EXAMPLE OUTPUT
--------------
FORTH> update-crcs
0 stable :: http://www.tim-mann.org/chess.html/index.html
1 stable :: http://home.iae.nl/users/mhx/index.html
2 stable :: http://www.bagley.org/~doug/shootout/index.html
3 stable :: http://www.cs.utk.edu/~ghenry/distrib/archive.htm
4 stable :: http://www98.phys.virginia.edu/classes/551.jvn.fall01/index.html
5 stable :: http://www.azillionmonkeys.com/qed/asm.html
6 stable :: http://www.cs.bell-labs.com/cm/cs/pearls/index.html
7 stable :: http://algoart.com/index.html
8 stable :: http://www-cs-faculty.stanford.edu/~knuth/index.html
9 stable :: http://home.hccnet.nl/a.w.m.van.der.horst/index.html
10 stable :: http://pweb.de.uu.net/schwalm.hb/index.html
11 stable :: http://www.jwdt.com/~paysan/bigforth.html
12 stable :: http://www.sunweb.ch/custom/epprecht/index.html
13 stable :: http://www.colorforth.com/index.html
14 stable :: http://www.fig-uk.org/codeindex/index.html
15 stable :: http://www.fig-uk.org/index.html
16 stable :: http://www.geocities.com/forthlinks/index.html
17 stable :: http://home.earthlink.net/~neilbawd/index.html
18 stable :: http://www.quartus.net/discus/index.html
19 stable :: http://www.forth.org.ru/index.html
20 stable :: http://dec.bournemouth.ac.uk/forth/index.html ok
*)
ENDDOC

-- Tools -----------------------------------------------------------------

-- I forgot how to do this in high-level
CODE crc32 ( n1 char -- n2 )
rpush,
ebx pop,
edx -> [esp] xchg, \ pop crc to edx
8 b# -> ecx mov, \ loop count
@@1:
edx shr, \ shift crc
bh rcr,
bl ror, \ shift character
ebx -> eax mov, \ save character
bh -> bl xor,
@@2 offset SHORT jns, \ skip if equal
$EDB88320 d# -> edx xor, \ crc-32 polymial 1 04C1 1DB7
@@2:
eax -> ebx mov, \ restore character
@@1 loop, \ next bit
edx -> [esp] xchg, \ crc to tos
rpop, ebx jmp,
END-CODE

\ calculate crc-32 of string
: crc-32 ( c-addr u -- n2 )
-1 -ROT BOUNDS ?DO I C@ crc32 LOOP INVERT ;

: ACCUMULATE ( c-addr umax handle -- c-addr u )
0 3 PICK LOCALS| start sz handle umax addr |
BEGIN umax DUP
WHILE addr SWAP handle READ-FILE ?FILE DUP
WHILE DUP +TO sz
addr umax ROT /STRING TO umax TO addr
REPEATED
DROP start sz ;P


-- http access -----------------------------------------------------------

: GET-URL ( c-addr1 u1 -- c-addr2 u2 )
S" c:/w3m/w3m.exe -dump " 2SWAP $+ R/O OPEN-PIPE ?FILE >S
PAD #100000 S ACCUMULATE ( -- c-addr u )
S> CLOSE-PIPE ?FILE ;

: .URL ( c-addr u -- )
GET-URL CR TYPE CR ;

: URL->CRC ( c-addr1 u1 -- u2 )
GET-URL crc-32 ;

-- A small URL database --------------------------------------------------

0 VALUE rec#
CREATE URL-DBASE HERE #100 CELLS DUP ALLOT ERASE

: cleanup ( u -- )
DROP
URL-DBASE rec# 0 ?DO @+ FREE ?ALLOCATE LOOP DROP
CLEAR rec#
URL-DBASE #100 CELLS ERASE ;P

: ADD-URL ( c-addr u -- )
rec# 100 U>= ABORT" URL-DBASE full"
DUP 2 CELLS + ALLOCATE ?ALLOCATE ( -- c-addr u addr )
CELL+ CELLPACK CELL- URL-DBASE rec# CELL[] !
1 +TO rec# ;

' cleanup IS-FORGET ADD-URL

: ADD-CRC URL-DBASE []CELL @ ! ; ( u rec# -- )
: ADD-DATA rec# >S ADD-URL S> ADD-CRC ; ( crc c-addr u -- )
: URL@ URL-DBASE []CELL @ CELL+ @+ ; ( ix -- c-addr u )
: CRC@ URL-DBASE []CELL @ @ ; ( ix -- u )
: job URL@ .URL ; ( ix -- )

: RECORD->FILE ( crc c-addr u1 handle -- )
>S ROT (.) S~ S" ~ $+
2SWAP $+
S~ " ADD-DATA~ $+
S> WRITE-LINE ?FILE ;P

: SAVE-URLS ( -- )
S" saved-urls.dat" W/O BIN CREATE-FILE ?FILE LOCAL handle
URL-DBASE rec# 0 ?DO @+ @+ SWAP @+ handle RECORD->FILE
LOOP DROP
handle CLOSE-FILE ?FILE ;

: RESTORE-URLS S" saved-urls.dat" INCLUDED ; ( -- )

RESTORE-URLS

: .URLS ( -- )
URL-DBASE
rec# 0 ?DO @+ @+ SWAP @+ CR I 3 .R ." :: " TYPE
C/L #20 - HTAB ." -- CRC = " H.
LOOP
DROP ;

-- Vector this to what you want done when a page changes -----------------

DEFER DO-SOMETHING ( index -- ) ' job IS DO-SOMETHING

: UPDATE-CRCs ( -- )
URL-DBASE
rec# 0 ?DO
@+ @+ >S @+ URL->CRC DUP >S I ADD-CRC
S> S> <> IF CR I 3 .R ." :: URL contents have changed"
I DO-SOMETHING
ELSE CR I 3 .R ." stable :: " I URL@ TYPE
ENDIF
LOOP
DROP
SAVE-URLS ;

:ABOUT CR .~ Try: S" http://some_url.html" GET-URL ( c-addr u ) -- get URL content in a string~
CR .~ S" http://some_url.html" .URL -- print page content~
CR .~ S" http://some_url.html" URL->CRC ( -- u) -- compute 32-bit CRC of page content~ ;

.ABOUT -scooter CR
DEPRIVE

(* End of Source *)

Chris Jakeman

unread,
Aug 19, 2002, 1:09:14 PM8/19/02
to
On Sun, 18 Aug 2002 15:33:31 GMT, m...@iaehv.iae.nl (Marcel Hendrix)
wrote:

>Here is a useful Forth utility that uses the w3m netbrowser.

>The idea is to check if any of the URLs in the "saved-urls.dat"
>file have changed since last time we checked. This is done
>by computing the 32-bit CRC of these pages.

I've been using the free service at
http://www.changedetection.com/monitor.html
for this but it's not as flexible as doing it yourself.

Instead, I've been meaning to program this for a while (more strictly,
to learn enough to try to do this). You and Krishna have saved me a
job. Many thanks.


Bye for now ____/ / __ / / / / /
/ / / _/ / / / /
Chris Jakeman __/ / / __ / / /_/
/ / / / / / / \
[To reply, please __/ __/ ____/ ___/ __/ _\
unspam my address]
Forth Interest Group United Kingdom
Voice +44 (0)1733 753489 chapter at http://www.fig-uk.org

Marcel Hendrix

unread,
Aug 19, 2002, 1:29:28 PM8/19/02
to

Chris wrote:

> I've been using the free service at
> http://www.changedetection.com/monitor.html for this but it's not as
> flexible as doing it yourself.

I used one too, but they stopped the service and wanted money after a while.
Very inflexible, and I suspect them of just wanting to build one's "interest
profile" so it can be sold to the spam society with a bonus.

> Instead, I've been meaning to program this for a while (more strictly, to
> learn enough to try to do this). You and Krishna have saved me a job.

I had to patch a single byte in w3m (it wants its options directory to start
with a "." which is too much asked from win32), Also, I use pipes, which are
not so easy to do under windows. Hopefully at least gForth will have got
that right.


Please let me know if you need help translating.


-marcel

Reply all
Reply to author
Forward
0 new messages