[Q] How to force a html page to a text version

yv...@nouce.slip.net

unread,

Nov 9, 1997, 3:00:00 AM11/9/97

to

Right now I am working on a web page which is basically all in a html
format with a couple of .gifs. What I would like to know if there is
away to have a text version of the page which will automatically change
the html to text without going in and copying the file and making the
modifications.

thanks for your help

Lars Marius Garshol

unread,

Nov 9, 1997, 3:00:00 AM11/9/97

to

* yv...@nouce.slip.net

The best way to do this is probably to use the Lynx browser, which
displays pages as text anyway and redirect its output to file.

The command line
lynx -dump URL > myfile.txt
will do what you want. If you don't want the URLs listed at the end
you can put "-nolist" after -dump.

<URL:http://lynx.browser.org/>

--
"These are, as I began, cumbersome ways / to kill a man. Simpler, direct,
and much more neat / is to see that he is living somewhere in the middle /
of the twentieth century, and leave him there." -- Edwin Brock

http://www.ifi.uio.no/~larsga/ http://birk105.studby.uio.no/

Andrew McCormick

unread,

Nov 9, 1997, 3:00:00 AM11/9/97

to

In article <645ri5$ffs$2...@news.hal-pc.org>, skq...@brokersys.com wrote:

>In message <3466765b....@news.ican.net>, Jean Bigras <t...@ican.net>
>wrote:
>|If you are on a windows platforms or Mac, the lynx solution might not
>|be the best...
>
>Not sure about the Mac but there *is* a Windows 95 version of
>Lynx. It's labeled as a development version but it appeared pretty
>stable when I tried it out.

MacLynx is available from http://www.lirmm.fr/~gutkneco/maclynx/

Andrew McCormick

smi...@sprintmail.com

opinions expressed in this post are not necessarily my own

Jean Bigras

unread,

Nov 10, 1997, 3:00:00 AM11/10/97

to

On Sun, 09 Nov 1997 14:02:20 -0800, yv...@nouce.slip.net wrote:

>Right now I am working on a web page which is basically all in a html
>format with a couple of .gifs. What I would like to know if there is
>away to have a text version of the page which will automatically change
>the html to text without going in and copying the file and making the
>modifications.
>

> thanks for your help

If you are on a windows platforms or Mac, the lynx solution might not
be the best...

One thing you can do is:
1) view the page with Netscape or MSIE.
2) press CTRL A to select all text.
3) paste in your favorite text editor...

The other is get a copy of HOmesite 2.5
http://www.allaire.com/
and this software has a "strip all html tag" feature.

JLB

Shawn K. Quinn - NO JUNK E-MAIL

unread,

Nov 10, 1997, 3:00:00 AM11/10/97

to

In message <3466765b....@news.ican.net>, Jean Bigras <t...@ican.net>
wrote:

|If you are on a windows platforms or Mac, the lynx solution might not
|be the best...

Not sure about the Mac but there *is* a Windows 95 version of

Lynx. It's labeled as a development version but it appeared pretty
stable when I tried it out.

|One thing you can do is:

|1) view the page with Netscape or MSIE.
|2) press CTRL A to select all text.
|3) paste in your favorite text editor...

And how do you do that in an automated fashion? You can't. The text
version quickly becomes out of sync with the page itself.

--
Shawn K. Quinn - skq...@brokersys.com - visit my home page at
http://www.brokersys.com/~skquinn/ and visit a bunch of bogus e-mail addresses
at http://www.brokersys.com/~skquinn/spamsucks.html (latter to foil robots)

Gerald Oskoboiny

unread,

Nov 18, 1997, 3:00:00 AM11/18/97

to

On 10 Nov 1997 02:29:25 GMT, Shawn K. Quinn <skq...@brokersys.com> wrote:
:

>|One thing you can do is:
>|1) view the page with Netscape or MSIE.
>|2) press CTRL A to select all text.
>|3) paste in your favorite text editor...
>
>And how do you do that in an automated fashion? You can't. The text
>version quickly becomes out of sync with the page itself.

If you're running Apache as your Web server, you can get automatic
text versions of HTML pages quite easily with a hack I thought up
a while ago.

Just put this:

ErrorDocument 404 /cgi-bin/404error

in your httpd's conf files, then include something like this in the
"404error" CGI script:

#!/usr/local/bin/perl
#
# 404error: a cool 404 error handler
#
# Gerald Oskoboiny, 30 Jan 1997

$htdocs = "/www/htdocs";
$logfile = "/usr/log/404_error_log";
$html2txt = "/usr/local/bin/lynx -cfg=/usr/local/lib/lynx.cfg -validate -dump";

$extension = $ENV{REDIRECT_URL}; $extension =~ s/.*\.//g;
$basename = $ENV{REDIRECT_URL}; $basename =~ s/\.[^\.]*$//g;
$basename =~ s|^/||g;

#####
# Check if they were looking for a ".txt" file; if so, generate one for them.
if ( ( $extension eq "txt" ) && ( -f "$htdocs/${basename}.html" ) ) {
print "Content-Type: text/plain\n\n";
open( HTML2TXT, "$html2txt http://www.hwg.org/${basename}.html |" ) ||
die "couldn't run $html2txt with http://www.hwg.org/${basename}.html! $!";
while (<HTML2TXT>) {
print;
}
close( HTML2TXT ) || die "couldn't close $html2txt! $!";
exit;
}
#####

# do other stuff here...

et voila! Instant .txt versions of all your HTML pages.

For example:

http://www.hwg.org/resources/html/validation.html (HTML)
http://www.hwg.org/resources/html/validation.txt (plain text)

http://www.hwg.org/index.html
http://www.hwg.org/index.txt

This isn't especially efficient, but it gets decent results with extremely
little effort.

Better would be to make it an Apache module triggered by a .txt Handler
that caches the automatically-generated plain text versions somewhere
after they're generated.

Gerald
--
Gerald Oskoboiny
<ge...@impressive.net>

Craig Cockburn

unread,

Nov 24, 1997, 3:00:00 AM11/24/97

to

Ann an sgriobhainn <64t9fg$8...@senator-bedfellow.MIT.EDU>, sgriobh
Gerald Oskoboiny <ge...@impressive.net>

>On 10 Nov 1997 02:29:25 GMT, Shawn K. Quinn <skq...@brokersys.com> wrote:
>:
>>|One thing you can do is:
>>|1) view the page with Netscape or MSIE.
>>|2) press CTRL A to select all text.
>>|3) paste in your favorite text editor...
>>
>>And how do you do that in an automated fashion? You can't. The text
>>version quickly becomes out of sync with the page itself.
>

I keep my HTML pages in sync automatically by converting them from text
using a program, and using that program to automatically split and
crossreference the files where appropriate. I use it to maintain about
300 FAQ pages. It takes about 5 mins to regenerate the lot from 2 text
files.

more info at http://www.scot.demon.co.uk/q-html.html

--
Craig Cockburn ("coburn"), Du\n E/ideann, Alba. (Edinburgh, Scotland)
http://www.scot.demon.co.uk/ E-mail: cr...@scot.demon.co.uk
Sgri\obh thugam 'sa Gha\idhlig ma 'se do thoil e.