Quite often, when I need to read a list of web pages, I download the
html sources and save them in a single file like a.html.
If they are mostly texts, I open the html using web browser, select all
and copy it to an editor and save it.
I want to make the process shorter.
How can I extract the text from html source?
I'm sure there're many parsers for it.
What is the most convenient one?
Thanks.
Sam
Take a a look at Michael Neumann's WWW::Mechanize
http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc
http://rubyforge.org/frs/?group_id=427&release_id=2014
Or install the gem
James
>
> Thanks.
> Sam
>
>
> .
>
--
http://www.ruby-doc.org
http://www.rubyxml.com
http://catapult.rubyforge.com
http://orbjson.rubyforge.com
http://ooo4r.rubyforge.com
http://www.jamesbritt.com
You don't need ruby for this:
$ apt-cache show w3m
Package: w3m
[snip]
Description: WWW browsable pager with excellent tables/frames support
w3m is a text-based World Wide Web browser with IPv6 support.
It features excellent support for tables and frames. It can be used
as a standalone file pager, too.
.
* You can follow links and/or view images in HTML.
* Internet message preview mode, you can browse HTML mail.
* You can follow links in plain text if it includes URL forms.
* With w3m-img, you can view image inline.
.
For more information,
see http://sourceforge.net/projects/w3m
$ w3m -dump http://ruby.brian-schroeder.de/quiz/mazes/ | head
A ruby a day!
Ruby Quiz Solutions (Amazing Mazes)
Amazing Mazes
For a full description see: (Amazing Mazes on Ruby Quiz Homepage)[http://
www.rubyquiz.com/quiz31.html]
Another graph algorithm. Create a maze that is fully connected and has only one
$
regards,
Brian
--
http://ruby.brian-schroeder.de/
multilingual _non rails_ ruby based vocabulary trainer:
http://www.vocabulaire.org/ | http://www.gloser.org/ | http://www.vokabeln.net/
Thank James.
That looks cool.
However, it doesn't seem to have a function to extract texts from html.
(Or did I miss it?)
What I want is...
<table><tr><td>TEST</td></tr></table> => TEST
Is there a module that does this?
Regards,
Sam
Oh, thanks.
I just realized that even lynx can do that.
Regards,
Sam
#--------------------------------------------------------------------
# Strip HTML Tags from Line
#--------------------------------------------------------------------
def striphtml(line)
line.gsub(/\n/, ' ').gsub(/<.*?>/, '')
end
No, it is a library for the (fairly) easy creation of HTML munging code.
Some coding is required, but it allows complete control (so you get just
the text of interest).
James
Save As ... [text file].txt
- Removes all tags.
(Verified with Opera, Firefox & IE6, so I guess most browsers do this)
( e.g. test page: http://www.qurl.net/ )
daz
Sam
Thank you for sharing the code.
However, this code works only for a simple line, right?
When I tested it with a page of html by looping line by line, the
result was not what I expected.
Probably, I need to get a DOM parser...:-(
Sam
You may find my HTMLTokenizer library convenient for this. To do what you
need, all you'd do is keep calling "tokenizer.getText()"
http://rubyforge.org/projects/htmltokenizer/
Ben
WWW::Mechanize sits atop such a process, but makes it easier to define
what to do for elected elements and such.
Just sayin' ...
James
I'd rather recommend to use
line.gsub(/\n/, ' ').gsub(/<[^>]+>/, '')
instead of
> line.gsub(/\n/, ' ').gsub(/<.*?>/, '')
> end
Julius
I guess you run it through XML parser, like Expat which is everywhere
these days. Even Bash and Gawk have interface to it.
--
William Park <openge...@yahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html