Getting the text from a webpage (not the source)

Guillaume Dargaud

unread,

Sep 17, 2012, 5:25:34 AM9/17/12

to

Hello all,
I would like to script the equivalent of doing Ctrl-C on a webpage in a
browser, and then Ctrl-V in a text editor.
In other words I would like the text from a webpage, after all the html+css
and possibly javascript rendering. The idea is to get the text like a person
sees it, no "display:none" shenanigans.

I don't think it's a job for wget which only gets the source.
I was thinking of some option in links/lynx but I don't think those
interpret css.

Suggestions for the command line, without veering offtopic into Firefox
plugins ?

Thanks
--
Guillaume Dargaud
http://www.gdargaud.net/

Message has been deleted

Bill Marcum

unread,

Sep 17, 2012, 1:48:54 PM9/17/12

to

On 09/17/2012 05:25 AM, Guillaume Dargaud wrote:
> Hello all,
> I would like to script the equivalent of doing Ctrl-C on a webpage in a
> browser, and then Ctrl-V in a text editor.
> In other words I would like the text from a webpage, after all the html+css
> and possibly javascript rendering. The idea is to get the text like a person
> sees it, no "display:none" shenanigans.
>
> I don't think it's a job for wget which only gets the source.
> I was thinking of some option in links/lynx but I don't think those
> interpret css.
>

If you wget the source of a web page and then view that file in a
browser, it should be the same. Or you could do a screenshot, but that
would get an image of the page.

Kaz Kylheku

unread,

Sep 17, 2012, 2:05:12 PM9/17/12

to

On 2012-09-17, Bill Marcum <bi...@nowhere.invalid> wrote:
> On 09/17/2012 05:25 AM, Guillaume Dargaud wrote:
>> Hello all,
>> I would like to script the equivalent of doing Ctrl-C on a webpage in a
>> browser, and then Ctrl-V in a text editor.
>> In other words I would like the text from a webpage, after all the html+css
>> and possibly javascript rendering. The idea is to get the text like a person
>> sees it, no "display:none" shenanigans.
>>
>> I don't think it's a job for wget which only gets the source.
>> I was thinking of some option in links/lynx but I don't think those
>> interpret css.
>>
> If you wget the source of a web page and then view that file in a
> browser, it should be the same.

What if the document that is rendered on the screen has some contents which are
computed by Javascript?

Wget doesn't contain a Javascript interpreter.

Guillaume is right.

Though, not sure how you can solve this easily with Unix shell tools.

You need a web scraping engine that processes Javascript.

Chris F.A. Johnson

unread,

Sep 17, 2012, 2:22:13 PM9/17/12

to

On 2012-09-17, Kaz Kylheku wrote:
> On 2012-09-17, Bill Marcum <bi...@nowhere.invalid> wrote:
>> On 09/17/2012 05:25 AM, Guillaume Dargaud wrote:
>>> Hello all,
>>> I would like to script the equivalent of doing Ctrl-C on a webpage in a
>>> browser, and then Ctrl-V in a text editor.
>>> In other words I would like the text from a webpage, after all the html+css
>>> and possibly javascript rendering. The idea is to get the text like a person
>>> sees it, no "display:none" shenanigans.
>>>
>>> I don't think it's a job for wget which only gets the source.
>>> I was thinking of some option in links/lynx but I don't think those
>>> interpret css.
>>>
>> If you wget the source of a web page and then view that file in a
>> browser, it should be the same.
>
> What if the document that is rendered on the screen has some contents which are
> computed by Javascript?
>
> Wget doesn't contain a Javascript interpreter.

No, but the browser does. If you use the -r option with wget, it will
download all the needed files.

--
Chris F.A. Johnson, author <http://shell.cfajohnson.com/>
===================================================================
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
Pro Bash Programming: Scripting the GNU/Linux Shell (2009, Apress)

Sivaram Neelakantan

unread,

Sep 18, 2012, 10:48:01 AM9/18/12

to

On Mon, Sep 17 2012,Chris F.A. Johnson Chris F.A. Johnson wrote:

[snipped 14 lines]

>>> If you wget the source of a web page and then view that file in a
>>> browser, it should be the same.
>>
>> What if the document that is rendered on the screen has some
>> contents which are computed by Javascript?
>>
>> Wget doesn't contain a Javascript interpreter.
>
> No, but the browser does. If you use the -r option with wget, it
> will download all the needed files.

Well, I have a slightly different issue in that I'm trying to take
screenshots of a frame/minipage/table? with scrollbars within the
rendered html page.

I don't own the web app to make changes to it. There is a standard
page with a long table of say 100 countries with 10 columns of data
rendered in the same page with its own horizontal and vertical
scrollbars. If I try to take screenshots of the page, only the
visible portion within the inner scrollbar portion gets picked up.
How do I get the entire table out as a screenshot. The webpage itself
is a fantastic mishmash of css,javascript,html and possibly hamsters.

lynx, w3m, firefox addons, windows screengrab tools....nyet, it's just
not picking up more than what's visible.

sivaram
--

--- Posted via news://freenews.netfront.net/ - Complaints to ne...@netfront.net ---

Guillaume Dargaud

unread,

Sep 20, 2012, 4:57:15 AM9/20/12

to

>> Wget doesn't contain a Javascript interpreter.
>
> No, but the browser does. If you use the -r option with wget, it will
> download all the needed files.
>

Well, I don't need a recursive download as much as a way to ignore
"display:none" in CSS (I'm interested in a specific set of pages, not the
entire web). So command line seems out. I guess the only 'perfect' solution
would be some screen grab in the browser that then feeds it into an OCR
engine !!! Maybe there's a Firefox plugin for that...

Ben Bacarisse

unread,

Sep 20, 2012, 6:10:41 AM9/20/12

to

Anything with display:none will not show up in the screen grab, unless
you also take other measures.

Does your system have html2text? How close does

wget -O - <url> | html2text

come to what you want?

--
Ben.

Ivan Shmakov

unread,

Sep 20, 2012, 7:14:44 AM9/20/12

to

>>>>> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>>>>> Guillaume Dargaud <use_the_co...@www.gdargaud.net> writes:

[Cross-posting to news:comp.infosystems.www.misc.]

[...]

>> Well, I don't need a recursive download as much as a way to ignore
>> "display:none" in CSS (I'm interested in a specific set of pages,
>> not the entire web).

[...]

> Anything with display:none will not show up in the screen grab,
> unless you also take other measures.

> Does your system have html2text? How close does

> wget -O - <url> | html2text

> come to what you want?

Neither html2text nor $ lynx -dump honors CSS (or JavaScript,
for that matter), which is what (AIUI) the OP needs.

--
FSF associate member #7257

Ben Bacarisse

unread,

Sep 20, 2012, 7:41:09 AM9/20/12

to

Possibly, yes. That's why I said "how close does it come", but I was
responding to a message that seemed to more specific. It mentioned only
ignoring display:none which html2text does of course. What other CSS
should be honoured seems to be up in the air,

I get the feeling the requirements are not set in stone and certainly
have not yet been fully stated. For example, the fact that CSS can
generate text has not yet come up, nor has the fact the CSS's
re-ordering of the text can make a significant difference to how usable
the result is.

--
Ben.

Ivan Shmakov

unread,

Sep 20, 2012, 8:29:54 AM9/20/12

to

>>>>> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>>>>> Ivan Shmakov <onei...@gmail.com> writes:
>>>>> Ben Bacarisse <ben.u...@bsb.me.uk> writes:

[...]

>>> Does your system have html2text? How close does

>>> wget -O - <url> | html2text

>>> come to what you want?

>> Neither html2text nor $ lynx -dump honors CSS (or JavaScript, for
>> that matter), which is what (AIUI) the OP needs.

> Possibly, yes. That's why I said "how close does it come", but I was
> responding to a message that seemed to more specific. It mentioned
> only ignoring display:none which html2text does of course. What

> other CSS should be honoured seems to be up in the air.

A quick web search [1] reveals that there're in fact two (at the
least) versions of html2text, and while I haven't checked for
that specifically, I'm pretty sure that the version currently in
Debian [2] (which I was referring to) doesn't honor CSS.

I don't know anything about the other version, though.

[1] http://duckduckgo.com/?q=html2text
[2] http://packages.debian.org/sid/html2text

> I get the feeling the requirements are not set in stone and certainly
> have not yet been fully stated. For example, the fact that CSS can
> generate text has not yet come up, nor has the fact the CSS's
> re-ordering of the text can make a significant difference to how
> usable the result is.

Yes.

Ben Bacarisse

unread,

Sep 20, 2012, 9:23:49 AM9/20/12

to

Ivan Shmakov <onei...@gmail.com> writes:

>>>>>> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>>>>>> Ivan Shmakov <onei...@gmail.com> writes:
>>>>>> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>
> [...]
>
> >>> Does your system have html2text? How close does
>
> >>> wget -O - <url> | html2text
>
> >>> come to what you want?
>
> >> Neither html2text nor $ lynx -dump honors CSS (or JavaScript, for
> >> that matter), which is what (AIUI) the OP needs.
>
> > Possibly, yes. That's why I said "how close does it come", but I was
> > responding to a message that seemed to more specific. It mentioned
> > only ignoring display:none which html2text does of course. What
> > other CSS should be honoured seems to be up in the air.
>
> A quick web search [1] reveals that there're in fact two (at the
> least) versions of html2text, and while I haven't checked for
> that specifically, I'm pretty sure that the version currently in
> Debian [2] (which I was referring to) doesn't honor CSS.
>
> I don't know anything about the other version, though.

Yes, I knew about the two versions but since my reply was just a punt I
didn't think to mention it. I should have, just in case the OP finds it
suitable.

The main differences seem to be that the Debian version can do recoding
but can't fetch the page via HTTP (which means that it can't do the
recoding based on the HTTP response!).

<snip>
--
Ben.

Manuel Collado

unread,

Sep 20, 2012, 9:24:46 AM9/20/12

to

El 20/09/2012 10:57, Guillaume Dargaud escribi�:

Instead of OCR from a screen capture, it should be possible to "print"
the page to a PDF file by using a PDF printer driver, and then use a
pdf2text utility to extract the text contents.

Just an idea. Don't know if this is usable in practice.

--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado

Ben Bacarisse

unread,

Sep 20, 2012, 9:41:02 AM9/20/12

to

Manuel Collado <m.co...@domain.invalid> writes:

> El 20/09/2012 10:57, Guillaume Dargaud escribió:
>>>> Wget doesn't contain a Javascript interpreter.
>>>
>>> No, but the browser does. If you use the -r option with wget, it will
>>> download all the needed files.
>>>
>>
>> Well, I don't need a recursive download as much as a way to ignore
>> "display:none" in CSS (I'm interested in a specific set of pages, not the
>> entire web). So command line seems out. I guess the only 'perfect' solution
>> would be some screen grab in the browser that then feeds it into an OCR
>> engine !!! Maybe there's a Firefox plugin for that...

Given that we're reduced to guessing, here's another idea: use X
automation to select all the text in the browser window and paste it
into file. Much simpler than OCR. But...

> Instead of OCR from a screen capture, it should be possible to "print"
> the page to a PDF file by using a PDF printer driver, and then use a
> pdf2text utility to extract the text contents.

Another reasonable idea, but that does not address the display:none
issue. It's reasonable to ignore it for the moment since the same
problem would arise form the OP's own suggestion for a "perfect"
method.

--
Ben.

Ivan Shmakov

unread,

Sep 20, 2012, 10:42:03 AM9/20/12

to

>>>>> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>>>>> Ivan Shmakov <onei...@gmail.com> writes:
>>>>> Ben Bacarisse <ben.u...@bsb.me.uk> writes:

>>>>> Ivan Shmakov <onei...@gmail.com> writes:

[...]

>>>> Neither html2text nor $ lynx -dump honors CSS (or JavaScript, for
>>>> that matter), which is what (AIUI) the OP needs.

>>> Possibly, yes. That's why I said "how close does it come", but I
>>> was responding to a message that seemed to more specific. It
>>> mentioned only ignoring display:none which html2text does of
>>> course. What other CSS should be honoured seems to be up in the
>>> air.

>> A quick web search [1] reveals that there're in fact two (at the
>> least) versions of html2text, and while I haven't checked for that
>> specifically, I'm pretty sure that the version currently in Debian
>> [2] (which I was referring to) doesn't honor CSS.

>> I don't know anything about the other version, though.

> Yes, I knew about the two versions but since my reply was just a punt
> I didn't think to mention it. I should have, just in case the OP
> finds it suitable.

> The main differences seem to be that the Debian version can do
> recoding but can't fetch the page via HTTP

The main difference between the two versions I've been referring
to is that one of them is written in Python, and the other in
C++.

The version of html2text in Debian (written in C++) doesn't seem
to honor CSS (or process &# symbol references, BTW.) E. g.:

$ html2text < 1348151128.xhtml
****** CSS ‘display:none’ example ******
This text should be visible, and this one shouldn't.
$

Neither does Lynx:

$ lynx -dump -- 1348151128.xhtml
CSS `display:none' example

This text should be visible, and this one shouldn't.
$

> (which means that it can't do the recoding based on the HTTP
> response!).

Then it shouldn't be expected to take any external CSS
referenced into account, either.

The document is as follows (it's correctly rendered by
Iceweasel, and passes checks at http://validator.w3.org/.)

$ cat < 1348151128.xhtml
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"
xml:lang="en">
<head>
<title>CSS ‘display:none’ example</title>
<style type="text/css">.invis { display: none; }</style>
</head>

<body>
<h1>CSS ‘display:none’ example</h1>

<p>This text should be
visible<span class="invis"
>, and this one shouldn't</span>.</p>
</body>
</html>
$

Eli the Bearded

unread,

Sep 20, 2012, 5:58:18 PM9/20/12

to

In comp.infosystems.www.misc, Ivan Shmakov <onei...@gmail.com> wrote:
> $ html2text < 1348151128.xhtml
> ****** CSS ‘display:none’ example ******
> This text should be visible, and this one shouldn't.
> $
>

> $ lynx -dump -- 1348151128.xhtml
> CSS `display:none' example
>
> This text should be visible, and this one shouldn't.
> $
>

> The document is as follows (it's correctly rendered by
> Iceweasel, and passes checks at http://validator.w3.org/.)
>
> $ cat < 1348151128.xhtml
> <!DOCTYPE html>
> <html xmlns="http://www.w3.org/1999/xhtml"
> xml:lang="en">
> <head>
> <title>CSS ‘display:none’ example</title>
> <style type="text/css">.invis { display: none; }</style>
> </head>
>
> <body>
> <h1>CSS ‘display:none’ example</h1>
>
> <p>This text should be
> visible<span class="invis"
> >, and this one shouldn't</span>.</p>
> </body>
> </html>
> $

Tricky! I tried with links (full screen text mode browser), elinks (full
screen text mode browser), w3m (full screen text mode browser), and
edbrowse (ed style, line by line text mode browser), too. All failed
that test:

$ links http://localhost/invis.html

CSS `display:none' example
This text should be visible, and this one shouldn't.

$ elinks http://localhost/invis.html

CSS `display:none' example

This text should be visible, and this one shouldn't.

$ w3m http://localhost/invis.html

CSS ‘display:none’ example

This text should be visible, and this one shouldn't.

$ edbrowse http://localhost/invis.html
no ssl certificate file specified; secure connections cannot be verified
417
85
1,$p

CSS ‘display:none’ example

This text should be visible, and this one shouldn't.

q
$

The last browser was selected because it, alone from all the
other text browsers, has some support for javascript text changes:

$ cat javascript.html

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"
xml:lang="en">
<head>

<title>Javascript ‘innerHTML’ example</title>
<script type="text/javascript">
function changeText(){
document.getElementById('change_me').innerHTML =
', and so should this';
}
</script>
</head>

<body onLoad="changeText();" >
<h1>Javascript ‘innerHTML’ example</h1>

<p>This text should be

visible<span id="change_me"

>, and this one shouldn't</span>.</p>
</body>
</html>

$ edbrowse http://localhost/javascript.html
no ssl certificate file specified; secure connections cannot be verified
552
191
Javascript ‘innerHTML’ example

This text should be visible, and this one shouldn't.

------------------------------------------------------------------------------
, and so should this
q
$

Elijah
------
enjoys experimenting with text mode browsers

William Ahern

unread,

Sep 20, 2012, 6:33:19 PM9/20/12

to

PhantomJS is a headless WebKit with JavaScript API. It has fast and
native support for various web standards: DOM handling, CSS
selector, JSON, Canvas, and SVG.

-- http://phantomjs.org/

Last time I tried it, it needed Xlib and Xvfb, which was a non-starter for
me. But AFAIK it's been much improved since then and the "only" requirement
is Qt and WebKit.

Basically, you would use JavaScript to process the rendered page as you
like and spit out the results, or even dump to a PDF or PNG or what have
you.

Allodoxaphobia

unread,

Sep 20, 2012, 6:59:42 PM9/20/12

to

On Tue, 18 Sep 2012 20:18:01 +0530, Sivaram Neelakantan wrote:
> On Mon, Sep 17 2012,Chris F.A. Johnson Chris F.A. Johnson wrote:
>
>
> [snipped 14 lines]
>
>>>> If you wget the source of a web page and then view that file in a
>>>> browser, it should be the same.
>>>
>>> What if the document that is rendered on the screen has some
>>> contents which are computed by Javascript?
>>>
>>> Wget doesn't contain a Javascript interpreter.
>>
>> No, but the browser does. If you use the -r option with wget, it
>> will download all the needed files.
>
> Well, I have a slightly different issue in that I'm trying to take
> screenshots of a frame/minipage/table? with scrollbars within the
> rendered html page.