Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Re: Getting the text from a webpage (not the source)

22 views
Skip to first unread message

Ivan Shmakov

unread,
Sep 20, 2012, 7:14:44 AM9/20/12
to
>>>>> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>>>>> Guillaume Dargaud <use_the_co...@www.gdargaud.net> writes:

[Cross-posting to news:comp.infosystems.www.misc.]

[...]

>> Well, I don't need a recursive download as much as a way to ignore
>> "display:none" in CSS (I'm interested in a specific set of pages,
>> not the entire web).

[...]

> Anything with display:none will not show up in the screen grab,
> unless you also take other measures.

> Does your system have html2text? How close does

> wget -O - <url> | html2text

> come to what you want?

Neither html2text nor $ lynx -dump honors CSS (or JavaScript,
for that matter), which is what (AIUI) the OP needs.

--
FSF associate member #7257

Ben Bacarisse

unread,
Sep 20, 2012, 7:41:09 AM9/20/12
to
Possibly, yes. That's why I said "how close does it come", but I was
responding to a message that seemed to more specific. It mentioned only
ignoring display:none which html2text does of course. What other CSS
should be honoured seems to be up in the air,

I get the feeling the requirements are not set in stone and certainly
have not yet been fully stated. For example, the fact that CSS can
generate text has not yet come up, nor has the fact the CSS's
re-ordering of the text can make a significant difference to how usable
the result is.

--
Ben.

Ivan Shmakov

unread,
Sep 20, 2012, 8:29:54 AM9/20/12
to
>>>>> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>>>>> Ivan Shmakov <onei...@gmail.com> writes:
>>>>> Ben Bacarisse <ben.u...@bsb.me.uk> writes:

[...]

>>> Does your system have html2text? How close does

>>> wget -O - <url> | html2text

>>> come to what you want?

>> Neither html2text nor $ lynx -dump honors CSS (or JavaScript, for
>> that matter), which is what (AIUI) the OP needs.

> Possibly, yes. That's why I said "how close does it come", but I was
> responding to a message that seemed to more specific. It mentioned
> only ignoring display:none which html2text does of course. What
> other CSS should be honoured seems to be up in the air.

A quick web search [1] reveals that there're in fact two (at the
least) versions of html2text, and while I haven't checked for
that specifically, I'm pretty sure that the version currently in
Debian [2] (which I was referring to) doesn't honor CSS.

I don't know anything about the other version, though.

[1] http://duckduckgo.com/?q=html2text
[2] http://packages.debian.org/sid/html2text

> I get the feeling the requirements are not set in stone and certainly
> have not yet been fully stated. For example, the fact that CSS can
> generate text has not yet come up, nor has the fact the CSS's
> re-ordering of the text can make a significant difference to how
> usable the result is.

Yes.

Ben Bacarisse

unread,
Sep 20, 2012, 9:23:49 AM9/20/12
to
Ivan Shmakov <onei...@gmail.com> writes:

>>>>>> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>>>>>> Ivan Shmakov <onei...@gmail.com> writes:
>>>>>> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>
> [...]
>
> >>> Does your system have html2text? How close does
>
> >>> wget -O - <url> | html2text
>
> >>> come to what you want?
>
> >> Neither html2text nor $ lynx -dump honors CSS (or JavaScript, for
> >> that matter), which is what (AIUI) the OP needs.
>
> > Possibly, yes. That's why I said "how close does it come", but I was
> > responding to a message that seemed to more specific. It mentioned
> > only ignoring display:none which html2text does of course. What
> > other CSS should be honoured seems to be up in the air.
>
> A quick web search [1] reveals that there're in fact two (at the
> least) versions of html2text, and while I haven't checked for
> that specifically, I'm pretty sure that the version currently in
> Debian [2] (which I was referring to) doesn't honor CSS.
>
> I don't know anything about the other version, though.

Yes, I knew about the two versions but since my reply was just a punt I
didn't think to mention it. I should have, just in case the OP finds it
suitable.

The main differences seem to be that the Debian version can do recoding
but can't fetch the page via HTTP (which means that it can't do the
recoding based on the HTTP response!).

<snip>
--
Ben.

Ivan Shmakov

unread,
Sep 20, 2012, 10:42:03 AM9/20/12
to
>>>>> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>>>>> Ivan Shmakov <onei...@gmail.com> writes:
>>>>> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>>>>> Ivan Shmakov <onei...@gmail.com> writes:

[...]

>>>> Neither html2text nor $ lynx -dump honors CSS (or JavaScript, for
>>>> that matter), which is what (AIUI) the OP needs.

>>> Possibly, yes. That's why I said "how close does it come", but I
>>> was responding to a message that seemed to more specific. It
>>> mentioned only ignoring display:none which html2text does of
>>> course. What other CSS should be honoured seems to be up in the
>>> air.

>> A quick web search [1] reveals that there're in fact two (at the
>> least) versions of html2text, and while I haven't checked for that
>> specifically, I'm pretty sure that the version currently in Debian
>> [2] (which I was referring to) doesn't honor CSS.

>> I don't know anything about the other version, though.

> Yes, I knew about the two versions but since my reply was just a punt
> I didn't think to mention it. I should have, just in case the OP
> finds it suitable.

> The main differences seem to be that the Debian version can do
> recoding but can't fetch the page via HTTP

The main difference between the two versions I've been referring
to is that one of them is written in Python, and the other in
C++.

The version of html2text in Debian (written in C++) doesn't seem
to honor CSS (or process &# symbol references, BTW.) E. g.:

$ html2text < 1348151128.xhtml
****** CSS &#x2018;display:none&#x2019; example ******
This text should be visible, and this one shouldn't.
$

Neither does Lynx:

$ lynx -dump -- 1348151128.xhtml
CSS `display:none' example

This text should be visible, and this one shouldn't.
$

> (which means that it can't do the recoding based on the HTTP
> response!).

Then it shouldn't be expected to take any external CSS
referenced into account, either.

The document is as follows (it's correctly rendered by
Iceweasel, and passes checks at http://validator.w3.org/.)

$ cat < 1348151128.xhtml
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"
xml:lang="en">
<head>
<title>CSS &#x2018;display:none&#x2019; example</title>
<style type="text/css">.invis { display: none; }</style>
</head>

<body>
<h1>CSS &#x2018;display:none&#x2019; example</h1>

<p>This text should be
visible<span class="invis"
>, and this one shouldn't</span>.</p>
</body>
</html>
$

Eli the Bearded

unread,
Sep 20, 2012, 5:58:18 PM9/20/12
to
In comp.infosystems.www.misc, Ivan Shmakov <onei...@gmail.com> wrote:
> $ html2text < 1348151128.xhtml
> ****** CSS &#x2018;display:none&#x2019; example ******
> This text should be visible, and this one shouldn't.
> $
>
> $ lynx -dump -- 1348151128.xhtml
> CSS `display:none' example
>
> This text should be visible, and this one shouldn't.
> $
>
> The document is as follows (it's correctly rendered by
> Iceweasel, and passes checks at http://validator.w3.org/.)
>
> $ cat < 1348151128.xhtml
> <!DOCTYPE html>
> <html xmlns="http://www.w3.org/1999/xhtml"
> xml:lang="en">
> <head>
> <title>CSS &#x2018;display:none&#x2019; example</title>
> <style type="text/css">.invis { display: none; }</style>
> </head>
>
> <body>
> <h1>CSS &#x2018;display:none&#x2019; example</h1>
>
> <p>This text should be
> visible<span class="invis"
> >, and this one shouldn't</span>.</p>
> </body>
> </html>
> $

Tricky! I tried with links (full screen text mode browser), elinks (full
screen text mode browser), w3m (full screen text mode browser), and
edbrowse (ed style, line by line text mode browser), too. All failed
that test:

$ links http://localhost/invis.html
CSS `display:none' example
This text should be visible, and this one shouldn't.

$ elinks http://localhost/invis.html
CSS `display:none' example

This text should be visible, and this one shouldn't.

$ w3m http://localhost/invis.html
CSS ‘display:none’ example

This text should be visible, and this one shouldn't.

$ edbrowse http://localhost/invis.html
no ssl certificate file specified; secure connections cannot be verified
417
85
1,$p
CSS ‘display:none’ example

This text should be visible, and this one shouldn't.
q
$

The last browser was selected because it, alone from all the
other text browsers, has some support for javascript text changes:

$ cat javascript.html
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"
xml:lang="en">
<head>
<title>Javascript &#x2018;innerHTML&#x2019; example</title>
<script type="text/javascript">
function changeText(){
document.getElementById('change_me').innerHTML =
', and so should this';
}
</script>
</head>

<body onLoad="changeText();" >
<h1>Javascript &#x2018;innerHTML&#x2019; example</h1>

<p>This text should be
visible<span id="change_me"
>, and this one shouldn't</span>.</p>
</body>
</html>
$ edbrowse http://localhost/javascript.html
no ssl certificate file specified; secure connections cannot be verified
552
191
Javascript ‘innerHTML’ example

This text should be visible, and this one shouldn't.
------------------------------------------------------------------------------
, and so should this
q
$

Elijah
------
enjoys experimenting with text mode browsers
0 new messages