Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion Getting the text from a webpage (not the source)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Kaz Kylheku  
View profile  
 More options Sep 17 2012, 2:05 pm
Newsgroups: comp.unix.shell
From: Kaz Kylheku <k...@kylheku.com>
Date: Mon, 17 Sep 2012 18:05:12 +0000 (UTC)
Local: Mon, Sep 17 2012 2:05 pm
Subject: Re: Getting the text from a webpage (not the source)
On 2012-09-17, Bill Marcum <b...@nowhere.invalid> wrote:

> On 09/17/2012 05:25 AM, Guillaume Dargaud wrote:
>> Hello all,
>> I would like to script the equivalent of doing Ctrl-C on a webpage in a
>> browser, and then Ctrl-V in a text editor.
>> In other words I would like the text from a webpage, after all the html+css
>> and possibly javascript rendering. The idea is to get the text like a person
>> sees it, no "display:none" shenanigans.

>> I don't think it's a job for wget which only gets the source.
>> I was thinking of some option in links/lynx but I don't think those
>> interpret css.

> If you wget the source of a web page and then view that file in a
> browser, it should be the same.

What if the document that is rendered on the screen has some contents which are
computed by Javascript?

Wget doesn't contain a Javascript interpreter.

Guillaume is right.

Though, not sure how you can solve this easily with Unix shell tools.

You need a web scraping engine that processes Javascript.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.