Received: by 10.66.89.162 with SMTP id bp2mr712087pab.4.1348195809501; Thu, 20 Sep 2012 19:50:09 -0700 (PDT) Path: t10ni2083231pbh.0!nntp.google.com!border1.nntp.dca.giganews.com!border4.nntp.dca.giganews.com!border2.nntp.dca.giganews.com!nntp.giganews.com!nrc-news.nrc.ca!goblin1!goblin.stu.neva.ru!feeds.phibee-telecom.net!zen.net.uk!dedekind.zen.co.uk!reader02.nrc01.news.zen.net.uk.POSTED!not-for-mail From: Ben Bacarisse Newsgroups: comp.unix.shell,comp.infosystems.www.misc Subject: Re: Getting the text from a webpage (not the source) References: <20120917104616.663@kylheku.com> <0.59371f0429c33221710a.20120920111041BST.878vc56m3i.fsf@bsb.me.uk> <86ipb93pzv.fsf@gray.siamics.net> <0.1c093b9924ec3142b9ae.20120920124109BST.87lig46hwq.fsf@bsb.me.uk> <8662785131.fsf@gray.siamics.net> Date: Thu, 20 Sep 2012 14:23:49 +0100 Message-ID: <0.e38cfdc080eadff24b4a.20120920142349BST.87fw6c6d5m.fsf@bsb.me.uk> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) Cancel-Lock: sha1:zsAyP/il/TYCZzyKaeAT27zMQVs= MIME-Version: 1.0 Lines: 40 Organization: Zen Internet NNTP-Posting-Host: 513bcbb3.news.zen.co.uk X-Trace: DXC=nER9UbT19;BL2Ih^^87?RG]G;bfYi23hD=dR0\ckLKG@WeZ<[7LZNRF=\h\kSbAjDC4CJ\2i>aFoDhmS4UAhl1iCfFM?9AH8UFB X-Complaints-To: abuse@zen.co.uk Bytes: 2813 Content-Type: text/plain; charset=us-ascii Ivan Shmakov writes: >>>>>> Ben Bacarisse writes: >>>>>> Ivan Shmakov writes: >>>>>> Ben Bacarisse writes: > > [...] > > >>> Does your system have html2text? How close does > > >>> wget -O - | html2text > > >>> come to what you want? > > >> Neither html2text nor $ lynx -dump honors CSS (or JavaScript, for > >> that matter), which is what (AIUI) the OP needs. > > > Possibly, yes. That's why I said "how close does it come", but I was > > responding to a message that seemed to more specific. It mentioned > > only ignoring display:none which html2text does of course. What > > other CSS should be honoured seems to be up in the air. > > A quick web search [1] reveals that there're in fact two (at the > least) versions of html2text, and while I haven't checked for > that specifically, I'm pretty sure that the version currently in > Debian [2] (which I was referring to) doesn't honor CSS. > > I don't know anything about the other version, though. Yes, I knew about the two versions but since my reply was just a punt I didn't think to mention it. I should have, just in case the OP finds it suitable. The main differences seem to be that the Debian version can do recoding but can't fetch the page via HTTP (which means that it can't do the recoding based on the HTTP response!). -- Ben.