How can I get a plain text directory with wget?

Todd

unread,

Feb 5, 2012, 10:18:08 PM2/5/12

to

Hi All,

Not having much luck with the man page.

I use the following piece of code in a bash script to
give me a directory listing:

wget --quiet http://$FtpSite$FtpDir -O - > $Tmp 2>&1

The value pumped into $Tmp in a bunch of HTML code.
It is annoying to have to sift through it.

Is there a way to get the directory listing in plain text
without all the HTML garble-de-gook? I would like some
thing like I would see with "ls".

Many thanks,
-T

Sam

unread,

Feb 5, 2012, 10:31:08 PM2/5/12

to

You have to contact the administrator of the HTTP server you're trying to
download your directory listing from, and asks for the details of their web
server's configuration. wget sends a request for a document from the remote
web server. Aside from specifying the document's URL, wget has no control
over the contents of the data that it receives. Whatever the remote server
chooses to respond with, that's what wget will get. If the remote server
returns an HTML document, that's what wget will give you.

There's a small possibility that by specifying overriding the HTTP 1.1
Accept: header in the request, using the --header option to wget, the remote
web server will accept the request to return text/plain content, rather than
text/html. You can try that, but if that doesn't work, there's nothing that
wget can do. Your only option would be to take the return HTML document and
convert to plain text yourself, using elinks, or something similar.

Michael Black

unread,

Feb 5, 2012, 10:27:10 PM2/5/12

to

On Sun, 5 Feb 2012, Todd wrote:

> Hi All,
>
> Not having much luck with the man page.
>
> I use the following piece of code in a bash script to
> give me a directory listing:
>
> wget --quiet http://$FtpSite$FtpDir -O - > $Tmp 2>&1
>
> The value pumped into $Tmp in a bunch of HTML code.
> It is annoying to have to sift through it.
>

Is it because you are actually getting an html page? Wouldn't the URL
start with ftp: to get an ftp directory?

wget is great for automated work, but it's not clear where in this you
feel it needs to be automated, not when there are other tools that may be
simpler if all you need is the directory.

> Is there a way to get the directory listing in plain text
> without all the HTML garble-de-gook? I would like some
> thing like I would see with "ls".
>

If you're getting html, other than fixing that in the first place, name
the resulting file something.html and then view it with a browser.

Michael

> Many thanks,
> -T
>

Todd

unread,

Feb 5, 2012, 10:41:02 PM2/5/12

to

On 02/05/2012 07:27 PM, Michael Black wrote:
> On Sun, 5 Feb 2012, Todd wrote:
>
>> Hi All,
>>
>> Not having much luck with the man page.
>>
>> I use the following piece of code in a bash script to
>> give me a directory listing:
>>
>> wget --quiet http://$FtpSite$FtpDir -O - > $Tmp 2>&1
>>
>> The value pumped into $Tmp in a bunch of HTML code.
>> It is annoying to have to sift through it.
>>
> Is it because you are actually getting an html page?

Yes. "bunch of HTML code."

> Wouldn't the URL
> start with ftp: to get an ftp directory?

I did not give your the contents of the variable. Usually
it is for an ftp site, but very often I feed it an http site.

releases.mozilla.org is particularly a pain-in-the-ass about
not always supporting ftp calls. I complain to them, they fix
it, and a couple of weeks later they are back to their evil ways.
So I call them with http and it always works.

The code is found in a bash function. Lots of other stuff
going on around it. The entire script is several hundreds
of lines long.

> wget is great for automated work, but it's not clear where in this you
> feel it needs to be automated, not when there are other tools that may
> be simpler if all you need is the directory.

Okay, you got me curious. Which other ones?

>
>> Is there a way to get the directory listing in plain text
>> without all the HTML garble-de-gook? I would like some
>> thing like I would see with "ls".
>>
> If you're getting html, other than fixing that in the first place, name
> the resulting file something.html and then view it with a browser.

Not too practical when this is an automated script.

Thank you for the tips,
-T

Todd

unread,

Feb 5, 2012, 10:44:39 PM2/5/12

to

On 02/05/2012 07:31 PM, Sam wrote:
> There's a small possibility that by specifying overriding the HTTP 1.1
> Accept: header in the request, using the --header option to wget, the
> remote web server will accept the request to return text/plain content,
> rather than text/html. You can try that, but if that doesn't work,
> there's nothing that wget can do.

I am calling server mirror servers, so one may work and the other not.

> Your only option would be to take the
> return HTML document and convert to plain text yourself, using elinks,
> or something similar.

elinks sounds like a text based web browser. Can it be used
from a script to convert html to text?

Is there some other utility to convert html to text that
can be used from a script?

Many thanks,
-T

Bit Twister

unread,

Feb 5, 2012, 11:10:26 PM2/5/12

to

On Sun, 05 Feb 2012 19:18:08 -0800, Todd wrote:
> Hi All,
>
> Not having much luck with the man page.
>
> I use the following piece of code in a bash script to
> give me a directory listing:
>
> wget --quiet http://$FtpSite$FtpDir -O - > $Tmp 2>&1
>
> The value pumped into $Tmp in a bunch of HTML code.
> It is annoying to have to sift through it.
>
> Is there a way to get the directory listing in plain text
> without all the HTML garble-de-gook?

Without having a url to test with, I'll suggest something like
html2text -nobs -style pretty -width 132 $FtpSite$FtpDir > $Tmp

Todd

unread,

Feb 6, 2012, 2:03:18 AM2/6/12

to

Hi Bit,

Love it! Thank you! I did not even know html2text existed.
"apropose htlm" did not catch it as it was not installed
(is now).

A few bumps in the road:

$ html2text -nobs -style pretty -width 132 \
http://releases.mozilla.org/pub/mozilla.org/firefox/releases/

HTTP/1.1 505 HTTP Version Not Supported Connection: close
Date: Mon, 06 Feb 2012 06:26:41 GMT Server: Cherokee/1.0.1
(UNIX) Content-Length: 314 Content-Type: text/html Cache-
Control: no-cache Pragma: no-cache

505 HTTP Version Not Supported
----------------------------------------------------------
Cherokee web server 1.0.1 (UNIX), Port 80

So back to wget and pipe it to html2text:

wget --quiet \
http://releases.mozilla.org/pub/mozilla.org/firefox /releases/ \
-O - | html2text -nobs -style pretty -width 132 | grep -i DIR

[DIR] Parent_Directory -
[DIR] latest 31-Jan-2012 21:58 link
[DIR] latest-3.6 01-Feb-2012 05:38 link
[DIR] latest-10.0 31-Jan-2012 21:58 link
[DIR] 2.0.0.20 18-Dec-2008 09:26 -
[DIR] 3.0.19-real-real 16-Mar-2010 02:52 -
[DIR] 3.6.24 03-Nov-2011 17:55 -
[DIR] 3.6.25 13-Dec-2011 14:04 -
[DIR] 3.6.26 29-Jan-2012 10:54 -
[DIR] 8.0.1 21-Nov-2011 06:50 -
[DIR] 9.0.1 21-Dec-2011 15:57 -
[DIR] 10.0 31-Jan-2012 17:57 -

And repeating (gets me a different mirror):
10.0/ 2012-Jan-31 16:57:40 - Directory
2.0.0.20/ 2008-Dec-18 08:26:59 - Directory
3.0.19-real-real/ 2010-Mar-16 01:52:50 - Directory
3.6.24/ 2011-Nov-03 16:55:01 - Directory
3.6.25/ 2011-Dec-13 13:04:11 - Directory
3.6.26/ 2012-Jan-29 09:54:13 - Directory
8.0.1/ 2011-Nov-21 05:50:19 - Directory
9.0.1/ 2011-Dec-21 14:57:25 - Directory
latest/ 2012-Jan-31 16:57:40 - Directory
latest-10.0/ 2012-Jan-31 16:57:40 - Directory
latest-3.6/ 2012-Jan-29 09:54:13 - Directory

Oh Crap! A third run freezes (yet another mirror).

And the directory I want is either $1 or $2 depending
on the mirror. This would explain some of the problems
I have been having. So I am going to really have to
think about how to carve out the directory name.

AAAHHHHH!

Okay, promise not to laugh:

$ wget --quiet \
http://releases.mozilla.org/pub/mozilla.org/firefox/releases/ \
-O - | \
html2text -nobs -style pretty -width 132 | \
grep -i DIR | \
sed -e "s/\[DIR\]//" | \
awk '{print $1}' | \
sed -e "s/\///"

Did I hear you just laugh?!?!

Thank you for the help!
-T

Anonymous Remailer (austria)

unread,

Feb 6, 2012, 6:15:06 AM2/6/12

to

Todd <T...@invalid.invalid> [T]:
T> I use the following piece of code in a bash script to give me a
T> directory listing:
T> wget --quiet http://$FtpSite$FtpDir -O - > $Tmp 2>&1
T> Is there a way to get the directory listing in plain text without
T> all the HTML garble-de-gook?

Have you ruled out FTP? Some ftp clients can be scripted.
Even
wget --quiet ftp://$FtpSite$FtpDir -O - > $Tmp 2>&1
might work.

If FTP is not an option, use awk/sed/perl to get rid of those
HTML tags. The following might be a good starting point

#!/bin/sed -f
#strip HTML code
s/<[^>]*>//g

The previous code assumes that tags don't extend across line
boundaries. If that's not true, you should replace <EOL> characters
(usually '\n') with, say, spaces:

<$Tmp tr '\n' ' ' | <sed_script>

And something slightly off topic. Writing debugging output
(stderr) to the same file as your normal output (stdin) - that's
what "2>&1" does - will only make parsing more difficult.

Sam

unread,

Feb 6, 2012, 6:51:37 AM2/6/12

to

Todd writes:

>> Your only option would be to take the
>> return HTML document and convert to plain text yourself, using elinks,
>> or something similar.
>
> elinks sounds like a text based web browser. Can it be used
> from a script to convert html to text?

Yes. There aren't many options for reformatting it, it basically emits
whatever it would ordinarily display, as text/plain content, so you'll have
to work with however elinks ends up formatting the HTML.

Dan Espen

unread,

Feb 6, 2012, 8:33:39 AM2/6/12

to

Todd <To...@invalid.invalid> writes:

R> On 02/05/2012 07:27 PM, Michael Black wrote:
>> On Sun, 5 Feb 2012, Todd wrote:
>>
>>> Hi All,
>>>
>>> Not having much luck with the man page.
>>>
>>> I use the following piece of code in a bash script to
>>> give me a directory listing:
>>>
>>> wget --quiet http://$FtpSite$FtpDir -O - > $Tmp 2>&1
>>>
>>> The value pumped into $Tmp in a bunch of HTML code.
>>> It is annoying to have to sift through it.
>>>
>> Is it because you are actually getting an html page?
>
> Yes. "bunch of HTML code."
>
>> Wouldn't the URL
>> start with ftp: to get an ftp directory?
>
> I did not give your the contents of the variable. Usually
> it is for an ftp site, but very often I feed it an http site.
>
> releases.mozilla.org is particularly a pain-in-the-ass about
> not always supporting ftp calls. I complain to them, they fix

There's probably a better API for monitoring Mozilla for changes.
Is this closer to what you want:

lynx --dump ftp://releases.mozilla.org/pub

--
Dan Espen

Chick Tower

unread,

Feb 6, 2012, 1:30:12 PM2/6/12

to

I don't know for sure that elinks can do it, but lynx and links can
translate HTML to formatted text. See the -dump option in the man
pages. w3m might be able to do it, too, but I don't have it or elinks
installed, so I can't verify that.
--
Chick Tower

For e-mail: colm DOT sent DOT towerboy AT xoxy DOT net

root

unread,

Feb 6, 2012, 1:43:42 PM2/6/12

to

Comparing lynx to w3m:
w3m does the better job with the -dump option. lynx adds all the
html links in the document as an appendix to the text:useful if
you want that.

Allodoxaphobia

unread,

Feb 6, 2012, 3:04:09 PM2/6/12

to

On Sun, 05 Feb 2012 19:18:08 -0800, Todd wrote:

lynx -dump -nolist "http://$FtpSite$FtpDir" > dir.list.ing

HTH
Jonesy

Michael Black

unread,

Feb 6, 2012, 5:39:11 PM2/6/12

to

I think that's assuming that lynx has "user mode" is set for novice (maybe
intermediate does it too). Advanced doesn't put the list of links at the
end of the page.

Michael

Chris F.A. Johnson

unread,

Feb 7, 2012, 4:37:04 PM2/7/12

to

On 2012-02-06, root wrote:
...

> lynx adds all the html links in the document as an appendix to the
> text:useful if you want that.

If you don't want it, use the -nolist option.

--
Chris F.A. Johnson, <http://cfajohnson.com>
Author:
Pro Bash Programming: Scripting the GNU/Linux Shell (2009, Apress)
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)