Tnx.
Perl is a big hammer for such a small nail.
How about just typing this at your commandline:
find . -name "*.htm"
(that recurses down from your current directory. cd to \ if you want to
find ALL such files anywhere they may exist. But you probably want to
start at your Apache DocumentRoot).
--
David Filmer (http://DavidFilmer.com)
The best way to get a good answer is to ask a good question.
There may not be any files on a web server (all pages could be generated
by the web server software directly in memory) or an infinite number of
files (dynamically created based on user input). Further, if you do not
have direct access to the server but rather want to create this list for
a remote server, you are limited by the options of the protocol the web
server system supports (usually only HTTP for the general public). You'd
have to write a crawler, or use an existing one, that visits a page and
follows all the links on it, recursively, until "all" pages have been
visited. This is a rather limited approach as some pages might only be
accessible via links from third party web pages, so you would have to
index "the whole web" for a usable list.
--
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
locate -r \.html$ > htmlfiles.txt
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
I'd use File::Find to loop through all files. Then for each file found
you could use one of the tools from http://validator.w3.org to check if
the file contains valid HTML code. You can also download the validator
code and install it locally to avoid calling their service a gazillion
times.
jue
Or, to find .htm or .html:
$ find . | grep -P 'html?$'
Or also .shtml and .pshtml:
$ find . | grep -P '[sp]?html?$'
Or to also find .xml
$ find . | grep -P '([sp]?html?|xml)$'
You get the idea. Also, grep with the -P arg uses a Perl style regex :-)
--
szr
Pero wrote:
I want to write search script in perl.Perl is a big hammer for such a small nail.
How to make list of all htm file on Linux - Apache web server?
How about just typing this at your commandline:
find . -name "*.htm"
(that recurses down from your current directory. cd to \ if you want to find ALL such files anywhere they may exist. But you probably want to start at your Apache DocumentRoot).
> $ find . | grep -P 'html?$'
That is quite wasteful, even if the current directory doesn't contain
millions of subdirectories and files.
And it would erroneously return ./test_html and such.
$ find . -type f -name "*.htm" -or -name "*.html"
$ find . -type f -regex ".*\.html?"
--
Affijn, Ruud
"Gewoon is een tijger."
Ah, yes, I forgot the *. in my examples. And I forgot you could use
regex with find.
--
szr
Aside form forgetting *. which should of been at the beginning of my
patterns, is it really more wasteful? Does find not have to also check
each file it comes across too? Or is it just the over of piping the
final output from find over to grep? Other then that I don't see why it
would be more wasteful? On my both my Dual core Linux system as well as
an old P2 400 also running Linux, I see no difference in speed, even on
a large sprawling directory. find does it's thing, grep prunes it's
results.
--
szr
Actually you can cd to "\", which takes you to the root of the current
drive you are in. If you want a true Unix style root have a look at
cygwin.
--
szr
They're not regular expressions: they're shell glob patterns.
--
Glenn Jackman
Write a wise saying and your name will live forever. -- Anonymous
I know that. I didn't mean it as a regex. The *.htm is anything, ending
with .htm
It is nice, though, that one can use just -regex when using find :-)
--
szr
Yes, absolutely.
>Does find not have to also check
>each file it comes across too?
Certainly. But you're piping *all* of them to grep, thus making both find
*and* grep process all of them.
>Or is it just the over of piping the
>final output from find over to grep?
That, too.
>Other then that I don't see why it
>would be more wasteful?
Because it:
a) creates, opens, and closes a pipe that is not necessary
b) spawns an additional process (grep) that is not necessary
c) ships *every* filename across that unnecessary pipe to that unnecessary
process to be filtered
.. when you could instead simply filter the filenames at the source, as
they're generated by find.
>On my both my Dual core Linux system as well as
>an old P2 400 also running Linux, I see no difference in speed, even on
>a large sprawling directory.
That's because
a) you're on a single-user machine, and
b) you're not examining a large enough directory to notice the difference.
Try that in a multi-user environment with typical production directory trees,
and the difference will become visible.
> find does it's thing, grep prunes it's results.
Pointless. find can both find *and* prune.
Yep.
>> Or is it just the over of piping the
>> final output from find over to grep?
s/over/overhead/
> That, too.
>
>> Other then that I don't see why it
>> would be more wasteful?
>
> Because it:
> a) creates, opens, and closes a pipe that is not necessary
> b) spawns an additional process (grep) that is not necessary
> c) ships *every* filename across that unnecessary pipe to that
> unnecessary process to be filtered
> .. when you could instead simply filter the filenames at the source,
> as
> they're generated by find.
>
>> On my both my Dual core Linux system as well as
>> an old P2 400 also running Linux, I see no difference in speed, even
>> on a large sprawling directory.
>
> That's because
> a) you're on a single-user machine, and
> b) you're not examining a large enough directory to notice the
> difference.
> Try that in a multi-user environment with typical production
> directory trees, and the difference will become visible.
I logged into one of the large servers that I manage and ran the same
test, and found there to be a difference, especially when running it
using the system root (/) as the starting point. It is indeed better to
go the efficient route.
>> find does it's thing, grep prunes it's results.
>
> Pointless. find can both find *and* prune.
True. Wonderful, -regex, is.
--
szr
hrunkner:~/tmp 12:05 122% cd "\\"
cd: no such file or directory: \
hrunkner:~/tmp 12:05 123% mkdir \\
hrunkner:~/tmp 12:06 124% cd \\
hrunkner:~/tmp/\ 12:06 125%
Yes, after creating a directory named "\", I can cd to it.
> which takes you to the root of the current
> drive you are in.
There is no "current drive" on Unix.
> If you want a true Unix style root have a look at
> cygwin.
ITYM: "if you want a find command under Windows, have a look at cygwin."
On unix you already have a Unix style root, and can't use cygwin anyway.
hp
> find does it's thing, grep prunes it's results.
Be very careful with that approach, it can easily get you fired.
On a heavy loaded production server, not only make your find do the
pruning itself, but nice it too.
Just accept that a wide find can take tons of minutes. When you need a
wide find, you shouldn't be in a hurry.
Hmm, for someone reason I cannot fathum I must of thought DeFaria was
talking about a Windows system. My mistake.
--
szr