How to make list of all htm file...

Pero

unread,

Jun 21, 2008, 3:17:43 PM6/21/08

to

I want to write search script in perl.
How to make list of all htm file on Linux - Apache web server?

Tnx.

David Filmer

unread,

Jun 21, 2008, 3:31:02 PM6/21/08

to

Pero wrote:
> I want to write search script in perl.
> How to make list of all htm file on Linux - Apache web server?

Perl is a big hammer for such a small nail.

How about just typing this at your commandline:

find . -name "*.htm"

(that recurses down from your current directory. cd to \ if you want to
find ALL such files anywhere they may exist. But you probably want to
start at your Apache DocumentRoot).

--
David Filmer (http://DavidFilmer.com)
The best way to get a good answer is to ask a good question.

Bjoern Hoehrmann

unread,

Jun 21, 2008, 3:31:10 PM6/21/08

to

* Pero wrote in comp.lang.perl.misc:

>I want to write search script in perl.
>How to make list of all htm file on Linux - Apache web server?

There may not be any files on a web server (all pages could be generated
by the web server software directly in memory) or an infinite number of
files (dynamically created based on user input). Further, if you do not
have direct access to the server but rather want to create this list for
a remote server, you are limited by the options of the protocol the web
server system supports (usually only HTTP for the general public). You'd
have to write a crawler, or use an existing one, that visits a page and
follows all the links on it, recursively, until "all" pages have been
visited. This is a rather limited approach as some pages might only be
accessible via links from third party web pages, so you would have to
index "the whole web" for a usable list.
--
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Gunnar Hjalmarsson

unread,

Jun 21, 2008, 4:03:12 PM6/21/08

to

Pero wrote:
> I want to write search script in perl.
> How to make list of all htm file on Linux - Apache web server?

locate -r \.html$ > htmlfiles.txt

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

Jürgen Exner

unread,

Jun 21, 2008, 4:20:22 PM6/21/08

to

"Pero" <pe...@tupwerwt.ch> wrote:
>I want to write search script in perl.
>How to make list of all htm file on Linux - Apache web server?

I'd use File::Find to loop through all files. Then for each file found
you could use one of the tools from http://validator.w3.org to check if
the file contains valid HTML code. You can also download the validator
code and install it locally to avoid calling their service a gazillion
times.

jue

szr

unread,

Jun 21, 2008, 4:40:40 PM6/21/08

to

David Filmer wrote:
> Pero wrote:
>> I want to write search script in perl.
>> How to make list of all htm file on Linux - Apache web server?
>
> Perl is a big hammer for such a small nail.
>
> How about just typing this at your commandline:
>
> find . -name "*.htm"
>
> (that recurses down from your current directory. cd to \ if you want
> to find ALL such files anywhere they may exist. But you probably
> want to start at your Apache DocumentRoot).

Or, to find .htm or .html:

$ find . | grep -P 'html?$'

Or also .shtml and .pshtml:

$ find . | grep -P '[sp]?html?$'

Or to also find .xml

$ find . | grep -P '([sp]?html?|xml)$'

You get the idea. Also, grep with the -P arg uses a Perl style regex :-)

--
szr

Andrew DeFaria

unread,

Jun 22, 2008, 12:35:58 AM6/22/08

to

David Filmer wrote:

Pero wrote:

I want to write search script in perl.
How to make list of all htm file on Linux - Apache web server?

Perl is a big hammer for such a small nail.

How about just typing this at your commandline:

find . -name "*.htm"

(that recurses down from your current directory. cd to \ if you want to find ALL such files anywhere they may exist. But you probably want to start at your Apache DocumentRoot).

"find" doesn't do this on Windows. On Unix there is no "\" to cd too. So which OS are you speaking of?

--
Andrew DeFaria
If all the world is a stage, where is the audience sitting?

Dr.Ruud

unread,

Jun 22, 2008, 8:53:05 AM6/22/08

to

szr schreef:

> $ find . | grep -P 'html?$'

That is quite wasteful, even if the current directory doesn't contain
millions of subdirectories and files.

And it would erroneously return ./test_html and such.

$ find . -type f -name "*.htm" -or -name "*.html"

$ find . -type f -regex ".*\.html?"

--
Affijn, Ruud

"Gewoon is een tijger."

szr

unread,

Jun 22, 2008, 5:55:15 PM6/22/08

to

Dr.Ruud wrote:
> szr schreef:
>
>> $ find . | grep -P 'html?$'
>
> That is quite wasteful, even if the current directory doesn't contain
> millions of subdirectories and files.
>
> And it would erroneously return ./test_html and such.
>
> $ find . -type f -name "*.htm" -or -name "*.html"
>
> $ find . -type f -regex ".*\.html?"

Ah, yes, I forgot the *. in my examples. And I forgot you could use
regex with find.

--
szr

szr

unread,

Jun 22, 2008, 5:59:41 PM6/22/08

to

Dr.Ruud wrote:
> szr schreef:
>
>> $ find . | grep -P 'html?$'
>
> That is quite wasteful, even if the current directory doesn't contain
> millions of subdirectories and files.

Aside form forgetting *. which should of been at the beginning of my
patterns, is it really more wasteful? Does find not have to also check
each file it comes across too? Or is it just the over of piping the
final output from find over to grep? Other then that I don't see why it
would be more wasteful? On my both my Dual core Linux system as well as
an old P2 400 also running Linux, I see no difference in speed, even on
a large sprawling directory. find does it's thing, grep prunes it's
results.

--
szr

szr

unread,

Jun 22, 2008, 6:02:09 PM6/22/08

to

Andrew DeFaria wrote:
> David Filmer wrote:
>> Pero wrote:
>>
>>> I want to write search script in perl.
>>> How to make list of all htm file on Linux - Apache web server?
>>
>> Perl is a big hammer for such a small nail.
>>
>> How about just typing this at your commandline:
>>
>> find . -name "*.htm"
>>
>> (that recurses down from your current directory. cd to \ if you want
>> to find ALL such files anywhere they may exist. But you probably
>> want to start at your Apache DocumentRoot).
>
> "find" doesn't do this on Windows. On Unix there is no "\" to cd too.
> So which OS are you speaking of?

Actually you can cd to "\", which takes you to the root of the current
drive you are in. If you want a true Unix style root have a look at
cygwin.

--
szr

Glenn Jackman

unread,

Jun 23, 2008, 12:05:37 PM6/23/08

to

They're not regular expressions: they're shell glob patterns.

--
Glenn Jackman
Write a wise saying and your name will live forever. -- Anonymous

szr

unread,

Jun 23, 2008, 2:13:51 PM6/23/08

to

I know that. I didn't mean it as a regex. The *.htm is anything, ending
with .htm

It is nice, though, that one can use just -regex when using find :-)

--
szr

Doug Miller

unread,

Jun 28, 2008, 10:11:23 PM6/28/08

to

In article <g3mi0...@news4.newsguy.com>, "szr" <sz...@szromanMO.comVE> wrote:
>Dr.Ruud wrote:
>> szr schreef:
>>
>>> $ find . | grep -P 'html?$'
>>
>> That is quite wasteful, even if the current directory doesn't contain
>> millions of subdirectories and files.
>
>Aside form forgetting *. which should of been at the beginning of my
>patterns, is it really more wasteful?

Yes, absolutely.

>Does find not have to also check
>each file it comes across too?

Certainly. But you're piping *all* of them to grep, thus making both find
*and* grep process all of them.

>Or is it just the over of piping the
>final output from find over to grep?

That, too.

>Other then that I don't see why it
>would be more wasteful?

Because it:
a) creates, opens, and closes a pipe that is not necessary
b) spawns an additional process (grep) that is not necessary
c) ships *every* filename across that unnecessary pipe to that unnecessary
process to be filtered
.. when you could instead simply filter the filenames at the source, as
they're generated by find.

>On my both my Dual core Linux system as well as
>an old P2 400 also running Linux, I see no difference in speed, even on
>a large sprawling directory.

That's because
a) you're on a single-user machine, and
b) you're not examining a large enough directory to notice the difference.
Try that in a multi-user environment with typical production directory trees,
and the difference will become visible.

> find does it's thing, grep prunes it's results.

Pointless. find can both find *and* prune.

szr

unread,

Jun 29, 2008, 12:46:35 AM6/29/08

to

Doug Miller wrote:
> In article <g3mi0...@news4.newsguy.com>, "szr"
> <sz...@szromanMO.comVE> wrote:
>> Dr.Ruud wrote:
>>> szr schreef:
>>>
>>>> $ find . | grep -P 'html?$'
>>>
>>> That is quite wasteful, even if the current directory doesn't
>>> contain millions of subdirectories and files.
>>
>> Aside form forgetting *. which should of been at the beginning of my
>> patterns, is it really more wasteful?
>
> Yes, absolutely.
>
>> Does find not have to also check
>> each file it comes across too?
>
> Certainly. But you're piping *all* of them to grep, thus making both
> find *and* grep process all of them.

Yep.

>> Or is it just the over of piping the
>> final output from find over to grep?

s/over/overhead/

> That, too.
>
>> Other then that I don't see why it
>> would be more wasteful?
>
> Because it:
> a) creates, opens, and closes a pipe that is not necessary
> b) spawns an additional process (grep) that is not necessary
> c) ships *every* filename across that unnecessary pipe to that
> unnecessary process to be filtered
> .. when you could instead simply filter the filenames at the source,
> as
> they're generated by find.
>
>> On my both my Dual core Linux system as well as
>> an old P2 400 also running Linux, I see no difference in speed, even
>> on a large sprawling directory.
>
> That's because
> a) you're on a single-user machine, and
> b) you're not examining a large enough directory to notice the
> difference.
> Try that in a multi-user environment with typical production
> directory trees, and the difference will become visible.

I logged into one of the large servers that I manage and ran the same
test, and found there to be a difference, especially when running it
using the system root (/) as the starting point. It is indeed better to
go the efficient route.

>> find does it's thing, grep prunes it's results.
>
> Pointless. find can both find *and* prune.

True. Wonderful, -regex, is.

--
szr

Peter J. Holzer

unread,

Jun 29, 2008, 6:10:58 AM6/29/08

to

On 2008-06-22 22:02, szr <sz...@szromanMO.comVE> wrote:
> Andrew DeFaria wrote:
>> David Filmer wrote:
>>> Pero wrote:
>>>
>>>> I want to write search script in perl.
>>>> How to make list of all htm file on Linux - Apache web server?

^^^^^

>>>
>>> Perl is a big hammer for such a small nail.
>>>
>>> How about just typing this at your commandline:
>>>
>>> find . -name "*.htm"
>>>
>>> (that recurses down from your current directory. cd to \ if you want
>>> to find ALL such files anywhere they may exist. But you probably
>>> want to start at your Apache DocumentRoot).
>>
>> "find" doesn't do this on Windows. On Unix there is no "\" to cd too.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

>> So which OS are you speaking of?
>
> Actually you can cd to "\",

hrunkner:~/tmp 12:05 122% cd "\\"
cd: no such file or directory: \
hrunkner:~/tmp 12:05 123% mkdir \\
hrunkner:~/tmp 12:06 124% cd \\
hrunkner:~/tmp/\ 12:06 125%

Yes, after creating a directory named "\", I can cd to it.

> which takes you to the root of the current
> drive you are in.

There is no "current drive" on Unix.

> If you want a true Unix style root have a look at
> cygwin.

ITYM: "if you want a find command under Windows, have a look at cygwin."

On unix you already have a Unix style root, and can't use cygwin anyway.

hp

Dr.Ruud

unread,

Jun 29, 2008, 7:57:46 AM6/29/08

to

szr schreef:

> find does it's thing, grep prunes it's results.

Be very careful with that approach, it can easily get you fired.

On a heavy loaded production server, not only make your find do the
pruning itself, but nice it too.

Just accept that a wide find can take tons of minutes. When you need a
wide find, you shouldn't be in a hurry.

szr

unread,

Jun 29, 2008, 4:25:32 PM6/29/08

to

Peter J. Holzer wrote:
> On 2008-06-22 22:02, szr <sz...@szromanMO.comVE> wrote:
>> Andrew DeFaria wrote:
>>> David Filmer wrote:
>>>> Pero wrote:
>>>>
>>>>> I want to write search script in perl.
>>>>> How to make list of all htm file on Linux - Apache web server?
> ^^^^^
>>>>
>>>> Perl is a big hammer for such a small nail.
>>>>
>>>> How about just typing this at your commandline:
>>>>
>>>> find . -name "*.htm"
>>>>
>>>> (that recurses down from your current directory. cd to \ if you
>>>> want to find ALL such files anywhere they may exist. But you
>>>> probably want to start at your Apache DocumentRoot).
>>>
>>> "find" doesn't do this on Windows. On Unix there is no "\" to cd
>>> too.
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> So which OS are you speaking of?
>>
>> Actually you can cd to "\",
>
> hrunkner:~/tmp 12:05 122% cd "\\"
> cd: no such file or directory: \
> hrunkner:~/tmp 12:05 123% mkdir \\
> hrunkner:~/tmp 12:06 124% cd \\
> hrunkner:~/tmp/\ 12:06 125%

Hmm, for someone reason I cannot fathum I must of thought DeFaria was
talking about a Windows system. My mistake.

--
szr