I've been using wget in mirroring mode but what happens is that I keep
fetching URLs like
http://abc.com/def/ghi.html
http://abc.com//def/ghi.html
http://abc.com///def/ghi.html
http://abc.com////def/ghi.html
.
.
.
Now, when I run linkchecker on the same site it does not run into this
problem but cleanly checks every page once and exits. Whatever
parameters I try with wget, however, I always run into the problem that
it keeps fetching the file, then finds it to be no newer than what it
had before and then trying it again with one forward slash more at
some point. This creates 2 problems.
1. I'm never sure if I got a complete mirror when I interrupt at some
point as I don't know where the whole thing starts and how.
2. My server will go down after a day or two because it writes massive
logs a lot faster than they are rotated away.
I've tried to disallow URLs as above from the server side before by
just redirecting //abc/def.html to /abc/def.html. Now, that works fine
for URLs like /abc//def.html but not for the ones that have the extra
forward slashes at the front because these still get logged by Apache
as single forward slashes.
What I need now is one of the following:
1. (preferred) A way of figuring out how my site is broken so that I
can fix it. Any suggestions how to catch that error when it first
occurs?
2. A different software to mirror the site which like linkchecker does
not run into the infinite loop. Unfortunately I can't get linkchecker
to save what it downloaded without any source modifications. Is anybody
aware of some piece of software I could use for this?
Cheers,
Christian
>
>
> On Sun 28/09/08 1:29 AM , Christian Lerrahn lu...@penpal4u.net sent:
> Chris I don't know how you are using wget ? I grabbed this from a
> post of mine at another sites forums --- So ignore the url n this
> example, but it may help you work it out ? especially "recursion
> depth"
>
>
> wget -r -k -l0 www.mag.mypclinuxos.com/html/112006/links.html
>
> Where:-
> -r = recursive? -k = convert-links? -l = maximum recursion depth (inf
> or 0 for infinite)
>
>
>
> ADDED
> (1) the 0 (zero) can be changed to 1 or 2 or 3 etc. (as 0 = infinite)
> (2) type wget --help for more info on all the switches
I last used
wget --mirror --no-parent --page-requisites --convert-links
--no-host-directories http://abc.com/
However, I've tried other options before and nothing seemed to fix the
problem. I can't really limit the level of recursions sensibly because
I'm unsure how many levels there would be (part of the site is in a CMS
another part isn't). If nothing else works, I'll try to figure out the
number of levels. But I'm still unsure if that actually would help.
Cheers,
Christian