http://web.archive.org/web/20030413185309/www.gizmology.net/lovecraft/
...suitable for burning to CD and keeping. (The original site has
gone by the wayside and I want to snag this archive before it goes
away too.)
However, I'm running into a problem...due to archive.org's odd
javascript modifications to the webpages, I'm unable to pull down
anything more than the index file, even at recursion depth of
infinite. I've had some advice involving pulling down the first page,
removing an archive.org-added header, and running the command again
using it as a base, but it didn't work. The advice is in the comments
at this LJ entry:
http://www.livejournal.com/community/linux/687075.html
I've also tried about three or four different Windows-based
sitegrabbers with no better results.
Can anyone offer any advice on how to mirror this site archive
properly (be it with wget or any other application)? I'd really
appreciate it.
--
Chris Meadows aka | If this post helped or entertained you, please rate
Robotech_Master | it at http://svcs.affero.net/rm.php?r=robotech
robo...@eyrie.org |
| Homepage: http://www.eyrie.org/~robotech
It's probably "protected" by a robots.txt file to prevent search
engines from indexing the archive. Create a file called .wgetrc in
your home directory, put the line:
robots = off
in this file, and try your wget again.
Here's the command that I use
wget --exclude-domains rail2000.org -e robots=off -nH --cut-dirs=2
--base=http://web.archive.org/web/20010202020600/http://www.rail2000.org/
-r -l 3 -N -k -p -R js -Gbase
http://web.archive.org/web/20010202020600/www.rail2000.org/
However, you need to manually remove each of the JAVASCRIPT entries
from the end of the file. I'm not sure how to do this short of
writing a perl script to fix each of the files.
I believe your command would look like this:
wget --exclude-domains gizmology.net -e robots=off -nH --cut-dirs=3
--base=http://web.archive.org/web/20030413185309/www.gizmology.net/lovecraft/
-r -l 4 -N -k -p -R js -Gbase
http://web.archive.org/web/20030413185309/www.gizmology.net/lovecraft/