Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Need Help: wget/archive.org

132 views
Skip to first unread message

Robotech_Master

unread,
Feb 21, 2004, 1:17:47 PM2/21/04
to
I'm trying to use wget to pull down a mirror of this site archive...

http://web.archive.org/web/20030413185309/www.gizmology.net/lovecraft/

...suitable for burning to CD and keeping. (The original site has
gone by the wayside and I want to snag this archive before it goes
away too.)

However, I'm running into a problem...due to archive.org's odd
javascript modifications to the webpages, I'm unable to pull down
anything more than the index file, even at recursion depth of
infinite. I've had some advice involving pulling down the first page,
removing an archive.org-added header, and running the command again
using it as a base, but it didn't work. The advice is in the comments
at this LJ entry:

http://www.livejournal.com/community/linux/687075.html

I've also tried about three or four different Windows-based
sitegrabbers with no better results.

Can anyone offer any advice on how to mirror this site archive
properly (be it with wget or any other application)? I'd really
appreciate it.
--
Chris Meadows aka | If this post helped or entertained you, please rate
Robotech_Master | it at http://svcs.affero.net/rm.php?r=robotech
robo...@eyrie.org |
| Homepage: http://www.eyrie.org/~robotech

Lewin A.R.W. Edwards

unread,
Feb 21, 2004, 8:38:31 PM2/21/04
to
> I'm trying to use wget to pull down a mirror of this site archive...

It's probably "protected" by a robots.txt file to prevent search
engines from indexing the archive. Create a file called .wgetrc in
your home directory, put the line:

robots = off

in this file, and try your wget again.

John Tseng

unread,
Mar 23, 2004, 7:36:14 PM3/23/04
to
la...@larwe.com (Lewin A.R.W. Edwards) wrote in message news:<608b6569.04022...@posting.google.com>...

> > I'm trying to use wget to pull down a mirror of this site archive...

Here's the command that I use

wget --exclude-domains rail2000.org -e robots=off -nH --cut-dirs=2
--base=http://web.archive.org/web/20010202020600/http://www.rail2000.org/
-r -l 3 -N -k -p -R js -Gbase
http://web.archive.org/web/20010202020600/www.rail2000.org/

However, you need to manually remove each of the JAVASCRIPT entries
from the end of the file. I'm not sure how to do this short of
writing a perl script to fix each of the files.


I believe your command would look like this:

wget --exclude-domains gizmology.net -e robots=off -nH --cut-dirs=3
--base=http://web.archive.org/web/20030413185309/www.gizmology.net/lovecraft/
-r -l 4 -N -k -p -R js -Gbase
http://web.archive.org/web/20030413185309/www.gizmology.net/lovecraft/

0 new messages