Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion Recovering the ALU wiki
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
D Herring  
View profile  
 More options Dec 18 2008, 9:24 pm
Newsgroups: comp.lang.lisp
From: D Herring <dherr...@at.tentpost.dot.com>
Date: Thu, 18 Dec 2008 21:24:13 -0500
Local: Thurs, Dec 18 2008 9:24 pm
Subject: Re: Recovering the ALU wiki

D Herring wrote:
> WalterGR wrote:
>>>>>> The ALU is dead! Long live the Google!

>>>>> WalterGR:

>>>>> Uh... is this not a cause for alarm?  I don't see anything on alu.org
>>>>> mentioning it.  Anyone know if somebody's working on the problem?
> ...
>> I e-mailed Nick Levine directly.  He doesn't know if the data for
>> wiki.alu.org is still available / recoverable and says:

>> "...I'm doing what I can to rescue the wiki from this end. But it's
>> better to be safe than sorry. If I were you, I'd recover what you
>> can."

>> Anyone want to help me do this?  E-mail me (walte...@aol.com
>> preferred) or respond here.

> Hmmm.  Scrape Google?  I see that their cache of wiki.alu.org has
> already updated to show alu.org, but some of the other pages are still
> cached.  A query of "site:wiki.alu.org" returns 344 pages (347 if you
> select the "omitted results".  I'll take a stab at it, but the history
> and metadata cannot be retrieved this way...

So I started in on it, but the Goog identified what I was doing and
disabled my scraping.  Got ~100 files (Bob_Bechtel to
Switch_Date_2001) before they caught me...

""
We're sorry...
... but your query looks similar to automated requests from a computer
virus or spyware application. To protect our users, we can't process
your request right now.
""

You'd think a site who's business model hinged on scraping others
would be a little more scraper-friendly.

Now I can't use Google cache even manually.

Here's what I did:
- query "site:wiki.alu.org"
- save the 4 pages as wiki1.html, wiki2.html, ...
- run `./pull.sh wiki1`

# pull.sh
cat <<_EOF > pull.sh
#!/bin/bash

page=$1 # e.g. wiki1

rm -rf $page
mkdir $page
sed -e 's,<a ,\n<a ,g' $page.html | grep Cached > $page/links.txt
cd $page
sed -e 's,<a href=",,' -e 's,\+site.*,,' links.txt > urls.txt
for f in $(<urls.txt)
do
     name=`echo $f | sed -e 's,.*:wiki.alu.org/,,'`
     wget -U nuweb -O $name $f
done
_EOF

When I try again later, I'll pass wget an extra "-w 10 --random-wait"
to try and stay below their radar.  Also change "nuweb" to some other
user-agent string (in case they remember).

- Daniel


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.