Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion Recovering the ALU wiki
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Kenny  
View profile  
 More options Dec 18 2008, 10:57 pm
Newsgroups: comp.lang.lisp
From: Kenny <kentil...@gmail.com>
Date: Thu, 18 Dec 2008 22:57:08 -0500
Local: Thurs, Dec 18 2008 10:57 pm
Subject: Re: Recovering the ALU wiki

D Herring wrote:
> D Herring wrote:

>> WalterGR wrote:

>>>>>>> The ALU is dead! Long live the Google!

>>>>>> WalterGR:

>>>>>> Uh... is this not a cause for alarm?  I don't see anything on alu.org
>>>>>> mentioning it.  Anyone know if somebody's working on the problem?

>> ...

>>> I e-mailed Nick Levine directly.  He doesn't know if the data for
>>> wiki.alu.org is still available / recoverable and says:

>>> "...I'm doing what I can to rescue the wiki from this end. But it's
>>> better to be safe than sorry. If I were you, I'd recover what you
>>> can."

>>> Anyone want to help me do this?  E-mail me (walte...@aol.com
>>> preferred) or respond here.

>> Hmmm.  Scrape Google?  I see that their cache of wiki.alu.org has
>> already updated to show alu.org, but some of the other pages are still
>> cached.  A query of "site:wiki.alu.org" returns 344 pages (347 if you
>> select the "omitted results".  I'll take a stab at it, but the history
>> and metadata cannot be retrieved this way...

> So I started in on it, but the Goog identified what I was doing and
> disabled my scraping.  Got ~100 files (Bob_Bechtel to Switch_Date_2001)
> before they caught me...

> ""
> We're sorry...
> ... but your query looks similar to automated requests from a computer
> virus or spyware application. To protect our users, we can't process
> your request right now.
> ""

> You'd think a site who's business model hinged on scraping others would
> be a little more scraper-friendly.

> Now I can't use Google cache even manually.

A LispNYCwe is also a Googler, Robert Brown. Maybe some social
engineering is in order?

I admit it would be more fun if it could be done programmatically.

:)

kt

> Here's what I did:
> - query "site:wiki.alu.org"
> - save the 4 pages as wiki1.html, wiki2.html, ...
> - run `./pull.sh wiki1`

> # pull.sh
> cat <<_EOF > pull.sh
> #!/bin/bash

> page=$1 # e.g. wiki1

> rm -rf $page
> mkdir $page
> sed -e 's,<a ,\n<a ,g' $page.html | grep Cached > $page/links.txt
> cd $page
> sed -e 's,<a href=",,' -e 's,\+site.*,,' links.txt > urls.txt
> for f in $(<urls.txt)
> do
>     name=`echo $f | sed -e 's,.*:wiki.alu.org/,,'`
>     wget -U nuweb -O $name $f
> done
> _EOF

> When I try again later, I'll pass wget an extra "-w 10 --random-wait" to
> try and stay below their radar.  Also change "nuweb" to some other
> user-agent string (in case they remember).

> - Daniel


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.