Recovering the ALU wiki

WalterGR

unread,

Dec 18, 2008, 6:38:41 AM12/18/08

to walt...@aol.com

>>>> The ALU is dead! Long live the Google!

>>> WalterGR:
>>>
>>> Uh... is this not a cause for alarm? I don't see anything on alu.org
>>> mentioning it. Anyone know if somebody's working on the problem?

>> D Herring:
>>
>> Yes. On Saturday, Nick Levine emailed that he was working on it (I
>> had contacted the ALU wrt this issue).

> WalterGR:
>
> Great, thanks. Did he mention whether they still have the data? If
> not, we / someone / I should begin recovering a "mirror" from the
> Google cache - before it spiders the site again and removes the pages
> from its index.

I e-mailed Nick Levine directly. He doesn't know if the data for
wiki.alu.org is still available / recoverable and says:

"...I'm doing what I can to rescue the wiki from this end. But it's
better to be safe than sorry. If I were you, I'd recover what you
can."

Anyone want to help me do this? E-mail me (walt...@aol.com
preferred) or respond here.

Walter

D Herring

unread,

Dec 18, 2008, 8:52:27 PM12/18/08

to

WalterGR wrote:
>>>>> The ALU is dead! Long live the Google!
>
>>>> WalterGR:
>>>>
>>>> Uh... is this not a cause for alarm? I don't see anything on alu.org
>>>> mentioning it. Anyone know if somebody's working on the problem?

...

> I e-mailed Nick Levine directly. He doesn't know if the data for
> wiki.alu.org is still available / recoverable and says:
>
> "...I'm doing what I can to rescue the wiki from this end. But it's
> better to be safe than sorry. If I were you, I'd recover what you
> can."
>
> Anyone want to help me do this? E-mail me (walt...@aol.com
> preferred) or respond here.

Hmmm. Scrape Google? I see that their cache of wiki.alu.org has
already updated to show alu.org, but some of the other pages are still
cached. A query of "site:wiki.alu.org" returns 344 pages (347 if you
select the "omitted results". I'll take a stab at it, but the history
and metadata cannot be retrieved this way...

archive.org is less helpful:
""
We're sorry, access to http://wiki.alu.org has been blocked by the
site owner via robots.txt.
""

- Daniel

D Herring

unread,

Dec 18, 2008, 9:24:13 PM12/18/08

to

D Herring wrote:
> WalterGR wrote:
>>>>>> The ALU is dead! Long live the Google!
>>
>>>>> WalterGR:
>>>>>
>>>>> Uh... is this not a cause for alarm? I don't see anything on alu.org
>>>>> mentioning it. Anyone know if somebody's working on the problem?
> ...
>> I e-mailed Nick Levine directly. He doesn't know if the data for
>> wiki.alu.org is still available / recoverable and says:
>>
>> "...I'm doing what I can to rescue the wiki from this end. But it's
>> better to be safe than sorry. If I were you, I'd recover what you
>> can."
>>
>> Anyone want to help me do this? E-mail me (walt...@aol.com
>> preferred) or respond here.
>
> Hmmm. Scrape Google? I see that their cache of wiki.alu.org has
> already updated to show alu.org, but some of the other pages are still
> cached. A query of "site:wiki.alu.org" returns 344 pages (347 if you
> select the "omitted results". I'll take a stab at it, but the history
> and metadata cannot be retrieved this way...

So I started in on it, but the Goog identified what I was doing and
disabled my scraping. Got ~100 files (Bob_Bechtel to
Switch_Date_2001) before they caught me...

""
We're sorry...
... but your query looks similar to automated requests from a computer
virus or spyware application. To protect our users, we can't process
your request right now.
""

You'd think a site who's business model hinged on scraping others
would be a little more scraper-friendly.

Now I can't use Google cache even manually.

Here's what I did:
- query "site:wiki.alu.org"
- save the 4 pages as wiki1.html, wiki2.html, ...
- run `./pull.sh wiki1`

# pull.sh
cat <<_EOF > pull.sh
#!/bin/bash

page=$1 # e.g. wiki1

rm -rf $page
mkdir $page
sed -e 's,<a ,\n<a ,g' $page.html | grep Cached > $page/links.txt
cd $page
sed -e 's,<a href=",,' -e 's,\+site.*,,' links.txt > urls.txt
for f in $(<urls.txt)
do
name=`echo $f | sed -e 's,.*:wiki.alu.org/,,'`
wget -U nuweb -O $name $f
done
_EOF

When I try again later, I'll pass wget an extra "-w 10 --random-wait"
to try and stay below their radar. Also change "nuweb" to some other
user-agent string (in case they remember).

- Daniel

Kenny

unread,

Dec 18, 2008, 10:57:08 PM12/18/08

to

A LispNYCwe is also a Googler, Robert Brown. Maybe some social
engineering is in order?

I admit it would be more fun if it could be done programmatically.

:)

kt

GP lisper

unread,

Dec 18, 2008, 11:02:43 PM12/18/08

to

On Thu, 18 Dec 2008 21:24:13 -0500, <dher...@at.tentpost.dot.com> wrote:
>
>> Hmmm. Scrape Google? I see that their cache of wiki.alu.org has
>> already updated to show alu.org, but some of the other pages are still
>> cached. A query of "site:wiki.alu.org" returns 344 pages (347 if you
>> select the "omitted results". I'll take a stab at it, but the history
>> and metadata cannot be retrieved this way...
>
> So I started in on it, but the Goog identified what I was doing and
> disabled my scraping. Got ~100 files (Bob_Bechtel to
> Switch_Date_2001) before they caught me...
>
> ""
> We're sorry...
> ... but your query looks similar to automated requests from a computer
> virus or spyware application. To protect our users, we can't process
> your request right now.

-) Go slower (which you mention)
-) Start-Stop a little (~ use a coin flip to continue)
-) Use a proxy, or better several proxies

--
"Most programmers use this on-line documentation nearly all of the
time, and thereby avoid the need to handle bulky manuals and perform
the translation from barbarous tongues." CMU CL User Manual
** Posted from http://www.teranews.com **

Kaz Kylheku

unread,

Dec 19, 2008, 1:32:03 AM12/19/08

to

On 2008-12-19, D Herring <dher...@at.tentpost.dot.com> wrote:
> So I started in on it, but the Goog identified what I was doing and
> disabled my scraping. Got ~100 files (Bob_Bechtel to
> Switch_Date_2001) before they caught me...

If the tracking is naively cookie based, discover the offending cookie and blow
it off.

> ""
> We're sorry...
> ... but your query looks similar to automated requests from a computer
> virus or spyware application. To protect our users, we can't process
> your request right now.

I.e. ``to protect our cache from being used as a free web hosting service since
the cache supports our search engine function and isn't a direct source of
revenue.''

If you want free disk space from Google, you have to get an account,
and then they can find things about you and target you with advertizing,
etc.

Tim Bradshaw

unread,

Dec 19, 2008, 9:44:27 AM12/19/08

to

On Dec 18, 11:38 am, WalterGR <walte...@gmail.com> wrote:

> "...I'm doing what I can to rescue the wiki from this end. But it's
> better to be safe than sorry. If I were you, I'd recover what you
> can."
>

> Anyone want to help me do this? E-mail me (walte...@aol.com
> preferred) or respond here.

*Surely* they had backups, right? Or maybe they only had onsite ones
and they had a fire or something?

Tim Bradshaw

unread,

Dec 19, 2008, 9:45:30 AM12/19/08

to

On Dec 19, 6:32 am, Kaz Kylheku <kkylh...@gmail.com> wrote:

> If the tracking is naively cookie based, discover the offending cookie and blow
> it off.

It won't be, because all the real bad guys would then do that
immediately.

Thomas F. Burdick

unread,

Dec 19, 2008, 10:00:06 AM12/19/08

to

Seems more like they're hunting down and destroying copies of the RtL.
Two days ago, you could access most of the cached pages from
archive.org. Silly me, I figured, "okay, worst case scenario they can
be recovered from there." It seems they've since installed a
Robots.txt that archive.org is honoring and no longer serving up the
cached copies. I'm guessing this is incompetence not malice, but ...
remind me not to trust the ALU with a wet paper sack.

WalterGR

unread,

Dec 19, 2008, 11:36:21 AM12/19/08

to

On Dec 19, 7:00 am, "Thomas F. Burdick" <tburd...@gmail.com> wrote:

> Two days ago, you could access most of the cached pages from

> archive.org. [snip] It seems they've since installed a

> Robots.txt that archive.org is honoring and no longer serving up the
> cached copies.

Yeah, that part's especially "hilarious." I was spidering archive.org
yesterday, and in the middle of it, they started honoring the robots
directive. I e-mailed Nick Levine, asking him to remove the
robots.txt, but I haven't heard back.

I, too, was blocked by Google, similarly after ~100 pages. Google's
spider has already returned to wiki.alu.org since it went offline, and
has started removing pages from its index and cache. For example,
yesterday you could get the cached copy of http://wiki.alu.org/Bob_Bechtel,
but today you can't.

If anyone wants to help spider Google (not that I'm encouraging this,
as it's against the TOS,) I've split the cached URLs *as of yesterday*
(minus the pages I grabbed) into groups of 50 URLs. Pick a URL set
from

http://waltergr.com/alu/google.txt

and write back to the group saying which URL set you're going to grab,
to avoid duplication. We can collect the files in one place later.

Walter

WalterGR

unread,

Dec 19, 2008, 12:02:47 PM12/19/08

to

On Dec 19, 7:00 am, "Thomas F. Burdick" <tburd...@gmail.com> wrote:

> Two days ago, you could access most of the cached pages from
> archive.org. [snip] It seems they've since installed a
> Robots.txt that archive.org is honoring and no longer serving up the
> cached copies.

One thing I forgot. From archive.org's FAQ:

"Q: How can I get a copy of the pages on my Web site? If my site got
hacked or damaged, could I get a backup from the Archive?'

A: Our terms of use do not cover backups for the general public.
However, you may use the Internet Archive Wayback Machine to locate
and access archived versions of your web site. We can't guarantee that
your site has been or will be archived. For siteowners only we offer
limited backup capabilites. Send your request to info at archive dot
org for more information."

Who would be the "siteowner" for the wiki? Previously, given it's a
community site, I'd say that one of us could have probably convinced
archive.org to help us out. But now that "they" have put up a
restrictive robots.txt, I think we're screwed unless we can contact
someone "in charge."

Walter

Kenny

unread,

Dec 19, 2008, 12:20:00 PM12/19/08

to

Thomas F. Burdick wrote:
> On 19 déc, 15:44, Tim Bradshaw <tfb+goo...@tfeb.org> wrote:
>
>>On Dec 18, 11:38 am, WalterGR <walte...@gmail.com> wrote:
>>
>>
>>>"...I'm doing what I can to rescue the wiki from this end. But it's
>>>better to be safe than sorry. If I were you, I'd recover what you
>>>can."
>>
>>>Anyone want to help me do this? E-mail me (walte...@aol.com
>>>preferred) or respond here.
>>
>>*Surely* they had backups, right? Or maybe they only had onsite ones
>>and they had a fire or something?
>
>
> Seems more like they're hunting down and destroying copies of the RtL.

<pssst!> I grabbed a copy of the Highlight Film anyway <\pssst!>

> Two days ago, you could access most of the cached pages from
> archive.org. Silly me, I figured, "okay, worst case scenario they can
> be recovered from there." It seems they've since installed a
> Robots.txt that archive.org is honoring and no longer serving up the
> cached copies. I'm guessing this is incompetence not malice, but ...

A plot would be so much cooler. Once the marketing devices have been
1984ed and the flood of novices eliminated the secret guild trying to
get control of the grail just needs to track down and eliminate any
masters outside the... omigod. It /is/ the yobbos.

I shall begin work on the screenplay. The Last Lisper. Cruise will be
all over this.

more soon!

kxo

D Herring

unread,

Dec 19, 2008, 8:53:56 PM12/19/08

to

WalterGR wrote:

> If anyone wants to help spider Google (not that I'm encouraging this,
> as it's against the TOS,) I've split the cached URLs *as of yesterday*
> (minus the pages I grabbed) into groups of 50 URLs. Pick a URL set
> from

Certainly I would never spider google...

FWIW, here's roughly 350 wiki pages.
http://tentpost.com/wiki.alu.org.tar.bz2

Looking at it, large swaths are missing (e.g.
http://wiki.alu.org/Local); but its somewhat better than nothing.

- Daniel

Nick Levine

unread,

Dec 19, 2008, 11:54:01 PM12/19/08

to

I've been asked to pass on the following message:

The Association of Lisp Users (ALU) regrets the current unavailability
of the wiki formerly available at http://wiki.alu.org . This is due to
circumstances totally outside its control.

The wiki was being maintained by third parties with whom communication
has proved difficult. The only role of the ALU itself in the wiki has
been to maintain its DNS records (in the same way that - for example -
it also maintains DNS records for planet.lisp.org but has nothing else
to do with that site). The ALU currently has neither read nor write
access to the wiki's contents.

It is the intent of the ALU to reconstitute and reestablish the
content of the wiki at alu.org as soon as possible. Furthermore, we
are thankful to those in the Lisp community who have been working to
recapture the wiki content and we will need the fruits of their
efforts as we rebuild the wiki.

Signed, The ALU Board of Directors

Kenny

unread,

Dec 20, 2008, 1:52:25 AM12/20/08

to

Nick Levine wrote:
> I've been asked to pass on the following message:
>
> The Association of Lisp Users (ALU) regrets the current unavailability
> of the wiki formerly available at http://wiki.alu.org . This is due to
> circumstances totally outside its control.

Jeez. The yobbos have kidnapped it. Demanding ransom, apparently, or
Kenny's head, whichever comes first.

>
> The wiki was being maintained by third parties with whom communication
> has proved difficult.

You'll find a cell phone in the hollow of the oak tree.

> The only role of the ALU itself in the wiki has
> been to maintain its DNS records (in the same way that - for example -
> it also maintains DNS records for planet.lisp.org but has nothing else
> to do with that site). The ALU currently has neither read nor write
> access to the wiki's contents.

If you ever want to see your Road alive again do not contact the police.

>
> It is the intent of the ALU to reconstitute and reestablish the
> content of the wiki at alu.org as soon as possible. Furthermore, we
> are thankful to those in the Lisp community who have been working to
> recapture the wiki content and we will need the fruits of their
> efforts as we rebuild the wiki.
>
> Signed, The ALU Board of Directors

Courage.

kenneth

Message has been deleted

WalterGR

unread,

Dec 21, 2008, 8:16:49 AM12/21/08

to

On Dec 19, 5:53 pm, D Herring <dherr...@at.tentpost.dot.com> wrote:
> WalterGR wrote:
> > If anyone wants to help spider Google (not that I'm encouraging this,
> > as it's against the TOS,) I've split the cached URLs *as of yesterday*
> > (minus the pages I grabbed) into groups of 50 URLs. Pick a URL set
> > from
>
> Certainly I would never spider google...
>
> FWIW, here's roughly 350 wiki pages.http://tentpost.com/wiki.alu.org.tar.bz2

Where did you recover these copies from? (They don't seem to contain
the usual archive.org / Google cache markers...)

Walter

D Herring

unread,

Dec 21, 2008, 12:13:37 PM12/21/08

to

And I would never strip g__gle's 3 header lines...

BTW, specific queries allowed me to recover older (~September) copies
of /Local and /The_Road_To_Lisp_Survey, but these haven't been
uploaded yet.

- Daniel

WalterGR

unread,

Dec 21, 2008, 12:52:50 PM12/21/08

to

On Dec 21, 9:13 am, D Herring <dherr...@at.tentpost.dot.com> wrote:
> WalterGR wrote:
> > On Dec 19, 5:53 pm, D Herring <dherr...@at.tentpost.dot.com> wrote:

> >> FWIW, here's roughly 350 wiki pages.http://tentpost.com/wiki.alu.org.tar.bz2
>
> > Where did you recover these copies from? (They don't seem to contain
> > the usual archive.org / Google cache markers...)
>
> And I would never strip g__gle's 3 header lines...

Wow. It must be an off day. I'm usually not *that* dense...

> BTW, specific queries allowed me to recover older (~September) copies
> of /Local and /The_Road_To_Lisp_Survey, but these haven't been
> uploaded yet.

Strange. I have /The_Road_To_Lisp_Survey from Google's Dec 7, 2008
03:39:40 GMT visit, but the last-modified date provided by the wiki
says 2006-04-06.

BTW, it appears that http://clrfi.alu.org/ is gone too. Is that a
recent occurrence?

Walter

Nick Levine

unread,

Dec 21, 2008, 1:27:22 PM12/21/08

to

> BTW, it appears that http://clrfi.alu.org/is gone too. Is that a
> recent occurrence?

That was deliberate and happened a few months ago. The CLRFI process
was badly pitched. It ground to a halt and died without there ever
being any CLRFIs. It seemed inappropriate to keep serving the bytes.
Such was the overwhelming public interest (!) in the site that it's
taken until now for anybody to notice.

I recommend that anyone interested in what the CLRFI was trying to
achieve should look instead at the CDR - Common Lisp Document
Repository - hosted at http://cdr.eurolisp.org/

- nick (wearing his ex-CLRFI committee member's hat)