Gmail Calendar Documents Reader Web more »
Recently Visited Groups | Help | Sign in
Google Groups Home
Message from discussion Accessing archive.org
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Brian Widdas  
View profile  
 More options Jan 14, 6:47 pm
Newsgroups: demon.service
From: Brian Widdas <br...@widdas.net>
Date: Wed, 14 Jan 2009 23:47:57 GMT
Local: Wed, Jan 14 2009 6:47 pm
Subject: Re: Accessing archive.org
On 2009-01-10, Alan Poulter <l...@poulter.demon.co.uk> wrote:

> Is anyone else having problems accessing archive.org? If you are please
> can you email me rather than reply here.

Hi.

The problems have now been fixed. The explanation is not short, so bear
with me:

Firstly, yes, something on web.archive.org was blocked by the IWF. Don't
ask what, I don't know.

The filter we use uses a proxy to inspect suspect URLs. Where a URL is not
on the IWF list (ie, the server hosts some child abuse content, but only a
single URL is blocked), we have to proxy the connection on to the original
server the request was intended for.

Here's where it gets interesting. The proxy sends various bits of
information with the request. One of these is the name of the proxy itself.
Not unsurprisingly, this is 'iwfwebfilter.thus.net'.

It seems that archive.org use caches at their end to speed up access to
pages. When a page is requested, if it's not in the cache, it is built from
the archive and made available to the requestor. As part of this build
process, the server takes a hostname from the cache, along with the date
portion of the URL, etc, to create the 'base URL' of the page.

To explain: say you want to archive www.demon.net. In order to make the
page available on

    http://web.archive.org/web/20070107191318/http://www.demon.net/

you need to strip out all the references to http://www.demon.net/ in the
page (in links, images, CSS, javascript, etc) and replace them with the URL
above. Since a page may not change much, it's better to do it at request
time, so that a single copy of a page can span multiple archived instances.

Unfortunately, the archive.org software would take the server name we
supplied and use it in place of 'web.archive.org', which is why you'd get a
URL like

    http://iwfwebfilter.thus.net/web/20070107191318/http://www.demon.net/

That server doesn't have any content there, so you'd get a 404.

However, this only happened on a cache miss. That is, if the page was
already in the cache, and it had the correct URLs, it would work just fine.
So some people would see that everything appeared to be as it should.

Equally unfortunately, a page with the iwfwebfilter.thus.net URLs could be
cached and then served up to non-Demon customers, which explains our
friends in Romania, and other reports of people who'd not been anywhere
near the Demon caches seeing 'iwfwebfilter.thus.net' where they'd been
expecting 'web.archive.org'

Shortly before 10pm this evening (albeit a more civilised time where
they're based), the Internet Archive fixed the bug and cleared the caches,
so the problem won't return. Nor can the same technique of mis-supplying a
hostname be used for mischief.

To summarise:
  * There was a bug in the Wayback Machine software, which we tickled
  * Demon didn't perform any content manipulation
  * Demon didn't unilaterally filter or block web.archive.org
  * The Internet Archive have now fixed the bug

As Richard is fond of saying, I'm writing to inform.
Brian
--
&#9786;


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google