Kind Regards,
Jochen
automatem .
software solutions
WEB
PHONE
MOBILE
J...@AUTOMATEM.CO.NZ
WWW.AUTOMATEM.CO.NZ
0800 GO 4 AUTO (0800 464288)
021 567 853
I'd love something we could integrate into the existing batch file, so
there's no extra workflow step.
If not, I'm beginning to think a server side solution may be the most
transparent option.
Paul
Kind Regards,
Jochen
automatem .
software solutions
WEB
PHONE
MOBILE
J...@AUTOMATEM.CO.NZ
WWW.AUTOMATEM.CO.NZ
0800 GO 4 AUTO (0800 464288)
021 567 853
Thanks for the link.
It's not appropriate for my current situation but I've bookmarked it
for future reference.
Paul
(wow, the 9rules network certainly has broadened it's net over the
last 18 months...)
http://www.natlib.govt.nz/about-us/current-initiatives/web-harvest-2008
If you're reponsible for a site hosted in .nz you may want to be aware
of:
"Will you honour the robots.txt protocol?
No.
We realise this may be a contentious decision, and we have given this
issue a great deal of thought. However, our current policy is to ignore
the robots.txt file and harvest as many files as possible from each
website unless we receive a request to do otherwise.
We believe it is best that we ignore robots.txt because we have a
responsibility and mandate to preserve the New Zealand internet so that
future New Zealanders can experience it just as we do - or as close as
is technically possible. However, robots.txt files currently block many
URLs. If we were to obey robots.txt we would only get a partial snapshot
of the internet.
We recognise that this policy can cause problems for websites, either by
overloading them with too much traffic, or by following links that cause
problems. In these cases we are happy to change the crawler's behaviour
at the webmaster's request."
- Douglas.
--
Douglas L Davey - Internet Systems Programmer, Waikato University
Any views or opinions presented are solely those of the author
wonder where they brought / developed their crawler?
I also wonder about how they plan to handle the content on sites like
yout tube / wikipedia etc...?
seems like a fairly large undertaking to me :)
"Will you honour the robots.txt protocol?
No.
It's Open Source: <http://webcurator.sourceforge.net/>
I'm a commissioner on the Library Information Advisory Council, which
advises the Minister Responsible for the National Library. I'm happy
to help in any way I can. I'm sympathetic to both sides: robots.txt
is made for a reason, but the law under which these guys are crawling
says that the NZ Internet is to be preserved whether it wants to be or
not :)
Let me know if you're not getting a response through the email
channels. I might be able to find you a real person.
Cheers;
Nat
I imagine the reason is something like: they're crawling every web
site in New Zealand and they'd be spamming. Did you try putting that
question to the e-mail address on the web site?
Nat
Copy of email just sent:
From: Michael Brandon-SearchMasters [mailto:mic...@searchmasters.co.nz]
Sent: Wednesday, 15 October 2008 3:35 p.m.
To: 'web-harv...@natlib.govt.nz'
Subject: re
http://www.natlib.govt.nz/about-us/current-initiatives/web-harvest-2008
I am very concerned that you have decided to ignore robots.txt
There are many url's that are dynamically created, there are many url's that
may be considered private and not meant to be accessible to people coming in
from the search engines.
If it is good enough for Google to abide by robots.txt, then it should be
good enough for an archiver to do the same. Such pages are NOT meant to be
found AT ALL. The pages are absolute private property, and should be
considered the same as password protected pages.
Many dynamically generated pages have duplicate url's, and one way of
stopping these being spidered is to robots.txt them.
On my sites, I have admin screens and other such pages that I don't want
found/spidered/copies created of.
If you want to be a responsible web citizen, then abide by the RULES.
I am very concerned that you should take matters into your own hands.
Yes, Google has run into the same issue, where people inadvertently add
robots.txt for pages that they should not. And how do they get around it?
They record only the url if found, and then have no cache. This is a
reasonable compromise since for the page to be found, there must have been a
link to it.
I will be publishing this email on a number of internet groups that I am
part of. I found out about it on nz-we...@googlegroups.com
Kind Regards
Michael
Michael Brandon
Search Engine Mastery
Getting you to the top of the Search Engines
http://www.SearchMasters.co.nz
Ph: 09 8132307, Mob: 021 728889, Skype: SearchMasters
You'd be more effective if you listed the specific problems with
ignoring robots.txt, for example duplicate URLs for content means
unnecessary bandwidth and server load for the content provider.
> On my sites, I have admin screens and other such pages that I don't
> want found/spidered/copies created of.
This is poor security--if an open source spider can find your admin
pages, so can black hats. Use passwords for material that you don't
want accessed.
Cheers;
Nat
So they get less pages than they otherwise would.
Tut tut National Library!!!!
Michael
I imagine the reason is something like: they're crawling every web
site in New Zealand and they'd be spamming.
Did you try putting that
question to the e-mail address on the web site?
hmmm
miles
Henry
You could say the same thing about any country's National Library :-)
The National Library is required to by an Act of Parliament: <http://legislation.govt.nz/act/public/2003/0019/latest/DLM191962.html
>. Our culture is increasingly happening online and I strongly
believe that it's important to preserve that. I'm sad that the
Internet Archive didn't start sooner, and that many sites still have
robots.txt files that prevent the I.A. from making useful and complete
archives.
Cheers;
Nat
> The National Library is required to by an Act of Parliament: <http://legislation.govt.nz/act/public/2003/0019/latest/DLM191962.html
> >. Our culture is increasingly happening online and I strongly
> believe that it's important to preserve that. I'm sad that the
> Internet Archive didn't start sooner, and that many sites still have
> robots.txt files that prevent the I.A. from making useful and complete
> archives.
Also from Courtney Johnston of Nat Lib:
"Our intentions really are good - to collect & preserve & make
accessible NZ's digital heritage for people in the future, the same
way we do already for books & newspapers & photographs"
<http://doing.nothing.net.nz/national-library-of-nz-web-harvest/#comments>
So, is a website a published piece of content in the same way a book
or a newspaper is? Or, to make my point more explicitly, if someone
has a robots.txt file asking for sites not to be spidered, could this
be considered an implicit choice that they don't want this site (or
part thereof) to be in the public domain in quite the same way as
something with no robots.txt file? In the same way that I might create
a paper 'zine for a few friends, but not want it archived by National
Library.
If so, why shouldn't that choice be honoured by National Library? And
yeah, I know, if it's on the internet, it's on the internet and it's
public - but there might be many reasons someone wants the content on
the internet, but not widely made public, and therefore uses
robots.txt to stop spidering and Googlability.
Or I am splitting hairs? :)
I *do* think IA is a good idea. But, and I'm not entirely sure why, I
do have this nagging feeling that making the "decision not to honour
the robots.txt protocol" seems a bit against the spirit of the web.
Mike
>
> On 15/10/2008, at 8:28 PM, Henry Maddocks wrote:
>> Of more concern is the fact they are scraping everyone's content
>> without permission and it appears in most cases the content
>> providers knowledge. Who's hair brained idea was this?
>
> You could say the same thing about any country's National Library :-)
Just because some other country does it, that makes it ok? Any one up
for invading Fiji?
I assume the molecules they have in their collection were either
purchased or gifted to them. Or did they walk into a gallery and
start lifting art from the walls?
I'm all for preserving our culture but just because it's bits doesn't
mean they can be so cavalier with everyone's rights. Imagine if the
Flickr crowd found out about this.
Henry
It's called "legal deposit". The law says every publisher must
deposit with the National Library a copy of every book published.
This is obviously not the case with art, photographs, unpublished
manuscripts, maps, ephemera, and the other collections in the library.
> Or, to make my point more explicitly, if someone has a robots.txt
> file asking for sites not to be spidered, could this be considered
> an implicit choice that they don't want this site (or part thereof)
> to be in the public domain in quite the same way as something with
> no robots.txt file?
That's a good question. I suggest that you require authentication if
you don't want something on the web but don't want it made public. I
was around when robots.txt was dreamed up, and it was created for
several reasons: bandwidth and CPU time were precious resources;
robots were getting tangled in /foo/../foo/../foo/.. type recursive
hells; and authentication wasn't widespread. The first two are still
good reasons to honour robots.txt, but authentication removes the
final reason.
> But, and I'm not entirely sure why, I do have this nagging feeling
> that making the "decision not to honour the robots.txt protocol"
> seems a bit against the spirit of the web.
Yes, I have a similar problem. The Internet Archive honour it, and
their archive is very patchy as a result. I can see the bind the
Library is in, and sympathise with both sides. It's a bit like
grumbling that the meter reader doesn't obey your "no trespassing"
signs. Yes, your signs are there for a reason but the meter reader
also has a job to do.
Cheers;
Nat
I apologize for conflating Henry's "molecules" comment with Mike's
ones about not wanting things public and nagging feelings. Dangers of
10:30pm email :)
Nat
They read HTML and try to use dynamic links that sit in text boxes (not
href, not action, nothing - just the contents of a text box) for people
to modify and embed in their sites. The links are incomplete. They take
what looks like a valid URL out of there and as a result they downloaded
the same video and the same image like 1000 times.
Their harvester comes from 149.20.55.4, which is a non-nz IP and comes
via an international circuit, just to make us pay.
They come to you loaded with cookies, so you won't event think they are
a bot and keep serving them dynamic URLs. They happily take it and come
back for more.
I couldn't understand where the increase of new sessions was coming
from. It's them, coming back for more with wrong cookies and I had to
open a new session and a DB entry for them.
Like I have nothing better to do than cleaning the DB after them.
I banned the IP. Should have done it long ago. Cost me a few bob in
traffic fees.
I believe they do. If I read their web site correctly, you can email
them at web-harvest-2008 and say what your domain is and how you'd
like them to handle its crawling.
Cheers;
Nat
I wonder how hard was it for them to have 2 bots: one for the local
circuit and one for international?
See: http://www.natlib.govt.nz/about-us/current-initiatives/web-harvest-2008
> What is the legal basis for the domain harvest?
>
> The National Library of New Zealand is a deposit library, which
> means that anyone who publishes a book in New Zealand must supply a
> copy to the National Library. This is called "legal deposit" and
> similar laws exist in many countries.
>
> In 2003 the National Library of New Zealand (Te Puna Mātauranga o
> Aotearoa) Act 2003 was passed to extend legal deposit to cover
> internet publications. Part 4 of the Act authorises the National
> Librarian to “make a copy” of any internet publication for the
> purpose of storing and using it.
>
> For more information see the National Library of New Zealand (Te
> Puna Mātauranga o Aotearoa) Act 2003 and the Minister's National
> Library Requirement (Electronic Documents) Notice 2006.
Basically, copyright is a legal construct. The law that creates it
also makes exceptions. One of the exceptions the law makes is for
archiving the New Zealand Internet.
Cheers;
Nat
I don't think it has anything to do with NZ Internet. Did you check
the user agent string? I googled for that IP address and found:
149.20.55.4 - - [10/Aug/2008:07:49:50 +0800] "GET /units/laws/laws8504?
&mysource_site_extension=page_info_pages HTTP/1.0" 200 16480 "http://units.handbooks.uwa.edu.au/units/laws/laws8504
" "Mozilla/5.0 (compatible; archive.org_bot/heritrix-1.15.1-x +http://pandora.nla.gov.au/crawl.html)
"
If you have the same user agent in your logs, it would appear that
you're being crawled by the Australian National Library and not the
New Zealand National Library.
Cheers;
Nat
I agree, although I'll be curious to learn how many loops, blacklists,
irate emails, etc. were turned up by their crawl. I can also see it
from the National Library's point of view, with robots.txt in the way
of their legal requirement to archive the New Zealand Internet for
posterity. I'm glad I didn't have to make the call on whether to obey
robots.txt or not, it's not a straightforward decision.
Cheers;
Nat
I agree, although I'll be curious to learn how many loops, blacklists,
irate emails, etc. were turned up by their crawl.