combining css files

Paul Bennett

unread,

Oct 13, 2008, 7:45:41 PM10/13/08

to nz-we...@googlegroups.com

Hi all,

any quick and dirty tips / tools for combining CSS files?
I've got a batch file to compress our CSS before sending to staging and live servers, but would like to also take the 20 odd files and combine them into one at the same time.

I know I can do this server side, but if there's a simple way I can work this into our existing edit > compress > test workflow it would be great.

Thanks in advance for any pointers.

:)
Paul

Jochen Daum

unread,

Oct 13, 2008, 7:50:54 PM10/13/08

to nz-we...@googlegroups.com

Hi,

I was going to suggest server side, did you mean with a dynamic caching script? That seems to be the easiest way, as it is transparent to any workflow.

Kind Regards,

Jochen

automatem .
software solutions

EMAIL
WEB
PHONE
MOBILE

J...@AUTOMATEM.CO.NZ
WWW.AUTOMATEM.CO.NZ
0800 GO 4 AUTO (0800 464288)
021 567 853

Paul Bennett

unread,

Oct 13, 2008, 9:19:57 PM10/13/08

to nz-we...@googlegroups.com

Hi Jochen,

I'd love something we could integrate into the existing batch file, so
there's no extra workflow step.
If not, I'm beginning to think a server side solution may be the most
transparent option.

Paul

Jochen Daum

unread,

Oct 13, 2008, 9:31:46 PM10/13/08

to nz-we...@googlegroups.com

Hi,

really what I mean is that I've been using is this:

http://rakaz.nl/extra/code/combine

It does not require any steps after you've made changes to either css or javascript and you can leave all comments and spaces in your files, as they will be transparently taken out.

In this way, it kind of eliminates your current workflow, but in the same way "it works with your current workflow"

HTH

Kind Regards,

Jochen

automatem .
software solutions

EMAIL
WEB
PHONE
MOBILE

J...@AUTOMATEM.CO.NZ
WWW.AUTOMATEM.CO.NZ
0800 GO 4 AUTO (0800 464288)
021 567 853

Paul Bennett

unread,

Oct 13, 2008, 10:15:53 PM10/13/08

to nz-we...@googlegroups.com

Hi Jochen,

Thanks for the link.
It's not appropriate for my current situation but I've bookmarked it
for future reference.

Paul
(wow, the 9rules network certainly has broadened it's net over the
last 18 months...)

Douglas Davey

unread,

Oct 14, 2008, 9:24:13 PM10/14/08

to nz-we...@googlegroups.com

The National Library are conducted a 'web harvest' - details available
from:

http://www.natlib.govt.nz/about-us/current-initiatives/web-harvest-2008

If you're reponsible for a site hosted in .nz you may want to be aware
of:

"Will you honour the robots.txt protocol?

No.

We realise this may be a contentious decision, and we have given this
issue a great deal of thought. However, our current policy is to ignore
the robots.txt file and harvest as many files as possible from each
website unless we receive a request to do otherwise.

We believe it is best that we ignore robots.txt because we have a
responsibility and mandate to preserve the New Zealand internet so that
future New Zealanders can experience it just as we do - or as close as
is technically possible. However, robots.txt files currently block many
URLs. If we were to obey robots.txt we would only get a partial snapshot
of the internet.

We recognise that this policy can cause problems for websites, either by
overloading them with too much traffic, or by following links that cause
problems. In these cases we are happy to change the crawler's behaviour
at the webmaster's request."

- Douglas.
--
Douglas L Davey - Internet Systems Programmer, Waikato University
Any views or opinions presented are solely those of the author

Paul Bennett

unread,

Oct 14, 2008, 9:44:26 PM10/14/08

to nz-we...@googlegroups.com

thanks for the heads up Douglas.

wonder where they brought / developed their crawler?

I also wonder about how they plan to handle the content on sites like
yout tube / wikipedia etc...?

seems like a fairly large undertaking to me :)

Robert Coup

unread,

Oct 14, 2008, 9:58:12 PM10/14/08

to nz-we...@googlegroups.com

On Wed, Oct 15, 2008 at 2:24 PM, Douglas Davey <doug...@waikato.ac.nz> wrote:

"Will you honour the robots.txt protocol?

No.

Not only that ... "feature", but they were hitting EventFinder at 16 request/second earlier today: http://twitter.com/jamesmcglinn#status_959853334

Rob :)

Nathan Torkington

unread,

Oct 14, 2008, 9:59:16 PM10/14/08

to nz-we...@googlegroups.com

On 15/10/2008, at 2:44 PM, Paul Bennett wrote:
> wonder where they brought / developed their crawler?

It's Open Source: <http://webcurator.sourceforge.net/>

I'm a commissioner on the Library Information Advisory Council, which
advises the Minister Responsible for the National Library. I'm happy
to help in any way I can. I'm sympathetic to both sides: robots.txt
is made for a reason, but the law under which these guys are crawling
says that the NZ Internet is to be preserved whether it wants to be or
not :)

Let me know if you're not getting a response through the email
channels. I might be able to find you a real person.

Cheers;

Nat

Tim Haines

unread,

Oct 14, 2008, 10:05:38 PM10/14/08

to nz-we...@googlegroups.com

Hey Nat,

Any reason they didn't get it touch with domain/website owners directly before they started crawling to inform them that the robot.txt protocol wouldn't be followed?

Cheers,

Tim.

Nathan Torkington

unread,

Oct 14, 2008, 10:34:38 PM10/14/08

to nz-we...@googlegroups.com

On 15/10/2008, at 3:05 PM, Tim Haines wrote:
> Any reason they didn't get it touch with domain/website owners
> directly before they started crawling to inform them that the
> robot.txt protocol wouldn't be followed?

I imagine the reason is something like: they're crawling every web
site in New Zealand and they'd be spamming. Did you try putting that
question to the e-mail address on the web site?

Nat

Michael Brandon-SearchMasters

unread,

Oct 14, 2008, 10:38:00 PM10/14/08

to nz-we...@googlegroups.com

I would like to see discussion from others on this.

Copy of email just sent:

From: Michael Brandon-SearchMasters [mailto:mic...@searchmasters.co.nz]
Sent: Wednesday, 15 October 2008 3:35 p.m.
To: 'web-harv...@natlib.govt.nz'
Subject: re
http://www.natlib.govt.nz/about-us/current-initiatives/web-harvest-2008

I am very concerned that you have decided to ignore robots.txt

There are many url's that are dynamically created, there are many url's that
may be considered private and not meant to be accessible to people coming in
from the search engines.

If it is good enough for Google to abide by robots.txt, then it should be
good enough for an archiver to do the same. Such pages are NOT meant to be
found AT ALL. The pages are absolute private property, and should be
considered the same as password protected pages.

Many dynamically generated pages have duplicate url's, and one way of
stopping these being spidered is to robots.txt them.

On my sites, I have admin screens and other such pages that I don't want
found/spidered/copies created of.

If you want to be a responsible web citizen, then abide by the RULES.

I am very concerned that you should take matters into your own hands.

Yes, Google has run into the same issue, where people inadvertently add
robots.txt for pages that they should not. And how do they get around it?
They record only the url if found, and then have no cache. This is a
reasonable compromise since for the page to be found, there must have been a
link to it.

I will be publishing this email on a number of internet groups that I am
part of. I found out about it on nz-we...@googlegroups.com

Kind Regards

Michael

Michael Brandon
Search Engine Mastery
Getting you to the top of the Search Engines

http://www.SearchMasters.co.nz
Ph: 09 8132307, Mob: 021 728889, Skype: SearchMasters

Nathan Torkington

unread,

Oct 14, 2008, 10:50:02 PM10/14/08

to nz-we...@googlegroups.com

On 15/10/2008, at 3:38 PM, Michael Brandon-SearchMasters wrote:
> I would like to see discussion from others on this.

You'd be more effective if you listed the specific problems with
ignoring robots.txt, for example duplicate URLs for content means
unnecessary bandwidth and server load for the content provider.

> On my sites, I have admin screens and other such pages that I don't
> want found/spidered/copies created of.

This is poor security--if an open source spider can find your admin
pages, so can black hats. Use passwords for material that you don't
want accessed.

Cheers;

Nat

Michael Brandon-SearchMasters

unread,

Oct 14, 2008, 10:53:23 PM10/14/08

to nz-we...@googlegroups.com

There are such things as spider traps - when spiders do not follow the
robots.txt rules, then access a specific "hidden" link on the site that they
have been prohibited from viewing, then the spider is disallowed from
viewing any further pages.

So they get less pages than they otherwise would.

Tut tut National Library!!!!

Michael

lenz

unread,

Oct 14, 2008, 11:22:32 PM10/14/08

to nz-we...@googlegroups.com

sorry, i don't really get your point. you write a mail to the national library that discloses that you have not protected your admin pages and then try to tell us that you are smart enough to set up spider traps ... or is this just a general troll mail (in which case i would link to some off topic discussions to make it more interesting)

cheers

lenz

Tim Haines

unread,

Oct 14, 2008, 11:29:08 PM10/14/08

to nz-we...@googlegroups.com

Hey Nat,

I imagine the reason is something like: they're crawling every web
site in New Zealand and they'd be spamming.

I doubt many webmasters would consider receiving a warning (i.e. advance notice) via email as spam. Seems like a very legitimate reason to email to me.

Did you try putting that
question to the e-mail address on the web site?

Have now.

Cheers,

Tim.

James McGlinn

unread,

Oct 14, 2008, 11:48:08 PM10/14/08

to NZ Web Dev

On Oct 15, 2:58 pm, "Robert Coup" <rob...@coup.net.nz> wrote:

> Not only that ... "feature", but they were hitting EventFinder at 16
> request/second earlier today:http://twitter.com/jamesmcglinn#status_959853334

Yeah, that was a bit overboard.

Full credit to the team at NatLib though - once notified of the issue
this morning they responded promptly and implemented a solution to
stop their crawler from absolutely hammering our servers.

Kind regards,
James McGlinn
__________________________________
CTO
Eventfinder Limited
Suite 106, Heards Building
2 Ruskin Street, Parnell, Auckland 1052
Phone: +649 365 2342
Mobile: +6421 633 234

james....@eventfinder.co.nz | www.eventfinder.co.nz

Miles Thompson

unread,

Oct 14, 2008, 11:41:57 PM10/14/08

to nz-we...@googlegroups.com

also going from website to working email address is harder and harder
these days - night on impossible in at least the vast majority of
cases without human assistance - people trying to avoid spammers, of
course. i imagine the natlib budget just didn't stretch to sending a
warning

hmmm

miles

Gordon Paynter

unread,

Oct 15, 2008, 1:04:51 AM10/15/08

to NZ Web Dev

Hi all:

My name is Gordon Paynter, I am the project manager for the National
Library's 2008 Whole-of-Domain Web Harvest.

A lot of people clearly disagreee with our decision to ignore
robots.txt. However, it is not one we have taken lightly, and we have
attempted to explain it a little better here:
http://www.natlib.govt.nz/about-us/news/15-october-2008-update-on-web-harvest
and here:
http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html

The first of those links also explains how you should expect the
harvester to behave on your website. If you think it is misbehaving,
or you would like us to modify its behaviour for your site, the
fastest way to get a response is to email web-
harves...@natlib.govt.nz (that email address is CCed to the crawl
engineers). If you have questions about the crawl in general, feel
free to email me at Gordon dot Paynter at natlib dot govt dot nz.

Thanks,
Gordon
--
Gordon Paynter
Technical Analyst
National Digital Library
The National Library of New Zealand

Henry Maddocks

unread,

Oct 15, 2008, 3:28:25 AM10/15/08

to nz-we...@googlegroups.com

Of more concern is the fact they are scraping everyone's content
without permission and it appears in most cases the content providers
knowledge. Who's hair brained idea was this?

Henry

Nathan Torkington

unread,

Oct 15, 2008, 3:48:09 AM10/15/08

to nz-we...@googlegroups.com

You could say the same thing about any country's National Library :-)

The National Library is required to by an Act of Parliament: <http://legislation.govt.nz/act/public/2003/0019/latest/DLM191962.html
>. Our culture is increasingly happening online and I strongly
believe that it's important to preserve that. I'm sad that the
Internet Archive didn't start sooner, and that many sites still have
robots.txt files that prevent the I.A. from making useful and complete
archives.

Cheers;

Nat

Mike Brown

unread,

Oct 15, 2008, 4:39:06 AM10/15/08

to nz-we...@googlegroups.com

On Wed, Oct 15, 2008 at 8:48 PM, Nathan Torkington
<nat...@torkington.com> wrote:

> The National Library is required to by an Act of Parliament: <http://legislation.govt.nz/act/public/2003/0019/latest/DLM191962.html
> >. Our culture is increasingly happening online and I strongly
> believe that it's important to preserve that. I'm sad that the
> Internet Archive didn't start sooner, and that many sites still have
> robots.txt files that prevent the I.A. from making useful and complete
> archives.

Also from Courtney Johnston of Nat Lib:
"Our intentions really are good - to collect & preserve & make
accessible NZ's digital heritage for people in the future, the same
way we do already for books & newspapers & photographs"
<http://doing.nothing.net.nz/national-library-of-nz-web-harvest/#comments>

So, is a website a published piece of content in the same way a book
or a newspaper is? Or, to make my point more explicitly, if someone
has a robots.txt file asking for sites not to be spidered, could this
be considered an implicit choice that they don't want this site (or
part thereof) to be in the public domain in quite the same way as
something with no robots.txt file? In the same way that I might create
a paper 'zine for a few friends, but not want it archived by National
Library.

If so, why shouldn't that choice be honoured by National Library? And
yeah, I know, if it's on the internet, it's on the internet and it's
public - but there might be many reasons someone wants the content on
the internet, but not widely made public, and therefore uses
robots.txt to stop spidering and Googlability.

Or I am splitting hairs? :)

I *do* think IA is a good idea. But, and I'm not entirely sure why, I
do have this nagging feeling that making the "decision not to honour
the robots.txt protocol" seems a bit against the spirit of the web.

Mike

Henry Maddocks

unread,

Oct 15, 2008, 5:24:22 AM10/15/08

to nz-we...@googlegroups.com

On 15/10/2008, at 8:48 PM, Nathan Torkington wrote:

>
> On 15/10/2008, at 8:28 PM, Henry Maddocks wrote:
>> Of more concern is the fact they are scraping everyone's content
>> without permission and it appears in most cases the content
>> providers knowledge. Who's hair brained idea was this?
>
> You could say the same thing about any country's National Library :-)

Just because some other country does it, that makes it ok? Any one up
for invading Fiji?

I assume the molecules they have in their collection were either
purchased or gifted to them. Or did they walk into a gallery and
start lifting art from the walls?

I'm all for preserving our culture but just because it's bits doesn't
mean they can be so cavalier with everyone's rights. Imagine if the
Flickr crowd found out about this.

Henry

xml.net

unread,

Oct 15, 2008, 5:27:38 AM10/15/08

to nz-we...@googlegroups.com

Can we just send them the traffic bill?

Nathan Torkington

unread,

Oct 15, 2008, 5:35:45 AM10/15/08

to nz-we...@googlegroups.com

On 15/10/2008, at 10:24 PM, Henry Maddocks wrote:
> I assume the molecules they have in their collection were either
> purchased or gifted to them. Or did they walk into a gallery and
> start lifting art from the walls?

It's called "legal deposit". The law says every publisher must
deposit with the National Library a copy of every book published.
This is obviously not the case with art, photographs, unpublished
manuscripts, maps, ephemera, and the other collections in the library.

> Or, to make my point more explicitly, if someone has a robots.txt
> file asking for sites not to be spidered, could this be considered
> an implicit choice that they don't want this site (or part thereof)
> to be in the public domain in quite the same way as something with
> no robots.txt file?

That's a good question. I suggest that you require authentication if
you don't want something on the web but don't want it made public. I
was around when robots.txt was dreamed up, and it was created for
several reasons: bandwidth and CPU time were precious resources;
robots were getting tangled in /foo/../foo/../foo/.. type recursive
hells; and authentication wasn't widespread. The first two are still
good reasons to honour robots.txt, but authentication removes the
final reason.

> But, and I'm not entirely sure why, I do have this nagging feeling
> that making the "decision not to honour the robots.txt protocol"
> seems a bit against the spirit of the web.

Yes, I have a similar problem. The Internet Archive honour it, and
their archive is very patchy as a result. I can see the bind the
Library is in, and sympathise with both sides. It's a bit like
grumbling that the meter reader doesn't obey your "no trespassing"
signs. Yes, your signs are there for a reason but the meter reader
also has a job to do.

Cheers;

Nat

Nathan Torkington

unread,

Oct 15, 2008, 5:37:38 AM10/15/08

to nz-we...@googlegroups.com

On 15/10/2008, at 10:35 PM, Nathan Torkington wrote:
> On 15/10/2008, at 10:24 PM, Henry Maddocks wrote:
>>

I apologize for conflating Henry's "molecules" comment with Mike's
ones about not wanting things public and nagging feelings. Dangers of
10:30pm email :)

Nat

natobasso

unread,

Oct 15, 2008, 10:37:17 AM10/15/08

to NZ Web Dev

If you don't allow a site to be selectively crawled, then you MUST
offer a way to opt out of this web crawl entirely. robots.txt doesn't
just "hide" information, it "protects" it as well. Not all information
needs to be, nor is it legal to be, saved for everyone to view. There
are legal ramifications to this that you are breaching, and I'm quite
sure you'll hear more protests (and dare I say lawsuits?) as word gets
out about this.

I don't see how not following the standards the rest of the world
follows qualifies as a valuable goal?

Offer choice. It's the only way.

Nathaniel

xml.net

unread,

Oct 15, 2008, 5:27:46 PM10/15/08

to nz-we...@googlegroups.com

F@#$ing idiots!!!
They are pulling all images from my servers even if they are hosted for
overseas customers and sit under non-nz domains.
WTF does it have to do with NZ internet?

They read HTML and try to use dynamic links that sit in text boxes (not
href, not action, nothing - just the contents of a text box) for people
to modify and embed in their sites. The links are incomplete. They take
what looks like a valid URL out of there and as a result they downloaded
the same video and the same image like 1000 times.

Their harvester comes from 149.20.55.4, which is a non-nz IP and comes
via an international circuit, just to make us pay.

They come to you loaded with cookies, so you won't event think they are
a bot and keep serving them dynamic URLs. They happily take it and come
back for more.
I couldn't understand where the increase of new sessions was coming
from. It's them, coming back for more with wrong cookies and I had to
open a new session and a DB entry for them.
Like I have nothing better to do than cleaning the DB after them.

I banned the IP. Should have done it long ago. Cost me a few bob in
traffic fees.

Nathan Torkington

unread,

Oct 15, 2008, 5:40:28 PM10/15/08

to nz-we...@googlegroups.com

On 16/10/2008, at 3:37 AM, natobasso wrote:
> If you don't allow a site to be selectively crawled, then you MUST
> offer a way to opt out of this web crawl entirely.

I believe they do. If I read their web site correctly, you can email
them at web-harvest-2008 and say what your domain is and how you'd
like them to handle its crawling.

Cheers;

Nat

xml.net

unread,

Oct 15, 2008, 5:56:52 PM10/15/08

to nz-we...@googlegroups.com

Block 149.20.55.4 and end of story.
It is a non-nz IP and comes

via an international circuit, just to make us pay.

I wonder how hard was it for them to have 2 bots: one for the local
circuit and one for international?

Nathan Torkington

unread,

Oct 15, 2008, 6:03:56 PM10/15/08

to nz-we...@googlegroups.com

On 16/10/2008, at 3:37 AM, natobasso wrote:

> There are legal ramifications to this that you are breaching, and
> I'm quite sure you'll hear more protests (and dare I say lawsuits?)
> as word gets out about this.

See: http://www.natlib.govt.nz/about-us/current-initiatives/web-harvest-2008

> What is the legal basis for the domain harvest?
>
> The National Library of New Zealand is a deposit library, which
> means that anyone who publishes a book in New Zealand must supply a
> copy to the National Library. This is called "legal deposit" and
> similar laws exist in many countries.
>
> In 2003 the National Library of New Zealand (Te Puna Mātauranga o
> Aotearoa) Act 2003 was passed to extend legal deposit to cover
> internet publications. Part 4 of the Act authorises the National
> Librarian to “make a copy” of any internet publication for the
> purpose of storing and using it.
>
> For more information see the National Library of New Zealand (Te
> Puna Mātauranga o Aotearoa) Act 2003 and the Minister's National
> Library Requirement (Electronic Documents) Notice 2006.

Basically, copyright is a legal construct. The law that creates it
also makes exceptions. One of the exceptions the law makes is for
archiving the New Zealand Internet.

Cheers;

Nat

Nathan Torkington

unread,

Oct 15, 2008, 6:05:09 PM10/15/08

to nz-we...@googlegroups.com

On 16/10/2008, at 10:27 AM, xml.net wrote:
> F@#$ing idiots!!!
> They are pulling all images from my servers even if they are hosted
> for overseas customers and sit under non-nz domains.
> WTF does it have to do with NZ internet?

I don't think it has anything to do with NZ Internet. Did you check
the user agent string? I googled for that IP address and found:

149.20.55.4 - - [10/Aug/2008:07:49:50 +0800] "GET /units/laws/laws8504?
&mysource_site_extension=page_info_pages HTTP/1.0" 200 16480 "http://units.handbooks.uwa.edu.au/units/laws/laws8504
" "Mozilla/5.0 (compatible; archive.org_bot/heritrix-1.15.1-x +http://pandora.nla.gov.au/crawl.html)
"

If you have the same user agent in your logs, it would appear that
you're being crawled by the Australian National Library and not the
New Zealand National Library.

Cheers;

Nat

shoaib shaikh

unread,

Oct 15, 2008, 6:11:44 PM10/15/08

to nz-we...@googlegroups.com

cheers thanks for the ip. i'll just block the CIDR for now ..

Miles Thompson

unread,

Oct 16, 2008, 5:00:57 AM10/16/08

to nz-we...@googlegroups.com

On another list, Tim wrote:
> The scraping itself is being done with this: <http://webcurator.sourceforge.net/>

Interesting.. if you look at the screenshots and the manual you see that a *lot* of their effort went into admin interfaces for things like...

harvest authorisation — formal approval for you to harvest web material. You normally need permission to harvest the website, and also to store it and make it accessible.
permission record — a specific record of a harvest authorisation, including the authorising agencies, the dates during which permissions apply and any restrictions on harvesting or access.
authorising agency — a person or organisation who authorises a harvest; often a web site owner or copyright holder.

Given that the thing was apparently a joint effort between the British and NZ libraries I guess the British must've opted to take the time and manage authorizations, whereas i guess here in NZ they went for the cheap option. Is that right?

Surely with such a tool it would be possible to harvest all the NZ internet - *respecting robots.txt* and then, later, go through and seek authorizations to harvest particular, important content without the robots.txt.

Then again I could be way off base. Care to explain what all this authorization stuff is about Gordon?

Miles

On Thu, Oct 16, 2008 at 7:16 PM, Tim Haines <tmha...@gmail.com> wrote:

xml.net

unread,

Oct 16, 2008, 5:26:46 AM10/16/08

to nz-we...@googlegroups.com

The main purpose of using robots.txt is to stop them going in loops
using dynamic URLs or getting to areas where a human interaction is a
must. Why ignore it?

Miles Thompson

unread,

Oct 16, 2008, 5:39:43 AM10/16/08

to nz-we...@googlegroups.com

Yup well thats a good point

natobasso

unread,

Oct 17, 2008, 9:18:56 AM10/17/08

to NZ Web Dev

That information is all well and good, but totally unnecessary. We
already have ROBOTS.TXT, why reinvent the wheel? I have one website,
what if I had HUNDREDS. Do I then have to inform them of every one and
what I want them to crawl on each? Absurd. That's my point.

Nathaniel

Nathan Torkington

unread,

Oct 17, 2008, 9:41:42 PM10/17/08

to nz-we...@googlegroups.com

On 18/10/2008, at 2:18 AM, natobasso wrote:
> That information is all well and good, but totally unnecessary. We
> already have ROBOTS.TXT, why reinvent the wheel? I have one website,
> what if I had HUNDREDS. Do I then have to inform them of every one
> and what I want them to crawl on each? Absurd. That's my point.

I agree, although I'll be curious to learn how many loops, blacklists,
irate emails, etc. were turned up by their crawl. I can also see it
from the National Library's point of view, with robots.txt in the way
of their legal requirement to archive the New Zealand Internet for
posterity. I'm glad I didn't have to make the call on whether to obey
robots.txt or not, it's not a straightforward decision.

Cheers;

Nat

Robert Coup

unread,

Oct 18, 2008, 5:20:48 AM10/18/08

to nz-we...@googlegroups.com

On Sat, Oct 18, 2008 at 2:41 PM, Nathan Torkington <nat...@torkington.com> wrote:

I agree, although I'll be curious to learn how many loops, blacklists,
irate emails, etc. were turned up by their crawl.

Would also be interesting to learn how much extra *information* they actually get by ignoring robots.txt (ie. only adding 0.1% more text for all the issues of anger/loops/etc)

Rob :)

Reply all

Reply to author

Forward