Fwd: [gcd-contact] Script for ComicRack

112 views
Skip to first unread message

Donald Dale Milne

unread,
Mar 7, 2011, 1:58:06 PM3/7/11
to GCD Tech List
From the contact list, for the tech team's consideration. - Don Milne

-------- Original Message --------
Subject: [gcd-contact] Script for ComicRack
Date: Mon, 7 Mar 2011 19:08:55 +0100
From: Maurizio <miz...@gmail.com>
Reply-To: gcd-c...@googlegroups.com
To: con...@comics.org

Hi guys,

I developed a quick script to scrape data from GCD. For the moment it has
been used by myself and a friend for testing, before publishing it on the CR
forum I'd like to have your green light.
It uses no API, since there are not !, but read chunks of pages
straightforwardly.

Are you interested in the code? I'm not a pro, though, so don't laugh ;-)

Let me know, I'll wait for your answer.

ciao,

Maurizio --
GCD-Contact mailing list - gcd-c...@googlegroups.com
To unsubscribe send email to gcd-contact...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/gcd-contact

Mark

unread,
Mar 7, 2011, 2:45:50 PM3/7/11
to gcd-...@googlegroups.com
Hi,

Not that I'm a tech or nothing but web scraping is broken by design
almost but then if we don't have an API I guess its the only method.
It may cause additional load etc but I think overall it could assist
as long as we get a noteworthy reference. After all it is targeting
comic readers and collectors.

Mark

Henry Andrews

unread,
Mar 8, 2011, 12:48:48 PM3/8/11
to gcd-...@googlegroups.com, miz...@gmail.com
Hi Maurizio,
We woud prefer that people not scrape the site as it increases the load on our
servers, at times to the point where no one else can use the site (when this
happens, we ban the offending IP addresses in order to restore service). We do
tolerate moderate usage, by which we mean anything from individual sites that
does not incur enough load for us to notice. So if you and your friend want to
keep using it that's probably fine. However, broadly publishing a system for
doing this will almost certainly cause problems. We would prefer that you not
distribute this code, and if we find a widespread scraping system incurring
heavy load we will most likely rework the pages in order to break it.

We offer for free the entire database contents for download already. We would
love to have an API and support other data download formats (among many other
features), and would welcome assistance in improving the site- please feel free
to join us on the gcd-...@googlegroups.com mailing list if you are interested.

thanks,
-henry

---
Henry Andrews
Lead Developer / Board Member, Grand Comics Database

> -- GCD-Tech mailing list - gcd-...@googlegroups.com
> To unsubscribe send email to gcd-tech+u...@googlegroups.com
> For more options, visit this group at http://groups.google.com/group/gcd-tech
>

Jochen Garcke

unread,
Mar 9, 2011, 1:57:34 PM3/9/11
to gcd-...@googlegroups.com, miz...@gmail.com
Hi Maurizio,

following up on this.

What I am afraid of is that all ComicRack users (which I assume are
quite a number) by a push of a button would try to automatically fetch
the data for all their comics. This would surely impact the performance
and stability of our server.

If the fetching of the data would be on an issue basis, as in you would
try to get the data only for a given issue, the situation is somewhat
different since the impact on our server is similar to normal web user
behaviour. Even then, by scraping the website you download much more
data then needed. There might be ways to just get the data of an issue
in some form, we could talk about this further.

Jochen

Am 09.03.2011 04:48, schrieb Henry Andrews:
> Hi Maurizio,
> We would prefer that people not scrape the site as it increases the load on our

Reply all
Reply to author
Forward
0 new messages