Web crawler proxy?

379 views
Skip to first unread message

Starfish

unread,
May 11, 2012, 6:21:33 AM5/11/12
to golang-nuts
Hello!

I believe I'm about to write a web spider in Go.

Naturally I will have to http.Get(...) a lot of URLs, but since many
pages are static between sweeps I'm thinking about letting all traffic
go through a web proxy server (specified in http.Transport).

Is this a sensible idea, would it work, and may there be a better way?

Cheers!

Christoph Hack

unread,
May 11, 2012, 7:09:37 AM5/11/12
to golan...@googlegroups.com
It's probably better to store the Last-Modified and E-Tag headers which
might be included in the pages you are fetching somewhere. Then you
can resend those headers during successive crawls of the same web
page, and web servers can respond with "304 Not modified" directly.

-christoph

Starfish

unread,
May 11, 2012, 7:35:31 AM5/11/12
to golang-nuts
I worry browsers and proxys have some smart feature unknown to me, or
not implemented in Go. E.g., does Go support every http compression
format, and what about the response 'Expires' header? Should I send a
request even though the URL has not expired?

Kyle Lemons

unread,
May 11, 2012, 11:56:47 AM5/11/12
to Starfish, golang-nuts
On Fri, May 11, 2012 at 4:35 AM, Starfish <ahn...@rocketmail.com> wrote:
I worry browsers and proxys have some smart feature unknown to me, or
not implemented in Go. E.g., does Go support every http compression
format, and what about the response 'Expires' header? Should I send a
request even though the URL has not expired?

The Go HTTP package is awesome and very complete.  The interpretation and provision of headers to/from a request, however, are not automatic: it does not remember ETags for you or respect the Expires header, for instance.

To the particular question you posed: no, you should probably not resend a request when the server told you that it hasn't expired, especially if your goal is to optimize your bandwidth and not trigger a server's DOS protection.  Definitely read up on the headers Christoph mentioned as well.

Starfish

unread,
May 11, 2012, 12:49:23 PM5/11/12
to golang-nuts
Good to know. I will simply study how Firefox behaves with Firebug and
imitate. E-tag and Expire headers will be preserved naturally.

Now I will just have to convince my team to go with Go ;)

Thanks!

André Moraes

unread,
May 11, 2012, 2:20:15 PM5/11/12
to Starfish, golang-nuts
On Fri, May 11, 2012 at 1:49 PM, Starfish <ahn...@rocketmail.com> wrote:
> Good to know. I will simply study how Firefox behaves with Firebug and
> imitate. E-tag and Expire headers will be preserved naturally.

Usually the interpretation of those headers is responsibility of the
User-Agent not the http parser package.

As a tip, you could write the User-Agent on top of the http package
and then write the crawler on top of the User-Agent package.

--
André Moraes
http://amoraes.info

Starfish

unread,
May 11, 2012, 7:00:54 PM5/11/12
to golang-nuts
Thank you! That makes sense.

We already have an old crawler written in Python, with a library
called Scrapy. I have no idea if it's wise to write a crawler from
scratch, but the old solution is too slow anyway.
Reply all
Reply to author
Forward
0 new messages