[sluggoster@gmail.com: Web crawling]

5 views
Skip to first unread message

C. Titus Brown

unread,
Mar 30, 2009, 4:15:00 PM3/30/09
to twill-dev
----- Forwarded message from Mike Orr <slugg...@gmail.com> -----

Date: Thu, 4 Oct 2007 20:15:53 -0700
From: Mike Orr <slugg...@gmail.com>
To: ti...@idyll.org
Subject: Web crawling

Hi Titus, I'm starting a project to crawl a website from a cron job.
I need to find links, parse the query string, change the query
parameters to form new URLs, and set cookies either manually or by
posting to forms.

mechanize.open seems to handle the page getting and cookies, although
I'm not sure how much I'm gaining over urllib2.urlopen. then I'm
using BeautifulSoup to find the links, extracting the query string
manually, and using cgi.parse_qs to parse it into a dict. Then
urllib.urlencode to form the new query string and join it to the URL
base.

That seems like an awful lot of manual work so I'm wondering if
there's a good URL class or Anchor class to do it. Actually, I'm a
bit surprised that mechanize or BeautifulSoup don't have this built
in. Twill doesn't seem to offer anything for this purpose from what I
can tell; it uses BeautifulSoup but apparently only to prettyify the
HTML, not to expose the DOM.

Which packages would you recommend for this?

--
Mike Orr <slugg...@gmail.com>

----- End forwarded message -----

--
C. Titus Brown, c...@msu.edu

Reply all
Reply to author
Forward
0 new messages