Using a queue for scraping

19 views
Skip to first unread message

Thomas Levine

unread,
May 7, 2012, 3:26:04 AM5/7/12
to scrap...@googlegroups.com
I've been using a queue for all of the scrapers I've written recently.

Here is an example. I've been calling the queue "Bucket-Wheel".

I define a class for each page type. Each class has __init__, load and parse methods.

I initialize the starting page (normally a sort of table of contents) and add it to the queue. This page is loaded (often just a GET request with error handling), and the response text is passed to the parse method. The parse method parses, saves to the database and returns more pages to be parsed. These pages are appended to the queue.

Once things are in this structure, some cool things become easy.
  • Management of state across runs
  • (Hyper)links can automatically be modeled in the database.
  • Specification of a complex order for the processing of pages
  • Distribution of work (not necessarily a great idea for scrapers though)
Because of ScraperWiki's quirks, the current implementation is quite a hack, but I'm planning on eventually writing this more properly in a more general way.

Tom

Thad Guidry

unread,
May 7, 2012, 8:21:18 AM5/7/12
to scrap...@googlegroups.com
Tom,

This is generally what I do, but right in Google Refine, with some pre-processing before hand in either Python or iMacros browser plugin when I am dealing with aspx pages and need to capture my lists of urls.

The idea I use is to just chunk all pages I really need into a Google Refine column called "HTML", and then I use Refine's built in functions to parse and separate the elements and values I need to put into columns. (less code here compared to Scraperwiki python or ruby) and I get free previews :)

What does not work so well for the above methods, is when the query URLs themselves no longer have a smooth iterator sequence, and you need to automate clicks or button pushing, basically automating a browser to generate dynamic URLs which sometimes happens with aspx pages ... sometimes Mechanize gets me through, but typically I will just use iMacros and save the raw HTML pages as text blobs outputting as a CSV file that can be easily absorbed by Google Refine to begin my parsing.

I find myself Scraping less now, and instead just Grabbing whole pages...and then parsing out what I need later in an easier fashion with the ability to UNDO in seconds within Google Refine.

Thomas Levine

unread,
May 7, 2012, 12:58:59 PM5/7/12
to scrap...@googlegroups.com
That actually sounds entirely different from what I've been doing. The queue does indeed allow for multiple parse steps, but I find the queue to be more helpful for very hierarchical websites, where one is linked to a table of contents, then a chapter, then a section, than a subsection, and so on.

For stateful requests, I just save all of those for last and run them all at once, storing the cookies outside the queue, to avoid logging in loads of times. And for POST requests I just use python-requests.
X

reclosedev

unread,
May 7, 2012, 1:05:17 PM5/7/12
to scrap...@googlegroups.com
Reminds me Scrapy project.
One thing, that I would change early is passing response object to parse function instead response.text. Because sometimes lxml can't handle unicode html/xml documents with defined encoding. So response.content should be passed to fromstring. Also, passing response object instead text allows to check status code, redirects, etc.

понедельник, 7 мая 2012 г., 11:26:04 UTC+4 пользователь Thomas Levine написал:
Reply all
Reply to author
Forward
0 new messages