I've been using a queue for all of the scrapers I've written recently.
Here is an example. I've been calling the queue "Bucket-Wheel".
I define a class for each page type. Each class has __init__, load and parse methods.
I initialize the starting page (normally a sort of table of contents) and add it to the queue. This page is loaded (often just a GET request with error handling), and the response text is passed to the parse method. The parse method parses, saves to the database and returns more pages to be parsed. These pages are appended to the queue.
Once things are in this structure, some cool things become easy.
- Management of state across runs
- (Hyper)links can automatically be modeled in the database.
- Specification of a complex order for the processing of pages
- Distribution of work (not necessarily a great idea for scrapers though)
Because of ScraperWiki's quirks, the current implementation is quite a hack, but I'm planning on eventually writing this more properly in a more general way.
Tom