Although it wasn't distributed, it had fancy concurrency control
and queuing policies; it could get a lot of the performance that would
be possible with reasonable Unix box and internet connection.
Then I went through a phase of creating simpler and simpler web
crawlers. I kind of thought I was devolving until I saw the crawling
strategy Nutch uses and realized it was pretty much the same.
These days I'm a big believer in breadth-first crawling. The web
crawler runs in stages: stage N outputs a list of urls to stage N+1.
The crawler itself is pretty dumb: it grabs the URLs, writes the
contents into files or stuffs them into DB blobs. Concurrency control
can be ~simple~, for instance, just divide the list of tasks to do
into M sublists, fork into M children, and let each child do 1/M of
the work. (That's not the best strategy, but you can even do it in
Perl or PHP.)
Once a stage of the crawl is done, I run some scripts that extract
whatever data comes out of the stage. The nice thing about having this
decoupled from the crawler is that you can fix bugs in your extractor
without having to re-run the crawl. The extractor sends URLs on the
stage N+1, you can even move URLs that were temporary fails in stage N
to stage N+1.
You'll usually see a rapid increase in the size of the stages, then
a gentle plateau, then it falls off and you're left with some
stragglers, which are all web traps. Terminate the crawl then... The
real advantage of breadth-first is that it easily shakes off common web
traps.
In early development or for small jobs you can do it manually and
have a lot of control over what's happening. In a more mature system
you can have higher-level optimization start and stop the stages, run
the extractor scripts, decide when to terminate a crawl, etc.
My current web crawler has a centralized work queue: other scripts
submit jobs to the crawler, which works through them, and runs
callback scripts when jobs are completed. It works pretty nice.
Well first move would be to see if they have some form of export, or
if the forums are open source whether you can add that and then have
it taken up by the site :)
Or this might be an option (I haven't used it but I have written
crawlers and they are a pain).
http://news.idg.no/cw/art.cfm?id=E1888BDC-1A64-67EA-E4609525E2DBCDB9
--J