I'm implementing a web scraping tool (using JSoup) and struggling to understand how best to decompose it into actors. I've looked briefly at the webwords example but it has rather a lot of pieces and is apparently not up-to-date, so I thought I'd ask here.
The items to be extracted have sequential ids, so I was planning to submit a series of searches to the site, each one for a range of ids smaller than one page of search results. I would then extract the detail links and then fetch further details for each link (this involves two different sites). I will probably persist each page worth of results once the information has been extracted. If a per-page aggregator actor is receiving messages with item details, it will have to maintain them sorted by id.
Am I correct that rather than thinking of spawning ephemeral child actors for the detail links on each result page, I should think in terms of workers, where there might be fewer or more actor instances than the number of links?
Other challenges I have are:
1 - retrying each URL a few times before reporting failure. I was thinking of sending a message to self with a remaining tries count
2 - delaying in an actor. I haven't quite figured out how the site displays search results (I think it polls or delays in JavaScript). I can fetch the results with a separate request after a Thread.sleep, but I know that's a no-no in an actor. I assume I'd use the Akka scheduler here?
Thanks for any pointers.