Actors for web scraping

139 views
Skip to first unread message

Richard Rodseth

unread,
Sep 1, 2013, 4:32:27 PM9/1/13
to akka...@googlegroups.com
I'm implementing a web scraping tool (using JSoup) and struggling to understand how best to decompose it into actors. I've looked  briefly at the webwords example but it has rather a lot of pieces and is apparently not up-to-date, so I thought I'd ask here.

The items to be extracted have sequential ids, so I was planning to submit a series of searches to the site, each one for a range of ids smaller than one page of search results. I would then extract the detail links and then fetch further details for each link (this involves two different sites). I will probably persist each page worth of results once the information has been extracted. If a per-page aggregator actor is receiving messages with item details, it will have to maintain them sorted by id.

Am I correct that rather than thinking of spawning ephemeral child actors for the detail links on each result page, I should think in terms of workers, where there might be fewer or more actor instances than the number of links?

Other challenges I have are:

1 - retrying each URL a few times before reporting failure. I was thinking of sending a message to self with a remaining tries count
2 - delaying in an actor. I haven't quite figured out how the site displays search results (I think it polls or delays in JavaScript). I can fetch the results with a separate request after a Thread.sleep, but I know that's a no-no in an actor. I assume I'd use the Akka scheduler here? 

Thanks for any pointers.
Richard 


Björn Antonsson

unread,
Sep 5, 2013, 8:44:01 AM9/5/13
to Akka User List
Hi Richard,

On Sunday, 1 September 2013 at 22:32, Richard Rodseth wrote:
I'm implementing a web scraping tool (using JSoup) and struggling to understand how best to decompose it into actors. I've looked  briefly at the webwords example but it has rather a lot of pieces and is apparently not up-to-date, so I thought I'd ask here.

The items to be extracted have sequential ids, so I was planning to submit a series of searches to the site, each one for a range of ids smaller than one page of search results. I would then extract the detail links and then fetch further details for each link (this involves two different sites). I will probably persist each page worth of results once the information has been extracted. If a per-page aggregator actor is receiving messages with item details, it will have to maintain them sorted by id.

Am I correct that rather than thinking of spawning ephemeral child actors for the detail links on each result page, I should think in terms of workers, where there might be fewer or more actor instances than the number of links?


I would go for a pool of workers and a master actor holding the list of links to follow. The workers could then report found links to the master and request new work from him.

Two references to balancing workloads and work pulling are are here in the Akka docs:

Other challenges I have are:

1 - retrying each URL a few times before reporting failure. I was thinking of sending a message to self with a remaining tries count

Absolutely. Maybe use the scheduler to send it to the actor with a small delay, so you don't hammer the other side with requests.
 
2 - delaying in an actor. I haven't quite figured out how the site displays search results (I think it polls or delays in JavaScript). I can fetch the results with a separate request after a Thread.sleep, but I know that's a no-no in an actor. I assume I'd use the Akka scheduler here? 


Yes, the scheduler would be a good alternative here as well.

B/
 
Thanks for any pointers.
Richard 


--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://akka.io/faq/
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/groups/opt_out.

-- 
Björn Antonsson
Typesafe – Reactive Apps on the JVM
twitter: @bantonsson

Reply all
Reply to author
Forward
0 new messages