A web-scraping framework with parallel sessions and retry on time-out

120 views
Skip to first unread message

Rudy Eschauzier

unread,
Aug 22, 2016, 5:36:14 AM8/22/16
to CasperJS
Going through the CasperJS forums, I found that there are multiple requests from people who would like to be able to scrape a website and retry when a time-out occurs. Also, people are asking for a way to run several scraping sessions in parallel.

Attached is a simple framework that wraps around the CasperJS commands and achieves retry-on-timeout and parallel sessions automatically. The file includes and example where the script fetches the title of 10 separate websites. It runs a maximum of 3 sessions in parallel. This number can be changed by setting the maxSessions variable. On a waitFor() timeout, the script will retry for another 2 times. This number can be set through the maxTries variable.

I have used the script for several projects now, and it has saved me lot of time in building robust and fast scraping solutions. I hope it will do the same for other people as well.

Please let me know if you have additional questions.
scrape-framework.js

Ken

unread,
May 25, 2017, 7:26:51 PM5/25/17
to CasperJS
Wow Rudy this looks really cool! I haven't come across anything like that. Does it mean you iteratively call casper.start and create multiple browser tabs within 1 PhantomJS process?
Reply all
Reply to author
Forward
0 new messages