I'm trying to setup a scrape that targets 1M unique URLs on the same site. The scrape has a proxy and captcha breaker, so it's running pretty slow and it's prone to crash because the target site goes down frequently (not from me scraping). Once the 1M pages are scraped, the scrape will grab about 1000 incremental urls per day.
URL Format:
http://www.foo.com/000000001 #the number sequence is a 'pin'
http://www.foo.com/000000002http://www.foo.com/000000003etc..
Does my proposed setup make sense?
Setup mongodb with 1M pins, and a scraped flag. For example:
{'pin': '000000001', 'scraped': False}
In the scrape I would setup a query to select 10,000 pins where 'scraped' = False. I would then append 10,000 urls to start_urls[]. The resulting scrape would get inserted into another collection and the pin 'scraped' flag would get set to True. After the 10,000 pins are scraped I would run the scrape again until all 1M pins are scraped.
Does this setup make sense or is there a more efficient way to do this?