1M Page Scrape Setup

Drew Friestedt

unread,

Sep 25, 2014, 10:12:04 AM9/25/14

to scrapy...@googlegroups.com

I'm trying to setup a scrape that targets 1M unique URLs on the same site. The scrape has a proxy and captcha breaker, so it's running pretty slow and it's prone to crash because the target site goes down frequently (not from me scraping). Once the 1M pages are scraped, the scrape will grab about 1000 incremental urls per day.

URL Format:
http://www.foo.com/000000001 #the number sequence is a 'pin'
http://www.foo.com/000000002
http://www.foo.com/000000003
etc..

Does my proposed setup make sense?

Setup mongodb with 1M pins, and a scraped flag. For example:
{'pin': '000000001', 'scraped': False}

In the scrape I would setup a query to select 10,000 pins where 'scraped' = False. I would then append 10,000 urls to start_urls[]. The resulting scrape would get inserted into another collection and the pin 'scraped' flag would get set to True. After the 10,000 pins are scraped I would run the scrape again until all 1M pins are scraped.

Does this setup make sense or is there a more efficient way to do this?

Nicolás Alejandro Ramírez Quiros

unread,

Sep 25, 2014, 10:45:52 AM9/25/14

to scrapy...@googlegroups.com

If you already have the "pins" you want to crawl, just make a file with them, then crawl the site. When the spider stops you calculate the difference between spider output and your total, and you launch the spider with that; you will have to repeat as many times needed.

Travis Leleu

unread,

Sep 25, 2014, 12:45:44 PM9/25/14

to scrapy-users

Drew,

Take a look at the start_requests() method in scrapy's crawler class. You'll override this method, and should yield the Request object for the next page to scrape. Ref: http://doc.scrapy.org/en/latest/topics/spiders.html?highlight=make_request#scrapy.spider.Spider.make_requests_from_url

I like to use start_requests() when I'm pulling from a database, because you can write the function as a generator to only pull from the db when you need. (I usually also mark the status as "QUEUED" in my DB once it's been handed to scrapy, and this is a good place to put that logic.)

One gotcha with this that I've run into: if you query mongo and have a cursor pointing to your results, that cursor will time out much quicker than I expected. I implemented the start_requests() as a generator, described above. But the cursor would time our between times retrieving the URLs! (You can check if the cursor is timed out and re-acquire the result set in start_requests(), or you can move to using a queuing data structure as I tend to prefer.)

Hope this helps. If you get stuck with start_Requests(), feel free to send me a link to a binpaste and I'll check it out when I have time.

Thanks,

Travis

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Drew Friestedt

unread,

Sep 25, 2014, 8:06:09 PM9/25/14

to scrapy...@googlegroups.com

If I implement this recommendation will scrapy process more than one url at a time. After reading the documents it looks like it will only process one url at a time:

start_requests() > Query mongodb > select 1 pin > parse url > update mongodb > call start_requests()

Can I construction a list of URLs and parse a list rather than an individual URL?

You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/sAFnLZ3wsKc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scrapy-users...@googlegroups.com.

Travis Leleu

unread,

Sep 25, 2014, 8:15:03 PM9/25/14

to scrapy...@googlegroups.com

Scrapy uses Twisted, the asynchronous library, so it will have multiple http requests made "simultaneously". Is that what you mean by "process one url at a time"?

IE, when I run a scrapy crawl process, I see it load around 20 URLs from my database into the (internal) queuing system. Dependent on your CONCURRENT_REQUESTS_PER_DOMAIN setting, scrapy will make that many parallel requests to each domain.

In other words, scrapy does not wait until the previous request has returned before issuing the next one. If this seems confusing, you might want to read a little on how Twisted does asynchronous requests (I would offer more, but I don't really know much about that myself.)

Best of luck,

Travis

lnxpgn

unread,

Sep 25, 2014, 10:56:04 PM9/25/14

to scrapy...@googlegroups.com

You can implement your own spider_idle signal hander, to get new urls from Mongodb when the spider is idle. This way doesn't need to run Scrapy again and again.

Drew Friestedt

unread,

Oct 3, 2014, 11:34:57 AM10/3/14

to scrapy...@googlegroups.com

I got everything working great, but for the final piece, restarting the scrape in spider_idle. self.start_requests() under spider_idle does not seem to work. I posted this question in stackoverflow but the initial feedback I got was to load 1M URLs in start_requests() , which I'm trying to avoid entirely.

http://stackoverflow.com/questions/26179390/scrapy-spider-idle-call-to-restart-scrape

Thx

lnxpgn

unread,

Oct 4, 2014, 3:37:50 AM10/4/14

to scrapy...@googlegroups.com

Triggering the built-in spider_idle signal has some conditions, look at here http://doc.scrapy.org/en/latest/topics/signals.html.

You could only get several URLs not 5000 URLs at a time to observe, check these built-in settings CONCURRENT_ITEMS

CONCURRENT_REQUESTS

CONCURRENT_REQUESTS_PER_DOMAIN

CONCURRENT_REQUESTS_PER_IP

DOWNLOAD_DELAY

Make sure each of them has a appropriate value.

The spider_idle signal gives an ability to a spider which can continue to run if has new URLs.

If you require greater speed, should run multiple spiders(use Scrapyd or manually launch) on several machines.

Reply all

Reply to author

Forward