Memory leak. Requests count goes only up and doesn

228 views
Skip to first unread message

ShapeR

unread,
Jul 23, 2015, 1:33:19 PM7/23/15
to scrapy-users

My spider have a serious memory leak.. After 15 min of run its memory 5gb and scrapy tells (using prefs() ) that there 900k requests objects and thats all. What can be the reason for this high number of living requests objects? Request only goes up and doesnt goes down. All other objects are close to zero.

My spider looks like this:

class ExternalLinkSpider(CrawlSpider):
  name = 'external_link_spider'
  allowed_domains = ['']
  start_urls = ['']

  rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),)

  def parse_obj(self, response):
    if not isinstance(response, HtmlResponse):
        return
    for link in LxmlLinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
        if not link.nofollow:
            yield LinkCrawlItem(domain=link.url)

Here output of prefs()


HtmlResponse                        2   oldest: 0s ago 
ExternalLinkSpider                  1   oldest: 3285s ago
LinkCrawlItem                       2   oldest: 0s ago
Request                        1663405   oldest: 3284s ago


Any ideas or suggestions?

fernando vasquez

unread,
Jul 27, 2015, 12:48:13 PM7/27/15
to scrapy-users, shape...@gmail.com
You are not processing Requests as fast as you capture them. I had the same problem; however the cause could be different. In my case the Link Extractor was capturing duplicated Requests, so I decided filter the duplicated ones. The problem with scrapy is that the duplicated filter work after the Link Extractor saved the Resquest, so you get tons of Requests.

In conclusion you might have duplicated requests, just filter before the for loop.

Rolando Espinoza

unread,
Jul 27, 2015, 1:18:53 PM7/27/15
to scrapy...@googlegroups.com
ShapeR, try using the JOBDIR setting to store the requests queue on disk:

$ scrapy crawl myspider -s JOBDIR=myspider-job

The directory myspider-job will be created and there will be a directory requests.queue and a file requests.seen.

Regards,
Rolando

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

fernando vasquez

unread,
Jul 27, 2015, 2:04:02 PM7/27/15
to scrapy...@googlegroups.com
JOBDIR parameter is good for not using too much RAM; however, it makes scrapy very very slow with big list of Request.

--
You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/AdOuQ8Npxyg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scrapy-users...@googlegroups.com.

To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.



--
Fernando Vasquez
Telefono: (+51 1) 4943270
Celular: 949132131

ShapeR

unread,
Jul 27, 2015, 3:43:39 PM7/27/15
to scrapy-users, shape...@gmail.com
It seems that it was bug in scrapy itself. I upgraded to the latest 1.0.1 release and requests count and memory drastically goes down from 4gb to 100-300mb per spider.

ShapeR

unread,
Aug 15, 2015, 7:43:51 AM8/15/15
to scrapy-users, shape...@gmail.com
Ok, i was wrong that updating scrapy fixed it. I just accidentely set depth of scan to 10k instead of 100k. When i set back it - still same problem. Requests just goes up. 

>>> prefs()
Live References

LinkCrawlItem                      33   oldest: 3s ago
HtmlResponse                       59   oldest: 5s ago
ExternalLinkSpider                  1   oldest: 352s ago
Request                        411836   oldest: 349s ago


The oldest request stay is memory no matter what. So at 100k page count on some sites spider can reach 40gb of memory use, which is totally  flawed. 
Can you elaborate about duplicate requests filter? I dont work anywhere in my code with the requests, only with responses.

As for jobdir - well it will become slow i guess and i dont think  thats its normal for this spider to consume 30gb of memory.
Reply all
Reply to author
Forward
Message has been deleted
0 new messages