Re: Scrapy and memory problems.

904 views
Skip to first unread message

Ellison Marks

unread,
May 31, 2013, 3:15:29 PM5/31/13
to scrapy...@googlegroups.com
You have a completely open SgmlLinkExtractor. Depending on the number and location of links on www.project.com, I wouldn't be surprised if your scraper was queuing up most of the internet to be scraped. the allowed_domains, only affects what the spider will actually crawl, not what will be queued up for it to crawl. try using some of these in you SgmlLinkExtractor:

http://doc.scrapy.org/en/latest/topics/link-extractors.html#topics-link-extractors

Parameters:
  • allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.
  • deny (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be excluded (ie. not extracted). It has precedence over the allow parameter. If not given (or empty) it won’t exclude any links.
  • allow_domains (str or list) – a single value or a list of string containing domains which will be considered for extracting the links
  • deny_domains (str or list) – a single value or a list of strings containing domains which won’t be considered for extracting the links
  • deny_extensions (list) – a list of extensions that should be ignored when extracting links. If not given, it will default to the IGNORED_EXTENSIONS list defined in the scrapy.linkextractor module.
  • restrict_xpaths (str or list) – is a XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below.
  • tags (str or list) – a tag or a list of tags to consider when extracting links. Defaults to ('a', 'area').
  • attrs (list) – list of attributes which should be considered when looking for links to extract (only for those tags specified in the tags parameter). Defaults to ('href',)
  • canonicalize (boolean) – canonicalize each extracted url (using scrapy.utils.url.canonicalize_url). Defaults to True.
  • unique (boolean) – whether duplicate filtering should be applied to extracted links.
  • process_value (callable) – see process_value argument of BaseSgmlLinkExtractor class constructor

On Wednesday, May 29, 2013 11:00:46 PM UTC-7, |DB| Snype wrote:
So my scrapy processes keeps getting "KILLED" because it runs out of memory.

I have remove the pipeline and the parse code. It still happens. The memory keeps on increasing gradually and then it gets "KILLED".

Here is the code:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from project.items import ProjectItem


class ProjectSpider(CrawlSpider):
   
    name = "project"
    allowed_domains = ["project.com"]
    start_urls = [
        "http://www.project.com"
    ]

    rules = (
        Rule(SgmlLinkExtractor(), callback="parse_site", follow=True),
    )   
       
    def parse_site(self, response):
        hxs = HtmlXPathSelector(response)

Thats all. Would like to hear what you guys think?

System Info : 1 core with 1 gig ram. The process runs at 70-100% cpu and the ram gradually keeps on increasing until it gets "KILLED".
What I think? Its the queue. Since I have follow True I am generating too many leads. Ultimately it is generating more leads it can visit can the memory gets filled.

Would like to know what you guys think. TY!

Capi Etheriel

unread,
Jun 4, 2013, 9:51:18 AM6/4/13
to scrapy...@googlegroups.com
Em sexta-feira, 31 de maio de 2013 18h43min08s UTC-3, Cesar Gonzalez-Flores escreveu:
On Friday, May 31, 2013 3:15:29 PM UTC-4, Ellison Marks wrote:
You have a completely open SgmlLinkExtractor. Depending on the number and location of links on www.project.com, I wouldn't be surprised if your scraper was queuing up most of the internet to be scraped. the allowed_domains, only affects what the spider will actually crawl, not what will be queued up for it to crawl. try using some of these in you SgmlLinkExtractor:


Wouldn't the alloweddomains=[''project.com'] limit the spider to project.com? I could easily see this becoming a runaway spider if this weren't the case. That being said, I agreee with Ellison. Mind your extractor settings!

The Spider's allowed domains effectively limits what requests the spider will follow. But the CrawlSpider's parse method will generate requests following the Spider Rules and queue them up. They will be filtered as the Spider scans that queue looking for the next request it is allowed to follow.

Pablo Hoffman

unread,
Jun 5, 2013, 2:19:00 PM6/5/13
to scrapy-users
You can always do some debugging to track down memory leaks. There's a few tools that Scrapy provides for that purpose:


--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Reply all
Reply to author
Forward
0 new messages