You have a completely open SgmlLinkExtractor. Depending on the number and location of links on
www.project.com, I wouldn't be surprised if your scraper was queuing up most of the internet to be scraped. the allowed_domains, only affects what the spider will actually crawl, not what will be queued up for it to crawl. try using some of these in you SgmlLinkExtractor:
http://doc.scrapy.org/en/latest/topics/link-extractors.html#topics-link-extractorsParameters:
- allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions)
that the (absolute) urls must match in order to be extracted. If not
given (or empty), it will match all links.
- deny (a regular expression (or list of)) – a single regular expression (or list of regular expressions)
that the (absolute) urls must match in order to be excluded (ie. not
extracted). It has precedence over the allow parameter. If not
given (or empty) it won’t exclude any links.
- allow_domains (str or list) – a single value or a list of string containing
domains which will be considered for extracting the links
- deny_domains (str or list) – a single value or a list of strings containing
domains which won’t be considered for extracting the links
- deny_extensions (list) – a list of extensions that should be ignored when
extracting links. If not given, it will default to the
IGNORED_EXTENSIONS list defined in the scrapy.linkextractor
module.
- restrict_xpaths (str or list) – is a XPath (or list of XPath’s) which defines
regions inside the response where links should be extracted from.
If given, only the text selected by those XPath will be scanned for
links. See examples below.
- tags (str or list) – a tag or a list of tags to consider when extracting links.
Defaults to ('a', 'area').
- attrs (list) – list of attributes which should be considered when looking
for links to extract (only for those tags specified in the tags
parameter). Defaults to ('href',)
- canonicalize (boolean) – canonicalize each extracted url (using
scrapy.utils.url.canonicalize_url). Defaults to True.
- unique (boolean) – whether duplicate filtering should be applied to extracted
links.
- process_value (callable) – see process_value argument of
BaseSgmlLinkExtractor class constructor
On Wednesday, May 29, 2013 11:00:46 PM UTC-7, |DB| Snype wrote:
So my scrapy processes keeps getting "KILLED" because it runs out of memory.
I have remove the pipeline and the parse code. It still happens. The memory keeps on increasing gradually and then it gets "KILLED".
Here is the code:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from project.items import ProjectItem
class ProjectSpider(CrawlSpider):
name = "project"
allowed_domains = ["project.com"]
start_urls = [
"http://www.project.com"
]
rules = (
Rule(SgmlLinkExtractor(), callback="parse_site", follow=True),
)
def parse_site(self, response):
hxs = HtmlXPathSelector(response)
Thats all. Would like to hear what you guys think?
System Info : 1 core with 1 gig ram. The process runs at 70-100% cpu and the ram gradually keeps on increasing until it gets "KILLED".
What I think? Its the queue. Since I have follow True I am generating too many leads. Ultimately it is generating more leads it can visit can the memory gets filled.
Would like to know what you guys think. TY!