Crawling slows down drastically towards the end

41 views
Skip to first unread message

Hyder Alamgir

unread,
Apr 18, 2016, 4:10:59 AM4/18/16
to scrapy-users
I've got a set of 25,000+ urls that I need to scrape. I'm consistently seeing that after about 22,000 urls the crawl rate drops drastically.

Take a look at these log lines to get some perspective:

2016-04-18 00:14:06 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-04-18 00:15:06 [scrapy] INFO: Crawled 5324 pages (at 5324 pages/min), scraped 0 items (at 0 items/min)
2016-04-18 00:16:06 [scrapy] INFO: Crawled 9475 pages (at 4151 pages/min), scraped 0 items (at 0 items/min)
2016-04-18 00:17:06 [scrapy] INFO: Crawled 14416 pages (at 4941 pages/min), scraped 0 items (at 0 items/min)
2016-04-18 00:18:07 [scrapy] INFO: Crawled 20575 pages (at 6159 pages/min), scraped 0 items (at 0 items/min)
2016-04-18 00:19:06 [scrapy] INFO: Crawled 22036 pages (at 1461 pages/min), scraped 0 items (at 0 items/min)
2016-04-18 00:20:06 [scrapy] INFO: Crawled 22106 pages (at 70 pages/min), scraped 0 items (at 0 items/min)
2016-04-18 00:21:06 [scrapy] INFO: Crawled 22146 pages (at 40 pages/min), scraped 0 items (at 0 items/min)
2016-04-18 00:22:06 [scrapy] INFO: Crawled 22189 pages (at 43 pages/min), scraped 0 items (at 0 items/min)
2016-04-18 00:23:06 [scrapy] INFO: Crawled 22229 pages (at 40 pages/min), scraped 0 items (at 0 items/min)

Here're my settings

# -*- coding: utf-8 -*-

BOT_NAME = 'crawler'

SPIDER_MODULES = ['crawler.spiders']
NEWSPIDER_MODULE = 'crawler.spiders'

CONCURRENT_REQUESTS = 10
REACTOR_THREADPOOL_MAXSIZE = 100
LOG_LEVEL = 'INFO'
COOKIES_ENABLED = False
RETRY_ENABLED = False
DOWNLOAD_TIMEOUT = 15
DNSCACHE_ENABLED = True
DNSCACHE_SIZE = 1024000
DNS_TIMEOUT = 10
DOWNLOAD_MAXSIZE = 1024000 # 10 MB
DOWNLOAD_WARNSIZE = 819200 # 8 MB
REDIRECT_MAX_TIMES = 3
METAREFRESH_MAXDELAY = 10
ROBOTSTXT_OBEY = True
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36' #Chrome 41

DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

#DOWNLOAD_DELAY = 1
#AUTOTHROTTLE_ENABLED = True
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 604800 # 7 days
COMPRESSION_ENABLED = True

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 550,
    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
    'crawler.middlewares.RandomizeProxies': 740,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
    'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware': 830,
    'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
    'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
}

PROXY_LIST = '/etc/scrapyd/proxy_list.txt'

Memory and CPU consumption is less than 10%
tcptrack shows no unusual network activity
iostat shows negligible disk i/o\


What can I look at to debug this?

vishal singh

unread,
Apr 18, 2016, 5:14:22 AM4/18/16
to scrapy...@googlegroups.com
disable DNSCACHE_ENABLED and HTTPCACHE_ENABLED, and check if you are getting same results.
try to open last urls manually in scrapy shell and check if its taking more than usual time

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Hyder Alamgir

unread,
Apr 18, 2016, 8:54:07 AM4/18/16
to scrapy-users
Disabling DNSCACHE_ENABLED and HTTPCACHE_ENABLED doesn't help. Still facing the same issue.

Any idea how I go about finding out what the last few URLs are?

Besides, I've already set DOWNLOAD_TIMEOUT to 15 and DNS_TIMEOUT = 10. In tcptrack, I don't see any connections longer than 15 seconds.
Reply all
Reply to author
Forward
0 new messages