How to crawl all pages ? Infinite MaxPagesToCrawl ?

161 views
Skip to first unread message

Christian LeMoussel

unread,
Feb 12, 2014, 9:32:38 AM2/12/14
to abot-web...@googlegroups.com
How to crawl all pages ?

MaxPagesToCrawl
is the maximum number of pages to crawl.
And in CrawlDecisionMaker.cs,  I found

public virtual CrawlDecision ShouldCrawlPage(PageToCrawl pageToCrawl, CrawlContext crawlContext)
{
......
            if (crawlContext.CrawledCount + 1 > crawlContext.CrawlConfiguration.MaxPagesToCrawl)
            {
                return new CrawlDecision { Allow = false, Reason = string.Format("MaxPagesToCrawl limit of [{0}] has been reached", crawlContext.CrawlConfiguration.MaxPagesToCrawl) };
            }
......
}

By default I do MaxPagesToCrawl=1 000 000.
Some idea to set it to infinite?

Christian

sjdi...@gmail.com

unread,
Feb 12, 2014, 10:21:13 AM2/12/14
to Christian LeMoussel, abot-web...@googlegroups.com
You could just set it to the max value 9223372036854775807. However, you WILL run into memory and/or disk space issues if you plan on crawling that many pages in a single crawl. Each PoliteWebCrawler instance is contextual to the root uri used to start the crawl. A better approach for deep crawls is to create a new instance of PoliteWebCrawler for each new domain you encounter during your crawl. 


You could also make a change to the code to ignore the check if it is set to zero 


public virtual CrawlDecision ShouldCrawlPage(PageToCrawl pageToCrawl, CrawlContext crawlContext)
{
......
            if (crawlContext.CrawlConfiguration.MaxPagesToCrawl > 0 && crawlContext.CrawledCount + 1 > crawlContext.CrawlConfiguration.MaxPagesToCrawl)
            {
                return new CrawlDecision { Allow = false, Reason = string.Format("MaxPagesToCrawl limit of [{0}] has been reached", crawlContext.CrawlConfiguration.MaxPagesToCrawl) };
            }
......
}

Steven




--
You received this message because you are subscribed to the Google Groups "Abot Web Crawler" group.
To unsubscribe from this group and stop receiving emails from it, send an email to abot-web-crawl...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Christian LeMoussel

unread,
Feb 12, 2014, 1:14:34 PM2/12/14
to abot-web...@googlegroups.com
Steven Thank's for you help.

All pages is on the same domain. 
I do isExternalPageCrawlingEnabled="false"
Reply all
Reply to author
Forward
0 new messages