Retry crawling for unsuccessfull http response by increasing MinCrawlDelayPerDomainMilliSeconds.

45 views
Skip to first unread message

agarwal....@gmail.com

unread,
Jun 14, 2020, 1:13:30 PM6/14/20
to Abot Web Crawler
I am crawling the website. This is my configuration:
  
var config = new CrawlConfiguration
 {
                MaxPagesToCrawl = 10000, 
                MaxConcurrentThreads = 10,
                MinCrawlDelayPerDomainMilliSeconds = 200,
 }

I was successfully able to crawl 8457 out of 10000. but 1543 gives http status NA.
if i increase the MinCrawlDelayPerDomainMilliSeconds by some factor i am able to get more better result. but the disadvantage is that it overall increase the time even the request which was possible in 200 milliseconds. now takes more time. 

is there way to just retry the request which were failed (in this about 153 request with increase MinCrawlDelayPerDomainMilliSeconds..


Thanks
Kunal

sjdi...@gmail.com

unread,
Jun 15, 2020, 4:09:35 PM6/15/20
to agarwal....@gmail.com, Abot Web Crawler
You can do the following which should allow configurable number or retries.


           CrawlConfiguration configuration = new CrawlConfiguration
            {
                MaxRetryCount = 3,
                etc...
            };

The other option is to use AbotX instead. It has dynamic AutoThrottling that will slow down when it detects response server stress and also speeds up when it starts getting more successful responses.

        private static async Task DemoCrawlerX_Throttling(Uri siteToCrawl)
        {
            var config = GetSafeConfig();
            config.AutoThrottling = new AutoThrottlingConfig
            {
                IsEnabled = true,
                ThresholdHigh = 2,
                ThresholdMed = 2,
                MinAdjustmentWaitTimeInSecs = 10
            };
            //Optional, configure how aggressively to speed up or down during throttling
            config.Accelerator = new AcceleratorConfig();
            config.Decelerator = new DeceleratorConfig();

            //Now the crawl is able to "Throttle" itself if the site being crawled
            //is showing signs of stress.
            using (var crawler = new CrawlerX(config))
            {
                crawler.PageCrawlCompleted += (sender, args) =>
                {
                    //Check out args.CrawledPage for any info you need
                };
                await crawler.CrawlAsync(siteToCrawl);
            }
        }

Hope that helps
--
You received this message because you are subscribed to the Google Groups "Abot Web Crawler" group.
To unsubscribe from this group and stop receiving emails from it, send an email to abot-web-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/abot-web-crawler/0c62c9e7-87a2-4bfc-a02b-54a4c0379f57o%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages