NullReferenceException at CrawlDecisionMaker.ShouldDownloadPageContent

17 views

Skip to first unread message

Lloyd

unread,

Nov 8, 2021, 6:08:33 AM11/8/21

to Abot Web Crawler

Hello,

I'm writing a basic WinForm which starts the web crawler based on a URL (checked and filtered) provided by the user, once they click on "OK". It gradually adds the crawled URLs to a list ("UrlTrovati"), which is then returned to the user through another form. The weird thing is that everything seems to flow without interruptions, but in the background (not so background though) Visual Studio points out the exceptions, which isn't really "clean", so to say. The exceptions aren't caught by a try-catch and they usually happen after Abot has been running for a while. The only one that breaks the flow (without closing the app) in VS is a NullReferenceException with this very short StackTrace:

at Abot2.Core.CrawlDecisionMaker.ShouldDownloadPageContent(CrawledPage crawledPage, CrawlContext crawlContext)

I tried manually setting the ShouldDownloadPageContentDecisionMaker, to no avail. This is the code which constructs the crawler and starts it:

private async Task StartWebCrawler()

{

try

{

CrawlConfiguration crawlConfiguration = new CrawlConfiguration

{

CrawlTimeoutSeconds = (int)numeric_Timeout.Value,

IsRespectRobotsDotTextEnabled = false,

MaxCrawlDepth = (int)numeric_maxDepth.Value,

MinCrawlDelayPerDomainMilliSeconds = 3000

};

PoliteWebCrawler webCrawler = new PoliteWebCrawler(crawlConfiguration);

webCrawler.PageCrawlStarting += WebCrawler_ProcessPageCrawlStarting;

webCrawler.PageCrawlCompleted += WebCrawler_PageCrawlCompleted;

webCrawler.ShouldDownloadPageContentDecisionMaker = (crawledPage, crawlContext) =>

{

CrawlDecision decision = new CrawlDecision { Allow = true };

return decision;

};

await webCrawler.CrawlAsync(new Uri(BaseURL));

}

catch (Exception exception)

{

// Placed to catch the exception directly rather than through VS, no success

Debug.WriteLine(exception.Message);

}

private void WebCrawler_ProcessPageCrawlStarting(object sender, PageCrawlStartingArgs e)

{

PageToCrawl pageToCrawl = e.PageToCrawl;

string urlTrovato = pageToCrawl.Uri.AbsoluteUri;

UrlTrovati.Add(urlTrovato);

e.CrawlContext.IsCrawlHardStopRequested = StopWebCrawler;

Debug.WriteLine($"Si sta per esplorare l'URL {urlTrovato} trovato sulla pagina {pageToCrawl.ParentUri.AbsoluteUri}");

}

Based on the console, it also seems to throw out other exceptions that don't break the flow in Visual Studio, such as: Exception thrown: 'System.Net.Http.HttpRequestException' in Abot2.dll, Exception thrown: 'System.InvalidOperationException' in Abot2.dll

Exception thrown: 'System.InvalidOperationException' in System.Private.CoreLib.dll