NullReferenceException at CrawlDecisionMaker.ShouldDownloadPageContent

11 views
Skip to first unread message

Lloyd

unread,
Nov 8, 2021, 6:08:33 AM11/8/21
to Abot Web Crawler
Hello,
I'm writing a basic WinForm which starts the web crawler based on a URL (checked and filtered) provided by the user, once they click on "OK". It gradually adds the crawled URLs to a list ("UrlTrovati"), which is then returned to the user through another form. The weird thing is that everything seems to flow without interruptions, but in the background (not so background though) Visual Studio points out the exceptions, which isn't really "clean", so to say. The exceptions aren't caught by a try-catch and they usually happen after Abot has been running for a while. The only one that breaks the flow (without closing the app) in VS is a NullReferenceException with this very short StackTrace:
   at Abot2.Core.CrawlDecisionMaker.ShouldDownloadPageContent(CrawledPage                   crawledPage, CrawlContext crawlContext)
I tried manually setting the ShouldDownloadPageContentDecisionMaker, to no avail. This is the code which constructs the crawler and starts it:

private async Task StartWebCrawler()
{
            try
            {
                CrawlConfiguration crawlConfiguration = new CrawlConfiguration
                {
                    CrawlTimeoutSeconds = (int)numeric_Timeout.Value,
                    IsRespectRobotsDotTextEnabled = false,
                    MaxCrawlDepth = (int)numeric_maxDepth.Value,
                    MinCrawlDelayPerDomainMilliSeconds = 3000
                };
                PoliteWebCrawler webCrawler = new PoliteWebCrawler(crawlConfiguration);
                webCrawler.PageCrawlStarting += WebCrawler_ProcessPageCrawlStarting;
                webCrawler.PageCrawlCompleted += WebCrawler_PageCrawlCompleted;
                webCrawler.ShouldDownloadPageContentDecisionMaker = (crawledPage,                        crawlContext) =>
                {
                    CrawlDecision decision = new CrawlDecision { Allow = true };
                    return decision;
                };

                await webCrawler.CrawlAsync(new Uri(BaseURL));
            }
            catch (Exception exception)
            {
                // Placed to catch the exception directly rather than through VS, no success
                Debug.WriteLine(exception.Message);
            }
}

private void WebCrawler_ProcessPageCrawlStarting(object sender,                                             PageCrawlStartingArgs e)
{
            PageToCrawl pageToCrawl = e.PageToCrawl;
            string urlTrovato = pageToCrawl.Uri.AbsoluteUri;
            UrlTrovati.Add(urlTrovato);
            e.CrawlContext.IsCrawlHardStopRequested = StopWebCrawler;
            Debug.WriteLine($"Si sta per esplorare l'URL {urlTrovato} trovato sulla pagina                    {pageToCrawl.ParentUri.AbsoluteUri}");
}

Based on the console, it also seems to throw out other exceptions that don't break the flow in Visual Studio, such as: Exception thrown: 'System.Net.Http.HttpRequestException' in Abot2.dll, Exception thrown: 'System.InvalidOperationException' in Abot2.dll
Exception thrown: 'System.InvalidOperationException' in System.Private.CoreLib.dll

What am I doing wrong?

Reply all
Reply to author
Forward
0 new messages