getting empy e.CrawledPage.Content.Text - don't undestand why

28 views
Skip to first unread message

VLADIMIR KOZLOV

unread,
Apr 17, 2021, 4:36:15 AM4/17/21
to Abot Web Crawler
Hello,
i'm getting always null or empty e.CrawledPage.Content.Text in PageCrawlCompleted(object sender, PageCrawlCompletedArgs e) event.

the code looks like

 var urls = new List<string>();
            for (int nmid = 17518000; nmid <= 17618000; nmid += 1)
                   urls.Add("https://www.foo.bar/" + nmid.ToString() + "/product/data");
            var config3 = new CrawlConfiguration
            {
                MaxConcurrentThreads = 100,
                MaxPagesToCrawl = 1000000,
                DownloadableContentTypes = "application/json;charset=UTF-8",
                UserAgentString = _chrome,
                IsExternalPageCrawlingEnabled = false,
                IsExternalPageLinksCrawlingEnabled = false,
                MaxCrawlDepth = 1000000
        };
            var scheduler = new UrlScheduler(urls);
            var decisionDefault = new CrawlDecisionMaker();
            var crawler = new PoliteWebCrawler(config, decisionDefault, null, scheduler, null, null, null, null, null);
            var crawlResult = await crawler.CrawlAsync(new Uri("https://www.foo.bar/"));

            crawler.PageCrawlCompleted += PageCrawlCompleted;
}

     class UrlScheduler : Scheduler
        {
            /// <summary>
            /// Instantiate the URL queue with list of URLs.
            /// </summary>
            /// <param name="urls"></param>
            public UrlScheduler(IEnumerable<string> urls)
                : base()
            {
                this.Add(urls.Select(url => new PageToCrawl(new Uri(url))));
            }
        }


foo.bar (and urls in list) returns a valid json
log in debug mode shows no errors.
Thank you in advance.
Best,
Vlad

Reply all
Reply to author
Forward
0 new messages