Using scheduler to preload with Urls, but then, the links on those preloaded Urls are not scheduled

42 views
Skip to first unread message

ghislain borremans

unread,
Feb 19, 2021, 3:29:04 PM2/19/21
to Abot Web Crawler
At present i load for ex 10 url's from a database with scheduler.
I add scheduler in the constructor and these pages get crawled, but the links found on those pages are not processed.
I assume it is because i added a scheduler to the constructor.
Is it possible to use the scheduler to load the first set of pages and then disconnect the scheduler so that the built in scheduler takes over?
If so, how can this be done?

best regards
Ghislain

sjdi...@gmail.com

unread,
Feb 22, 2021, 5:23:40 PM2/22/21
to ghislain borremans, Abot Web Crawler
Hi,

Adding links to the scheduler should still allow the Abot/AbotX to process the links on each page. However, depending on the root uri (the first link crawled) you may need to tinker with the following config value settings....

/// <summary>
/// Whether pages external to the root uri should be crawled
/// </summary>
public bool IsExternalPageCrawlingEnabled { get; set; }

/// <summary>
/// Whether pages external to the root uri should have their links crawled. NOTE: IsExternalPageCrawlEnabled must be true for this setting to have any effect
/// </summary>
public bool IsExternalPageLinksCrawlingEnabled { get; set; }

also you can edit what the crawler considers "internal" by sending in a custom delegate to this property

--
You received this message because you are subscribed to the Google Groups "Abot Web Crawler" group.
To unsubscribe from this group and stop receiving emails from it, send an email to abot-web-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/abot-web-crawler/e1e5f08c-76d8-48c7-a69b-7e554fb0d9d7n%40googlegroups.com.

ghislain borremans

unread,
Feb 23, 2021, 11:59:16 AM2/23/21
to Abot Web Crawler
Thank you for the feedback. I will test them.

Op maandag 22 februari 2021 om 23:23:40 UTC+1 schreef sjdirect:

VLADIMIR KOZLOV

unread,
Apr 30, 2021, 2:16:41 AM4/30/21
to Abot Web Crawler
suffer from similar problem.

this leads to nothing:
            crawler.IsInternalUriDecisionMaker = (uriInQuestion, rootUri) =>
            {

                return true;
            };
            crawler.ShouldDownloadPageContentDecisionMaker = (crawledPage, crawlContext) =>
            {
                var decision = new CrawlDecision { Allow = true };

                return decision;
            };
and this too:

 var config = new CrawlConfiguration
            {
               IsExternalPageCrawlingEnabled = true,
                IsExternalPageLinksCrawlingEnabled=true,
}

should i switch to commercial version? thank you.
Vlad
Reply all
Reply to author
Forward
0 new messages