Seeding SiteToCrawl with multiple pages?

15 views
Skip to first unread message

Tom

unread,
Nov 10, 2021, 1:59:07 PM11/10/21
to Abot Web Crawler
Working on converting an abot project into an abotx project for parallelized crawling. 

The way I read it, each site should be seeded into ParallelCrawlerEngine with a single URL:

new SiteToCrawl { Uri = new Uri("https://thesite.com") };

But how do you deal with scenarios where you have a site that you want to seed with multiple URLs? E.g. a few key points in the site that tend to have regularly new/updated links that you want to make sure get picked up and crawled right away. 

I assume you don't want to create a SiteToCrawl for each one? 

Another thing not clear to me from the abotx docs, where do I inject custom implementations at the crawler level? I guess that I should do this within the CrawlerInstanceCreated event handler, but not sure this is right:

crawlEngine.CrawlerInstanceCreated += (sender, eventArgs) =>
{
eventArgs.Crawler.Impls.HtmlParser = new MyHyperlinkParser();
eventArgs.Crawler.Impls.CrawlDecisionMaker = new MyCrawlDecisionMaker();
};

?

Tom

unread,
Nov 10, 2021, 2:15:32 PM11/10/21
to Abot Web Crawler
Also, if I take the cheat route and just add a new SiteToCrawl for every URL, is the parallel engine smart enough to understand that URLs from the same domain are part of the same site, even if they span multiple SiteToCrawls?

E.g.:

siteToCrawlProvider.AddSitesToCrawl
(
    new SiteToCrawl { Uri = new Uri("https://thesite.com/") },
    new SiteToCrawl { Uri = new Uri("https://thesite.com/sitemap1.xml") },
    new SiteToCrawl { Uri = new Uri("
https://thesite.com/sitemap2.xml") },
);

Reply all
Reply to author
Forward
0 new messages