Seeding SiteToCrawl with multiple pages?

24 views

Skip to first unread message

Tom

unread,

Nov 10, 2021, 1:59:07 PM11/10/21

to Abot Web Crawler

Working on converting an abot project into an abotx project for parallelized crawling.

The way I read it, each site should be seeded into ParallelCrawlerEngine with a single URL:

new SiteToCrawl { Uri = new Uri("https://thesite.com") };

But how do you deal with scenarios where you have a site that you want to seed with multiple URLs? E.g. a few key points in the site that tend to have regularly new/updated links that you want to make sure get picked up and crawled right away.

I assume you don't want to create a SiteToCrawl for each one?

Another thing not clear to me from the abotx docs, where do I inject custom implementations at the crawler level? I guess that I should do this within the CrawlerInstanceCreated event handler, but not sure this is right:

crawlEngine.CrawlerInstanceCreated += (sender, eventArgs) =>
{

    eventArgs.Crawler.Impls.HtmlParser = new MyHyperlinkParser();
    eventArgs.Crawler.Impls.CrawlDecisionMaker = new MyCrawlDecisionMaker();
};

?

Tom

unread,

Nov 10, 2021, 2:15:32 PM11/10/21

to Abot Web Crawler

Also, if I take the cheat route and just add a new SiteToCrawl for every URL, is the parallel engine smart enough to understand that URLs from the same domain are part of the same site, even if they span multiple SiteToCrawls?

E.g.:

siteToCrawlProvider.AddSitesToCrawl
(
    new SiteToCrawl { Uri = new Uri("https://thesite.com/") },
    new SiteToCrawl { Uri = new Uri("https://thesite.com/sitemap1.xml") },
    new SiteToCrawl { Uri = new Uri("https://thesite.com/sitemap2.xml") },);

Reply all

Reply to author

Forward

0 new messages