Best way to do page-by-page crawling

90 views
Skip to first unread message

Anton Kheistver

unread,
Dec 4, 2020, 2:50:54 AM12/4/20
to Abot Web Crawler

Hello,

My source site structure looks so;
http://example.com/page/1
http://example.com/page/2
and so on... But some pages may not contain what I'm looking for (images).

What is the most efficient way to crawl such a site with Abot?
For now, I use my own implementation of HyperLinkParser, which queues the next page by incrementing the URL of the current crawled page. Maybe there is a more efficient way? I'm thinking of my own implementation of Scheduler with pre-calculated URLs.

Thanks!

sjdi...@gmail.com

unread,
Dec 4, 2020, 11:04:49 PM12/4/20
to Anton Kheistver, Abot Web Crawler
Are you asking how to crawl a site that has a predictable url structure but the links are not discoverable by just following links the abot crawls or are you asking how to avoid crawling pages that do not have images?

--
You received this message because you are subscribed to the Google Groups "Abot Web Crawler" group.
To unsubscribe from this group and stop receiving emails from it, send an email to abot-web-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/abot-web-crawler/41146505-3ece-4f50-9c79-1e3bd0c49e5cn%40googlegroups.com.

Anton Kheistver

unread,
Dec 5, 2020, 4:20:13 AM12/5/20
to Abot Web Crawler
Yes, I'm asking how to crawl a site that has a predictable URL structure. 
Maybe my current approach with custom HyperLinkParser is not the best solution in this case.

суббота, 5 декабря 2020 г. в 05:04:49 UTC+1, sjdirect:

sjdi...@gmail.com

unread,
Dec 5, 2020, 11:17:31 AM12/5/20
to Anton Kheistver, Abot Web Crawler
Then I do think creating an instance of the Scheduler.cs class, loading it up with the list of precalculated urls is the correct way to go.

CS

unread,
Jan 16, 2021, 12:46:55 PM1/16/21
to Abot Web Crawler
Hi 

Is there sample or reference code that has this feature?

sjdi...@gmail.com

unread,
Jan 25, 2021, 3:45:56 PM1/25/21
to CS, Abot Web Crawler
Hi, 

You can just create an instance of the Scheduler class, call .Add() with every url you want to crawl then plug that into the constructor. I didn't compile this so im guessing the code would be like...

var scheduler = new Scheduler();
scheduler.Add(new SiteToCrawl("http://yahoo.com/1"));
scheduler.Add(new SiteToCrawl("http://yahoo.com/2"));
scheduler.Add(new SiteToCrawl("http://yahoo.com/3"));

var crawler = new PoliteWebCrawler(
    	null,
	null,
	null, 
	scheduler, 
	null, 
	null, 
	null, 
    	null,
	null);

CS

unread,
Mar 7, 2021, 6:22:20 AM3/7/21
to Abot Web Crawler
Thank-you!
Reply all
Reply to author
Forward
0 new messages