How to only crawl new and updated pages of a website?

21 views

Skip to first unread message

Ramin Donyaee

unread,

Jul 5, 2024, 9:09:29 PM7/5/24

to Abot Web Crawler

Let's say this example.com has been crawled today. How to only crawl the new and updated pages next time the crawler is run?

I can preserve the Last-Modified and ETag headers in a storage to check if a page is new or updated but there is a case that puzzles me and that is when a page has not been updated but there might be links within that page pointing to pages with new and updated contents. How to handle this case?