How to only crawl new and updated pages of a website?

20 views
Skip to first unread message

Ramin Donyaee

unread,
Jul 5, 2024, 9:09:29 PM7/5/24
to Abot Web Crawler
Let's say this example.com has been crawled today. How to only crawl the new and updated pages next time the crawler is run?

I can preserve the Last-Modified and ETag headers in a storage to check if a page is new or updated but there is a case that puzzles me and that is when a page has not been updated but there might be links within that page pointing to pages with new and updated contents. How to handle this case?

Thanks for the awesome package!
Reply all
Reply to author
Forward
0 new messages