I'm using the AbotX parallel crawler engine, within the confines of a dotnet worker app (i.e. uses the
Microsoft.NET.Sdk.Worker SDK to run "services" like my crawler).
When the service is stopped, I would like to dump out the list of URLs that have been collected for future crawling (i.e. what the crawler has discovered when crawling a site), so that the next time I start the crawler, it can pick right up where it left off.
I don't see a clear way to do this, am I missing something?
I see the Pause/Resume option, but it doesn't seem to be designed for complete shutdown and resuming later.
Poking around in the code, I see in the IScheduler interface, there is a GetNext() method to return a PageToCrawl, so I've thought about trying to just loop through that and save to disk... is that going to work? Is/should there be a better way?