Hi,
I am pleased to announce that the first version of the
URL Frontier API is ready to review.
At this stage, the focus is not really on the code but more on checking that the API itself makes sense, that nothing major has been omitted and that the documentation is understandable.
I will make a wider announcement to other groups (Heritrix, Scrapy etc...) with a bit more of a description, apologies in advance for cross-posting.
There is a minimal service implementation (by no means feature-complete, scalable or reliable) for you to play with but a more robust implementation will be coming in the next phase once the API has been consolidated and your feedback taken into account.
(An early
module for StormCrawler is available as well.)
I am looking forward to your comments. Any contribution would be fantastic from just asking a question to contributing code (more tests and better CLI?). If you develop a crawler, what would it take for you to make it compatible with the API or put it differently, if your crawler implements a URL frontier, could this be decoupled and exposed via the API?
Have a great day and thanks in advance
Julien