I have just released version 0.4 of URL Frontier, see
http://urlfrontier.net for more details. This latest release contains quite a few bug fixes and performance improvements.
For those of you who crawl in Java, the client code is on Maven
<dependencies>
<dependency>
<groupId>com.github.crawler-commons</groupId>
<artifactId>urlfrontier-API</artifactId>
<version>0.4</version>
</dependency>
</dependencies>and if you use StormCrawler, it already has a module to use URLFrontier.
I have been using StormCrawler and URLFrontier at scale in the context of a Fed4Fire+ experiment and have fetched 300M URLs and discovered another billion URLs. (Note - this is running with a single Frontier instance). The content we crawl is stored in the WARC format and will be donated to our friends at CommonCrawl.
We are entering the final stages of the project with NLNet but I am considering applying for a second round of funding to add more functionalities and improvements to it. I will keep you posted on this but in the meantime, please use
https://github.com/crawler-commons/url-frontier/discussions for sharing your feedback, questions or suggestions.
Have fun!
Julien
--
Open Source Solutions for Text Engineering