URL Frontier: call for review and feedback

21 views

Skip to first unread message

DigitalPebble

unread,

Jan 28, 2021, 6:19:03 AM1/28/21

to crawler...@googlegroups.com

Hi,

I am pleased to announce that the first version of the URL Frontier API is ready to review.

At this stage, the focus is not really on the code but more on checking that the API itself makes sense, that nothing major has been omitted and that the documentation is understandable.

Please use https://github.com/crawler-commons/url-frontier/discussions for sharing your feedback, questions or suggestions.

I will make a wider announcement to other groups (Heritrix, Scrapy etc...) with a bit more of a description, apologies in advance for cross-posting.

There is a minimal service implementation (by no means feature-complete, scalable or reliable) for you to play with but a more robust implementation will be coming in the next phase once the API has been consolidated and your feedback taken into account.

(An early module for StormCrawler is available as well.)

I am looking forward to your comments. Any contribution would be fantastic from just asking a question to contributing code (more tests and better CLI?). If you develop a crawler, what would it take for you to make it compatible with the API or put it differently, if your crawler implements a URL frontier, could this be decoupled and exposed via the API?

Have a great day and thanks in advance

Julien

Julien Nioche

unread,

Apr 27, 2021, 11:27:15 AM4/27/21

to crawler...@googlegroups.com

Hi,

I am pleased to announce that URL Frontier 0.2 has just been released. It includes changes and improvements from the feedback I got after the initial release but also a robust and scalable implementation of the service based on RocksDB.

The service has been tested with StormCrawler (a module for URL Frontier is available and will be part of the imminent next release of SC) and is available as a Docker image.

For those of you who crawl in Java, the client code is on Maven

<dependencies>
<dependency>
<groupId>com.github.crawler-commons</groupId>
<artifactId>urlfrontier-API</artifactId>
<version>0.2</version>
</dependency>
</dependencies>

For other languages, the client code can be generated from the API. Instructions can be found on the website.

In the next couple of months, I will try to run it at scale, I'll let you know how it goes. In the meantime, please use https://github.com/crawler-commons/url-frontier/discussions for sharing your feedback, questions or suggestions.

Have fun!