Hi Julien,
yes, it definitely makes sense to define such an API as new project under the hood
of crawler-commons. Also a reference implementation perfectly fits.
> the operations that a web crawler typically does when communicating with a web frontier
This list of operations should be the first step. At least, it feels that the API will
become more stable from the beginning if we take this definition step serious.
Happy to get involved into the project!
Thanks,
Sebastian
On 9/12/19 11:05 PM, DigitalPebble wrote:
> Hi,
>
> I have been thinking about a new project for a while about developing a crawler-neutral URL frontier
> API; the idea being that it could be used with my own StormCrawler but also with Heritrix or other
> crawlers. This is a bit comparable to Frontera but w/o the dependency on Scrapy and more generic.
>
> The main task would be to design a REST API with OpenAPI
> <
https://swagger.io/docs/specification/about/> for the operations that a web crawler typically does
> when communicating with a web frontier e.g. get the next N URLs to crawl, update the information
> about a URL, change the crawl rate for a particular hostname, get the list of active hosts, get
> stats, etc...
>
> When this is done, we could provide a set of client APIs using Swagger Codegen and maybe a simple
> reference implementation as well as a test / validation suite to check that implementations behave
> as expected.
>
> The beauty of it would be that if we can come up with a generic enough API, a compatible crawler
> would not need to know the details.
>
> Since this is a cross-crawler effort, like the rest of our project, I thought it would be a good
> place to host it.
>
> Any thoughts or objections?
>
> Thanks
>
> Julien
>
>
> --
> ****
> **
> *Open Source Solutions for Text Engineering
>
>
http://www.digitalpebble.com*
> *
http://digitalpebble.blogspot.com <
http://digitalpebble.blogspot.com/>**
> *
> @digitalpebble <
https://twitter.com/digitalpebble>*
> *
>
> --
> You received this message because you are subscribed to the Google Groups "crawler-commons" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
crawler-commo...@googlegroups.com <mailto:
crawler-commo...@googlegroups.com>.
> <
https://groups.google.com/d/msgid/crawler-commons/CALv%2Baz1Ci0xhVUy%3Diq_bgNO-NH5BxSZB-hxH6kw2Uc9PfK3s4w%40mail.gmail.com?utm_medium=email&utm_source=footer>.