Proposal for a sub-project at crawler-commons

DigitalPebble

unread,

Sep 12, 2019, 5:05:14 PM9/12/19

to crawler...@googlegroups.com

Hi,

I have been thinking about a new project for a while about developing a crawler-neutral URL frontier API; the idea being that it could be used with my own StormCrawler but also with Heritrix or other crawlers. This is a bit comparable to Frontera but w/o the dependency on Scrapy and more generic.

The main task would be to design a REST API with OpenAPI for the operations that a web crawler typically does when communicating with a web frontier e.g. get the next N URLs to crawl, update the information about a URL, change the crawl rate for a particular hostname, get the list of active hosts, get stats, etc...

When this is done, we could provide a set of client APIs using Swagger Codegen and maybe a simple reference implementation as well as a test / validation suite to check that implementations behave as expected.

The beauty of it would be that if we can come up with a generic enough API, a compatible crawler would not need to know the details.

Since this is a cross-crawler effort, like the rest of our project, I thought it would be a good place to host it.

Any thoughts or objections?

Thanks

Julien

--

Open Source Solutions for Text Engineering

http://www.digitalpebble.com

http://digitalpebble.blogspot.com

@digitalpebble

Lewis John Mcgibbney

unread,

Sep 12, 2019, 6:43:30 PM9/12/19

to crawler...@googlegroups.com

I think it’s an excellent idea Julien.

--
You received this message because you are subscribed to the Google Groups "crawler-commons" group.
To unsubscribe from this group and stop receiving emails from it, send an email to crawler-commo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/crawler-commons/CALv%2Baz1Ci0xhVUy%3Diq_bgNO-NH5BxSZB-hxH6kw2Uc9PfK3s4w%40mail.gmail.com.

--

Lewis

Dr. Lewis J. McGibbney Ph.D, B.Sc

Skype: lewis.john.mcgibbney

Sebastian Nagel

unread,

Sep 14, 2019, 11:35:01 AM9/14/19

to crawler...@googlegroups.com

Hi Julien,

yes, it definitely makes sense to define such an API as new project under the hood
of crawler-commons. Also a reference implementation perfectly fits.

> the operations that a web crawler typically does when communicating with a web frontier

This list of operations should be the first step. At least, it feels that the API will
become more stable from the beginning if we take this definition step serious.

Happy to get involved into the project!

Thanks,
Sebastian

On 9/12/19 11:05 PM, DigitalPebble wrote:
> Hi,
>
> I have been thinking about a new project for a while about developing a crawler-neutral URL frontier
> API; the idea being that it could be used with my own StormCrawler but also with Heritrix or other
> crawlers. This is a bit comparable to Frontera but w/o the dependency on Scrapy and more generic.
>
> The main task would be to design a REST API with OpenAPI

> <https://swagger.io/docs/specification/about/> for the operations that a web crawler typically does

> when communicating with a web frontier e.g. get the next N URLs to crawl, update the information
> about a URL, change the crawl rate for a particular hostname, get the list of active hosts, get
> stats, etc...
>
> When this is done, we could provide a set of client APIs using Swagger Codegen and maybe a simple
> reference implementation as well as a test / validation suite to check that implementations behave
> as expected.
>
> The beauty of it would be that if we can come up with a generic enough API, a compatible crawler
> would not need to know the details.
>
> Since this is a cross-crawler effort, like the rest of our project, I thought it would be a good
> place to host it.
>
> Any thoughts or objections?
>
> Thanks
>
> Julien
>
>
> --

> ****
> **
> *Open Source Solutions for Text Engineering
>
> http://www.digitalpebble.com*
> *http://digitalpebble.blogspot.com <http://digitalpebble.blogspot.com/>**
> *
> @digitalpebble <https://twitter.com/digitalpebble>*
> *

>
> --
> You received this message because you are subscribed to the Google Groups "crawler-commons" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> crawler-commo...@googlegroups.com <mailto:crawler-commo...@googlegroups.com>.

> To view this discussion on the web visit
> https://groups.google.com/d/msgid/crawler-commons/CALv%2Baz1Ci0xhVUy%3Diq_bgNO-NH5BxSZB-hxH6kw2Uc9PfK3s4w%40mail.gmail.com

> <https://groups.google.com/d/msgid/crawler-commons/CALv%2Baz1Ci0xhVUy%3Diq_bgNO-NH5BxSZB-hxH6kw2Uc9PfK3s4w%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Julien Nioche

unread,

Sep 15, 2019, 12:43:05 PM9/15/19

to crawler...@googlegroups.com

thanks Lewis and Sebastian, I'll create a subproject for it. anyone else interested in taking part?

Julien

To unsubscribe from this group and stop receiving emails from it, send an email to crawler-commo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/crawler-commons/33ad0c5d-e790-f6c5-0c5b-ff7ae3a6e5da%40googlemail.com.

--

Open Source Solutions for Text Engineering

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble

Aécio

unread,

Sep 15, 2019, 1:19:25 PM9/15/19

to crawler...@googlegroups.com

I would be interested too, and I agree that the operations supported by the API should be the first step. Also, there are some details that greatly influence the implementation:

- scalability: what is the target, millions or billions of URLs? Is it a distributed frontier?

- crawl ordering policies: will it support advanced crawl ordering or only simple FIFO or random policies? How to customize ordering?

- multi-tenancy: will it support multiple crawling queues?

- API: REST only or a Java library?

Regards,

Aécio Santos

To view this discussion on the web visit https://groups.google.com/d/msgid/crawler-commons/CA%2B-fM0stqd3ddSwNA%2B-JVd-qga7BzdLb2nPK8J%2BPiASyrN%2BpOQ%40mail.gmail.com.

Avi Hayun

unread,

Sep 15, 2019, 3:22:55 PM9/15/19

to crawler...@googlegroups.com

I will be happy to take part

To view this discussion on the web visit https://groups.google.com/d/msgid/crawler-commons/CA%2B-fM0stqd3ddSwNA%2B-JVd-qga7BzdLb2nPK8J%2BPiASyrN%2BpOQ%40mail.gmail.com.

Julien Nioche

unread,

Sep 16, 2019, 3:59:49 AM9/16/19

to crawler...@googlegroups.com

Hi Aécio

Thanks for your comments

I would be interested too, and I agree that the operations supported by the API should be the first step. Also, there are some details that greatly influence the implementation:
- scalability: what is the target, millions or billions of URLs? Is it a distributed frontier?

My focus is primarily on defining a common API, and then maybe have a simple implementation for testing. An efficient implementation of the API could be in a separate project and would indeed be determined by the aspects you listed. I am definitely interested in that too of course!

- crawl ordering policies: will it support advanced crawl ordering or only simple FIFO or random policies? How to customize ordering?

this is also what the work on the API will have to determine: should the API provide a way of specifying these, should it be left to a specific implementation or should it be handled by the crawler themselves?

- multi-tenancy: will it support multiple crawling queues?

I'd think the API (and therefore its implementations) should provide a way of doing multi tenancy

- API: REST only or a Java library?

Both ;-) I'm planning to use the Swagger toolbox to generate client code in various languages

Thanks

Julien

To view this discussion on the web visit https://groups.google.com/d/msgid/crawler-commons/CAOrZYMhSycokwiwZwkUMqb50zYsnibVsKKCMof4Bek1XgXgMsQ%40mail.gmail.com.

Ken Krugler

unread,

Sep 17, 2019, 11:31:50 AM9/17/19

to crawler...@googlegroups.com

Hi Julien,

As per our recent chat, I’m interested in reviewing the proposal.

— Ken

To view this discussion on the web visit https://groups.google.com/d/msgid/crawler-commons/CA%2B-fM0stqd3ddSwNA%2B-JVd-qga7BzdLb2nPK8J%2BPiASyrN%2BpOQ%40mail.gmail.com.

--------------------------

Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Julien Nioche

unread,

Dec 8, 2020, 4:28:33 AM12/8/20

to crawler...@googlegroups.com

Hi,

A quick update on the sub-project URL Frontier which I mentioned over a year ago (see the original message below). I managed to get some funding for me to work on it from NLNet [1] as part of the NGI0 DiscoveryFund.

From a technical point of view, things have changed a bit since the original idea and we're now heading for a gRPC based schema instead of a REST API. The overall goal remains the same.

The plan is to have an initial release in a couple of months:

1st release API schema
client code
V1 reference implementation basic functionalities + test suite
Initial documentation on the website

At that point, it would be great to have as much feedback from the community as possible before we go on to the next iteration.

The Open API based schema [2] I had started won't be used, am sharing the URL just in case some of you are interested.

I'll give you an update when we are closer to the initial release and we can discuss things concretely; those of you who follow the repo for the project [3] will probably see some commits very shortly though.

I am very excited about this, I hope you are too.

Have a good day

Julien

[1] https://nlnet.nl/project/URLFrontier/

[2] https://app.swaggerhub.com/apis/jnioche/url_frontier/0.1

[3] https://github.com/crawler-commons/url-frontier

--

You received this message because you are subscribed to the Google Groups "crawler-commons" group.
To unsubscribe from this group and stop receiving emails from it, send an email to crawler-commo...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/crawler-commons/CALv%2Baz1Ci0xhVUy%3Diq_bgNO-NH5BxSZB-hxH6kw2Uc9PfK3s4w%40mail.gmail.com.

--

Open Source Solutions for Text Engineering

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble

Avi Hayun

unread,

Dec 8, 2020, 8:53:27 AM12/8/20

to crawler...@googlegroups.com

Sounds great !!

I love the idea.

Will be happy to follow

Avi.

To view this discussion on the web visit https://groups.google.com/d/msgid/crawler-commons/CA%2B-fM0uBWUXkbQgMRGFWc6ym_JXrMPJb6UfN8nwa-o_MP2B6fA%40mail.gmail.com.

Reply all

Reply to author

Forward