Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Proposal for a sub-project at crawler-commons

25 views
Skip to first unread message

DigitalPebble

unread,
Sep 12, 2019, 5:05:14 PM9/12/19
to crawler...@googlegroups.com
Hi,

I have been thinking about a new project for a while about developing a crawler-neutral URL frontier API; the idea being that it could be used with my own StormCrawler but also with Heritrix or other crawlers. This is a bit comparable to Frontera but w/o the dependency on Scrapy and more generic.

The main task would be to design a REST API with OpenAPI for the operations that a web crawler typically does when communicating with a web frontier e.g. get the next N URLs to crawl, update the information about a URL, change the crawl rate for a particular hostname, get the list of active hosts, get stats, etc...

When this is done, we could provide a set of client APIs using Swagger Codegen and maybe a simple reference implementation as well as a test / validation suite to check that implementations behave as expected. 

The beauty of it would be that if we can come up with a generic enough API, a compatible crawler would not need to know the details. 

Since this is a cross-crawler effort, like the rest of our project, I thought it would be a good place to host it.

Any thoughts or objections?

Thanks

Julien


--

Lewis John Mcgibbney

unread,
Sep 12, 2019, 6:43:30 PM9/12/19
to crawler...@googlegroups.com
I think it’s an excellent idea Julien.

--
You received this message because you are subscribed to the Google Groups "crawler-commons" group.
To unsubscribe from this group and stop receiving emails from it, send an email to crawler-commo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/crawler-commons/CALv%2Baz1Ci0xhVUy%3Diq_bgNO-NH5BxSZB-hxH6kw2Uc9PfK3s4w%40mail.gmail.com.
--
Lewis
Dr. Lewis J. McGibbney Ph.D, B.Sc
Skype: lewis.john.mcgibbney



Sebastian Nagel

unread,
Sep 14, 2019, 11:35:01 AM9/14/19
to crawler...@googlegroups.com
Hi Julien,

yes, it definitely makes sense to define such an API as new project under the hood
of crawler-commons. Also a reference implementation perfectly fits.

> the operations that a web crawler typically does when communicating with a web frontier

This list of operations should be the first step. At least, it feels that the API will
become more stable from the beginning if we take this definition step serious.

Happy to get involved into the project!

Thanks,
Sebastian

On 9/12/19 11:05 PM, DigitalPebble wrote:
> Hi,
>
> I have been thinking about a new project for a while about developing a crawler-neutral URL frontier
> API; the idea being that it could be used with my own StormCrawler but also with Heritrix or other
> crawlers. This is a bit comparable to Frontera but w/o the dependency on Scrapy and more generic.
>
> The main task would be to design a REST API with OpenAPI
> <https://swagger.io/docs/specification/about/> for the operations that a web crawler typically does
> when communicating with a web frontier e.g. get the next N URLs to crawl, update the information
> about a URL, change the crawl rate for a particular hostname, get the list of active hosts, get
> stats, etc...
>
> When this is done, we could provide a set of client APIs using Swagger Codegen and maybe a simple
> reference implementation as well as a test / validation suite to check that implementations behave
> as expected. 
>
> The beauty of it would be that if we can come up with a generic enough API, a compatible crawler
> would not need to know the details. 
>
> Since this is a cross-crawler effort, like the rest of our project, I thought it would be a good
> place to host it.
>
> Any thoughts or objections?
>
> Thanks
>
> Julien
>
>
> --
> ****
> ** 
> *Open Source Solutions for Text Engineering
>  
> http://www.digitalpebble.com*
> *http://digitalpebble.blogspot.com <http://digitalpebble.blogspot.com/>**
> *
> @digitalpebble <https://twitter.com/digitalpebble>*
> *
>
> --
> You received this message because you are subscribed to the Google Groups "crawler-commons" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> crawler-commo...@googlegroups.com <mailto:crawler-commo...@googlegroups.com>.
> <https://groups.google.com/d/msgid/crawler-commons/CALv%2Baz1Ci0xhVUy%3Diq_bgNO-NH5BxSZB-hxH6kw2Uc9PfK3s4w%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Julien Nioche

unread,
Sep 15, 2019, 12:43:05 PM9/15/19
to crawler...@googlegroups.com
thanks Lewis and Sebastian, I'll create a subproject for it. anyone else interested in taking part?

Julien

To unsubscribe from this group and stop receiving emails from it, send an email to crawler-commo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/crawler-commons/33ad0c5d-e790-f6c5-0c5b-ff7ae3a6e5da%40googlemail.com.


--

Open Source Solutions for Text Engineering

Aécio

unread,
Sep 15, 2019, 1:19:25 PM9/15/19
to crawler...@googlegroups.com
I would be interested too, and I agree that the operations supported by the API should be the first step. Also, there are some  details that greatly influence the implementation:
- scalability: what is the target, millions or billions of URLs? Is it a distributed frontier?
- crawl ordering policies: will it support advanced crawl ordering or only simple FIFO or random policies? How to customize ordering?
- multi-tenancy: will it support multiple crawling queues?
- API: REST only or a Java library?

Regards,
Aécio Santos

Avi Hayun

unread,
Sep 15, 2019, 3:22:55 PM9/15/19
to crawler...@googlegroups.com

Julien Nioche

unread,
Sep 16, 2019, 3:59:49 AM9/16/19
to crawler...@googlegroups.com
Hi Aécio

Thanks for your comments

I would be interested too, and I agree that the operations supported by the API should be the first step. Also, there are some  details that greatly influence the implementation:
- scalability: what is the target, millions or billions of URLs? Is it a distributed frontier?

My focus is primarily on defining a common API, and then maybe have a simple implementation for testing. An efficient implementation of the API could be in a separate project and would indeed be determined by the aspects you listed. I am definitely interested in that too of course!
 
- crawl ordering policies: will it support advanced crawl ordering or only simple FIFO or random policies? How to customize ordering?

this is also what the work on the API will have to determine: should the API provide a way of specifying these, should it be left to a specific implementation or should it be handled by the crawler themselves?
 
- multi-tenancy: will it support multiple crawling queues?

I'd think the API (and therefore its implementations) should provide a way of doing multi tenancy
 
- API: REST only or a Java library?

Both ;-) I'm planning to use the Swagger toolbox to generate client code in various languages

Thanks

Julien
 

Ken Krugler

unread,
Sep 17, 2019, 11:31:50 AM9/17/19
to crawler...@googlegroups.com
Hi Julien,

As per our recent chat, I’m interested in reviewing the proposal.

— Ken


--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Julien Nioche

unread,
Dec 8, 2020, 4:28:33 AM12/8/20
to crawler...@googlegroups.com
Hi, 

A quick update on the sub-project URL Frontier which I mentioned over a year ago (see the original message below). I managed to get some funding for me to work on it from NLNet [1] as part of the NGI0 DiscoveryFund. 

From a technical point of view, things have changed a bit since the original idea and we're now heading for a gRPC based schema instead of a REST API. The overall goal remains the same.

The plan is to have an initial release in a couple of months:
  • 1st release API schema
  • client code
  • V1 reference implementation basic functionalities + test suite
  • Initial documentation on the website
At that point, it would be great to have as much feedback from the community as possible before we go on to the next iteration.

The Open API based schema [2] I had started won't be used, am sharing the URL just in case some of you are interested.

I'll give you an update when we are closer to the initial release and we can discuss things concretely; those of you who follow the repo for the project [3] will probably see some commits very shortly though.

I am very excited about this, I hope you are too.

Have a good day

Julien






--
You received this message because you are subscribed to the Google Groups "crawler-commons" group.
To unsubscribe from this group and stop receiving emails from it, send an email to crawler-commo...@googlegroups.com.

Open Source Solutions for Text Engineering

Avi Hayun

unread,
Dec 8, 2020, 8:53:27 AM12/8/20
to crawler...@googlegroups.com
Sounds great !!

I love the idea.

Will be happy to follow



Avi.

Reply all
Reply to author
Forward
0 new messages