Hi Alexander,
I am writing a book aimed at
Python/pyspark fluent audience for Apress (a Springer imprint) titled
"getting structured data from internet: web crawling on production
scale".
In broad crawling chapter, I am thinking about covering Frontera since this is the only open source Python based crawler out there.
Since I am familiar with Nutch 1.x which provides the ability to input the importance score of the seed urls which can be used by OPIC scoring filters. I was looking if Frontera include something similar to this right out of the box? I went through the documentation pdf but couldnt find it. It hints at this in section 3.0 crawl frontier but I didnt notice any implementations. Does the "Basic" crawling strategy discussed here https://frontera.readthedocs.io/en/latest/topics/strategies.html already does some of that?
As you rightly point out in your ycombinator
post, that there are large number of domain addresses out there, and the only way to create a reasonable snapshot of the web is by indexing only the most important pages from each domain. This in Nutch is being guided by their implementation of OPIC algorithm.
Using precomputed domain level harmonic centrality from open source data sources such as web data commons or common crawl, in Nutch we can assign the importance score of the seed domains itself so more webpages from a seed domains with higher score gets parsed and indexed vs from a domain with lower score.
Thanks,
Jay.