I would like to use Common Crawl as a search engine.
I want custom control, so I think I’ll do Athena SQL queries instead of using a search tool someone else has built.
I was curious, why is Common Crawl specifically on AWS? Because it is cheap / offered freely? I think in principle it would be cool if the data had different “mirrors”. In other words, the data is a result of a crawling algorithm; technically the specific vessel containing it is flexible.
What is the exact format of how CC indexes a page? Are we limited to keyword searching page titles or is there any good classification system for the pages - subject tags, but also maybe “page type”, like academic, blog, newspaper, forum, social network, shopping site, wiki-site, etc.
I think I read CC does not allow fulltext searching? Is there any plans towards that? I think it could be cool if the crawler could be separated into modular pieces. Maybe people could assemble their own crawler from different pieces. Maybe if fulltext is too much data, an NLP algorithm might generate a few topics or key terms for each page? I’m also thinking about needing to render pages fully if one is going to process their text content. It is probably necessary to render the page in a headless browser, like with Selenium, to actually get the real page content, on many sites, right?
Those are just some thoughts. I’ll explore them now.
Thanks,
Julius