Searching Common Crawl

85 views
Skip to first unread message

Peter Elbert

unread,
Nov 27, 2022, 2:43:24 PM11/27/22
to common...@googlegroups.com
I would like to use Common Crawl as a search engine.

I want custom control, so I think I’ll do Athena SQL queries instead of using a search tool someone else has built.

I was curious, why is Common Crawl specifically on AWS? Because it is cheap / offered freely? I think in principle it would be cool if the data had different “mirrors”. In other words, the data is a result of a crawling algorithm; technically the specific vessel containing it is flexible.

What is the exact format of how CC indexes a page? Are we limited to keyword searching page titles or is there any good classification system for the pages - subject tags, but also maybe “page type”, like academic, blog, newspaper, forum, social network, shopping site, wiki-site, etc.

I think I read CC does not allow fulltext searching? Is there any plans towards that? I think it could be cool if the crawler could be separated into modular pieces. Maybe people could assemble their own crawler from different pieces. Maybe if fulltext is too much data, an NLP algorithm might generate a few topics or key terms for each page? I’m also thinking about needing to render pages fully if one is going to process their text content. It is probably necessary to render the page in a headless browser, like with Selenium, to actually get the real page content, on many sites, right?

Those are just some thoughts. I’ll explore them now.

Thanks,
Julius

Stan Srednyak

unread,
Nov 30, 2022, 12:53:06 PM11/30/22
to common...@googlegroups.com
hi ,

You are asking great questions.

We thought about these some time ago and we are building something along these lines at rorur.com. It is a decentralized search engine where the data will not be siloed at AWS but rather distributed among participating computers ( which get the data using a distributed crawler). In addition, users can define their own ranking algorithms that the network can run. We have a working prototype, feel free to reach if you want to participate.


Stan Srednayk

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/CAM17qt_h7eFcuh7TfL0MFAF0VGHUHDviq5GyqW5szDjnteLp%3DQ%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages