Any Tutorials on Using CommonCrawl as Index for General Web Search Engine?

678 views
Skip to first unread message

David Mackey

unread,
Jan 28, 2017, 3:48:04 AM1/28/17
to Common Crawl
Hi All,

I'm interested in any tutorials or other articles relating to the use of the CommonCrawl data set to power a general web search engine. I haven't been able to find anything, wondering if anyone is aware of anything?

Thanks,
Dave

Laura Dietz

unread,
Jan 28, 2017, 9:55:54 AM1/28/17
to common...@googlegroups.com
Hi Dave,

I guess, if you are a commercial search engine company, you already have all that data in CC.

I am an IR researcher and I originally set out to create a full text search index from CommonCrawl. I ran into this practical issue which lead to me abandon the idea of building an index versus "steaming retrieval":



If you want to create a full text search index of 500TB of packed raw data. This index will probably be 500TB large. Storing 500TB on AWS infrastructure is a) prohibitive expensive b) you only get 1TB blocks and have to build a distributed file system on it. We tried that with gluster and it was a disaster.

The other thought is that for research purposes, we do not need millisecond response times. Actually, often we need to run a few hundred queries in batch mode - this can take overnight or even a week - then we add more pipeline steps on top of that to fine tune the results, consolidate, summarize etc.

In this situation, you do not need a search index. Once you have a batch of queries, you can process the data in a streaming fashion, and keep track of one top-k result list per query. We call this streaming retrieval versus indexed retrieval.

Lets compare them side-by-side:
                       |  Indexed   |  Streaming
-------------------+-------------+------------------
storage           |  500TB      | low (1)
memory         |  high  (2)    | manageable (3)
processor       | same (4)    | same
response time| seconds     | days
index time      | days           | N/A


(1) We precompute collection statistics on a small subset of the common crawl
(2) needs to keep track of all words in the vocabulary; map-reduce to build inverted index of all term occurrences
(3) keeps track only of words in the query (or query expansion); result scoring on the fly; after reading one document, only its rank in the top-k result list (if any) needs to be maintained
(4) actually, it is a bit higher because you need to build the inverted index


Best,
Laura Dietz

--

Assistant Professor for Computer Science
TREMA lab: Text Retrieval, Extraction, Machine Learning, and Analysis
University of New Hampshire
http://www.cs.unh.edu/~dietz/
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.


Sebastian Nagel

unread,
Jan 28, 2017, 11:48:01 AM1/28/17
to common...@googlegroups.com
Hi David,

also have a look at, it's neither a tutorial or an article, but a good example

https://uidemo.commonsearch.org/
https://github.com/commonsearch

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.

OneSpeedFast

unread,
Jan 28, 2017, 2:24:17 PM1/28/17
to common...@googlegroups.com
Laura is right about identifying your urgency - when I wanted to extract certain fields I didn`t mind spending about 40 days downloading and parsing the entire crawl. Now I am beta testing a search API product that supplies live search data to any app, research or business need. 

On Sat, Jan 28, 2017 at 8:47 AM, Sebastian Nagel <seba...@commoncrawl.org> wrote:
Hi David,

also have a look at, it's neither a tutorial or an article, but a good example

  https://uidemo.commonsearch.org/
  https://github.com/commonsearch

Best,
Sebastian

On 01/28/2017 09:48 AM, David Mackey wrote:
> Hi All,
>
> I'm interested in any tutorials or other articles relating to the use of the CommonCrawl data set to
> power a general web search engine. I haven't been able to find anything, wondering if anyone is
> aware of anything?
>
> Thanks,
> Dave
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> To post to this group, send email to common...@googlegroups.com
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.

Chillar Anand

unread,
Nov 16, 2022, 6:54:39 AM11/16/22
to Common Crawl
Hi,

CommonSearch seems to be abandoned.

Any other search engines/tools that are built on top of CC?


On Sunday, January 29, 2017 at 12:54:17 AM UTC+5:30 OneSpeedFast wrote:
Laura is right about identifying your urgency - when I wanted to extract certain fields I didn`t mind spending about 40 days downloading and parsing the entire crawl. Now I am beta testing a search API product that supplies live search data to any app, research or business need. 

On Sat, Jan 28, 2017 at 8:47 AM, Sebastian Nagel <seba...@commoncrawl.org> wrote:
Hi David,

also have a look at, it's neither a tutorial or an article, but a good example

  https://uidemo.commonsearch.org/
  https://github.com/commonsearch

Best,
Sebastian

On 01/28/2017 09:48 AM, David Mackey wrote:
> Hi All,
>
> I'm interested in any tutorials or other articles relating to the use of the CommonCrawl data set to
> power a general web search engine. I haven't been able to find anything, wondering if anyone is
> aware of anything?
>
> Thanks,
> Dave
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> To post to this group, send email to common...@googlegroups.com
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.

To post to this group, send email to common...@googlegroups.com.

Sebastian Nagel

unread,
Nov 16, 2022, 9:35:43 AM11/16/22
to common...@googlegroups.com
Hi,

a few pointers - I'm sure there is more:

- https://www.alexandria.org/
https://github.com/alexandria-org

- https://www.chatnoir.eu/
https://www.chatnoir.eu/doc/architecture/
https://github.com/chatnoir-eu

- https://github.com/ahcm/tantivy_warc_indexer

- https://quickwit.io/blog/commoncrawl/

- https://github.com/gaoalexander/web-search-engine

- https://github.com/hannesrabo/simple-search-engine

- https://link-archive.org/

Best,
Sebastian


On 11/16/22 12:54, Chillar Anand wrote:
> Hi,
>
> CommonSearch seems to be abandoned.
>
> Any other search engines/tools that are built on top of CC?
>
>
> On Sunday, January 29, 2017 at 12:54:17 AM UTC+5:30 OneSpeedFast wrote:
>
> Laura is right about identifying your urgency - when I wanted to
> extract certain fields I didn`t mind spending about 40 days
> downloading and parsing the entire crawl. Now I am beta testing a
> search API product that supplies live search data to any app,
> research or business need.
>
> On Sat, Jan 28, 2017 at 8:47 AM, Sebastian Nagel
> <seba...@commoncrawl.org> wrote:
>
> Hi David,
>
> also have a look at, it's neither a tutorial or an article, but
> a good example
>
> https://uidemo.commonsearch.org/ <https://uidemo.commonsearch.org/>
> https://github.com/commonsearch <https://github.com/commonsearch>
>
> Best,
> Sebastian
>
> On 01/28/2017 09:48 AM, David Mackey wrote:
> > Hi All,
> >
> > I'm interested in any tutorials or other articles relating to the use of the CommonCrawl data set to
> > power a general web search engine. I haven't been able to find anything, wondering if anyone is
> > aware of anything?
> >
> > Thanks,
> > Dave
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
>
> > common-crawl...@googlegroups.com
> <mailto:common-crawl+unsub...@googlegroups.com>.
>
>
> > To post to this group, send email to common...@googlegroups.com
>
> > <mailto:common...@googlegroups.com>.
>
> > Visit this group at
> https://groups.google.com/group/common-crawl
> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the
> Google Groups "Common Crawl" group.
>
> To unsubscribe from this group and stop receiving emails from
> it, send an email to common-crawl...@googlegroups.com.
> To post to this group, send email to common...@googlegroups.com.
>
>
> Visit this group at https://groups.google.com/group/common-crawl
> <https://groups.google.com/group/common-crawl>.
> For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/dd928dca-f61d-4544-a1b7-116ba8b1f4f7n%40googlegroups.com <https://groups.google.com/d/msgid/common-crawl/dd928dca-f61d-4544-a1b7-116ba8b1f4f7n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Chillar Anand

unread,
Nov 16, 2022, 11:04:04 PM11/16/22
to Common Crawl
Excellent sources.

Thanks, Sebastian.

Can we add this to https://commoncrawl.org/the-data/examples/ as well?
This will help users to discover these projects easily.

Is there any criterial to include it in commoncrawl.org site?
I am also building couple of projects using CC data. Just checking if those can be included too.

Sebastian Nagel

unread,
Nov 17, 2022, 1:52:43 AM11/17/22
to common...@googlegroups.com
> Can we add this to https://commoncrawl.org/the-data/examples/ as well?

Yes, I'll make sure that all of them are listed there (some of them
are already).

> Is there any criteria to include it in commoncrawl.org site?

There are no pinned formal criteria. But some level of code
quality and/or originality is required.

Otherwise you might just browse repositories mentioning
"Common Crawl" on Github:
https://github.com/search?q=common+crawl
https://github.com/topics/common-crawl
https://github.com/topics/commoncrawl

> I am also building couple of projects using CC data. Just checking if
> those can be included too.

Of course! Let us know about it.


On 11/17/22 05:04, Chillar Anand wrote:
> Excellent sources.
>
> Thanks, Sebastian.
>
> Can we add this to https://commoncrawl.org/the-data/examples/ as well?
> This will help users to discover these projects easily.
>
> Is there any criterial to include it in commoncrawl.org site?
> I am also building couple of projects using CC data. Just checking if
> those can be included too.
>
> On Wednesday, November 16, 2022 at 8:05:43 PM UTC+5:30 Sebastian Nagel
> wrote:
>
> Hi,
>
> a few pointers - I'm sure there is more:
>
> - https://www.alexandria.org/ <https://www.alexandria.org/>
> https://github.com/alexandria-org <https://github.com/alexandria-org>
> https://www.chatnoir.eu/doc/architecture/
> <https://www.chatnoir.eu/doc/architecture/>
> https://github.com/chatnoir-eu <https://github.com/chatnoir-eu>
> - https://link-archive.org/ <https://link-archive.org/>
>
> Best,
> Sebastian
>
>
> On 11/16/22 12:54, Chillar Anand wrote:
> > Hi,
> >
> > CommonSearch seems to be abandoned.
> >
> > Any other search engines/tools that are built on top of CC?
> >
> >
> > On Sunday, January 29, 2017 at 12:54:17 AM UTC+5:30 OneSpeedFast
> wrote:
> >
> > Laura is right about identifying your urgency - when I wanted to
> > extract certain fields I didn`t mind spending about 40 days
> > downloading and parsing the entire crawl. Now I am beta testing a
> > search API product that supplies live search data to any app,
> > research or business need.
> >
> > On Sat, Jan 28, 2017 at 8:47 AM, Sebastian Nagel
> > <seba...@commoncrawl.org> wrote:
> >
> > Hi David,
> >
> > also have a look at, it's neither a tutorial or an article, but
> > a good example
> >
> > https://uidemo.commonsearch.org/
> <https://uidemo.commonsearch.org/> <https://uidemo.commonsearch.org/
> > <https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>>.
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send
> > an email to common-crawl...@googlegroups.com
> > <mailto:common-crawl...@googlegroups.com>.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/common-crawl/dd928dca-f61d-4544-a1b7-116ba8b1f4f7n%40googlegroups.com <https://groups.google.com/d/msgid/common-crawl/dd928dca-f61d-4544-a1b7-116ba8b1f4f7n%40googlegroups.com> <https://groups.google.com/d/msgid/common-crawl/dd928dca-f61d-4544-a1b7-116ba8b1f4f7n%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/common-crawl/dd928dca-f61d-4544-a1b7-116ba8b1f4f7n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/e99ff3a6-812e-4315-99ff-871b289c0bb6n%40googlegroups.com <https://groups.google.com/d/msgid/common-crawl/e99ff3a6-812e-4315-99ff-871b289c0bb6n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Amirouche Boubekki

unread,
Nov 21, 2022, 4:49:46 PM11/21/22
to common...@googlegroups.com
Hello all,
I dare mention my project that is available at
https://git.sr.ht/~amirouche/python-babelia
It is not specific to commoncrawl, you can code the necessary
boilerplate in a week or so:

To index a document as a bag-of-word, you need to call pstore.index
inside a transaction [0]
There is an example use at [1]

[0] https://git.sr.ht/~amirouche/python-babelia/tree/main/item/pstore.py#L154

[1] https://git.sr.ht/~amirouche/python-babelia/tree/main/item/hyperdev/main.py#L267-276

The main differences compared to the above alternatives:

- The index can scale horizontally, same as tantivy, but accessing the
index is much faster;

- It is more flexible, and more cost efficient than elasticsearch

Like tantivy, and alexandria, It is possible to scale querying
horizontally too, even if it is not implemented yet;

I am looking for partners.

Common Screens

unread,
Dec 20, 2022, 11:30:57 AM12/20/22
to Common Crawl
Amirouche,

I am looking to deploy a search engine from common crawl wet files, we can be your partners for this effort, contact me at an...@dosvak.com.

Anil

Common Screens

unread,
Jan 8, 2023, 2:47:05 PM1/8/23
to Common Crawl
created php code to index WARC files into solr in cloud mode solr can handle 2.5 billion documents per shard, preview available at https://visualsearch.org/ it will take approximately 6 months to index all 3.5 billion urls. check it out let me know if you are interested in the code behind it.
Reply all
Reply to author
Forward
0 new messages