I wanted to download .com, .net, .org, .gov files of common crawl.

Bhavana

unread,

Jul 12, 2016, 1:12:31 PM7/12/16

to Common Crawl

Hello Everyone,

i am new to common crawl and i required to download .com, .net, .org, .gov files of common crawl and load in to elastic search for searching based on title name or any content. Can you please suggest me a procedure to download the files from common crawl.

Thanks,

Bhavana.

Sebastian Nagel

unread,

Jul 12, 2016, 4:00:18 PM7/12/16

to common...@googlegroups.com

Hi Bhavana,

> load in to elastic search for searching based on title name or any content

Have a look at Common Search and their git repositories
https://about.commonsearch.org/
https://github.com/commonsearch

That's an excellent example how to perform the desired task

and some more work mainly to improve the ranking of search results.

> download .com, .net, .org, .gov

That would be about 80% of the Common Crawl data.

It doesn't make sense to pre-filter the data in that case.

Best,

Sebastian

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Bhavana

unread,

Jul 12, 2016, 4:35:50 PM7/12/16

to Common Crawl

Bhavana

unread,

Jul 12, 2016, 4:36:54 PM7/12/16

to Common Crawl

Hello Sebastian,

Thanks for the reply.

I have gone through the common Search Git repository but i didn't understand how the data is downloaded from common crawl ?

Can you please mention a procedure for the download of common crawl data ? How much space it is going to take if i download ? I want to store the data in elastic search for searching the data.