I wanted to download .com, .net, .org, .gov files of common crawl.

58 views
Skip to first unread message

Bhavana

unread,
Jul 12, 2016, 1:12:31 PM7/12/16
to Common Crawl
Hello Everyone,

i am new to common crawl and i required to download .com, .net, .org, .gov files of common crawl and load in to elastic search for searching based on title name or any content. Can you please suggest me a procedure to download the files from common crawl.

Thanks,
Bhavana.

Sebastian Nagel

unread,
Jul 12, 2016, 4:00:18 PM7/12/16
to common...@googlegroups.com
Hi Bhavana,


> load in to elastic search for searching based on title name or any content
Have a look at Common Search and their git repositories
  https://about.commonsearch.org/
  https://github.com/commonsearch
That's an excellent example how to perform the desired task
and some more work mainly to improve the ranking of search results.

> download .com, .net, .org, .gov
That would be about 80% of the Common Crawl data.
It doesn't make sense to pre-filter the data in that case.

Best,
Sebastian


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Bhavana

unread,
Jul 12, 2016, 4:35:50 PM7/12/16
to Common Crawl

Bhavana

unread,
Jul 12, 2016, 4:36:54 PM7/12/16
to Common Crawl
Hello Sebastian,

Thanks for the reply.

I have gone through the common Search Git repository but i didn't understand how the data is downloaded from common crawl ?

Can you please mention a procedure for the download of common crawl data ? How much space it is going to take if i download ? I want to store the data in elastic search for searching the data. 

Thanks,
Bhavana.
Reply all
Reply to author
Forward
0 new messages