I want to Get worlds all active website URLS

59 views
Skip to first unread message

kaleem....@gmail.com

unread,
Jan 17, 2018, 8:22:44 AM1/17/18
to Common Crawl
I want to Get worlds all active website URLS and also all webpages links of  website

Sebastian Nagel

unread,
Jan 17, 2018, 8:31:58 AM1/17/18
to common...@googlegroups.com
Hi,

Common Crawl provides sample snapshots of the web - 3 billion pages every month
from 20+ million domains or 50+ million hosts/sites.

Sebastian


On 01/17/2018 02:22 PM, kaleem....@gmail.com wrote:
> I want to Get worlds all active website URLS and also all webpages links of  website
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Vallabh Kansagara

unread,
Jan 17, 2018, 10:23:22 PM1/17/18
to Common Crawl
You may like my project. see here https://github.com/vrkansagara/common-crawler

Tom Morris

unread,
Jan 18, 2018, 1:51:20 AM1/18/18
to common...@googlegroups.com
On Wed, Jan 17, 2018 at 10:23 PM, Vallabh Kansagara <vrkan...@gmail.com> wrote:
You may like my project, common-crawler

Could you explain a little bit more about what your project is and how it relates to CommonCrawl, which it seems to be attempting to associate itself with?

Tom 

kaleem asad

unread,
Jan 18, 2018, 2:46:12 AM1/18/18
to common...@googlegroups.com
i want to get all domain names so that i can do some analysis that how people think that types of domain names they like, what are the most characters used, most characters used used for first 3 laters of domain names etc.

--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/naAhBHpkjso/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.

kaleem asad

unread,
Jan 18, 2018, 4:33:53 AM1/18/18
to common...@googlegroups.com
i want to get all domain names so that i can do some analysis that how people think that types of domain names they like, what are the most characters used, most characters used used for first 3 laters of domain names etc.
On Thu, Jan 18, 2018 at 11:51 AM, Tom Morris <tfmo...@gmail.com> wrote:

--

Ivan Habernal

unread,
Jan 18, 2018, 4:41:15 AM1/18/18
to Common Crawl
Dear kaleem,

> i want to get all domain names so that i can do some analysis that how people think that types of domain names they like, what are the most characters used, most characters used used for first 3 laters of domain names etc.

Welcome to the discussion forum about CommonCrawl. To answer your question, please have a first look here - the most helpful webpage to start with on the Internet: http://www.catb.org/%7Eesr/faqs/smart-questions.html

Once you finish that, you might have a look at the official tutorials at the CommonCrawl github page - they should definitely give you a good starting point. We also extracted URLs from CommonCrawl here: https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/#_list_of_urls_from_commoncrawl

Best,

Ivan

kaleem asad

unread,
Jan 18, 2018, 4:43:20 AM1/18/18
to common...@googlegroups.com
thanks

--

Sebastian Nagel

unread,
Jan 18, 2018, 4:47:58 AM1/18/18
to common...@googlegroups.com
If it's only about domain names, I would recommend the domain-level webgraph and rankings:
http://commoncrawl.org/2017/11/host-and-domain-level-web-graphs-augseptoct-2017/
There are 93 million domain names listed.

Sebastian


On 01/18/2018 10:33 AM, kaleem asad wrote:
> i want to get all domain names so that i can do some analysis that how people think that types of
> domain names they like, what are the most characters used, most characters used used for first 3
> laters of domain names etc.
>
> On Thu, Jan 18, 2018 at 11:51 AM, Tom Morris <tfmo...@gmail.com <mailto:tfmo...@gmail.com>> wrote:
>
> On Wed, Jan 17, 2018 at 10:23 PM, Vallabh Kansagara <vrkan...@gmail.com
> <mailto:vrkan...@gmail.com>> wrote:
>
> You may like my project, common-crawler
>
>
> Could you explain a little bit more about what your project is and how it relates to
> CommonCrawl, which it seems to be attempting to associate itself with?
>
> Tom 
>
> --
> You received this message because you are subscribed to a topic in the Google Groups "Common
> Crawl" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/common-crawl/naAhBHpkjso/unsubscribe
> <https://groups.google.com/d/topic/common-crawl/naAhBHpkjso/unsubscribe>.
> To unsubscribe from this group and all its topics, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> <https://groups.google.com/group/common-crawl>.
> For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.

Vallabh Kansagara

unread,
Jan 18, 2018, 9:19:45 AM1/18/18
to Common Crawl

It will fetch common index from the server and store in your local db for future crawling using varios cdx-api filter with help to crawll perticular page and index.

It's quite easy to get what you want from some of the pages. It will  take a long time to crawl whole index but it can save money and time in my case. thats why I started developing this things.

For more please keep following project. thanks.

Tom Morris

unread,
Jan 18, 2018, 9:32:26 AM1/18/18
to common...@googlegroups.com
On Thu, Jan 18, 2018 at 9:19 AM, Vallabh Kansagara <vrkan...@gmail.com> wrote:

It will fetch common index from the server and store in your local db for future crawling using varios cdx-api filter with help to crawll perticular page and index.

Using the Common Crawl Index service for bulk access (e.g. *.co.uk) is an abuse of the service that will negative affect casual interactive users. It's also slow.

You should be downloading the index files (listed in e.g. https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/cc-index.paths.gz) and accessing them locally.

Tom
 

It's quite easy to get what you want from some of the pages. It will  take a long time to crawl whole index but it can save money and time in my case. thats why I started developing this things.

For more please keep following project. thanks.
 
On Thursday, 18 January 2018 12:21:20 UTC+5:30, Tom Morris wrote:
On Wed, Jan 17, 2018 at 10:23 PM, Vallabh Kansagara <vrkan...@gmail.com> wrote:
You may like my project, common-crawler

Could you explain a little bit more about what your project is and how it relates to CommonCrawl, which it seems to be attempting to associate itself with?

Tom 

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.

Vallabh Kansagara

unread,
Jan 25, 2018, 2:21:40 PM1/25/18
to Common Crawl
What excetly mean to me !

On Thursday, 18 January 2018 20:02:26 UTC+5:30, Tom Morris wrote:
On Thu, Jan 18, 2018 at 9:19 AM, Vallabh Kansagara <vrkan...@gmail.com> wrote:

It will fetch common index from the server and store in your local db for future crawling using varios cdx-api filter with help to crawll perticular page and index.

Using the Common Crawl Index service for bulk access (e.g. *.co.uk) is an abuse of the service that will negative affect casual interactive users. It's also slow.

You should be downloading the index files (listed in e.g. https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/cc-index.paths.gz) and accessing them locally.

Tom
 

It's quite easy to get what you want from some of the pages. It will  take a long time to crawl whole index but it can save money and time in my case. thats why I started developing this things.

For more please keep following project. thanks.
 
On Thursday, 18 January 2018 12:21:20 UTC+5:30, Tom Morris wrote:
On Wed, Jan 17, 2018 at 10:23 PM, Vallabh Kansagara <vrkan...@gmail.com> wrote:
You may like my project, common-crawler

Could you explain a little bit more about what your project is and how it relates to CommonCrawl, which it seems to be attempting to associate itself with?

Tom 

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages