URL list

484 views
Skip to first unread message

nuli...@gmail.com

unread,
Mar 6, 2016, 2:33:44 AM3/6/16
to Common Crawl
Hi,
I just discovered this project and wanted to ask a question, if I may, of course. Is the URL list of all crawled pages available for download ? I will not be able to handle all data with my computing resources but yes if I only use URLs. I saw the URL search API, but it you have to provide a pattern, but I want the complete list. It shouldn't be so big or is it?

Thanks in advance
Nulik

OneSpeedFast

unread,
Mar 6, 2016, 3:16:10 AM3/6/16
to common...@googlegroups.com
I have never seen it for download anywhere, I was also interested. Instead I ran all 150 tb of the latest crawl through a few of my servers and extracted hostnames. I will see if I can make it available somehow after some further processing. Currently i`m checking every root domain for A record.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Tom Morris

unread,
Mar 6, 2016, 11:09:39 AM3/6/16
to common...@googlegroups.com
On Sun, Mar 6, 2016 at 2:33 AM, <nuli...@gmail.com> wrote:
I just discovered this project and wanted to ask a question, if I may, of course.

Welcome! Of course you may ask questions.
 
Is the URL list of all crawled pages available for download ?

The smallest pre-processed artefact available currently is the URL index. It runs a little over 100 GB. The URL list is probably 10-15 GB compressed, depending on whether you included counts, duplicates, etc and can be easily created from the full index.
 
I will not be able to handle all data with my computing resources 

Don't forget that you can also use AWS to process things. For pennies an hour, you get high bandwidth access to the index (and crawl) along with virtually unlimited processing power.

Tom 

Tom Morris

unread,
Mar 6, 2016, 11:24:53 AM3/6/16
to common...@googlegroups.com
On Sun, Mar 6, 2016 at 3:15 AM, OneSpeedFast <onespe...@gmail.com> wrote:
I have never seen it for download anywhere, I was also interested. Instead I ran all 150 tb of the latest crawl through a few of my servers and extracted hostnames. 

If all you wanted was host names, you could have reduced the amount of data you needed to process by several orders of magnitude. The latest crawl URL index is only 106 GB:

    $ aws s3 --no-sign-request ls s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2016-07/indexes/ --human-readable --summarize

I haven't looked at the latest crawl, but the November 2105 crawl had 25.8M hostnames which rolled up into 17.3M pay level domains (as computed using the Public Suffix List).

Top 20 PLDs in the CC-MAIN-2015-48 crawl by page count along with their Alexa ranks and page counts:

CC#|Alex#|PLD|Count
1 7 wikipedia.org 15210451
2 615 urbandictionary.com 9745333
3 180 stackexchange.com 9692651
4 ? wordpress.com 8006693 (sub-domains ranked individually by Alexa)
5 1237 mlb.com 6216408
6 98 wikia.com 3932537
7 5 yahoo.com 3732758
8 7369 oclc.org 3638297
9 191 tripadvisor.com 2996572
10 1 google.com 2897723
11 113972 scribdassets.com 2871431
12 1215 cbslocal.com 2655388
13 514 photobucket.com 2425730
14 2313 rivals.com 2412046
15 240 deviantart.com 2306039
16 853 wiktionary.org 2004200
17 76 go.com 1990645
18 559 hotels.com 1888442
19 3070 flightaware.com 1856720
20 ? typepad.com 1757766 (ranked individually by Alexa)


 

Dominik Stadler

unread,
Mar 6, 2016, 3:09:40 PM3/6/16
to common...@googlegroups.com
Hi,

I was looking for the same a while ago, there does not seem to be a separate list of URLs
available as part of CommonCrawl itself.

What I ended up doing is to use the raw-data of the URL index (which is
considerably smaller than the full crawl and thus can be downloaded from
a local machine) and extract URLs from there. I published the code for
this at https://github.com/centic9/CommonCrawlDocumentDownload, look at
https://github.com/centic9/CommonCrawlDocumentDownload/blob/master/src/main/java/org/dstadler/commoncrawl/index/DownloadURLIndex.java
which retrieves URLs and writes the found ones to a json-file. By default I
only extract URLs of some file-types, look at class "Extension" and
"MimeTypes" for how to adjust it to download all file-types.

If you want to also download the files, there is
https://github.com/centic9/CommonCrawlDocumentDownload/tree/master/src/main/java/org/dstadler/commoncrawl/index
which expects a list of files in the json-format created by
"DownloadURLIndex".

Thanks... Dominik.

--

Ivan Habernal

unread,
Mar 12, 2016, 3:28:02 PM3/12/16
to Common Crawl
Hi Nulik,

we added URL extraction to the C4CorpusTools project, see

https://github.com/dkpro/dkpro-c4corpus#list-of-urls-from-commoncrawl

I ran it over the latest CommonCrawl and the extracted URLs (25 GB) are available in our S3 public bucket:

s3://ukp-research-data/c4corpus/common-crawl-full-url-list-2016-07

Hope it helps!

Best,

Ivan

Dne neděle 6. března 2016 8:33:44 UTC+1 nuli...@gmail.com napsal(a):

Andrew Berezovskyi

unread,
Mar 14, 2016, 9:26:23 AM3/14/16
to Common Crawl
Hello Ivan, thank you very much for you work!

Julie Sanje

unread,
May 1, 2016, 1:48:19 AM5/1/16
to Common Crawl
Hi Ivan, do you have a http version to download this? I'm pretty new to common-crawl and Amazon s3. Have just tried for hours to download the public bucket but with no luck, i dont see any good tutorials anywhere.

Thanks in advance

Julie

Andrew Berezovskyi

unread,
May 1, 2016, 6:50:46 AM5/1/16
to common...@googlegroups.com
Hi Julie,

You must have an AWS account to download the files. This way you can be charged for traffic (so that those who provide the dataset won't). Search for AWS Free Tier.

You can start a server in AWS. If you do this in the right region, you will not be charged for any traffic. Use tools like s3cmd for download. Look into Spot instances. You can also run Widows Server if you don't fancy the command line.

If you want to download files on your machine, be ready to pay around $0.08/GB. Use a tool like S3Browser for the download.

/Andrew

Sent from my phone
--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/xMz2gZoIMV8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.

Ivan Habernal

unread,
May 6, 2016, 3:26:50 AM5/6/16
to Common Crawl
Hi Juli,

unfortunately not, due to the transfer costs as mentioned by Andrew. But you might have a look at our documentation to C4Corpus which also describes how to run a simple free-tier AWS server and access/download any data publicly available at S3:

https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/

Beware of the transfer costs: you must run your instance in us-east-1 (virginia) because that's where the CommonCrawl and C4Corpus are located; otherwise standard fees for transfer between AWS regions apply.

Hope it helps,

Ivan

Julie Sanje

unread,
Jun 2, 2016, 7:42:44 PM6/2/16
to Common Crawl
Hi,

Thanks for the reply, i eventually paid a freelancer to try and access the bucket but he advised me that the bucket was not public and I would need some secret key from you. Is this true? Or does the guy just not know what he is doing?

Thanks

Andrew Berezovskyi

unread,
Jun 2, 2016, 8:20:24 PM6/2/16
to common...@googlegroups.com
Hi Julie,

The freelancer is right. Be sure to use IAM to limit his keys to S3 operations. And remember to revoke them after the job is done.

– Andrew

Tom Morris

unread,
Jun 2, 2016, 8:38:26 PM6/2/16
to common...@googlegroups.com
On Thu, Jun 2, 2016 at 7:42 PM, Julie Sanje <in...@sanje-angie.com> wrote:

Thanks for the reply, i eventually paid a freelancer to try and access the bucket but he advised me that the bucket was not public and I would need some secret key from you. Is this true? Or does the guy just not know what he is doing?

Your freelancer doesn't know what he's doing. I just checked and the bucket is still available. At $0.09/GB, the 24 GB of the full URL list would cost you between two and three dollars to download.

Alternatively, you could do what I suggested back in March and download the ~85GB CommonCrawl Index for the free CommonCrawl bucket. This command will download the first 1/300th of the index, extract the 4.75 million URLs, and save them locally, all in just 2 minutes:

$ aws --no-sign-request s3 cp s3://commoncrawl/cc-index/collections/CC-MAIN-2016-18/indexes/cdx-00000.gz - | gunzip | cut -d ' ' -f 3-999 | jq -r .url | gzip > cc-urls-00000.gz

Repeat for each of the other 299 files and you'll have the full list. 

Tom

On Friday, May 6, 2016 at 8:26:50 AM UTC+1, Ivan Habernal wrote:
Hi Juli,

unfortunately not, due to the transfer costs as mentioned by Andrew. But you might have a look at our documentation to C4Corpus which also describes how to run a simple free-tier AWS server and access/download any data publicly available at S3:

https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/

Beware of the transfer costs: you must run your instance in us-east-1 (virginia) because that's where the CommonCrawl and C4Corpus are located; otherwise standard fees for transfer between AWS regions apply.

Hope it helps,

Ivan

Hi Ivan, do you have a http version to download this? I'm pretty new to common-crawl and Amazon s3. Have just tried for hours to download the public bucket but with no luck, i dont see any good tutorials anywhere.

Thanks in advance

Julie

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.

Andrew Berezovskyi

unread,
Jun 2, 2016, 8:43:36 PM6/2/16
to common...@googlegroups.com
My apologies, I actually misread the part about the key. But the freelancer will still need a pair of AMI keys (from you, Julie, not from somebody else) to access the “requester pays" S3 buckets.

– Andrew

You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/xMz2gZoIMV8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.

Andrew Berezovskyi

unread,
Jun 2, 2016, 8:49:25 PM6/2/16
to Common Crawl, and...@berezovskyi.me
Or you may use the solution from Tom that seems to work like a charm. Didn't know about the --no-sign-request flag before. Thank you, Tom!

Jay Glasgow

unread,
May 15, 2017, 10:12:28 AM5/15/17
to Common Crawl, dominik...@gmail.com
Dominik,

Thanks so much for your super code! It works rather nicely and is fairly approachable.

We are attempting to modify your code so that we can narrow in the search a little. We had some success removing all document types/mimes and going with just ".pdf," but now we would like to add specific search words to the URL... such as "TV" or "Insurance" to be able to just build a list of URL's for pdf's that feature something about "TV" or "Insurance" (as examples, only).

What files/lines in your code would we need to modify to be able to use only the URL's that match other key words before their extensions so that resulting commoncrawl-CC-MAIN-2017-13 file is smaller and more specific?

We are currently reducing the output file in an intermediary step, and then from there the rest of your code works brilliantly! We would just like to remove that intermediary step and begin with a smaller file if that's possible. Can you please provide us a little guidance?

=Jay
Reply all
Reply to author
Forward
0 new messages