Getting all URLS for Jan 2015 or Feb 2015 crawl

Aline Bessa

unread,

Apr 22, 2015, 9:42:27 AM4/22/15

to common...@googlegroups.com

Hi all,

is it possible to have access to all URLs in the Jan or Feb 2015 crawls? I want to sample them.

Thanks for all the help!

Tom Morris

unread,

Apr 22, 2015, 4:30:38 PM4/22/15

to common...@googlegroups.com

If you don't care about the content type or other metadata associated with the URL, you should be able to generate a list from the index files pretty easily. It looks like the index latest crawl isn't world readable (anyone know why?), but the previous one is.

A crude, brute force way of doing this would be to do the following, or equivalent, for each of the 300 index files (400 MB ea.):

$ aws s3 cp --no-sign-request s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-11/indexes/cdx-00000.gz .

$ zcat cdx-00000.gz | cut -f 4 -d '"' | gzip > cdx-00000-urls.gz

If you've got a good internet connection, you could use wget or curl and just pipe everything together, saving on 120 GB of temp space

The (unreadable) latest index is at https://aws-publicdatasets.s3.amazonaws.com/common-crawl/cc-index/collections/CC-MAIN-2015-14/indexes/cdx-00000.gz

If you want to filter the list of URLs to only those which are HTML pages or some other criteria that requires the use of the metadata, you'll need to process all the WAT files. This is more like 10 TB instead of 120 MB, so you'll definitely want to do this using Amazon AWS.

The list of the 33,002 WAT (metadata) files is here: https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-11/wat.paths.gz

An example of an individual WAT file is: https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-11/segments/1424936459277.13/wat/CC-MAIN-20150226074059-00000-ip-10-28-5-156.ec2.internal.warc.wat.gz

You'll need to parse the included JSON to get the URI and any other metadata that you are interested in.

Alternatively, if you can make do with the 2014 crawl, you can download a 20 GB list of the URLs from the WebDataCommons:

http://webdatacommons.org/hyperlinkgraph/2014-04/download.html#toc1

If you need the freshest data before the WebDataCommons folks re-run their page graph analysis (they seem to be focusing on microdata now), you could re-use their extraction software:

http://webdatacommons.org/hyperlinkgraph/index.html#toc8

Hope that gives you some starting points to work with.

Tom

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

ikre...@gmail.com

unread,

Apr 22, 2015, 5:14:12 PM4/22/15

to common...@googlegroups.com

Hi,

Sorry for slow response, I've been pretty busy these days.. The latest index was still finishing up. I've made sure its public now and updated and its available on index.commoncrawl.org

I've been thinking of adding a sampling api to the cdx server, but haven't had a chance to do so.

Does anyone have any suggestions what that might look like?

It's not hard to do, especially starting from the secondary index file, which has about every 3000-th url in it. (This file is available at: s3://aws-publicdatasets/common-crawl/cc-index/collections/[CRAWL]/indexes/cluster.idx for each crawl so far [CC-MAIN-2015-06, CC-MAIN-2015-11, CC-MAIN-2015-14])

You could get some basic data just by sampling lines for that file as well.

My thoughts were to add something like this to the query api: ...CC-MAIN-2015-14-index?url=*&sample=N&offset=K&line=L

which would divide the secondary index into N even parts, then add an offset of K to that, and read the L-th line into the block.

The url could also be an exact/prefix/domain as in other queries.

This is just an off-the top idea, any more specific suggestions for what you'd like to see as far as sampling are welcome!

Ilya

Message has been deleted

Tom Morris

unread,

Apr 23, 2015, 2:48:44 PM4/23/15

to common...@googlegroups.com

On Wed, Apr 22, 2015 at 7:14 PM, Aline Bessa <ali...@gmail.com> wrote:

Thanks for the replies. I would like to know how many pages are available in the index. I thought that performing a query like "end", as long as there are no limits to the retrieved results, would give me a good approach to that, assuming that only HTML files are available on the index. This is not true though, right?

The index includes all URLs which were crawled, not just HTML pages. The total number of URLs is usually included as part of the announcement. For example, the February 2015 announcement says it contains 1.9 billion pages.

Even though all mime types are included, I think you'll find the vast majority of the pages are HTML. I sampled the first 3000 URLs in each of the 300 shards for the latest index and found over 97% were HTML. Here's the breakdown of other popular types:

879113 text/html

5038 text/xml

4828 application/pdf

4013 application/rss+xml

2802 unk

1286 text/plain

881 application/atom+xml

729 image/jpeg

343 text/calendar

333 unknown/unknown

126 application/octet-stream

101 image/jpg

77 application/xml

50 application/msword

27 application/x-tex

24 image/gif

22 application/vnd.openxmlformats-officedocument.word

18 Application/RSS+XML

18 application/vnd.ms-powerpoint

16 application/x-sh

10 application/x-zip

9 application/vnd.ms-excel

9 application/xhtml+xml

Thanks

Tom Morris

unread,

Apr 25, 2015, 2:58:03 PM4/25/15

to common...@googlegroups.com

An update on sampling using Unix command line tools...

On Wed, Apr 22, 2015 at 4:30 PM, Tom Morris <tfmo...@gmail.com> wrote:

If you don't care about the content type or other metadata associated with the URL, you should be able to generate a list from the index files pretty easily.

...

If you've got a good internet connection, you could use wget or curl and just pipe everything together, saving on 120 GB of temp space

Actually it looks like awscli knows how to output to stdout now too. Here's a one-liner which will generate a pseudo-random ~1% sample of the first shard in the March 2015 index. It runs in under 3 minutes on my laptop (using ~25 Mb/s download bandwidth).

time aws s3 cp --no-sign-request s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-14/indexes/cdx-00000.gz - | gunzip | cut -f 4 -d '"' | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}' | gzip > cdx-00000-url-1pct.txt.gz

You can adjust this by changing the probability threshold from .01 to something else or by tossing a grep stage into the pipe to filter on URLs matching certain patterns.

Somewhat surprisingly it is only twice as fast (1.5 minutes) when running on a EC2 c4.large instance where CPU time for the cut command dominates (but, of course, you can run them in parallel there to take advantage of the extra bandwidth).

Dominik Stadler

unread,

May 1, 2015, 10:44:20 AM5/1/15

to common...@googlegroups.com

Hi,

That is interesting, seems the newer crawls did not include as many
other file types as before. The older URL Index from around 2012/2013
seems to include a huge number of documents. A rough calculation that
I did for documents that are useful to mass-test Apache POI (mostly
Microsoft Office documents, no PDFs) looks like this, i.e. more than 8
million files!

* Size of overall URL Index is 233689120776, i.e. 217GB
* Header: 6 Bytes
* Index-Blocks: 2644
* Block-Size: 65536
* => Data-Blocks: 3563169
* Aprox. Files per Block: 2.421275
* Resulting aprox. number of files: 8627412
* Avg. size per file: 221613
* Needed storage: 1911954989425 bytes = ~1.7TB!

Thanks... Dominik.

Reply all

Reply to author

Forward