Getting all URLS for Jan 2015 or Feb 2015 crawl

225 views
Skip to first unread message

Aline Bessa

unread,
Apr 22, 2015, 9:42:27 AM4/22/15
to common...@googlegroups.com
Hi all,

is it possible to have access to all URLs in the Jan or Feb 2015 crawls? I want to sample them. 

Thanks for all the help!

Tom Morris

unread,
Apr 22, 2015, 4:30:38 PM4/22/15
to common...@googlegroups.com
If you don't care about the content type or other metadata associated with the URL, you should be able to generate a list from the index files pretty easily.  It looks like the index latest crawl isn't world readable (anyone know why?), but the previous one is.

A crude, brute force way of doing this would be to do the following, or equivalent, for each of the 300 index files (400 MB ea.):

    $ aws s3 cp --no-sign-request s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-11/indexes/cdx-00000.gz .
    $ zcat cdx-00000.gz | cut -f 4 -d '"' | gzip > cdx-00000-urls.gz

If you've got a good internet connection, you could use wget or curl and just pipe everything together, saving on 120 GB of temp space


If you want to filter the list of URLs to only those which are HTML pages or some other criteria that requires the use of the metadata, you'll need to process all the WAT files.  This is more like 10 TB instead of 120 MB, so you'll definitely want to do this using Amazon AWS.

You'll need to parse the included JSON to get the URI and any other metadata that you are interested in.

Alternatively, if you can make do with the 2014 crawl, you can download a 20 GB list of the URLs from the WebDataCommons:

If you need the freshest data before the WebDataCommons folks re-run their page graph analysis (they seem to be focusing on microdata now), you could re-use their extraction software:

Hope that gives you some starting points to work with.

Tom


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

ikre...@gmail.com

unread,
Apr 22, 2015, 5:14:12 PM4/22/15
to common...@googlegroups.com
Hi,

Sorry for slow response, I've been pretty busy these days.. The latest index was still finishing up. I've made sure its public now and updated and its available on index.commoncrawl.org

I've been thinking of adding a sampling api to the cdx server, but haven't had a chance to do so.

Does anyone have any suggestions what that might look like?

It's not hard to do, especially starting from the secondary index file, which has about every 3000-th url in it. (This file is available at: s3://aws-publicdatasets/common-crawl/cc-index/collections/[CRAWL]/indexes/cluster.idx for each crawl so far [CC-MAIN-2015-06, CC-MAIN-2015-11, CC-MAIN-2015-14])

You could get some basic data just by sampling lines for that file as well.

My thoughts were to add something like this to the query api: ...CC-MAIN-2015-14-index?url=*&sample=N&offset=K&line=L

which would divide the secondary index into N even parts, then add an offset of K to that, and read the L-th line into the block.

The url could also be an exact/prefix/domain as in other queries.

This is just an off-the top idea, any more specific suggestions for what you'd like to see as far as sampling are welcome!

Ilya
Message has been deleted

Tom Morris

unread,
Apr 23, 2015, 2:48:44 PM4/23/15
to common...@googlegroups.com
On Wed, Apr 22, 2015 at 7:14 PM, Aline Bessa <ali...@gmail.com> wrote:

Thanks for the replies. I would like to know how many pages are available in the index. I thought that performing a query like "end", as long as there are no limits to the retrieved results, would give me a good approach to that, assuming that only HTML files are available on the index. This is not true though, right?

The index includes all URLs which were crawled, not just HTML pages.  The total number of URLs is usually included as part of the announcement.  For example, the February 2015 announcement  says it contains 1.9 billion pages.

Even though all mime types are included, I think you'll find the vast majority of the pages are HTML.  I sampled the first 3000 URLs in each of the 300 shards for the latest index and found over 97% were HTML.  Here's the breakdown of other popular types:

879113 text/html
  5038 text/xml
  4828 application/pdf
  4013 application/rss+xml
  2802 unk
  1286 text/plain
   881 application/atom+xml
   729 image/jpeg
   343 text/calendar
   333 unknown/unknown
   126 application/octet-stream
   101 image/jpg
    77 application/xml
    50 application/msword
    27 application/x-tex
    24 image/gif
    22 application/vnd.openxmlformats-officedocument.word
    18 Application/RSS+XML
    18 application/vnd.ms-powerpoint
    16 application/x-sh
    10 application/x-zip
     9 application/vnd.ms-excel
     9 application/xhtml+xml

 

Thanks

Tom Morris

unread,
Apr 25, 2015, 2:58:03 PM4/25/15
to common...@googlegroups.com
An update on sampling using Unix command line tools...

On Wed, Apr 22, 2015 at 4:30 PM, Tom Morris <tfmo...@gmail.com> wrote:
If you don't care about the content type or other metadata associated with the URL, you should be able to generate a list from the index files pretty easily. 
... 
If you've got a good internet connection, you could use wget or curl and just pipe everything together, saving on 120 GB of temp space

Actually it looks like awscli knows how to output to stdout now too.  Here's a one-liner which will generate a pseudo-random ~1% sample of the first shard in the March 2015 index.  It runs in under 3 minutes on my laptop (using ~25 Mb/s download bandwidth).

 time aws s3 cp --no-sign-request s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-14/indexes/cdx-00000.gz - | gunzip | cut -f 4 -d '"' | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}' | gzip > cdx-00000-url-1pct.txt.gz

You can adjust this by changing the probability threshold from .01 to something else or by tossing a grep stage into the pipe to filter on URLs matching certain patterns.

Somewhat surprisingly it is only twice as fast (1.5 minutes) when running on a EC2 c4.large instance where CPU time for the cut command dominates (but, of course, you can run them in parallel there to take advantage of the extra bandwidth).

Dominik Stadler

unread,
May 1, 2015, 10:44:20 AM5/1/15
to common...@googlegroups.com
Hi,

That is interesting, seems the newer crawls did not include as many
other file types as before. The older URL Index from around 2012/2013
seems to include a huge number of documents. A rough calculation that
I did for documents that are useful to mass-test Apache POI (mostly
Microsoft Office documents, no PDFs) looks like this, i.e. more than 8
million files!

* Size of overall URL Index is 233689120776, i.e. 217GB
* Header: 6 Bytes
* Index-Blocks: 2644
* Block-Size: 65536
* => Data-Blocks: 3563169
* Aprox. Files per Block: 2.421275
* Resulting aprox. number of files: 8627412
* Avg. size per file: 221613
* Needed storage: 1911954989425 bytes = ~1.7TB!

Thanks... Dominik.
Reply all
Reply to author
Forward
0 new messages