Inverted indices for Common Crawl data

219 views
Skip to first unread message

Aline Bessa

unread,
Apr 27, 2015, 2:03:55 PM4/27/15
to common...@googlegroups.com
Hi folks,

I am now playing with the CC-MAIN-2015-11 dump. I would love to be able to get all the urls that have the word "bicycle", for example. Is there an inverted index available for this kind of stuff or do I need to download the data and build it?

Cheers! 

Tom Morris

unread,
Apr 27, 2015, 3:02:28 PM4/27/15
to common...@googlegroups.com
On Mon, Apr 27, 2015 at 2:03 PM, Aline Bessa <ali...@gmail.com> wrote:

I am now playing with the CC-MAIN-2015-11 dump. I would love to be able to get all the urls that have the word "bicycle", for example. Is there an inverted index available for this kind of stuff or do I need to download the data and build it?

The only index is the prefix index using inverted host/domain name pieces (ie en.wikipedia.org/wiki/ becomes, effectively org.wikpedia.en/wiki).  You can get prefix matches for the inverted form, but nothing else is indexed.

As an aside, how would you want to see the url tokenized for an inverted index?  A simple punctuation-based tokenizer could easily miss things like goodbicycles.com or other concatenated variants. 

You wouldn't need to download the entire crawl to do arbitrary pattern patching on the URLs -- you'd just need the index, or, better yet, do the processing directly on AWS.  The program that I posted the other day could be tweaked to generate a list of matches for something like $0.10-$0.15 in EC2 compute time.

Tom

Laura Dietz

unread,
Apr 27, 2015, 3:06:04 PM4/27/15
to common...@googlegroups.com
Tom,

I am guessing that Aline is looking for an inverted index of the contents of the websites, which are indicated by the URL.  -- not the URL itself.

I would be personally also very interested in an inverted index of the contents myself, although I see that it is challenging to store an inverted index of a 500TB collection.

I was however thinking of trying to create an inverted index of words in wordnet and titles of wikipedia for some portion of the CC. Is anyone already working on this?

Cheers,
Laura
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Aline Bessa

unread,
Apr 27, 2015, 4:02:30 PM4/27/15
to common...@googlegroups.com
Hi Laura,

Yes, this is what I was talking about. It is a pity that it doesn't exist yet, but I understand.

Thanks,

Aline Bessa

unread,
Apr 27, 2015, 4:31:27 PM4/27/15
to common...@googlegroups.com
Is it at least possible to download a fair sample of common crawl's data? 100,000 pages or 1,000,000 pages would be excellent...

Laura Dietz

unread,
Apr 27, 2015, 4:37:23 PM4/27/15
to common...@googlegroups.com
Aline,

yes you can download segments of the CC through Amazon's S3 service. While S3 usually costs money (unless you call yourself lucky to have an AWS scholarship), you can create an AWS account for free which comes with a small allowance so you can look at a small subset of the data before spending money.

However, it is a really hard problem to get a fair sample of a graph (such as hyperlinked web paged). And to make it worse: words are power-law distributed, which means that while stopwords are very frequent, most 'useful' words are very skewed across an arbitrary sample of pages.

If you are looking for smaller corpus which is carefully engineered to be mostly spam-free, I recommend ClueWeb 12 Category B.

Cheers,
Laura


On 04/27/2015 04:31 PM, Aline Bessa wrote:

Aline Bessa

unread,
Apr 27, 2015, 5:23:14 PM4/27/15
to common...@googlegroups.com
Thanks, Laura. I bought ClueWeb but it is taking time to arrive here. :-(

Stephen Merity

unread,
Apr 27, 2015, 6:39:40 PM4/27/15
to common...@googlegroups.com
Hi everyone =]

First, I'd just reiterate that you can download all the files from Common Crawl completely for free. We've had universities download entire crawl archives to their local clusters for zero cost. No need for a credit card!

By appending https://aws-publicdatasets.s3.amazonaws.com/ to the files referred to by warc.paths.gz / wat.paths.gz / wet.paths.gz, you can download them like normal files with zero authentication. You don't even need to create an AWS account if you'd prefer not to.
If you access the files via S3 credentials, there are still no bandwidth costs, but you will need to set up an AWS account.
These features are part of Amazon Public Data Sets, for which we thank them thoroughly. Using a cluster on EC2 is still the fastest way to process a crawl archive, given the data is situated right next to the servers, but this means it's easy for local exploration or downloading to a local cluster.

The idea of an inverted index for Common Crawl is an interesting one we've discussed many times. As already mentioned, the question is how that would be generated, stored and accessed. Inverted indexes, for reasonable query speed, usually require the index to be stored in memory or SSD, with both resulting in on-going costs. We'd love to see an inverted index / search tool created for Common Crawl as a proof of concept (and indeed it would be quite doable with modern distributed search tools and a cluster of spot instances) but extending that such that it can remain a resource indefinitely without costing vast sums of money is an open problem.

If you have a small set of keywords that you're interested in, particular as a corpus for NLP or similar, you could either (a) run a MapReduce job over all the files for filtering or (b) processing the WAT files and only checking for the existence of the keyword in the title of the HTML page. With some optimizations, (b) can be tremendously efficient - instead of decoding the JSON for each entry, you could just check for the existence of "bicycles" in the raw JSON text and decode the JSON for proper examination only if it's found. This way you don't spend time decoding JSON unless there's a chance your target token is in there. From the JSON entries, you can then retrieve the original HTML content from the WARC files as the WAT files contain the offset and length of the specific gzip entry.

I'll reply to the idea of a small sampling of Common Crawl on the other email thread.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.



--
Regards,
Stephen Merity
Data Scientist @ Common Crawl

Laura Dietz

unread,
Apr 27, 2015, 7:15:34 PM4/27/15
to common...@googlegroups.com
Stephen,


On 04/27/2015 06:39 PM, Stephen Merity wrote:
First, I'd just reiterate that you can download all the files from Common Crawl completely for free.

thanks so much for the re-iteration. - I am ashamed to admit that I was completely unaware.


The idea of an inverted index for Common Crawl is an interesting one we've discussed many times. As already mentioned, the question is how that would be generated, stored and accessed. Inverted indexes, for reasonable query speed, usually require the index to be stored in memory or SSD, with both resulting in on-going costs.

I would not be too concerned about having the inverted index on 'spinning rust' - I use Galago with indexes of a couple of tera byte, which make use of a disk-based B+ tree data structure and supports on-the-fly merge-sort.

I agree that the S3 setup is not ideal to make use of disk-based random-access data structure. The EBS service also does not seem to be drastically cheaper than SSD for an EC2 instance.

Galago supports a web-server mode with a JSON interface, but it would still need to run somewhere. And with 500TB of compressed content, I dread that the index would be in the same order.

My approach would be to come up with something limited, but something that is better than nothing. One idea is to limit the indexing vocabulary, e.g. wordnet (minus very common words) plus wiki titles. While my work gets a lot of mileage out of positional indices, maybe frequency-counts or even just a boolean index might have to do.


Cheers,
Laura

Aline Bessa

unread,
Apr 27, 2015, 7:54:45 PM4/27/15
to common...@googlegroups.com


Em segunda-feira, 27 de abril de 2015 18:39:40 UTC-4, Stephen Merity escreveu:
Hi everyone =]

First, I'd just reiterate that you can download all the files from Common Crawl completely for free. We've had universities download entire crawl archives to their local clusters for zero cost. No need for a credit card!

By appending https://aws-publicdatasets.s3.amazonaws.com/ to the files referred to by warc.paths.gz / wat.paths.gz / wet.paths.gz, you can download them like normal files with zero authentication. You don't even need to create an AWS account if you'd prefer not to.
If you access the files via S3 credentials, there are still no bandwidth costs, but you will need to set up an AWS account.
These features are part of Amazon Public Data Sets, for which we thank them thoroughly. Using a cluster on EC2 is still the fastest way to process a crawl archive, given the data is situated right next to the servers, but this means it's easy for local exploration or downloading to a local cluster.

The idea of an inverted index for Common Crawl is an interesting one we've discussed many times. As already mentioned, the question is how that would be generated, stored and accessed. Inverted indexes, for reasonable query speed, usually require the index to be stored in memory or SSD, with both resulting in on-going costs. We'd love to see an inverted index / search tool created for Common Crawl as a proof of concept (and indeed it would be quite doable with modern distributed search tools and a cluster of spot instances) but extending that such that it can remain a resource indefinitely without costing vast sums of money is an open problem.

If you have a small set of keywords that you're interested in, particular as a corpus for NLP or similar, you could either (a) run a MapReduce job over all the files for filtering or (b) processing the WAT files and only checking for the existence of the keyword in the title of the HTML page. With some optimizations, (b) can be tremendously efficient - instead of decoding the JSON for each entry, you could just check for the existence of "bicycles" in the raw JSON text and decode the JSON for proper examination only if it's found. This way you don't spend time decoding JSON unless there's a chance your target token is in there. From the JSON entries, you can then retrieve the original HTML content from the WARC files as the WAT files contain the offset and length of the specific gzip entry.


Hi. As for option (a), can I run the map-reduce job over the files on CC's aws or do I have to download the data and uncompress it? Could you give me some additional hints on this? 

Tom Morris

unread,
Apr 28, 2015, 1:28:14 PM4/28/15
to common...@googlegroups.com
Hi Aline,

On Mon, Apr 27, 2015 at 7:54 PM, Aline Bessa <ali...@gmail.com> wrote:
Em segunda-feira, 27 de abril de 2015 18:39:40 UTC-4, Stephen Merity escreveu:
Hi everyone =]

First, I'd just reiterate that you can download all the files from Common Crawl completely for free. We've had universities download entire crawl archives to their local clusters for zero cost. No need for a credit card!

By appending https://aws-publicdatasets.s3.amazonaws.com/ to the files referred to by warc.paths.gz / wat.paths.gz / wet.paths.gz, you can download them like normal files with zero authentication. You don't even need to create an AWS account if you'd prefer not to.
If you access the files via S3 credentials, there are still no bandwidth costs, but you will need to set up an AWS account.
These features are part of Amazon Public Data Sets, for which we thank them thoroughly. Using a cluster on EC2 is still the fastest way to process a crawl archive, given the data is situated right next to the servers, but this means it's easy for local exploration or downloading to a local cluster.

The idea of an inverted index for Common Crawl is an interesting one we've discussed many times. As already mentioned, the question is how that would be generated, stored and accessed. Inverted indexes, for reasonable query speed, usually require the index to be stored in memory or SSD, with both resulting in on-going costs. We'd love to see an inverted index / search tool created for Common Crawl as a proof of concept (and indeed it would be quite doable with modern distributed search tools and a cluster of spot instances) but extending that such that it can remain a resource indefinitely without costing vast sums of money is an open problem.

If you have a small set of keywords that you're interested in, particular as a corpus for NLP or similar, you could either (a) run a MapReduce job over all the files for filtering or (b) processing the WAT files and only checking for the existence of the keyword in the title of the HTML page. With some optimizations, (b) can be tremendously efficient - instead of decoding the JSON for each entry, you could just check for the existence of "bicycles" in the raw JSON text and decode the JSON for proper examination only if it's found. This way you don't spend time decoding JSON unless there's a chance your target token is in there. From the JSON entries, you can then retrieve the original HTML content from the WARC files as the WAT files contain the offset and length of the specific gzip entry.

Hi. As for option (a), can I run the map-reduce job over the files on CC's aws or do I have to download the data and uncompress it? Could you give me some additional hints on this? 

The best way to do this is to run a program on Amazon's AWS using the Common Crawl S3 bucket as input.  Depending on whether you prefer Java or Python, you could consider something like WDC Framework (Java) or Ilya's index generation code (Python) as a starting point.  There are also some examples & tutorials linked from the CC site (although some of them may not be updated for the most recent format).


If you are only interested in certain TLDs or PLDs (particularly if you're excluding the ubiquitous .com TLD), you may want to start with the URL index rather than processing all WARCs or WETs.

While downloading the data is free and AWS processing isn't, the data is to big to download feasibly (unless you just want a few segments) and using spot instances only costs pennies an hour on EC2.

Tom
Reply all
Reply to author
Forward
0 new messages