I am now playing with the CC-MAIN-2015-11 dump. I would love to be able to get all the urls that have the word "bicycle", for example. Is there an inverted index available for this kind of stuff or do I need to download the data and build it?
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.
First, I'd just reiterate that you can download all the files from Common Crawl completely for free.
The idea of an inverted index for Common Crawl is an interesting one we've discussed many times. As already mentioned, the question is how that would be generated, stored and accessed. Inverted indexes, for reasonable query speed, usually require the index to be stored in memory or SSD, with both resulting in on-going costs.
Hi everyone =]First, I'd just reiterate that you can download all the files from Common Crawl completely for free. We've had universities download entire crawl archives to their local clusters for zero cost. No need for a credit card!By appending https://aws-publicdatasets.s3.amazonaws.com/ to the files referred to by warc.paths.gz / wat.paths.gz / wet.paths.gz, you can download them like normal files with zero authentication. You don't even need to create an AWS account if you'd prefer not to.If you access the files via S3 credentials, there are still no bandwidth costs, but you will need to set up an AWS account.These features are part of Amazon Public Data Sets, for which we thank them thoroughly. Using a cluster on EC2 is still the fastest way to process a crawl archive, given the data is situated right next to the servers, but this means it's easy for local exploration or downloading to a local cluster.The idea of an inverted index for Common Crawl is an interesting one we've discussed many times. As already mentioned, the question is how that would be generated, stored and accessed. Inverted indexes, for reasonable query speed, usually require the index to be stored in memory or SSD, with both resulting in on-going costs. We'd love to see an inverted index / search tool created for Common Crawl as a proof of concept (and indeed it would be quite doable with modern distributed search tools and a cluster of spot instances) but extending that such that it can remain a resource indefinitely without costing vast sums of money is an open problem.If you have a small set of keywords that you're interested in, particular as a corpus for NLP or similar, you could either (a) run a MapReduce job over all the files for filtering or (b) processing the WAT files and only checking for the existence of the keyword in the title of the HTML page. With some optimizations, (b) can be tremendously efficient - instead of decoding the JSON for each entry, you could just check for the existence of "bicycles" in the raw JSON text and decode the JSON for proper examination only if it's found. This way you don't spend time decoding JSON unless there's a chance your target token is in there. From the JSON entries, you can then retrieve the original HTML content from the WARC files as the WAT files contain the offset and length of the specific gzip entry.
Em segunda-feira, 27 de abril de 2015 18:39:40 UTC-4, Stephen Merity escreveu:Hi everyone =]First, I'd just reiterate that you can download all the files from Common Crawl completely for free. We've had universities download entire crawl archives to their local clusters for zero cost. No need for a credit card!By appending https://aws-publicdatasets.s3.amazonaws.com/ to the files referred to by warc.paths.gz / wat.paths.gz / wet.paths.gz, you can download them like normal files with zero authentication. You don't even need to create an AWS account if you'd prefer not to.If you access the files via S3 credentials, there are still no bandwidth costs, but you will need to set up an AWS account.These features are part of Amazon Public Data Sets, for which we thank them thoroughly. Using a cluster on EC2 is still the fastest way to process a crawl archive, given the data is situated right next to the servers, but this means it's easy for local exploration or downloading to a local cluster.The idea of an inverted index for Common Crawl is an interesting one we've discussed many times. As already mentioned, the question is how that would be generated, stored and accessed. Inverted indexes, for reasonable query speed, usually require the index to be stored in memory or SSD, with both resulting in on-going costs. We'd love to see an inverted index / search tool created for Common Crawl as a proof of concept (and indeed it would be quite doable with modern distributed search tools and a cluster of spot instances) but extending that such that it can remain a resource indefinitely without costing vast sums of money is an open problem.If you have a small set of keywords that you're interested in, particular as a corpus for NLP or similar, you could either (a) run a MapReduce job over all the files for filtering or (b) processing the WAT files and only checking for the existence of the keyword in the title of the HTML page. With some optimizations, (b) can be tremendously efficient - instead of decoding the JSON for each entry, you could just check for the existence of "bicycles" in the raw JSON text and decode the JSON for proper examination only if it's found. This way you don't spend time decoding JSON unless there's a chance your target token is in there. From the JSON entries, you can then retrieve the original HTML content from the WARC files as the WAT files contain the offset and length of the specific gzip entry.Hi. As for option (a), can I run the map-reduce job over the files on CC's aws or do I have to download the data and uncompress it? Could you give me some additional hints on this?