I believe the raw data for 2012 from Sebastian's paper is available at on s3 at s3(n)://aws-publicdatasets/common-crawl/index2012
Jordan
On Feb 28, 2014, at 2:06 AM, Alexander Czech <
alexand...@googlemail.com> wrote:
> I'm planing on doing a regional (austria) restricted projekt with the common crawl data and it would be interesting to know in advance how many URLs with the TLD .at are crawled. I already had a looked at the Summary of the 2012 Corpus, but the .at TLD is to insignificant to be mentioned in the paper of Sebastian Spiegler. Is there a raw statisical dataset of the summary available without running the github code of Sebastian Spielger?
>
> Thanks in advance.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
common-crawl...@googlegroups.com.
> To post to this group, send email to
common...@googlegroups.com.
> Visit this group at
http://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/groups/opt_out.