Which top level domains are crawled?

125 views
Skip to first unread message

Alexander Czech

unread,
Feb 28, 2014, 5:06:50 AM2/28/14
to common...@googlegroups.com
I'm planing on doing a regional (austria) restricted projekt with the common crawl data and it would be interesting to know in advance how many URLs with the TLD .at are crawled. I already had a looked at the Summary of the 2012 Corpus, but the .at TLD is to insignificant to be mentioned in the paper of Sebastian Spiegler. Is there a raw statisical dataset of the summary available without running the github code of Sebastian Spielger?

Thanks in advance.

Jordan Mendelson

unread,
Mar 12, 2014, 5:41:16 PM3/12/14
to common...@googlegroups.com
I believe the raw data for 2012 from Sebastian's paper is available at on s3 at s3(n)://aws-publicdatasets/common-crawl/index2012


Jordan

On Feb 28, 2014, at 2:06 AM, Alexander Czech <alexand...@googlemail.com> wrote:

> I'm planing on doing a regional (austria) restricted projekt with the common crawl data and it would be interesting to know in advance how many URLs with the TLD .at are crawled. I already had a looked at the Summary of the 2012 Corpus, but the .at TLD is to insignificant to be mentioned in the paper of Sebastian Spiegler. Is there a raw statisical dataset of the summary available without running the github code of Sebastian Spielger?
>
> Thanks in advance.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To post to this group, send email to common...@googlegroups.com.
> Visit this group at http://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/groups/opt_out.

Reply all
Reply to author
Forward
0 new messages