are all the crawls available through Athena's table ?

192 views
Skip to first unread message

A C

unread,
Nov 25, 2018, 12:23:52 PM11/25/18
to Common Crawl
Hi,
first of all thank you so much for such a brilliant initiative. 

I have been trying to request some data from the index file through Athena AWS (following this: http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/), however it seems that only WARCs from 2017 and 2018 are available. 

is there any other ways to access all the indices with the same ease?

This is important for my use case which is related to measuring longevity of websites. 

Cheers,
Azdine

Sebastian Nagel

unread,
Nov 25, 2018, 2:54:38 PM11/25/18
to common...@googlegroups.com
Hi Azdine,

yes, you're right. At present, only about 12 monthly crawls are contained in the columnar index. The
reason was simple that we wanted to be sure that the new index also gets used.
Looks like that it has now been adapted well by various users. :)

I'll put the inclusion of older crawls into the columnar index on the agenda
and let you know about our decision soon. I'd take also some time to add the
remaining data to the columnar index.

Meanwhile you may use the index at https://index.commoncrawl.org/
together with one of the CDX clients:
https://pypi.org/project/cdx-toolkit/
https://github.com/ikreymer/cdx-index-client
It's easy to get the URL counts for specific domains/sites. You may also use
the clients to query the Internet Archive which provides additional evidence
in case our crawler may not have visited some the sites.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

A C

unread,
Nov 25, 2018, 4:43:41 PM11/25/18
to Common Crawl
Many thanks for the swift answer Sebastian. it's a bit harder to get around the cdx clients for non techies :) I also find it very convenient to be able to filter based on language for instance when using the columnar index. 

Sebastian Nagel

unread,
Nov 26, 2018, 12:01:25 PM11/26/18
to common...@googlegroups.com
Hi Azdine,

I understand, we'll do our best to extend the columnar index.
The possibility to select by any field is the greatest advantage
over the CDX index which always requires a URL prefix pattern
and can only filter by language in a secondary step.

Btw., we've added the language annotation in August 2018 to
the CDX index and in September to the columnar index.
I'll also put on the list to add language annotations to
"historic" crawls but that'll take more time.

Best,
Sebastian


On 11/25/18 10:43 PM, A C wrote:
> Many thanks for the swift answer Sebastian. it's a bit harder to get around the cdx clients for non
> techies :) I also find it very convenient to be able to filter based on language for instance when
> using the columnar index. 
>
>
> On Sunday, November 25, 2018 at 8:54:38 PM UTC+1, Sebastian Nagel wrote:
>
> Hi Azdine,
>
> yes, you're right. At present, only about 12 monthly crawls are contained in the columnar index.
> The
> reason was simple that we wanted to be sure that the new index also gets used.
> Looks like that it has now been adapted well by various users. :)
>
> I'll put the inclusion of older crawls into the columnar index on the agenda
> and let you know about our decision soon. I'd take also some time to add the
> remaining data to the columnar index.
>
> Meanwhile you may use the index at https://index.commoncrawl.org/
> together with one of the CDX clients:
>   https://pypi.org/project/cdx-toolkit/ <https://pypi.org/project/cdx-toolkit/>
>   https://github.com/ikreymer/cdx-index-client <https://github.com/ikreymer/cdx-index-client>
> It's easy to get the URL counts for specific domains/sites. You may also use
> the clients to query the Internet Archive which provides additional evidence
> in case our crawler may not have visited some the sites.
>
> Best,
> Sebastian
>
> On 11/25/18 6:23 PM, A C wrote:
> > Hi,
> > first of all thank you so much for such a brilliant initiative. 
> >
> > I have been trying to request some data from the index file through Athena AWS (following this:
> > http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
> <http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/>), however it
> seems
> > that only WARCs from 2017 and 2018 are available. 
> >
> > is there any other ways to access all the indices with the same ease?
> >
> > This is important for my use case which is related to measuring longevity of websites. 
> >
> > Cheers,
> > Azdine
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> > common-crawl...@googlegroups.com <javascript:>
> <mailto:common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to common...@googlegroups.com <javascript:>
> > <mailto:common...@googlegroups.com <javascript:>>.
> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
Reply all
Reply to author
Forward
0 new messages