Hi Azdine,
I understand, we'll do our best to extend the columnar index.
The possibility to select by any field is the greatest advantage
over the CDX index which always requires a URL prefix pattern
and can only filter by language in a secondary step.
Btw., we've added the language annotation in August 2018 to
the CDX index and in September to the columnar index.
I'll also put on the list to add language annotations to
"historic" crawls but that'll take more time.
Best,
Sebastian
On 11/25/18 10:43 PM, A C wrote:
> Many thanks for the swift answer Sebastian. it's a bit harder to get around the cdx clients for non
> techies :) I also find it very convenient to be able to filter based on language for instance when
> using the columnar index.
>
>
> On Sunday, November 25, 2018 at 8:54:38 PM UTC+1, Sebastian Nagel wrote:
>
> Hi Azdine,
>
> yes, you're right. At present, only about 12 monthly crawls are contained in the columnar index.
> The
> reason was simple that we wanted to be sure that the new index also gets used.
> Looks like that it has now been adapted well by various users. :)
>
> I'll put the inclusion of older crawls into the columnar index on the agenda
> and let you know about our decision soon. I'd take also some time to add the
> remaining data to the columnar index.
>
> Meanwhile you may use the index at
https://index.commoncrawl.org/
> together with one of the CDX clients:
>
https://pypi.org/project/cdx-toolkit/ <
https://pypi.org/project/cdx-toolkit/>
>
https://github.com/ikreymer/cdx-index-client <
https://github.com/ikreymer/cdx-index-client>
> It's easy to get the URL counts for specific domains/sites. You may also use
> the clients to query the Internet Archive which provides additional evidence
> in case our crawler may not have visited some the sites.
>
> Best,
> Sebastian
>
> On 11/25/18 6:23 PM, A C wrote:
> > Hi,
> > first of all thank you so much for such a brilliant initiative.
> >
> > I have been trying to request some data from the index file through Athena AWS (following this:
> >
http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
> <
http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/>), however it
> seems
> > that only WARCs from 2017 and 2018 are available.
> >
> > is there any other ways to access all the indices with the same ease?
> >
> > This is important for my use case which is related to measuring longevity of websites.
> >
> > Cheers,
> > Azdine
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> >
common-crawl...@googlegroups.com <javascript:>
> <mailto:
common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to
common...@googlegroups.com <javascript:>
> > <mailto:
common...@googlegroups.com <javascript:>>.
> <
https://groups.google.com/group/common-crawl>.
> > For more options, visit
https://groups.google.com/d/optout <
https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.