CC-MAIN-2017-30

32 views
Skip to first unread message

al...@fromkyiv.com

unread,
Mar 1, 2018, 10:15:19 AM3/1/18
to Common Crawl
Hi,

This index (CC-MAIN-2017-30) is not accessible, what is the reason?
Many thanks in advance!

Best regards,
Alex

Sebastian Nagel

unread,
Mar 1, 2018, 10:23:46 AM3/1/18
to common...@googlegroups.com
Hi Alex,

could you specify what service or data is not accessible.

Afaics, the URL index properly returns results:
http://index.commoncrawl.org/CC-MAIN-2017-30-index?url=commoncrawl.org&output=json

Thanks,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

al...@fromkyiv.com

unread,
Mar 1, 2018, 2:59:13 PM3/1/18
to Common Crawl
Hi Sebastian,

When I try to access e.g. cc-index/cdx/CC-MAIN-2017-30/segments/1500549423183.57/crawldiagnostics/CC-MAIN-20170720121902-20170720141902-00000.cdx.gz via s3 (aws cli or Cloudberry Explorer), I get Access Denied error.

Other indexes for 2017 are accessible this way.

Best regards, 


On Thursday, March 1, 2018 at 4:23:46 PM UTC+1, Sebastian Nagel wrote:
Hi Alex,

could you specify what service or data is not accessible.

Afaics, the URL index properly returns results:
   http://index.commoncrawl.org/CC-MAIN-2017-30-index?url=commoncrawl.org&output=json

Thanks,
Sebastian

On 03/01/2018 04:15 PM, al...@fromkyiv.com wrote:
> Hi,
>
> This index (CC-MAIN-2017-30) is not accessible, what is the reason?
> Many thanks in advance!
>
> Best regards,
> Alex
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

Sebastian Nagel

unread,
Mar 1, 2018, 4:09:12 PM3/1/18
to common...@googlegroups.com
Hi Alex,

ok, you got me. :) The per-WARC CDX files in s3://commoncrawl/cc-index/cdx/
are temporary. For every WARC an CDX file is generated, then all CDX files
are merged, sorted by SURT URL and split again into 300 parts.
More details you can find on [1].

Since the cdx.gz files are temporary and never have been "officially"
released (cf. [2]), they're not included in the script which verifies
that the permissions are properly set.

The cdx.gz files of CC-MAIN-2017-30 are now readable, but I cannot not
guarantee that for all cdx.gz files. We may also remove them in the future:
one URL index (either cdx.gz or cdx-*.gz) takes only around 0.4% of the
WARC volume, but keep the same content twice isn't really necessary.
Since today there'll be even a third version [3].

May I ask what you are using the cdx.gz files for? It there a reason
why the merged and sorted cdx-*.gz files are not suitable for your
use case? That may help us regarding the decision whether to remove
or keep them.

Thanks,
Sebastian


[1] https://groups.google.com/d/msg/common-crawl/PEIT5DBZyr0/eXE3W260AQAJ
[2] https://groups.google.com/d/topic/common-crawl/5Yvk6jFf65c/discussion
[3] http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/


On 03/01/2018 08:59 PM, al...@fromkyiv.com wrote:
> Hi Sebastian,
>
> When I try to access e.g.
> cc-index/cdx/CC-MAIN-2017-30/segments/1500549423183.57/crawldiagnostics/CC-MAIN-20170720121902-20170720141902-00000.cdx.gz
> via s3 (aws cli or Cloudberry Explorer), I get Access Denied error.
>
> Other indexes for 2017 are accessible this way.
>
> Best regards, 
>
> On Thursday, March 1, 2018 at 4:23:46 PM UTC+1, Sebastian Nagel wrote:
>
> Hi Alex,
>
> could you specify what service or data is not accessible.
>
> Afaics, the URL index properly returns results:
>    http://index.commoncrawl.org/CC-MAIN-2017-30-index?url=commoncrawl.org&output=json
> <http://index.commoncrawl.org/CC-MAIN-2017-30-index?url=commoncrawl.org&output=json>
>
> Thanks,
> Sebastian
>
> On 03/01/2018 04:15 PM, al...@fromkyiv.com <javascript:> wrote:
> > Hi,
> >
> > This index (CC-MAIN-2017-30) is not accessible, what is the reason?
> > Many thanks in advance!
> >
> > Best regards,
> > Alex
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> > common-crawl...@googlegroups.com <javascript:>
> <mailto:common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to common...@googlegroups.com <javascript:>
> > <mailto:common...@googlegroups.com <javascript:>>.
> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.

al...@fromkyiv.com

unread,
Mar 5, 2018, 4:02:37 AM3/5/18
to Common Crawl
Hi Sebastian,

My mistake, the merged files seem to suit my needs perfectly.
Thank you for explanation!

Best regards,
Alex
>     <mailto:common-crawl+unsub...@googlegroups.com <javascript:>>.
>     > To post to this group, send email to common...@googlegroups.com <javascript:>
>     > <mailto:common...@googlegroups.com <javascript:>>.
>     > Visit this group at https://groups.google.com/group/common-crawl
>     <https://groups.google.com/group/common-crawl>.
>     > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
Reply all
Reply to author
Forward
0 new messages