Enabling CORS on the AWS CommonCrawl buckets?

35 views
Skip to first unread message

Nick Mitchell

unread,
Apr 24, 2021, 9:10:41 PM4/24/21
to Common Crawl
Howdy!

I had an idea to enable read-only browsing of the CommonCrawl S3 directory structure within a web page. Everything seems fine, except that the CommonCrawl S3 bucket does not have a CORS policy.

What would you think about establishing one? The data is already public, this proposal only covers allowing webapps to read directly from the CommonCrawl bucket, without need for an intermediate proxy.

Thoughts? Thanks!
Nick
@starpit

Sebastian Nagel

unread,
Apr 26, 2021, 4:56:37 AM4/26/21
to common...@googlegroups.com
Hi Nick,

this topic was discussed two years ago, see
https://groups.google.com/g/common-crawl/c/O6fluDTW9PM/m/fvZYW946DAAJ

We've decided not to enable CORS because we might run into some unforeseen issues
given that the content is not screened for unwanted (malicious or otherwise ugly) material.
In general, we assume that our users are aware of this fact, have the required technical
background to understand the risks and, last-but-not-least, have read and agreed to our ToU.

Of course, your use case is definitely a good and valid one. But with a wildcard CORS policy
I see no way to allow the "good" webapps and block the "bad" ones. So far...

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/6487271e-3dda-4925-a729-2aadad5454b8n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/6487271e-3dda-4925-a729-2aadad5454b8n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Ed Summers

unread,
Apr 26, 2021, 5:10:40 AM4/26/21
to Sebastian Nagel, common...@googlegroups.com
I'm not disagreeing with the decision regarding malicious content at all,
but after reading the previous thread I was curious to know if Amazon was
contacted and if they had an opinion or policy regarding setting CORS
headers for AWS Public Dataset s3 buckets?

Apr 26, 2021 4:56:39 AM Sebastian Nagel <seba...@commoncrawl.org>:
an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit
https://groups.google.com/d/msgid/common-crawl/b04dca2b-c091-51bd-25c9-b0f1ef7c6341%40commoncrawl.org.
>

Sebastian Nagel

unread,
Apr 26, 2021, 5:37:01 AM4/26/21
to common...@googlegroups.com
Hi Ed,

no, we didn't contact the AWS Open Data Set team regarding this question.
Just discussed it internally and put it aside because there was no urgent need
(Colin was ok to use a proxy).

Best,
Sebastian

Nick Mitchell

unread,
Apr 26, 2021, 9:42:45 AM4/26/21
to Common Crawl

Hi Sebastian, thanks for the nice response!

I agree that caution is needed here.

We have found that being able to browse the data sets makes the data quite approachable (see attached gif). This includes not only directory navigation, but previewing (e.g. seeing the first couple hundred lines) of the files, to get a sense of the format -- this all goes such a long way towards making the project feel less opaque. It's not just something i access via query and via opaque libraries, but living data.

Reading through the TOU for carefully... is the position that, as long as one pays to host a proxy (thus avoiding CORS issues), that the Open Data Set team is not against such use cases? If the position is against such use cases, it seems like a loophole that should be plugged in the TOU? If not, then having the extra expense of a proxy seems rather pointless? And perhaps the TOU could be updated to state that any recasting of the data must be prefaced with a TOU agreement page?

In any case, mostly just brainstorming here!
Thanks again,
Nick
@starpit

commoncrawling.gif

Sebastian Nagel

unread,
Apr 26, 2021, 11:59:21 AM4/26/21
to common...@googlegroups.com
Hi Nick,

> We have found that being able to browse the data sets makes the data quite approachable
> (see attached gif)

Got it.

Actually, some Amazon Open Data Sets have a data explorer.
Just 3 examples:

https://human-pangenomics.s3.amazonaws.com/index.html
https://landsat-pds.s3.amazonaws.com/index.html
https://openaq-fetches.s3.amazonaws.com/index.html

Assumed that they're hosted on the bucket, no special CORS configuration is required.
But I'll need to verify this.

Some data sets also include static navigation pages:
https://landsat-pds.s3.amazonaws.com/c1/L8/001/003/LC08_L1GT_001003_20170516_20170516_01_RT/index.html

Right now, we have static navigation pages for some data sets
https://commoncrawl.s3.amazonaws.com/crawl-data/index.html
The plan is to add the missing ones (eg. the 2008 - 2012 crawls) and to provide more metadata
(eg. from-to timestamps) also in machine-readable form.

But agreed: adding also an interactive browser would be nice.

Is your project an open source project?


> the TOU could be updated

Noted. However, this requires a lawyer, nothing done quickly.

Best,
Sebastian
> <https://groups.google.com/d/msgid/common-crawl/6487271e-3dda-4925-a729-2aadad5454b8n%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/6487271e-3dda-4925-a729-2aadad5454b8n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
> >>
> >> --
> >> You received this message because you are subscribed to the Google
> > Groups "Common Crawl" group.
> >> To unsubscribe from this group and stop receiving emails from it, send
> > an email to common-crawl...@googlegroups.com.
> >> To view this discussion on the web visit
> > https://groups.google.com/d/msgid/common-crawl/b04dca2b-c091-51bd-25c9-b0f1ef7c6341%40commoncrawl.org
> <https://groups.google.com/d/msgid/common-crawl/b04dca2b-c091-51bd-25c9-b0f1ef7c6341%40commoncrawl.org>.
> >>
> >
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/3ebe6136-606c-40bd-94af-42664c57a04bn%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/3ebe6136-606c-40bd-94af-42664c57a04bn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Nick Mitchell

unread,
Apr 26, 2021, 12:29:37 PM4/26/21
to Common Crawl
Greetings!

Thanks for the pointers to those projects, they look great. I checked out one, and it is does indeed seem to use a proxy to communicate with S3 data (rather than going directly to S3 API).

re: open source, yes indeed. The UI in the gif is a client extension of Kui, which is a graphical CLI framework, and part of the Kubernetes org: https://github.com/kubernetes-sigs/kui

On top of Kui, this client adds some S3 support, and that's it. The rest (the REPL, the monaco editor previewer, etc.) are all Kui, and thus already open sourced. Any kind of branding is possible, e.g. it could fly official CommonCrawl banners, iconography, etc., if so desired. We will be open sourcing the S3 bits soon, and could also open source a CommonCrawl branding alongside... if so desired!

Anyway, again, this has been mostly brainstorming, based on some local activities we've had against CommonCrawl (happy to talk about those, too, soon). The ability to browse went a long way in our ability to tell stories about the data, and our analyses thereof.

The client currently previews the compressed data files (e.g. wat.gz), but to limit load, only shows the first couple hundred lines -- i.e. no arbitrary pagination was an intentional choice to avoid undue load against the data sets. But we could also offer arbitrary pagination, if deemed valuable.

Nick
Reply all
Reply to author
Forward
0 new messages