Hi Tim, hi Henry,
thanks for the suggestion, and also for creating this useful dataset(s).
> Option 1: we apply for our own AWS public dataset
Happy to introduce you to the Amazon Open Data Set team to
discuss this option. Would it be also an option to host it as
a subset of Common Crawl?
> Option 2: CC runs a special once only/once a year crawl where
> truncation is bumped to 1GB per file. We then run our processes over
> this new data set and publish that.
There are a couple of technical reasons for the 1 MiB limit (they just
make implementing a crawler much easier), eg.
- buffering content while fetching over many simultaneously open
connections
- the 1 GB recommendation for WARC files
- a 1 GiB content limit would mean that there will be WARC files with
just a single capture
- btw., it also guarantees that offsets do not overflow when 32-bit
integers are used
But the major simplification is that the size limit allows to ignore the
document size while sampling:
- assumed for a domain 2k page captures are allowed given the popularity
of that domain in terms of hyperlink centrality
- with the 1 MiB limit the domain could maximally contribute 2 GiB of
data
- but with a much higher limit of 1 GiB it could be up to 2 TiB
- that's a lot given that all of our "monthly" crawls are below
100 TiB of WARC files
One option could be to introduce variable limits depending on the MIME
type. A global and high limit could eventually cause that the crawler
accidentally captures larger software archives, multimedia collections
or alike.
Regarding variable limits: is there any kind of "metric" how useful
truncated captures are for various MIME types? - MIME type identifiable,
some forensic analysis possible, still readable, etc.
This matters since we somehow need to balance between occupied storage
and usage. And there's a lot of potential to optimize the collections.
> are there recs for storing enriched data for a commoncrawl set?
You mean how to link metadata with Common Crawl captures or WARC
captures in general?
Best,
Sebastian