Archiving enriched common crawl data?

Tim Allison

unread,

Jul 6, 2022, 10:22:23 AM7/6/22

to Common Crawl

I'm about a year out from wrapping up a project that has relied heavily on Common Crawl data. Our focus was PDFs.

We took one month of CC data (August/Sept 2020), extracted 6 million pdfs and refetched 2 million truncated PDFs. Refetching was a major pain, and I'd like to prevent anyone from ever having to do that again. It would also be super useful to have a seamless set of data that isn't divided between original and refetched.

We ran a bunch of PDF parsers and recorded crashes, error/warning messages, extracted text and other metadata.

We'd like to persist and share these data but we're not sure of the best method.

Some options:

Option 1: we apply for our own AWS public dataset or are there other potential hosts?

Option 2: CC runs a special once only/once a year crawl where truncation is bumped to 1GB per file. We then run our processes over this new data set and publish that.

Option 3?

Refetched docs aside, are there recs for storing enriched data for a commoncrawl set?

Is anyone else interested in a larger crawl?

Thank you!

Cheers,

Tim

Ref: https://www.youtube.com/watch?v=IbknEtcmEhI

Henry S. Thompson

unread,

Jul 6, 2022, 11:40:50 AM7/6/22

to common...@googlegroups.com

Tim Allison writes:

> We took one month of CC data (August/Sept 2020), extracted 6 million
> pdfs and refetched 2 million truncated PDFs. Refetching was a major
> pain, and I'd like to prevent anyone from ever having to do that
> again. It would also be super useful to have a seamless set of data
> that isn't divided between original and refetched.

I've done the same for August 2019, and have the same goals.

> We'd like to persist and share these data but we're not sure of the
> best method.

Ditto.

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: h...@inf.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Sebastian Nagel

unread,

Jul 11, 2022, 10:36:09 AM7/11/22

to common...@googlegroups.com

Hi Tim, hi Henry,

thanks for the suggestion, and also for creating this useful dataset(s).

> Option 1: we apply for our own AWS public dataset

Happy to introduce you to the Amazon Open Data Set team to
discuss this option. Would it be also an option to host it as
a subset of Common Crawl?

> Option 2: CC runs a special once only/once a year crawl where
> truncation is bumped to 1GB per file. We then run our processes over
> this new data set and publish that.

There are a couple of technical reasons for the 1 MiB limit (they just
make implementing a crawler much easier), eg.
- buffering content while fetching over many simultaneously open
connections
- the 1 GB recommendation for WARC files
- a 1 GiB content limit would mean that there will be WARC files with
just a single capture
- btw., it also guarantees that offsets do not overflow when 32-bit
integers are used

But the major simplification is that the size limit allows to ignore the
document size while sampling:
- assumed for a domain 2k page captures are allowed given the popularity
of that domain in terms of hyperlink centrality
- with the 1 MiB limit the domain could maximally contribute 2 GiB of
data
- but with a much higher limit of 1 GiB it could be up to 2 TiB
- that's a lot given that all of our "monthly" crawls are below
100 TiB of WARC files

One option could be to introduce variable limits depending on the MIME
type. A global and high limit could eventually cause that the crawler
accidentally captures larger software archives, multimedia collections
or alike.

Regarding variable limits: is there any kind of "metric" how useful
truncated captures are for various MIME types? - MIME type identifiable,
some forensic analysis possible, still readable, etc.
This matters since we somehow need to balance between occupied storage
and usage. And there's a lot of potential to optimize the collections.

> are there recs for storing enriched data for a commoncrawl set?

You mean how to link metadata with Common Crawl captures or WARC
captures in general?

Best,
Sebastian

Reply all

Reply to author

Forward