New large scale Common Crawl-based corpus focused on PDFs

Tim Allison

unread,

May 16, 2023, 2:17:24 PM5/16/23

to Common Crawl

All,

We recently published a corpus deriving from a single month of Common Crawl data. There are 8 million PDFs/8TB. We refetched 2 million truncated PDFs. These are all packaged in zip files, and there's accompanying metadata.

For those on this list for whom Common Crawl is second nature, this will not seem like an accomplishment. For those not up on Common Crawl, it is kind of huge, and I wanted to thank Sebastian Nagel and the Common Crawl team for such an important resource.

Please let me know if you have any questions.

Best,

Tim

Data:

https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/

Peter Wyatt's post:

https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/

C.L. Liu

unread,

May 16, 2023, 9:45:54 PM5/16/23

to Common Crawl

Hi,

Thanks for your work. As someone who is not familiar with Common Crawl, I am impressed by the volume of data that was collected and refetched. It is truly a valuable resource.
I was wondering if it would be possible to obtain the data split by language. I am particularly interested in analyzing data in a specific language, and having the data split by language would be immensely helpful. If this is possible, please let me know.

Thank you once again for your hard work

Best regards,

Liu

Tim Allison 在 2023年5月17日星期三凌晨2:17:24 [UTC+8] 的信中寫道：

Tim Allison

unread,

May 17, 2023, 6:06:12 AM5/17/23

to Common Crawl

We plan to run Apache Tika with language detection against the corpus and add a metadata table or two from that run. I'd estimate that should be ready first week of June, maybe? I'll ping this thread when that table/those tables are ready.

Tim Allison

unread,

Jun 15, 2023, 10:23:36 AM6/15/23

to Common Crawl

Dates on calendar were closer than they appeared. The Tika data is still a work in progress...

Two other press releases just dropped on this corpus:

1) https://www.jpl.nasa.gov/news/jpl-creates-worlds-largest-pdf-archive-to-aid-malware-research (picked up: https://m.slashdot.org/story/415590)

2) DARPA's press release on the overall SafeDocs program: https://www.darpa.mil/news-events/2023-06-14

Romain Beaumont

unread,

Jun 29, 2023, 4:50:59 AM6/29/23

to Common Crawl

Hey,

Cool dataset!

https://github.com/rom1504/cc2dataset can process WAT files pretty fast

I ran it once for pdfs and found there's 5B pdf deduplicated links over all months of CC.

To turn that into a dataset, some work would be needed to build a tool similar to img2dataset to let users download the data (maybe starting there https://github.com/rom1504/any2dataset/issues/6 ). And probably some work around removing 'unsafe' (to be defined) documents.

If you'd be interested to pursue this as a next step on your work, happy to give you pointers.

Best,

Romain

Chillar Anand

unread,

Jul 4, 2023, 5:38:47 AM7/4/23

to Common Crawl

Tim,

Let me know if you need any help with it.

I am also extracting all pdfs that are in the Telugu language.

Best,

Chillar Anand

https://avilpage.com

Tim Allison

unread,

Jul 17, 2023, 12:54:41 PM7/17/23

to Common Crawl

All,

I finally got around to running Tika against the dataset with tika-eval's language id -- not perfect, but decent - and more importantly its out-of-vocabulary (oov) statistic that can indicate that electronic text as stored is likely junk.

I focused heavily on PDF-specific metadata items.

There are two main tables:

a) the "container" files, the main PDFs -- one row per URL

b) the "container" files and the embedded files -- at least one row per URL, optionally many rows per URL, one for each embedded file.

https://downloads.digitalcorpora.org/corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/metadata/

Let me know if you have any questions.

Best,

Tim

Tim Allison

unread,

Jul 17, 2023, 12:58:19 PM7/17/23

to Common Crawl

Very cool! My funding on this project reached the end of its contract last week. :( Until I find a new source of funding, I regret I won't be able to spend much time on this.

I do look forward to digging into your links. Thank you so much!

>> 'unsafe' (to be defined)'

LOL...y, that's one of the things we struggled with on the project that funded the gathering of this corpus: DARPA's SafeDocs program. There are many areas for PDFs to be unsafe far beyond the run-of-the-mill javascript stuff which is now thankfully mostly disabled by default in most PDF consuming tools.

Tim Allison

unread,

Jul 17, 2023, 1:05:22 PM7/17/23

to Common Crawl

I regret that I don't have great news for you for this corpus. :(

On the one hand, the language detector does (theoretically) identify Telugu. On the other, when I search for "Telugu" files with > 50 alphabetic tokens and < 80% out of vocabulary, I only find ~117 files where apparently Telugu was extracted from the electronic text as stored in the PDF.

This number does not cover image-only Telugu PDFs, nor PDFs that a human would see as Telugu but where the electronic text was not able to be extracted reliably (broken/missing font, missing Unicode tables, etc).

On Tuesday, July 4, 2023 at 5:38:47 AM UTC-4 anand2...@gmail.com wrote:

Reply all

Reply to author

Forward