Call for Contributions - Format-Specific Datasets

105 views
Skip to first unread message

Sebastian Nagel

unread,
Oct 8, 2019, 3:45:10 AM10/8/19
to Common Crawl
Hi all,

Recently there have been discussions about the truncation of PDF docs in the Common Crawl, and
requests to raise the content length limit in order to reduce the number of truncated PDF files. The
discussions led to the proposal to create a separate dataset dedicated to PDFs that includes other
document formats used to store texts (office documents, e-books, etc.)

We would use existing tools to download and package a format-specific dataset. Therefore, the
challenge is how to:
- select a representative sample: page rank or harmonic centrality scores are not suitable
to choose sites and determine the number of documents to be sampled per site
- define content limit thresholds: it is possible to fetch more than 1 MB of content but
there must be still a limit
- create supportive formats similar to WAT and WET.

Notably, we must first decide which document formats to include in the collection(s). From prior
discussions in this group and usage examples, the following candidates have been identified:
- PDF documents [1,2,3]
- spreadsheets [4]
- office documents [5,6]

There are many more document formats to consider, so please let us know about your interests and/or
use cases. Your suggestions regarding the questions above are very welcome!

Best,
Sebastian

[1] https://groups.google.com/d/topic/common-crawl/JrTi5EGg6EM/discussion
[2] https://groups.google.com/d/topic/common-crawl/JJW6fv1rUQw/discussion
[3] http://pdfinfo.net/en/about/
[4] https://kevinlubick.com/pubs/MSR2015-Fuse_spreadsheet_corpus.pdf
http://static.barik.net/fuse/
[5] https://github.com/centic9/CommonCrawlDocumentDownload
[6] https://www.decalage.info/en/download_mso_files

Tony Wong

unread,
Jan 20, 2022, 4:58:46 AM1/20/22
to Common Crawl
any progress for this?

How can I contribute?
Message has been deleted

Sebastian Nagel

unread,
Feb 1, 2022, 11:18:59 AM2/1/22
to common...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages