> Is this real or an illusion?
yes, the latest crawl has 1% PDF files (cf. ) which is high but not exceptionally.
There was always some variance:
See  how to download the statistics files, then do
zgrep '"mimetype_detected","application\\/pdf"' stats/CC-MAIN-*.gz
Note: here only page captures with HTTP status 200 are counted.
Also redirects from /robots.txt are not included. So the numbers in the
index might be slightly higher.
> Any idea what's going on? Different crawling strategy?
No, not really. I'd need to look into it. There are only 2 settings which
affect PDF files:
- (since October 2019) for sitemaps (but not for "normal" links)
there is a suffix filter excluding URLs with a path component ending
- (since September 2018) re-fetching of URLs identified as PDFs is delayed
compared to HTML pages (same for images, videos, etc.)
My first idea would be to look how the PDFs are distributed over domains.
This is easily done using the columnar index.
Do you have other ideas or are there other numbers you want to know?
On 5/25/21 3:04 PM, Tim Allison wrote:
> Hi All,
> I ran some counts on the index files comparing detected mimes in May 2021 and December 2019. It looks like there was a huge increase in
> PDFs: ~3million -> ~29 million.
> I thought I had seen this increase earlier (mid 2020 was last I looked?), but I didn't have time to follow up.
> NOTE: I did not look for unique digests!
> Any idea what's going on? Different crawling strategy? Is this real or an illusion?
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> To view this discussion on the web visit