PDFs galore?

30 views
Skip to first unread message

Tim Allison

unread,
May 25, 2021, 9:04:43 AMMay 25
to Common Crawl
Hi All,
  I ran some counts on the index files comparing detected mimes in May 2021 and December 2019.  It looks like there was a huge increase in PDFs: ~3million -> ~29 million.  

I thought I had seen this increase earlier (mid 2020 was last I looked?), but I didn't have time to follow up.

NOTE: I did not look for unique digests!

Any idea what's going on?  Different crawling strategy?  Is this real or an illusion?

Cheers,

        Tim

CC-MAIN-2019-51
text/html 2,324,897,452
application/xhtml+xml 556,572,002
text/plain 73,548,957
application/octet-stream 32,466,292
NULL 16,877,290
message/rfc822 4,182,827
application/rss+xml 3,548,519
image/jpeg 3,414,527
application/pdf 3,314,786
application/atom+xml 3,279,559
application/xml 2,472,155
application/binary 1,529,943
text/calendar 1,085,842
application/json 920,295
image/png 498,865
application/x-stata-do 488441
text/x-perl 253738
application/rdf+xml 246618
application/zip 202327
audio/mpeg 155743

CC-MAIN-2021-21
text/html 2,671,608,780
application/xhtml+xml 414,297,587
text/plain 64,884,811
application/octet-stream 34,502,937
application/pdf 28,825,203
NULL 12,308,820
image/jpeg 4,230,991
application/rss+xml 3,399,706
application/xml 2,739,181
application/atom+xml 2,035,688
application/binary 977,552
application/json 769,477
text/calendar 762,553
image/png 729,016
message/rfc822, 487,245
application/x-stata-do 360,759
text/x-perl 277,776
application/rdf+xml 230,500
application/zip 197,717
text/x-php 166,367


Sebastian Nagel

unread,
May 26, 2021, 1:41:23 PMMay 26
to common...@googlegroups.com
Hi Tim,

> Is this real or an illusion?

yes, the latest crawl has 1% PDF files (cf. [1]) which is high but not exceptionally.
There was always some variance:

27082692 CC-MAIN-2019-04
31586278 CC-MAIN-2019-09
12288145 CC-MAIN-2019-13
12960647 CC-MAIN-2019-18
19423997 CC-MAIN-2019-22
16109119 CC-MAIN-2019-26
3453816 CC-MAIN-2019-30
4236832 CC-MAIN-2019-35
25529661 CC-MAIN-2019-39
11780097 CC-MAIN-2019-43
12206558 CC-MAIN-2019-47
3275012 CC-MAIN-2019-51
2716184 CC-MAIN-2020-05
3790477 CC-MAIN-2020-10
21430937 CC-MAIN-2020-16
3506961 CC-MAIN-2020-24
3122176 CC-MAIN-2020-29
3809774 CC-MAIN-2020-34
35100891 CC-MAIN-2020-40
23346577 CC-MAIN-2020-45
18345598 CC-MAIN-2020-50
26963788 CC-MAIN-2021-04
16358436 CC-MAIN-2021-10
14802128 CC-MAIN-2021-17
27808355 CC-MAIN-2021-21

See [2] how to download the statistics files, then do
zgrep '"mimetype_detected","application\\/pdf"' stats/CC-MAIN-*.gz

Note: here only page captures with HTTP status 200 are counted.
Also redirects from /robots.txt are not included. So the numbers in the
index might be slightly higher.


> Any idea what's going on? Different crawling strategy?

No, not really. I'd need to look into it. There are only 2 settings which
affect PDF files:
- (since October 2019) for sitemaps (but not for "normal" links)
there is a suffix filter excluding URLs with a path component ending
in `.pdf`.
- (since September 2018) re-fetching of URLs identified as PDFs is delayed
compared to HTML pages (same for images, videos, etc.)

My first idea would be to look how the PDFs are distributed over domains.
This is easily done using the columnar index.

Do you have other ideas or are there other numbers you want to know?

Best,
Sebastian

[1] https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes
[2] https://github.com/commoncrawl/cc-crawl-statistics#step-3-download-the-data

On 5/25/21 3:04 PM, Tim Allison wrote:
> Hi All,
>   I ran some counts on the index files comparing detected mimes in May 2021 and December 2019.  It looks like there was a huge increase in
> PDFs: ~3million -> ~29 million.
>
> I thought I had seen this increase earlier (mid 2020 was last I looked?), but I didn't have time to follow up.
>
> NOTE: I did not look for unique digests!
>
> Any idea what's going on?  Different crawling strategy?  Is this real or an illusion?
>
> Cheers,
>
>         Tim
>
> *CC-MAIN-2019-51*
> text/html2,324,897,452
> application/xhtml+xml556,572,002
> text/plain73,548,957
> application/octet-stream32,466,292
> NULL16,877,290
> message/rfc8224,182,827
> application/rss+xml3,548,519
> image/jpeg3,414,527
> application/pdf3,314,786
> application/atom+xml3,279,559
> application/xml2,472,155
> application/binary1,529,943
> text/calendar1,085,842
> application/json920,295
> image/png498,865
> application/x-stata-do488441
> text/x-perl253738
> application/rdf+xml246618
> application/zip202327
> audio/mpeg155743
>
> *CC-MAIN-2021-21*
> text/html2,671,608,780
> application/xhtml+xml414,297,587
> text/plain64,884,811
> application/octet-stream34,502,937
> application/pdf28,825,203
> NULL12,308,820
> image/jpeg4,230,991
> application/rss+xml3,399,706
> application/xml2,739,181
> application/atom+xml2,035,688
> application/binary977,552
> application/json769,477
> text/calendar762,553
> image/png729,016
> message/rfc822,487,245
> application/x-stata-do360,759
> text/x-perl277,776
> application/rdf+xml230,500
> application/zip197,717
> text/x-php166,367
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/5811956a-ca58-41d0-83df-666b9c76b871n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/5811956a-ca58-41d0-83df-666b9c76b871n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Tim Allison

unread,
May 26, 2021, 3:21:13 PMMay 26
to Common Crawl
Oh, wow.  This is fantastic. What this shows is that the December 2019 crawl was actually kind of low, and that there were crawls before that with a bunch more PDFs.

 Thank you!

Reply all
Reply to author
Forward
0 new messages