Hi Weijian,
> I am so curious about that whether the big difference in the number of PDFs was caused
> by different crawling strategies on crawling PDF files or by people's behavior of using
> PDFs.
My guess would be that the crawling strategy is the more important factor.
Another suggestion would be that some more sites (resp. CMS) provide a
"print" functionality. For the crawler it's just a link pointing to
a PDF version of the page.
Unfortunately, I do not know what happened back in 2014. I've checked whether
there have been URL filters active (suppressing *.pdf): that wasn't the case.
But it could be that the Blekko seeds we relied upon in 2014 did penalize PDFs
(or just prefer HTML).
Since autumn 2016 we maintain our own crawl frontier, and since Sep 2018 PDFs
and other non-HTML content types are delayed when selected for refetch. That's
done because PDFs and multimedia formats are often large in size and poorly
compress in WARC files.
Below is the relative amount of application/pdf content type in monthly crawls:
% monthly crawl
0.1621 CC-MAIN-2013-20
0.1562 CC-MAIN-2013-48
0.1719 CC-MAIN-2014-10
0.2105 CC-MAIN-2014-15 # << Apr 2014 : 5.5 million captures, 3.2 million unique URLs
0.1738 CC-MAIN-2014-23
0.1893 CC-MAIN-2014-35
0.1834 CC-MAIN-2014-41
0.2421 CC-MAIN-2014-42
0.2202 CC-MAIN-2014-49
0.1675 CC-MAIN-2014-52
0.1941 CC-MAIN-2015-14
0.1722 CC-MAIN-2015-18
0.1774 CC-MAIN-2015-22
0.1772 CC-MAIN-2015-27
0.1709 CC-MAIN-2015-32
0.1671 CC-MAIN-2015-35
0.1678 CC-MAIN-2015-40
0.1589 CC-MAIN-2015-48
0.1574 CC-MAIN-2016-07
0.1712 CC-MAIN-2016-18
0.2515 CC-MAIN-2016-22
0.2647 CC-MAIN-2016-26
0.2268 CC-MAIN-2016-30
0.2221 CC-MAIN-2016-36
0.2518 CC-MAIN-2016-40
0.1941 CC-MAIN-2016-44
0.1958 CC-MAIN-2016-50
0.2375 CC-MAIN-2017-04
0.2674 CC-MAIN-2017-09
0.6915 CC-MAIN-2017-13
0.7644 CC-MAIN-2017-17 # << Apr 2017 : 22.5 million captures
0.5987 CC-MAIN-2017-22
0.7102 CC-MAIN-2017-26
0.7463 CC-MAIN-2017-30
0.5556 CC-MAIN-2017-34
0.8228 CC-MAIN-2017-39
0.4501 CC-MAIN-2017-43
0.4945 CC-MAIN-2017-47
0.2841 CC-MAIN-2017-51
0.8428 CC-MAIN-2018-05
0.6064 CC-MAIN-2018-09
0.5160 CC-MAIN-2018-13
0.4462 CC-MAIN-2018-17
0.4110 CC-MAIN-2018-22
0.5468 CC-MAIN-2018-26
0.5305 CC-MAIN-2018-30
0.4556 CC-MAIN-2018-34
0.1419 CC-MAIN-2018-39 # << Sep 2018 : refetch of PDFs delayed
0.8036 CC-MAIN-2018-43
0.9513 CC-MAIN-2018-47
1.1398 CC-MAIN-2018-51
0.9150 CC-MAIN-2019-04
1.0602 CC-MAIN-2019-09
0.4675 CC-MAIN-2019-13 # << Mar 2019 : increased delay
> I noticed that the April 2014 dataset included only 2 million PDFs while the April
> 2017 dataset included 20 million PDFs which is 10 times more.
Well, for April 2014 I get 5.5 million resp 3.2 million unique URLs
for "application/pdf" in the statistics derived from the CDX index.
It's what the web servers send as "Content-Type", notoriously noisy
and not necessarily correct. For more recent crawls there would be
also available the MIME type detected from content by Apache Tika.
On Linux with git, Python3 and R (to plot via ggplot2) installed,
you can generate the MIME metrics by running:
git clone
https://github.com/commoncrawl/cc-crawl-statistics.git
cd cc-crawl-statistics
pip3 install -r requirements.txt
pip3 install -r requirements_plot.txt
pip3 install awscli
# download the data
./get_stats.sh
# $PWD must be in PYTHONPATH
export PYTHONPATH=$PYTHONPATH:
mkdir data
zcat stats/CC-MAIN-*.gz | python3 plot/mimetype.py
zcat stats/CC-MAIN-*.gz | python3 plot/mimetype_detected.py
grep 'application/pdf' data/mimetypes_percentage.csv
...
Let me know if you need help. I can also send you the generated CSV files
which contain all crawls. Only the latest ones are shown on
https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes
Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.