Common Crawl's Crawling Strategy on PDF links

80 views
Skip to first unread message

Weijian Li

unread,
Mar 25, 2019, 10:37:27 PM3/25/19
to Common Crawl
Hello everyone,

I am currently doing a project on extracting a special type of URLs from PDFs files in the Common Crawl dataset. I noticed that the April 2014 dataset included only 2 million PDFs while the April 2017 dataset included 20 million PDFs which is 10 times more.
 
I am so curious about that whether the big difference in the number of PDFs was caused by different crawling strategies on crawling PDF files or by people's behavior of using PDFs.

I have tried to find the relevant information on the Common Crawl websites but failed to find something. What I currently know is that Common Crawl was using the seed donations from a startup called Blekko in 2014 while since autumn 2016 it has been using its own crawling frontier. 

Could anyone point me out where to find the information about Common Crawl's strategy of crawling PDF files in April 2014 and April 2017? Or does anyone have any idea about why the number of PDFs in April 2017 was 10 times larger than that of April 2014?

Any information would be greatly appreciated.   :) 

Weijian Li

Sebastian Nagel

unread,
Mar 29, 2019, 6:49:00 AM3/29/19
to common...@googlegroups.com
Hi Weijian,

> I am so curious about that whether the big difference in the number of PDFs was caused
> by different crawling strategies on crawling PDF files or by people's behavior of using
> PDFs.

My guess would be that the crawling strategy is the more important factor.
Another suggestion would be that some more sites (resp. CMS) provide a
"print" functionality. For the crawler it's just a link pointing to
a PDF version of the page.

Unfortunately, I do not know what happened back in 2014. I've checked whether
there have been URL filters active (suppressing *.pdf): that wasn't the case.
But it could be that the Blekko seeds we relied upon in 2014 did penalize PDFs
(or just prefer HTML).

Since autumn 2016 we maintain our own crawl frontier, and since Sep 2018 PDFs
and other non-HTML content types are delayed when selected for refetch. That's
done because PDFs and multimedia formats are often large in size and poorly
compress in WARC files.

Below is the relative amount of application/pdf content type in monthly crawls:

% monthly crawl
0.1621 CC-MAIN-2013-20
0.1562 CC-MAIN-2013-48
0.1719 CC-MAIN-2014-10
0.2105 CC-MAIN-2014-15 # << Apr 2014 : 5.5 million captures, 3.2 million unique URLs
0.1738 CC-MAIN-2014-23
0.1893 CC-MAIN-2014-35
0.1834 CC-MAIN-2014-41
0.2421 CC-MAIN-2014-42
0.2202 CC-MAIN-2014-49
0.1675 CC-MAIN-2014-52
0.1941 CC-MAIN-2015-14
0.1722 CC-MAIN-2015-18
0.1774 CC-MAIN-2015-22
0.1772 CC-MAIN-2015-27
0.1709 CC-MAIN-2015-32
0.1671 CC-MAIN-2015-35
0.1678 CC-MAIN-2015-40
0.1589 CC-MAIN-2015-48
0.1574 CC-MAIN-2016-07
0.1712 CC-MAIN-2016-18
0.2515 CC-MAIN-2016-22
0.2647 CC-MAIN-2016-26
0.2268 CC-MAIN-2016-30
0.2221 CC-MAIN-2016-36
0.2518 CC-MAIN-2016-40
0.1941 CC-MAIN-2016-44
0.1958 CC-MAIN-2016-50
0.2375 CC-MAIN-2017-04
0.2674 CC-MAIN-2017-09
0.6915 CC-MAIN-2017-13
0.7644 CC-MAIN-2017-17 # << Apr 2017 : 22.5 million captures
0.5987 CC-MAIN-2017-22
0.7102 CC-MAIN-2017-26
0.7463 CC-MAIN-2017-30
0.5556 CC-MAIN-2017-34
0.8228 CC-MAIN-2017-39
0.4501 CC-MAIN-2017-43
0.4945 CC-MAIN-2017-47
0.2841 CC-MAIN-2017-51
0.8428 CC-MAIN-2018-05
0.6064 CC-MAIN-2018-09
0.5160 CC-MAIN-2018-13
0.4462 CC-MAIN-2018-17
0.4110 CC-MAIN-2018-22
0.5468 CC-MAIN-2018-26
0.5305 CC-MAIN-2018-30
0.4556 CC-MAIN-2018-34
0.1419 CC-MAIN-2018-39 # << Sep 2018 : refetch of PDFs delayed
0.8036 CC-MAIN-2018-43
0.9513 CC-MAIN-2018-47
1.1398 CC-MAIN-2018-51
0.9150 CC-MAIN-2019-04
1.0602 CC-MAIN-2019-09
0.4675 CC-MAIN-2019-13 # << Mar 2019 : increased delay

> I noticed that the April 2014 dataset included only 2 million PDFs while the April
> 2017 dataset included 20 million PDFs which is 10 times more.

Well, for April 2014 I get 5.5 million resp 3.2 million unique URLs
for "application/pdf" in the statistics derived from the CDX index.
It's what the web servers send as "Content-Type", notoriously noisy
and not necessarily correct. For more recent crawls there would be
also available the MIME type detected from content by Apache Tika.

On Linux with git, Python3 and R (to plot via ggplot2) installed,
you can generate the MIME metrics by running:

git clone https://github.com/commoncrawl/cc-crawl-statistics.git
cd cc-crawl-statistics
pip3 install -r requirements.txt
pip3 install -r requirements_plot.txt
pip3 install awscli
# download the data
./get_stats.sh
# $PWD must be in PYTHONPATH
export PYTHONPATH=$PYTHONPATH:
mkdir data
zcat stats/CC-MAIN-*.gz | python3 plot/mimetype.py
zcat stats/CC-MAIN-*.gz | python3 plot/mimetype_detected.py
grep 'application/pdf' data/mimetypes_percentage.csv
...

Let me know if you need help. I can also send you the generated CSV files
which contain all crawls. Only the latest ones are shown on
https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Weijian Li

unread,
Mar 30, 2019, 5:41:16 PM3/30/19
to Common Crawl
Hi Sebastian,

Thank you so much for your help! I have further questions.
 

Another suggestion would be that some more sites (resp. CMS) provide a
"print" functionality.  For the crawler it's just a link pointing to
a PDF version of the page.
 
What  does this mean? Would you mind further explaining it a little bit?


Unfortunately, I do not know what happened back in 2014.  I've checked whether
there have been URL filters active (suppressing *.pdf): that wasn't the case.
But it could be that the Blekko seeds we relied upon in 2014 did penalize PDFs
(or just prefer HTML).

Since autumn 2016 we maintain our own crawl frontier


Was the new crawl frontier since autumn 2016 for producing seeds similar to the seed donations from Blekko (by saying similar, I mean they were both URI seeds)? 

Thank you again for your answer.  :)

Best wishes,
Weijian

Greg Lindahl

unread,
Mar 30, 2019, 7:15:30 PM3/30/19
to common...@googlegroups.com
On Fri, Mar 29, 2019 at 11:48:55AM +0100, Sebastian Nagel wrote:

> But it could be that the Blekko seeds we relied upon in 2014 did penalize PDFs
> (or just prefer HTML).

blekko never intentionally crawled any pdf. If I recall correctly, we
analyzed the apparent extension at the end of the path of an url, and
used that to decide if we would include the url in our frontier. I'm
pretty sure that was extremely narrowly limited to just paths ending
with /, .html (or aliases like .php, .aspx) and things we thought were
raw text (like .txt).

Looking through the one blekko frontier I've got lying around, I don't
see any url with a path ending with .pdf; I see a lot of urls like:

http://example.com/downloader.php?file=example.pdf

which blekko's naive algorithm would consider html -- the path ends
with .php -- and would include in the frontier. If that url was a
redirect to the actual pdf file, then CommonCrawl would have crawled
through to the pdf.

So, in this previous era of CommonCrawl, when blekko's frontier was
used as the entire seed, blekko's policy decision flowed through to
disfavor most pdf content in CommonCrawl. It was a strong selection
effect in favor of the limited number of websites that either used the
above style of url, or, sites that put pdf content into urls ending
with /.

After Sebastian came on-board and improved CommonCrawl's crawler to be
much more diverse, I'm sure that situation changed for the better.

-- greg
(former blekko CTO)


Sebastian Nagel

unread,
Apr 1, 2019, 9:07:46 AM4/1/19
to common...@googlegroups.com
> Another suggestion would be that some more sites (resp. CMS) provide a
> "print" functionality. For the crawler it's just a link pointing to
> a PDF version of the page.
>
>
> What does this mean? Would you mind further explaining it a little bit?

I mean pages which provide a link to view the same page as PDF, e.g.

https://www.consilium.europa.eu/en/press/press-releases/2018/05/25/copyright-rules-for-the-digital-environment-council-agrees-its-position/
which contains a link "Download as pdf" pointing to a PDF-version
(https://www.consilium.europa.eu/en/press/press-releases/2018/05/25/copyright-rules-for-the-digital-environment-council-agrees-its-position/pdf)

But it's only a suggestion, I do not have any evidence that such links
got more widely used in recent years.


But these links are difficult to detect because they do not point to
"static" PDF files with a .pdf suffix, it could be just a path or
query parameter: "/pdf", "print=pdf", "download=true&type=pdf",
"toPdf=true", "do=export_pdf", etc.


> Was the new crawl frontier since autumn 2016 for producing seeds similar to the seed
> donations from Blekko (by saying similar, I mean they were both URI seeds)?

Yes, they're similar: basically, just a list of billions of pairs <URI, score>.


Best,
Sebastian

Greg Lindahl

unread,
Apr 1, 2019, 6:20:19 PM4/1/19
to common...@googlegroups.com
On Mon, Apr 01, 2019 at 03:07:42PM +0200, Sebastian Nagel wrote:

> I mean pages which provide a link to view the same page as PDF, e.g.
>
> https://www.consilium.europa.eu/en/press/press-releases/2018/05/25/copyright-rules-for-the-digital-environment-council-agrees-its-position/
> which contains a link "Download as pdf" pointing to a PDF-version
> (https://www.consilium.europa.eu/en/press/press-releases/2018/05/25/copyright-rules-for-the-digital-environment-council-agrees-its-position/pdf)

And for those wondering how the blekko-era algorithm applies to this
particular example: since the one ending in /pdf doesn't have a dot
before the 'pdf', it IS included in the blekko frontier.

I have no idea what fraction of the web's pdfs have a path that ends
in '.pdf', but the question could certainly be explored in a
recent CommonCrawl monthly crawl.

-- greg


Sebastian Nagel

unread,
Apr 2, 2019, 4:35:12 AM4/2/19
to common...@googlegroups.com
Hi Weijian, hi Greg,

> I have no idea what fraction of the web's pdfs have a path that ends
> in '.pdf', but the question could certainly be explored in a
> recent CommonCrawl monthly crawl.

That's actually pretty easy using the columnar index [1]:

SELECT COUNT(*) as count,
regexp_like(url_path, '(?i)\.pdf$') as has_suffix_pdf
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2019-13'
AND subset = 'warc'
AND content_mime_detected = 'application/pdf'
GROUP BY regexp_like(url_path, '(?i)\.pdf$');

and for the March crawl the results are:

count has_suffix_pdf
10877528 true
1410617 false

So, a filter for *.pdf would block 80-90% of all PDFs.
Even today we would get a similar reduction in the amount of PDFs
if a simple suffix filter is enabled.

Thanks for the clarifying remarks, Greg!


Best,
Sebastian


[1] http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
Reply all
Reply to author
Forward
0 new messages