Get the links to all files of a certain type for all domains

87 views
Skip to first unread message

John

unread,
Jan 29, 2023, 10:31:32 PM1/29/23
to Common Crawl
I'm basically trying to get the links of all urls ending with .pdf from the HTML in all .com and .net website indexed by commoncrawl. How can I go about doing this?

Sebastian Nagel

unread,
Jan 30, 2023, 5:04:22 AM1/30/23
to common...@googlegroups.com
Hi John,

have a look at the columnar index


https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

It allows for quick filtering by top-level domain, (regular expression
on) URL path or even identified MIME type. See the example queries:


https://github.com/commoncrawl/cc-index-table#query-the-table-in-amazon-athena

Let me know if you need more help.

One important note: this approach only gives you links actually crawled
by the Common Crawl crawler. If you want to get also links seen but not
visited, you need to process the WAT or WARC files. But this is a much
larger project.

Best,
Sebastian

John

unread,
Feb 2, 2023, 8:27:14 PM2/2/23
to Common Crawl
Hi! Thank you for the response. Does common crawl index image files such as jpg and png? If not how could I search the HTML for these?

Sebastian Nagel

unread,
Feb 14, 2023, 8:57:06 AM2/14/23
to common...@googlegroups.com
Hi John,

(sorry for the delayed response)

> Does common crawl index image files such as jpg and png?

No. The Common Crawl crawler tries hard to avoid fetching any images or
other media file formats. The focus is clearly on HTML pages which make
around 98% of the crawled URLs [1]. Images and media content might get
fetched only occasionally.

However, the WARC and WAT files include links to images.

Best,
Seastian

[1] https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes


On 2/3/23 02:27, John wrote:
> Hi! Thank you for the response. Does common crawl index image files such
> as jpg and png? If not how could I search the HTML for these?
>
> On Monday, January 30, 2023 at 4:04:22 AM UTC-6 Sebastian Nagel wrote:
>
> Hi John,
>
> have a look at the columnar index
>
>
> https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/ <https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/>
>
> It allows for quick filtering by top-level domain, (regular expression
> on) URL path or even identified MIME type. See the example queries:
>
>
> https://github.com/commoncrawl/cc-index-table#query-the-table-in-amazon-athena <https://github.com/commoncrawl/cc-index-table#query-the-table-in-amazon-athena>
Reply all
Reply to author
Forward
0 new messages