Get the links to all files of a certain type for all domains

John

unread,

Jan 29, 2023, 10:31:32 PM1/29/23

to Common Crawl

I'm basically trying to get the links of all urls ending with .pdf from the HTML in all .com and .net website indexed by commoncrawl. How can I go about doing this?

Sebastian Nagel

unread,

Jan 30, 2023, 5:04:22 AM1/30/23

to common...@googlegroups.com

Hi John,

have a look at the columnar index

https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

It allows for quick filtering by top-level domain, (regular expression
on) URL path or even identified MIME type. See the example queries:

https://github.com/commoncrawl/cc-index-table#query-the-table-in-amazon-athena

Let me know if you need more help.

One important note: this approach only gives you links actually crawled
by the Common Crawl crawler. If you want to get also links seen but not
visited, you need to process the WAT or WARC files. But this is a much
larger project.

Best,
Sebastian

John

unread,

Feb 2, 2023, 8:27:14 PM2/2/23

to Common Crawl

Hi! Thank you for the response. Does common crawl index image files such as jpg and png? If not how could I search the HTML for these?

Sebastian Nagel

unread,

Feb 14, 2023, 8:57:06 AM2/14/23

to common...@googlegroups.com

Hi John,

(sorry for the delayed response)

> Does common crawl index image files such as jpg and png?

No. The Common Crawl crawler tries hard to avoid fetching any images or
other media file formats. The focus is clearly on HTML pages which make
around 98% of the crawled URLs [1]. Images and media content might get
fetched only occasionally.

However, the WARC and WAT files include links to images.

Best,
Seastian

[1] https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes

On 2/3/23 02:27, John wrote:
> Hi! Thank you for the response. Does common crawl index image files such
> as jpg and png? If not how could I search the HTML for these?
>
> On Monday, January 30, 2023 at 4:04:22 AM UTC-6 Sebastian Nagel wrote:
>
> Hi John,
>
> have a look at the columnar index
>
>

> https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/ <https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/>

>
> It allows for quick filtering by top-level domain, (regular expression
> on) URL path or even identified MIME type. See the example queries:
>
>

> https://github.com/commoncrawl/cc-index-table#query-the-table-in-amazon-athena <https://github.com/commoncrawl/cc-index-table#query-the-table-in-amazon-athena>

Reply all

Reply to author

Forward