Get the links to all files of a certain type for all domains
87 views
Skip to first unread message
John
unread,
Jan 29, 2023, 10:31:32 PM1/29/23
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
I'm basically trying to get the links of all urls ending with .pdf from the HTML in all .com and .net website indexed by commoncrawl. How can I go about doing this?
Sebastian Nagel
unread,
Jan 30, 2023, 5:04:22 AM1/30/23
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
One important note: this approach only gives you links actually crawled
by the Common Crawl crawler. If you want to get also links seen but not
visited, you need to process the WAT or WARC files. But this is a much
larger project.
Best,
Sebastian
John
unread,
Feb 2, 2023, 8:27:14 PM2/2/23
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
Hi! Thank you for the response. Does common crawl index image files such as jpg and png? If not how could I search the HTML for these?
Sebastian Nagel
unread,
Feb 14, 2023, 8:57:06 AM2/14/23
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to common...@googlegroups.com
Hi John,
(sorry for the delayed response)
> Does common crawl index image files such as jpg and png?
No. The Common Crawl crawler tries hard to avoid fetching any images or
other media file formats. The focus is clearly on HTML pages which make
around 98% of the crawled URLs [1]. Images and media content might get
fetched only occasionally.
However, the WARC and WAT files include links to images.
On 2/3/23 02:27, John wrote:
> Hi! Thank you for the response. Does common crawl index image files such
> as jpg and png? If not how could I search the HTML for these?
>
> On Monday, January 30, 2023 at 4:04:22 AM UTC-6 Sebastian Nagel wrote:
>
> Hi John,
>
> have a look at the columnar index
>
>