Is there a way to obtain only the URLs that have been crawled?

75 views
Skip to first unread message

Humanities Clinic

unread,
Jun 28, 2019, 7:14:46 AM6/28/19
to Common Crawl
Is there a way to obtain only the URLs that have been crawled? I'd like to use it to obtain all the URLs that belong to a domain that fulfil a certain regex, for example.

Alternatively, because the file sizes are really very big, is there a web API that I can work with so that I can just query the data directly there rather than download the files in full?

Sebastian Nagel

unread,
Jun 28, 2019, 7:24:29 AM6/28/19
to common...@googlegroups.com
Hi,

please have a look at the URL indexes:
https://index.commoncrawl.org/
(and the client linked from there)
http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

There are a couple of options to get your use case done, they have already been discussed on this
list, eg.
https://groups.google.com/d/topic/common-crawl/EBYaos2Yk1M/discussion
https://groups.google.com/d/topic/common-crawl/7l9VSQ00fgw/discussion

Feel free to ask for further advise!

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/fbfc9e18-a733-411d-9d00-195c75c582d8%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/fbfc9e18-a733-411d-9d00-195c75c582d8%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages