Hi Alex,
> is it possible to query subdomains without specifying the domain?
The search is always a prefix search on the field "urlkey" which looks like
com,example,subdomain)/path/to/terms-of-service.html
The pattern you enter is "translated" to match the "urlkey" aka SURT [1].
You need to specify at least the top-level-domain and then filter
on the field "url", see the API docs [2] for how to filter on a field.
For example, this may look like:
http://index.commoncrawl.org/CC-MAIN-2018-47-index?url=commoncrawl.org&matchType=domain&filter=~url:.*faq.*
However, filtering is done secondary only, and it will take very long
to get all term-of-service URLs for the .com top-level domain.
The faster solution for this problem is to use the columnar URL index [3,4]
and query the field "url_path". Recently, there was a similar question
in this group to find job pages [5]. You may use Athena or Spark to query
the columnar index. The latter also allows to process the terms-of-service
pages at scale. I've recently added an example how to do this to the
cc-pyspark project [6].
Best,
Sebastian
[1]
https://github.com/internetarchive/surt
[2]
https://github.com/webrecorder/pywb/wiki/CDX-Server-API#filter
[3]
http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
[4]
https://github.com/commoncrawl/cc-index-table
[5]
https://groups.google.com/d/topic/common-crawl/EBYaos2Yk1M/discussion
[6]
https://github.com/commoncrawl/cc-pyspark
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.