find urls containing "terms-of-service" using http://index.commoncrawl.org/ search pages?

227 views
Skip to first unread message

Alex Henry

unread,
Nov 28, 2018, 1:30:51 PM11/28/18
to Common Crawl
Hi, 

I'm trying to get a list of all URLs in the Common Crawl index that contain the substring "terms-of-service."  Right now I'm just using http://index.commoncrawl.org/CC-MAIN-2018-47 to see if this kind of search can work.

I tried this query but got 0 results:  *terms-of-service* 

Although, e.g. http://apple.com/* seems to work fine.

Clearly I'm missing something fundamental about this API  -- is it possible to query subdomains without specifying the domain?  If not is there any way to do that without downloading the full index?

Thanks very much,

Alex

Sebastian Nagel

unread,
Nov 28, 2018, 2:12:20 PM11/28/18
to common...@googlegroups.com
Hi Alex,

> is it possible to query subdomains without specifying the domain?

The search is always a prefix search on the field "urlkey" which looks like
com,example,subdomain)/path/to/terms-of-service.html
The pattern you enter is "translated" to match the "urlkey" aka SURT [1].
You need to specify at least the top-level-domain and then filter
on the field "url", see the API docs [2] for how to filter on a field.

For example, this may look like:

http://index.commoncrawl.org/CC-MAIN-2018-47-index?url=commoncrawl.org&matchType=domain&filter=~url:.*faq.*

However, filtering is done secondary only, and it will take very long
to get all term-of-service URLs for the .com top-level domain.

The faster solution for this problem is to use the columnar URL index [3,4]
and query the field "url_path". Recently, there was a similar question
in this group to find job pages [5]. You may use Athena or Spark to query
the columnar index. The latter also allows to process the terms-of-service
pages at scale. I've recently added an example how to do this to the
cc-pyspark project [6].

Best,
Sebastian


[1] https://github.com/internetarchive/surt
[2] https://github.com/webrecorder/pywb/wiki/CDX-Server-API#filter
[3] http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
[4] https://github.com/commoncrawl/cc-index-table
[5] https://groups.google.com/d/topic/common-crawl/EBYaos2Yk1M/discussion
[6] https://github.com/commoncrawl/cc-pyspark
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Alex Henry

unread,
Nov 28, 2018, 2:50:02 PM11/28/18
to Common Crawl
Reply all
Reply to author
Forward
0 new messages