Re: [cc] Searching for paths with the Common Crawl Index server (and some unexpected behaviour)

Skip to first unread message

Tom Morris

Dec 26, 2020, 2:56:34 PM12/26/20
The index server does prefix matches on URLs with inverted host names, so your examples, as transformed, would be:


You can do prefix matches on com.aardvark.* or com.aardvark/foo/*, but com.aardvark.*/foo or com.*/foo aren't prefix matches.

Having said that, the columnar index makes it easy to do a table scan matching against whatever pattern you like.


On Fri, Dec 25, 2020 at 6:48 PM Andy Mackie <> wrote:

Hi all,

I'm trying to search for all URLs with a certain path, such as '*.com/foo' or even '*/foo'. On executing this query, I'd expect to see results such as:


Testing the Common Crawl Index server, I see that queries can include a path for a specific domain - the following query works:

When a sub-domain wildcard is introduced, however, the path component of the query is ignored (you receive every URL in the index for all sub-domains):

Is this expected behavour?

Is there a way that the index server can support paths with sub-domain wildcards, please?

Or is there another way of achieving the same result, please, with existing tools? (I know that I can download the 200GB URL index files and create my own tool, but I'd rather use an existing tool if one exists).


You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
To view this discussion on the web visit
Reply all
Reply to author
0 new messages