I'm trying to search for all URLs with a certain path, such as '*.com/foo' or even '*/foo'. On executing this query, I'd expect to see results such as:
Testing the Common Crawl Index server, I see that queries can include a path for a specific domain - the following query works:
When a sub-domain wildcard is introduced, however, the path component of the query is ignored (you receive every URL in the index for all sub-domains):
Is this expected behavour?
Is there a way that the index server can support paths with sub-domain wildcards, please?
Or is there another way of achieving the same result, please, with existing tools? (I know that I can download the 200GB URL index files and create my own tool, but I'd rather use an existing tool if one exists).
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/b68dd7e5-e9c9-4303-a251-71ad2190d5f4n%40googlegroups.com.