Re: [cc] Searching for paths with the Common Crawl Index server (and some unexpected behaviour)

67 views
Skip to first unread message

Tom Morris

unread,
Dec 26, 2020, 2:56:34 PM12/26/20
to common...@googlegroups.com
The index server does prefix matches on URLs with inverted host names, so your examples, as transformed, would be:

com.000/foo
com.aaa/foo
com.aardvark/foo

You can do prefix matches on com.aardvark.* or com.aardvark/foo/*, but com.aardvark.*/foo or com.*/foo aren't prefix matches.

Having said that, the columnar index makes it easy to do a table scan matching against whatever pattern you like.


Tom


On Fri, Dec 25, 2020 at 6:48 PM Andy Mackie <andyma...@gmail.com> wrote:

Hi all,

I'm trying to search for all URLs with a certain path, such as '*.com/foo' or even '*/foo'. On executing this query, I'd expect to see results such as:

etc.

Testing the Common Crawl Index server, I see that queries can include a path for a specific domain - the following query works:


When a sub-domain wildcard is introduced, however, the path component of the query is ignored (you receive every URL in the index for all sub-domains):


Is this expected behavour?

Is there a way that the index server can support paths with sub-domain wildcards, please?

Or is there another way of achieving the same result, please, with existing tools? (I know that I can download the 200GB URL index files and create my own tool, but I'd rather use an existing tool if one exists).

Thanks.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/b68dd7e5-e9c9-4303-a251-71ad2190d5f4n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages