Using Regex to filter URLs on index server

46 views
Skip to first unread message

Y R

unread,
Jul 23, 2024, 10:25:46 AMJul 23
to Common Crawl
Hi all
I want to know which domain has blog(s) as a subdomain (like blog(s).example.com) or as a part of path (like example.com/blog(s)).

For example, let's check blogs.oracle.com

To see that there are captures start with blogs.oracle.com, I search it using this way:


matchType:domain
filter:~url:https?://blogs.*

beside this, if set showNumPages to true, it return {"pages": 38, "pageSize": 5, "blocks": 188}

So clearly it should have captures, but it doesn't return anything! 

Just mention that if use "www" instead of "blogs" in Regex like this, "filter:~url:https?://www.*", it returns captures with "https://www.oracle.com/..."

Any help?

Sebastian Nagel

unread,
Jul 23, 2024, 5:28:54 PMJul 23
to common...@googlegroups.com
Hi,

the filter options is a purely secondary operation: the
results for oracle.com are filtered, but only if the results
are returned. The option showNumPages just gives you the
number of blocks in the cluster.idx file which contain records
from oracle.com.

To query for blogs.oracle.com just set

url:blogs.oracle.com
matchType:domain

without a filter. Domain here means including "all subdomains",
see [1].

Best,
Sebastian

[1] https://pywb.readthedocs.io/en/latest/manual/cdxserver_api.html

On 7/23/24 12:57, Y R wrote:
> Hi all
> I want to know which domain has blog(s) as a subdomain (like
> blog(s).example.com) or as a part of path (like example.com/blog(s)).
>
> For example, let's check blogs.oracle.com
>
> To see that there are captures start with blogs.oracle.com, I search it
> using this way:
> Search without regex
> <https://index.commoncrawl.org/CC-MAIN-2024-26-index?url=blogs.oracle.com/*&fl=url&output=text&limit=100>
>
> Now using Regex to filter URL
> <https://index.commoncrawl.org/CC-MAIN-2024-26-index?url=oracle.com&fl=url&output=text&limit=100&filter=~url:https?://blogs.*&matchType=domain> (no captures)
Reply all
Reply to author
Forward
0 new messages