Hi David,
the query param "url" is required. In combination with matchType=domain it must contain a domain
name. Also the regex to match the URL could be extended to allow up to 4 non-letter characters
(e.g., "-", "+", "%20", etc.) between the words "refer a friend". If also the
word "a" is optional you'll get:
https://index.commoncrawl.org/CC-MAIN-2018-43-index?url=qhotels.co.uk&matchType=domain&filter=~url:.*refer[^a-zA-Z]{1,4}(?:a[^a-zA-Z]{1,4})?friend&output=json
Of course, the required domain name isn't really practical.
The columnar index is much more efficient for this kind of queries.
Here one example (restricted to 100 URLs only with the .uk top-level domain)
which is answered by Amazon Athena within seconds (see [1] for instructions):
SELECT url
FROM "ccindex"."ccindex"
WHERE (crawl = 'CC-MAIN-2018-43')
AND subset = 'warc'
AND regexp_like(url_path, 'refer[^a-zA-Z]{1,4}(?:a[^a-zA-Z]{1,4})?friend')
AND url_host_tld = 'uk'
LIMIT 100
Most URLs use "refer-a-friend" but there are some exceptions:
1
https://www.curtisrecruitment.co.uk/refer-a-friend/
2
https://pinegreen.co.uk/refer-a-friend/
3
https://www.fuelgenie.co.uk/refer-a-friend/login/
4
https://www.fuelgenie.co.uk/refer-a-friend/terms-and-conditions/
5
http://moriati.co.uk/graduates/refer-a-friend/
6
http://www.family-care.co.uk/fostering/refer-friend/
7
https://www.dobell.co.uk/refer-a-friend/
8
http://www.familylore.co.uk/2011/10/refer-friend.html
9
https://www.signaldrivingschool.co.uk/refer-a-friend
10
http://www.signaturesounds.co.uk/refer-a-friend.html
...
49
https://www.vitaminplanet.co.uk/refer-friend.aspx
53
http://enjoywellnesscentres.co.uk/refer_a_friend.html
56
http://www.999talk.co.uk/jd/pages/refer_a_friend.php
66
http://www.matfordbusinesscentre.co.uk/special-offers/refer-friend/
71
https://www.thisismoney.co.uk/money/saving/article-5505665/Natwest-pilots-refer-friend-deal-paying-500.html
88
https://www.promptexecutivehire.co.uk/refer_friend.php
...
Of course, you may further improve the regular expression to catch more variants.
But if you remove the limit and the restriction you'll probably get enough URLs to start with.
Best,
Sebastian
[1]
http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
> **
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.