Newbie Question

95 views
Skip to first unread message

David Best

unread,
Jan 14, 2019, 9:49:54 AM1/14/19
to Common Crawl
What I am looking for is to run a search that will extract all of the sites urls that would have "refer a friend" in the title or url.

I see examples that seem to do this but they all have a base url that the search is searching for.

Can someone give me an example of how this would be done?

I have tried this url:


But I get this error:

Common Crawl Index Server Error

A url= param must be specified to query the cdx server

Any help would be greatly appreciated.

Thanks in advance,

David

David Best

unread,
Jan 14, 2019, 10:39:15 AM1/14/19
to Common Crawl
As a follow up to my question I found what I think may work.


However the admin posted:

However, please do not do this for .com which makes about 50% of
the entire captures. It's much faster to download the entire
URL index (or the 50% of files which hold the .com TLD)
and process it offline.

I take it from this warning that running the .com would not be good?

If that is the case is there a safe way to run the above query with harming the server?

Also if downloading the URL index is possible how would I do it and how large is it likely to be?

Thanks again in advance,

David

I also assume tha

Sebastian Nagel

unread,
Jan 14, 2019, 10:41:50 AM1/14/19
to common...@googlegroups.com
Hi David,

the query param "url" is required. In combination with matchType=domain it must contain a domain
name. Also the regex to match the URL could be extended to allow up to 4 non-letter characters
(e.g., "-", "+", "%20", etc.) between the words "refer a friend". If also the
word "a" is optional you'll get:

https://index.commoncrawl.org/CC-MAIN-2018-43-index?url=qhotels.co.uk&matchType=domain&filter=~url:.*refer[^a-zA-Z]{1,4}(?:a[^a-zA-Z]{1,4})?friend&output=json

Of course, the required domain name isn't really practical.
The columnar index is much more efficient for this kind of queries.

Here one example (restricted to 100 URLs only with the .uk top-level domain)
which is answered by Amazon Athena within seconds (see [1] for instructions):

SELECT url
FROM "ccindex"."ccindex"
WHERE (crawl = 'CC-MAIN-2018-43')
AND subset = 'warc'
AND regexp_like(url_path, 'refer[^a-zA-Z]{1,4}(?:a[^a-zA-Z]{1,4})?friend')
AND url_host_tld = 'uk'
LIMIT 100


Most URLs use "refer-a-friend" but there are some exceptions:

1 https://www.curtisrecruitment.co.uk/refer-a-friend/
2 https://pinegreen.co.uk/refer-a-friend/
3 https://www.fuelgenie.co.uk/refer-a-friend/login/
4 https://www.fuelgenie.co.uk/refer-a-friend/terms-and-conditions/
5 http://moriati.co.uk/graduates/refer-a-friend/
6 http://www.family-care.co.uk/fostering/refer-friend/
7 https://www.dobell.co.uk/refer-a-friend/
8 http://www.familylore.co.uk/2011/10/refer-friend.html
9 https://www.signaldrivingschool.co.uk/refer-a-friend
10 http://www.signaturesounds.co.uk/refer-a-friend.html
...
49 https://www.vitaminplanet.co.uk/refer-friend.aspx
53 http://enjoywellnesscentres.co.uk/refer_a_friend.html
56 http://www.999talk.co.uk/jd/pages/refer_a_friend.php
66 http://www.matfordbusinesscentre.co.uk/special-offers/refer-friend/
71
https://www.thisismoney.co.uk/money/saving/article-5505665/Natwest-pilots-refer-friend-deal-paying-500.html
88 https://www.promptexecutivehire.co.uk/refer_friend.php
...

Of course, you may further improve the regular expression to catch more variants.
But if you remove the limit and the restriction you'll probably get enough URLs to start with.


Best,
Sebastian

[1] http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
> **
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Sebastian Nagel

unread,
Jan 14, 2019, 10:51:22 AM1/14/19
to common...@googlegroups.com
Hi David,

> I take it from this warning that running the .com would not be good?

Yes, it will take long (hours or even days) and will cause a high load on the URL index server.

There are much faster ways to search URLs in the index:
- see my previous answer posted few minutes ago
- the URL index files are about 300 GB per month
You can download the 300 files and then "grep" them (Linux or Cygwin required):
zgrep -E 'refer[^a-zA-Z]{1,4}(a[^a-zA-Z]{1,4})?friend' cdx-*.gz

Best,
Sebastian

David Best

unread,
Jan 14, 2019, 11:36:20 AM1/14/19
to Common Crawl
Thanks for all of your help.

Now to see if I can get this to work.

David
Reply all
Reply to author
Forward
0 new messages