how to get web site name for a specific word

33 views
Skip to first unread message

Davood Hadiannejad

unread,
Jul 4, 2022, 5:17:02 AMJul 4
to Common Crawl
Is there a way to get web site name for a specific word, for example the word "ebik", I wanna write a script which search all web sites in Germany and lists the web sites that has the word 'ebik'

Sebastian Nagel

unread,
Jul 7, 2022, 11:07:22 AMJul 7
to common...@googlegroups.com
Hi Davood,

this is not a trivial task:
- you could get a list of URLs or site names by selecting
records in the .de top-level domain or with German ("deu")
content language using the Parquet index [1]
- alternatively, select records with "E-Bike" or similar concepts
in the URL. If a site is seriously about e-bikes, they likely
put the terms ("ebike", "e-bike", "pedelec", etc.), into the URL
following SEO recommendations.
- captured HTML pages can be fetched from the archives using the
provided WARC filename, record offset and length.

Searching directly the content web pages would require significantly
more resources.

Best,
Sebastian

[1]
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
Reply all
Reply to author
Forward
0 new messages