Hi Davood,
this is not a trivial task:
- you could get a list of URLs or site names by selecting
records in the .de top-level domain or with German ("deu")
content language using the Parquet index [1]
- alternatively, select records with "E-Bike" or similar concepts
in the URL. If a site is seriously about e-bikes, they likely
put the terms ("ebike", "e-bike", "pedelec", etc.), into the URL
following SEO recommendations.
- captured HTML pages can be fetched from the archives using the
provided WARC filename, record offset and length.
Searching directly the content web pages would require significantly
more resources.
Best,
Sebastian
[1]
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/