Hi Kim,
I would second Ivan that the task is not trivial. But it's doable! One approach could be:
First, get pages/sites of the desired domain from DMOZ or common search:
http://www.dmoz.org/search?q=architecture https://uidemo.commonsearch.org/?g=en&q=architectureEv. do this to also for other domains and/or add close but undesired domains (e.g., "home depot").
Second, get the location in the Common Crawl archives via the CC index:
http://index.commoncrawl.org/CC-MAIN-2016-26-index?url=archiseek.com&matchType=domain&output=jsonThird, "fetch" the pages of your domain from the Common Crawl archives on AWS S3
and train a classifier on the page content.
Now you could run the classifier over the whole Common Crawl data to get more
"architecture" pages, or even try to find links to architecture pages not included in
the Common Crawl data.
The Common Crawl index is also available on AWS S3 as gzipped files. In case, you only want to
grep the URLs for keywords that's the way to go:
idx=CC-MAIN-2016-26
for i in `seq 0 299`; do
aws s3 cp --no-sign-request s3://commoncrawl/cc-index/collections/$idx/indexes/cdx-`printf %05d $i`.gz - \
| gzip -dc \
| grep -i architecture
done
For how to get the page content using the WARC file location and offset, see one of the previous threads in this group.
Best,
Sebastian