I think its a pretty typical use case for common crawl datasets.
Briefly, you have two options. You can process all ~60,000 files of WET from one monthly crawl that contains extracted text from webpages and find the keywords you want from that.
Alternatively, you can filter webpages on the basis of domain ranks etc using common crawl index and process WARC files that contain HTML source of webpages.
I discuss both these things in chapter 6 and 7 of my book. On the implementation side, I prefer running these on my own custom code on a cluster of c5.2xlarge AWS EC2 servers but I am sure you can find a lot of other options by searching through this forum.