Hey guys,
I am completely new to the common crawl data set (and also to data science/data mining).
For an application I plan on developing, I need to have at least a list with the most common distracting website URLs.
For that, I thought about going through the common crawl dataset and let each page categorize into shop/social media page/news page/porn site.
For that, I first might need a LLM that is trained on detecting if a URL + text is a shop/social media page/news page/porn site. Furthermore, I would need to apply that on the common crawl data in order to get all of the sites categorized.
However, I wonder if someone already did something more similar to what I am trying to achieve and could share their work with me?
Or if someone knows already a categorized dataset that I could use?
This would be very helpful for me! Also any hints, tips and so on are much appreciated!
Kind regards
Sandra