Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Use common crawl to categorize URLS into shop/social media page/news page/porn site

101 views
Skip to first unread message

Sandra Borst

unread,
Dec 2, 2024, 4:40:34 AM12/2/24
to Common Crawl
Hey guys,
I am completely new to the common crawl data set (and also to data science/data mining).

For an application I plan on developing, I need to have at least a list with the most common distracting website URLs.

For that, I thought about going through the common crawl dataset and let each page categorize into shop/social media page/news page/porn site.

For that, I first might need a LLM that is trained on detecting if a URL + text is a  shop/social media page/news page/porn site. Furthermore, I would need to apply that on the common crawl data in order to get all of the sites categorized.

Since the common crawl data is waaaay too much data, I thought about using only the most popular pages (with help of the project already done here: GitHub - trendsci/linkrun: LinkRun - Data Engineering project done in 3 weeks during the Insight fellowship) and then move forward some way similar to this project:

However, I wonder if someone already did something more similar to what I am trying to achieve and could share their work with me?

Or if someone knows already a categorized dataset that I could use?

This would be very helpful for me! Also any hints, tips and so on are much appreciated!

Kind regards
Sandra

Jen English

unread,
Dec 12, 2024, 2:55:26 PM12/12/24
to Common Crawl
Hi Sandra,

On possibility noted by our team is this community-driven project that categorizes pages in many of the categories you mention  http://dsi.ut-capitole.fr/blacklists/index_en.php

You could also take a look through our academic paper citations datasets for potential related projects:
https://huggingface.co/datasets/commoncrawl/citations
https://huggingface.co/datasets/commoncrawl/citations-annotated

Finally, you are welcome to join our Discord server and ask over there if anyone has worked on a relevant project: https://discord.com/invite/njaVFh7avF

Best,
Jen
Reply all
Reply to author
Forward
0 new messages