Hi Juan Francisco,
I fear this question will stay mostly unanswered. At least, there is no
easy answer.
- first, there can be no general answer because different language
models use different content for model training. You need to read
the research papers to know whether CC is used, which subsets
are selected and how the data is filtered.
But about Common Crawl itself...
- there are research papers performing automatic topic classification
on parts of Common Crawl. But this would not differentiate between
journalistic and non-journalistic content. Journalists write about
any topic: politics, sports, health, IT security. But anybody else
can write about the same topics, even in a journalistic style.
However, what's central to journalism - careful research, etc. -
are not easy to detect (automatically, maybe also manually).
- one approach to get an approximation about the part of journalistic
content would be to take a list of known news sites,
look up the domain names and count the number archived page.
[1] used this approach to extract a news corpus for fake news
detection.
- there is a dedicated news data set [2] - however, it's small
compared to the main crawls (approx. 200 million pages per year)
- if you intersect the domain names of the news data set with
those of the main crawls, a rough estimate would be that the
news domains contribute 2-3% of the main crawls. Of course,
on page-level the intersection is much smaller because
- the news crawler follows news feeds and sitemaps and that
way is focused on recent news articles
- the main crawls sample URLs and may include more
non-journalistic content from news sites (real estate ads,
etc.)
- on the other hand the news crawl may lack (it certainly does)
news sites
> Is there somewhere to check the typology and perhaps evolution of the
> types of web that feed CC?
You may just look at the data itself or some metrics over the last
10 years provided in [3,4,5].
Best,
Sebastian
[1]
https://rowanzellers.com/grover/
[2]
https://commoncrawl.org/2016/10/news-dataset-available/
[3]
https://commoncrawl.github.io/cc-crawl-statistics/plots/domains
[4]
https://commoncrawl.github.io/cc-crawl-statistics/
[5]
https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-iZNm1TvVGuCW7245-WGvZq47teNpb_uL5N9/edit