Journalistic web sources in CC

Juan Francisco Jiménez Jacinto

unread,

Oct 8, 2022, 11:23:35 AM10/8/22

to Common Crawl

Dear CC Team,

I'm new to the group, thanks for the welcome.

I develop a research about the role and presence of journalistic corpus in the databases that train artificial intelligence models.

Could you help me to define what is approximately the percentage of journalistic websites that use CC? Is there somewhere to check the typology and perhaps evolution of the types of web that feed CC?

Thank you very much in advance,

Juan Francisco

Sebastian Nagel

unread,

Oct 16, 2022, 3:45:25 PM10/16/22

to common...@googlegroups.com

Hi Juan Francisco,

I fear this question will stay mostly unanswered. At least, there is no
easy answer.

- first, there can be no general answer because different language
models use different content for model training. You need to read
the research papers to know whether CC is used, which subsets
are selected and how the data is filtered.

But about Common Crawl itself...

- there are research papers performing automatic topic classification
on parts of Common Crawl. But this would not differentiate between
journalistic and non-journalistic content. Journalists write about
any topic: politics, sports, health, IT security. But anybody else
can write about the same topics, even in a journalistic style.
However, what's central to journalism - careful research, etc. -
are not easy to detect (automatically, maybe also manually).

- one approach to get an approximation about the part of journalistic
content would be to take a list of known news sites,
look up the domain names and count the number archived page.
[1] used this approach to extract a news corpus for fake news
detection.

- there is a dedicated news data set [2] - however, it's small
compared to the main crawls (approx. 200 million pages per year)

- if you intersect the domain names of the news data set with
those of the main crawls, a rough estimate would be that the
news domains contribute 2-3% of the main crawls. Of course,
on page-level the intersection is much smaller because
- the news crawler follows news feeds and sitemaps and that
way is focused on recent news articles
- the main crawls sample URLs and may include more
non-journalistic content from news sites (real estate ads,
etc.)
- on the other hand the news crawl may lack (it certainly does)
news sites

> Is there somewhere to check the typology and perhaps evolution of the
> types of web that feed CC?

You may just look at the data itself or some metrics over the last
10 years provided in [3,4,5].

Best,
Sebastian

[1] https://rowanzellers.com/grover/
[2] https://commoncrawl.org/2016/10/news-dataset-available/
[3] https://commoncrawl.github.io/cc-crawl-statistics/plots/domains
[4] https://commoncrawl.github.io/cc-crawl-statistics/
[5]
https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-iZNm1TvVGuCW7245-WGvZq47teNpb_uL5N9/edit

Yonz

unread,

Jul 21, 2023, 2:02:44 PM7/21/23

to Common Crawl

I wanted to get the latest news crawl but I wasn't able to find it, is it still being crawled? https://commoncrawl.org/2016/10/news-dataset-available/
-Yonz

Tom Morris

unread,

Jul 21, 2023, 3:42:56 PM7/21/23

to common...@googlegroups.com

On Fri, Jul 21, 2023 at 2:02 PM Yonz <found...@gmail.com> wrote:

I wanted to get the latest news crawl but I wasn't able to find it, is it still being crawled? https://commoncrawl.org/2016/10/news-dataset-available/

The last entry in https://data.commoncrawl.org/crawl-data/CC-NEWS/2023/07/warc.paths.gz

is:

crawl-data/CC-NEWS/2023/07/CC-NEWS-20230721173757-00004.warc.gz

which seems pretty current to me.

Tom

Yonz

unread,

Jul 21, 2023, 4:59:58 PM7/21/23

to Common Crawl

Thank you,
https://data.commoncrawl.org/crawl-data/CC-NEWS/2023/index.html only showed till month 5. For my learning, how did you go about finding it? Is there a different website to peek at the files under https://data.commoncrawl.org/crawl-data/ ?

I was only able to confirm after doing ` aws s3 ls s3://commoncrawl/crawl-data/CC-NEWS/2023/`

-Yonatan

Tom Morris

unread,

Jul 21, 2023, 5:17:41 PM7/21/23

to common...@googlegroups.com

On Fri, Jul 21, 2023 at 5:00 PM Yonz <found...@gmail.com> wrote:

Thank you,
https://data.commoncrawl.org/crawl-data/CC-NEWS/2023/index.html only showed till month 5. For my learning, how did you go about finding it? Is there a different website to peek at the files under https://data.commoncrawl.org/crawl-data/ ?

No secret site. I just followed the pattern and replaced the 5 with a 7. I suspect the index pages might be updated by hand, so they can get stale.

Tom

Reply all

Reply to author

Forward