Number of articles in dataset nowadays; exponential growth?

33 views
Skip to first unread message

Joseph Kwon

unread,
May 17, 2022, 9:55:17 AM5/17/22
to Common Crawl
Hi all, for a project I scraped English articles from ccnews with a few million articles over half a year or so. September 2021, I collected 3.48 million articles, which is already much more than the number of articles I'd seen when I was collecting from earlier in that year. However, now I'm seeing that articles from recent months have over 5 million articles and the scraping is continuing. I'm wondering now if this is normal or something has gone with scraping when I was doing it in previous months. Is there a sudden spike in articles? Does anyone have an estimate of how many articles there are over some specified time periods?

Would appreciate any input on this. Thanks a bunch!

Sebastian Nagel

unread,
May 18, 2022, 2:53:40 AM5/18/22
to common...@googlegroups.com
Hi Joseph,

> Is there a sudden spike in articles?

there shouldn't be spikes in the number of articles and also no
exponential grows. During 2021 and 2022 until now, the number of
articles (all languages) in the news data set [1] crawled per month
ranges between 15 and 18 million.

Of course, on the level per news site there may be spikes if a
feed/sitemap was lost and/or (re)discovered. There is also noise
caused by ad links in feeds/sitemaps or by expired news domains
now hosting different types of content.

In doubt, could you share details how your numbers were achieved?

Best,
Sebastian

[1] https://commoncrawl.org/2016/10/news-dataset-available/
Reply all
Reply to author
Forward
0 new messages