
We continue our mission to preserve and provide open access to the public web at global scale.
We believe the dataset should be representative of the open web - Language, Community, and Culture.
Our work increasingly focuses on improving the representation of languages that are often overlooked in large-scale web datasets. During Q2 we continued our efforts in language identification, multilingual data quality, and support for underrepresented language.
2. Openness and Accessibility
We believe the systems built on public web data should be transparent, accessible, accountable, and fair.
As AI systems become more dependent on web-scale data, we continue to advocate for mechanisms that give publishers, creators, and communities greater visibility into how their content is used. During Q2 we remained active participants in discussions around AIPREF within the IETF and with collaborators, contributing to the development of open standards for expressing content preferences and improving transparency across the AI ecosystem.
Last year we launched the Opt-Out Registry, which we continue to refine and develop, with helpful feedback from the community.
We believe scientific research should be reproducible, accessible, and built on shared public resources.
For nearly two decades, researchers have used Common Crawl data to study language, society, information systems, and the web itself. We continue to invest in tools, documentation, and resources that help researchers work effectively with our datasets, while the growing number of scholarly citations demonstrates the increasing impact of open web data on scientific discovery.