Hi all -- Common Crawl users may be interested to check out NVIDIA NeMo Curator. This GPU-accelerated data-curation library includes data download, document deduplication, language identification, filtering, and other features often requested by Common Crawl users. Helpful for preparing large-scale, high-quality datasets for pretraining and customization. Learn more here: https://github.com/NVIDIA/NeMo-Curator
--
Jen English
Program Manager, Common Crawl