Hello!
NeuCLIR 2022 has released its document collection. Topics will be released in the second half of June. Below we have compiled information about the document collection. This information is also accessible here.
The NeuCLIR1 document collection is available for download by those registered for TREC 2022 at https://trec.nist.gov/act_part/tracks2022.html. The document collection consists of documents in three languages: Chinese, Persian, and Russian, drawn from the Common Crawl news collection. They were obtained by Common Crawl between August 1, 2016 and July 31, 2021; most of the documents were published within this five year window. Text was extracted from each source webpage using the Python utility newspaper. The collection is distributed as JSONL – a list of JSON objects representing each document, one per line. Each document JSON structure consists of the following fields:
To ascertain the language of each document, its title and text were independently run through two automatic language identification tools, cld3 and VaLID. Documents where the tools agreed on the language, or where one of the tools agreed the language recorded in the webpage metadata, were included in the collection; all others were removed. All documents greater than 24,000 characters (approximately 10 pages of text) were also removed, as were Chinese documents containing 75 or fewer characters, Persian documents containing 100 or fewer characters, and Russian documents containing 200 or fewer characters.
Each collection was limited to 5 million documents. After removing duplicates, the Russian collection was significantly above this threshold. Therefore, we used Scikit-learn's implementation of random sample without replacement to downsample the collection. Final collection statistics are as follows:
--
You received this message because you are subscribed to the Google Groups "neuclir-participants" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neuclir-particip...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/neuclir-participants/6e4867db-a44e-4ab8-83ac-5d79c2d6cc8fn%40googlegroups.com.