NeuCLIR1 Document Collection

147 views
Skip to first unread message

Dawn Lawrie

unread,
May 23, 2022, 11:16:26 AM5/23/22
to neuclir-participants

Hello!

NeuCLIR 2022 has released its document collection. Topics will be released in the second half of June. Below we have compiled information about the document collection. This information is also accessible here.

The NeuCLIR1 document collection is available for download by those registered for TREC 2022 at https://trec.nist.gov/act_part/tracks2022.html. The document collection consists of documents in three languages: Chinese, Persian, and Russian, drawn from the Common Crawl news collection. They were obtained by Common Crawl between August 1, 2016 and July 31, 2021; most of the documents were published within this five year window. Text was extracted from each source webpage using the Python utility newspaper. The collection is distributed as JSONL – a list of JSON objects representing each document, one per line. Each document JSON structure consists of the following fields:

  • id: document ID assigned by Common Crawl
  • cc_file: raw Common Crawl document
  • time: time of publication, or null
  • title: article headline or title
  • text: article body
  • url: address of the source Webpage

To ascertain the language of each document, its title and text were independently run through two automatic language identification tools, cld3 and VaLID. Documents where the tools agreed on the language, or where one of the tools agreed the language recorded in the webpage metadata, were included in the collection; all others were removed. All documents greater than 24,000 characters (approximately 10 pages of text) were also removed, as were Chinese documents containing 75 or fewer characters, Persian documents containing 100 or fewer characters, and Russian documents containing 200 or fewer characters. 


Each collection was limited to 5 million documents. After removing duplicates, the Russian collection was significantly above this threshold. Therefore, we used Scikit-learn's implementation of random sample without replacement to downsample the collection. Final collection statistics are as follows:

Screen Shot 2022-05-23 at 11.10.12 AM.png
-----------------------------------
1 Tokens were identified by the SpaCy tokenizers.

John M. Conroy

unread,
Jun 6, 2022, 9:05:51 AM6/6/22
to neuclir-participants
Thank you, Dawn, and the organizers for this challenging and fun track!
What is the deadline date for the submission of runs? The only information I see is that they are due in July? 

John

Samy Ateia

unread,
Jun 21, 2022, 5:01:00 AM6/21/22
to neuclir-participants
Hi Dawn, thanks for the great preparation.

I just finished indexing the collections for some test runs and stumbled upon a deviation for the Document count in the Russian collection:
I only indexed 4,627,347 unique IDs from the Russian collection even though the JSONL file has 4,627,542 lines. The other collections match.
I'm uncertain if it is an error in my script or there are actually duplicate IDs in the collection. I will figure it out.

Best 

Samy

Samy Ateia

unread,
Jun 21, 2022, 11:38:52 AM6/21/22
to neuclir-participants
Nevermind it was an error on my side, the collection is correct, i now have all documents indexed.

Dawn J. Lawrie

unread,
Jun 21, 2022, 11:39:53 AM6/21/22
to Samy Ateia, neuclir-participants
Hi Samy,

Thanks for the confirmation.

Dawn

--
You received this message because you are subscribed to the Google Groups "neuclir-participants" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neuclir-particip...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/neuclir-participants/6e4867db-a44e-4ab8-83ac-5d79c2d6cc8fn%40googlegroups.com.


--
_________________________________________________
Dawn J. Lawrie Ph.D.
Senior Research Scientist
Human Language Technology Center of Excellence
Johns Hopkins University
810 Wyman Park Drive
Baltimore, MD 21211
law...@jhu.edu
https://hltcoe.jhu.edu/faculty/dawn-lawrie/

John M. Conroy

unread,
Jul 13, 2022, 9:32:20 AM7/13/22
to neuclir-participants
Has there been an update to the schedule? What are the submissions dates for the main tasks and reranking tasks or have they past and I missed it?
John

John M. Conroy

unread,
Jul 13, 2022, 9:47:02 AM7/13/22
to neuclir-participants
Ah, I found it. Google groups things more organized than I could ask. :>)
Thank you, Dawn, and the organizers for being organized. 
John


Jun 15, 2022, 10:09:16 AM
to neuclir-participants
Hi,

We would like to give you more details about important dates including next week's release of topics. We also include some details that may be helpful for system development.

--- Important Dates ---
ASAP: TREC Registration: https://ir.nist.gov/trecsubmit.open/application.html
Already: Document collection available for download (including translations of the documents into English)
June 22, 2022: Topics released with human translations of English topics into document languages
June 24, 2022: Machine translations and runs for reranking
July 26, 2022 AoE: runs submitted
October 2022: Relevance judgements and individual evaluation scores released
Late October 2022: Initial system description papers due
November 14-18, 2022: TREC 2022 conference (at NIST and/or virtual)

-- Details --
  • For Chinese, documents have been distributed as they were written in either Simplified or Traditional Chinese. Likewise, some human translations of topics will be in Traditional Chinese, while others are in Simplified Chinese. 
    • We have released a script to convert between Traditional and Simplified, which works well except for some named entities.
  • Topics will be released in a json format
  • Runs for reranking will index translated documents and produce a ranking with BM25. These runs will be released in the TREC submission format.
We are looking forward to your participation in the Track!

The NeuCLIR Organizers

Reply all
Reply to author
Forward
0 new messages