TREC 2021 corpus doubts

37 views
Skip to first unread message

marcosfp 97

unread,
Jul 9, 2021, 11:36:59 AM7/9/21
to TREC Health Misinformation Track
Hi again,

I have some doubts related to this year corpus. In the track's instructions, it says that the whole corpus contains ~ 1B English documents. However, I have just downloaded it and it takes up to 2.3 T.

On the other hand, I wanted to confirm with track organizers that document ids will have this format: "en.noclean.c4-train.01234-of-07168.0". Since in the example run the docno have a different format (see below):

1 Q0 en.noclean.c4-train.04124-of-69102 1 14.8928003311 myGroupNameMyMethodName 
1 Q0 en.noclean.c4-train.03346-of-52165 2 14.7590999603 myGroupNameMyMethodName

Thanks in advance and good luck to everyone!
Best,
Marcos

Mark Smucker

unread,
Jul 14, 2021, 10:57:11 PM7/14/21
to TREC Health Misinformation Track
Hi Marcos,

Thanks for the heads up on the mistake.  The website has been updated.

Mark

Reply all
Reply to author
Forward
0 new messages