TREC 2021 corpus doubts

37 views

Skip to first unread message

marcosfp 97

unread,

Jul 9, 2021, 11:36:59 AM7/9/21

to TREC Health Misinformation Track

Hi again,

I have some doubts related to this year corpus. In the track's instructions, it says that the whole corpus contains ~ 1B English documents. However, I have just downloaded it and it takes up to 2.3 T.

On the other hand, I wanted to confirm with track organizers that document ids will have this format: "en.noclean.c4-train.01234-of-07168.0". Since in the example run the docno have a different format (see below):

1 Q0 en.noclean.c4-train.04124-of-69102 1 14.8928003311 myGroupNameMyMethodName

1 Q0 en.noclean.c4-train.03346-of-52165 2 14.7590999603 myGroupNameMyMethodName

Thanks in advance and good luck to everyone!

Best,

Marcos

Mark Smucker

unread,

Jul 14, 2021, 10:57:11 PM7/14/21

to TREC Health Misinformation Track

Hi Marcos,

Thanks for the heads up on the mistake. The website has been updated.

Mark

Reply all

Reply to author

Forward

0 new messages