Hi again,
I have some doubts related to this year corpus. In the track's instructions, it says that the whole corpus contains ~ 1B English documents. However, I have just downloaded it and it takes up to 2.3 T.
On the other hand, I wanted to confirm with track organizers that document ids will have this format: "en.noclean.c4-train.01234-of-07168.0". Since in the example run the docno have a different format (see below):
1 Q0 en.noclean.c4-train.04124-of-69102 1 14.8928003311 myGroupNameMyMethodName
1 Q0 en.noclean.c4-train.03346-of-52165 2 14.7590999603 myGroupNameMyMethodName
Thanks in advance and good luck to everyone!
Best,
Marcos