Dataset Questions

95 views
Skip to first unread message

Ignatios Chatzigianellis

unread,
Jul 13, 2020, 5:09:50 PM7/13/20
to TREC Health Misinformation Track
Hello all,

I'm new to TREC and this kind of format and I have some questions regarding the data provided.
1) Are the available data from months January - April our provided training dataset?
2) Are these data only COVID related or must they be cleaned up?
3) If (1) is actually the case, are they labeled?

I would be very glad if someone could help me out, since I'm trying to figure out which my training dataset should be.

Thank you very much in advance.

Mark Smucker

unread,
Jul 14, 2020, 10:24:01 AM7/14/20
to TREC Health Misinformation Track
The document collection (see https://trec-health-misinfo.github.io/ ) consists of 5 months of the common crawl news crawl ( https://commoncrawl.org/2016/10/news-dataset-available/ ).  It is from this collection that you will retrieve documents.

Last year's Decision Track test collection could be useful as a training set, but there is not specific training set on the new document collection.

The topics are focused on COVID, but the document collection is all news.

Mark

Ignatios Chatzigianellis

unread,
Jul 14, 2020, 11:19:06 AM7/14/20
to TREC Health Misinformation Track
Thank you very much for your quick answer Mark.

I did indeed download the data from the first month yesterday and noticed, as you mentioned, that its a wider collection of data, not necessarily COVID specific.
So in order to get the COVID specific data from the crawled source, one should mine the related data first and then annotate them regarding their validity?

Thank you.

Maria Maistro

unread,
Jul 21, 2020, 3:23:06 AM7/21/20
to TREC Health Misinformation Track
Dear Ignatios,

For task 1 and 2 you need to retrieve documents related to the given topics (identifying those which convey misinformation) which focus on COVID.

Once you have identified those documents and ranked them in a run with TREC format, you will need to submit that run for assessment (deadline September 1).

The labelling will be handle by TREC organisers, who will return the effectiveness scores of your submissions (at some point in October). Then those labels will be publicly available.

If you need to train your system with COVID specific documents from this collection, you need to label the documents by yourself, since they are not available right now.

If you have further questions do not hesitate to ask,
Best,
Maria

Simão Gonçalves

unread,
Aug 16, 2020, 7:45:31 AM8/16/20
to TREC Health Misinformation Track
The labelling will be handle by TREC organisers, who will return the effectiveness scores of your submissions (at some point in October). Then those labels will be publicly available.

I'm also new to TREC, could you explain how will the labelling process be done?

Charles Clarke

unread,
Aug 16, 2020, 10:02:00 AM8/16/20
to Simão Gonçalves, TREC Health Misinformation Track
NIST hires human assessors to label the documents. We should have more details about the labeling guidelines soon.

--Charlie

--
You received this message because you are subscribed to the Google Groups "TREC Health Misinformation Track" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trec-health-misinforma...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/trec-health-misinformation-track/46aa0c16-5fba-49c9-a6ba-b7a961dd102en%40googlegroups.com.

Mark Smucker

unread,
Aug 20, 2020, 9:36:25 PM8/20/20
to TREC Health Misinformation Track
Hi All,

We've updated the guidelines to provide more information about how NIST will do the relevance assessing.  https://trec-health-misinfo.github.io/

In particular, we have posted the guidelines that NIST assessors will use: 


Please note that "correctness" is a simple computation based on a topic's answer and the NIST assessor's decision about whether or not a document contains a definitive answer to a topic's question.    A document without a definitive answer will neither be correct nor incorrect; it will lack an answer.  

Mark

Charles Clarke

unread,
Aug 21, 2020, 9:52:12 AM8/21/20
to Mark Smucker, TREC Health Misinformation Track
Instructions to the judges are:

Assume there is a search user who has a question of the form "Can X Y COVID-19?", where X is a treatment and Y is one of: cause, prevent, worsen, cure, or help. The user is searching the document collection for answers to this question. Your job is to assess documents on:
1) Does this document contain material that the search user might find useful in answering the question?
2) Does the document answer the question? If so, is the answer yes or no.
3) How credible is the document?

For many of these topics, finding documents that answer the question is much harder than just finding "topically relevant" documents in the traditional sense.  For some of these topics, finding documents in this collection that answer the question incorrectly is even harder. Traditional retrieval methods are unlikely to do well on anything but #1.

Reply all
Reply to author
Forward
0 new messages