Duplicate Documents in the Collection

104 views
Skip to first unread message

Asma Ben Abacha

unread,
Aug 12, 2020, 3:30:56 AM8/12/20
to TREC Health Misinformation Track
Dear organizers,

Many documents in the CC News collection have the same content (copy-paste of the same articles but with different URLs/document IDs). 
 
How are these documents going to be evaluated? Should we keep them all in the list of retrieved documents or do we need to filter them and keep only one? 
 
Thanks, 
Asma

====================================    

Dr. Asma Ben Abacha

Staff Scientist, National Institutes of Health (NIH)

U.S. National Library of Medicine (NLM)

Website I LinkedIn I Twitter

====================================    


Charles Clarke

unread,
Aug 12, 2020, 9:19:55 AM8/12/20
to Asma Ben Abacha, TREC Health Misinformation Track
Please retain them in the ranked list. We will have a broader announcement soon.

--Charlie

--
You received this message because you are subscribed to the Google Groups "TREC Health Misinformation Track" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trec-health-misinforma...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/trec-health-misinformation-track/CADJB%2Bbtxxbr_YOQnbtLoCaLn1ctHrCp%3DYKx7C3oBp7EZbChtsA%40mail.gmail.com.

Maik Mam10eks

unread,
Sep 3, 2020, 5:51:52 AM9/3/20
to TREC Health Misinformation Track
Dear organizers,

We have a spark-library that we have used to detect (near-)duplicate-documents in TREC runs. We used this library in our ECIR 2020 Repro-Paper "The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines." We can detect (near-)duplicates with this library, I would be happy to help.

Best regards,

Maik

Charles Clarke

unread,
Sep 3, 2020, 10:35:43 AM9/3/20
to Maik Mam10eks, Mark D. Smucker, TREC Health Misinformation Track
Mark, do you want to try taking advantage of this before we finalize pools?

Maik, do you actually have a duplicate list immediately available for this corpus? If not, we probably can't wait for the track itself, since assessments start soon, but I would be interested anyway.

--Charlie

Maik Fröbe

unread,
Sep 3, 2020, 11:08:38 AM9/3/20
to TREC Health Misinformation Track
@Charlie: I do not have the list available for this corpus. I would need a list of documents that would be judged, and then I can calculate near-duplicates within those documents. I think it would take me three days to produce the list of redundant documents then.

Best Regards,

Maik

Charles Clarke

unread,
Sep 3, 2020, 11:33:46 AM9/3/20
to Maik Fröbe, TREC Health Misinformation Track, Ellen Voorhees
Ellen, would it be possible to send the pools to Maik? You may not be able to use the near-duplicates, but I think we might be able to use them when we think about new evaluation measures.

Maik, we may end up judging some duplicates, so you might be able to see how well your method works for this collection :-)

--Charlie



Maik Fröbe

unread,
Sep 3, 2020, 11:54:24 AM9/3/20
to TREC Health Misinformation Track
That would be really nice!
I do not need to know which Team submitted which document, only the set of pooled documents per topic.

Best Regards,

Maik

Yassine Mrabet

unread,
Sep 3, 2020, 4:27:23 PM9/3/20
to Maik Fröbe, TREC Health Misinformation Track

I just read your very insightful ECIR paper, thanks for sharing!

Only thing is that for the specific Adhoc track, source credibility was mentioned as an evaluation criteria; so even if it's only meant to make the assessors job easier, I'm not sure how removing duplicates will fit with that objective.



--Yassine


Charles Clarke

unread,
Sep 3, 2020, 4:33:27 PM9/3/20
to Yassine Mrabet, Maik Fröbe, TREC Health Misinformation Track
Interesting point.

I think an ideal system should return the near duplicate from the most credible source. All of the near duplicates are not equal. This definitely warrants some more consideration.

--Charlie

Maik Fröbe

unread,
Sep 4, 2020, 2:49:05 AM9/4/20
to TREC Health Misinformation Track
I agree that near-duplicates might not be useful to reduce the number of judgments, but I think they are interesting for evaluations. In particular, the number of judgments may even increase since you have to identify a proper near-duplicate threshold for your use-case. (I.e., label some document pairs at different similarity thresholds.)

Best regards,

Maik
Reply all
Reply to author
Forward
0 new messages