Task 1 false positives

Damian Romero

unread,

Apr 30, 2022, 10:52:50 PM4/30/22

to Dynamic Adversarial Data Collection (DADC) Workshop

There are several times where the system has thrown a false positive that we're sure it doesn't answer our questions correctly. We copy two examples below, but we wanted to know if there is anything we could do if this happens during the task.

Case 1: Incomplete answer

Relevant excerpt:

While doing press for Working Title ' s " Les Mis é rables " film adaptation , producer Eric Fellner stated that fellow producer Tim Bevan was working with writer Straughan and director Alfredson on developing a sequel to " Tinker Tailor Soldier Spy ".

Question: What was Eric Fellner working on?*

Model marked this as the correct answer: Tinker Tailor Soldier Spy

The answer should have been: a sequel to "Tinker Tailor Soldier Spy"

Case 2: The model selects a string in the text that we did not select and marks is as positive.

Relevant excerpt:

The morning walkway is on the eastern side, and it's open for pedestrians and bicycles in the morning to mid-afternoon during weekdays (5:00 am to 3:30 pm), and to pedestrians only for the remaining daylight hours (until 6:00 pm, or 9:00 pm during DST). The eastern walkway is reserved to pedestrians on weekends (5:00 am to 6:00 pm, or 9:00 pm during DST).

Question: At what times is the eastern walkway open to pedestrians only?*

Model marked this as the correct answer: 5:00 am to 3:30 pm

The answer should have been: 5:00 am to 6:00 pm, or 9:00 pm during DST

*Full disclosure: we're only paraphrasing our questions here to convey our point, however these cases actually happened.

Damian Romero

unread,

May 5, 2022, 6:41:24 PM5/5/22

to Dynamic Adversarial Data Collection (DADC) Workshop

Just pinging the organizing team to see if there are any updates on this matter because we crossed paths with two more false positives today while working on the task.

Sorry for insisting!

Max Bartolo

unread,

May 5, 2022, 7:20:00 PM5/5/22

to Dynamic Adversarial Data Collection (DADC) Workshop

Hi Damian,

Great spot! I agree that those definitely seem like valid model-fooling examples. Whether the system considers the model fooled is currently based on an F1 score threshold between your annotated answer and the model answer.

If you encounter more of these examples, could you please prepend the text MODELFOOLED to either of the explanation boxes and provide the explanations as you'd normally do for model-fooling examples?

I'll update the task instructions tomorrow and look into changing the threshold to reduce the occurrence rates.

Thanks!

Max

Damian Romero

unread,

May 6, 2022, 10:34:20 AM5/6/22

to Dynamic Adversarial Data Collection (DADC) Workshop

Thank you, Max!

Absolutely.

Damian

Reply all

Reply to author

Forward