Issue in subtask 2 evaluation script

121 views
Skip to first unread message

Iman Saleh

unread,
Nov 5, 2019, 4:06:32 AM11/5/19
to DeftEval 2020
Hello task organizers,

When I attempted to run semeval2020_06_evaluation_main.py evaluation script, I noticed that in the method validate_labels(gold_rows, pred_rows) is validating output labels based on labels found in the dev set. The issue is that some labels are not present in the dev set, for instance B-Alias-Term-frag and I-Alias-Term-frag. If my output contains any of these labels the script throws an error. Please let me know what do you think about this issue. 
Thanks.

Nicholas Miller

unread,
Nov 5, 2019, 10:37:34 AM11/5/19
to DeftEval 2020
Hi Iman,

Thanks for your message. Yes, it looks like the dev set data doesn't cover all the possible labels, which is causing the error. We're working on a fix for that now.

Nick

Jayasimha Talur

unread,
Dec 1, 2019, 12:04:07 PM12/1/19
to DeftEval 2020
Hi

In the codalab evaluation page, it is given that only "Term, Alias-Term, Referential-Term, Definition, Referential-Definition, and Qualifier"  are used for computing F1 score. Does "B-Alias-Term-frag" and "I-Alias-Term-frag" need to be included as well?

Thanks and Regards

Mingyu Han

unread,
Dec 3, 2019, 6:58:46 AM12/3/19
to DeftEval 2020
Hi,

I try to run evaluation with given training_phase.zip in reference_files via the Codelab framework, but something is wrong with subtask 2.
Is the given zip file wrong?

Here is the error:

WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/classification.py:1439: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples.
  'recall', 'true', average, warn_for)
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/classification.py:1439: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 due to no true samples.
  'recall', 'true', average, warn_for)
Traceback (most recent call last):
  File "/tmp/codalab/tmpS9y5OM/run/program/semeval2020_06_evaluation_main.py", line 84, in 
    main(cfg)
  File "/tmp/codalab/tmpS9y5OM/run/program/semeval2020_06_evaluation_main.py", line 69, in main
    task_2_report = task_2_eval_main(ref_path, res_path, output_dir, cfg['task_2']['eval_labels'])
  File "/tmp/codalab/tmpS9y5OM/run/program/evaluation_sub2.py", line 193, in task_2_eval_main
    report = evaluate(y_gold, y_pred, eval_labels)
  File "/tmp/codalab/tmpS9y5OM/run/program/evaluation_sub2.py", line 136, in evaluate
    return classification_report(y_gold, y_pred, labels=eval_labels, output_dict=True)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/classification.py", line 1926, in classification_report
    zip(headers, [i.item() for i in avg]))
  File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/classification.py", line 1926, in 
    zip(headers, [i.item() for i in avg]))
AttributeError: 'int' object has no attribute 'item'


Many thanks!

Sasha Spala

unread,
Dec 3, 2019, 10:07:36 AM12/3/19
to Mingyu Han, DeftEval 2020

Hi Mingyu,

I’ll take a look at your submission file later today.

Best,
Sasha

--
You received this message because you are subscribed to the Google Groups "DeftEval 2020" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semeval-2020-task...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/semeval-2020-task-6-all/6d013f2a-b483-4e6f-a753-990ebf7e497d%40googlegroups.com.

Sasha Spala

unread,
Dec 4, 2019, 2:58:36 PM12/4/19
to Mingyu Han, DeftEval 2020

Hi Mingyu,

This error was the result of labels that occur in the res or ref files that are not being evaluated. I’ve added some warnings and errors to help with this – you should now see a warning if there are labels in the reference files that aren’t specified in the config file for evaluation. I’ve updated the evaluation program in Codalab, and added these changes to Github.

 

Best,
Sasha

Sasha Spala

unread,
Dec 4, 2019, 4:19:18 PM12/4/19
to Jayasimha Talur, DeftEval 2020

Hi Jayasimha,

We are currently considering fragments (in this dataset, typically definition spans that are interrupted by a term) out of scope for this competition. Term or definition spans that contain a fragment will not be in the test set.


Best,

Sasha

 

From: <semeval-202...@googlegroups.com> on behalf of Jayasimha Talur <jayasi...@gmail.com>
Date: Sunday, December 1, 2019 at 12:04 PM
To: DeftEval 2020 <semeval-202...@googlegroups.com>
Subject: Re: Issue in subtask 2 evaluation script

 

Hi

--

You received this message because you are subscribed to the Google Groups "DeftEval 2020" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semeval-2020-task...@googlegroups.com.

Message has been deleted

Sasha Spala

unread,
Dec 5, 2019, 10:31:00 AM12/5/19
to Mingyu Han, DeftEval 2020

Whoops, thanks for the catch! You’ve caught my debugging

I’ve updated this on the github and in Codalab.


Best,
Sasha

 

From: <semeval-202...@googlegroups.com> on behalf of Mingyu Han <hanm...@gmail.com>
Date: Wednesday, December 4, 2019 at 10:09 PM
To: DeftEval 2020 <semeval-202...@googlegroups.com>
Subject: Re: Issue in subtask 2 evaluation script

 

Thanks for fixing this problem. But it seems a new error occurred:

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/codalab/tmprlhUzV/run/input/res_120419'

I found that line 55 in file 'semeval2020_06_evaluation_main.py' became 'res_path = input_dir.joinpath('res_120419')'

I deleted the '_120419' then it worked. Should it be deleted too in Codalab?

 

Many thanks!


2019125日星期四 UTC+8上午3:58:36Sasha Spala写道:

Dian Todorov

unread,
Jun 7, 2020, 9:30:06 AM6/7/20
to DeftEval 2020
I still don't get how the evaluation should works against the test files?

There are tags in the test labeled data which are missing from the evaluation configuration. How should we treat them?



- B-Term
- I-Term
- B-Definition
- I-Definition
- B-Alias-Term
- I-Alias-Term
- B-Referential-Definition
- I-Referential-Definition
- B-Referential-Term
- I-Referential-Term
- B-Qualifier
- I-Qualifier

However, in the test file, there are examples with tags not part of the evaluation
is data/source_txt/t1_biology_0_606.deft 1772 1774 I-Secondary-Definition

I don't understand how are these treated, are they just ignored. They are bound to not work. When i try to run against the test files, I get the following exception.
  File "~/deft_corpus/evaluation/program/evaluation_sub2.py", line 86, in validate_res_labels
    raise ValueError(f"Encountered unknown or unevaluated label: {label}")
ValueError: Encountered unknown or unevaluated label: B-Secondary-Definition

How does the testing work exactly?

Dian Todorov

unread,
Jun 7, 2020, 10:53:22 AM6/7/20
to DeftEval 2020
In other words,

Is it OK if we preprocess the labeled test data so that we replace every unknown tag to '0'.

Sasha Spala

unread,
Jun 15, 2020, 11:51:56 AM6/15/20
to Dian Todorov, DeftEval 2020

Hi Dian,


We left the original set of labels in the evaluation set so that folks who wanted to evaluate on a label set outside of the official SemEval context, they would still be able to do so. If you are working with a subset of evaluation labels (as in the SemEval task), the evaluation script treats any labels not in the provided eval set (in the config file) as “O”. However, it is your responsibility to handle those OOV labels in your submitted results files as you wish. You may do this either by treating OOV labels predicted by your model as “O” or by instructing your model to only predict the labels in the eval set – whichever you prefer. Since those approaches may produce different results, we leave this to the user to do. If you are trying to submit the provided ground truth to the evaluation script as a sanity check, you’ll have to remove the unevaluated labels (by marking all OOV labels as “O”).

 

Hope this helps! Let me know if you have any other questions or if I can clarify more.


Best,
Sasha

 

From: <semeval-202...@googlegroups.com> on behalf of Dian Todorov <dianval...@gmail.com>
Date: Sunday, June 7, 2020 at 9:30 AM
To: DeftEval 2020 <semeval-202...@googlegroups.com>
Subject: Re: Issue in subtask 2 evaluation script

 

I still don't get how the evaluation should works against the test files?

--

You received this message because you are subscribed to the Google Groups "DeftEval 2020" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semeval-2020-task...@googlegroups.com.

Reply all
Reply to author
Forward
0 new messages