PARSEME Shared Task 1.2 - Final Call for Participation

Skip to first unread message

Agata Savary

Jun 19, 2020, 7:00:31 AM6/19/20
to verbalmwe

PARSEME shared task 1.2 on semi-supervised identification of verbal multiword expressions

Final call for participation

(Apologies for cross-posting)

The third edition of the PARSEME shared task on automatic identification of verbal multiword expressions (VMWEs) aims at identifying **verbal MWEs** in running texts.  Verbal MWEs include, among others, verbal idioms (to let the cat out of the bag), light-verb constructions (to make a decision), verb-particle constructions (to give up), multi-verb constructions (to make do) and inherently reflexive verbs (s'évanouir 'to faint' in French).  Their identification is a well-known challenge for NLP applications, due to their complex characteristics including discontinuity, overlaps, non-compositionality, heterogeneity and syntactic variability.

Editions 1.0 (2017) and 1.1 (2018) have shown that, while some systems reach high performance (F1>0.7) for identifying VMWEs that were seen in training corpus, performance on unseen VMWEs is very low (F1<0.2). Hence for this third edition, **emphasis will be put on discovering VMWEs that were not seen in the training corpus**.

We kindly ask potential participant teams to register using the expression of interest form:

Task updates and questions will be posted on the shared task website:

and announced on our public mailing list (anyone can join):

#### Publication and workshop

Shared task participants will be invited to submit a system description paper to a special track of the Joint Workshop on Multiword Expressions and Electronic Lexicons (MWE-LEX 2020), at COLING 2020, to be held on December 13, 2020, in Barcelona, Spain (postponed):

Submitted system description papers must follow the workshop submission instructions and will go through double-blind peer reviewing.  Their acceptance depends on the quality of the paper rather than on the ranking in the shared task.  Authors of the accepted papers will present their work as posters/demos in a dedicated session of the MWE-LEX 2020 workshop.  The submission of a system description paper is not mandatory.

Due to double blind review, participants are asked to provide a nickname (i.e. a name that does not identify authors, universities, research groups etc.) for their systems when submitting results and system description papers.

#### Provided corpora

The PARSEME team has prepared corpora in which VMWEs were manually annotated: The provided annotations follow the PARSEME 1.2 guidelines:

On March 23, 2020, we released, for each language: 

* a training corpus manually annotated for VMWEs;

* a development corpus to tune/optimize the systems' parameters ; and

* a syntactically parsed raw corpus, not annotated for VMWEs, to support semi- and unsupervised methods for VMWE discovery (for each language, the size is between 12 million tokens and 2.5 billion tokens)

On July 1, 2020, we will release, for each language:

* A blind test corpus to be used as input to the systems during the evaluation phase, during which the VMWE annotations will be kept secret.

On July 3, 2020, participants will have to upload their annotated version of the test corpus at

Morphosyntactic annotations (parts of speech, lemmas, morphological features, and syntactic dependencies) are also provided, both for annotated and raw corpora.  Depending on the language, the information comes from treebanks (mostly Universal Dependencies v2) or from automatic parsers trained on UD v2 treebanks (e.g., UDPipe).

The annotated training and development corpora are released in the CUPT format (which is the CoNLL-U format with an extra column for the MWE annotations). The raw corpora are released in the CoNLL-U format. The blind test corpus will be released in the CUPT format, with an underspecified 11th column to be predicted. Reference annotations for the test copus will be released after the evaluation phase.

The trial data, training and dev sets are available on the shared task's release repository:

The raw corpus is available on the corpus initiative website:

Corpora are available for the following languages: German (DE), Greek (EL), Basque (EU), French (FR), Irish (GA), Hebrew (HE), Hindi (HI), Italian (IT), Polish (PL), Brazilian Portuguese (PT), Romanian (RO), Swedish (SV), Turkish (TR), Chinese (ZH).

The amount of annotated data in the training, development, test, and raw corpus depends on the language.

#### Tracks

System results can be submitted in two tracks:

  * Closed track: Systems using only the provided training and development corpora (with VMWE and morpho-syntactic annotations) + provided raw corpora.

  * Open track: Systems using or not the provided training corpus, plus any additional resources deemed useful (MWE lexicons, symbolic grammars, wordnets, other raw corpora, word embeddings and language models trained on external data, etc.). This track includes notably purely symbolic and rule-based systems.

In both tracks, the use of the corpora from the previous PARSEME shared tasks, and from the PARSEME source repositories, is strictly forbidden, as material may have moved during corpus splits.

Teams submitting systems in the open track will be requested to describe and provide references to all resources used at submission time. Teams are encouraged to favor freely available resources for better reproducibility of their results.

#### Evaluation metrics

Participants will provide the output produced by their systems on the test corpus in the CUPT format, with the 11th column containing their predictions. This output will be compared with the gold standard (ground truth) using both generic and specialised precision, recall and F1 scores.

The evaluation metrics will be the same as for the 1.1 edition, as described in:

Note that for the 1.2 edition the published general ranking will emphasize 3 metrics:

   * global MWE-based

   * global Token-based

   * unseen MWE-based

A VMWE from the test corpus is considered seen if a VMWE with the same (multi-)set of lemmas is annotated at least once in the training or development corpus.

#### Corpus split

For each language, the annotated sentences are shuffled and split, in a way which ensures that there is a minimum of 300 VMWEs in the test set which are unseen in the training + dev sets. This means that the natural sequence of sentences in a document will not be respected in the proposed corpus split. Note the unseen ratio, that is, the proportion of unseen VMWEs wrt all VMWEs in the test set, may vary across languages. To guide participants on this hard task, the number and rate of unseen VMWEs for the dev corpora are available on the shared task website. In both tracks, the use of previous shared task editions' corpora, and from the PARSEME source repositories, is strictly forbidden, as material may have moved during corpus splits.

#### Important dates (updated)


  * Feb 19, 2020: trial data and evaluation script released

  * Mar 23, 2020: training and development corpus + raw corpus released

  * Jul 01, 2020: blind test corpus released

  * Jul 03, 2020: submission of system results

  * Jul 09, 2020: announcement of results

  * Sep 02, 2020: shared task system description papers due (same as regular papers)

  * Oct 16, 2020: notification of acceptance

  * Nov 01, 2020: camera-ready system description papers due

  * Dec 13, 2020: shared task session at the MWE-LEX 2020 workshop at Coling 2020

#### Organizing team

Carlos Ramisch, Marie Candito, Bruno Guillaume, Agata Savary, Ashwini Vaidya, and Jakub Waszczuk

Reply all
Reply to author
0 new messages