PARSEME ST1.2 train/dev data released!

6 views
Skip to first unread message

Carlos Ramisch

unread,
Mar 23, 2020, 7:59:35 PM3/23/20
to verbalmwe

Dear all,


We are happy to announce the release of the **training and development data** for the PARSEME shared task 1.2 on semi-supervised identification of verbal multiword expressions (VMWEs):


https://gitlab.com/parseme/sharedtask-data/tree/master/1.2



### Languages


We provide full training and development sets for 14 languages: German (DE), Greek (EL), Basque (EU), French (FR), GA (Irish), Hebrew (HE), Hindi (HI), Italian (IT), Polish (PL), Brazilian Portuguese (PT), Romanian (RO), Swedish (SV), Turkish (TR) and Chinese (ZH).



### Annotated data


We provide .cupt files that contain VMWE annotations and morphosyntactic data. The annotation guidelines for VMWEs were slightly extended with respect to previous editions to accomodate for Chinese and Swedish phenomena, and to fix minor issues in Hindi-specific tests, leading to PARSEME guidelines 1.2.


The accompanying morphosyntactic information (POS tags, lemmas, morphological features and/or syntactic dependencies) uses the UD v2 scheme (the exact version of the UD-based data depends on the language). Depending on the language, the morphosyntactic information was manually or automatically annotated. All annotations are available under open licenses, notably various flavors of the Creative Commons license.


We remind you that the blind test data will be released on April 28, and the submission of system results is due on April 30.



### Additional raw corpora


We also provide "raw" corpora, meant to help identify VMWEs that were unseen at training time. Here are the instructions for downloading these raw corpora.


The raw corpora were automatically parsed with UD v2 tools (the exact version depending on the language) and are provided in the CoNLL-U format. Their sizes vary from language to language, see the raw corpora page for statistics.



### Split of the annotated data


We provide a training (train), development (dev) and test sets for each language. The test set will be released later, after the evaluation phase is over. The data split was performed with a focus on unseen VMWE identification in mind. The split is random but we controlled the following factors for each language:


  - Test contains about 300 VMWEs which are unseen in train+dev

  - Dev contains about 100 VMWEs which are unseen in train

  - The ratio of unseen VMWEs in test with respect to train+dev (resp. dev with respect to train) is as close as possible to an average (see below for details)


Unseen VMWEs are defined as in the evaluation script, that is, a VMWE in test (resp. dev) is considered unseen in train+dev (resp. train) if its multi-set of lemmas does not occur as an annotated VMWE, with the same multi-set of lemmas, in train+dev (resp. train). 


The ratios of unseen VMWEs vary from language to language.  For most languages, the ratios of unseen VMWEs in test (with respect to train+dev) and in dev (with respect to train) are comparable, but this was not possible for languages with little data.


To choose the final split, we first estimated the number of sentences in test (resp. dev) needed to provide 300 (resp. 100) unseen VWMEs in train+dev (resp. train). Then, we ran several random splits and selected one for which the unseen ratio is as close as possible to the average.



### Guidelines to participants


During the system development phase, and for computing the results on the test sets, the participants are free to use train+dev in any way. In other words, the dev set can be added to the train set for machine learning purposes.


In both tracks, **no data from the previous editions should be used**.


The evaluation metrics will be the same as for edition 1.1. However, for edition 1.2, the published general ranking will emphasize 3 metrics:

* global MWE-based,

* global token-based,

* unseen MWE-based.


Do not forget to register to the participants' mailing list. We will also post the latest updates on the shared task 1.2 website.


As seen from the previous PARSEME shared task editions, supervised VMWE identifiers are rather efficient for seen VMWEs, but very poor for unseen ones. We hope that this highly multilingual dataset will foster the development of systems with increased ability to identify VMWEs unseen at training time.


This has been a tremendous collective effort, possible only with the strong commitment of many annotators, language leaders, organizers and technical support experts. We would like to thank all contributors for the time and enthusiasm they invested in the creation of this resource. In particular, the following people helped us by managing language-specific annotations and preparing the raw corpora: Abigail Walsh, Archna Bhatia, Chaya Liebeskind, Federico Sangati, Johanna Monti, Menghan Jiang, Hongzhi Xu, Rafael Ehren, Renata Ramisch, Sara Stymne, Timm Lichte, Tunga Güngör, Uxoa Iñurrieta, Verginica Barbu Mititelu, Voula Giouli, Zeynep Yirmibeşoğlu.



All the best,

Carlos Ramisch, Bruno Guillaume, Agata Savary, Jakub Waszczuk, Marie Candito and Ashwini Vaidya



--
 Carlos RAMISCH
http://pageperso.lis-lab.fr/~carlos.ramisch

---------------------------------------------------------------------------------
 work_fr: (+33) 4 86 09 06 72 
 address:
    LIS-TALEP
    Parc Scientifique et Technologique de Luminy
    163, avenue de Luminy - Case 901
    13288 MARSEILLE CEDEX 9
    France
---------------------------------------------------------------------------------
Reply all
Reply to author
Forward
0 new messages