Unimorph gaps and errors

Christian Chiarcos

unread,

Aug 24, 2017, 11:24:04 AM8/24/17

to unimorph, Christian Chiarcos

Dear all,

in the Unimorph data, I observed 45 features in the data that are not in line with Sylak-Glassman (2016)

Three error types
- gaps in S-G
- typos, misspelling

- unclear

Please see the table attached that lists the features together with the language and a manual error analysis.

Best,

Christian

PS: This illustrates feature validation as a possible application of the Unimorph ontology.

feats.xlsx

Seth Ebner

unread,

Aug 24, 2017, 3:25:08 PM8/24/17

to unimorph

ara also uses the label "NDEF" instead of "INDF". I submitted an issue to the ara repo last week but haven't received a response.

Ryan Cotterell

unread,

Aug 31, 2017, 10:04:00 AM8/31/17

to Seth Ebner, unimorph

Thanks Seth,

We're still setting up the infrastructure to quickly handle issues like this. I think we need to assign individuals to sets of languages to make sure there is a single person responsible.

Thanks,

--Ryan

arya.m...@gmail.com

unread,

Jul 12, 2018, 12:40:37 PM7/12/18

to unimorph

Reviving this thread because all of the ":" issues that Christian found now have pull requests. We should improve our argument template for the sake of Basque. The arguments are positional, and the spec elsewhere says that * can represent a missing or universally compatible value for a dimension. With that in mind, I propose accepting Christian's recommendations for Basque, but with * instead of ?.

Christian Chiarcos

unread,

Jul 12, 2018, 1:33:23 PM7/12/18

to unimorph, arya.m...@gmail.com, chia...@informatik.uni-frankfurt.de

Dear Arya, dear all,

thanks for pointing that out. In my original error table, the ?? were not meant to be recommendation but just a mark that the scheme required something that isn't there. As for *, it unfortunately comes with the naive interpretation as a wildcard, i.e., an arbitrary sequence of characters, and this can be problematic, because it means losing position information if multiple features are concatenated. So if multiple sub-features involve the same ascii character, and we cannot recover their position, we cannot tell them apart. Another placeholder that doesn't come with the sequential connotation would be better. In Perl regular expressions, such a placeholder would be ".".

However, after I've sent this table, we developed more detailed recommendations, mostly also communicated via this list, and in particular the suggestion to replace the ARG-mechanism by a ranking-based encoding (http://www.lrec-conf.org/proceedings/lrec2018/summaries/421.html). I discussed that with David Yarowsky and Christo Kirov at LREC. Also see our Unimorph fork under https://github.com/acoli-repo/unimorph for the implementation of these revisions, see https://github.com/acoli-repo/unimorph/tree/master/eus/src for the mapping for Basque.

In fact, a ranking-based encoding for polyvalent verbs in head-marking languages has a number of advantages: It prevents unrestricted tagset explosion (which is inevitable if a compositional scheme is used), it allows to encode different features independently rather than by concatenation (and thus, in a way that is more consistent with the annotation for languages without double agreement), and it allows to account for different language-specific rankings (e.g., either based on expected morphological case as currently recommended or based on grammatical roles as necessary for Kartwelian languages).

Moreover, if the highest-ranking argument (say, subject, if defined as such for a particular language) goes unmarked, and all others get numerical indices (say, features of direct object marked by -1, indirect object by -2, etc.), the actual annotations for this argument are actually identical to annotations we would have in a language without double agreement. This is ideal for projection experiments.

Our suggestion is downward-compatible in the sense that *not a single* Unimorph data set at the time used the ARG schema as it was been de(/pre)scribed. Very different ideas had been implemented, including some resembling our own (e.g., ARG-encoding of individual features rather than by concatenation, non-marking of top-ranking arguments). All documented in our Github, with a mapping for all languages with ARG-encoding.

Some other suggestions (fully downward-compatible, in that case) have been implemented as well, e.g., for recursive inflectional morphology as necessary for nominal morphology in Sumerian and languages from different language families in the Caucasus.

My idea was and is that the Unimorph community takes a thorough look on our fork, and if finds approval -- or, if it inspires an alternative extension --, that we merge our fork in the original repos.

All the best,

Christian

--
You received this message because you are subscribed to the Google Groups "unimorph" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unimorph+u...@googlegroups.com.
To post to this group, send email to unim...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimorph/0d95c10f-cdc7-4898-860a-659ce7514fd4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Arya McCarthy

unread,

Jul 12, 2018, 6:36:56 PM7/12/18

to unimorph, Christian Chiarcos, chia...@informatik.uni-frankfurt.de

Hi Christian,

Thanks for the good point about wildcard, and for letting me know about your paper—it makes important points about the schema. I withdraw my recommendation to alter the schema for Basque’s sake—what you’ve done looks good, and it aligns well with my understanding of Basque.

-am

Ryan Cotterell

unread,

Jul 12, 2018, 7:21:59 PM7/12/18

to Arya McCarthy, unimorph, Christian Chiarcos, chia...@informatik.uni-frankfurt.de

Yes, thanks Christian!

To unsubscribe from this group and stop receiving emails from it, send an email to unimorph+unsubscribe@googlegroups.com.

To post to this group, send email to unim...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimorph/0d95c10f-cdc7-4898-860a-659ce7514fd4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "unimorph" group.

To unsubscribe from this group and stop receiving emails from it, send an email to unimorph+unsubscribe@googlegroups.com.

To post to this group, send email to unim...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/unimorph/c533a987-9ead-4afa-b94b-9dec399705ba%40Spark.

Reply all

Reply to author

Forward