restriction on syntactic or semantic Delaf categories

64 views
Skip to first unread message

Denis Maurel

unread,
Mar 15, 2016, 12:04:50 PM3/15/16
to Unitex-GramLab


Dear All

With Anubhav we recently parsed a scientific text with the sequence: {vi,... , v,}

We had been surprised by the Unitex interpretation:

{vi,... , v,} -> form "vi", lemma same as form and syntactic or semantic Delaf category ".. , v,"

What do you think about introduction of restrictions on syntactic or semantic Delaf category names?

For instance:

category: letters - digits - hyphen - underscore
simple feature: letters - digits - hyphen - underscore
feature with value: letters - digits - hyphen - underscore= all characters except + { } / : unless protected \+ \{ \} \/ \:


What do you think about this proposal?

Best regards,

Denis Maurel


____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/


Cvetana Krstev

unread,
Mar 18, 2016, 5:15:54 AM3/18/16
to Unitex-GramLab, denis....@univ-tours.fr
Dear all,

I agree with Denis' proposal. Actually in my dictionaries (of Serbian) I was very restrictive and used for categories, simple features and features with values [a-zA-Z0-9]+
Unitex need not be so restrictive, but it would definitely be very good to pose some reasonable restriction.

Best wishes, Cvetana

eric.laporte

unread,
Mar 19, 2016, 12:58:42 PM3/19/16
to Unitex-GramLab, denis....@univ-tours.fr
Hi,
I agree with Denis's proposal of excluding special characters from codes for POS and features. With this change, the rules to name POS and features would be expicit, and escape sequences would be available for features with an '=' sign like in "Principaute d'Andorre,Andorra.N+Toponym+Lat=42\.5+Lng=1\.5".
In the list of special characters, I suggest adding "~" since it can be used in lexical masks to negate a feature (manual, Section 4.3.3).
I suggest specifying more precisely what we call 'letters' in codes for POS and features: letters of the Latin alphabet or also of the alphabet of the language? It would be interesting to know the opinion of users working on languages with other alphabets.
Presently, in POS codes that contain an inflectional code referring to a transducer, accented characters are accepted by Unitex but it cannot process them correctly, at least on my Windows computer. For example, if I name N_poète the transducer that inflects the noun poète, Unitex processes the dictionary but searches for N_pote.fst2 instead of N_poète.fst2.

Best,
Eric Laporte
Message has been deleted

Denis Maurel

unread,
Mar 29, 2016, 10:20:20 AM3/29/16
to Unitex-GramLab


Dear All

We had problems with texts containing curly brackets, for instance mathematical texts. So we propose to parse precisely the inside of the curly brackets to see a dictionary entry or a mathematical formula.

After mails of Eric and Cvetana, we propose to limit characters used for dictionary entries. Page 43 of the Unitex manual, we read:

An entry of a DELAF is a line of text terminated by a newline that conforms to the following syntax:
apples,apple.N+conc:p/this is an example

We propose to limit the inflected form and the canonical form:
* as of now: any character, except comma, dot, plus, colon, slash, escape character, curly bracket
* with escape character before: comma, dot, plus, colon, slash, escape character, curly bracket

We propose to limit the sequence of grammatical and semantic information:
* Latin non accented alphabet
* digits, underscore, hyphen, tilda, equal to
* plus to introduce feature
* with escape character before: comma, dot, plus, colon, slash, escape character, curly bracket
* colon to introduce morphological features
* slash to introduce comment

We propose to limit the comment:
* any character, except curly bracket and escape character
* with escape character before: escape character, curly bracket

Do you agree with this proposal for the Unitex 3.2 version? do you suggest other specification?


Best regards,

Denis Maurel


____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/



Hi,

Here I give the opinion of the use of Latin alphabet versus other alphabets, and English masks versus localized as a representative of Serbian that uses equally Latin and Cyrillic alphabet. Presently, in the Cyrillic module all masks were, naturally, in Latin (French), but also all other codes: POS, syn/sem markers, grammatical codes. Practically, only words of the language are in Cyrillic. It seemed natural to me, and I never thought that localized masks would be very helpful. However, I have to admit that this concept complicates quite a lot translation of a collection of graphs developed in Cyrillic to Latin and vice versa. We produced a procedure that performs this task almost perfectly.

To conclude: I support English masks, leaving French masks for backward compatibility, I don't think that localization is the most important thing to do.

Best wishes, Cvetana

Cvetana Krstev

unread,
Mar 30, 2016, 3:52:27 PM3/30/16
to Unitex-GramLab, denis....@univ-tours.fr
This proposal seems reasonable to me, Cvetana


On Tuesday, March 15, 2016 at 5:04:50 PM UTC+1, Denis MAUREL wrote:

Cristian Martinez

unread,
Mar 31, 2016, 1:43:36 PM3/31/16
to unitex-...@googlegroups.com, denis....@univ-tours.fr
Dear Denis,

Could you please help me to know if the next entries are compatible with your proposition?

Congo Brazzaville,Republic of the Congo.N+Toponym+Country+Iso=CG+Lat=\-1+Lng=15+Lang=en;fr+UID=T4567822
Republica Eslovaca,Slovakia.N+Toponym+Country+Iso=SK+Lat=48\.66667+Lng=19\.5+Lang=es;pt+UID=T3567542


Thanks,

On Tuesday, March 29, 2016 at 4:20:20 PM UTC+2, Denis MAUREL wrote:


Dear All
dico.png

Denis Maurel

unread,
Apr 4, 2016, 4:35:43 AM4/4/16
to Cristian Martinez, Unitex-GramLab


Dear Cristian

I read your two lines. In our description, we forget semi-colon. We add it. We proposed to not put back slash before hyphen, so the same as minus. But a back slash before a non protected character will be accepted.
So ok for your two lines.

New proposal:

We propose to limit the inflected form and the canonical form:
* as of now: any character, except comma, dot, plus, colon, slash, escape character, curly bracket, semi-colon
* with escape character before: comma, dot, plus, colon, slash, escape character, curly bracket

We propose to limit the sequence of grammatical and semantic information:
* Latin non accented alphabet
* digits, underscore, hyphen, tilda, equal to, semi-colon
* plus to introduce feature
* with escape character before: comma, dot, plus, colon, slash, escape character, curly bracket
* colon to introduce morphological features
* slash to introduce comment

We propose to limit the comment:
* any character, except curly bracket and escape character
* with escape character before: escape character, curly bracket



Best regards,

Denis Maurel


____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/



Dear Denis,


Could you please help me to know if the next entries are compatible with your proposition?

Congo Brazzaville,Republic of the Congo.N+Toponym+Country+Iso=CG+Lat=\-1+Lng=15+Lang=en;fr+UID=T4567822
Republica Eslovaca,Slovakia.N+Toponym+Country+Iso=SK+Lat=48\.66667+Lng=19\.5+Lang=es;pt+UID=T3567542




Reply all
Reply to author
Forward
0 new messages