Normalizing nominal expressions: INFLEXION in input > Lowercase Lemma in output?

55 views
Skip to first unread message

Nadia Ivanova

unread,
Jul 25, 2016, 10:50:46 PM7/25/16
to Unitex-GramLab
Hello all,
I'm trying to use Unitex to normalize some strings in Spanish.

Let's say, I have the input "AYUDANTESand I want to output in Replace mode the lemma and lowercase the word (except the first letter) to get "Ayudante".
I'm using a variable to detect the Noun and then I created an empty box in the output :
<E>/$JobTitle.LEMMA$
(letting lowercase out for the moment).

But it's not working. I read para 6.7.5 Transducer outputs with variables and 6.8 Variables in the manual but still can't figure it out. I suspect my syntax is wrong. My output does not appear to be bold as in manual screenshots.

Later on I would like to detect nominal expressions then to normalise them in the same way:
AYUDANTES GENERALES -> Ayudante General

Any comments or suggestions would be much appreciated.
Thank you,
Nadia

Denis Maurel

unread,
Jul 26, 2016, 2:32:51 AM7/26/16
to Nadia Ivanova, Unitex-GramLab


Dear Nadia

1) May be you forget to declare your dictionary in morphological mode (Info/Preference)?

2) To convert uppercase to lower case, you can use a graph in replace mode and morphology: < A/a B/b etc.>. If you use a cascade, put a symbol at the begining and at the end of the word to be converted.


Best regards,

Denis Maurel


____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/



--
You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Visit this group at https://groups.google.com/group/unitex-gramlab.
To view this discussion on the web visit https://groups.google.com/d/msgid/unitex-gramlab/92c9bb9e-3940-4aad-97cc-0d0e24f5f331%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

eric.laporte

unread,
Jul 26, 2016, 4:00:55 AM7/26/16
to Unitex-GramLab
Dear Nadia,    
The $JobTitle.LEMMA$ syntax is compatible with dictionary-entry variables (Section 6.4.4), not with input variables (6.7.5). It requires either the morphological mode (6.4), or the intersection-automaton option of Locate Pattern. It also requires declaring the dictionary as morphological-mode dictionary (6.4.3).
The output will be lowercase if lemmas are owercase in the dictionary. If you want to uppercase initials after generating lemmas, you need an additional substitution pass,
Whether box outputs in your graphs will appear bold  or not depends on display options (Section 5.3.5), not on the syntax of definition or use of variables.
Best,
Eric Laporte

Nadia Ivanova

unread,
Jul 26, 2016, 6:18:21 AM7/26/16
to Unitex-GramLab
Thank you for your answer, Eric and thank you both for taking time in answering me, I really appreciate it.
For the moment I don't quite see yet how to put together pattern detection (what I'm quite familiar with) and normalisation in replace mode (from inflexion to lemma, what is quite new for me in Unitex). Do you have any examples of it, or previous topics with something similar? I'm trying to detect a pattern and then output it in a normalised form. Does it make sense? Is it possible?

As to lemma output, I managed to get a lowercased lemma in output for Noun and Adjective separately but not both together. 
I can see on page 106 we should be able to use several variables in a row in output but the manual does not tell me how I should write them. In manual's example, how are $year$ $month$ written? I tried it in two separate boxes or the same box with a space in between but I only get the last one in the output (with two boxes) or quite a weird result in the second case.

I'm putting in the output box:
<E>/$Noun.LEMMA$  $Adjective.LEMMA$
(see the graph screenshot attached)

Thank you.
P. S. I have now declared the morphological-mode dictionary, thank you to you both for pointing it out to me.
Unitex-01.png

eric.laporte

unread,
Jul 26, 2016, 7:05:08 AM7/26/16
to Unitex-GramLab
Nadia,
Nadia Ivanova wrote:
<<
For the moment I don't quite see yet how to put together pattern detection (what I'm quite familiar with) and normalisation in replace mode (from inflexion to lemma, what is quite new for me in Unitex).
Do you have any examples of it, or previous topics with something similar? I'm trying to detect a pattern and then output it in a normalised form. Does it make sense? Is it possible?
>>
Yes, it makes sense and it is possible.

<< 
I can see on page 106 we should be able to use several variables in a row in output but the manual does not tell me how I should write them. In manual's example, how are $year$ $month$ written?
>>
In the same way as in your attached graph.

<<
I tried it in two separate boxes or the same box with a space in between
>>
When you define a dictionary-entry variable, you should add the defining output to a box which contains only one lexical mask, as in your attached graph. If the box contains two lexical masks, Unitex/GramLab will not create two dictionary-entry variables

<<
but I only get the last one in the output (with two boxes)
>>
In the attached graph, the creation of the two dictionary-entry variables seems correct. If one of the outputs does not appear, it must for some other reason.
Best,
Eric

Nadia Ivanova

unread,
Jul 27, 2016, 1:50:10 AM7/27/16
to Unitex-GramLab
Hi Eric, 
thank you very much for your support and for letting me know what I'm trying to do is possible and that I'm doing it right :-)
I persevered and made some progress, testing on a small dedicated corpus.

I'm getting there with inflexion > lemma output for N+Adj., it is working now. The only difference with the previous graph is hat I put my Noun and Adjective each in a separate surround box with morphological tags. That's probably what you meant talking about two lexical masks. If it can help anybody in future... I'm posting the screenshot with a comparison.

Thank you again!

I will now try to apply the same principle to more complex grammars as my pattern is a bit more complex than just N.+Adj. ;-)

Is it compatible with negative left context? 

BTW, there might be an error in the manual or am I getting it wrong? Page 130 isn't a left context rather than right? That's what the manual says:
In graphs like that of Figure 6.15, the negative right context does not need to match the same number of tokens as the box after it. For example, before the graph of Figure 6.16 recognizes too, the negative right context checks if it occurs in a phrase like too early or too many. 
Same here:

Figure 6.17: Advanced use of right contexts 


Cheers,
Nadia
Unitex_Lemma_Output_KO&OK.png

eric.laporte

unread,
Jul 27, 2016, 4:14:24 AM7/27/16
to unitex-...@googlegroups.com
Hi Nadia,


On Wednesday, 27 July 2016 07:50:10 UTC+2, Nadia Ivanova wrote:
I'm getting there with inflexion > lemma output for N+Adj., it is working now. The only difference with the previous graph is hat I put my Noun and Adjective each in a separate surround box with morphological tags. That's probably what you meant talking about two lexical masks. If it can help anybody in future... I'm posting the screenshot with a comparison.
>>
That's not what I meant: I hadn't noticed that the space between the noun and the adjective was missing in your Unitex-01 graph. In the morphological mode, spaces in patterns must appear explicitly (enclosed in double quotes: " "). As it was, Unitex-01 searched for patterns like libronuevo. Your new graph works because it accepts an implicit space while it is out of the morphological mode. But instead of getting out and in the morphological mode between the words of your patterns, you can explicitly insert spaces. Alternatively, you can generate the text automaton and use the "automaton intersection" option of Locate Pattern instead of the morphological mode. Then, you don't need to insert the spaces.

<<
Is it compatible with negative left context? 
>>
Left and right contexts are incompatible with the morphological mode, but I think they are compatible with the "automaton intersection" option of Locate Pattern.

<< 

BTW, there might be an error in the manual or am I getting it wrong? Page 130 isn't a left context rather than right? That's what the manual says:
In graphs like that of Figure 6.15, the negative right context does not need to match the same number of tokens as the box after it. For example, before the graph of Figure 6.16 recognizes too, the negative right context checks if it occurs in a phrase like too early or too many. 
Same here:

Figure 6.17: Advanced use of right contexts 

>>
Unitex/GramLab does not make negative search in left contexts. The contexts in Figure 6.15 and 6.16 are called negative right contexts because they make a search to the right from a position that is identifiable in the graph (and so do positive right contexts, see Figure 6.28).
Best,
Eric

Nadia Ivanova

unread,
Jul 28, 2016, 9:31:31 AM7/28/16
to Unitex-GramLab
Thank you again, Eric.
I read the 7.7 chapter about "automaton intersection" option of Locate Pattern but it's a bit abstract for me...
Anyway, my negative contexts are working fine with morphological mode for some reason (I'm not complaining ;-)).
I take note of staying within morphological mode but actually I realised that splitting N+A expressions allows to cater for a better granularity (as I use variables for each box).

I have one more question, I hope you don't mind?

I'm using Dictionary-entry variables and am wondering if I can specify the specific inflexion I want to output.

Let's say I have 'Administrativo/a' in the input text, I can now get the output 'Administrativo' (with $xxx.INFLECTED$ or $xxx.LEMMA$, it's the same here) but not 'Administrativa': it looks like the option $xxx.INFLECTED$ only allows you to output the original inflexion, is this correct? I had a look on inflexion dictionaries but I don't think I can use them here... What would be a way to get a feminine inflexion in output from a lemma in input in a grammar? I can provide more details if necessary.

 

Thank you for all your help,

Kindest regards,
Nadia
Reply all
Reply to author
Forward
0 new messages