How to get both M and F inflexions in output from "Vendedor/a"

64 views
Skip to first unread message

Nadia Ivanova

unread,
Feb 10, 2017, 1:29:28 AM2/10/17
to Unitex-GramLab
Hello everybody,

In my Spanish corpus, I have many lines looking like this:
Vendedor/a
INSTALADOR/A ELECTRICO
Encargado/a de Administracion
Peluqueros/as
etc.

My goal it to output both masculine and feminine singular inflexions, so I would have two outputs in Replace mode for each of them, like this:

Vendedor/a > vendedor
Vendedor/a > vendedora
INSTALADOR/A ELECTRICO > instalador eléctrico
INSTALADOR/A ELECTRICO > instaladora eléctrica (NB. Adjective in feminine as well)
Encargado/a de Administracion > encargado de administración
Encargado/a de Administracion encargada de administración
Peluqueros/as > peluquero
Peluqueros/as > peluquera
 
I can't predict the noun coming first but the pattern is Noun/a.

Is it possible? If yes, how can I specify I want a feminine inflexion in output, given I don't really have it in input?
Currently, I'm getting only the masculine inflexion with an output variable.

Looking forward to your answers,

Thank you very much in advance,
Kind regards,
Nadia

P.S. I thought I already asked this question here but can't find it so I presume I didn't, just thought very hard about it :-)

Alexis Neme

unread,
Feb 10, 2017, 8:13:27 PM2/10/17
to Unitex-GramLab
Dear Nadia,
 
Unitex is appropriate for tagging corpus with lexical resources not for string processing. The preprocessing with "Replace" is not appropriate as well to the task you describe. A clean  solution is to use  find/replace with regular expression (Regex), for instance  by using Unix utilities such as grep/sed. 


You describe the inflexion of job names for Spanish, which is not a simple. It will be more simple to use a scripting language such as Python or PERL,  using the embedded regex library. This will give you more readability, flexibility and maintainability than the Unix utilities (grep/sed) . 

I don't know if you wanna go in such direction,  take a look to my Python script (attachment) to have an idea of what you should expect  by using such Scripting language (see attachment) and in term of developer time-consuming. the script is clear and simple ( for a Python programmer, you can read).  Such script can preprocess many files or whatever!          

if you have further questions ....

Cheers,
Alex 
    
  
================================
Here the output of my Python script for the 4 examples you gave, 
of course, it works for similar Jobs like "Doutor/a; Engenheiro/a electrico" :

"<<<" read as input line string 
">>>" read as output  line string 
wc : word count
------OUTPUT -------------------------------
<<< line, wc: vendedor/a   1
>>> vendedor
>>> vendedor a
------------------
<<< line, wc:      instalador/a electrico   2
>>>      instalador electrico
>>>      instalador a electrica
------------------
<<< line, wc:      encargado/a de administracion   3
>>>      encargado  de  administracion  
>>>      encargad a  de  administracion  
------------------
<<< line, wc:      peluqueros/as   1
>>>      peluqueros
>>>      peluquer as

finalScript.py

Nadia Ivanova

unread,
Feb 12, 2017, 11:31:40 PM2/12/17
to Unitex-GramLab
Dear Alexis,
Thank you very much for taking time to answer me and to write a Python script.

To give a bit of context, we are running a POC with Unitex (but also with some machine learning tools) to find out what each tool can and can't do for our job title normalisation task. And you are right, it's not a simple task!

We are looking for a solution at scale as we have thousand hundred of raw job titles in a flow. I don't know Python but my understanding is the transformation your script is doing are very specific. 
I'm rather looking at high-level patterns which we could apply to the whole corpus.

After checking grep/sed, I think it could be a solution for the first Noun to append a "-a" when the word ends with a consonant, or replace a "-o" with a "-a", but would probably be a bit more tricky to agree the adjective when there is one, like in the example INSTALADOR/A ELECTRICO > instaladora eléctrica.

Anyway, thanks again.

Cheers,
Nadia

Anubhav Gupta

unread,
Feb 16, 2017, 1:29:47 AM2/16/17
to Nadia Ivanova, Unitex-GramLab
Hi Nadia,

Unitex cannot output line breaks (new lines). 
However, with help of morphological mode and CasSys it is possible to get the output in a single line.
I have attached the solution it is similar to the one provided in python.
If you have a DELAS spanish dictionary then the solution (graph) can be straightforward (see the morphological mode dictionaries section in the manual)

Regards,
Anubhav

--
You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramlab+unsubscribe@googlegroups.com.
To post to this group, send email to unitex-gramlab@googlegroups.com.
Visit this group at https://groups.google.com/group/unitex-gramlab.
To view this discussion on the web visit https://groups.google.com/d/msgid/unitex-gramlab/162774e0-ad50-4d8f-bf8e-b9e0735b1682%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



Archive.zip

Nadia Ivanova

unread,
Feb 20, 2017, 4:55:12 AM2/20/17
to Unitex-GramLab, nadia....@jobseeker.com.au
Dear Anubhav,
Thank you very much for your solution.

I could open your fiels in Unitex and I understand what it is doing.

I will play with it and try to figure out how I can combine it with my main graph which is doing other transformations.
My real life corpus does not contain only this kind of cases with '-o/a' so I need to find a way to select only those ones and apply your solution only to them.

Thank you again!

Kind regards,
Nadia

On Thursday, February 16, 2017 at 5:29:47 PM UTC+11, Anubhav Gupta wrote:
Hi Nadia,

Unitex cannot output line breaks (new lines). 
However, with help of morphological mode and CasSys it is possible to get the output in a single line.
I have attached the solution it is similar to the one provided in python.
If you have a DELAS spanish dictionary then the solution (graph) can be straightforward (see the morphological mode dictionaries section in the manual)

Regards,
Anubhav
2017-02-13 5:31 GMT+01:00 Nadia Ivanova <nadia....@jobseeker.com.au>:
Dear Alexis,
Thank you very much for taking time to answer me and to write a Python script.

To give a bit of context, we are running a POC with Unitex (but also with some machine learning tools) to find out what each tool can and can't do for our job title normalisation task. And you are right, it's not a simple task!

We are looking for a solution at scale as we have thousand hundred of raw job titles in a flow. I don't know Python but my understanding is the transformation your script is doing are very specific. 
I'm rather looking at high-level patterns which we could apply to the whole corpus.

After checking grep/sed, I think it could be a solution for the first Noun to append a "-a" when the word ends with a consonant, or replace a "-o" with a "-a", but would probably be a bit more tricky to agree the adjective when there is one, like in the example INSTALADOR/A ELECTRICO > instaladora eléctrica.

Anyway, thanks again.

Cheers,
Nadia

--
You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages