How to match de-accented words (currently "unknown")?

Nadia Ivanova

unread,

Feb 2, 2017, 2:42:41 AM2/2/17

to Unitex-GramLab

Hello everybody,

My Spanish corpus has many deaccented words which are currently not recognised by my graphs and show in the list of Unknown words generated after applying lexical resources.

Do you know a way of solving this to match deaccented words when the accented version is in the dictionary?

Any ideas welcome.

Thank you very much,

Kind regards,

Nadia

Noureddine Doumi

unread,

Feb 2, 2017, 5:51:44 AM2/2/17

to Unitex-GramLab

Hello,

The same problem arises in Arabic. The vocalization marks are optional in writing Arabic text so the writers often ignore them and the result is a corpus containing partially vocalized or non vocalized words. However the Arabic DELA entries are fully vocalized and when processing these corpus with the dico program most of the corpus words are listed in unknown words.
In my opinion, we have to modify the dico program so that it can match not only the exact equal character strings (between the DELA entry and the corpus word) but it has to consider also the case of Spanish or Arabic partially accented or vocalized words.
For the case of Arabic I developed and tested with success an algorithm which do that but I couldn't integrate it the source code of dico program.

Best regards,
Noureddine

Oto Vale

unread,

Feb 2, 2017, 7:37:57 AM2/2/17

to Nadia Ivanova, Unitex-GramLab

Hello Nadia,

you can build a dictionary of deaccented forms from accented forms selected in the general dictionary, such as:

edicion,edición.N:fs

telefono,teléfono.N:ms

telefonos,teléfono.N:mp

Il would be better to put a tag to identify those forms

edicion,edición.N:fs+DEACC

telefono,teléfono.N:ms+DEACC

telefonos,teléfono.N:mp+DEACC

[]s

Oto Araujo Vale

Professor Associado

Universidade Federal de São Carlos

Rodovia Washington Luís, km 235 - SP-310

São Carlos - São Paulo - Brasil

CEP 13565-905

--
You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramlab+unsubscribe@googlegroups.com.
To post to this group, send email to unitex-gramlab@googlegroups.com.
Visit this group at https://groups.google.com/group/unitex-gramlab.
To view this discussion on the web visit https://groups.google.com/d/msgid/unitex-gramlab/ea5490e4-763e-453c-abd9-67e98ee7b062%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

eric.laporte

unread,

Feb 2, 2017, 10:36:11 AM2/2/17

to Unitex-GramLab, nadia....@jobseeker.com.au

Hello Nadia,

You can also modify the Alphabet.txt file in the Spanish directory in your workspace. By adding a line EÉ, you allow the system to match an E in the text with an É in the dictionary, and so on with eé etc. (manual, section 14.2.1).

Best,

Eric

Alexis Neme

unread,

Feb 2, 2017, 4:39:40 PM2/2/17

to Unitex-GramLab

Dear Noureddine,

Since Nov, 2010, the lookup in dico program for Semitic languages recognizes partially diacriticized words. it is not documented in the manual but in my paper 2011, section 4.2

"Consequently, processing written Arabic text should take into account undiacriticized and partially diacriticized text. A lookup procedure in Unitex5 has been adjusted to deal with omission of diacritics in Arabic. This procedure finds in the diacriticized full-form dictionary all possible diacriticized candidate forms compatible with a given undiacriticized or partially diacriticized form."

The file Arabic_typo_rules.txt in the Unitex/Arabic explicits such rules and are taken into account by the lookup procedure.

You are right, we need to document such features for Arabic in the Manual (we will do it these two months).

Meanwhile, let me know if you have any question.

Cheers,

Alexis

On Thursday, February 2, 2017 at 9:51:44 PM UTC+11, Noureddine Doumi wrot

Nadia Ivanova

unread,

Feb 2, 2017, 10:56:24 PM2/2/17

to Unitex-GramLab, nadia....@jobseeker.com.au

Thank you very much, Oto,

I thought I could not edit a DELA but somebody told me I can de-compile it so I might try this solution and would definitely use it to add new entries or grammatical information.

Modifying the Alphabet.txt file (below) looks like an easier solution for this particular issue, though.

Thank you again for taking time to help me.

Cheers,

Nadia

To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.

Nadia Ivanova

unread,

Feb 2, 2017, 10:58:06 PM2/2/17

to Unitex-GramLab, nadia....@jobseeker.com.au

Many thanks, Eric,

I think this is the easiest way to fix my issue!

I will try and apply it.

Cheers,

Nadia

Alexis Neme

unread,

Feb 3, 2017, 4:48:10 PM2/3/17

to denis....@univ-tours.fr, unitex-...@googlegroups.com

Hi Denis,

>>> Is this module specific to semitic languages or it would be possible to extent it to other ones?

- Actually, it is called Semitic languages but in fact it is Arabic; and Arabic_typo_rules.txt is the configuration file for default rules for diacritic omissions and letter substitutions.

The compression of the dico must be in -semitic mode also.

the default compression is concatenative, but -semitic compression take into account infixation for inflexion, diacritics (i.e. letter) omissions and letter substitutions. defined in the configaration files. (compressed file in this mode is 3 times bigger than the concatenative mode, at least).

This mode of the lookup procedure in this compressed file is called semitic, but it can be any language.

- If you want to apply it to Hebrew or Syriac, one should create another configuration for Hebrew or Syriac and check the implementation of the lookup procedure (dico.cpp) and make the appropriate modification. It should be easy to adjust the dico.cpp code.

- it seems it worth to extend to other languages if the language admits infixation (for inflexion), and some typographical rules, such as Indonesian or Tagalog or for the inflexion part of these austronésian languages, but I am not sure.

- This is NOT the adequate way to extend this lookup procedure to agglutinative languages such as Turkish, .... better to use the morphological mode, I guess.

Hope this will help,

Alexis

Bests

----------

Alexis Neme

Computer Scientist - Arabic NLP
FR-PT-EN-AR (DE, Tagalog)

http://tasrif.univ-mlv.fr/About.html

UPEM - LIGM - Laboratoire d'Informatique Gaspard-Monge
Bureau 4B045, 5 Bd Descartes, Champs-sur-Marne
77454 Marne-la-Vallée Cedex 2, France

Tél : 00 33 1 60 95 77 17

On Fri, Feb 3, 2017 at 10:17 PM, Denis Maurel <mau...@univ-tours.fr> wrote:

Dear Alexis

Is this module specific to semitic languages or it would be possible to extent it to other ones?
Thanks

Best regards,

Denis Maurel

____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis.maurel@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/

----- Le 2 Fév 17, à 22:39, Alexis Neme <alexi...@gmail.com> a écrit :

--

You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.

To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramlab+unsubscribe@googlegroups.com.
To post to this group, send email to unitex-gramlab@googlegroups.com.

Visit this group at https://groups.google.com/group/unitex-gramlab.

To view this discussion on the web visit https://groups.google.com/d/msgid/unitex-gramlab/a7f64a23-c7b9-4e08-b83a-47752d7e23fd%40googlegroups.com.

Nadia Ivanova

unread,

Feb 8, 2017, 1:20:31 AM2/8/17

to Unitex-GramLab, nadia....@jobseeker.com.au

Hello again,

Just an update to say I applied the solution you suggested, Eric (updating Alphabet.txt in my workspace) and it worked perfectly.

It addresses my issue better than generating a de-accented version of the dictionary, as I have an output variable which I want to keep accented, as a canonical version (I'm using LEMMA or INFLECTED in the output so not sure what would happen if I had both accented and de-accented versions of the same word).

Thank you again,

Kind regards,

Nadia

On Friday, February 3, 2017 at 2:36:11 AM UTC+11, eric.laporte wrote:

Reply all

Reply to author

Forward