Normalize unicode space

56 views
Skip to first unread message

Anubhav Gupta

unread,
Oct 28, 2014, 11:24:50 AM10/28/14
to unitex-...@googlegroups.com
Hi,

We are working on a project wherein the documents have various kind of spaces e.g. non-breaking space (U+00A0), figure space (U+2007) etc. Since these spaces are different from the regular space, Unitex treats them like a token and as result our graphs do not function properly.
One of the solutions (suggested by Cristian Martinez) is to include these spaces in Norm.txt file (attached).
We would request the community to include the updated file in the Unitex so that any text with unicode spaces is normalized properly.

Regards
Anubhav GUPTA
Norm.txt

eric.laporte

unread,
Oct 29, 2014, 12:19:35 PM10/29/14
to unitex-...@googlegroups.com
Dear Anubhav,
Thanks for the interesting idea. The Norm.txt files are language-specific, so the one you uploaded is not necessarily appropriate for other languages. In addition, I don't know for which language you use this file and It seems to have more lines than what is needed for the special space characters. Could you provide another file with only the lines with the special spaces, and with each line containing 3 characters (the original character, a tab and the substitution character)?
Thanks again,
Eric

Gilles Vollant

unread,
Oct 29, 2014, 12:21:18 PM10/29/14
to eric.laporte, unitex-...@googlegroups.com

We can upload it as altrenative_norm.txt in one (or somes) ling/<LANGAGE> directory on svn

 

De : unitex-...@googlegroups.com [mailto:unitex-...@googlegroups.com] De la part de eric.laporte
Envoyé : mercredi 29 octobre 2014 17:20
À : unitex-...@googlegroups.com
Objet : [Unitex-GramLab] Re: Normalize unicode space

--
You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Visit this group at http://groups.google.com/group/unitex-gramlab.
To view this discussion on the web visit https://groups.google.com/d/msgid/unitex-gramlab/b867d472-54c8-46ef-af5d-b36cc611ec82%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Denis Maurel

unread,
Oct 29, 2014, 12:57:13 PM10/29/14
to Gilles Vollant, eric.laporte, unitex-...@googlegroups.com


Hi Gilles and Eric,

There are lot of spaces in Unicode. If these spaces are not present (like often {} actually) there is no problem. alternative is not a solution. We can sure change normilize.txt for european language...

Best regards,

Denis Maurel


____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/



eric.laporte

unread,
Sep 23, 2015, 11:15:54 AM9/23/15
to Unitex-GramLab
Hi,
The problem has been solved by Anubhav Gupta by updating the code of the Normalize program (revision 4019, September 2015).
Eric Laporte

Reply all
Reply to author
Forward
0 new messages