UTF-8

45 views
Skip to first unread message

Denis Maurel

unread,
Nov 10, 2015, 11:23:13 AM11/10/15
to Unitex-GramLab


Dear All,

Actually the character encoding by default is UTF-16. I suggest for a better use by linguists that the character encoding by default will be UTF-8.
Do you agree with this light interface modification?

Best regards,

Denis Maurel


____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/


eric.laporte

unread,
Nov 12, 2015, 4:17:46 AM11/12/15
to Unitex-GramLab, denis....@univ-tours.fr
Dear all,
I agree with Denis: UTF-8 is more standard and more compact for some languages. Thanks for pointing this out.
Eric

Gilles Vollant

unread,
Nov 12, 2015, 4:54:47 AM11/12/15
to eric.laporte, Unitex-GramLab, denis....@univ-tours.fr

Just curious, do you think we must uses UTF8 with or without BOM at the beginning

 

https://en.wikipedia.org/wiki/Byte_order_mark

 

http://www.prelude.me/index.php/2011/01/15/utf-8-avec-ou-sans-bom/

 

When I asked S. Paumier (a lot of time ago) about utf8 in java portion, he tell me UTF16 is better for file mapped when editor open very very big file (because in UTF8, we known character N is at position N*2 in byte, and in UTF8 we must read the file).

 

But pehaps this is not used now

 

Regards

Gilles

 

De : unitex-...@googlegroups.com [mailto:unitex-...@googlegroups.com] De la part de eric.laporte
Envoyé : Thursday, November 12, 2015 10:18 AM
À : Unitex-GramLab
Cc : denis....@univ-tours.fr
Objet : [Unitex-GramLab] Re: UTF-8

 

Dear all,

I agree with Denis: UTF-8 is more standard and more compact for some languages. Thanks for pointing this out.

Eric

--
You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Visit this group at http://groups.google.com/group/unitex-gramlab.
To view this discussion on the web visit https://groups.google.com/d/msgid/unitex-gramlab/74e73df3-8a42-4fa2-911d-53bcf2c9dcb6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Denis Maurel

unread,
Nov 12, 2015, 6:46:36 AM11/12/15
to Gilles Vollant, eric.laporte, Unitex-GramLab
Dear Gilles,

We preferably used UTF-8 without bom. We had some problems with the files with bom. I am not sure that the mask "begining of the text" can be used?

The link between char and byte is just true if one approximates UTF-16 to 2 bytes. But some char are coded with 4.

Best regards,

Denis Maurel

____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/

----- Mail original -----

Anubhav Gupta

unread,
Nov 12, 2015, 9:00:45 AM11/12/15
to Unitex-GramLab
Hi,

UTF-8 will be the default encoding in the UI with r4149.

Regards,
Anubhav Gupta
Reply all
Reply to author
Forward
0 new messages