encoding UTF-8 versus Latin1

478 views
Skip to first unread message

Conway, Mike

unread,
Mar 25, 2013, 8:12:18 AM3/25/13
to <irod-chat@googlegroups.com>
Tony Edgin (from iPlant) and I have been bouncing back and forth on some vexing encoding issues they've been experiencing.  BTW Tony has been a very valuable contributor of some recent Jargon fixes, and iPlant has generally been a great collaborator!

That being said, we've been working on bug [#1211] Re: [iROD-Chat:9536] jargon mangling UTF-8 characters


Tony had some weirdness with file names entered via iCommands and accessed via Jargon

One of our users has a file named Ù7Ï9AQV²Ö\x12×\u0084|Ñ^3Þ\x05.  (Don't
ask me why.)  When I use .getDataObjectsUnderPath to get the contents of the
the parent directory, the file comes across as ?7?9AQV???|?^3?, where each ? is
the character 65533 in UTF-16.  Using ils, displays the file name correctly, so
I assume the encoding error is occurring in Jargon.

The insight Tony had on Friday was:

I looked some more into this.  From iROD-Chat:9519, I gathered that by default iRODS uses the LATIN1 character set, AKA ISO-8859-1.  I changed the Jargon to use ISO-8859-1, and it can now list the file correctly as Ù7Ï9AQV²Öׄ|Ñ^3Þ.

I'm going to recommend that we recompile our iRODS to use UTF-8 as its character set, but you might want to change Jargon to use ISO-8859-1 by default.  


Wayne indicated:

I also remember working with the encoding and finding that LATIN1 worked best for something or other for the core.  I think there was an issue with the ODBC interface to Postgres.  But with that, we got the i-commands, the server and postgres reasonably happy and I think it solved something for someone.  


So in summary, there's a current delta between Jargon (and probably most clients, and default web/RDF) between UTF-* and Latin1.  Options:

  1. complete the process of making char set enabled via jargon.properties setting (that needed just a bit more effort).  Set to UTF-8 or Latin1 and allow users to override
  2. do a potential bit of core server work to pass the iRODS char set in the miscserverinfo response, and 
    1. set the jargon char set to same
    2. not reset it but give warnings in the log
  3. just leave at UTF-8
  4. set the jargon default to latin1


thoughts? implications?

MC





Mike Conway
DICE Center
Jargon, Java, Interface Developer

------------------------------------------------

Google voice/video: Michael....@gmail.com

Skype: michael.c.conway





Reagan Moore

unread,
Mar 25, 2013, 8:32:28 AM3/25/13
to irod...@googlegroups.com
Mike:
I am curious what Academia Sinica did to display chinese characters?

Reagan Moore

--
--
"iRODS: the Integrated Rule-Oriented Data-management System; A community driven, open source, data grid software solution" https://www.irods.org
 
iROD-Chat: http://groups.google.com/group/iROD-Chat
 
---
You received this message because you are subscribed to the Google Groups "iROD-Chat" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Jean-Yves Nief

unread,
Mar 25, 2013, 8:34:29 AM3/25/13
to irod...@googlegroups.com
Reagan Moore wrote:
> Mike:
> I am curious what Academia Sinica did to display chinese characters?
for the icommands, I think they have extended them with encoding conversion.
cheers,
JY
>
> Reagan Moore
>
> From: <Conway>, Mike <michael...@unc.edu
> <mailto:michael...@unc.edu>>
> Reply-To: "irod...@googlegroups.com
> <mailto:irod...@googlegroups.com>" <irod...@googlegroups.com
> <mailto:irod...@googlegroups.com>>
> Date: Monday, March 25, 2013 8:12 AM
> To: "<irod...@googlegroups.com <mailto:irod...@googlegroups.com>>"
> <irod...@googlegroups.com <mailto:irod...@googlegroups.com>>
> Subject: [iROD-Chat:9771] encoding UTF-8 versus Latin1
>
> Tony Edgin (from iPlant) and I have been bouncing back and forth on
> some vexing encoding issues they've been experiencing. BTW Tony has
> been a very valuable contributor of some recent Jargon fixes, and
> iPlant has generally been a great collaborator!
>
> That being said, we've been working on bug [#1211] Re:
> [iROD-Chat:9536] jargon mangling UTF-8 characters
>
> https://code.renci.org/gf/project/jargon/tracker/?action=TrackerItemEdit&tracker_id=31&tracker_item_id=1211
>
> Tony had some weirdness with file names entered via iCommands and
> accessed via Jargon
>
> One of our users has a file named �7�9AQV��\x12�\u0084|�^3�\x05. (Don't
> ask me why.) When I use .getDataObjectsUnderPath to get the contents of the
> the parent directory, the file comes across as ?7?9AQV???|?^3?, where each ? is
> the character 65533 in UTF-16. Using ils, displays the file name correctly, so
> I assume the encoding error is occurring in Jargon.
>
> The insight Tony had on Friday was:
>
>> I looked some more into this. From iROD-Chat:9519, I gathered that
>> by default iRODS uses the LATIN1 character set, AKA ISO-8859-1. I
>> changed the Jargon to use ISO-8859-1, and it can now list the file
>> correctly as �7�9AQV��ׄ|�^3�.
>>
>> I'm going to recommend that we recompile our iRODS to use UTF-8 as
>> its character set, but you might want to change Jargon to use
>> ISO-8859-1 by default.
>
>
> Wayne indicated:
>
> I also remember working with the encoding and finding that LATIN1
> worked best for something or other for the core. I think there was an
> issue with the ODBC interface to Postgres. But with that, we got the
> i-commands, the server and postgres reasonably happy and I think it
> solved something for someone.
>
>
> *So in summary, there's a current delta between Jargon (and probably
> most clients, and default web/RDF) between UTF-* and Latin1. Options:*
> *
> *
>
> 1. *complete the process of making char set enabled via
> jargon.properties setting (that needed just a bit more effort).
> Set to UTF-8 or Latin1 and allow users to override*
> 2. *do a potential bit of core server work to pass the iRODS char set
> in the miscserverinfo response, and *
> 1. *set the jargon char set to same*
> 2. *not reset it but give warnings in the log*
> 3. *just leave at UTF-8*
> 4. *set the jargon default to latin1*
>
>
>
> thoughts? implications?
>
> MC
>
>
>
>
>
> Mike Conway
> DICE Center
> Jargon, Java, Interface Developer
> michael...@unc.edu <mailto:michael...@unc.edu>
>
> ------------------------------------------------
>
> Google voice/video: Michael....@gmail.com
> <mailto:Michael....@gmail.com>
>
> Skype: michael.c.conway
>
>
>
>
>
> --
> --
> "iRODS: the Integrated Rule-Oriented Data-management System; A
> community driven, open source, data grid software solution"
> https://www.irods.org
>
> iROD-Chat: http://groups.google.com/group/iROD-Chat
>
> ---
> You received this message because you are subscribed to the Google
> Groups "iROD-Chat" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to irod-chat+...@googlegroups.com
> <mailto:irod-chat+...@googlegroups.com>.

Wayne Schroeder

unread,
Mar 25, 2013, 11:41:33 AM3/25/13
to iROD-Chat
I checked into this more, and it seems that finishSetup.pl has been
setting the encoding to Latin1 for a long time. The comment is:
# Set the encoding to be LATIN1. This avoids a problem
# in the current Postgres ODBC code that doesn't handle
# non-latin encodings yet.
$sql = "alter user $DATABASE_ADMIN_NAME set client_encoding to
LATIN1;";
This is in the section for Postgres and has been in there since at
least 2.2 (I didn't check previous to that).

It's entirely possible that that's no longer necessary.

There's some other setting that is used for MySQL that I think I was
thinking of before, but I don't think that's encoding.

If if helps, I'd be happy to extend the miscsvrinfo information to
include the encoding. The tricky part would be to get setting into
the Agent, as it's just something set at the DBMS level now.

- Wayne -


On Mar 25, 5:34 am, Jean-Yves Nief <n...@cc.in2p3.fr> wrote:
> Reagan Moore wrote:
> > Mike:
> > I am curious what Academia Sinica did to display chinese characters?
>
> for the icommands, I think they have extended them with encoding conversion.
> cheers,
> JY
>
>
>
>
>
>
>
>
>
> > Reagan Moore
>
> > From: <Conway>, Mike <michael_con...@unc.edu
> > <mailto:michael_con...@unc.edu>>
> > Reply-To: "irod...@googlegroups.com
> > <mailto:irod...@googlegroups.com>" <irod...@googlegroups.com
> > <mailto:irod...@googlegroups.com>>
> > Date: Monday, March 25, 2013 8:12 AM
> > To: "<irod...@googlegroups.com <mailto:irod...@googlegroups.com>>"
> > <irod...@googlegroups.com <mailto:irod...@googlegroups.com>>
> > Subject: [iROD-Chat:9771] encoding UTF-8 versus Latin1
>
> > Tony Edgin (from iPlant) and I have been bouncing back and forth on
> > some vexing encoding issues they've been experiencing.  BTW Tony has
> > been a very valuable contributor of some recent Jargon fixes, and
> > iPlant has generally been a great collaborator!
>
> > That being said, we've been working on bug [#1211] Re:
> > [iROD-Chat:9536] jargon mangling UTF-8 characters
>
> >https://code.renci.org/gf/project/jargon/tracker/?action=TrackerItemE...
>
> > Tony had some weirdness with file names entered via iCommands and
> > accessed via Jargon
>
> > One of our users has a file named 7 9AQV \x12 \u0084| ^3 \x05.  (Don't
> > ask me why.)  When I use .getDataObjectsUnderPath to get the contents of the
> > the parent directory, the file comes across as ?7?9AQV???|?^3?, where each ? is
> > the character 65533 in UTF-16.  Using ils, displays the file name correctly, so
> > I assume the encoding error is occurring in Jargon.
>
> > The insight Tony had on Friday was:
>
> >> I looked some more into this.  From iROD-Chat:9519, I gathered that
> >> by default iRODS uses the LATIN1 character set, AKA ISO-8859-1.  I
> >> changed the Jargon to use ISO-8859-1, and it can now list the file
> >> correctly as 7 9AQV ׄ| ^3 .
> > michael_con...@unc.edu <mailto:michael_con...@unc.edu>
>
> > ------------------------------------------------
>
> > Google voice/video: Michael.C.Con...@gmail.com
> > <mailto:Michael.C.Con...@gmail.com>
>
> > Skype: michael.c.conway
>
> > --
> > --
> > "iRODS: the Integrated Rule-Oriented Data-management System; A
> > community driven, open source, data grid software solution"
> >https://www.irods.org
>
> > iROD-Chat:http://groups.google.com/group/iROD-Chat
>
> > ---
> > You received this message because you are subscribed to the Google
> > Groups "iROD-Chat" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> > an email to irod-chat+...@googlegroups.com
> > <mailto:irod-chat+...@googlegroups.com>.
> > For more options, visithttps://groups.google.com/groups/opt_out.

Conway, Mike

unread,
Mar 25, 2013, 1:01:48 PM3/25/13
to <irod-chat@googlegroups.com>
thanks wayne, that's sort of the 'introspection and discovery' aspect of iRODS playing out, that is, facilities so clients may discover how iRODS is configured, and what services might be available on that particular resource…

Does it make sense for clients to detect and attempt to use the same char set as the grid they are talking to?  I'm thinking about having iRODS connections underneath set themselves to the set that iRODS says it is supporting, and am curious as to what folks who have to support other languages think. All to often we run into char set issues in international use and tend to not focus enough on them, assuming it's all covered, I'm no different for the Jargon stuff!

MC



On Mar 25, 2013, at 11:41 AM, Wayne Schroeder <w.sch...@gmail.com>
 wrote:

------------------------------------------------

Google voice/video: Michael....@gmail.com

Skype: michael.c.conway





Jean-Yves Nief

unread,
Mar 25, 2013, 4:37:01 PM3/25/13
to irod...@googlegroups.com
hello Mike,

Conway, Mike a écrit :
> thanks wayne, that's sort of the 'introspection and discovery' aspect
> of iRODS playing out, that is, facilities so clients may discover how
> iRODS is configured, and what services might be available on that
> particular resource…
>
> Does it make sense for clients to detect and attempt to use the same
> char set as the grid they are talking to? I'm thinking about having
> iRODS connections underneath set themselves to the set that iRODS says
> it is supporting,
in my opinion, it does. It could be a very nice feature. However, it can
be defeated easily by configuration inconsistencies made by the iRODS
admins: all the iRODS servers and the database underneath must advertise
(and use!) the very same encoding.
cheers,
JY
> and am curious as to what folks who have to support other languages
> think. All to often we run into char set issues in international use
> and tend to not focus enough on them, assuming it's all covered, I'm
> no different for the Jargon stuff!
>
> MC
>
>
>
> On Mar 25, 2013, at 11:41 AM, Wayne Schroeder <w.sch...@gmail.com
> <mailto:w.sch...@gmail.com>>
> wrote:
>
>> I checked into this more, and it seems that finishSetup.pl has been
>> setting the encoding to Latin1 for a long time. The comment is:
>> # Set the encoding to be LATIN1. This avoids a problem
>> # in the current Postgres ODBC code that doesn't handle
>> # non-latin encodings yet.
>> $sql = "alter user $DATABASE_ADMIN_NAME set client_encoding to
>> LATIN1;";
>> This is in the section for Postgres and has been in there since at
>> least 2.2 (I didn't check previous to that).
>>
>> It's entirely possible that that's no longer necessary.
>>
>> There's some other setting that is used for MySQL that I think I was
>> thinking of before, but I don't think that's encoding.
>>
>> If if helps, I'd be happy to extend the miscsvrinfo information to
>> include the encoding. The tricky part would be to get setting into
>> the Agent, as it's just something set at the DBMS level now.
>>
>> - Wayne -
>>
>>
>> On Mar 25, 5:34 am, Jean-Yves Nief <n...@cc.in2p3.fr
>> <http://cc.in2p3.fr>> wrote:
>>> Reagan Moore wrote:
>>>> Mike:
>>>> I am curious what Academia Sinica did to display chinese characters?
>>>
>>> for the icommands, I think they have extended them with encoding
>>> conversion.
>>> cheers,
>>> JY
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> Reagan Moore
>>>
>>>> From: <Conway>, Mike <michael_con...@unc.edu <http://unc.edu>
>>>> <mailto:michael_con...@unc.edu <http://unc.edu>>>
>>>> Reply-To: "irod...@googlegroups.com
>>>> <mailto:irod...@googlegroups.com>
>>>> <mailto:irod...@googlegroups.com>" <irod...@googlegroups.com
>>>> <mailto:irod...@googlegroups.com>
>>>> <mailto:irod...@googlegroups.com>>
>>>> Date: Monday, March 25, 2013 8:12 AM
>>>> To: "<irod...@googlegroups.com
>>>> <mailto:irod...@googlegroups.com>
>>>> <mailto:irod...@googlegroups.com>>"
>>>> <irod...@googlegroups.com <mailto:irod...@googlegroups.com>
>>>> michael_con...@unc.edu <http://unc.edu>
>>>> <mailto:michael_con...@unc.edu <http://unc.edu>>
>>>
>>>> ------------------------------------------------
>>>
>>>> Google voice/video: Michael.C.Con...@gmail.com <http://gmail.com>
>>>> <mailto:Michael.C.Con...@gmail.com <http://gmail.com>>
>> send an email to irod-chat+...@googlegroups.com
>> <mailto:irod-chat+...@googlegroups.com>.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>
> Mike Conway
> DICE Center
> Jargon, Java, Interface Developer
> michael...@unc.edu <mailto:michael...@unc.edu>
>
> ------------------------------------------------
>
> Google voice/video: Michael....@gmail.com
> <mailto:Michael....@gmail.com>
>
> Skype: michael.c.conway
>
>
>
>
>
> --
> --
> "iRODS: the Integrated Rule-Oriented Data-management System; A
> community driven, open source, data grid software solution"
> https://www.irods.org
>
> iROD-Chat: http://groups.google.com/group/iROD-Chat
>
> ---
> You received this message because you are subscribed to the Google
> Groups "iROD-Chat" group.
> To unsubscribe from this group and stop receiving emails from it, send

Conway, Mike

unread,
Mar 26, 2013, 7:17:22 AM3/26/13
to <irod-chat@googlegroups.com>
Good point, sounds like an item to consider in the add resource process (i.e. is there a validation step that would be required). This also seems to complicate cross-zone federations….I don't want to tug on that thread too much, as it's already a potential issue no matter what I do!

So I'd like to propose the following:

  • Can we add the char set to miscSvrInfo?  Jargon connections obtain this after logging in and cache it.
  • I will finish propagating the char set to jargon properties and leave it at UTF-8 for now, so as not to surprise anyone
  • I will add a property detect.char.set.on.server or something like this, and if that is set to true, it will see if it can auto-detect and set the char set based on the logged in resource/zone.  This can be false by default for now

MC




Reply all
Reply to author
Forward
0 new messages