Re: [tao-users] Naming service and native codesets

Douglas C. Schmidt

unread,

Jun 10, 2010, 10:18:25 AM6/10/10

to Chris Trawick, tao-...@list.isis.vanderbilt.edu

Hi Chris,

Thanks for using the PRF.

> TAO VERSION: 1.4.10

You are using an ANCIENT version of TAO. Please upgrade to ACE+TAO+CIAO
x.7.9 (i.e., ACE 5.7.9, TAO 1.7.9, and CIAO 0.7.9), which you can
download from

http://download.dre.vanderbilt.edu

under the heading: "Latest Micro Release Kit."

The DOC group at Vanderbilt University only provides "best effort"
support for non-sponsors for the latest release, as described in

http://www.dre.vanderbilt.edu/~schmidt/DOC_ROOT/ACE/docs/ACE-bug-process.html

Thus, if you need more "predictable" help for earlier versions of
ACE+TAO, I recommend that you check out

http://www.dre.vanderbilt.edu/support.html

for a list of companies that will provide you with ACE+TAO commercial
support.

Thanks,

Doug

> ACE VERSION: 5.4.10
>
>
>
> HOST MACHINE and OPERATING SYSTEM:
>
> Windows XP Pro
>
>
>
> COMPILER NAME AND VERSION (AND PATCHLEVEL):
>
> Borland C++ Builder 5.5.1
>
>
>
> THE $ACE_ROOT/ace/config.h FILE [if you use a link to a platform-
>
> specific file, simply state which one]:
>
> #include "ace/config-win32.h"
>
>
>
> BUILD METHOD USED:
>
> Borland make
>
>
>
> AREA/CLASS/EXAMPLE AFFECTED:
>
> Codeset, Naming Service
>
>
>
> DOES THE PROBLEM AFFECT:
>
> COMPILATION?
>
> No
>
> LINKING?
>
> No
>
> EXECUTION?
>
> Yes
>
> OTHER (please specify)?
>
> Both ACE/TAO and application are affected
>
>
>
> SYNOPSIS:
>
> Naming service behaves strangely when native codeset is set
>
>
>
> DESCRIPTION:
>
> Our CORBA server needs to pass string information to a Java client using
> character sets other than ISO-8859-1 (such as CP1250). I implemented a codeset
> translator and changed the native codeset per available TAO documentation,
> however the CORBA server failed to communicate with the naming service unless
> it also has the same native codeset. Once the server and naming service are
> speaking with the same native codeset (other than the default 8859-1), the Java
> client can no longer communicate with the naming service to find CORBA objects.
>
>
>
> I suspect that the object names the server is binding to (hard-coded,
> presumably in US-ASCII) are getting mangled by the codeset translator, which is
> why Java can't find them. If that's the case, is there a way to use the
> default codeset for the naming calls and the native codeset for the server
> objects? How do I get TAO to juggle the two codesets correctly?
>
>
>
>
>
> REPEAT BY:
>
> Start naming service, start CORBA server, start Java client, boom
>
>
>
>
>
>
>
> J Chris Trawick, CISSP-ISSAP
>
> Software Engineer
>
> chris....@atex.com
>
>
>
> Phone +1 (321) 435-0218
>
> Mobile +1 (813) 416-7686
>
>
>
> Atex
>
> 410 N Wickham Rd
>
> Suite 100
>
> Melbourne, FL 32935
>
> Atex_logo_B
>
> Bringing new life to your media business
>
> www.atex.com
>
>
>
> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY
> MATERIAL and is thus for use only by the intended recipient. If you received
> this in error, please contact the sender and delete the e-mail and its
> attachments from all computers.
>
>
>
>
>
>
>
> _______________________________________________
> tao-users mailing list
> tao-...@list.isis.vanderbilt.edu
> http://list.isis.vanderbilt.edu/mailman/listinfo/tao-users

Chris Trawick

unread,

Jun 10, 2010, 11:40:28 AM6/10/10

to sch...@dre.vanderbilt.edu, tao-...@list.isis.vanderbilt.edu

Thanks for the quick response. I considered the upgrade before, but it didn't seem that codeset functionality changed all that much between these releases. Is it likely to affect this issue? I'm starting the upgrade now, but I expect porting will keep me from testing this again for another couple of days.

In the meantime, how would you do this in the latest release? That is, how would you configure or code the client and server to communicate with the naming service using a different codeset than what the server needs for its business functions?

Phil Mesnier

unread,

Jun 10, 2010, 12:18:37 PM6/10/10

to Chris Trawick, tao-...@list.isis.vanderbilt.edu

Hi Chris,

Chris Trawick wrote:
> Thanks for the quick response. I considered the upgrade before, but
> it
didn't seem that codeset functionality changed all that much between
these releases. Is it likely to affect this issue? I'm starting the
upgrade now, but I expect porting will keep me from testing this again
for another couple of days.
>

The codeset code has been pretty stable for a long time. Doug's comment
is more about

> In the meantime, how would you do this in the latest release? That
> is,
how would you configure or code the client and server to communicate
with the naming service using a different codeset than what the server
needs for its business functions?
>

When you are adding a novel codeset, you also need to make sure it is
identified in the ACE codeset registry. The easiest way to do that is to
construct a text file containing the details of all the codesets you
intend to use, including ISO-8859-1 and UTF-16, and run the mkcsregdb
utility found in $ACE_ROOT/apps/mkcsregdb. This will generate a new
instance of ace/Codeset_Registry_db.cpp with properly formatted registry
entries.

I searched the existing source file, code_set_registry1.2g.txt and did
not find an instance of CP1250.

Then in your svc.conf file you need to first include your translator,
which should go from CP1250 to whatever native codeset you use, such as
ISO-8859-1 or UTF-16. You also need to declare in the Resource factory
settings that you support both the native codeset and the various
desired translated codesets.

>> Our CORBA server needs to pass string information to a Java client using
>> character sets other than ISO-8859-1 (such as CP1250). I implemented a codeset
>> translator and changed the native codeset per available TAO documentation,
>> however the CORBA server failed to communicate with the naming service unless
>> it also has the same native codeset. Once the server and naming service are
>> speaking with the same native codeset (other than the default 8859-1), the Java
>> client can no longer communicate with the naming service to find CORBA objects.
>>

Can you share your server's svc.conf file with me?

Also, let me understand if I am interpreting your problem correctly:

Naming service is storing names that ought to be 8859-1 encoded, which
both your TAO server and Java client access or manipulate.

The Java client resolves the TAO server based on the name binding, and
the server needs to send CP1250 encoded strings to the client.

But this doesn't work.

What ORB are you using on the Java client side? How did you specify its
use of CP1250?

>>
>>
>> I suspect that the object names the server is binding to (hard-coded,
>> presumably in US-ASCII) are getting mangled by the codeset translator, which is
>> why Java can't find them. If that's the case, is there a way to use the
>> default codeset for the naming calls and the native codeset for the server
>> objects? How do I get TAO to juggle the two codesets correctly?

"Native" codeset refers to the codeset used on the host system on which
the application runs. It defines the encoding of codepoints used for
I/O, rendering, etc. Is your TAO server running on a host that is
natively using CP1250? What about the Java application?

What is the origin of the strings that you need to send in CP1250? If,
for example, you are on a system that uses 8859-1 natively but reading
from a file or database that has CP1250 encoded data, you could pass
that in an octet sequence and avoid the whole translation mess.

I will have more questions for you, but would like to know more about
your specific instance before going deeper.

Thanks,
Phil

--
Phil Mesnier
Principal Software Engineer and Partner, http://www.ociweb.com
Object Computing, Inc. +01.314.579.0066 x225

Chris Trawick

unread,

Jun 10, 2010, 2:42:37 PM6/10/10

to Phil Mesnier, tao-...@list.isis.vanderbilt.edu

> When you are adding a novel codeset, you also need to make sure it is

> identified in the ACE codeset registry. [...]

Already done. I used a more complete codeset registry that included all of my target codesets and rebuilt TAO.

> Then in your svc.conf file you need to first include your translator,
> which should go from CP1250 to whatever native codeset you use, such as
> ISO-8859-1 or UTF-16. You also need to declare in the Resource factory
> settings that you support both the native codeset and the various
> desired translated codesets.

The reason we need the other codeset translators is that 8859-1 doesn't have the characters we need from other codesets, so how would building a translator between (say) CP1250 and ISO-8859-1 help? Would it be to translate object names? Those names are already in ISO-8859-1, so I doubt translating them as CP1250 would help. In fact, if I understand things right that's exactly what the problem is.

> Can you share your server's svc.conf file with me?

Attached.

> Also, let me understand if I am interpreting your problem correctly:
>
> Naming service is storing names that ought to be 8859-1 encoded, which
> both your TAO server and Java client access or manipulate.
>
> The Java client resolves the TAO server based on the name binding, and
> the server needs to send CP1250 encoded strings to the client.
>
> But this doesn't work.

Close enough. It would be more accurate to say that Java's looking up objects using names in Unicode, which in a default configuration (with the TAO-bundled ISO-8859-1 translator) encodes to the same binary representation as specified by the server. However, when a CP1250 translator is used instead on the TAO side, it no longer works.

My theory is that the server's binding calls to the naming service run through the same native codeset translators as everything else. Naturally, this means that the same native input bytes (e.g., an object or function name) will be encoded to different transport output bytes if using different translators. I believe that is why, for example (and as well-documented on ace-user), the naming service must use the same native codeset as the server and client: So that when these names are translated from the native codeset to the transport codeset (UTF-8) on the server/client side and back again on the naming service side, the names evaluate to the same sets of bytes and bound to remote objects.

If that's correct, then the solution is clearly to bind a CORBA object server with its own native codeset using naming service calls in a different codeset. That way the naming service can still use ISO-8859-1 natively and correctly interact with Unicode clients such as Java. How do I do that?

If that's not correct, then I still need a way to get these objects to find each other. Is there another way?

> What ORB are you using on the Java client side? How did you specify its
> use of CP1250?

We're using JacORB. Since Java is natively Unicode, I wasn't aware that specifying a native codeset would be required or desirable. In fact, that seems a little weird to me. I figured using a transport codeset of UTF-8 (which is TAO's default) would allow the Java side to be rather codeset agnostic.

If there were any way to make Java natively anything other than Unicode, I might try CP1250 and see if it works, but I don't know of any way to do that. Thoughts?

cp1250-svc.conf

Andrew L. Shwaika

unread,

Jun 11, 2010, 1:18:34 AM6/11/10

to Chris Trawick, tao-...@list.isis.vanderbilt.edu

Hello,
We are using TAO and JDK 1.6 ORB:

ace/Codeset_Registry_db.cpp must have the record:
{"X/Open UTF-8; UCS Transformation Format 8 (UTF-8)","UTF-8",0x05010001,1,{0x1000},6},
Run all TAO applocations with svc.conf:
static Resource_Factory '-ORBNativeCharCodeSet 0x05010001 -ORBNativeWcharCodeSet 0x00010100'
Run Java applications with ORB option:
   Properties props = new Properties();
   props.setProperty("com.sun.CORBA.codeset.charsets", "0x05010001");
   orb = ORB.init(args, props);
Our applications built with TAO manually converts all string data from/to native encoding to UTF-8

This solution works even with IIOP.NET
BTW, we are using more exotic native charset encoding like KOI8-R on linux

Sincerely yours,
-Andrew

Chris Trawick

unread,

Jun 11, 2010, 9:11:42 AM6/11/10

to Andrew L. Shwaika, tao-...@list.isis.vanderbilt.edu

Sounds promising, but if I understand you correctly it sounds like your application is performing the UTF8 transform on its own rather than using TAO’s codeset translation. I was hoping that TAO would be able to do the transform itself (a chief requirement of CORBA) because refactoring it into our application would be a prodigious effort.

Did you go that route because TAO ultimately couldn’t handle it on its own or was there another reason?

From: Andrew L. Shwaika [mailto:a...@solvo.ru]
Sent: Friday, June 11, 2010 1:19 AM
To: Chris Trawick
Cc: Phil Mesnier; tao-...@list.isis.vanderbilt.edu
Subject: Re: [tao-users] Naming service and native codesets

Hello,

Phil Mesnier

unread,

Jun 11, 2010, 10:33:18 PM6/11/10

to Chris Trawick, tao-...@list.isis.vanderbilt.edu

Hi Chris,

Sorry it's taken me so long to respond.

Chris Trawick wrote:
>> When you are adding a novel codeset, you also need to make sure it is
>> identified in the ACE codeset registry. [...]
>
> Already done. I used a more complete codeset registry that included
all of my target codesets and rebuilt TAO.
>

Great. How many codesets do you use?

>
>> Then in your svc.conf file you need to first include your translator,
>> which should go from CP1250 to whatever native codeset you use, such as
>> ISO-8859-1 or UTF-16. You also need to declare in the Resource factory
>> settings that you support both the native codeset and the various
>> desired translated codesets.
>
> The reason we need the other codeset translators is that 8859-1
doesn't have the characters we need from other codesets, so how would
building a translator between (say) CP1250 and ISO-8859-1 help? Would it
be to translate object names? Those names are already in ISO-8859-1, so
I doubt translating them as CP1250 would help. In fact, if I understand
things right that's exactly what the problem is.
>

It all depends on how the strings are being used. Assuming the naming
service is using 8859-1 (latin1) exclusively, you need to provide a way
for all clients of the naming service to send and receive latin1 encoded
strings.

Many codesets map the latin1 character set to the code range 00-7F,
meaning that translating those codepoints are trivial, and codepoints
outside that range trigger a MARSHAL exception.

If a larger codeset, say utf-8 is used in place of latin1, then you have
a wider range of mappable codepoints.

>
>> Can you share your server's svc.conf file with me?
>
> Attached.
>

Thanks. I presume from the contents that you are natively using CP1250,
and supporting that and UTF-8 on the wire. You had mentioned using Java
before. Would that happen to be that which is in Websphere?

>
>> Also, let me understand if I am interpreting your problem correctly:
>>
>> Naming service is storing names that ought to be 8859-1 encoded, which
>> both your TAO server and Java client access or manipulate.
>>
>> The Java client resolves the TAO server based on the name binding, and
>> the server needs to send CP1250 encoded strings to the client.
>>
>> But this doesn't work.
>

> Close enough. It would be more accurate to say that Java's looking
> up objects using names in Unicode, which in a default configuration
> (with the TAO-bundled ISO-8859-1 translator) encodes to the same
> binary representation as specified by the server. However, when a
> CP1250 translator is used instead on the TAO side, it no longer
> works.

Later releases of TAO support utf-8 as well as latin1. Java uses Unicode
internally, but that codeset cannot be used for IIOP strings as it uses
fixed 16-bit codepoints. JacORB handles this by using Java primitives to
convert the Unicode internal format to either utf-8 or latin1 as
demanded by the transport. I imagine your Java ORB has the same
capacity. When looking up references to CP1250, I encountered an IBM
websphere page: http://tinyurl.com/3yvdtmy which describes this capability.

>
> My theory is that the server's binding calls to the naming service
> run through the same native codeset translators as everything else.

I'm not sure, but I don't believe TAO 1.4 supplies any char codeset
translators by default. It does provide a Unicode BOM translator for
wchar, but the naming service doesn't use wchars. If client and server
use the same native codeset, no translation is performed.

Now if you supply translators, then you can have different native
codesets for peer applications. Of course you need to take care to use
compatible character sets within these varied codesets. Otherwise, as
you have said, you won't be able to properly map the codepoints from one
codeset to the other.

> Naturally, this means that the same native input bytes (e.g., an
> object or function name) will be encoded to different transport
> output bytes if using different translators. I believe that is why,
> for example (and as well-documented on ace-user), the naming service
> must use the same native codeset as the server and client: So that

No, the whole point of translation is to facilitate communication
between different native codesets. Its the character set that matters.

> when these names are translated from the native codeset to the
> transport codeset (UTF-8) on the server/client side and back again on
> the naming service side, the names evaluate to the same sets of bytes
> and bound to remote objects.

Sure. You are correct that the end strings must be able to compare
identically in the naming service.

Now, do you use another translator on the naming service from
utf8(T)-latin1(N)? Incidentally, later versions of TAO also ship with
such a translator included. Of course you can declare your naming
service as using utf8 natively, and save a second translation.

>
> If that's correct, then the solution is clearly to bind a CORBA
> object server with its own native codeset using naming service calls
> in a different codeset. That way the naming service can still use
> ISO-8859-1 natively and correctly interact with Unicode clients such
> as Java. How do I do that?

Now I see below that you are using JacORB. It already supports using
latin1 as a transport codeset. As long as you don't stray from the
latin1 character set there is no problem. If you stray from that
character set, then you will get a MARSHAL exception raised trying to
such a string.

>
> If that's not correct, then I still need a way to get these objects
> to find each other. Is there another way?
>

No, I think you are on the right track.

>
>> What ORB are you using on the Java client side? How did you specify its
>> use of CP1250?
>
> We're using JacORB. Since Java is natively Unicode, I wasn't aware
> that specifying a native codeset would be required or desirable. In
> fact, that seems a little weird to me. I figured using a transport
> codeset of UTF-8 (which is TAO's default) would allow the Java side
> to be rather codeset agnostic.
>

I think TAO's default is latin1, and by-and-large JacORB to TAO string
passing works fine. Of course, I generally only work in English which is
fully coded by latin1. Users of other languages would probably want to
explicitly set TAO's default to utf-8 in order to use the expanded
charater sets provided by that codeset.

> If there were any way to make Java natively anything other than
> Unicode, I might try CP1250 and see if it works, but I don't know of
> any way to do that. Thoughts?

It's a little more tedious to expand JacORB's allowed suite of
acceptable codesets. There have been strives towards improving that, and
probably Java already supports Unicode to CP1250.

-Phil

Andrew L. Shwaika

unread,

Jun 13, 2010, 1:31:45 PM6/13/10

to Chris Trawick, tao-...@list.isis.vanderbilt.edu

Hello,

Yes, you are right. Our application have to perform UTF8 conversion to locale character set.
This is done because of lacks of codeset id in OSF's Character and Code Set Registry
for our locale (KOI8-R), and I didn't found any possibility in TAO to determine current locale
and offer it in connection profile. Even if your locale is UTF-8 TAO will use only LATIN-1
until you will change it in svc.conf.

BTW, I saw other ORB that do it more sophisticated: If your current locale is not compatible
to LATIN-1 it will add UTF-8 codeset in native charsets of connection profile as preferred and converts
all character data using any standard for platform convertion library (libiconv on linux for example)
from current locale from/to UTF-8.

Sincerely yours,
-Andrew

06/11/2010 05:11 PM, Chris Trawick:

Sounds promising, but if I understand you correctly it sounds like your application is performing the UTF8 transform on its own rather than using TAO’s codeset translation. I was hoping that TAO would be able to do the transform itself (a chief requirement of CORBA) because refactoring it into our application would be a prodigious effort.

Did you go that route because TAO ultimately couldn’t handle it on its own or was there another reason?

From: Andrew L. Shwaika [mailto:a...@solvo.ru]
Sent: Friday, June 11, 2010 1:19 AM
To: Chris Trawick
Cc: Phil Mesnier; tao-...@list.isis.vanderbilt.edu
Subject: Re: [tao-users] Naming service and native codesets

Hello,
We are using TAO and JDK 1.6 ORB:

ace/Codeset_Registry_db.cpp must have the record:
{"X/Open UTF-8; UCS Transformation Format 8 (UTF-8)","UTF-8",0x05010001,1,{0x1000},6},

Run all TAO applocations with svc.conf:
static Resource_Factory '-ORBNativeCharCodeSet 0x05010001 -ORBNativeWcharCodeSet 0x00010100'

Run Java applications with ORB option:
   Properties props = new Properties();
   props.setProperty("com.sun.CORBA.codeset.charsets", "0x05010001");
   orb = ORB.init(args, props);

Our applications built with TAO manually converts all string data from/to native encoding to UTF-8

This solution works even with IIOP.NET
BTW, we are using more exotic native charset encoding like KOI8-R on linux

Sincerely yours,
-Andrew

When you are adding a novel codeset, you also need to make sure it is

identified in the ACE codeset registry. [...]

Already done.  I used a more complete codeset registry that included all of my target codesets and rebuilt TAO.

Then in your svc.conf file you need to first include your translator,

which should go from CP1250 to whatever native codeset you use, such as

ISO-8859-1 or UTF-16.  You also need to declare in the Resource factory

settings that you support both the native codeset and the various

desired translated codesets.

The reason we need the other codeset translators is that 8859-1 doesn't have the characters we need from other codesets, so how would building a translator between (say) CP1250 and ISO-8859-1 help?  Would it be to translate object names?  Those names are already in ISO-8859-1, so I doubt translating them as CP1250 would help.  In fact, if I understand things right that's exactly what the problem is.

Can you share your server's svc.conf file with me?

Attached.

Also, let me understand if I am interpreting your problem correctly:

Naming service is storing names that ought to be 8859-1 encoded, which

both your TAO server and Java client access or manipulate.

The Java client resolves the TAO server based on the name binding, and

the server needs to send CP1250 encoded strings to the client.

But this doesn't work.

Close enough.  It would be more accurate to say that Java's looking up objects using names in Unicode, which in a default configuration (with the TAO-bundled ISO-8859-1 translator) encodes to the same binary representation as specified by the server.  However, when a CP1250 translator is used instead on the TAO side, it no longer works.

My theory is that the server's binding calls to the naming service run through the same native codeset translators as everything else.  Naturally, this means that the same native input bytes (e.g., an object or function name) will be encoded to different transport output bytes if using different translators.  I believe that is why, for example (and as well-documented on ace-user), the naming service must use the same native codeset as the server and client:  So that when these names are translated from the native codeset to the transport codeset (UTF-8) on the server/client side and back again on the naming service side, the names evaluate to the same sets of bytes and bound to remote objects.

If that's correct, then the solution is clearly to bind a CORBA object server with its own native codeset using naming service calls in a different codeset.  That way the naming service can still use ISO-8859-1 natively and correctly interact with Unicode clients such as Java.  How do I do that?

If that's not correct, then I still need a way to get these objects to find each other.  Is there another way?

What ORB are you using on the Java client side? How did you specify its

use of CP1250?

We're using JacORB.  Since Java is natively Unicode, I wasn't aware that specifying a native codeset would be required or desirable.  In fact, that seems a little weird to me.  I figured using a transport codeset of UTF-8 (which is TAO's default) would allow the Java side to be rather codeset agnostic.

If there were any way to make Java natively anything other than Unicode, I might try CP1250 and see if it works, but I don't know of any way to do that.  Thoughts?

Chris Trawick

unread,

Jun 14, 2010, 10:54:20 AM6/14/10

to Phil Mesnier, tao-...@list.isis.vanderbilt.edu

> > Already done. I used a more complete codeset registry that included
> all of my target codesets and rebuilt TAO.
> >
>

> Great. How many codesets do you use?

Currently we only support one (CP1252, which is close enough to ISO-8859-1 that it hasn't been a problem). As we expand to more international markets, we need more. I'm targeting support for the character sets supported by Windows Input Language, all of which seem to be in the OSF registry (except for Vietnamese, which isn't a target market yet). Also, for technical reasons beyond my control (i.e., legacy software), I prefer the entirely 8-bit sets over the multi-byte ones.

> It all depends on how the strings are being used. Assuming the naming
> service is using 8859-1 (latin1) exclusively, you need to provide a way
> for all clients of the naming service to send and receive latin1
> encoded
> strings.

Precisely. So, if the native codeset is CP1250, how do I let TAO know that I'm expecting or passing Latin1 strings instead?

> Many codesets map the latin1 character set to the code range 00-7F,
> meaning that translating those codepoints are trivial, and codepoints
> outside that range trigger a MARSHAL exception.

Yes. Something in the naming service calls is going outside of that range. It's nothing we're providing as input, so it must be an artifact of the protocol itself.

> If a larger codeset, say utf-8 is used in place of latin1, then you
> have
> a wider range of mappable codepoints.

Are you suggesting using UTF-8 as a native codeset? That's simply not practical for our application.

> Thanks. I presume from the contents that you are natively using CP1250,
> and supporting that and UTF-8 on the wire. You had mentioned using Java
> before. Would that happen to be that which is in Websphere?

Java's hosted on Tomcat.

> Later releases of TAO support utf-8 as well as latin1.

TAO 1.4 uses UTF-8 on the wire, not Latin1. If you mean to suggest that we use UTF-8 natively, that's not really an option.

> Java uses
> Unicode
> internally, but that codeset cannot be used for IIOP strings as it uses
> fixed 16-bit codepoints. JacORB handles this by using Java primitives
> to
> convert the Unicode internal format to either utf-8 or latin1 as
> demanded by the transport. I imagine your Java ORB has the same
> capacity. When looking up references to CP1250, I encountered an IBM
> websphere page: http://tinyurl.com/3yvdtmy which describes this
> capability.

We're not using WebSphere. Our JacORB uses UTF-8 on the wire.

> I'm not sure, but I don't believe TAO 1.4 supplies any char codeset
> translators by default. It does provide a Unicode BOM translator for
> wchar, but the naming service doesn't use wchars. If client and server
> use the same native codeset, no translation is performed.

TAO 1.4 uses an ISO-8859-1 to UTF-8 translator by default. It's called UTF8_Latin1_Factory. Even when I specify a native codeset, it still tries the default first and rejects it based on ncs value.

> Now if you supply translators, then you can have different native
> codesets for peer applications. Of course you need to take care to use
> compatible character sets within these varied codesets. Otherwise, as
> you have said, you won't be able to properly map the codepoints from
> one
> codeset to the other.

I believe you're confusing issues here. One of these peers needs to use a different native codeset when communicating with one peer and a different codeset when communicating with another. If the only way to do that is for all peers to use the same codeset, then what good is CORBA?

> No, the whole point of translation is to facilitate communication
> between different native codesets. Its the character set that matters.

All right, so how does one peer use two codesets for different calls?

> Now, do you use another translator on the naming service from
> utf8(T)-latin1(N)? Incidentally, later versions of TAO also ship with
> such a translator included. Of course you can declare your naming
> service as using utf8 natively, and save a second translation.

My version of TAO (1.4) ships with this very translator, and uses it by default. It's called UTF8_Latin1_Factory, and its source is in ACE_Wrappers/TAO/tao/Codeset/

> Now I see below that you are using JacORB. It already supports using
> latin1 as a transport codeset. As long as you don't stray from the
> latin1 character set there is no problem. If you stray from that
> character set, then you will get a MARSHAL exception raised trying to
> such a string.

No, JacORB is using UTF-8 as a transport codeset, same as TAO.

> I think TAO's default is latin1, and by-and-large JacORB to TAO string
> passing works fine. Of course, I generally only work in English which
> is
> fully coded by latin1. Users of other languages would probably want to
> explicitly set TAO's default to utf-8 in order to use the expanded
> charater sets provided by that codeset.

Are you speaking of native or transport codesets? TAO's default native codeset is ISO-8859-1. TAO's default transport codeset is UTF-8. JacORB's transport codeset is UTF-8 (see TAO/tao/Codeset/Codeset_Manager_i.cpp). UTF-8 natively is not practical.

> It's a little more tedious to expand JacORB's allowed suite of
> acceptable codesets. There have been strives towards improving that,
> and
> probably Java already supports Unicode to CP1250.

I don't need to expand it. It already supports UTF-8 on the wire, which as you say is all I need.

Phil Mesnier

unread,

Jun 14, 2010, 12:08:55 PM6/14/10

to Chris Trawick, tao-...@list.isis.vanderbilt.edu

Hi Chris,

Chris Trawick wrote:
>>> Already done. I used a more complete codeset registry that included
>> all of my target codesets and rebuilt TAO.
>> Great. How many codesets do you use?
>
> Currently we only support one (CP1252, which is close enough to
ISO-8859-1 that it hasn't been a problem). As we expand to more
international markets, we need more. I'm targeting support for the
character sets supported by Windows Input Language, all of which seem to
be in the OSF registry (except for Vietnamese, which isn't a target
market yet). Also, for technical reasons beyond my control (i.e., legacy
software), I prefer the entirely 8-bit sets over the multi-byte ones.
>

Ok. I think you mean you prefer octet-based sets verses wide-character
sets. An 8-bit code set will only have 256 unique codepoint values.
UTF-8, for instance has thousands of codepoints which may encode to
several octets.

Just so we are straight, the terms I am using are:
code point = a single numeric value that represents a glyph in some
rendering.
character set = a collection of glyphs which may (or may not) have
common code point values.
code set = a collection of code points. Code sets may contain (or
support) one or more character sets.
Native code set (NCS) = a property of an application instance, declares
how the application expects to interpret code point values in I/O.
Transmission code set(TCS) = a property of the connection between two
application instances, declares how octets containing code point values
are to be interpreted.
Translator - a utility to transform code point values between NCS
representations of characters to TCS representation of those same
characters.
Conversion Code set (CCS) = any alternative to the native code set that
may be recognized by the advertising application instance. A server
application may advertise a single NCS and many CCS in its IOR. It will
do this for both char and wchar based code sets.

Translation may occur during message origination, message reception, or
both.

>
>> It all depends on how the strings are being used. Assuming the naming
>> service is using 8859-1 (latin1) exclusively, you need to provide a way
>> for all clients of the naming service to send and receive latin1
>> encoded
>> strings.
>
> Precisely. So, if the native codeset is CP1250, how do I let TAO know
>
that I'm expecting or passing Latin1 strings instead?
>

You provide a CP1250(NCS) -> Latin1(TCS) translator. In the svc.conf
file you load that translator service object, and you declare CP1250 to
be your native code set.

TAO has a codeset library which contains a Codeset_Manager object which
on start up compares the loaded translators to the defined NCS, and then
constructs a tagged component with the NCS and supported CCS values.

>
>> Many codesets map the latin1 character set to the code range 00-7F,
>> meaning that translating those codepoints are trivial, and codepoints
>> outside that range trigger a MARSHAL exception.
>
> Yes. Something in the naming service calls is going outside of that
range. It's nothing we're providing as input, so it must be an artifact
of the protocol itself.
>

Can you be more specific? The codeset translation only occurs on
character and string data you supply, it does not affect things such as
operation names or other protocol-provided data.

You can also get a CODESET_INCOMPATIBLE exception if a client
application cannot match the NCS or any CCS supplied in the IOR to any
of its own NCS or CCS values.

If you get a MARSHAL exception out of a codeset translator, it is
because a code point supplied to the translator is out of its range of
acceptable input.

>
>> If a larger codeset, say utf-8 is used in place of latin1, then you
>> have
>> a wider range of mappable codepoints.
>
> Are you suggesting using UTF-8 as a native codeset? That's simply not
>
practical for our application.
>

I was suggesting that, but thinking about on the naming service only.
However if that's not practical, so be it. BTW, I would like to take
this moment to suggest that you open a support contract with OCI. I
would like to be able to provide you with greater support, but this is
taking a fair amount of my time. Please contact sa...@ociweb.com.
(Sorry for the plug, list.)

Anyway, I admit I don't fully understand the context of your
application, but I do understand a lot about codeset use in CORBA

>> Later releases of TAO support utf-8 as well as latin1.
>
> TAO 1.4 uses UTF-8 on the wire, not Latin1. If you mean to suggest
that we use UTF-8 natively, that's not really an option.
>

You're right. I couldn't remember at the time when this support was added.

>> I'm not sure, but I don't believe TAO 1.4 supplies any char codeset
>> translators by default. It does provide a Unicode BOM translator for
>> wchar, but the naming service doesn't use wchars. If client and server
>> use the same native codeset, no translation is performed.
>
> TAO 1.4 uses an ISO-8859-1 to UTF-8 translator by default. It's
> called
UTF8_Latin1_Factory. Even when I specify a native codeset, it still
tries the default first and rejects it based on ncs value.
>

Right. The supplied UTF8->Latin1 translator is specifically for use with
latin1 as the NCS. It was added specifically to ease interaction with
UTF-8 native peers.

>
>> Now if you supply translators, then you can have different native
>> codesets for peer applications. Of course you need to take care to use
>> compatible character sets within these varied codesets. Otherwise, as
>> you have said, you won't be able to properly map the codepoints from
>> one
>> codeset to the other.
>
> I believe you're confusing issues here. One of these peers needs to
use a different native codeset when communicating with one peer and a
different codeset when communicating with another. If the only way to do
that is for all peers to use the same codeset, then what good is CORBA?
>

What am I confusing?

It sounds like you have 3 peer applications, A, B, and C. A and B use
NCS cp1250, and C uses NCS latin1. A is a client to both B & C, B is
also a client of C. So if A and B both load a latin1->cp1250 translator,
then they will support latin1 as a CCS. Thus when they get the IOR form
C, they will match its NCS to their CCS. When A gets the IOR from B, B's
NCS will match C's NCS and thus no translator will be needed.

Does that make sense?

Does that describe your requirement?

>
>> No, the whole point of translation is to facilitate communication
>> between different native codesets. Its the character set that matters.
>
> All right, so how does one peer use two codesets for different calls?
>

See above, I think I've answered this already.

>
>> Now I see below that you are using JacORB. It already supports using
>> latin1 as a transport codeset. As long as you don't stray from the
>> latin1 character set there is no problem. If you stray from that
>> character set, then you will get a MARSHAL exception raised trying to
>> such a string.
>
> No, JacORB is using UTF-8 as a transport codeset, same as TAO.
>

Well, that depends on JacORB version. You are right that later versions
interrogate the system to determine the appropriate native codeset, but
default to latin1 as a last resort. Older versions of JacORB just used
latin1. Again, I don't recall when the additional initialization code
was added.

>
>> I think TAO's default is latin1, and by-and-large JacORB to TAO string
>> passing works fine. Of course, I generally only work in English which
>> is
>> fully coded by latin1. Users of other languages would probably want to
>> explicitly set TAO's default to utf-8 in order to use the expanded
>> charater sets provided by that codeset.
>
> Are you speaking of native or transport codesets? TAO's default
> native
codeset is ISO-8859-1. TAO's default transport codeset is UTF-8.
JacORB's transport codeset is UTF-8 (see
TAO/tao/Codeset/Codeset_Manager_i.cpp). UTF-8 natively is not practical.
>

I think I was speaking of NCS above. TAO will want to use the NCS as the
TCS whenever it can, since it removes the need for character transformation.

>
>> It's a little more tedious to expand JacORB's allowed suite of
>> acceptable codesets. There have been strives towards improving that,
>> and
>> probably Java already supports Unicode to CP1250.
>
> I don't need to expand it. It already supports UTF-8 on the wire,
which as you say is all I need.
>

It is all you need if UTF-8 contains all the character sets you are
interested in passing between your applications. For your applications
that use CP1250 as NCS, you will need to provide a UTF8(TCS) ->
CP1250(NCS) translator. If all the character sets you wish to use are
mapped to the same codepoint ranges in both codesets then the
translation process is a no-op, but as I've described above, you still
need this so that IORs can be properly advertised as supporting both
CP1250 and UTF8, and that IORs advertising support for both may be
accepted by clients using either.

I hope this clarifies things a bit. And please let me know about opening
a support contract with OCI.

Best regards,

Phil Mesnier

unread,

Jun 14, 2010, 12:21:14 PM6/14/10

to Andrew L. Shwaika, tao-...@list.isis.vanderbilt.edu

Hi Andrew,

Andrew L. Shwaika wrote:
>
> Hello,
>
> Yes, you are right. Our application have to perform UTF8 conversion to
> locale character set.
> This is done because of lacks of codeset id in OSF's Character and Code
> Set Registry
> for our locale (KOI8-R), and I didn't found any possibility in TAO to
> determine current locale
> and offer it in connection profile. Even if your locale is UTF-8 TAO
> will use only LATIN-1
> until you will change it in svc.conf.
>

You are correct that TAO doesn't consider locale when selecting the NCS,
that was not part of the originally funded use case, and no one has ever
brought up its lack before you. If you have code that does that, it
would be really cool if you could contribute that to TAO to make it
easier to detect and configure the NCS.

> BTW, I saw other ORB that do it more sophisticated: If your current
> locale is not compatible
> to LATIN-1 it will add UTF-8 codeset in native charsets of connection
> profile as preferred and converts
> all character data using any standard for platform convertion library
> (libiconv on linux for example)
> from current locale from/to UTF-8.
>

The challenge for TAO in this regard is its portability. As with
detection of locale and extraction of the NCS definition varies from
platform to platform, or possibly between versions of similar platforms,
so does the system-supplied ability to map between codesets.

TAO's existing codeset translation mechanism is a little brittle, but by
avoiding platform-specific tricks, it is certainly portable. It should
be possible to make a UTF8-on-wire to any native codeset converter based
on iconv or other utility. I think a few years back someone did that,
but for some reason or other it wasn't integrated with the release.

So once again, you are welcome and encouraged to contribute such a
translator, or fund the development of the same.

Chris Trawick

unread,

Jun 14, 2010, 12:33:31 PM6/14/10

to Phil Mesnier, Andrew L. Shwaika, tao-...@list.isis.vanderbilt.edu

Let's reset. Clearly this simple question has turned into a baroque monstrosity.

TAO Server A uses NCS CP1250 and TCS UTF8.
TAO Naming Service B uses NCS Latin1 and TCS UTF8.
JacORB Client C uses Unicode and TCS UTF8, uses naming service to find TAO Server A.

Given: Translation between CP1250 and Latin1 is so lossy it's silly to contemplate.

TAO Server A is fully capable of passing Latin1 strings to Naming Service B and CP1250 strings to JacORB Client C at the same time.

So, question is this: How TAO Server A can tell TAO that strings passed to Naming Service B are in one codeset while strings passed to JacORB client C are in another?

Phil Mesnier

unread,

Jun 14, 2010, 12:46:46 PM6/14/10

to Chris Trawick, tao-...@list.isis.vanderbilt.edu

Chris Trawick wrote:
> Let's reset. Clearly this simple question has turned into a baroque monstrosity.
>
> TAO Server A uses NCS CP1250 and TCS UTF8.
> TAO Naming Service B uses NCS Latin1 and TCS UTF8.
> JacORB Client C uses Unicode and TCS UTF8, uses naming service to find TAO Server A.
>
> Given: Translation between CP1250 and Latin1 is so lossy it's silly to contemplate.
>

OK.

> TAO Server A is fully capable of passing Latin1 strings to Naming
> Service B and CP1250 strings to JacORB Client C at the same time.
>

Right.

> So, question is this: How TAO Server A can tell TAO that strings
> passed to Naming Service B are in one codeset while strings passed to
> JacORB client C are in another?
>

OK. the answer is "it doesn't have to!" This is the whole point of the
codeset translation subsystem.

When server A connects as a client to the naming service, it does so by
examining the naming service IOR, which contains a codeset component
declaring latin1 as NCS and utf8 as CCS. You can verify this by writing
the naming service IOR in a file, then using $ACE_ROOT/bin/catior -f
<ior file name>. If catior isn't in your bin directory, you can build it
from the source in $TAO_ROOT/utils/catior.

Anyway, the codeset manager in server A looks at the codesets listed in
the IOR and observes that utf8 is supported by both itself and the
naming service. It will then supply the utf8->cp1250 translator to the
transport used for communicating with the naming service.

The naming service gets informed of this decision on receipt of the
first message from server A, a "codeset service context" gets appended
to the first request declaring that utf8 is used as the transmission
codeset. The naming service's codeset manager then supplies its
utf8->latin1 translator to its local transport object.

A similar negotiation happens between client C and both the naming
service and Server A.

Does that make sense?

Chris Trawick

unread,

Jun 14, 2010, 12:55:40 PM6/14/10

to Phil Mesnier, tao-...@list.isis.vanderbilt.edu

> OK. the answer is "it doesn't have to!" This is the whole point of the
> codeset translation subsystem.

You are not understanding.

Here's some code for a single peer:

namingService->someCall("This string is encoded in Latin1");
return /* to client */ "This string is encoded in CP1250";

Same peer. Same thread. Same NCS setting. Two different CORBA interfaces, one being called and the other being implemented. How do I tell TAO that the call is in Latin1 and the return is in CP1250? That it "just knows" is not credible.

Phil Mesnier

unread,

Jun 14, 2010, 1:16:06 PM6/14/10

to Chris Trawick, tao-...@list.isis.vanderbilt.edu

Chris, you are not understanding what I am trying to explain.

Codeset translation is behavior related to the connection between two
peers and thus transparent to the application code.

Your server has two connections, one to the naming service which is
using one translator, and a second connection to the client which uses a
separate translator.

In fact, for your scenario, both connections in server A will use the
same translator, utf8->cp1250.

It is irrelevant to server A that both peers happen to be using
different NCS values, it only knows that the transports for each
connection are using utf8 as the transmission codeset.

I think you are worrying about character sets. the code set ISO-8859-1
is known as "latin1" because it encodes only the character set for latin1.

If you look in the code set registry file, you will see that each
codeset definition defines a value for "char_values" which are character
set identifiers. In this case latin1 is 0x0011. If you grep for 0x0011,
you will see that a lot of code sets contain that character set, and
many do not. I assume that cp1250 does, but that code set is not listed
in version of the code set registry included in ACE/apps/mkcsregdb.

So, back to the point, there is no way to constrain or enforce the use
of a particular character set within a code set, apart from detecting a
marshal exception if the value supplied is outside the range expected.

Since you know you are writing an application in which these limitations
exist, you must anticipate this exception and be able to deal with it.

Chris Trawick

unread,

Jun 14, 2010, 1:39:17 PM6/14/10

to Phil Mesnier, tao-...@list.isis.vanderbilt.edu

> Chris, you are not understanding what I am trying to explain.

No, I'm pretty sure I'm understanding you perfectly. The thing is you're answering a different question than I'm asking.

> Your server has two connections, one to the naming service which is
> using one translator, and a second connection to the client which uses
> a
> separate translator.

Perfect! *Now* we're getting somewhere. How do I tell TAO that the two translators have different native codesets?

> It is irrelevant to server A that both peers happen to be using
> different NCS values, it only knows that the transports for each
> connection are using utf8 as the transmission codeset.

I contend it's quite relevant. For example, as you suggested I set up the naming service to use UTF8 natively. When the server uses ISO-8859-1 natively, everything goes fine. When the server uses CP1250, I get a marshal exception. In all cases, the TAO logs clearly show it's using UTF8 for TCS. Clearly, accessing the naming service with anything other than either Latin1 or UTF8 as a native codeset simply doesn't work. So we're back to my original question yet again: How do I make it work?

> I think you are worrying about character sets. the code set ISO-8859-1
> is known as "latin1" because it encodes only the character set for
> latin1.
>
> If you look in the code set registry file, you will see that each
> codeset definition defines a value for "char_values" which are
> character
> set identifiers. In this case latin1 is 0x0011. If you grep for 0x0011,
> you will see that a lot of code sets contain that character set, and
> many do not. I assume that cp1250 does, but that code set is not listed
> in version of the code set registry included in ACE/apps/mkcsregdb.

I merely used Latin1 because you substituted that term every time I used ISO-8859-1. I thought it might get us on the same page (so to speak), but clearly that's failed.

Phil Mesnier

unread,

Jun 14, 2010, 3:04:57 PM6/14/10

to Chris Trawick, tao-...@list.isis.vanderbilt.edu

Chris Trawick wrote:
>> Chris, you are not understanding what I am trying to explain.
>
> No, I'm pretty sure I'm understanding you perfectly. The thing is
you're answering a different question than I'm asking.
>
>
>> Your server has two connections, one to the naming service which is
>> using one translator, and a second connection to the client which uses
>> a
>> separate translator.
>
> Perfect! *Now* we're getting somewhere. How do I tell TAO that the
> two
translators have different native codesets?
>

You don't, you can't. You only have one native codeset in an
application. Or more precisely you only have one native codeset per ORB.
This is actually part of the CORBA specification. Since one of the
motivations of CORBA is to shield application developers from
distribution details, it hides this configuration. Since the ORB has the
responsibility to select the TCS based on information advertised in the
IOR, it can only do so if it only has a single NCS. Further, when the
ORB produces an IOR it must advertise its NCS, and the spec only defines
a single one for char based text, and one for wchar based text.

The ability to provide distinct configurations to multiple ORBs was
added in the TAO 1.5 era. This started in OCI TAO 1.5a and was back
ported to DOC TAO at some point. Using this would let you configure two
ORBs so that each could have a distinct NCS.

Short of doing that, you have a single NCS, cp1250, and use codeset
translators to support multiple CCS.

Alternatively you could use char/string for communicating exclusively
with the naming service, and use wchar/wstring types for other
communication. Then you can specify your NCS-C to be utf8 or latin1, and
your NCS-W to be cp1250. Of course then you have the whole 8-16 (or 32)
bit conversion mess to deal with, which I suspect you don't want.

>
>> It is irrelevant to server A that both peers happen to be using
>> different NCS values, it only knows that the transports for each
>> connection are using utf8 as the transmission codeset.
>
> I contend it's quite relevant. For example, as you suggested I set up
>

It isn't relevant. The only thing that matters, at either side of the
connection, is the conversion between the NCS and the negotiated TCS.

the naming service to use UTF8 natively. When the server uses ISO-8859-1
natively, everything goes fine. When the server uses CP1250, I get a
marshal exception. In all cases, the TAO logs clearly show it's using
UTF8 for TCS. Clearly, accessing the naming service with anything other
than either Latin1 or UTF8 as a native codeset simply doesn't work. So
we're back to my original question yet again: How do I make it work?
>

Have you validated that your utf8->cp1250 translator works correctly?

If you run your test with both the naming service and your test client
at the highest debug level, -ORBDebuglevel 10, you will be able to tell
from the logs which side raises the marshal exception.

If your naming service is running with utf8 then the translator will
work run in the client, the naming service will not do translation, and
therefore ought not raise a marshal exception. So if you get an
exception, then it is coming from the translator in your client
application, trying to convert native cp1250 to transmitted utf8.

Note that the translator itself does not raise the exception, it merely
returns 1/0 from read/write primitives. The higher-level CDR objects
will actually raise the exception.

When you run the naming service with latin1 as the native codeset and
utf8 as the transmission codeset, there will be a translation occurring
on the naming service side of the connection as well, which will throw a
marshal exception back to the client if any of the code point values
cannot be mapped to latin1.

>
>> I think you are worrying about character sets. the code set ISO-8859-1
>> is known as "latin1" because it encodes only the character set for
>> latin1.
>>
>> If you look in the code set registry file, you will see that each
>> codeset definition defines a value for "char_values" which are
>> character
>> set identifiers. In this case latin1 is 0x0011. If you grep for 0x0011,
>> you will see that a lot of code sets contain that character set, and
>> many do not. I assume that cp1250 does, but that code set is not listed
>> in version of the code set registry included in ACE/apps/mkcsregdb.
>
> I merely used Latin1 because you substituted that term every time I
> used ISO-8859-1. I thought it might get us on the same page (so to
> speak), but clearly that's failed.

That's fine, Because you were asking a somewhat undefined question, "how
do I use 2 native codesets" I was trying to map this question to
something that made sense, such as "how do I use multiple character sets
correctly?" I think we are coming together understanding the problem.