VMS and Unicode

Denny Rich

unread,

Feb 11, 2004, 1:18:53 PM2/11/04

to

We are expecting to be receiving EDI messages coded in Unicode. We run
multiple ES45s in a homogeneous cluster, now on VMS7.3-1.

As I understand the Unicode issue, it means that any given character
may consist of a byte stream of from 1 to many bytes. I don't know if
the byte count is fixed or variable.

Using this bytestream, it is possible to encode most symbols and
characters in most major languages.

We need to map some of those characters to the ASCII that our
applications know how to handle.

So, there will be some bit pattern that identifies the "euro currency
character" (I picked this because its probably not in the ASCII
character set). When this is encountered, our ASCII-based software can
make some internal 'notation' that the currency to follow is in Euros
rather than in yen or pesos.

Similar process for the little "degree mark" (small raised circle)
that is a common part of european street addresses, as well as part
of temperature specifications. Our application could note that we are
talking about "degrees", or a floor in a particular building.

We envision feeding this Unicode stream to VMS, where there will be
some "transformation" done that will make Unicode understandable to a
VMS application. at least, the transformation will translate the
Unicode into some representation meaningful to our application.

One way to do this, would be to build a "transform program" that has a
big lookup table in it and provides Unicode-to-ASCII conversion for
characters of interest and based partly on the context of the
character within the message. This could be home-grown or commercial.
You could even do this in hardware, using a couple of big EPROMS for
the translation. Very fast.

I took a cursory look at the VMS doc CD but didn't come up with
anything startling.

So, questions:
Have I missed something in the Doc set? (is this easier than I
think?)
How have others addressed this problem?
Is there a commercial product for making this transformation?

Surely, someone has run into this sort of applicaiton before now. What
did you do?

Thanks,

Denny
VMS System Manager

rmo...@rmoore.dyndns.org

unread,

Feb 11, 2004, 2:48:55 PM2/11/04

to

The DEC C RTL contains routines to convert between ASCII and Unicode.
They are documented in the OpenVMS 7.3-1 C RTL Reference, chapter 10.
There is a special kit to install to get Unicode support. Once that kit
is installed, you use the iconv() function in C to convert between
character sets.

JF Mezei

unread,

Feb 11, 2004, 6:49:59 PM2/11/04

to

Denny Rich wrote:
> One way to do this, would be to build a "transform program" that has a
> big lookup table in it and provides Unicode-to-ASCII conversion for
> characters of interest

> I took a cursory look at the VMS doc CD but didn't come up with
> anything startling.

$HELP ICONV

Theroretically, VMS has all the hooks in place to do the conversion you want.
But in practice, they only went halfway and are missing the popular
translation tables. Look for "UTF8" in a recent subject in the newsgroup and
someone provided the exact name with additional translation tables can be
found on the VMS CD. (something like II18N____)

Bob Koehler

unread,

Feb 12, 2004, 9:43:49 AM2/12/04

to

In article <d28306e.04021...@posting.google.com>, denny...@swagelok.com (Denny Rich) writes:
>
> As I understand the Unicode issue, it means that any given character
> may consist of a byte stream of from 1 to many bytes. I don't know if
> the byte count is fixed or variable.

Single byte Unicode maps to ASCII, and is the most common.

> Using this bytestream, it is possible to encode most symbols and
> characters in most major languages.
>
> We need to map some of those characters to the ASCII that our
> applications know how to handle.
>
> So, there will be some bit pattern that identifies the "euro currency
> character" (I picked this because its probably not in the ASCII
> character set). When this is encountered, our ASCII-based software can
> make some internal 'notation' that the currency to follow is in Euros
> rather than in yen or pesos.

Euro is in ASCII now.

> Similar process for the little "degree mark" (small raised circle)
> that is a common part of european street addresses, as well as part
> of temperature specifications. Our application could note that we are
> talking about "degrees", or a floor in a particular building.

Degree mark is in DEC multinational character set, there's a standard
ISO character set that's almost exactly the same.

Things you will find in Unicode but won't find in ASCII include asian
character sets like kanji and hiragana, as well as European
characters not included in DEC multinational.

> One way to do this, would be to build a "transform program" that has a
> big lookup table in it and provides Unicode-to-ASCII conversion for
> characters of interest and based partly on the context of the
> character within the message. This could be home-grown or commercial.
> You could even do this in hardware, using a couple of big EPROMS for
> the translation. Very fast.
>
> I took a cursory look at the VMS doc CD but didn't come up with
> anything startling.

LIB$MOVTC is a good place to start, but was designed for single byte
character sets. LIB$TRA_EBC_ASC and LIB$TRA_ASC_EBC use it.

Of course, if you program in Macro-32 MOVTC is an instruction.

Richard Brodie

unread,

Feb 12, 2004, 11:23:00 AM2/12/04

to

"Denny Rich" <denny...@swagelok.com> wrote in message
news:d28306e.04021...@posting.google.com...

> As I understand the Unicode issue, it means that any given character
> may consist of a byte stream of from 1 to many bytes. I don't know if
> the byte count is fixed or variable.

Unicode has a number of encodings. One of the more useful
and interesting encodings is UTF-8, where the characters in the
range 0-127 (i.e. those corresponding to the ASCII characters)
are encoded using a single byte. This means that plain ASCII
text is unchanged but the format can accomodate the full Unicode
character set.

Since the character set for all HTML documents is Unicode, this
kind of issue comes up all the time in HTML/XML processing.
Pick your favourite language, and see what library support it has
for web plumbing.

Something quick and dirty would be to encode all the
characters not representable in ASCII as XML entities -
two line Python prototype:

$ python
Python 2.3 (#0, Aug 4 2003, 08:46:25) [DECC] on OpenVMS
Type "help", "copyright", "credits" or "license" for more information.
>>> temperature = u"37°C"
>>> print temperature.encode("ascii", "xmlcharrefreplace")
37°C

The application specific way of handling the extra characters may
be more interesting. Specifying that is the hard part of the problem,
as I see it: everything else is, as you say, just using a lookup table.

Craig A. Berry

unread,

Feb 12, 2004, 11:46:42 AM2/12/04

to

In article <d28306e.04021...@posting.google.com>,
denny...@swagelok.com (Denny Rich) wrote:

> As I understand the Unicode issue,

I'm not sure anyone fully understands Unicode, but there is a lot
more information at <http://www.unicode.org>.

> it means that any given character
> may consist of a byte stream of from 1 to many bytes. I don't know if
> the byte count is fixed or variable.

It depends on what encoding has been chosen. There are fixed-width
encodings (UCS-2 and UCS-4) but for data transfer purposes, UTF-8 is
most common because, among other reasons, it is byte order independent,
and yes, it does have varying width characters. XML documents are
assumed to be encoded in UTF-8 unless specified otherwise. Java class
files are stored in UTF-8 (but Java string classes store strings in
UCS-2).

> Using this bytestream, it is possible to encode most symbols and
> characters in most major languages.
>
> We need to map some of those characters to the ASCII that our
> applications know how to handle.

"that our applications know how to handle" is a huge qualification.
Which ASCII? All are pretty much in agreement about the 0-127 values,
but there are a large variety of so-called "upper ASCII"
implementations defined by vendors and the ISO standards committees
that determine how the 8th bit is interpreted. FWIW, a UTF-8-encoded
byte is also identical to an ASCII character up to and including the
7th bit. If your software is a set of VMS applications, there's a good
chance you are talking about the DEC Multinational Character Set or one
of the DEC national sets.

> So, there will be some bit pattern that identifies the "euro currency
> character" (I picked this because its probably not in the ASCII
> character set). When this is encountered, our ASCII-based software can
> make some internal 'notation' that the currency to follow is in Euros
> rather than in yen or pesos.
>
> Similar process for the little "degree mark" (small raised circle)
> that is a common part of european street addresses, as well as part
> of temperature specifications. Our application could note that we are
> talking about "degrees", or a floor in a particular building.
>
> We envision feeding this Unicode stream to VMS, where there will be
> some "transformation" done that will make Unicode understandable to a
> VMS application. at least, the transformation will translate the
> Unicode into some representation meaningful to our application.

Yep, but you'll have to define what "representation meaningful to our
application" really means. You may be talking about a simple character
conversion, where some Unicode character has an appropriate mapping to
a character in the local version of ASCII. But do you know that all
the characters you'll be getting have an appropriate equivalent in one
or another version of ASCII? Remember there are as many Unicode
characters as can be stored in a 32-bit integer, but each version of
ASCII has only as many characters as can be stored in an 8-bit integer.

Matters of language, font, and output device are separate from but
overlap with character set issues. If you are getting Unicode text
that may be either in French or Finnish, you may need to determine
which it is and convert to a different version of ASCII accordingly.
Whether the resulting text will display properly depends on the
capabilities of your output device. For example, someone has mentioned
that the euro symbol is now in DEC/hp's proprietary ASCII, and recent
versions of DECWindows have fonts that will display it properly, but
that doesn't necessarily mean it will display on a VT terminal or
terminal emulator.

> One way to do this, would be to build a "transform program" that has a
> big lookup table in it and provides Unicode-to-ASCII conversion for
> characters of interest and based partly on the context of the
> character within the message. This could be home-grown or commercial.
> You could even do this in hardware, using a couple of big EPROMS for
> the translation. Very fast.

Yikes. Don't reinvent the wheel if what you're doing is character
conversion. If you need to expand certain characters into some other
form of representation (such as converting the degree symbol into the
English word "degrees") then you may have to build your own translator.

> I took a cursory look at the VMS doc CD but didn't come up with
> anything startling.

Read chapter 10 of the CRTL manual entitled, "Developing International
Software". I'd give you a URL but they keep changing them so it's
hardly worth it. As others have mentioned, you have to install the
optional internationalization kit in order to get the UTF-8
conversions.

Since you have a recent OS version, I believe you have Java out of the
box, which includes all sorts of character set manipulation
capabilities, IIRC in the java.text namespace.

I have a soft spot for Perl. Perl 5.8 and later includes Unicode
support. See

$ perldoc perluniintro

or

http://www.perldoc.com/perl5.8.0/pod/perluniintro.html

You can do conversions with the bundled Encode module like so:

use Encode 'from_to';
from_to($data, "iso-8859-1", "utf-8");

Per Schröder

unread,

Feb 13, 2004, 10:49:40 AM2/13/04

to

Bob Koehler wrote:

> In article <d28306e.04021...@posting.google.com>,

> denny...@swagelok.com (Denny Rich) writes:
>>
>> As I understand the Unicode issue, it means that any given character
>> may consist of a byte stream of from 1 to many bytes. I don't know if
>> the byte count is fixed or variable.
>

> Single byte Unicode maps to ASCII, and is the most common.

Unicode contains more than 70,000 characters. There are several different
encodings used, but the "variable byte encoding" is most likely UTF-8. A
property of UTF-8 is that all ASCII characters are encoded as-is (one byte)
in an UTF-8 file.

Note that ASCII is a 7-bit encoding and contains only 128 code points.

For more general info, see here:

http://www.cl.cam.ac.uk/~mgk25/unicode.html
http://www.unicode.org

>
>> Using this bytestream, it is possible to encode most symbols and
>> characters in most major languages.
>>
>> We need to map some of those characters to the ASCII that our
>> applications know how to handle.
>>
>> So, there will be some bit pattern that identifies the "euro currency
>> character" (I picked this because its probably not in the ASCII
>> character set). When this is encountered, our ASCII-based software can
>> make some internal 'notation' that the currency to follow is in Euros
>> rather than in yen or pesos.
>

> Euro is in ASCII now.

No, it is not. ASCII contains only 128 code points.
DEC multinational was a character set that extended the 128 code points to
use all 256 code points usable in a 8-bit byte.
The ISO 8859-1 (aka Latin-1) standard developed later. It is very similar to
Dec Multinational and differs in only a few places.
When the euro symbol was included, the ISO 8859-15 character set was
created. It is similar to ISO 8859-1.

http://wwwwbs.cs.tu-berlin.de/user/czyborra/charsets/
http://www.cs.tut.fi/~jkorpela/latin9.html

Do you think all these character sets are becoming messy? Prepare to use
Unicode instead!

>>
>> I took a cursory look at the VMS doc CD but didn't come up with
>> anything startling.
>

> LIB$MOVTC is a good place to start, but was designed for single byte
> character sets. LIB$TRA_EBC_ASC and LIB$TRA_ASC_EBC use it.
>
> Of course, if you program in Macro-32 MOVTC is an instruction.

In your C program, use the type wchar_t to represent a single Unicode
character point. This datatype is 4 bytes long on OpenVMS. (Some people
(Java, Windows) got it all wrong and thought 2 bytes would do. Trying to
fit more than 70,000 character points into that becomes messy.)

On VMS use UTF-8 as an external representation. On Windows, UTF-16 is
popular (use 2 bytes per code point), but this can become messy on OpenVMS
since VMS has its own different ideas on how lines (records) are separated
in files. Use UTF-8 for the least number of surprises.

Use C RTL routines to convert between external (file) encodings and the
internal wchar_t format. Also use wide character string routines in C to
perform string operations.

http://h71000.www7.hp.com/commercial/c/docs/5763p016.html#widechar_sect

Regards
/Per Schröder
http://developer.mimer.se

Richard Brodie

unread,

Feb 13, 2004, 11:24:43 AM2/13/04

to

"Per Schröder" <p...@mimer.se> wrote in message news:c0irbr$dbv$1...@yggdrasil.glocalnet.com...

> In your C program, use the type wchar_t to represent a single Unicode
> character point. This datatype is 4 bytes long on OpenVMS. (Some people

> (Java, Windows) got it all wrong and thought 2 bytes would do.)

That's a bit harsh. They use 2 bytes because that was one of the fundamental
design principles in the Unicode specification (number 1, as it happens).
Three major revisions of Unicode later, the landscape looks little different.