Unicode, MUMPS encoding and do MUMPS characters stay in 0-255?

315 views
Skip to first unread message

fmql

unread,
Dec 3, 2010, 1:52:56 AM12/3/10
to Hardhats
I want to safely encode any and all MUMPS data as JSON. In a nutshell,
this means character with codes ...
- 32 to 126 as is
- escape " and \
- 0 to 31 as unicode (0->\u0000) except for a few like chr(10) encoded
as \n ...
- 127 to 255 (and beyond if relevant) as unicode (127->\u007f etc)

Two questions:
- is there a nice succinct way to match characters in MUMPS so I can
avoid a big mapping array ala ...

JESCCHRS(127)="u007f",JESCCHRS(128)="u0080",JESCCHRS(129)="u0081" ...
- am I safe in thinking that I only need to handle 0-255, that
chr(256) onwards will NEVER appear in a MUMPS array? That seems to be
true in my GT/M as "S X=$C(256 ...)" and X == "".

Thx,
Conor

kdt...@gmail.com

unread,
Dec 3, 2010, 7:44:21 AM12/3/10
to Hardhats
I don't think that is true. For example, I think that Jordan is using
characters 128-255 for local-specific characters.

Kevin

Bhaskar, K.S

unread,
Dec 3, 2010, 9:57:47 AM12/3/10
to hard...@googlegroups.com, fmql
[KSB] In M mode, GT.M will never report a character as having a $char() value greater than 255.  In UTF-8 mode, it will.  For example:

GTM>write $zchset
UTF-8
GTM>For i=1040:16:1072 Write ! For j=0:1:15 Write $Char(i+j)," "

А Б В Г Д Е Ж З И Й К Л М Н О П
Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я
а б в г д е ж з и й к л м н о п
GTM>

You can use the ZWRITE command to dumps strings containing unprintable characters in a portable, printable form to the current device.

Regards
-- Bhaskar


Thx,
Conor

--
http://groups.google.com/group/Hardhats
To unsubscribe, send email to Hardhats+u...@googlegroups.com


-- 
GT.M - Rock solid. Lightning fast. Secure. No compromises.
_____________

The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you.
_____________

David Whitten

unread,
Dec 3, 2010, 11:43:19 AM12/3/10
to hard...@googlegroups.com, fmql
I believe that creates strings that are MUMPS-portable, though. i.e. able to be written from a MUMPS systems and read into a MUMPS system. (i.e. it produces results based on what you would have to type as an expression in a MUMPS program to represent that string. (Programs must be written in M-mode)

For example
GTM>s X=$C(27) ZWRITE X
X=$C(27)

GTM>s X="hello"_$C(10+3)_"world"_$C(10) ZWRITE X
X="hello"_$C(13)_"world"_$C(10)

Is there some function, probably a Z function, which produce an ASCII string that is the representation of the UTF-8 string for a value ? (suitable for use in a JSON representation)

David

Bhaskar, K.S

unread,
Dec 3, 2010, 12:04:49 PM12/3/10
to hard...@googlegroups.com

[KSB2] No, there is not such a built in function. I'd suggest opening a
pipe to another GT.M process, operating in M mode, to which you pass the
string and have it zwrite the string back.

However, GT.M will tolerate printable non-ASCII characters in string
literals and in comments in M code

GTM>set x="ϨϩϪϫϬϭϮϯϰϱϲϳϴϵ϶Ϸ"

GTM>zwr x
x="ϨϩϪϫϬϭϮϯϰϱϲϳϴϵ϶Ϸ"

GTM>write "Hello" ; ;zero;eins;deux;tres;quattro;пять;ستة;सात;捌;ஒன்பது
Hello
GTM>

Regards
-- Bhaskar

fmql

unread,
Dec 3, 2010, 1:05:42 PM12/3/10
to Hardhats
Thanks all. Just to clarify, fill in some blanks ...

1) is there a portable way for a program to know the character set of
its VM? I see zchset ($ZCHSET="M" or $ZCHSET="UTF-8" ) is in GT/M but
this "Z" variable doesn't seem to be in Cache?

2) what is the difference between the character sets "ASCII" and "M".
Is "M" Latin-1? It certainly seems to decode as latin-1 (though Sam
insists it's not!)

3) Can someone with character set UTF-8 tell me if $ASCII, despite its
name, works on characters after 255? Is X=$A($C(X) when X > 255? I get
-1 for X if X > 255 in my systems but that's because they are "M".

On a utf-8, does the (misnamed?) ASCII keeps working ala "ord" in
python? Python has "chr" ala "$c" and "ord", its opposite, which is
equivalent to "$a" for the ascii characters. What happens when you go
to utf-8? Does $a still act like "ord"?

4) then on serializing the utf for 128 on into the form \u#### ...
If the answer to 3 is yes - $a keeps working on utf-8 systems - then I
just need I $A(STRVAL(I)) > 128 S SERIALVALUE="\u"_NEXTHEX. (BTW, I'm
right in thinking there is no formal support for hex numbers in MUMPS,
right?)

Thanks again,
Conor

Bhaskar, K.S

unread,
Dec 3, 2010, 1:50:39 PM12/3/10
to hard...@googlegroups.com

On 12/03/2010 01:05 PM, fmql wrote:
> Thanks all. Just to clarify, fill in some blanks ...
>
> 1) is there a portable way for a program to know the character set of
> its VM? I see zchset ($ZCHSET="M" or $ZCHSET="UTF-8" ) is in GT/M but
> this "Z" variable doesn't seem to be in Cache?

[KSB3] I can't speak for other MUMPS implementations, but one sneaky
possibility might be $l($c(256)):

GTM>write $zchset
M
GTM>w $length($c(255))
1
GTM>w $length($c(256))
0
GTM>

and

GTM>write $zchset
UTF-8
GTM>w $length($c(255))
1
GTM>w $length($c(256))
1
GTM>

> 2) what is the difference between the character sets "ASCII" and "M".
> Is "M" Latin-1? It certainly seems to decode as latin-1 (though Sam
> insists it's not!)

[KSB] ASCII and the official MUMPS character sets only ascribe meanings
to $c(0) through $c(127). In M mode $c(128) through $c(255) are simply
byte values whose meaning is whatever the application chooses to use for
them. Encodings in the ISO-8859 family are commonly used by applications
running on GT.M. But the encodings for different languages for $c(128)
through $c(255) are different.

I hope this clarifies rather than confuses!

> 3) Can someone with character set UTF-8 tell me if $ASCII, despite its
> name, works on characters after 255? Is X=$A($C(X) when X > 255? I get
> -1 for X if X > 255 in my systems but that's because they are "M".

[KSB] Yes, although the name $ASCII() is something of a misnomer, we
retained it for upward compatibility when we added Unicode support.

GTM>w $ascii("ϵ")
1013
GTM>w $ascii($char(999))
999
GTM>

> On a utf-8, does the (misnamed?) ASCII keeps working ala "ord" in
> python? Python has "chr" ala "$c" and "ord", its opposite, which is
> equivalent to "$a" for the ascii characters. What happens when you go
> to utf-8? Does $a still act like "ord"?
>
> 4) then on serializing the utf for 128 on into the form \u#### ...
> If the answer to 3 is yes - $a keeps working on utf-8 systems - then I
> just need I $A(STRVAL(I)) > 128 S SERIALVALUE="\u"_NEXTHEX. (BTW, I'm
> right in thinking there is no formal support for hex numbers in MUMPS,
> right?)

[KSB] I am sorry to have to take away this excuse for you to write code
(I personally find programming to be therapeutic)! 8-)

The utility functions %UTF2HEX and %HEX2UTF are described in the Unicode
support technical bulletin. For all GT.M documentation, go to
http://fis-gtm.com and click on the User Documentation tab. Source code
is included in the distribution if you want to modify them.

Regards
-- Bhaskar

--
GT.M - Rock solid. Lightning fast. Secure. No compromises.

_____________

David Whitten

unread,
Dec 3, 2010, 2:24:59 PM12/3/10
to hard...@googlegroups.com
On Fri, Dec 3, 2010 at 12:05 PM, fmql <care...@gmail.com> wrote:
Thanks all. Just to clarify, fill in some blanks ...

1) is there a portable way for a program to know the character set of
its VM? I see zchset ($ZCHSET="M" or $ZCHSET="UTF-8" ) is in GT/M but
this "Z" variable doesn't seem to be in Cache?

Character sets have been implemented in different ways for different M implementations.
Bhaskar has already talked about GT.M
Cache seems to focus on character sets in the context of Cache Server Pages and XML.
This link reveals what I have found:
http://vista.intersystems.com/csp/docbook/DocBook.UI.SearchBM.cls?CurrPage=1&KeyWord=UTF-&Search=Search&Type=Word&Include=All&BkFilter=&TpFilter=
 
2) what is the difference between the character sets "ASCII" and "M".
Is "M" Latin-1? It certainly seems to decode as latin-1 (though Sam
insists it's not!)

"M" is the name used in the standard, and I think only applies to the first 127 possible values stored in a byte (which is the true "ASCII" sequence, as ASCII does not apply if the highest (8th-bit) is equal to 1.
 
3) Can someone with character set UTF-8 tell me if $ASCII, despite its
name, works on characters after 255? Is X=$A($C(X) when X > 255? I get
-1 for X if X > 255 in my systems but that's because they are "M".

Again, this is an implementation specific issue.
 
On a utf-8, does the (misnamed?) ASCII keeps working ala "ord" in
python? Python has "chr" ala "$c" and "ord", its opposite, which is
equivalent to "$a" for the ascii characters. What happens when you go
to utf-8? Does $a still act like "ord"?

4) then on serializing the utf for 128 on into the form \u#### ...
If the answer to 3 is yes - $a keeps working on utf-8 systems - then I
just need I $A(STRVAL(I)) > 128 S SERIALVALUE="\u"_NEXTHEX. (BTW, I'm
right in thinking there is no formal support for hex numbers in MUMPS,
right?)

MUMPS does not support hex notation for numbers. Realistically, if it had provided
support for a non-decimal notation for numbers, it would have had octal, as that was
the most common at the time on PDP machinery.  At one time there were discussions
about standardizing bit-streams but that isn't relevant here, I guess.
 
Thanks again,
Conor

By the way, are you participating remotely in the VistA Technical Meeting this weekend ?

David
713-870-3834

 

fmql

unread,
Dec 4, 2010, 3:16:29 AM12/4/10
to Hardhats
So using Bhaskar's function, something like this gives JSON from an M
string and I think beyond too, when utf-8 is the default character
set. Well, it should work on GT/M - on Cache it removes control
characters and characters after 127 so it's valid JSON but ...,

JSONSTRING(MSTR)
N JSTR S JSTR=""
N I F I=1:1:$L(MSTR) D
. N NC S NC=$E(MSTR,I)
. N CD S CD=$A(NC) Q:CD'=+CD ; being careful with $A. Don't know
what Cache does when CD > 255
. ; \b,\t,\n,\f,\r get special handling; " and \ are escaped
with \; 32 to 126 are themselves; all else is 4 hex unicode.
. S JSTR=JSTR_
$S(CD=8:"\b",CD=9:"\t",CD=10:"\n",CD=12:"\f",CD=13:"\r",CD=34:$C(92)_
$C(34),CD=92:$C(92)_$C(92),(CD>31&CD<127):NC,$L($T(FUNC^
%UTF2HEX)):"\u00"_$$FUNC^%UTF2HEX(CD),1:"")
Q JSTR

one shortcoming I see is that for bigger unicode values, prefixing
\u00 is too long. It's late ... what's the fastest, most succinct way
to 0-pad a MUMPS string?

Conor

On Dec 3, 11:24 am, David Whitten <whit...@worldvista.org> wrote:
> On Fri, Dec 3, 2010 at 12:05 PM, fmql <careg...@gmail.com> wrote:
> > Thanks all. Just to clarify, fill in some blanks ...
>
> > 1) is there a portable way for a program to know the character set of
> > its VM? I see zchset ($ZCHSET="M" or $ZCHSET="UTF-8" ) is in GT/M but
> > this "Z" variable doesn't seem to be in Cache?
>
> > Character sets have been implemented in different ways for different M
>
> implementations.
> Bhaskar has already talked about GT.M
> Cache seems to focus on character sets in the context of Cache Server Pages
> and XML.
> This link reveals what I have found:http://vista.intersystems.com/csp/docbook/DocBook.UI.SearchBM.cls?Cur...
> > To unsubscribe, send email to Hardhats+u...@googlegroups.com<Hardhats%2Bunsu...@googlegroups.com>

David Whitten

unread,
Dec 4, 2010, 9:03:46 AM12/4/10
to hard...@googlegroups.com
The easiest way to zero-pad a number (zeros at the beginning of the string) is:  $TR($J(val,length)," ",0)

ie:  the $JUSTIFY (2 arguments) will take a string and pad it with spaces to the length specified.
the $TRANSLATE takes the spaces and converts them to zeros.

David

fmql

unread,
Dec 5, 2010, 1:38:01 AM12/5/10
to Hardhats
thanks Dave for the $J tip. That, one fix and Cache's ZHEX does it, I
think ...

JSONSTRING(MSTR)
N JSTR S JSTR=""
N I F I=1:1:$L(MSTR) D
. N NC S NC=$E(MSTR,I)
. N CD S CD=$A(NC) Q:CD="" ; Check "" though GT/M and Cache say
$A works for all unicode
. ; \b,\t,\n,\f,\r separated - ",\ escaped with \ - 32 to 126
themselves; others 4 hex unicode.
. S JSTR=JSTR_
$S(CD=8:"\b",CD=9:"\t",CD=10:"\n",CD=12:"\f",CD=13:"\r",CD=34:$C(92)_
$C(34),CD=92:$C(92)_$C(92),(CD>31&(CD<127)):NC,$L($T(FUNC^
%UTF2HEX)):"\u"_$TR($J($$FUNC^%UTF2HEX(NC),4)," ","0"),1:"\u"_
$TR($J($ZHEX(CD),4)," ","0"))
Q JSTR

may not be the most efficient and hasn't been tested on utf-8 VMs but
works on 0-255 on M VMs.

Conor
> > > > To unsubscribe, send email to Hardhats+u...@googlegroups.com<Hardhats%2Bunsubscribe@googlegroups.c om>
> > <Hardhats%2Bunsu...@googlegroups.com<Hardhats%252Bunsubscribe@googlegro ups.com>
>
> > --
> >http://groups.google.com/group/Hardhats
> > To unsubscribe, send email to Hardhats+u...@googlegroups.com<Hardhats%2Bunsubscribe@googlegroups.c om>
Reply all
Reply to author
Forward
0 new messages