I think I have found a bugs in SYMENC^MXMLUTL for GT.M implementations (and possibly others):
Here is the code on my system. The header is as follows:
MXMLUTL ;mjk/alb - MXML Build Utilities ;12/11/2002 15:30
;;2.2;XML PROCESSING UTILITIES;;May 18, 2014;Build 11
SYMENC(STR) ; -- replace reserved xml symbols with their encoding.
N A,I,X,Y,Z,NEWSTR,QT
S (Y,Z)="",QT=""""
I STR["&" S NEWSTR=STR D S STR=Y_Z
. F X=1:1 S Y=Y_$PIECE(NEWSTR,"&",X)_"&",Z=$PIECE(STR,"&",X+1,999) Q:Z'["&"
I STR["<" F S STR=$PIECE(STR,"<",1)_"<"_$PIECE(STR,"<",2,99) Q:STR'["<"
I STR[">" F S STR=$PIECE(STR,">",1)_">"_$PIECE(STR,">",2,99) Q:STR'[">"
I STR["'" F S STR=$PIECE(STR,"'",1)_"'"_$PIECE(STR,"'",2,99) Q:STR'["'"
I STR[QT F S STR=$PIECE(STR,QT,1)_"""_$PIECE(STR,QT,2,99) Q:STR'[QT
;
F I=1:1:$L(STR) D
. S X=$E(STR,I)
. S A=$A(X)
. IF A<31 S STR=$P(STR,X,1)_$P(STR,X,2,99)
Q STR
;
Before I discuss the real bug, I'll mention major concerns about this function: It is using "99" in the piece command, with the idea that there would never be more than 99 instances of a disallowed character in the input string. This is a bad assumption. Often an entire HTML document can be stored in 1 string. And strings in GT.M can be 1-2 mb or longer. So it is very likely that there would be more than 99 instances.
But the bug is the second part:
F I=1:1:$L(STR) D
. S X=$E(STR,I)
. S A=$A(X)
. IF A<31 S STR=$P(STR,X,1)_$P(STR,X,2,99)
Q STR
Notice this sample usage:
ASTRON>S STR="X"_$CHAR(13)_"X" ASTRON>SET STR2=$$SYMENC^MXMLUTL(STR)
ASTRON>W $L(STR2)
0
ASTRON>
What happens, at least on GT.M, is that the $L(STR) is only called ONCE, when running through the FOR loop. Thus, when the string has characters removed, the index extends beyond end of the string, and then X="", A=-1, and $P(STR,"",1)="", and the entire string is lost.
But even if $L(STR) was evaluated every time, this is still bad programming. Let me rewrite the code to demonstrate. Here, to avoid the confusion of ASCII control characters, I am going to strip out "*" chars instead of those for $ASCII()<31.
STRIP(STR) ;
N I
F I=1:1:$L(STR) D
. S X=$E(STR,I)
. I X="*" S STR=$P(STR,X,1)_$P(STR,X,2,99)
Q STR
;
Here is input and output:
SET STR="X**X**X**"
ASTRON>w $$STRIP^TMGFIX(STR)
XX*X**
ASTRON>
What is happening here is that the index (I) is being increased to the next character, even though the current character has been removed at the index position, which moves the NEXT character to the current position. So when increased, it skips over this next character.
This is how this example code should be rewritten
STRIP2(STR) ;
N I SET I=1
F Q:I>$L(STR) DO
. S X=$E(STR,I)
. I X'="*" S I=I+1 QUIT
. S STR=$E(STR,1,I-1)_$E(STR,I+1,$L(STR))
Q STR
;
And this is how I am going to replace all of it in my installation. As a bonus, I bet using $FIND and $EXTRACT will be faster than using $PIECE as it prevents having to start the search over again at the start of the string.
SYMENC(STR) ; -- replace reserved xml symbols with their encoding.
S STR=$$REPLACE(STR,"&","&")
S STR=$$REPLACE(STR,"<","<")
S STR=$$REPLACE(STR,">",">")
S STR=$$REPLACE(STR,"'","'")
S STR=$$REPLACE(STR,"""",""")
;
N I SET I=1
F Q:I>$L(STR) DO
. N X S X=$E(STR,I)
. N A S A=$A(X)
. I A>31 S I=I+1 Q
. ;"S STR=$E(STR,1,I-1)_$E(STR,I+1,$L(STR)) Q <-- use if simple strip wanted
. S STR=$E(STR,1,I-1)_"%"_$$HEX(A)_$E(STR,I+1,$L(STR)) ;"<-- encodes as %## e.g. %0A
. S I=I+3
Q STR
;
REPLACE(STR,SRCHSTR,REPLSTR) ;//kt added
N SLEN,RLEN,POS S SLEN=$L(SRCHSTR),RLEN=$LENGTH(REPLSTR),POS=1
F Q:POS=0 D
. S POS=$F(STR,SRCHSTR,POS) Q:POS=0
. S STR=$EXTRACT(STR,1,POS-SLEN-1)_REPLSTR_$EXTRACT(STR,POS,$LENGTH(STR))
. S POS=POS-SLEN+RLEN
Q STR
;
HEX(NUM) ;"supports only 0-FF
N D S D="0123456789ABCDEF"
Q $E(D,NUM\16+1)_$E(D,NUM#16+1)
;
Does anyone disagree with this?
Thanks
Kevin