Major BUGs in SYMENC^MXMLUTL

163 views
Skip to first unread message

Kevin Toppenberg

unread,
May 23, 2018, 3:20:19 PM5/23/18
to Hardhats
I think I have found a bugs in SYMENC^MXMLUTL for GT.M implementations (and possibly others):

Here is the code on my system.  The header is as follows:

MXMLUTL ;mjk/alb - MXML Build Utilities ;12/11/2002  15:30                                             
;;2.2;XML PROCESSING UTILITIES;;May 18, 2014;Build 11

SYMENC(STR) ; -- replace reserved xml symbols with their encoding.
  N A,I,X,Y,Z,NEWSTR,QT
 S (Y,Z)="",QT=""""
 I STR["&" S NEWSTR=STR D  S STR=Y_Z
 . F X=1:1  S Y=Y_$PIECE(NEWSTR,"&",X)_"&",Z=$PIECE(STR,"&",X+1,999) Q:Z'["&"
 I STR["<" F  S STR=$PIECE(STR,"<",1)_"&lt;"_$PIECE(STR,"<",2,99) Q:STR'["<"   
 I STR[">" F  S STR=$PIECE(STR,">",1)_"&gt;"_$PIECE(STR,">",2,99) Q:STR'[">"   
 I STR["'" F  S STR=$PIECE(STR,"'",1)_"&apos;"_$PIECE(STR,"'",2,99) Q:STR'["'" 
 I STR[QT F  S STR=$PIECE(STR,QT,1)_"&quot;"_$PIECE(STR,QT,2,99) Q:STR'[QT     
 ;
 F I=1:1:$L(STR) D     
 . S X=$E(STR,I)       
 . S A=$A(X)           
 . IF A<31 S STR=$P(STR,X,1)_$P(STR,X,2,99)
 Q STR                                     
 ;                                         

Before I discuss the real bug, I'll mention major concerns about this function: It is using "99" in the piece command, with the idea that there would never be more than 99 instances of a disallowed character in the input string.   This is a bad assumption.  Often an entire HTML document can be stored in 1 string. And strings in GT.M can be 1-2 mb or longer.  So it is very likely that there would be more than 99 instances.  

But the bug is the second part:

 F I=1:1:$L(STR) D     
 . S X=$E(STR,I)       
 . S A=$A(X)           
 . IF A<31 S STR=$P(STR,X,1)_$P(STR,X,2,99)
 Q STR 

Notice this sample usage:

ASTRON>S STR="X"_$CHAR(13)_"X"                                                                          ASTRON>SET STR2=$$SYMENC^MXMLUTL(STR)                                                                    
ASTRON>W $L(STR2)
0
ASTRON>

What happens, at least on GT.M, is that the $L(STR) is only called ONCE, when running through the FOR loop.  Thus, when the string has characters removed, the index extends beyond end of the string, and then X="", A=-1, and $P(STR,"",1)="", and the entire string is lost.

But even if $L(STR) was evaluated every time, this is still bad programming.  Let me rewrite the code to demonstrate.  Here, to avoid the confusion of ASCII control characters, I am going to strip out "*" chars instead of those for $ASCII()<31.

STRIP(STR) ;
N I
F I=1:1:$L(STR) D
. S X=$E(STR,I)
. I X="*" S STR=$P(STR,X,1)_$P(STR,X,2,99)
Q STR
;
  
Here is input and output: 
SET STR="X**X**X**"
ASTRON>w $$STRIP^TMGFIX(STR)
XX*X**
ASTRON>

What is happening here is that the index (I) is being increased to the next character, even though the current character has been removed at the index position, which moves the NEXT character to the current position.  So when increased, it skips over this next character. 

This is how this example code should be rewritten

STRIP2(STR) ;
 N I SET I=1
 F  Q:I>$L(STR)  DO
 . S X=$E(STR,I)         
 . I X'="*" S I=I+1 QUIT
 . S STR=$E(STR,1,I-1)_$E(STR,I+1,$L(STR))
 Q STR                                     
 ;
 
And this is how I am going to replace all of it in my installation.  As a bonus, I bet using $FIND and $EXTRACT will be faster than using $PIECE as it prevents having to start the search over again at the start of the string.

SYMENC(STR) ; -- replace reserved xml symbols with their encoding.
  S STR=$$REPLACE(STR,"&","&amp;")
  S STR=$$REPLACE(STR,"<","&lt;")
  S STR=$$REPLACE(STR,">","&gt;")
  S STR=$$REPLACE(STR,"'","&apos;")
  S STR=$$REPLACE(STR,"""","&quot;")
  ;
  N I SET I=1
  F  Q:I>$L(STR)  DO
. N X S X=$E(STR,I)
. N A S A=$A(X)
  . I A>31 S I=I+1 Q
  . ;"S STR=$E(STR,1,I-1)_$E(STR,I+1,$L(STR)) Q <-- use if simple strip wanted
  . S STR=$E(STR,1,I-1)_"%"_$$HEX(A)_$E(STR,I+1,$L(STR))  ;"<-- encodes as %## e.g. %0A 
  . S I=I+3
  Q STR                                     
 ;
REPLACE(STR,SRCHSTR,REPLSTR) ;//kt added
  N SLEN,RLEN,POS S SLEN=$L(SRCHSTR),RLEN=$LENGTH(REPLSTR),POS=1
  F  Q:POS=0  D
  . S POS=$F(STR,SRCHSTR,POS) Q:POS=0
  . S STR=$EXTRACT(STR,1,POS-SLEN-1)_REPLSTR_$EXTRACT(STR,POS,$LENGTH(STR))
  . S POS=POS-SLEN+RLEN
  Q STR
  ;
HEX(NUM) ;"supports only 0-FF
  N D S D="0123456789ABCDEF"
  Q $E(D,NUM\16+1)_$E(D,NUM#16+1)
  ;


Does anyone disagree with this?


Thanks
Kevin

Kevin Toppenberg

unread,
May 23, 2018, 3:26:41 PM5/23/18
to Hardhats
A slight further improvement of efficiency changes the code to only measure the length of the string if it has changed.  

STRIP2(STR) ;
 N I S I=1
 N LEN S LEN=$L(STR)
 F  Q:I>LEN  DO
 . S X=$E(STR,I)         
 . I X'="*" S I=I+1 QUIT
 . S STR=$E(STR,1,I-1)_$E(STR,I+1,$L(STR))
 . S LEN=$L(STR)
 Q STR                                     
 ;
 

Sam Habiel

unread,
May 23, 2018, 3:42:42 PM5/23/18
to hardhats
Kevin,

Can you install my Enhanced XML utilities which fixes a decent amount
of bugs and retest?

Here's a link to the KIDS file:
https://github.com/shabiel/VISTA-xml-processing-utilities/releases/download/2.5/XML_PROCESSING_UTILITIES_2P5.KID.

I honestly didn't read your email. If you still find that there is a
bug, I will address it and make a new KIDS build.

--Sam
> --
> --
> http://groups.google.com/group/Hardhats
> To unsubscribe, send email to Hardhats+u...@googlegroups.com
>
> ---
> You received this message because you are subscribed to the Google Groups
> "Hardhats" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to hardhats+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Kevin Toppenberg

unread,
May 23, 2018, 8:26:07 PM5/23/18
to Hardhats
Sam

I spent about 2 hrs fixing this bug.  The least you could do would be to read the code for 5 minutes.  If I install your kids patch, it may well overwrite my fixes.  And I don't want to take the time to do that right now.  I'm up to my neck in a very difficult bug fix in my own code.  I was frustrated to have to stop and fix the MXMLUTL stuff. 

Kevin

Kevin Toppenberg

unread,
May 23, 2018, 8:57:17 PM5/23/18
to Hardhats
Do you have your version of MXMLUTL somewhere online so I could view it directly?  I could tell if it was fixed that way.

Kevin

Kevin Toppenberg

unread,
May 23, 2018, 8:59:53 PM5/23/18
to Hardhats
I used an editor to view the KIDS file directly.  "RTN","MXMLUTL" seems to have this same bug.

Kevin

Sam Habiel

unread,
May 24, 2018, 9:33:35 AM5/24/18
to hardhats
Kevin,

I spoke to you yesterday and you showed me that there is a new bug. So
I will add that to the next build.

--Sam

Ed de Moel

unread,
May 24, 2018, 1:26:53 PM5/24/18
to Hardhats
Looking at the code segment

 F I=1:1:$L(STR) D     
 . S X=$E(STR,I)       
 . S A=$A(X)           
 . IF A<31 S STR=$P(STR,X,1)_$P(STR,X,2,99)
 Q STR

I'd code that as

 New controls,ii
 Set controls="" For ii=0:1:31,127 Set controls=controls_$Char(ii)
 Quit $Translate(STR,controls)

(Of course, if the input string is encoded in UTF-7 of UTF-8, you'd first want to convert any embedded escape sequences, otherwise this way of getting rid of control-characters may end up quite destructive.)

Ed de Moel

unread,
May 24, 2018, 1:46:13 PM5/24/18
to Hardhats
... and, indeed: I added $Char(31) and $Char(127) as control-characters...

Kevin Toppenberg

unread,
May 27, 2018, 2:00:04 PM5/27/18
to Hardhats
How does that handle other characters < #32?

Kevin

Kevin Toppenberg

unread,
May 27, 2018, 2:05:37 PM5/27/18
to Hardhats
Sorry, I wasn't reading the code carefully enough.  It does handle all the chars (and #127).

But for underlying efficiency, my code looks at each position in the string once, where as doing a $translate will scan the entire string for each possible control character.  If one has a long string (e.g. 1 mb in case of HTML documents), this might be a factor.  I guess it would be a comparison between assembly level implementation of $translate vs char-by-char in mumps.

Kevin

Sam Habiel

unread,
May 30, 2018, 12:01:17 PM5/30/18
to hardhats
I fixed this bug and made a new release. I chose Ed's implementation over yours: it will be much faster as $TR is implemented in C/assembly not in M.


--Sam



---
You received this message because you are subscribed to the Google Groups "Hardhats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hardhats+unsubscribe@googlegroups.com.

Kevin Toppenberg

unread,
May 30, 2018, 2:06:57 PM5/30/18
to Hardhats
What about the first part of the function with using max of 99 pieces?

Kevin


On Wednesday, May 30, 2018 at 12:01:17 PM UTC-4, Sam Habiel wrote:
I fixed this bug and made a new release. I chose Ed's implementation over yours: it will be much faster as $TR is implemented in C/assembly not in M.


--Sam
On Sun, May 27, 2018 at 2:05 PM, Kevin Toppenberg <kdt...@gmail.com> wrote:
Sorry, I wasn't reading the code carefully enough.  It does handle all the chars (and #127).

But for underlying efficiency, my code looks at each position in the string once, where as doing a $translate will scan the entire string for each possible control character.  If one has a long string (e.g. 1 mb in case of HTML documents), this might be a factor.  I guess it would be a comparison between assembly level implementation of $translate vs char-by-char in mumps.

Kevin


On Sunday, May 27, 2018 at 2:00:04 PM UTC-4, Kevin Toppenberg wrote:
How does that handle other characters < #32?

Kevin

On Thursday, May 24, 2018 at 1:46:13 PM UTC-4, Ed de Moel wrote:
... and, indeed: I added $Char(31) and $Char(127) as control-characters...


---
You received this message because you are subscribed to the Google Groups "Hardhats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hardhats+u...@googlegroups.com.

Sam Habiel

unread,
May 30, 2018, 2:43:18 PM5/30/18
to hardhats
99 -> 99999.

Hope that's enough.

--Sam



---
You received this message because you are subscribed to the Google Groups "Hardhats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hardhats+unsubscribe@googlegroups.com.

Kevin Toppenberg

unread,
May 31, 2018, 9:44:10 PM5/31/18
to Hardhats
That is probably enough.  

Thanks
Kevin

Wally Fort

unread,
Jun 1, 2018, 12:50:45 PM6/1/18
to Hardhats
Most of this code was written when strings were limited to 255.

To make Ed's strip even shorter.
STRIP(STR)
Q $TR(STR,$C(1,2,3,4,5,6,7,8,9,10...32,127))
Reply all
Reply to author
Forward
0 new messages