I want to describe our problems and solutions with UTF-8 performance. Maybe this will be useful for someone else in the community.
### The problem
We started experiencing performance problems when large messages were sent to and from our M application. Our application builds and parses strings containing the full JSON data of network messages.
### Incorrect theory
At first we though the problem was the string building itself. Our code concatenates strings in the conventional M manner:
set result=result_stringForTheCurrentIteration
We though this resulted in a O(n^2) execution time in the length of the string.
### Discoveries
However, experiments made us realise that the string building code was indeed not the problem. We noted that our code was only slow when the input messages contained non-ASCII characters!
We also learned more about M performance from this extremely interesting post by Bhaskar:
https://groups.google.com/g/comp.lang.mumps/c/MSVKLt0X6R4/m/zqBx52MTAgAJ
### Correct diagnosis
Instead, we figured out that it is the string manipulation routines in M that are slow for large non-ASCII strings.
The following code will have O(n) performance in the length of the string (and the value of `index`):
s ch=$extract(largeString,index)
This is not surprising. Characters in a UTF-8 encoded string have a variable byte length: ASCII characters are 1 byte, other characters consists of 2-4 bytes. To find the character at a particular index the implementation of `$extract` has to start from the beginning and traverse all the characters to find the byte position of the one that it should return.
GT.M seems to have an optimisation so that if a string consists of only ASCII characters then `$extract` can fetch characters in O(1) time.
### Solution
Our solution is to switch from `$extract` and `$length` and instead use `$zextract` and `$zlength` for string manipulation. The Z variants ignore UTF-8 and treats strings as sequences of bytes. Because of that `$extract` can work in O(1) time in all cases.
The complication with this solution is that we have write code to handle multi-byte characters ourselves. Fortunately this turned out to be pretty simple in our case.
All bytes of multi-byte UTF-8 characters have a value that is 128 or larger, while a 1-byte character has a value that is 127 or smaller. Because of this it is easy to distinguish them.
Have a look at the Wikipedia page for an explanation:
https://en.wikipedia.org/wiki/UTF-8#Encoding
In our case we have to examine the 1-byte characters to generate correct JSON. The multi-byte characters however can be simply copied byte-by-byte to the output.
In this way we have obtained a O(n) execution time of our JSON generation and parsing routines.