Adventures with UTF-8 performance

125 views
Skip to first unread message

Jens Lideström

unread,
Nov 1, 2022, 5:31:45 AM11/1/22
to
I want to describe our problems and solutions with UTF-8 performance. Maybe this will be useful for someone else in the community.

### The problem

We started experiencing performance problems when large messages were sent to and from our M application. Our application builds and parses strings containing the full JSON data of network messages.

### Incorrect theory

At first we though the problem was the string building itself. Our code concatenates strings in the conventional M manner:

set result=result_stringForTheCurrentIteration

We though this resulted in a O(n^2) execution time in the length of the string.

### Discoveries

However, experiments made us realise that the string building code was indeed not the problem. We noted that our code was only slow when the input messages contained non-ASCII characters!

We also learned more about M performance from this extremely interesting post by Bhaskar:

https://groups.google.com/g/comp.lang.mumps/c/MSVKLt0X6R4/m/zqBx52MTAgAJ

### Correct diagnosis

Instead, we figured out that it is the string manipulation routines in M that are slow for large non-ASCII strings.

The following code will have O(n) performance in the length of the string (and the value of `index`):

s ch=$extract(largeString,index)

This is not surprising. Characters in a UTF-8 encoded string have a variable byte length: ASCII characters are 1 byte, other characters consists of 2-4 bytes. To find the character at a particular index the implementation of `$extract` has to start from the beginning and traverse all the characters to find the byte position of the one that it should return.

GT.M seems to have an optimisation so that if a string consists of only ASCII characters then `$extract` can fetch characters in O(1) time.

### Solution

Our solution is to switch from `$extract` and `$length` and instead use `$zextract` and `$zlength` for string manipulation. The Z variants ignore UTF-8 and treats strings as sequences of bytes. Because of that `$extract` can work in O(1) time in all cases.

The complication with this solution is that we have write code to handle multi-byte characters ourselves. Fortunately this turned out to be pretty simple in our case.

All bytes of multi-byte UTF-8 characters have a value that is 128 or larger, while a 1-byte character has a value that is 127 or smaller. Because of this it is easy to distinguish them.

Have a look at the Wikipedia page for an explanation: https://en.wikipedia.org/wiki/UTF-8#Encoding

In our case we have to examine the 1-byte characters to generate correct JSON. The multi-byte characters however can be simply copied byte-by-byte to the output.

In this way we have obtained a O(n) execution time of our JSON generation and parsing routines.

Onix Man

unread,
Nov 4, 2022, 1:43:17 AM11/4/22
to
вторник, 1 ноября 2022 г. в 12:31:45 UTC+3, jens.li...@vgregion.se:
it would be faster if the internal strings used UCS2 encoding(char16_t) in this case, there is no need to produce $length and $z duplicates functions. But gtm/db go your own way.

Jens Lideström

unread,
Nov 4, 2022, 4:02:50 AM11/4/22
to
On Friday, November 4, 2022 at 6:43:17 AM UTC+1, a1443...@gmail.com wrote:
> it would be faster if the internal strings used UCS2 encoding(char16_t) in this case, there is no need to produce $length and $z duplicates functions. But gtm/db go your own way.

I don't think it would be faster.

With UTF-16 you still has to deal with surrogate pairs to represent all Unicode.

https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF

Alex Maslov

unread,
Nov 8, 2022, 11:23:42 AM11/8/22
to
пятница, 4 ноября 2022 г. в 11:02:50 UTC+3, jens.li...@vgregion.se:
> I don't think it would be faster.
>
> With UTF-16 you still has to deal with surrogate pairs to represent all Unicode.

Surrogate pairs are rear guests in European languages at least.
Having some experience in both worlds (ISC and GT.M) I can confirm that ISC's approach based on UCS-2 provides much less extra expenses for Unicode support than GT.M's one.

Jens Lideström

unread,
Nov 16, 2022, 7:21:20 AM11/16/22
to
It's interesting to hear about your experience, Alex!

I guess it comes down to whether $length and buddies must give the correct result in the presence of surrogate pairs, or it's acceptable to assume that every character is 2 bytes.

Java, for example, uses UTF-16 in it's string interface and might give weird results for surrogate pairs.
Reply all
Reply to author
Forward
0 new messages