_mm_maddubs_epi16

3 views

Skip to first unread message

Ailene Goldhirsh

unread,

Aug 5, 2024, 5:48:20 AM8/5/24

to talkufeldbop

Youmight want to look at the implementations is SIMDe for _mm_madd_epi16 and _mm_maddubs_epi16 (note to future readers: you might want to check the latest version of those files since implementations in SIMDe get improved sometimes and it's very unlikely I'll remember to update this answer). These implementations are just copied from there.

Conversion from UTF-16 to UTF-8 will be covered in this article. Since the fast path supports only basic plane, we don't need to convert surrogate pairs. Converting from UTF-32 works the same way, but two times more data is read per iteration and it is compressed into UTF-16 format first.

The images in this article represent binary or hexadecimal numbers in big-endian convention as usual in math, while all sequences of bytes or words are shown in little-endian as they are laid in memory.

The encoding algorithm will follow the same general idea (as described in the previous part): we extract information into a bitmask, then shuffle using precomputed control register. One notable difference is that since all supported UTF-16 characters are two bytes long, it is possible to process all 16 bytes of input at each iteration. However, UTF-8 output may be larger than UTF-16 input if three-byte characters are present, in which case output from one iteration may not fit into 16 bytes. For this reason, we will start with a code path supporting only one- and two- byte UTF-8 characters, and then will extend it to three-byte characters.

This results in a bitmask with 8 words, each spanning two bytes. Ideally, we should use_mm_movemask_epi16 intrinsic to extract one bit from every 16-bit word, but such instruction does not exist in SSE. So we compact lower bytes using a byte shuffle and use _mm_movemask_epi8 instead:

If you recall how UTF-8 is encoded (see table from previous part), you will notice that the lower bits of UTF-8 output can be taken from the lower bits of levelA and levelB variables. The higher bits can then be overwritten with proper UTF-8 byte headers.

SSE shuffle only operates on 16-byte registers, so both input and output must fit into 16 bytes. Since we consider every output character to take 1 or 2 bytes in UTF-8, the output surely fits into 16-byte register. As for levelA and levelB values, each of them contains eight 16-bit words, but we only need the lower half from every word. Hence, we can mix all of these lower bytes together into one register:

It remains only to overwrite the higher bits with UTF-8 headers, and UTF-16 output will be obtained. This is achieved in the same way as we validated headers in decoding. LUT entry contains a 16-byte bitmask, showing which of the bits must be overwritten (depending on which bytes are 1-byte start, 2-byte start, or continuation byte). Only the last bit of the header is zero, so multiplying the bitmask by two produces the header itself.

Since we always process the whole 16-bytes of input, advancing input pointer is trivial. As for output pointer, we can put the number of bytes encoded into the LUT entry, since it varies depending on the type of characters.

The main problem with three-byte characters is that the output from 16 bytes of input can take up to 24 bytes of output, i.e. the data is expanding due to these characters. Since we cannot do our fancy shuffles across 24 bytes, we have to split input data into two halves: each half contains 8 bytes, being four UTF-16 words. These halves are processed independently most of the time, although for better performance we still try to process them together where possible.

Another thing to note is that third level is added. Previously, we had levelA and levelB variables, aligned to 0-th and 6-th bits respectively. Now we also have levelC aligned to 12-th bit, which should go into shuffling to produce first bytes of the three-byte UTF-8 characters.

We need full information about byte-lengths of all input characters. Every character will take 2 bits in the LUT index with three possible values. It does not matter how the these possible values look exactly, as long as they are different for every byte length. I tried several ways and chose the one which looked fastest to me.

First, let's check which characters won't fit into 1 byte, and which characters won't fit into 2 bytes. This is done easily by comparing either the original word or the levelB shifted value against a threshold:

These two bitmasks together contain all the information we need, but they take 32 bytes. Obviously, they contains 16 words, each being 0x0000 or 0xFFFF. So we can compress every word into a byte and combine these bitmasks together:

The lower 8 bits of allMask correspond to the first four characters, and the higher 8 bits correspond to the last four characters. Since shuffling would be performed separately for these halves, we should split them now:

The values to be shuffled are not ready yet. Since we shuffle each half separately, we want to obtain two 16-byte values: one should contain all the needed bytes for the first four characters, and another one should contain same data for the last four characters. For every character we need three bytes:

As usual, there are many ways to achieve this, and I'll show the one I found to be the fastest. First, let's take the original data and shift right by 4 bits the higher byte in every word. There is no SSE instruction to shift only odd bytes, but we can emulate it using _mm_maddubs_epi16 intrinsic:

Now levAC contains 8 words, and every word has (A) in its lower byte and (C) in its higher byte. Recall that we have already computed (B) in levelB variable when producing LUT index. Now we combine lower halves of these variables into level0, and higher halves into level1:

It does not matter how exactly the (A), (B), (C) bytes are packed in these values, as long as you properly account for their order when precomputing the shuffle mask.From now on, the processing is done separately for two halves. Shuffling allows us to move every byte into its final place in the UTF-8 output:

Unlike UTF-8, validation of UTF-16 input is very simple.Surrogate words is the only possible problem in the input. They are the code points in range [0xD800..U+E000), and we can check for them easily at the very beginning:

However, SIMD procedures can be really fast and convert in parallel severalnumbers. There is only one "but": the input data has to be regular and valid,i.e. the input string must contain only ASCII digits. Recently, I updatedarticle about SSE parsing with the benchmark results. Thespeed-ups are really impressive, for example the SSSE3 parser is 7 to 9 timesfaster than a naive, scalar code.

The obvious question is how these powerful SIMD procedures can be used toconvert real data? By real I mean possibly broken inputs that contain seriesof numbers of different length separated with characters from a predefined set.

At this point we converted 16 one-digit numbers. Now, by merging adjacent numbersfrom t0, we may obtain two-digit numbers. SSSE3 introduced the instruction_mm_maddubs_epi16 (pmaddubsw) which multiplies vectors of bytes and addsintermediate 16-bit values producing a vector of 16-bit values. Thus, when wemultiply t0 by a vector filled with weights (1, 10) we get eight two-digitnumbers, each saved on a 16-bit field:

If we need narrower numbers we simply stop the above procedure at an earlierstep; for example, for four-digit numbers the result is t2. Since theintermediate vectors have different element size (8-, 16-, 32-bit) a cast toother vector type might be needed. Widening can be done with the instructionspmovsx (intrinsics _mm_cvtepi16_epi32, _mm_cvtepi8_epi32, etc.);narrowing with the mentioned pack instruction which is also available indifferent variants.

The same conversion using a naive scalar procedure would require 16 subtracts,14 multiplications and 14 additions. More advanced SWAR procedures would require fewer multiplications, but still not as fewas the SIMD procedure.

Since SIMD procedures impose specific layout of digits within a vector, anarbitrary input has to be properly aligned. In order to do this, we need toidentify spans of digits in the input and move each span onto certainsubarrays of a vector. Let's assume at this point that the input contains eitherdigits or separators (denoted with _).

The solution with a precalculated array is suitable only for SSE, as spanpatterns have 16 bits. In cases of AVX2 and AVX512, where vectors are wider,such a table would be simply too large, respectively 232 or264 entries. Additionally, the AVX2 version of pshufb instructionworks on lanes, i.e. 128-bit halves of a vector, thus it is impossible toshuffle all inputs.

In practise we need two masks: one with positions of digits, another withpositions of separators. If the or'ed mask has some zero elements, it meansthat there are invalid characters. Below is a schema:

The most generic SSE implementation works in time proportional to the sizeof a separators set. Each character from the set is broadcasted intoa vector and then compared with the input vector. The resulting byte mask isored with the previous result.

The SNTI instruction pcmpestrm (_mm_cmpestrm) from SSE4.2 can also be usedin same cases. The instruction gets two arguments. One is an input vector.Another one is interpreted either as a set of individual characters (up to 16) orcharacters ranges (up to 8). The result is a mask for the input characters matchingthe set (or ranges). In the example below the set variant was used.

Similarly to the unsigned conversion, the signed conversion also requires aspan pattern. But in this case also the sign characters '+' and '-' areincluded in the pattern. For this sample string:

The second phase checks if sign characters are placed properly. For each spancombination we need to precalculate the mask of valid sign positions. Then,after or'ing the masks for the '+' and '-' we simply verify if they are placedon valid positions.

AVX512VBMI defines the powerful instruction _mm512_permutex2var_epi8 (vperm2b)which does a lookup in a 128-byte table using as the indices the seven lowest bits ofeach byte of a ZMM register. Invocation of the intrinsics function is weird: