UTF-8 vectorization

Mark Figura

unread,

Dec 22, 2016, 10:42:44 AM12/22/16

to mechanical-sympathy

Hello everybody (long time, first time),

I came across an interesting post a few days ago with a method for counting the number of characters in UTF-8: http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html

Rather than checking byte-by-byte, it loads as many bytes as fit into the word-size of the of the system and works on them in parallel.

Does anyone know any good resources that can help explain how one might come up with lines of code like the following 2?

u = ((u & (ONEMASK * 0x80)) >> 7) & ((~u) >> 6);
count += (u * ONEMASK) >> ((sizeof(size_t) - 1) * 8);

In particular, I'm not sure how the multiplications fit in. Taking "ONEMASK * 0x80" as a constant, the 1st line is pretty straight-forward, but I haven't a clue for the 2nd.

Apologies if I've managed to push the entire content of one of my early CS classes out of my head. :)

Thanks!

Mark

Tavian Barnes

unread,

Dec 22, 2016, 11:38:48 AM12/22/16

to mechanica...@googlegroups.com

ONEMASK is something like 0x01010101 (twice as long for 64-bit). We're trying to count UTF-8 continuation bytes, which all start with 10xxxxxx (leading bytes are either 0xxxxxxx or 11xxxxxx). So ((u & (ONEMASK * 0x80)) >> 7) is ((u & 0x80808080) >> 7), which is a mask with a 0x01 byte for every corresponding 1xxxxxxx byte, and 0x00 otherwise. Then the & (~u >> 6) masks out the x1xxxxxx bytes, leaving a 0x01 byte for only the 10xxxxxx bytes.

The multiplication by ONEMASK is a quick way to compute u + (u << 8) + (u << 16) + ... (since u * ONEMASK == u * 0x01 + u * 0x0100 + u * 0x010000 + ...). There's no hex carries that can occur (at most 8 1's added together), so grabbing the leftmost byte is equivalent to just adding each byte of u together, which counts the number of 0x01 bytes in u. For example:

u = 0x000063C3B474C3A9 (côté)

((u & 0x8080808080808080) >> 7) == 0x0000000101000101

~u >> 6 == 0x03FFFE70F12E2CF1

u = 0x0000000101000101 & 0x03FFFE70F12E2CF1 == 0x0000000001000001

u * 0x0101010101010101 ==

0x00|00000001000001

0x00|00000100000100

0x00|00010000010000

0x00|01000001000000

0x01|00000100000000

0x00|00010000000000

0x00|01000000000000

+ 0x01|00000000000000

--------------------

0x02|02020202010101

Notice how the value of the leftmost byte is just the sum of all the separate bytes in u. In this case, we find that there are 2 UTF-8 trailing bytes in u (the second bytes of ô and é).

Mark Figura

unread,

Dec 22, 2016, 12:38:47 PM12/22/16

to mechanical-sympathy

Ah... Thanks! I hope I get to put that to use some time, but at least the multiplication makes sense now.

Tony Finch

unread,

Dec 23, 2016, 10:30:49 AM12/23/16

to mechanical-sympathy

Mark Figura <mfi...@alum.wpi.edu> wrote:
>
> Does anyone know any good resources that can help explain how one might
> come up with lines of code like the following 2?

Get the book "Hacker's Delight" by Hank Warren. It's a treasure-trove of
clever micro-optimizations along these lines.

My collection of links has several more things along these lines...

http://aggregate.org/MAGIC/

http://aggregate.org/SWAR/

http://www.jjj.de/fxt/#fxtbook

http://graphics.stanford.edu/~seander/bithacks.html

http://www.inwap.com/pdp10/hbaker/hakmem/hacks.html

http://www.cl.cam.ac.uk/~am21/hakmemc.html

http://www.cl.cam.ac.uk/~am21/progtricks.html

Tony.
--
f.anthony.n.finch <d...@dotat.at> http://dotat.at/ - I xn--zr8h punycode
Viking: Southwesterly backing southerly 6 to gale 8, perhaps severe gale 9
later. Very rough. Wintry showers. Good.

Reply all

Reply to author

Forward