Many compilers for IA-32 generate "repne scasb" in order tto find the length of a given C string. However, it is possible to implement strlen (and many other string functions) using the SSE2 instruction set: pcmpeqb + pmovmaskb until there is a set bit then bsf to find its index. On Core2 it is roughly 9.3 times faster and about 6.5 times faster on Pentium 4. The same technique could obviously be applied to wider data types as well.
My question is: why isn't the "repne scas" idiom microcoded internally to implement the above (or better!) sequence using the inherent power of the on-chip SIMD execution engines? The same applies to repe cmp (that is, strcmp, memcmp)...
Piotr Wyderski wrote: > Many compilers for IA-32 generate "repne scasb" in order > tto find the length of a given C string. However, it is possible > to implement strlen (and many other string functions) using the SSE2 > instruction set: pcmpeqb + pmovmaskb until there > is a set bit then bsf to find its index. On Core2 it is roughly > 9.3 times faster and about 6.5 times faster on Pentium 4.
I assume this is measured starting with 16-byte aligned strings?
To handle arbitrary starting alignment you have to specialcase the first block, probably by POR'ing in a 16*16=256-byte lookup table value, or you could use a single 31-byte table with a MOVUPS unaligned load?
However, if you're willing to look at misaligned data, then it is probably much better to simply handle the first block with a MOVUPS load, do the testing, then move on to aligned accesses from then on.
This is because (a) quite a few strings will be at least 8-byte aligned anyway, and (b) most C-style zero-terminated strings are quite short, needing only a single iteration of a 16-byte wide operation.
> The same technique could obviously be applied to wider > data types as well.
> My question is: why isn't the "repne scas" idiom microcoded > internally to implement the above (or better!) sequence using > the inherent power of the on-chip SIMD execution engines?
According to discussions I've had with a well-know cpu architect, hardware could indeed do a much better job, and afaik, the current 'Fast String' support on some chips is a step in that direction.
The main problem is how to make it totally general, what with variously cached/uncached memory buffers in particular.
> The same applies to repe cmp (that is, strcmp, memcmp)...
See above.
Terje
> Best regards > Piotr Wyderski
-- - <Terje.Mathi...@hda.hydro.com> "almost all programming can be viewed as an exercise in caching"
Terje Mathisen wrote: > I assume this is measured starting with 16-byte aligned strings?
Yes, as I can align all the dynamicly allocated strings to that boundary with my custom memory allocator and enforce an appropriate alignment of the static strings using language extensions (#pragma, __attribute__ etc.).
> To handle arbitrary starting alignment you have to specialcase the first > block, probably by POR'ing in a 16*16=256-byte lookup table value, or you > could use a single 31-byte table with a MOVUPS unaligned load?
Why such a complex solution? :-) In case of strlen()/memchr() all you need is to check whether the string is aligned (test with 0x0f + a statically predicted "not taken" branch to the unaligned mode handler) and if it is not, then align it down manually, perform the comparison operation, mask out the invalid "prefix bytes" and stop if found/follow the aligned path if not found. This way you'll check the overlapping part two times, but who cares? :-) The movups-based solution also have its merits in case of short strings, but perhaps a generic instruction-based while loop will be enough, since the unaligned part requires at most 15 iterations?
> According to discussions I've had with a well-know cpu architect, hardware > could indeed do a much better job, and afaik, the current 'Fast String' > support on some chips is a step in that direction.
But is there any "fast string support"? Almost everywhere the rep family is as slow as its direct replacement based on test/jnz. Only the block copy/fill operations are optimized.
> The main problem is how to make it totally general
Have you seen the SSE4's pcmpestri instruction? It is pretty darn general. :-) Do they have a large array of 256 comparators to execute it in a single cycle, or unrolled it in time (which means 8+ cycles, hence quite useless).
Piotr Wyderski wrote: > Terje Mathisen wrote: >> To handle arbitrary starting alignment you have to specialcase the >> first block, probably by POR'ing in a 16*16=256-byte lookup table >> value, or you could use a single 31-byte table with a MOVUPS unaligned >> load?
> Why such a complex solution? :-) In case of strlen()/memchr() all you need > is to check whether the string is aligned (test with 0x0f + a statically > predicted "not taken" branch to the unaligned mode handler) and if it is > not, then align it down manually, perform the comparison operation, mask > out the invalid "prefix bytes" and stop if found/follow the aligned path if > not found. This way you'll check the overlapping part two times, but who > cares? :-) The movups-based solution also have its merits in case of > short strings, but perhaps a generic instruction-based while loop will > be enough, since the unaligned part requires at most 15 iterations?
There is one very important situation that Alpha did consider:
If you have a misaligned string of less than 8/16 (Alpha/SSE) bytes from the end of the last allocated page, then you'll get a page fault by trying to read past the end of the page.
Alpha handled this by having an instruction which could convert the lower three bits of an address into a mask, said mask to be used together with an aligned load (which it also had: Load from 0xnnnn3 and you would get the quadword starting at 0xnnnn0). This made sure that all load operations were aligned, and therefore couldn't cross a page boundary.
With SSE you don't have a similarly fast way of turning a potentially misaligned load in an aligned load + mask, and since average C string lengths are just a few bytes (single digits afair from reading some paper 10+ years ago), the first block handling is _crucial_.
> Have you seen the SSE4's pcmpestri instruction? It is pretty darn > general. :-)
Sure.
In fact, it is so general that there's just two possible ways to implement it:
a) Microcode, which then will effectively expand to a _lot_ of code handling alignment, locating the shortest (implicit) string length input, do a bunch of 16-byte loads from the two source addresses, masking the compare if the shortest string ends anywhere inside, do the compare, then loop.
b) A _huge_ slice of hardware directly implementing all the logic of (a), which would turn this into one of the CISC'iest non-OS instructions ever.
The documentation I have doesn't specify enough detail, but my guess would be that it simply returns the index of the first non-equal byte, then it is up to the calling code to figure out which string is longer.
The C-style version is in some ways even more hairy, it requires two sets of comparators (but not three):
One 16-byte wide set which compares pairs of input bytes and another which checks for zero bytes. The second can of course re-use the first hw with one of the inputs forced to zero, and what with all the other stuff going on, I don't believe the actual XOR+merge 8/16 bits part is going to be the bottleneck.
> Do they have a large array of 256 comparators to execute it in a single > cycle, or unrolled it in time (which means 8+ cycles, hence quite useless).
16 or 32 is enough, unless it actually works on 64-byte cache blocks, in the latter case they also need one or two 64-byte barrel shifters to align the strings with each other.
Terje
-- - <Terje.Mathi...@hda.hydro.com> "almost all programming can be viewed as an exercise in caching"
Piotr Wyderski wrote: > Terje Mathisen wrote: > But is there any "fast string support"? Almost everywhere the rep family > is as slow as its direct replacement based on test/jnz. Only the block > copy/fill operations are optimized.
>> The main problem is how to make it totally general
> Have you seen the SSE4's pcmpestri instruction? It is pretty darn > general. :-) > Do they have a large array of 256 comparators to execute it in a single > cycle, or unrolled it in time (which means 8+ cycles, hence quite useless).
OK, I've found some more docs about these soon-to-be opcodes, and you're right, it is pretty darn general, but not what I'd think of as a "string" operation, more like a building block.
They don't need anything like 256 comparators though:
The way they've probably implemented it is to use the same 16 byte-wide comparators as needed by lots of existing opcodes, with one of the immediate bits (which really should be considered part of the opcode, resulting in 64 times as many opcodes I guess), used to determine if pairs of byte operations should be joined into words.
Each of these comparators must be two-state (</=) or (=/>), so that the two bits used for equal_any/range/equal_each/ordered can be used to select the terms to merge.
What it does look like is an opcode that can only ever be used as part of library code, or in the form of specific canned operations that are output from a compiler.
I bet you could use this opcode plus a store and have a two-instruction Turing-equivalent machine, using self-modification for all interesting stuff. :-)
Terje
-- - <Terje.Mathi...@hda.hydro.com> "almost all programming can be viewed as an exercise in caching"
Terje Mathisen wrote: > There is one very important situation that Alpha did consider:
> If you have a misaligned string of less than 8/16 (Alpha/SSE) bytes from > the end of the last allocated page, then you'll get a page fault by trying > to read past the end of the page.
> Alpha handled this by having an instruction which could convert the lower > three bits of an address into a mask, said mask to be used together with > an aligned load (which it also had: Load from 0xnnnn3 and you would get > the quadword starting at 0xnnnn0). This made sure that all load operations > were aligned, and therefore couldn't cross a page boundary.
> With SSE you don't have a similarly fast way of turning a potentially > misaligned load in an aligned load + mask, and since average C string > lengths are just a few bytes (single digits afair from reading some paper > 10+ years ago), the first block handling is _crucial_.
Yes, it is exactly the reason I convert all misaligned string base addresses into down-aligned ones. This way you'll never cross a page boundary, as the alignment process simply clears the four least significant bits of the string address and can use the fast movaps instruction. All you need is to mask out the "false positives" located before the actual string itself, and the mask computation process can be performed using a simple piece of general-purpose code which runs in parallel with the SSE2-based comparator (namely, it is expected to slide under the pmovmskb latency).
Terje Mathisen wrote: > They don't need anything like 256 comparators though:
How is that possible? The documentation explicitely says that every SIMD vector A item is compared with every SIMD vector B item and then the results are merged accordingly. Since there are up to 16 entries in a vector, it means that 256 byte-wide comparisons must be performed. Complete unrolling in space requires 256 byte-wide comparators (or 16 128-bit-wide, if you prefer), which is a considerable piece of silicon. Complete unrolling in time requires just one 128-bit comparator and 16 passes, which is SLOW.
> What it does look like is an opcode that can only ever be used as part of > library code, or in the form of specific canned operations that are output > from a compiler.
But if an instruction is slow, then it is useless (provided that there is a faster replacement). I am pretty sure that a carefully tuned variant of pcmpeqb + ptest will give you much faster strlen()/strcmp()/memcmp() than pcmpestri. Another DAS/AAM? No, thanks... :-)
> I bet you could use this opcode plus a store and have a two-instruction > Turing-equivalent machine, using self-modification for all interesting > stuff. :-)
Frankly, I had a very similar impression and the term "overdesigned" was the first (well, the first non-insulting...) word that came to mind. :-)))
>> They don't need anything like 256 comparators though:
> How is that possible? The documentation explicitely says > that every SIMD vector A item is compared with every > SIMD vector B item and then the results are merged
I _really_ don't think that's what it means!
In fact, I'm 99% positive this instruction does NOT entail any form of barrel shift/full crossbar to actually compare all bytes against all bytes.
What it does is simply (!) 16+16+16 (a[x] == 0, a[x] == b[x], a[x] < b[x]) parallel byte compares, with the immediate mask determining how these intermediate results should be used/combined.
First the W bit enabled merging of pairs of initial results, trivial except for the (< less than) which has to combine the byte == and < tests:
> accordingly. Since there are up to 16 entries in a vector, > it means that 256 byte-wide comparisons must be performed. > Complete unrolling in space requires 256 byte-wide comparators > (or 16 128-bit-wide, if you prefer), which is a considerable > piece of silicon. Complete unrolling in time requires just > one 128-bit comparator and 16 passes, which is SLOW.
Forget about that, it simply cannot be the case.
>> What it does look like is an opcode that can only ever be used as part >> of library code, or in the form of specific canned operations that are >> output from a compiler.
> But if an instruction is slow, then it is useless (provided that > there is a faster replacement). I am pretty sure that a carefully > tuned variant of pcmpeqb + ptest will give you much faster > strlen()/strcmp()/memcmp() than pcmpestri. Another DAS/AAM? > No, thanks... :-)
Assuming it works as I (now) expect, it should have latency comparable with a 64-bit SUB or CMP, i.e. 1 or 2 cycles.
Terje -- - <Terje.Mathi...@hda.hydro.com> "almost all programming can be viewed as an exercise in caching"
On Aug 14, 9:46 am, Terje Mathisen <terje.mathi...@hda.hydro.com> wrote:
> Piotr Wyderski wrote: > > Terje Mathisen wrote:
> >> They don't need anything like 256 comparators though:
BUT, the very FIRST thing to do is to profile a wide range of programs, see how much time is consumed by str* functions, and decide whether this is even worth arguing about [it might, or might not be; the last time I did this was along ago, but at the time, bcopy/memcpy was the only high-runner.]
The second thing, for code one owns, is to take a look and make sure there aren't silly things like for (i = 0; i <= strlen(s); i++) {stuff}, since many compilers won't move the strlen(s) out of the loop.
Needless to say, profiling Dhrystone doesn't count :-)
John Mashey wrote: > BUT, the very FIRST thing to do is to profile a wide range of > programs, see how much time is consumed by str* functions, and decide > whether this is even worth arguing about [it might, or might not be; > the last time I did this was along ago, but at the time, bcopy/memcpy > was the only high-runner.]
In my case SIMD does help a lot, therefore I don't care about "a wide range of programs". ;-) My programs should provide the highest performance available, the remaining ones can be slow (most of cases) or even should be slow (our competitors)...
> The second thing, for code one owns, is to take a look and make sure > there aren't silly things like for (i = 0; i <= strlen(s); i++)
Indeed, but this kind of code transformation it is the very first optimization I do.
>>> They don't need anything like 256 comparators though:
>> How is that possible? The documentation explicitely says >> that every SIMD vector A item is compared with every >> SIMD vector B item and then the results are merged
>I _really_ don't think that's what it means!
>In fact, I'm 99% positive this instruction does NOT entail any form of >barrel shift/full crossbar to actually compare all bytes against all bytes.
>What it does is simply (!) 16+16+16 (a[x] == 0, a[x] == b[x], a[x] < >b[x]) parallel byte compares, with the immediate mask determining how >these intermediate results should be used/combined.
The Intel manual really does say something different.
>> accordingly. Since there are up to 16 entries in a vector, >> it means that 256 byte-wide comparisons must be performed. >> Complete unrolling in space requires 256 byte-wide comparators >> (or 16 128-bit-wide, if you prefer), which is a considerable >> piece of silicon. Complete unrolling in time requires just >> one 128-bit comparator and 16 passes, which is SLOW.
>Forget about that, it simply cannot be the case.
>>> What it does look like is an opcode that can only ever be used as part >>> of library code, or in the form of specific canned operations that are >>> output from a compiler.
>> But if an instruction is slow, then it is useless (provided that >> there is a faster replacement). I am pretty sure that a carefully >> tuned variant of pcmpeqb + ptest will give you much faster >> strlen()/strcmp()/memcmp() than pcmpestri. Another DAS/AAM? >> No, thanks... :-)
>Assuming it works as I (now) expect, it should have latency comparable >with a 64-bit SUB or CMP, i.e. 1 or 2 cycles.
One way to implement the "equal any" mode would be using 256 byte compares (the other way would be to implement the loop from the pseudo code). However, you could also build a 256 bit wide lookup (1 bit wide) from one xmm register, and index into it with each byte of the other xmm register. Interesting instruction, I bet the range mode could also help with audio/video compression. Also note that these instructions do not have an alignment restriction.
John Mashey wrote: > On Aug 14, 9:46 am, Terje Mathisen <terje.mathi...@hda.hydro.com> > wrote: >> Piotr Wyderski wrote: >>> Terje Mathisen wrote: >>>> They don't need anything like 256 comparators though:
> BUT, the very FIRST thing to do is to profile a wide range of > programs, see how much time is consumed by str* functions, and decide > whether this is even worth arguing about [it might, or might not be; > the last time I did this was along ago, but at the time, bcopy/memcpy > was the only high-runner.]
This is of course the reason only REP MOVS has been targeted until now.
> The second thing, for code one owns, is to take a look and make sure > there aren't silly things like for (i = 0; i <= strlen(s); i++) > {stuff}, since many compilers won't move the strlen(s) out of the > loop.
A bad programmer can write bad code in any language, the loop above doesn't even come close. What about re-initializing an entire 3D workspace, with all objects/lists/surfaces etc between every frame?
Even a "RealityEngine N+1" would have at least some problems at that point. :-(
> Needless to say, profiling Dhrystone doesn't count :-)
No, there was no need to tell me that.
Anyway, with the extreme flexibility of those new opcodes, and assuming they do run in "CMP reg,reg" time, it more or less becomes a limited-functionality sub-processor. Parallel state machines?
Terje -- - <Terje.Mathi...@hda.hydro.com> "almost all programming can be viewed as an exercise in caching"
Tim McCaffrey wrote: > In article <gs29p4-k7n....@osl016lin.hda.hydro.com>, > terje.mathi...@hda.hydro.com says... >> Piotr Wyderski wrote: >>> Terje Mathisen wrote:
>>>> They don't need anything like 256 comparators though: >>> How is that possible? The documentation explicitely says >>> that every SIMD vector A item is compared with every >>> SIMD vector B item and then the results are merged >> I _really_ don't think that's what it means!
>> In fact, I'm 99% positive this instruction does NOT entail any form of >> barrel shift/full crossbar to actually compare all bytes against all > bytes. >> What it does is simply (!) 16+16+16 (a[x] == 0, a[x] == b[x], a[x] < >> b[x]) parallel byte compares, with the immediate mask determining how >> these intermediate results should be used/combined.
> The Intel manual really does say something different.
That was the manual I was reading, but not the initial part (2.3.1) where it does state "up to 256 compares".
WOW!
What can I say? Mea Culpa. :-(
This is such a break with all cpu architecture design I've witnessed over the last 20+ years, this is even more of an "everything_including_the_kitchen_sink" opcode.
It must use a really big hunk of transistors, it is a true coprocessor, tightly integrated with implicit registers/flags for input/output.
The way they've remapped all the flag register fields to allow a single PCMP?STR* instruction to return all sorts of information at once is quite amazing, it really looks like an attempt to make 8 and 16-bit searching _very_ fast. I.e. this looks like FAST's original idea, where my old "Algorithms and Data Structures" worked for years to have students construct and implement a 16-wide parallel search engine.
Terje -- - <Terje.Mathi...@hda.hydro.com> "almost all programming can be viewed as an exercise in caching"
In article <622bp4-3o4....@osl016lin.hda.hydro.com>, terje.mathi...@hda.hydro.com says...
>This is such a break with all cpu architecture design I've witnessed >over the last 20+ years, this is even more of an >"everything_including_the_kitchen_sink" opcode.
>It must use a really big hunk of transistors, it is a true coprocessor, >tightly integrated with implicit registers/flags for input/output.
>The way they've remapped all the flag register fields to allow a single >PCMP?STR* instruction to return all sorts of information at once is >quite amazing, it really looks like an attempt to make 8 and 16-bit >searching _very_ fast. I.e. this looks like FAST's original idea, where >my old "Algorithms and Data Structures" worked for years to have >students construct and implement a 16-wide parallel search engine.
I can't put my finger on why I feel this way, but I'm willing to make a (small) bet that this is what Mr. Glew has been working on the last few years. It just seems like his "style".
I notice it breaks alot of the PPro rules as well: Uses a GPR, one XMM, one (unrestricted alignment) memory access (or another XMM), and a GPR and a Flags result all in one instruction. Must cause all kinds of havoc syncronizing the execution pipes.
Tim McCaffrey wrote: > I can't put my finger on why I feel this way, but I'm willing to make a > (small) bet that this is what Mr. Glew has been working on the last few
I'd take that (very small) bet: I'll buy you a beer the next (first?) time we meet if you're right. (If you're not, forget about it, I don't like beer. :-)
> years. It just seems like his "style".
Andy, if this is indeed you, then I'd like (at some point in time) to get the full story about what important workload could conceivably get sufficient speedup from this.
> I notice it breaks alot of the PPro rules as well: Uses a GPR, one XMM,
Which is why I don't entirely like to think about Andy behind it.
> one (unrestricted alignment) memory access (or another XMM), and a GPR > and a Flags result all in one instruction. Must cause all kinds of > havoc syncronizing the execution pipes.
Right.
This cannot really be a part of the usual pipes at all, it seem like a huge hunk of semi-independent hardware, but with a private pipe in/out of the register array.
The main hiccup wouldn't be the need to read potentially unaligned input data, but the way it has to return information to three different locations, all implied:
XMM0 for the SIMD data, ECX for the index found and the flags for all sorts of branchable indications.
OTOH we've always had a few special-case instructions with extra outputs: MUL and DIV are the most obvious, returning two int regs + flags.
Writing to an SSE register in parallel with a normal int return might actually be easier than writing to two integer registers at once?
Terje -- - <Terje.Mathi...@hda.hydro.com> "almost all programming can be viewed as an exercise in caching"
On Aug 15, 2:44 pm, timcaff...@aol.com (Tim McCaffrey) wrote:
> I notice it breaks alot of the PPro rules as well: Uses a GPR, one XMM, > one (unrestricted alignment) memory access (or another XMM), and a GPR > and a Flags result all in one instruction. Must cause all kinds of > havoc syncronizing the execution pipes.
Once the FCMP*I instructions went in (FP comparison, int flags result) the pipeline guys built a dedicated bus from the FPU to EFLAGS to make these fast. The FPUs do both 80-bit and SSE, so having a EFLAGS result is straightforward (in the sense of interlocks and pipelining).
>> BUT, the very FIRST thing to do is to profile a wide range of >> programs, see how much time is consumed by str* functions, and decide >> whether this is even worth arguing about [it might, or might not be; >> the last time I did this was along ago, but at the time, bcopy/memcpy >> was the only high-runner.]
> In my case SIMD does help a lot, therefore I don't care about > "a wide range of programs". ;-) My programs should provide > the highest performance available, the remaining ones > can be slow (most of cases) or even should be slow > (our competitors)...
If performance is a concern why scan for the zero terminator to determine the length of a string? You could just stored the length with the string itself.
Patrick de Zeester wrote: > If performance is a concern why scan for the zero terminator to > determine the length of a string? You could just stored the length with > the string itself.
strlen() is just a simple and clean example of what can be done the SIMD way, hence it is a toy function to play with. But the field of possible applications is much, much wider. Anyway, even in the case of memcmp()/strcmp() the cached length field doesn't help.