When declared with typedef, it is a lot nicer to use than most other
traditional options (eg, functions and "float *" pointers and similar).
>>
>> Otherwise, it is like saying that no one can use inline ASM, or that one
>> will refuse to use a compiler which supports using inline ASM.
>
> No. Inline asm has its place, for example in system headers,
> to access features which are not otherwise accessible.
>
> However, if I would have to jump through hoops like the above
> for a new architecture...
>
No one is forced to use the language extensions...
It also isn't anywhere near as low-level of a design as something like
"xmmintrin.h" was, it is, if anything, generally more like if part of
GLSL was glued onto C.
There is a common subset that exists with GCC's notation though, so it
is possible to use this and write code that works on both compilers.
The main limitation (relative to GCC) is that the vectors are
more-or-less required to be a padded to a power-of-2 size and within a
more limited set of possible combinations of types and sizes, whereas
GCC's vector system is a little more open-ended as to what it allows.
Though, it is possible I could consider going over to allowing more
open-ended vectors, using a type representation more like what I did for
"_BitInt" support.
Though, likely, it would be be vaguely similar:
Packed types: Byte/Word/DWord/QWord, Half/Single/Double;
Would still only allow vectors of primitive-type elements.
Vector sizes 2/3/4/N;
Storage sizes: 8B, 16B, or a multiple of 16B.
N-element storage being padded up to what fits in one of the above.
So, for example:
float v13 __attribute__((vector_size(52)));
Would be padded to 64-bytes and 16 elements.
Unlike GCC's system, it does borrow a few minor things from xmmintrin,
though they behave a bit differently, eg:
__m128 is cast-compatible with other 128-bit vector types;
It is cast compatible with __int128;
...
Similarly, it is possible to use cast-conversion to sidestep some other
traditional conversions, eg:
double f;
uint64_t fi;
memcpy(&fi, &f, 8); //traditional approach.
fi=(uint64_t)((__m64)f); //alternative (non-memcpy)
But, nothing says one can't still use memcpy if they want.
However, memcpy, unions, and pointer derefs, have the drawback that they
force the values to be spilled to memory, and use memory ops, whereas
the casts allow doing it via register operations.
So, eg, __m64 and __m128 are sort of like the equivalent of "void *" for
SIMD and Floating Point types, whereas direct-casting between
incompatible vector types either wont work, or will try to perform a
value-type conversion.
It is unclear whether or not a large vector type would be allowed to be
bit-cast via _BitInt, eg:
float v13 __attribute__((vector_size(52)));
_BitInt(416) lvi;
lvi=(_BitInt(416))v13; //should this be allowed?...
...
But, I guess it could be nicer if notations were more consistent.
Unlike GLSL though, neither BGBCC's nor GCC's vector system includes a
way to natively express things like matrices or matrix math though.
>> One can just write traditional scalar code, and have it perform as such.
>> Its performance may suck in comparison, but it isn't precluded.
>
> Sure.
>
> However, even with the sorry state of auto-vectorization, it often
> generates better code than pure scalar code (and compilers are
> indeed getting better at this). You are saying you do not want
> this, and want to force the user to write your specialized version
> if decent (even if non-optimal) performance is required.
>
> Your architecture, your choice. Just count me out.
>
I would be less opposed to auto vectorization if it could be done in
ways which weren't prone to occasionally break stuff or make performance
worse than in the baseline language.
I also have similar reservations about strict aliasing semantics.
But, granted, then one can argue that it is an uphill battle to convince
programmers to use the 'restrict' keyword or similar (or argue that
'restrict' is useless if the compiler uses strict aliasing semantics by
default, ...), but alas.
>> IME, auto vectorization only sometimes helps on traditional targets, and
>> frequently (if the optimizer is overzealous) can turn into a wrecking
>> ball on the performance front (performing worse, sometimes
>> significantly, than its non vectorized counterpart).
>
> Rarely, and if it actually turns out to be a problem, you can turn
> it off.
>
Except when the #pragma that is supposed to disable it, or the command
line options to turn it off, weren't working. Though, this being with
MSVC on X64; which seems to have lost the notion of being able to do an
optimized build with vectorization disabled (well, unless one is using
"/Os" rather than "/O2").
One case like this, I ended up needing to resort to doing a
type-conversion via manual bit twiddling, as this was the only option I
found that worked in this case, and breaking the vectorization with a
function call and bit-twiddling on floating point values was faster than
just letting the compiler continue to do what it wanted to do in this case.
> One problem, of course, is C's over-dependence on pointers, which
> make a lot of vectorization options impossible because the compiler
> cannot tell that there is no aliased load and stores.
>
Some of the compilers assume non-aliasing in situations where the
aliasing does occur. This can lead to code misbehaving and crashing.
Compiler writers then pass blame off on the code for containing strict
aliasing violations or similar (and/or treat strict aliasing as opt-out
rather than opt-in).
Ironically, that leaves MSVC as one of the few compilers where it is
disabled by default.
>> If it were up to me though, there would be explicit attributes to tell
>> the compiler what it should or should not vectorize.
>
> There is the OpenMP SIMD directive, for example.
>
Possible.
>> Or, maybe, if some standardized way were defined to specify vector types
>> in C (preferably more concise than GCC's notation).
>>
>> Say, for example, if the compiler allowed:
>> float[[vector(4)]] vec; //define a 4-element floating point vector.
>> Or:
>> [[vector(4)]] float vec; //basically equivalent.
>
> Unless this becomes part of a language standard, this is even worse.
>
Potentially...
The C attributes are supposed to be for things which don't change the
functional behavior of the program, though this is not so true with many
traditional uses of __declspec or __attribute__...
>>
>>> [...]
>>>
>>>> My approach also tries to be much less of an "ugly wart" on the C
>>>> language, so does things in ways that I feel are more consistent with
>>>> traditional C semantics (can use native operators, cast conversions, ...).
>>>
>>> There are other programming languages than C. What would you
>>> propose for Fortran, for example?
>>>
>>
>> Dunno, not enough overlap between use-cases.
>>
>> BJX2 is not intended for supercomputers or scientific computing
>
>> rather
>> I was more intending it for robot and machine control tasks.
>
> I use Fortran because it is a nice general-purpose language (and
> because I know it well), not especially because it is the language
> of supercomputers.
>
> Of course, if your target is the embedded market for C, then you
> are likely not even using a hosted implementation, right? In that
> case, Fortran with its big libraries is probably the wrong language
> for you.
>
Looking at it some, Fortran seems to support explicit operations over
arrays:
This is not autovectorization, this is explicit vectorization.
At present, programs run as mostly:
Bare ROM (size limited);
Binary loaded at boot time (has most "kernel" functionality statically
linked);
Application binary, loaded by the kernel/shell.
At present, in the "TestKern OS", the kernel and shell are essentially
the same program. It launches binaries, and any system calls return back
to it. Most of what would be separate utilities in a traditional Unix
are integrated into the shell as well...
So, by architectural analogy, it is like BusyBox were shoved into the
Linux kernel, which was itself reduced to starting itself up, trying to
launch an "autoexec" shell script or program, and then dumping the user
at the command-line in the default case.
Other cases, the binary is statically linked with all the kernel stuff
(filesystem and memory manager), and then put on the SDcard by itself to
boot and run.
At present, it uses a single big address space, and doesn't use a
preemptive scheduler. Partly this is because multiple address spaces and
preemptive scheduling are a bad situation for real-time programs.
Instead, main alternatives are cooperative multi-threading, and an
explicit task scheduler loop (typically operates using something more
akin to "continuation passing style"). For similar reasons, there is no
garbage collector, ...
One wants to be able to schedule events down to a time window of a few
microseconds for things like servomotor control and similar.
Some other parts were intended partly for processing image data from
camera modules, but had a non-zero overlap with what was needed for a
software rasterizer (and using them to implement a software-rasterized
OpenGL backend seemed like a reasonable test case), ...
This part involves lots of small-vector low-precision SIMD, and some
more obscure operators like taking the dot-products of vectors and
comparing them against a threshold value, ...
...
But, as noted, I am primarily testing it with Doom and Quake and similar...