Proposal for Single instructions for string library functions on My 66000

Stephen Fuld

unread,

Jun 21, 2021, 7:47:08 PM6/21/21

to

I was content to assume the use of VVM for most/all of the C string
library functions on Mitches MY 66000. The capabilities, especially the
ones “underneath” the ISA level to stream values from/to memory and do
multiple byte compares in a single cycle, make a huge improvement in the
performance of these functions.

However, once Mitch introduced the MM (Memory Move) instruction, which
makes a single instruction out of what would otherwise be a short VVM
sequence of instructions, that made me try to think about the issues
involved in adding single instructions to implement (perhaps some of)
the other string functions. This is what I have so far.

A few caveats.

First, IANAHG, and this certainly not a detailed proposal. I don’t know
enough to make one. It is just some thoughts on the issue. Many of
these ideas and statements may be wrong, and of course, I welcome
corrections.

Second, I looked a little but could not find information giving the
frequency of use of each of the functions, so any ideas there are just
my, probably poor, intuition. Again, I welcome more information.

Third, I realize that Mitch added the MM instruction to speed up
structure moves, and this means higher use of the MM instruction than
strictly as a library function replacement. But this provides the
infrastructure making the implementation of similar instructions easier.
(i.e. arbitrary number of TLB misses per instruction,
interruptable/resumable instructions, etc.) And several of the proposed
instructions have other uses too.

So, is it worth considering?

The big pro to this is performance. While the VVM solution eliminates
the cost of fetching, decoding and executing the multiple instructions
in the loop on other than the first pass, there is still more cost to
executing the several instructions in the loop once (on the first pass)
than doing so for a single instruction. However this is usually a one
time cost, and may not by huge when talking about a longer running
function. But if the function is, in fact, long running, then it it
more likely to encounter an interrupt, in which case the “first pass”
cost is encountered again. Similarly, a single instruction takes less
space in the instruction stream and the I-cache than a sequence of
several instructions. (Note that this assumes in-lining the function.
If it is called, there is even more overhead.) And functions like
strcat require two VVM loops to complete, although upon an interrupt,
only one of them must incur the extra cost. Again, I didn’t think this
was huge, but the fact that Mitch thought it was worthwhile enough to
eliminate made me reconsider.

For some functions, there are opportunities for further substantial
performance gains. (see below)

The big cons are the cost of implementing any new instruction. First,
there is the cost of the gates to do it, although for many of the
functions, they are already there, so it adds only the logic to invoke
them.

Second is the cost of additional op-codes. While there are twenty some
functions, I don’t propose anywhere close to that many op codes. I
suspect that some of the functions, especially the ones that require an
additional potential character substitution per character (e.g. the
localization functions) aren’t good candidates for single instruction
implementation. I also probably wouldn’t do the errno lookup function,
as presumably it is infrequently called and never in the critical path.

And many of the functions are essentially small modifications of others,
e.g. strcmp and strncmp. These two can be handled using a single op
code but using the Carry meta instruction to indicate “use the n
version” and specify the register that contains n. Applying this logic
to other similar cases reduces the number of op codes further. Even
where you don’t need the additional register, you could use the presence
of the Carry indicator bit to modify the instruction, e.g. strchr and
strrchr. There are several choices for how to implement this. I don’t
pretend to know the best one.

Lastly, there are a number of functions, mostly the “nested loop” ones
that would gain substantial benefit from being able to use an
instruction that implements another string function as a “building
block” to speed them up, even without a dedicated op code. See below
for an example.

Combining all of these, I think you could get down to a single digit
number of new op codes for most of the desired functionality.

The “nested loop” functions are the ones, such as strpbrk that require
you to code a nested loop, the outer loop going over the first string,
the inner loop going over the second string. The code that is in the
outer loop, but not in the inner loop is just loading the next byte and
checking it for a value of zero. This will work, but there is a
performance issue. VVM loops can’t be nested. So, assuming you use VVM
for the inner loop, the outer loop will cost relatively a lot, as it
can’t use the streaming and multi byte compare capabilities of VVM.
But, if you have the single instruction that searches for a character
match in a string (strchr), you can use this single instruction (plus
perhaps the Carry modifier), as the inner loop, thus enabling you to use
VVM for the outer loop. So while you still have essentially a nested
loop (the strchr instruction is essentially a loop), you have
substantially sped up the operation.

One last thing. While thinking about this, I had another idea regarding
some of these nested loop functions. I am guessing that for a
non-trivial percentage of these, the string containing the “to be looked
for” characters is, in fact, a compile time constant. An example of
this is searching text for any “white space” character. For these
cases, perhaps we can use the features built into the My 66000, together
with some things that hardware does better than software to do better
than even the method I outlined in the previous paragraph.

The idea is as follows. Let’s use strspn as an example. If the compiler
sees that the second string is a compile time constant, it builds a 256
bit map, with a one bit set for the value in each character in the
string. It then emits an instruction giving the addresses of the first
and this newly constructed second string as the two source operands. The
result operand will be used to hold the count. When the instruction is
encountered, the hardware loads the 256 bit (32 byte) bit map into one
of the available buffers. It also loads the starting bytes of the first
string into another streaming buffer. It then starts going through the
first string, using the value of each byte as an index into the second
string (this ability to have a 256 bit index into a 32 byte string is
the thing that the hardware can do better (faster) than software. The
hardware proceeds through the first string, looking up each byte and
looking for a one bit, incrementing a counter for each byte. The
presence of the carry flag could be used to reverse the sense of the
test, giving strcspn. The hope is that it can do one byte per cycle, or
perhaps one byte per two cycles. In any event, it certainly is much
faster than a nested loop as it becomes a single loop. I don’t know
if the advantages of this are worth the extra implementation cost, but I
wanted to mention it.

Let me conclude by re-emphasizing that this whole idea (single
instructions for string functions) might not make sense, or might not be
worthwhile, or it might be the wrong way to implement the functionality,
etc. But I want to present it to get reactions and potential improvements.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,

Jun 21, 2021, 8:26:49 PM6/21/21

to

Stephen has done a good job of lining up the pros and cons on converting
well known libraries into instructions, in fact, I have done so also in the case
of transcendental instructions.
<
I did fret about putting MM in My 66000 ISA.
<
I did fret about leaving some of the str* and mem* functions out of ISA.
<
By incorporating these functions into ISA you invoke the near necessity of
function unit microcode. Each of these functions has different sequencing
rules and necessities, and many of the sub-cases are sub-sets of each other.
For these kinds of sequences, a microcode sequencer is dé rigueur. Today
the name microcode has a bad taste in the minds of the almost knowing
and almost understanding. Just the marketing would be an uphill struggle.
<
In the end, I left these out as I thought that getting 3×-16× performance benefits
via something that already did make it into ISA was enough. Architecture is as
much about what to leave out as it is to what to put in.

Brian G. Lucas

unread,

Jun 21, 2021, 10:39:56 PM6/21/21

to

The MM instruction is widely useful no matter what the source language is.

IMHO implementing the C string library instructions is "preparing for the
previous war". I think we need to wait until we see what happens with Rust
(and perhaps Go and others) and determine what string (or array) primitives
are hot spots in applications written in more modern languages.

Brian

Marcus

unread,

Jun 22, 2021, 4:17:57 AM6/22/21

to

Agree. However if one is eager enough, it should be possible to pull
ample data from something like Mozilla Firefox / Servo. It uses C++
with custom string classes as well as Rust, /and/ it's heavy on string
processing.

In general I think that while optimizing string operations in existing
software at the hardware level is good (and VVM feels like it's pretty
close to what can be done in this space), it probably does not reach the
full potential if one starts thinking about how strings /could/ be
handled in hardware+software.

/Marcus

Thomas Koenig

unread,

Jun 22, 2021, 4:54:30 AM6/22/21

to

Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:

> However, once Mitch introduced the MM (Memory Move) instruction, which
> makes a single instruction out of what would otherwise be a short VVM
> sequence of instructions, that made me try to think about the issues
> involved in adding single instructions to implement (perhaps some of)
> the other string functions. This is what I have so far.

[...]

While C was an amazing language design for its time and especially
for the hardware constraints of the machine it was developed for,
some of its features have not aged well. Null-terminated strings
are one of these features.

I wouldn't try to implement those in hardware. The mem* functions,
however, are fair game (and probably already covered by the
MM instruction).

Terje Mathisen

unread,

Jun 22, 2021, 7:06:17 AM6/22/21

to

This idea tends to break down when all text is utf8, i.e. you can still
handle all the 7-bit US ASCII chars this way but you have to exit out of
the inner loop each time you get to an extended unicode wide char.

I try to write such code that I can simply ignore all the utf8 issues,
but that isn't always possible.

I.e. for my next world count implementation I've already figured out
that I can count characters by skipping (setting the char count
increment to zero) all intermediate utf8 byte values, but counting words
the same way only works if all utf8 sequences that end with a given
final byte value are members of the same set: Either all space/word
separators, or all in-word chars.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Stephen Fuld

unread,

Jun 22, 2021, 11:21:18 AM6/22/21

to

On 6/21/2021 5:26 PM, MitchAlsup wrote:
> On Monday, June 21, 2021 at 6:47:08 PM UTC-5, Stephen Fuld wrote:

big snip

>> Let me conclude by re-emphasizing that this whole idea (single
>> instructions for string functions) might not make sense, or might not be
>> worthwhile, or it might be the wrong way to implement the functionality,
>> etc. But I want to present it to get reactions and potential improvements.
>>
>>
> <
> Stephen has done a good job of lining up the pros and cons on converting
> well known libraries into instructions, in fact, I have done so also in the case
> of transcendental instructions.
> <
> I did fret about putting MM in My 66000 ISA.
> <
> I did fret about leaving some of the str* and mem* functions out of ISA.

I did sense some of your hesitation. Your response helps to clarify why.

> <
> By incorporating these functions into ISA you invoke the near necessity of
> function unit microcode. Each of these functions has different sequencing
> rules and necessities, and many of the sub-cases are sub-sets of each other.
> For these kinds of sequences, a microcode sequencer is dé rigueur.

OK. That brings up a couple of questions. Does the MM implementation
use microcode? Do the transcendental instructions use microcode?

I'm sure you see where I am going with these questions. If MM uses
microcode, then you have already "bitten the bullet". If not, what
subset of the kinds of instructions I have been talking about can you
also implement without microcode.

> Today
> the name microcode has a bad taste in the minds of the almost knowing
> and almost understanding.

:-)

> Just the marketing would be an uphill struggle.

:-( But you are probably right.

> <
> In the end, I left these out as I thought that getting 3×-16× performance benefits
> via something that already did make it into ISA was enough. Architecture is as
> much about what to leave out as it is to what to put in.

Of course. Thanks for your thoughtful response.

Stephen Fuld

unread,

Jun 22, 2021, 11:28:17 AM6/22/21

to

On 6/21/2021 7:39 PM, Brian G. Lucas wrote:
> On 6/21/21 6:47 PM, Stephen Fuld wrote:

big snip

>> Let me conclude by re-emphasizing that this whole idea (single
>> instructions for string functions) might not make sense, or might not
>> be worthwhile, or it might be the wrong way to implement the
>> functionality, etc. But I want to present it to get reactions and
>> potential improvements.
>>
>>
> The MM instruction is widely useful no matter what the source language is.
>
> IMHO implementing the C string library instructions is "preparing for
> the previous war". I think we need to wait until we see what happens
> with Rust (and perhaps Go and others) and determine what string (or
> array) primitives are hot spots in applications written in more modern
> languages.

Certainly a valid point, although there will certainly be a huge amount
of C/C++ code (and Java and other popular languages that have similar
functions) in use for a long time.

I spent a little time looking at the Rust book to see if what I proposed
was applicable. It seems it might be, though more of the functionality
is in the language proper rather than in a library. It would take more
study or someone more versed in Rust than I am to know for sure. I
haven't looked at Go at all.

Thanks for your thoughts.

Stephen Fuld

unread,

Jun 22, 2021, 11:37:38 AM6/22/21

to

On 6/22/2021 1:54 AM, Thomas Koenig wrote:
> Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:
>
>> However, once Mitch introduced the MM (Memory Move) instruction, which
>> makes a single instruction out of what would otherwise be a short VVM
>> sequence of instructions, that made me try to think about the issues
>> involved in adding single instructions to implement (perhaps some of)
>> the other string functions. This is what I have so far.
>
> [...]
>
> While C was an amazing language design for its time and especially
> for the hardware constraints of the machine it was developed for,
> some of its features have not aged well. Null-terminated strings
> are one of these features.

I never liked null-terminated strings, but they certainly have become
popular. Is there a consensus on what to replace them with in newer,
general purpose languages?

> I wouldn't try to implement those in hardware.

Certainly a valid position.

> The mem* functions,
> however, are fair game (and probably already covered by the
> MM instruction).

I don't think so. As I understand it, the MM instruction only does mem
move (and therefore mem copy), but not the others. You are suggesting a
subset of what I suggested, and I have no problems with that.

Stephen Fuld

unread,

Jun 22, 2021, 11:40:50 AM6/22/21

to

On 6/22/2021 4:06 AM, Terje Mathisen wrote:
> Stephen Fuld wrote:

big snip

>> Let me conclude by re-emphasizing that this whole idea (single
>> instructions for string functions) might not make sense, or might not
>> be worthwhile, or it might be the wrong way to implement the
>> functionality, etc. But I want to present it to get reactions and
>> potential improvements.
>
> This idea tends to break down when all text is utf8, i.e. you can still
> handle all the 7-bit US ASCII chars this way but you have to exit out of
> the inner loop each time you get to an extended unicode wide char.
>
> I try to write such code that I can simply ignore all the utf8 issues,
> but that isn't always possible.

Good point. And, of course, my proposal doesn't handle wide characters.

MitchAlsup

unread,

Jun 22, 2021, 11:40:54 AM6/22/21

to

On Tuesday, June 22, 2021 at 10:21:18 AM UTC-5, Stephen Fuld wrote:
> On 6/21/2021 5:26 PM, MitchAlsup wrote:
> > On Monday, June 21, 2021 at 6:47:08 PM UTC-5, Stephen Fuld wrote:
> big snip
> >> Let me conclude by re-emphasizing that this whole idea (single
> >> instructions for string functions) might not make sense, or might not be
> >> worthwhile, or it might be the wrong way to implement the functionality,
> >> etc. But I want to present it to get reactions and potential improvements.
> >>
> >>
> > <
> > Stephen has done a good job of lining up the pros and cons on converting
> > well known libraries into instructions, in fact, I have done so also in the case
> > of transcendental instructions.
> > <
> > I did fret about putting MM in My 66000 ISA.
> > <
> > I did fret about leaving some of the str* and mem* functions out of ISA.
> I did sense some of your hesitation. Your response helps to clarify why.
> > <
> > By incorporating these functions into ISA you invoke the near necessity of
> > function unit microcode. Each of these functions has different sequencing
> > rules and necessities, and many of the sub-cases are sub-sets of each other.
> > For these kinds of sequences, a microcode sequencer is dé rigueur.
<
> OK. That brings up a couple of questions. Does the MM implementation
> use microcode? Do the transcendental instructions use microcode?
<

Transcendentals did not need microcode as the body of the polynomial
evaluation is identical for each one so this part is a pure sequencer.
Then the end cases are a simply HW switch on certain data values.
This could be done with microcode, but remains inside the bounds of
what one can do with a sequencer.
<
str* and mem* have enough special cases that the HW switch to
the right part of the sequencer might as well be microcode--although
it could be done without--you would need to figure out a way to
effectively call a HW sequence from within a sequence, and then
return whence you started. Doable, but the times I have encountered
such sequences I had access to microcode and let it handle the
scenarios.

Thomas Koenig

unread,

Jun 22, 2021, 1:00:00 PM6/22/21

to

Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:

> On 6/22/2021 1:54 AM, Thomas Koenig wrote:
>> Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:
>>
>>> However, once Mitch introduced the MM (Memory Move) instruction, which
>>> makes a single instruction out of what would otherwise be a short VVM
>>> sequence of instructions, that made me try to think about the issues
>>> involved in adding single instructions to implement (perhaps some of)
>>> the other string functions. This is what I have so far.
>>
>> [...]
>>
>> While C was an amazing language design for its time and especially
>> for the hardware constraints of the machine it was developed for,
>> some of its features have not aged well. Null-terminated strings
>> are one of these features.
>
> I never liked null-terminated strings, but they certainly have become
> popular. Is there a consensus on what to replace them with in newer,
> general purpose languages?

Let's google a bit.

Fortran: Length + data (older than C, but character variables are
newer, so I think this counts) (OK, I knew that before)

Go: It has slices, which look a bit like array descrptors
(or dope vectors) under used the hood of Fortran, except
they are user-visible.

Rust: Strings are not null-terminated.

C#: Strings are not null-terminated.

Swift: Strings are not null-terminated.

Those are probably the major modern languages that are likely to
to be compiled directly to machine code. It stands to reason
that they have to store a length somewhere.

[...]

Marcus

unread,

Jun 23, 2021, 2:53:51 AM6/23/21

to

C++: Length + data (but with a null-termination at data + 1, in order
to be compatible(ish) with C strings).

And do not forget that major string-heavy C++ applications often have
their own string classes, that typically are not null-terminated.

Stephen Fuld

unread,

Jun 23, 2021, 11:32:36 AM6/23/21

to

First, thank you for the research. I have been thinking about your
proposal, and I think it boils down to several questions/issues. These
assume that there are single instructions for each of the mem* functions
in the C string library, but no others.

1. How easy is it for the various compilers to recognize the idioms and
generate code to use the new instructions? In C it is easy as they are
all function calls to well known names. I am sure it varies between
languages, but it might be a little harder.

2. How much help do these provide for the "traditional" C string
functions? As I noted, there is a lot of C code out there, and will be
for a long time. For example, memchr looking for a null is equivalent
to strlen. So this could be used to speed up the first part of strcat.

3. A related question is are there any minor modifications to the mem*
instructions to aid in other cases, especially the C string cases? As
an example, allow an option (the carry modifier?) to allow "early
termination" if a null is encountered before the count is exhausted?
How much would these "mess up" the implementation?

4. How much improvement to overall performance would these provide?

5. And last, but certainly not least, a question for Mitch, could these
be implemented within the hardware constraints you have laid out of no
microcode?

MitchAlsup

unread,

Jun 23, 2021, 11:44:23 AM6/23/21

to

Note: The LOOP instruction in My 66000 was designed to deal with both
counted and null terminated loops (and a few more). It seems to me that
providing all of the instructions one can synthesize with VVM and My
66000 instructions would consume a lot of space.
<
Secondly: Using VVM one is running in the 8-32 I/C range on not that wide
implementations. So It is hard to see direct HW implementations at the
instruction level running "that much faster" most of the loops are governed
by cache access width (both VVM and direct HW implementation.)

>
> 5. And last, but certainly not least, a question for Mitch, could these
> be implemented within the hardware constraints you have laid out of no
> microcode?
<

There comes a time where sequences are best described in tabular formats,
one can have a direct translation into microcode, or one can pass the table
through the great fate eater in Verliog synthesis and have the same sequencer
without the ability to program it after fabrication.

Stephen Fuld

unread,

Jun 23, 2021, 1:24:22 PM6/23/21

to

Yes. That was one of the things I was talking about in my original post
as "existing logic".

> It seems to me that
> providing all of the instructions one can synthesize with VVM and My
> 66000 instructions would consume a lot of space.

Sure, but no one is proposing that! The question is whether there is a
subset of "all" that is worthwhile? So far, there is, consisting of a
single member, MM. My question is, are there more?

> <
> Secondly: Using VVM one is running in the 8-32 I/C range on not that wide
> implementations. So It is hard to see direct HW implementations at the
> instruction level running "that much faster" most of the loops are governed
> by cache access width (both VVM and direct HW implementation.)

I understand. Thus my surprise that you implemented MM. The fact that
you did led to my trying to see if it was worthwhile to go further.

As I said, ISTM that the main advantages of a single instruction is
lower cost "start up" (and resume after interrupt), and lower
memory/I-cache usage. Once you are up and going, I agree that there is
essentially no advantage.

MitchAlsup

unread,

Jun 23, 2021, 7:49:52 PM6/23/21

to

MM made the list, after much consideration, mainly because it is much more
compact way to move stuff around in memory without having to pass through
registers. LDM and STM made the cut but are so underutilized it would cause
no undo harm to remove them--the vast majority of the LDM/STM uses are
performed with ENTER and EXIT.

>
> As I said, ISTM that the main advantages of a single instruction is
> lower cost "start up" (and resume after interrupt), and lower
> memory/I-cache usage. Once you are up and going, I agree that there is
> essentially no advantage.
<

MM, ENTER, and EXIT made the cut for code density reasons.
ABS and CopySign made the cut for power reasons and more advanced
implementations can perform these in zero cycles (on the forwarding path).
<
The one I keep fretting about is BMM--bit matrix multiply--where you take
an operand (64-bits) and a matrix (64 entries of 64-bits each in memory)
And perform a BMM between these delivering a 64-bit result. There are all
sorts of clever bit manipulation that can be so expressed, and I can see a
way of executing it in 8-ish cycles.

Ivan Godard

unread,

Jun 23, 2021, 10:13:54 PM6/23/21

to

On 6/23/2021 4:49 PM, MitchAlsup wrote:
> On Wednesday, June 23, 2021 at 12:24:22 PM UTC-5, Stephen Fuld wrote:

<snip>

>> I understand. Thus my surprise that you implemented MM. The fact that
>> you did led to my trying to see if it was worthwhile to go further.
> <
> MM made the list, after much consideration, mainly because it is much more
> compact way to move stuff around in memory without having to pass through
> registers. LDM and STM made the cut but are so underutilized it would cause
> no undo harm to remove them--the vast majority of the LDM/STM uses are
> performed with ENTER and EXIT.
>>
>> As I said, ISTM that the main advantages of a single instruction is
>> lower cost "start up" (and resume after interrupt), and lower
>> memory/I-cache usage. Once you are up and going, I agree that there is
>> essentially no advantage.
> <
> MM, ENTER, and EXIT made the cut for code density reasons.
> ABS and CopySign made the cut for power reasons and more advanced
> implementations can perform these in zero cycles (on the forwarding path).
> <
> The one I keep fretting about is BMM--bit matrix multiply--where you take
> an operand (64-bits) and a matrix (64 entries of 64-bits each in memory)
> And perform a BMM between these delivering a 64-bit result. There are all
> sorts of clever bit manipulation that can be so expressed, and I can see a
> way of executing it in 8-ish cycles.

ENTER and EXIT should make the list on RAS grounds; code density is a
useful side benefit.

MM is a special case of streams. I always prefer general solutions.

BMM is attractive functionality, but is better handled by a dedicated
co-processor IMO. Though that opinion may come because I have never
found a good way to fit it into the architecture paradigm.

Terje Mathisen

unread,

Jun 24, 2021, 2:23:34 AM6/24/21

to

MitchAlsup wrote:
> On Wednesday, June 23, 2021 at 12:24:22 PM UTC-5, Stephen Fuld wrote:
>>> Secondly: Using VVM one is running in the 8-32 I/C range on not that wide
>>> implementations. So It is hard to see direct HW implementations at the
>>> instruction level running "that much faster" most of the loops are governed
>>> by cache access width (both VVM and direct HW implementation.)
> <
>> I understand. Thus my surprise that you implemented MM. The fact that
>> you did led to my trying to see if it was worthwhile to go further.
> <
> MM made the list, after much consideration, mainly because it is much more
> compact way to move stuff around in memory without having to pass through
> registers. LDM and STM made the cut but are so underutilized it would cause
> no undo harm to remove them--the vast majority of the LDM/STM uses are
> performed with ENTER and EXIT.

I am equally confident MM is worthwhile, there are just so many
instances of copying going on that having a single approved method that
is both near-light-speed fast and very compact (in code size) makes a
lot of sense.

We have been following Intel and AMD's various attempts to do the same
to REP MOVS wiht various "fast strings" implementation, and it is slowly
getting to where it can match unrolled SSE/AVX copy blocks, without
having any of those pesky alignment/block size limitations.

>>
>> As I said, ISTM that the main advantages of a single instruction is
>> lower cost "start up" (and resume after interrupt), and lower
>> memory/I-cache usage. Once you are up and going, I agree that there is
>> essentially no advantage.
> <
> MM, ENTER, and EXIT made the cut for code density reasons.
> ABS and CopySign made the cut for power reasons and more advanced
> implementations can perform these in zero cycles (on the forwarding path).

They fall out from your fast trancendentals?

> <
> The one I keep fretting about is BMM--bit matrix multiply--where you take
> an operand (64-bits) and a matrix (64 entries of 64-bits each in memory)
> And perform a BMM between these delivering a 64-bit result. There are all
> sorts of clever bit manipulation that can be so expressed, and I can see a
> way of executing it in 8-ish cycles.

8!!!

We are talking about 512 bytes of data in the matrix, so just loading it
needs 8 64-byte cache line loads: I think you must envision a way to
stream the matrix past the operand during those cache loads, or would
this be implemented with a dedicated 512-byte matrix cache inside the
CPU so that it could be reused multiple times for an array of operands?

Ivan Godard

unread,

Jun 24, 2021, 3:57:39 AM6/24/21

to

They fall out of bit set/clear/test instructions in the ALU/shifter. I
don't see why you's want them as FP codes, unless you were using split
int/FP regs and didn't want to pay for the moves.

MitchAlsup

unread,

Jun 24, 2021, 12:22:29 PM6/24/21

to

The SW implementations (to look at and define what HW does) is replete
with these, however the actual HW does them internally.

<
> > <
> > The one I keep fretting about is BMM--bit matrix multiply--where you take
> > an operand (64-bits) and a matrix (64 entries of 64-bits each in memory)
> > And perform a BMM between these delivering a 64-bit result. There are all
> > sorts of clever bit manipulation that can be so expressed, and I can see a
> > way of executing it in 8-ish cycles.
<
> 8!!!
<

Not a typo.

>
> We are talking about 512 bytes of data in the matrix, so just loading it
> needs 8 64-byte cache line loads: I think you must envision a way to
> stream the matrix past the operand during those cache loads, or would
> this be implemented with a dedicated 512-byte matrix cache inside the
> CPU so that it could be reused multiple times for an array of operands?
<

The amount of "math logic" is so small that it is entirely cache bound.
So if you make it capable of accessing a whole cache line in a cycle,
you can perform the BMM on eight (8) rows per cycle.

Terje Mathisen

unread,

Jun 25, 2021, 2:10:31 AM6/25/21

to

I.e. my cache line/cycle streaming speed guess/suggestion.

I've been used to thinking in cache line chunks since the Larrabee tech
review. :-)

robf...@gmail.com

unread,

Jun 25, 2021, 5:33:00 AM6/25/21

to

I am a little confused by the terminology??? I thought a bit-matrix multiply was multiplying bits between two 64-bit register values as 8x8 matrixes of bits. This can be done in a single clock cycle and requires only two 64-bit registers. Larger registers could be used for larger bit multiplies. A 256-bit SIMD register could hold a 16x16 bit matrix.

Michael S

unread,

Jun 25, 2021, 6:00:47 AM6/25/21

to

On Friday, June 25, 2021 at 12:33:00 PM UTC+3, robf...@gmail.com wrote:
> I am a little confused by the terminology??? I thought a bit-matrix multiply was multiplying bits between two 64-bit register values as 8x8 matrixes of bits. This can be done in a single clock cycle and requires only two 64-bit registers. Larger registers could be used for larger bit multiplies. A 256-bit SIMD register could hold a 16x16 bit matrix.

You seems less confused than I am.
For starter, I would like to understand if operations are modulo-2 or not.
Later on, I'd like to know what BMM is good for.

MitchAlsup

unread,

Jun 25, 2021, 10:40:09 AM6/25/21

to

On Friday, June 25, 2021 at 5:00:47 AM UTC-5, Michael S wrote:
> On Friday, June 25, 2021 at 12:33:00 PM UTC+3, robf...@gmail.com wrote:
> > I am a little confused by the terminology??? I thought a bit-matrix multiply was multiplying bits between two 64-bit register values as 8x8 matrixes of bits. This can be done in a single clock cycle and requires only two 64-bit registers. Larger registers could be used for larger bit multiplies. A 256-bit SIMD register could hold a 16x16 bit matrix.
<

You could make an 8×8 version and then use loops to make the larger one.
But if you do this, you are going to be in the 64 cycle range for a 64×64 BMM
while my target is 8 cycles. The "math" is so simple that orchestrating the
loop in SW is just slow (and counterproductive.)

<
> You seems less confused than I am.
> For starter, I would like to understand if operations are modulo-2 or not.
<

# define index(i,j) (((j)<<3)+(i))

uint64_t BMM( uint64_t S1, uint64_t S2 )
{
uint64_t i, j, k,
rd;

for( i = 0; i < 8; i++ )
for( j = 0; j < 8; j++ )
for( k = 0; k < 8; k++ )
if( S )
rd<index(i,j)>^=S1<index(i,k)>&S2<index(k,j)>;
else
rd<index(i,j)>|=S1<index(i,k)>&S2<index(k,j)>;
return rd;

}
<
> Later on, I'd like to know what BMM is good for.
<

Moving arbitrary bit positions around to arbitrary bit result locations while
also performing bit multiplication (either AND or XOR) and accumulation (OR).

BGB

unread,

Jun 25, 2021, 8:39:47 PM6/25/21

to

Expecting much standardization between these language is probably a stretch.

Though, one possibility for strings could be (for bare character pointers):
Pointer points at start of string data;
String data ends in NUL byte.

However, this is not the end of the story:
str[-1]==0, Plain Null terminated string
We are pointing at the start.
str[-1]==01..7F, We are pointing somewhere to the string interior;
str[-1]==80..BF, We are pointing somewhere to the string interior;
str[-1]==C0..EF && str[0]==80..BF, String Interior.
str[-1]==C0..EF && str[0]!=80..BF, Start of String, Reverse VLN.
...

Reverse VLN:
80..BF: Length (0000..003F)
C0..DF: Length (0040..07FF)
E0..EF: Length (0800..FFFF)
...

Essentially, Reverse VLN is effectively sort of like a UTF-8 codepoint
but encoded backwards. The Reverse VLN is preceded either with a
meta-type-tag or NUL byte.

One advantage of these strings is that they are partially backwards
compatible with C strings, but can allow some more capabilities (such as
scanning backwards to find the start of a string if given a pointer to
its interior).

Unlike plain C strings, they would require double-ended termination, but
for string tables, the start and end terminators between adjacent short
strings could be merged.

Possibly, strings longer than a certain minimum could be encoded by
default with a length prefix. This format will assume that the character
data is stored as either ASCII or UTF-8.

A pair of NUL bytes could also encode the start or end marker of a
string table.

A similar scheme can be used for UTF-16 strings but with backwards
surrogate pairs or similar as the start-of-string length marker.

As for whether special string instructions belong in an ISA, I don't
personally believe so. Packed byte-compare comes close, but arguably has
other uses as well.

MitchAlsup

unread,

Jun 25, 2021, 8:56:14 PM6/25/21

to

This reminds me of using IBM TSS360 where we would pad file names
with spaces "file " so that the disk sweeper could not come along and
remove the files (without being told to do so explicitly with file name in
quotation marks with the proper number of spaces !)

Stephen Fuld

unread,

Jun 29, 2021, 2:54:47 PM6/29/21

to

On 6/23/2021 4:49 PM, MitchAlsup wrote:
> On Wednesday, June 23, 2021 at 12:24:22 PM UTC-5, Stephen Fuld wrote:
>> On 6/23/2021 8:44 AM, MitchAlsup wrote:

snip

>>> Note: The LOOP instruction in My 66000 was designed to deal with both
>>> counted and null terminated loops (and a few more).
>> Yes. That was one of the things I was talking about in my original post
>> as "existing logic".
>>> It seems to me that
>>> providing all of the instructions one can synthesize with VVM and My
>>> 66000 instructions would consume a lot of space.
>> Sure, but no one is proposing that! The question is whether there is a
>> subset of "all" that is worthwhile? So far, there is, consisting of a
>> single member, MM. My question is, are there more?
>>> <
>>> Secondly: Using VVM one is running in the 8-32 I/C range on not that wide
>>> implementations. So It is hard to see direct HW implementations at the
>>> instruction level running "that much faster" most of the loops are governed
>>> by cache access width (both VVM and direct HW implementation.)
> <
>> I understand. Thus my surprise that you implemented MM. The fact that
>> you did led to my trying to see if it was worthwhile to go further.
> <
> MM made the list, after much consideration, mainly because it is much more
> compact way to move stuff around in memory without having to pass through
> registers. LDM and STM made the cut but are so underutilized it would cause
> no undo harm to remove them--the vast majority of the LDM/STM uses are
> performed with ENTER and EXIT.

I have thought about making STM/LDM "variants" of Enter/Exit, but you
need another register specifier to hold the memory address. If the goal
is to eliminate the LDM/STM op codes, and they are infrequently used, I
suppose you could precede Enter/Exit with a Carry meta instruction that
indicates what register contains the memory address. I am not sure how
much saving the two op-codes is worth.

>> As I said, ISTM that the main advantages of a single instruction is
>> lower cost "start up" (and resume after interrupt), and lower
>> memory/I-cache usage. Once you are up and going, I agree that there is
>> essentially no advantage.
> <
> MM, ENTER, and EXIT made the cut for code density reasons.

So the second of my reasons above (the memory/I-cache usage). Fine. I
think you are saying that MM would occur often enough that the savings
justify the cost. And by omitting the others, that they do not occur
often enough. You may be right. As I said in the OP, I don't have any
good statistics.

My guess is that perhaps the next most used instruction would be
essentially a variant of strchr that optionally had an n character
limit. Besides strchr and memchr, this provides strlen by making the
searched for character a null and n very large, allows the "outer loop"
of the nested loop functions to use VVM for a big savings on those, and
speeds up the first part of strcat and strncat.

MitchAlsup

unread,

Jun 29, 2021, 4:25:41 PM6/29/21

to

No, ENTER and EXIT imply the stack pointer (SP), teh 2 register specifiers
are the start and stop registers (start==stop implies all 32 registers).

<
> >> As I said, ISTM that the main advantages of a single instruction is
> >> lower cost "start up" (and resume after interrupt), and lower
> >> memory/I-cache usage. Once you are up and going, I agree that there is
> >> essentially no advantage.
> > <
> > MM, ENTER, and EXIT made the cut for code density reasons.
> So the second of my reasons above (the memory/I-cache usage). Fine. I
> think you are saying that MM would occur often enough that the savings
> justify the cost. And by omitting the others, that they do not occur
> often enough. You may be right. As I said in the OP, I don't have any
> good statistics.
<

In any event MM went in and came out several times until Brian found
ways to use it for struct assignments, at which point it made the cut.

>
> My guess is that perhaps the next most used instruction would be
> essentially a variant of strchr that optionally had an n character
> limit. Besides strchr and memchr, this provides strlen by making the
> searched for character a null and n very large, allows the "outer loop"
> of the nested loop functions to use VVM for a big savings on those, and
> speeds up the first part of strcat and strncat.
<

I specifically architected VEC and especially LOOP to deal with the NULL
terminated C strings and the count to N loop limits simultaneously.
Basically, all of the str* and mem* that are leaf routines vectorize.

Stephen Fuld

unread,

Jun 29, 2021, 5:49:01 PM6/29/21

to

Yes, I realize that. Hence my suggestion about using Carry. The
register specified in the Carry instruction would contain the memory
address for starting the LDM/STM, and the Fact that the Enter/Exit was
under the shadow of the Carry would tell the HW to use the Carry
register rather than the stack pointer. The start and stop specifiers
in the Enter/Exit would work exactly as they do now.

> <
>>>> As I said, ISTM that the main advantages of a single instruction is
>>>> lower cost "start up" (and resume after interrupt), and lower
>>>> memory/I-cache usage. Once you are up and going, I agree that there is
>>>> essentially no advantage.
>>> <
>>> MM, ENTER, and EXIT made the cut for code density reasons.
>> So the second of my reasons above (the memory/I-cache usage). Fine. I
>> think you are saying that MM would occur often enough that the savings
>> justify the cost. And by omitting the others, that they do not occur
>> often enough. You may be right. As I said in the OP, I don't have any
>> good statistics.
> <
> In any event MM went in and came out several times until Brian found
> ways to use it for struct assignments, at which point it made the cut.

So, perhaps I missed that it appears that an important factor for you is
whether the compiler generates the instruction, not that it would
benefit a library function. Is that correct?

>>
>> My guess is that perhaps the next most used instruction would be
>> essentially a variant of strchr that optionally had an n character
>> limit. Besides strchr and memchr, this provides strlen by making the
>> searched for character a null and n very large, allows the "outer loop"
>> of the nested loop functions to use VVM for a big savings on those, and
>> speeds up the first part of strcat and strncat.
> <
> I specifically architected VEC and especially LOOP to deal with the NULL
> terminated C strings and the count to N loop limits simultaneously.

I understand the advantages of VVM, and especially its ability to deal
with compare value and count as both termination conditions. As I have
said several times, I am a big fan. :-) But the functionality provided
by MM obviously vectorizes extememly well with VVM, yet you felt it was
worthwhile to get a little bit more performance out f it.

> Basically, all of the str* and mem* that are leaf routines vectorize.

I spent some time looking at ones like strpbrk or strcspn, and the best
VVM implementation I would come up with vectorizes the search of the
presumably shorter shorter second string/set. But you really want to
take advantage of VVM on the longer string. Hence my idea of making
what was the inner/shorter loop an instruction to allow VVM to work its
magic on the longer string. Is there a better implementation?

MitchAlsup

unread,

Jun 29, 2021, 7:33:00 PM6/29/21

to

But at this point you start loosing code density.

> > <
> >>>> As I said, ISTM that the main advantages of a single instruction is
> >>>> lower cost "start up" (and resume after interrupt), and lower
> >>>> memory/I-cache usage. Once you are up and going, I agree that there is
> >>>> essentially no advantage.
> >>> <
> >>> MM, ENTER, and EXIT made the cut for code density reasons.
> >> So the second of my reasons above (the memory/I-cache usage). Fine. I
> >> think you are saying that MM would occur often enough that the savings
> >> justify the cost. And by omitting the others, that they do not occur
> >> often enough. You may be right. As I said in the OP, I don't have any
> >> good statistics.
> > <
> > In any event MM went in and came out several times until Brian found
> > ways to use it for struct assignments, at which point it made the cut.
<
> So, perhaps I missed that it appears that an important factor for you is
> whether the compiler generates the instruction, not that it would
> benefit a library function. Is that correct?
<

MM made the cut because it takes the place of a couple of setup instructions
and a loop of what would have been 5 instructions {LD/ST/ADD/CMP/BC}.
So the compiler can use it and it is cheaper and denser than a call+ret.

> >>
> >> My guess is that perhaps the next most used instruction would be
> >> essentially a variant of strchr that optionally had an n character
> >> limit. Besides strchr and memchr, this provides strlen by making the
> >> searched for character a null and n very large, allows the "outer loop"
> >> of the nested loop functions to use VVM for a big savings on those, and
> >> speeds up the first part of strcat and strncat.
> > <
> > I specifically architected VEC and especially LOOP to deal with the NULL
> > terminated C strings and the count to N loop limits simultaneously.
<
> I understand the advantages of VVM, and especially its ability to deal
> with compare value and count as both termination conditions. As I have
> said several times, I am a big fan. :-) But the functionality provided
> by MM obviously vectorizes extememly well with VVM, yet you felt it was
> worthwhile to get a little bit more performance out f it.
<

Yes, it does vectorize well, but MM is also considerably denser.

<
> > Basically, all of the str* and mem* that are leaf routines vectorize.
<
> I spent some time looking at ones like strpbrk or strcspn, and the best
> VVM implementation I would come up with vectorizes the search of the
> presumably shorter shorter second string/set. But you really want to
> take advantage of VVM on the longer string. Hence my idea of making
> what was the inner/shorter loop an instruction to allow VVM to work its
> magic on the longer string. Is there a better implementation?
<

But the inner (shorter) loops are far from being "an instruction"
<
From the Apple library:: lightly updated to modern c::
<
char *strpbrk( const char *s1, const char *s2 )
{
const char *scanp;
int c, sc;

while ((c = *s1++) != 0) {
for (scanp = s2; (sc = *scanp++) != 0;)
if (sc == c)
return ((char *)(s1 - 1));
}
return (NULL);
}

strpbrk:
LDSB R4,[R1]
ADD R1,R1,#1
BEQ0 R4,exit
MOV R3,R2
loop:
LDSB R5,[R3]
ADD R3,R3,#1
BEQ0 R5,strpbrk
CMP R6,R4,R5
BNE R6,loop
ADD R1,R1,#-1
RET
exit:
MOV R1,#0
RET

size_t strcspn( const char *s1, const char *s2 )
{
register const char *p, *spanp;
register char c, sc;

/*
* Stop as soon as we find any character from s2. Note that there
* must be a NUL in s2; it suffices to stop when we find that, too.
*/
for (p = s1;;) {
c = *p++;
spanp = s2;
do {
if ((sc = *spanp++) == c)
return (p - 1 - s1);
} while (sc != 0);
}
}

strcspn:
MOV R3,R1
loop:
LDSB R5,[R3]
ADD R3,R3,#1
MOV R6,R2
doloop:
LDSB R7,[R6]
ADD R6,R6,#1
CMP R7,R5,R6
PNE R7,{3,{111}}
ADD R3,R3,#-1
ADD R1,R3,-R1
RET
BNE0 R6,doloop
BR loop

Marcus

unread,

Jun 30, 2021, 3:27:45 AM6/30/21

to

On 2021-06-30, MitchAlsup wrote:

[snip]

For kicks, the MRISC32 GCC output is:

strpbrk:
LDUB R6,R1,#0
ADD R1,R1,#1
BZ R6,exit
MOV R3,R2
loop:
LDUB R4,R3,#0
ADD R3,R3,#1
SNE R5,R4,R6
BZ R4,strpbrk
BS R5,loop
ADD R1,R1,#-1
RET
exit:
MOV R1,R6
RET

It's remarkable how similar our ISA:s are in _certain_ areas ;-) BTW I
noticed that our C compilers use different signs for "char" - one of the
many vexing parts of the C standard.

/Marcus

Thomas Koenig

unread,

Jun 30, 2021, 5:31:50 AM6/30/21

to

Marcus <m.de...@this.bitsnbites.eu> schrieb:

> BTW I
> noticed that our C compilers use different signs for "char" - one of the
> many vexing parts of the C standard.

signed char deals "gracefully" with broken code like

char ch;
while ((ch = getchar()) != EOF)
putchar(ch);

which works normally as long as you don't input 0xff (if your
EOF happens to be -1).

I wonder how this particular idiom influenced ABI designer's choice
of using signed vs. unsigned char for default char.

(If you want to be warned about this, use

cc -Wall -Wextra -Werror -funsigned-char

which will issue an error if your cc is a relatively recent gcc
or clang.)

Marcus

unread,

Jun 30, 2021, 5:59:21 AM6/30/21

to

Usually this is not a problem, as you usually do == or != comparisons
with characters. Where you do < > comparisons you either look at ASCII
characters (where the sign bit is always 0) or you are careful to do the
right type conversions.

In some places, though, it can bite you. E.g. the original Doom source
assumed x86 style signed char, and on machines/compilers that use
unsigned char by default you got funny steering / control errors.

Fixed independently in different Doom ports:

https://github.com/mbitsnbites/mc1-doom/commit/14d04b02f199c36ebfe488799a473ff2f8af62a3

https://github.com/smunaut/doom_riscv/commit/da1dbb098e429d67dec52ebc600e2fabaf282cef

...etc.

/Marcus

Thomas Koenig

unread,

Jun 30, 2021, 7:02:26 AM6/30/21

to

Marcus <m.de...@this.bitsnbites.eu> schrieb:

[signed vs. unsigned chars]

> In some places, though, it can bite you. E.g. the original Doom source
> assumed x86 style signed char, and on machines/compilers that use
> unsigned char by default you got funny steering / control errors.

Aaaaah... now I know why Apple made the char type signed for
their aarch64 platform. They obviously wanted to play Doom
without patching it.

Marcus

unread,

Jun 30, 2021, 7:26:45 AM6/30/21

to

On 2021-06-30, Thomas Koenig wrote:

That must be it! :-)

I did a quick grep in the GCC repo and got the following map (some of
these are controlled by ifdef:s so it's probably not the real truth):

aarch64 DEFAULT_SIGNED_CHAR = 0
alpha DEFAULT_SIGNED_CHAR = 1
arc DEFAULT_SIGNED_CHAR = 0
arm DEFAULT_SIGNED_CHAR = 0
avr DEFAULT_SIGNED_CHAR = 1
bfin DEFAULT_SIGNED_CHAR = 1
bpf DEFAULT_SIGNED_CHAR = 1
c6x DEFAULT_SIGNED_CHAR = 1
cr16 DEFAULT_SIGNED_CHAR = 1
cris DEFAULT_SIGNED_CHAR = 1
csky DEFAULT_SIGNED_CHAR = 0
epiphany DEFAULT_SIGNED_CHAR = 0
fr30 DEFAULT_SIGNED_CHAR = 1
frv DEFAULT_SIGNED_CHAR = 1
ft32 DEFAULT_SIGNED_CHAR = 1
gcn DEFAULT_SIGNED_CHAR = 1
h8300 DEFAULT_SIGNED_CHAR = 0/1
i386 DEFAULT_SIGNED_CHAR = 1
ia64 DEFAULT_SIGNED_CHAR = 1
iq2000 DEFAULT_SIGNED_CHAR = 1
lm32 DEFAULT_SIGNED_CHAR = 0
m32c DEFAULT_SIGNED_CHAR = 1
m32r DEFAULT_SIGNED_CHAR = 1
m68k DEFAULT_SIGNED_CHAR = 1
mcore DEFAULT_SIGNED_CHAR = 0
microblaze DEFAULT_SIGNED_CHAR = 1
mips DEFAULT_SIGNED_CHAR = 0/1
mmix DEFAULT_SIGNED_CHAR = 1
mn10300 DEFAULT_SIGNED_CHAR = 0
moxie DEFAULT_SIGNED_CHAR = 0
mrisc32 DEFAULT_SIGNED_CHAR = 0
msp430 DEFAULT_SIGNED_CHAR = 0
nds32 DEFAULT_SIGNED_CHAR = 1
nios2 DEFAULT_SIGNED_CHAR = 1
nvptx DEFAULT_SIGNED_CHAR = 1
or1k DEFAULT_SIGNED_CHAR = 1
pa DEFAULT_SIGNED_CHAR = 1
pdp11 DEFAULT_SIGNED_CHAR = 1
pru DEFAULT_SIGNED_CHAR = 0
riscv DEFAULT_SIGNED_CHAR = 0
rl78 DEFAULT_SIGNED_CHAR = 0
rs6000 DEFAULT_SIGNED_CHAR = 0/1
rx DEFAULT_SIGNED_CHAR = 0
s390 DEFAULT_SIGNED_CHAR = 0
sh DEFAULT_SIGNED_CHAR = 1
sparc DEFAULT_SIGNED_CHAR = 1
stormy16 DEFAULT_SIGNED_CHAR = 0
tilegx DEFAULT_SIGNED_CHAR = 1
tilepro DEFAULT_SIGNED_CHAR = 1
v850 DEFAULT_SIGNED_CHAR = 1
vax DEFAULT_SIGNED_CHAR = 1
visium DEFAULT_SIGNED_CHAR = 0
xtensa DEFAULT_SIGNED_CHAR = 0

For MRISC32 i went with ARM+AArch64+RISC-V > x86, but in the end it
really does not matter since, well, there will always be portability
problems and you will always have to specify "unsigned char" or "signed
char" (or better yet, "uint8_t" or "int8_t") if you care about
portability.

/Marcus

MitchAlsup

unread,

Jun 30, 2021, 10:48:58 AM6/30/21

to

With a list like the above, one would be foolish not to include signed
and unsigned byte LDs.

luke.l...@gmail.com

unread,

Jul 12, 2021, 8:17:24 PM7/12/21

to

On Tuesday, June 22, 2021 at 9:54:30 AM UTC+1, Thomas Koenig wrote:
> Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:

> While C was an amazing language design for its time and especially
> for the hardware constraints of the machine it was developed for,
> some of its features have not aged well. Null-terminated strings
> are one of these features.

it isn't going away. MSRPC (aka DCE/RPC) uses length-specifiers: first heavy cost, a 16 bit integer added to every string. a zero byte string (empty) is therefore 2 bytes. second: if strings are over 65535 bytes *you can't have them* you are forced to use a 4 byte length encoder, now you have a *4* byte overhead for short strings...want to do bith dynamicalky with escaoe sequencing? that strncpy in microcode is looking reeeal attractive by comparison, ehn?

then also you are forgetting that UTF8 is the de-facto web internet encoding format, and that is a whole new ballgane for which specislist opcodes are well known (except by me, sigh, i just heard about them)

> I wouldn't try to implement those in hardware. The mem* functions,
> however, are fair game (and probably already covered by the
> MM instruction).

it is seriously worthwhile looking up "speculative load" that has been introduced into SVE, RVV and SVP64.

these are suited to Horizontal-first Vector ISAs although could likely be adapted to VVM Vertical-First.

basically they say "when doing a LOAD the length may be ARBITRARILY TRUNCATED to the number of elements that would succeed WITHOUT a page fault"

or to anything the hardware feels like as long as it's one or more.

with VL now truncated the remaining Vector ops can operate safely. some of these will be "find first zero" typically using cntlz (inverted) on a parallel cmpeqz

overall in RVV you end up with a stunning 13 instructions, where the hardware is free to choose the number of Lanes, could be 1 could be 4 could be 64 could be 10,000.

strlen, similar size, note the fail-first load:
https://github.com/gsauthof/riscv/blob/master/strlen.s

on VVM, the Vertical-First ISA, the number of elements to be SIMDified at the hardware vackend would be determined by the first LD operation.

the hardware *starts out* with the *intention* of performing a parallel execution of say loading 64 bytes simultaneously, the number of parallel elements having arbitrarily set to 64.

however at the FFIRSTed LD it goes, "wwwhoops, actually this is totally misaligned, and i can only load 12 bytes wiyjout a page fault"

so the hardware goes, "ok i know i meant to load 64 elements but actuslly i only gonna do 12".

now a quick and dirty hack way of doing this is to create a "fake" (hidden) predicate mask, which is implicitly ANDed eith all VVM Loop Vector ops following the FFIRSTed LD.

it starts off at 0b111111111111...1111 and if say the 13th byte would segfault, then this fake hidden predicate mask would be set to 0b0000000111111111111.

that way, the hardware VVM loop detector need not try desperately to back out of any decisions, it simply automatically ANDs that hidden mask into all Vector operations up until the end of the loop. it also only increments the loop counter by 12 not 64.

10 years ago, Power ISA retired stringcopy instructions they added in 1994, this should tell you everything you need to know about whether they're a good idea to add in 2021.

l.

luke.l...@gmail.com

unread,

Jul 12, 2021, 8:20:23 PM7/12/21

to

On Tuesday, June 22, 2021 at 4:40:50 PM UTC+1, Stephen Fuld wrote:
.
> Good point. And, of course, my proposal doesn't handle wide characters.

FFIRST LOAD has thus covered. simply use the vLDhff or vLDwff instruction for 16 and 32 bit speculative vector load ibstead if vldbff (byte).

l.

MitchAlsup

unread,

Jul 12, 2021, 8:54:55 PM7/12/21

to

On Monday, July 12, 2021 at 7:17:24 PM UTC-5, luke.l...@gmail.com wrote:
> On Tuesday, June 22, 2021 at 9:54:30 AM UTC+1, Thomas Koenig wrote:
> > Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:
> > While C was an amazing language design for its time and especially
> > for the hardware constraints of the machine it was developed for,
> > some of its features have not aged well. Null-terminated strings
> > are one of these features.
> it isn't going away. MSRPC (aka DCE/RPC) uses length-specifiers: first heavy cost, a 16 bit integer added to every string. a zero byte string (empty) is therefore 2 bytes. second: if strings are over 65535 bytes *you can't have them* you are forced to use a 4 byte length encoder, now you have a *4* byte overhead for short strings...want to do bith dynamicalky with escaoe sequencing? that strncpy in microcode is looking reeeal attractive by comparison, ehn?
>
> then also you are forgetting that UTF8 is the de-facto web internet encoding format, and that is a whole new ballgane for which specislist opcodes are well known (except by me, sigh, i just heard about them)
>
> > I wouldn't try to implement those in hardware. The mem* functions,
> > however, are fair game (and probably already covered by the
> > MM instruction).
>
> it is seriously worthwhile looking up "speculative load" that has been introduced into SVE, RVV and SVP64.
>
> these are suited to Horizontal-first Vector ISAs although could likely be adapted to VVM Vertical-First.
>
> basically they say "when doing a LOAD the length may be ARBITRARILY TRUNCATED to the number of elements that would succeed WITHOUT a page fault"
>
> or to anything the hardware feels like as long as it's one or more.
>
> with VL now truncated the remaining Vector ops can operate safely. some of these will be "find first zero" typically using cntlz (inverted) on a parallel cmpeqz
>
> overall in RVV you end up with a stunning 13 instructions, where the hardware is free to choose the number of Lanes, could be 1 could be 4 could be 64 could be 10,000.
>
> strlen, similar size, note the fail-first load:
> https://github.com/gsauthof/riscv/blob/master/strlen.s
>
> on VVM, the Vertical-First ISA, the number of elements to be SIMDified at the hardware vackend would be determined by the first LD operation.
<

My 66000 VVM compiled code:
<
GLOBAL strlen
ENTRY strlen
strlen:
MOV R2,#0
VEC R4,{R2}
LDUB R3,[R1+R2]
LOOP T,R2,R3!=#0 // LOOP type 3
MOV R1,R2
RET

>
> the hardware *starts out* with the *intention* of performing a parallel execution of say loading 64 bytes simultaneously, the number of parallel elements having arbitrarily set to 64.
>
> however at the FFIRSTed LD it goes, "wwwhoops, actually this is totally misaligned, and i can only load 12 bytes wiyjout a page fault"
>
> so the hardware goes, "ok i know i meant to load 64 elements but actuslly i only gonna do 12".
>
> now a quick and dirty hack way of doing this is to create a "fake" (hidden) predicate mask, which is implicitly ANDed eith all VVM Loop Vector ops following the FFIRSTed LD.
>
> it starts off at 0b111111111111...1111 and if say the 13th byte would segfault, then this fake hidden predicate mask would be set to 0b0000000111111111111.
<

These shenanigans are why CRAY vectors are not good for string handling.........

luke.l...@gmail.com

unread,

Jul 12, 2021, 9:10:22 PM7/12/21

to

On Tuesday, July 13, 2021 at 1:54:55 AM UTC+1, MitchAlsup wrote:

> My 66000 VVM compiled code:
> <
> GLOBAL strlen
> ENTRY strlen
> strlen:
> MOV R2,#0
> VEC R4,{R2}
> LDUB R3,[R1+R2]
> LOOP T,R2,R3!=#0 // LOOP type 3
> MOV R1,R2
> RET

how many bytes can be parallel-loaded by LDUB being autovectorised?
(and, what's a loop type 3? :) )

is the possibility of performing more than one byte-load terminated by the use of R3!=0?

> > it starts off at 0b111111111111...1111 and if say the 13th byte would segfault, then this fake hidden predicate mask would be set to 0b0000000111111111111.
> <
> These shenanigans are why CRAY vectors are not good for string handling.........

the original one? absolutely agree. with SVE / RVV / SVP64 FFIRST.Load i respectfully disagree.

the idea there of a hidden predicate mask was to add speculative FFIRST.LOAD to *VVM* (Vertical-First), as a first cut, illustrating the basic principle.

for Cray-style (Horizontal-First) there are no "shenanigens": VL is truncated, there and then. subsequent instructions *automatically* operate on only the data that was LOADed.

that "fake predicate" was a first cut of an idea of how to add the same concept to VVM.

l.

Stephen Fuld

unread,

Jul 12, 2021, 9:29:36 PM7/12/21

to

On 7/12/2021 5:17 PM, luke.l...@gmail.com wrote:
> On Tuesday, June 22, 2021 at 9:54:30 AM UTC+1, Thomas Koenig wrote:
>> Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:
>
>> While C was an amazing language design for its time and especially
>> for the hardware constraints of the machine it was developed for,
>> some of its features have not aged well. Null-terminated strings
>> are one of these features.

First of all, you messed up the attributions. The comments above are
Thomas Koenig's, not mine.

> it isn't going away.

I tend to agree.

MSRPC (aka DCE/RPC) uses length-specifiers: first heavy cost, a 16
bit integer added to every string. a zero byte string (empty) is
therefore 2 bytes. second: if strings are over 65535 bytes *you can't
have them* you are forced to use a 4 byte length encoder, now you have a
*4* byte overhead for short strings...want to do bith dynamicalky with
escaoe sequencing? that strncpy in microcode is looking reeeal
attractive by comparison, ehn?

Ignoring the escape and multi-byte characters, I have seen systems that
support an encoded length to handle exactly that problem. For example,
if the first byte is zero, the length is zero. If the high order bit of
the first byte is zero, the remaining seven bits encode the length (up
to 127). If the first bit is one but the second bit is zero remaining
14 bits in the first two bytes specify the length, up to 16K. If the
first two bits are one, you use the remaining 22 bits in the first three
bytes for very long strings. So, compared to null terminated strings,
for pretty short strings, the additional storage overhead is zero, and
it is one extra byte for strings between 127 and 16K. Overall, pretty
negligible.

Easier to use than null terminated for some operations, harder for
others. Typical engineering trade off.

MitchAlsup

unread,

Jul 12, 2021, 9:47:27 PM7/12/21

to

On Monday, July 12, 2021 at 8:10:22 PM UTC-5, luke.l...@gmail.com wrote:
> On Tuesday, July 13, 2021 at 1:54:55 AM UTC+1, MitchAlsup wrote:
>
> > My 66000 VVM compiled code:
> > <
> > GLOBAL strlen
> > ENTRY strlen
> > strlen:
> > MOV R2,#0
> > VEC R4,{R2}
> > LDUB R3,[R1+R2]
> > LOOP T,R2,R3!=#0 // LOOP type 3
> > MOV R1,R2
> > RET
> how many bytes can be parallel-loaded by LDUB being autovectorised?
<

Cache width per cycle. A small machine will be able to do 16 iterations of the loop
per cycle while larger machines may be able to do 32-to-64 per cycle.

<
> (and, what's a loop type 3? :) )
<

The LOOP instruction has 32 sub-variants, and type 1 is purely counted loop, Type 2 is
purely data terminated loop with increments, type 3 is counted and data==0 terminated.
The syntax of the LOOP instruction was a bit hard to read/parse by eye, so we had the
compiler spit out a token which simplifies the job.

>
> is the possibility of performing more than one byte-load terminated by the use of R3!=0?
<

No, each lane of the loop performs this on its byte and the loop controller assimilates it
into what smells like a VL register but it actually goes into the LOOP instruction (which
is a branch instruction) which decides if the loop terminates, and which values are to
be left in the scalar registers.

<
> > > it starts off at 0b111111111111...1111 and if say the 13th byte would segfault, then this fake hidden predicate mask would be set to 0b0000000111111111111.
> > <
> > These shenanigans are why CRAY vectors are not good for string handling.........
<
> the original one? absolutely agree. with SVE / RVV / SVP64 FFIRST.Load i respectfully disagree.
<

Your code size seems to be about 2× what My 66000 code size happens to be.

luke.l...@gmail.com

unread,

Jul 12, 2021, 9:55:29 PM7/12/21

to

On Tuesday, July 13, 2021 at 2:29:36 AM UTC+1, Stephen Fuld wrote:

> First of all, you messed up the attributions. The comments above are
> Thomas Koenig's, not mine.

apologies, i am using a very small device and the google groups interface is awesomely dire. it converts HTML formatted reply indentation into "nothingness" (i.e. ignores them entirely). this ignoringness basically removes one level of reply attribution, without people's knowledge or consent. it's done it again to my replies.

your replies are now attributed to me, thanks to google . if you were to switch to plaintext replies rather than "rich text" it would not mess up.

> > it isn't going away.
> I tend to agree.
> MSRPC (aka DCE/RPC) uses length-specifiers: first heavy cost, a 16

(see? you used HTML reply, and now my reply is attributed to you, according to google. this is because you use an HTML formatted mailer)

> bit integer added to every string. a zero byte string (empty) is
> therefore 2 bytes. second: if strings are over 65535 bytes *you can't
> have them* you are forced to use a 4 byte length encoder, now you have a
> *4* byte overhead for short strings...want to do bith dynamicalky with
> escaoe sequencing? that strncpy in microcode is looking reeeal
> attractive by comparison, ehn?

i recognise my style (and bad phone peck spelling) so the next para must be yours.

> Ignoring the escape and multi-byte characters, I have seen systems that
> support an encoded length to handle exactly that problem. For example,
> if the first byte is zero, the length is zero. If the high order bit of
> the first byte is zero, the remaining seven bits encode the length (up
> to 127). If the first bit is one but the second bit is zero remaining
> 14 bits in the first two bytes specify the length, up to 16K. If the
> first two bits are one, you use the remaining 22 bits in the first three
> bytes for very long strings. So, compared to null terminated strings,
> for pretty short strings, the additional storage overhead is zero, and
> it is one extra byte for strings between 127 and 16K. Overall, pretty
> negligible.

in storage space, yes. in terms of implementing in hardware, as a special
microcoded operation, quite risky. if the hardware microcode could be
*defined* (a la Transmeta) i would say that was a route worth pursuing.

an example we are seriously considering proposing to OPF is javascript
style FP to INT rounding. it has modulo 2^32 in the actual specification.
whoever thought that it was a good idea to convert FP to INT by doing
modulo arithmetic instead of saturation (like any sensible rounding would)
needs their frickin head examined.

typical implementations of FP to INT rounding are FORTY FIVE
instructions and involve FIVE branches.

in hardware however it is far simpler, and, here's the kicker, javascript
is hardly likely to drop this or change the spec any time soon.

consequently as a high profile stable and expensive operation it is
easy to justify adding.

various string routines, not standardised, very tricky.

l.

luke.l...@gmail.com

unread,

Jul 12, 2021, 10:04:28 PM7/12/21

to

On Tuesday, July 13, 2021 at 2:47:27 AM UTC+1, MitchAlsup wrote:
> On Monday, July 12, 2021 at 8:10:22 PM UTC-5, luke.l...@gmail.com wrote:

> > is the possibility of performing more than one byte-load terminated by the use of R3!=0?
> <
> No, each lane of the loop performs this on its byte and the loop controller assimilates it
> into what smells like a VL register but it actually goes into the LOOP instruction (which
> is a branch instruction) which decides if the loop terminates, and which values are to
> be left in the scalar registers.

niice. i have to assimilate this. effectively it automaticallyy incorporates the set-before-first capability into LOOP.

l.

Stefan Monnier

unread,

Jul 12, 2021, 11:22:13 PM7/12/21

to

>> > it isn't going away.

Agreed, tho zero-terminated strings are rarely important for performance
nowadays I'd think.

>> bit integer added to every string. a zero byte string (empty) is
>> therefore 2 bytes. second: if strings are over 65535 bytes *you can't
>> have them* you are forced to use a 4 byte length encoder, now you have a
>> *4* byte overhead for short strings...

Nowadays you have to accommodate strings longer than 4GB, so you'll want
64bit for the length. In Emacs, a string object comes with 2
length fields, both of them `ptrdiff_t` (one of them is for the length
in bytes, the other for the length in "characters").

We haven't made effort to assess whether it was the optimal/best choice,
but small strings are never the dominant factor is total heap size.

> in storage space, yes. in terms of implementing in hardware, as a special
> microcoded operation, quite risky. if the hardware microcode could be
> *defined* (a la Transmeta) i would say that was a route worth pursuing.

Agreed, I don't see much need for special hardware support for string
processing (whether zero-terminated or with a specified length).
If it's a significant part of your total processing time, there's a good
chance that the better way to speed it up is to use a different
representation rather than to try and squeeze a few extra percents from
the hardware.

> typical implementations of FP to INT rounding are FORTY FIVE
> instructions and involve FIVE branches.

But JS compilers should rarely need to generate such code, because they
should do enough analysis that in 99% of the cases they know that the
inputs were themselves integers and to do the whole computation without
any FP ops at all.

> in hardware however it is far simpler, and, here's the kicker, javascript
> is hardly likely to drop this or change the spec any time soon.

But JS compilers will still want to use integer ops instead of FP ops
whenever possible even if you can make this operation a bit faster, so
I'd expect this special hardware support to be largely unused.

Stefan

Thomas Koenig

unread,

Jul 13, 2021, 2:48:59 AM7/13/21

to

luke.l...@gmail.com <luke.l...@gmail.com> schrieb:

> On Tuesday, June 22, 2021 at 9:54:30 AM UTC+1, Thomas Koenig wrote:
>> Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:
>
>> While C was an amazing language design for its time and especially
>> for the hardware constraints of the machine it was developed for,
>> some of its features have not aged well. Null-terminated strings
>> are one of these features.
>
> it isn't going away. MSRPC (aka DCE/RPC) uses length-specifiers:
> first heavy cost, a 16 bit integer added to every string.

Unless a lot of your strings are very small, that should not be
an issue.

And, of course, this means that you cannot use strings to
store arbitrary data.

You need a 64-bit (or, less often these days, a 32-bit) pointer
to the string anyway.

>a zero
> byte string (empty) is therefore 2 bytes. second: if strings are
> over 65535 bytes *you can't have them* you are forced to use a 4
> byte length encoder, now you have a *4* byte overhead for short
> strings...want to do bith dynamicalky with escaoe sequencing?

Naw, just put an 8-byte length on the string and be done with it.
This is what gfortran does (switched from 4-byte string length
some time ago).

If you want to play games, put the string data for short strings
into a union with the metadata, as is done for C++ strings
(minus the NULL terminator of course).

> that
> strncpy in microcode is looking reeeal attractive by comparison,
> ehn?

Not really.

> then also you are forgetting that UTF8 is the de-facto web
> internet encoding format, and that is a whole new ballgane for
> which specislist opcodes are well known (except by me, sigh,
> i just heard about them)

Which specialized opcodes for which architecture?

MitchAlsup

unread,

Jul 13, 2021, 12:12:54 PM7/13/21

to

A) There are 48 ways to do {float,double}FP<->INT{signed,unsigned}
B) there are instructions in My 66000 to do any and all of them.

Ivan Godard

unread,

Jul 13, 2021, 8:37:27 PM7/13/21

to

48 ways? Two FP radices, five FP rounding modes, two int signednesses, 4
integer overflow behavior. That's 80. If you are changing width in the
process (we don't) then it's 4 FP widths X 4 integer widths, which
explodes to much more than 48 and even if only {float, double}->single
sized int it runs the count to 160 cases. You are probably not dealing
with decimal, so we're back to 80, but not 48.

How do you get 48?

luke.l...@gmail.com

unread,

Jul 13, 2021, 8:52:52 PM7/13/21

to

On Tuesday, July 13, 2021 at 2:47:27 AM UTC+1, MitchAlsup wrote:

> Your code size seems to be about 2× what My 66000 code size happens to be.

yes. i am both impressed and also now challenged to unpack this. do, walking through it:

strlen:
MOV R2,#0
VEC R4,{R2}
LDUB R3,[R1+R2]
LOOP T,R2,R3!=#0 // LOOP type 3
MOV R1,R2
RET

let us assume that the hardware is a multi-issue precise-capable microarchitecture apable of doing 32 LDUBs per loop. given this is actually only 8 64-bit LDs it is not unreasonable.

let us assume that the Vec hardware attempts all 32. also that byte 17 crosses a page boundary and in speculative load terms throws a page fault

if the VVM hardware at this point says "ok 17 is only possible, cancel the other 15 elements" by pulling their Shadow "Die" fkag, we have unwound the speculative in-flight execution to the 17 element mark.

the internal loop counter can then also unwind to 17.

also, the 15 speculatively-running LOOP R3!=0 tests can *also be Shadow cancelled*.

this means that although the Loop counter started out at 32 *it is safe to truncate to 17* all based on simple application of Precise Exception Shadow handling of speculative execution.

although i probably have some details a bit fuzzy, is this basically how VVM works?

because if so it *already has FFIRST.LOAD behaviour inherently built-in* without an explicit ffirst load instruction being needed.

l.

MitchAlsup

unread,

Jul 13, 2021, 10:00:11 PM7/13/21

to

There are 5 ways to round Double into UnSigned, Signed, or Float; each.
There are 5 ways to round Float into UnSigned or Signed; each.
There are 5 ways to round UnSigned into Double or Float; each
There are 5 ways to round Signed into Double or Float; each
And there are assignment conversions:
UnSigned = Signed,
Signed = UnSigned , and
Double = Float.
<
15+10+10+10+3 = 48
<
The singed = unsigned is not a MOV saturating at SPOSMAX
The unsigned = signed is not a MOV saturating negatives to zero.

MitchAlsup

unread,

Jul 13, 2021, 10:10:01 PM7/13/21

to

On Tuesday, July 13, 2021 at 7:52:52 PM UTC-5, luke.l...@gmail.com wrote:
> On Tuesday, July 13, 2021 at 2:47:27 AM UTC+1, MitchAlsup wrote:
> > Your code size seems to be about 2× what My 66000 code size happens to be.
> yes. i am both impressed and also now challenged to unpack this. do, walking through it:
> strlen:
> MOV R2,#0
> VEC R4,{R2}
> LDUB R3,[R1+R2]
> LOOP T,R2,R3!=#0 // LOOP type 3
> MOV R1,R2
> RET
> let us assume that the hardware is a multi-issue precise-capable microarchitecture apable of doing 32 LDUBs per loop. given this is actually only 8 64-bit LDs it is not unreasonable.
<

32 bytes per cycle read is not unreasonable for a higher end machine--this may take 2 or even 4
cache banks to be performed in 1 cycle.

>
> let us assume that the Vec hardware attempts all 32. also that byte 17 crosses a page boundary and in speculative load terms throws a page fault
>
> if the VVM hardware at this point says "ok 17 is only possible, cancel the other 15 elements" by pulling their Shadow "Die" fkag, we have unwound the speculative in-flight execution to the 17 element mark.
<

It is more like 15 of them all detect page fault and cancel themselves, letting the vector
termination sequencer to clean up the mess. The 17 which got data compare to zero,
and if detected, terminate the loop with i containing the proper strlen. If no zero was
found, the page fault control transfer is taken with IP pointing at LDUB and Ri properly set.
<
After page fault is "handled" control returns, the 17th LDUB is performed, then the LOOP
instruction is performed, which transfers control back to the VEC instruction, instead of
the LDUB instruction, the vectorized loop is setup again and then vectorized running continues.

>
> the internal loop counter can then also unwind to 17.
>
> also, the 15 speculatively-running LOOP R3!=0 tests can *also be Shadow cancelled*.
>
> this means that although the Loop counter started out at 32 *it is safe to truncate to 17* all based on simple application of Precise Exception Shadow handling of speculative execution.
>
> although i probably have some details a bit fuzzy, is this basically how VVM works?
<

Tolerably good guess.

>
> because if so it *already has FFIRST.LOAD behaviour inherently built-in* without an explicit ffirst load instruction being needed.
<

Precisely! No go figure out how you don't need that instruction, either.
>
> l.

luke.l...@gmail.com

unread,

Jul 14, 2021, 12:30:14 AM7/14/21

to

On Wednesday, July 14, 2021 at 3:10:01 AM UTC+1, MitchAlsup wrote:

> > although i probably have some details a bit fuzzy, is this basically how VVM works?
> <
> Tolerably good guess.

awesome. ok now i am impressed. only frickin took me 2 years to grok.

so i would say, in conclusion, the code is so short and so effective there is no need for strlen op, strcpy etc.

> Precisely! Now go figure out how you don't need that instruction, either.

yehyeh, that's the hard part. i only just grokked Vertical-First Vectorisation, last week, the implications will take a while to propagate, after 3 years of studying Horizontal-first.

due to time constraints i may have to go with that "hint" idea, that rather than the arch automatically working out the number of safe parallel elements, the compiler tells you.

yes, i know, memory aliasing's a pain. haven't got my head round it yet.

l.

Stephen Fuld

unread,

Jul 14, 2021, 12:58:56 AM7/14/21

to

On 7/12/2021 6:55 PM, luke.l...@gmail.com wrote:
> On Tuesday, July 13, 2021 at 2:29:36 AM UTC+1, Stephen Fuld wrote:
>
>> First of all, you messed up the attributions. The comments above are
>> Thomas Koenig's, not mine.
>
> apologies, i am using a very small device and the google groups interface is awesomely dire. it converts HTML formatted reply indentation into "nothingness" (i.e. ignores them entirely). this ignoringness basically removes one level of reply attribution, without people's knowledge or consent. it's done it again to my replies.
>
> your replies are now attributed to me, thanks to google . if you were to switch to plaintext replies rather than "rich text" it would not mess up.

I am using Thunderbird (latest version). There is an option in the
account compose options to use HTML. I do NOT have that checked. Other
than that, I don't know how to do what you want me to. If anyone can
help, please do. But I should note that I think others in this group
use Google groups, and, in general, no one else does what your system does.

I am willing to experiment, but I don't know what to try.

Ivan Godard

unread,

Jul 14, 2021, 2:49:09 AM7/14/21

to

So you ignore integer overflow behavior, float to double, short float,
long float, and decimal. While you can get away without the rest (most
do), I think you need float to double.

Branimir Maksimovic

unread,

Jul 14, 2021, 7:57:35 AM7/14/21

to

On 2021-07-14, Stephen Fuld <sf...@alumni.cmu.edu.invalid> wrote:
> On 7/12/2021 6:55 PM, luke.l...@gmail.com wrote:
>> On Tuesday, July 13, 2021 at 2:29:36 AM UTC+1, Stephen Fuld wrote:
>>
>>> First of all, you messed up the attributions. The comments above are Thomas
>>> Koenig's, not mine.
>>
>> apologies, i am using a very small device and the google groups interface is
>> awesomely dire. it converts HTML formatted reply indentation into
>> "nothingness" (i.e. ignores them entirely). this ignoringness basically
>> removes one level of reply attribution, without people's knowledge or
>> consent. it's done it again to my replies.
>>
>> your replies are now attributed to me, thanks to google . if you were to
>> switch to plaintext replies rather than "rich text" it would not mess up.
>
> I am using Thunderbird (latest version). There is an option in the account
> compose options to use HTML. I do NOT have that checked. Other than that, I
> don't know how to do what you want me to. If anyone can help, please do.
> But I should note that I think others in this group use Google groups, and,
> in general, no one else does what your system does.

I use https://usenet-news.net/
slrn+vim on macOS :P

--
bmaxa now listens Ob-Neob Radio

Stefan Monnier

unread,

Jul 14, 2021, 9:35:18 AM7/14/21

to

Stephen Fuld [2021-07-13 21:58:53] wrote:
> I am using Thunderbird (latest version). There is an option in the account
> compose options to use HTML. I do NOT have that checked. Other than that,
> I don't know how to do what you want me to. If anyone can help, please do.
> But I should note that I think others in this group use Google groups, and,
> in general, no one else does what your system does.
>
> I am willing to experiment, but I don't know what to try.

FWIW, your messages look fine from my end (no HTML, proper quoting, ...).

Stefan

David Brown

unread,

Jul 14, 2021, 9:46:38 AM7/14/21

to

I concur - looking at the source for your messages, they are all
absolutely fine.

Google groups doesn't need HTML posts or anything else in order to screw
up formatting, attributions, indentation, paragraph separation, code
snippets and every thing else. Perhaps they have even found a way to
make it do different bad things when you are using a small device rather
than a computer.

Terje Mathisen

unread,

Jul 14, 2021, 11:09:34 AM7/14/21

to

Do you?

double float2double(float x)

is basically a zero-extend/NOP (except when presented with a subnormal
float input) which just pads the exponent field from 8 to 11 bits
(adding (1023-127) to it) and adds 30 zero bits to the mantissa.

The key here is that there are no way this can generate any kind of
exception afaik?

The subnormal input will just be normalized.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

EricP

unread,

Jul 14, 2021, 12:25:13 PM7/14/21

to

Convert double/single source might be a signalled Nan.

Convert double to single might overflow exponent or underflow.
On x86/x64 those can trigger exceptions.

MitchAlsup

unread,

Jul 14, 2021, 12:38:46 PM7/14/21

to

Float to Double exists:: I counted the encoding again and came up with 48
again, and found (double)Rd=(float)Rs1, but I don't see how it did not get
accounted for int the "5 ways" above.

There is no HW support for short floats or long doubles or decimal

MitchAlsup

unread,

Jul 14, 2021, 12:40:36 PM7/14/21

to

Yes, any reasonable HW must perform these "little diversions" within the
name of convert...........

EricP

unread,

Jul 14, 2021, 4:17:51 PM7/14/21

to

I didn't want the possibility of FP stores to throw conversion exceptions
as the x86 can do. That gets into nasty things like the LSQ having to
sync with and query the FPU control register trap flags.

If a down-convert store of fp64 to fp32 or fp16 value overflows then
it saturates to the max value representable at the dest size
(it could saturate to infinity but that didn't seem correct),
or underflows to zero.

If one doesn't want that behavior then one does a down-conversion check
beforehand which may trigger exceptions.

Ivan Godard

unread,

Jul 14, 2021, 4:41:30 PM7/14/21

to

Why not explicitly down-convert always? The program asked for it, after all.

luke.l...@gmail.com

unread,

Jul 14, 2021, 4:44:07 PM7/14/21

to

On Wednesday, July 14, 2021 at 5:25:13 PM UTC+1, EricP wrote:

> Convert double/single source might be a signalled Nan.
>
> Convert double to single might overflow exponent or underflow.
> On x86/x64 those can trigger exceptions.

https://libre-soc.org/openpower/sv/int_fp_mv/

there's 4 main types (different rules)
* rust / llvm / java / SPIR-V - these are saturated without error
* standard IEEE754 conversion - provided by almost all languages and all CPUs
* x86 and Power ISA "saturation with NaN converted to minimum valid integer"
* the borked-in-the-brain, utterly-insane, javascript "modulo 2^32 / 2^64" conversion

l.

Ivan Godard

unread,

Jul 14, 2021, 5:22:50 PM7/14/21

to

These are actually two distinct steps, which may be combined for
encoding reasons.

The first step is converting the FP to another FP that has an integral
value, using the governing rounding rules. This is an IEE standard
operation; not that the result has no fraction part, but the exponent
may cause the integral value represented to be larger than that
representable in an integer of the same size. This step can raise
inexact if the argument was not already integral, and for some rounding
modes and maximal FP arguments can raise FP overflow with saturation to
infinity. A signalling NaN should signal.

The second step is to convert the integral FP value to an integer value,
which may overflow. This step should the same overflow detection and
reporting mechanism used for any other integer overflows in the ISA. A
quiet NaN should raise whatever integer invalid-datum exception is
defined by the ISA. Most ISAs have no such excepton, for a NaN would be UB.

MitchAlsup

unread,

Jul 14, 2021, 6:51:54 PM7/14/21

to

On Wednesday, July 14, 2021 at 3:17:51 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Wednesday, July 14, 2021 at 11:25:13 AM UTC-5, EricP wrote:
> > <<<<<<<
> >> Convert double/single source might be a signalled Nan.
> >>
> >> Convert double to single might overflow exponent or underflow.
> >> On x86/x64 those can trigger exceptions.
> > <<<<<<<
> > Yes, any reasonable HW must perform these "little diversions" within the
> > name of convert...........
<

I'm lost::

<
> I didn't want the possibility of FP stores to throw conversion exceptions
> as the x86 can do. That gets into nasty things like the LSQ having to
> sync with and query the FPU control register trap flags.
<

We were talking about the CVT instruction not the ST instruction.

<
>
> If a down-convert store of fp64 to fp32 or fp16 value overflows then
> it saturates to the max value representable at the dest size
> (it could saturate to infinity but that didn't seem correct),
> or underflows to zero.
<

CVT should operate under the guise of "value that best represents the operand".
This could be Infnity, denorm, zero, or QNaN.

>
> If one doesn't want that behavior then one does a down-conversion check
> beforehand which may trigger exceptions.
<

Which is why we were talking about the CVT instruction and not the ST instruction.

MitchAlsup

unread,

Jul 14, 2021, 6:57:30 PM7/14/21

to

On Wednesday, July 14, 2021 at 4:22:50 PM UTC-5, Ivan Godard wrote:
> On 7/14/2021 1:44 PM, luke.l...@gmail.com wrote:
> > On Wednesday, July 14, 2021 at 5:25:13 PM UTC+1, EricP wrote:
> >
> >> Convert double/single source might be a signalled Nan.
> >>
> >> Convert double to single might overflow exponent or underflow.
> >> On x86/x64 those can trigger exceptions.
> >
> > https://libre-soc.org/openpower/sv/int_fp_mv/
> >
> > there's 4 main types (different rules)
> > * rust / llvm / java / SPIR-V - these are saturated without error
> > * standard IEEE754 conversion - provided by almost all languages and all CPUs
> > * x86 and Power ISA "saturation with NaN converted to minimum valid integer"
> > * the borked-in-the-brain, utterly-insane, javascript "modulo 2^32 / 2^64" conversion
> >
> > l.
> >
> These are actually two distinct steps, which may be combined for
> encoding reasons.
>
> The first step is converting the FP to another FP that has an integral
> value, using the governing rounding rules.
<

The first step is converting the FP to another FP that has no fractional
significance, using the governing rounding rules.

<
> This is an IEE standard
> operation; not that the result has no fraction part, but the exponent
> may cause the integral value represented to be larger than that
> representable in an integer of the same size. This step can raise
> inexact if the argument was not already integral, and for some rounding
> modes and maximal FP arguments can raise FP overflow with saturation to
> infinity. A signalling NaN should signal.
>
> The second step is to convert the integral FP value to an integer value,
> which may overflow. This step should
<

?contain?

MitchAlsup

unread,

Jul 14, 2021, 8:46:10 PM7/14/21

to

Also note:
There are 12=2×6 ways to round a {double, float} to the same container.
<
{I.e. get rid of the fractional significance.}
<
These do not convert between types, but round within a single type and use
an instruction spelled RND rather than CVT.

EricP

unread,

Jul 14, 2021, 11:27:48 PM7/14/21

to

MitchAlsup wrote:
> On Wednesday, July 14, 2021 at 3:17:51 PM UTC-5, EricP wrote:
>> MitchAlsup wrote:
>>> On Wednesday, July 14, 2021 at 11:25:13 AM UTC-5, EricP wrote:
>>> <<<<<<<
>>>> Convert double/single source might be a signalled Nan.
>>>>
>>>> Convert double to single might overflow exponent or underflow.
>>>> On x86/x64 those can trigger exceptions.
>>> <<<<<<<
>>> Yes, any reasonable HW must perform these "little diversions" within the
>>> name of convert...........
> <
> I'm lost::
> <

I'm just muddying the waters a bit I suppose.

For example, to contrast Intel's FP approach with one FADD instruction,
vs Alpha's 4 FADDx instructions (one for each data type).

The design consequences propagate to the store instruction
when one tries to store an fp64 register value to 32-bit memory.

>> I didn't want the possibility of FP stores to throw conversion exceptions
>> as the x86 can do. That gets into nasty things like the LSQ having to
>> sync with and query the FPU control register trap flags.
> <
> We were talking about the CVT instruction not the ST instruction.
> <

As I mentioned the x86 FP store does do a convert which can cause
exceptions if one stores an fp64 to 32-bit memory operand.

>> If a down-convert store of fp64 to fp32 or fp16 value overflows then
>> it saturates to the max value representable at the dest size
>> (it could saturate to infinity but that didn't seem correct),
>> or underflows to zero.
> <
> CVT should operate under the guise of "value that best represents the operand".
> This could be Infnity, denorm, zero, or QNaN.

As I mentioned, Intel FP store can do a conversion and can trap.

Alpha, in contrast, does a straight forward bit-wise transform on the
value to be stored. It is your responsibility to make sure the the
value in the source register was the correct data type to begin with.
(chops high order bits off the exponent, low order bit off mantissa,
packs remaining bits together.)
(Alpha also has multiple memory formats to deal with:
VAX 32-bit F, VAX 64-bit G (both weirdo middle endian layout)
IEEE 32-bit S, IEEE 64-bit T.)

>> If one doesn't want that behavior then one does a down-conversion check
>> beforehand which may trigger exceptions.
> <
> Which is why we were talking about the CVT instruction and not the ST instruction.

I'm pointing out that when one stores an fp64 to fp32 memory
there is a down-conversion taking place.
That conversion can be done like Intel (with traps),
like Alpha (with bit truncates), or maybe with saturates.

Intel has one FADD instruction that generates the same output results.

Alpha has an FADDx instruction for each data type F, G, S, T.
If you are going to store an 'S' 32-bit IEEE type then the last
instruction had to generate an S type into the dest register.
The FP store instruction STF assumes the source register contains an
F data type and therefore the stores' conversion to 32-bits "works".

George Neuner

unread,

Jul 15, 2021, 8:48:20 AM7/15/21

to

On Tue, 13 Jul 2021 21:58:53 -0700, Stephen Fuld
<sf...@alumni.cmu.edu.invalid> wrote:

>I am using Thunderbird (latest version). There is an option in the
>account compose options to use HTML. I do NOT have that checked. Other
>than that, I don't know how to do what you want me to. If anyone can
>help, please do. But I should note that I think others in this group
>use Google groups, and, in general, no one else does what your system does.
>
>I am willing to experiment, but I don't know what to try.

*Disclaimer* I have never played with these settings.

In the general Options under Composition there is a "Send Options"
button. It allows you to declare mail domains that expect messages to
be in plain text. Also you can control what to do when a message has
multiple recipients that expect different formats.

Perhaps if you set Google / Gmail as a text domain? But you probably
would have to prevent TBird from also sending in HTML format.

George

David Brown

unread,

Jul 15, 2021, 9:44:59 AM7/15/21

to

That all applies to email, not Usenet. When you post to a newsserver,
you are not sending to google or a gmail account.

His posts are in plain text, not HTML - there is nothing wrong with his
settings as far as I can see. It is google groups that is broken.

luke.l...@gmail.com

unread,

Jul 15, 2021, 9:46:47 AM7/15/21

to

On Thursday, July 15, 2021 at 4:27:48 AM UTC+1, EricP wrote:

> Alpha, in contrast, does a straight forward bit-wise transform on the
> value to be stored. It is your responsibility to make sure the the
> value in the source register was the correct data type to begin with.
> (chops high order bits off the exponent, low order bit off mantissa,
> packs remaining bits together.)

Power ISA likewise. Single precision takes place *at* single precision, the bits are simply distributed over 64 bit *as if* and in identical format to FP64. to the extent that, should a decision be made to use FP64 operations they may do so *without conversion*.

the only conversion needed is FP64-to-FP32 in order to round and check range, but the other way FP32-to-5l64 is nevrr needed.

LDST for FP32 knows exactly what to do: transfer the appropriate bits to/from the 64 bit FPR reg. no conversion needed there either.

it is very elegant, a little weird, and takes getting used to.

l.

luke.l...@gmail.com

unread,

Jul 15, 2021, 9:54:01 AM7/15/21

to

On Thursday, July 15, 2021 at 2:44:59 PM UTC+1, David Brown wrote:

> His posts are in plain text, not HTML - there is nothing wrong with his
> settings as far as I can see. It is google groups that is broken.

you can see online what happened:

https://groups.google.com/g/comp.arch/c/0n9Ko6wjmHU/m/KNjd3pSKAAAJ

steven replied to me, but look carefully: my text starting "MSRPC" is not properly indented, it appears in a subtly different font and colour.

replying to that, google groups via a mobile chrome browser was unable to cope, and crushed the indentation.

i suspect therefore it is a unique interaction: not one factor but two.

i find there is no problem with indentation replying to *anyone else's* messages: it appears to be a unique interaction between Thunderbird and google groups when used with chrome on mobile.

l.

Stephen Fuld

unread,

Jul 15, 2021, 10:57:49 AM7/15/21

to

Thanks to all who replied. I agree with David that the options George
referred to are for e-mail, not newsgroups. And for newsgroups, there
is no "domain" to put in the text list.

Luke may be right that it is some weird interaction. Can someone else
using Thunderbird reply so Luke can see test his hypothesis?

In the mean time, since no one else seems to have a problem with my
posts, and there isn't an obvious change to make, I am not going to
change anything. But I will double check any of Luke's replies to my
posts to see if the attributions are correct.

MitchAlsup

unread,

Jul 15, 2021, 11:57:11 AM7/15/21

to

On Thursday, July 15, 2021 at 8:46:47 AM UTC-5, luke.l...@gmail.com wrote:
> On Thursday, July 15, 2021 at 4:27:48 AM UTC+1, EricP wrote:
>
> > Alpha, in contrast, does a straight forward bit-wise transform on the
> > value to be stored. It is your responsibility to make sure the the
> > value in the source register was the correct data type to begin with.
> > (chops high order bits off the exponent, low order bit off mantissa,
> > packs remaining bits together.)
> Power ISA likewise. Single precision takes place *at* single precision, the bits are simply distributed over 64 bit *as if* and in identical format to FP64. to the extent that, should a decision be made to use FP64 operations they may do so *without conversion*.
>
> the only conversion needed is FP64-to-FP32 in order to round and check range, but the other way FP32-to-5l64 is nevrr needed.
<

FP32-FP64 one only has to worry about denorms becoming norms.

David Brown

unread,

Jul 15, 2021, 12:05:39 PM7/15/21

to

I've looked a bit closer now - it was helpful that Luke pointed out
exactly which part he had written but which appeared as /your/ text in
your post.

There is nothing wrong with your Thunderbird settings. But you made an
error when replying to Luke. You were trying to snip some of his reply,
including the start of the paragraph containing "MSRPC...", and only
keeping that part. Unfortunately, you also deleted the ">" indent
symbol and thus the bit you left was formatted as coming from you, not Luke.

So nothing dramatic or complicated is going on - you just deleted a
character too much.

Of course, the blame still lies with the idiotic and broken google
groups - if it followed Usenet standards and got its line lengths and
paragraphs formatted correctly, you would not have been faced with a
single big line but a proper quoted paragraph, and you'd have seen the
mistake immediately.

Luke, if at all possible, /please/ get a proper newsreader and proper
newsserver. (I realise it may not be possible.)

George Neuner

unread,

Jul 15, 2021, 1:19:30 PM7/15/21

to

On Thu, 15 Jul 2021 15:44:56 +0200, David Brown
<david...@hesbynett.no> wrote:

>On 15/07/2021 14:48, George Neuner wrote:
>> On Tue, 13 Jul 2021 21:58:53 -0700, Stephen Fuld
>> <sf...@alumni.cmu.edu.invalid> wrote:
>>
>>> I am using Thunderbird (latest version). There is an option in the
>>> account compose options to use HTML. I do NOT have that checked. Other
>>> than that, I don't know how to do what you want me to. If anyone can
>>> help, please do. But I should note that I think others in this group
>>> use Google groups, and, in general, no one else does what your system does.
>>>
>>> I am willing to experiment, but I don't know what to try.
>>
>> *Disclaimer* I have never played with these settings.
>>
>> In the general Options under Composition there is a "Send Options"
>> button. It allows you to declare mail domains that expect messages to
>> be in plain text. Also you can control what to do when a message has
>> multiple recipients that expect different formats.
>>
>> Perhaps if you set Google / Gmail as a text domain? But you probably
>> would have to prevent TBird from also sending in HTML format.
>>
>
>That all applies to email, not Usenet. When you post to a newsserver,
>you are not sending to google or a gmail account.

In TBird it /can/ apply to NN posts also: all that matters is what
domain is being addressed.

>His posts are in plain text, not HTML - there is nothing wrong with his
>settings as far as I can see. It is google groups that is broken.

I know Google Groups is broken.

The issue here is what TBird does for NNTP. Certainly HTML /can/ be
sent via NNTP, and by default TBird sends /email/ in both text and
HTML format.

What I don't know - and can't seem to find out easily - is whether
Tbird does the same for NNTP. If so, Google could be confused even
more than usual. I don't use Tbird for NNTP [I use a dedicated
client] so I don't know what TBird does with it.

George

Stephen Fuld

unread,

Jul 15, 2021, 1:25:19 PM7/15/21

to

Thank you! I appreciate the extra effort you went to on this. And I
apologize to Luke for me error.

> Of course, the blame still lies with the idiotic and broken google
> groups - if it followed Usenet standards and got its line lengths and
> paragraphs formatted correctly, you would not have been faced with a
> single big line but a proper quoted paragraph, and you'd have seen the
> mistake immediately.
>
> Luke, if at all possible, /please/ get a proper newsreader and proper
> newsserver. (I realise it may not be possible.)

Since he said Chrome on a tiny device, I assume he is using Android. I
know nothing about what apps may be available for a newsreader under
Android (I know the situation with IOS isn't good). But as long as you
only want the text newsgroups, there is at least one good, free
solution, Eternal September. I am a happy, long time user, but have no
other connection with it. There are probably others.

luke.l...@gmail.com

unread,

Jul 15, 2021, 2:39:56 PM7/15/21

to

On Thursday, July 15, 2021 at 6:25:19 PM UTC+1, Stephen Fuld wrote:
> On 7/15/2021 9:05 AM, David Brown wrote:
> > So nothing dramatic or complicated is going on - you just deleted a
> > character too much.
> Thank you! I appreciate the extra effort you went to on this.

likewise, David.

> And I apologize to Luke for me error.

it may be more involved: the original de-attribution was before the
one you looked at, David.

> Since he said Chrome on a tiny device, I assume he is using Android.

with google play entirely disabled. it's a 6 year old Asia Samsung N9005,
i currently only install apps with F-Droid, now.

l.

Terje Mathisen

unread,

Jul 15, 2021, 3:50:36 PM7/15/21

to

EricP wrote:
> MitchAlsup wrote:
>> On Wednesday, July 14, 2021 at 11:25:13 AM UTC-5, EricP wrote:
>> <<<<<<<
>>> Convert double/single source might be a signalled Nan.
>>> Convert double to single might overflow exponent or underflow. On
>>> x86/x64 those can trigger exceptions.
>> <<<<<<<
>> Yes, any reasonable HW must perform these "little diversions" within the
>> name of convert...........
>
> I didn't want the possibility of FP stores to throw conversion exceptions
> as the x86 can do. That gets into nasty things like the LSQ having to
> sync with and query the FPU control register trap flags.
>
> If a down-convert store of fp64 to fp32 or fp16 value overflows then
> it saturates to the max value representable at the dest size
> (it could saturate to infinity but that didn't seem correct),
> or underflows to zero.

Rather the opposite:

Overflow has to go to Inf, underflow to Zero!

(At least if you want to claim ieee754 compliancy.)

>
> If one doesn't want that behavior then one does a down-conversion check
> beforehand which may trigger exceptions.
>

ok.

Ivan Godard

unread,

Jul 15, 2021, 5:42:04 PM7/15/21

to

I use Thunderbird, and he hasn't reported any issues when he has
responded to my posts. However, I go through eternal-september, so the
problem may indeed be Google Groups.

Stephen Fuld

unread,

Jul 15, 2021, 6:11:02 PM7/15/21

to

I too use eternal-september.

David Brown

unread,

Jul 16, 2021, 4:07:41 AM7/16/21

to

On 15/07/2021 19:19, George Neuner wrote:
> On Thu, 15 Jul 2021 15:44:56 +0200, David Brown
> <david...@hesbynett.no> wrote:
>
>> On 15/07/2021 14:48, George Neuner wrote:
>>> On Tue, 13 Jul 2021 21:58:53 -0700, Stephen Fuld
>>> <sf...@alumni.cmu.edu.invalid> wrote:
>>>
>>>> I am using Thunderbird (latest version). There is an option in the
>>>> account compose options to use HTML. I do NOT have that checked. Other
>>>> than that, I don't know how to do what you want me to. If anyone can
>>>> help, please do. But I should note that I think others in this group
>>>> use Google groups, and, in general, no one else does what your system does.
>>>>
>>>> I am willing to experiment, but I don't know what to try.
>>>
>>> *Disclaimer* I have never played with these settings.
>>>
>>> In the general Options under Composition there is a "Send Options"
>>> button. It allows you to declare mail domains that expect messages to
>>> be in plain text. Also you can control what to do when a message has
>>> multiple recipients that expect different formats.
>>>
>>> Perhaps if you set Google / Gmail as a text domain? But you probably
>>> would have to prevent TBird from also sending in HTML format.
>>>
>>
>> That all applies to email, not Usenet. When you post to a newsserver,
>> you are not sending to google or a gmail account.
>
> In TBird it /can/ apply to NN posts also: all that matters is what
> domain is being addressed.

It may be that Thunderbird can send in HTML to newsgroups. But I don't
know what domain settings you would use in the "Send options" box to do
that, as Usenet posts do not send to a domain. Certainly if you are
using, say, news.eternal-september.org as the news server then your
posts have no connection to Google domains when they are sent.

>
>> His posts are in plain text, not HTML - there is nothing wrong with his
>> settings as far as I can see. It is google groups that is broken.
>
> I know Google Groups is broken.
>
> The issue here is what TBird does for NNTP. Certainly HTML /can/ be
> sent via NNTP, and by default TBird sends /email/ in both text and
> HTML format.

Thunderbird sends in both HTML and plain text if you send an email in
HTML, and the receiver email address is not in the "HTML domains" list
(empty by default), and you have the appropriate option for sending in
both formats enabled (it is by default).

So if you have your email set up to send HTML, then /email/ will go in
both formats (unless you hold down shift while clicking "Write",
"Reply", etc., - then you get plain text if HTML is your standard, or
HTML if plain text is your standard). AFAIK, the standard for Usenet
posts is always plain text. But then, I always make sure I have plain
text for normal emails.

>
> What I don't know - and can't seem to find out easily - is whether
> Tbird does the same for NNTP. If so, Google could be confused even
> more than usual. I don't use Tbird for NNTP [I use a dedicated
> client] so I don't know what TBird does with it.
>

All the posts here have been plain text, as far as I have seen. But it
is possible that my server (news.eternal-september.org - the same as
Stephen and many others here) automatically converts any HTML posts to
plain text, and that has thrown me off. It would be surprising, to me
at least, if the server passed on HTML for feeds to other news servers
(including Google) but stripped it for its own client connections.

What Stephen could do here is look in the "Sent" folder of his "Local
folders", where Thunderbird (by default) stores posts sent to the news
server. Using ctrl-U on one of these, he could see the exact source of
the post sent to the server - then we would know for sure if he had sent
HTML or plain text.

Terje Mathisen

unread,

Jul 16, 2021, 7:42:26 AM7/16/21

to

I have used SeaMonkey since it was first forked out from the original
Mozilla all-in-one project, the mail/news part is more or less the same
as Thunderbird.

I'm confident Google Groups is the culprit.

Stephen Fuld

unread,

Jul 16, 2021, 11:35:29 AM7/16/21

to

Done. See below:

From: Stephen Fuld <sf...@alumni.cmu.edu.invalid>
Date: Thu, 15 Jul 2021 15:11:00 -0700
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0)
Gecko/20100101
Thunderbird/78.12.0
MIME-Version: 1.0
In-Reply-To: <scqa39$mab$1...@dont-email.me>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit

In the third line from the bottom, it clearly says text/plain