The Answer to Life, the Universe, and Everything

Quadibloc

unread,

Jun 11, 2022, 9:31:48 PM6/11/22

to

One advantage to 24-bit computers is that in addition to handling 6-bit
upper-case only character codes, they can also fit three eight-bit
characters in a word, and so use modern character codes like ASCII
and EBCDIC that have room for lower-case letters!

It's true that they eventually found a use for the extra eighth bit
in a byte with ASCII, by going to the 8859-1 code, with accented
letters for a lot of foreign languages. Or one could use the high
bit to indicate two-byte characters in a DBCS for Chinese or
Japanese.

But many computer systems didn't bother much with furrin
languages. And so there's that wasted bit! So instead of a
24-bit word that handles both 6-bit characters and 8-bit
characters...

maybe we should want a computer that can handle 6-bit
characters and 7-bit characters! (Especially if four-bit
BCD-coded decimal digits are not a concern.) So perhaps
there should have been more computers with a 42-bit
word length. (ASI did make _one_ comuter with a 21-bit
word length...)

John Savard

Stefan Monnier

unread,

Jun 11, 2022, 11:41:40 PM6/11/22

to

Quadibloc [2022-06-11 18:31:46] wrote:
> maybe we should want a computer that can handle 6-bit
> characters and 7-bit characters! (Especially if four-bit
> BCD-coded decimal digits are not a concern.) So perhaps
> there should have been more computers with a 42-bit
> word length. (ASI did make _one_ comuter with a 21-bit
> word length...)

Fractional bit length anyone?

Stefan

Michael S

unread,

Jun 12, 2022, 9:05:56 AM6/12/22

to

Itanium (42.(6) bits per instruction) appears to satisfy both yours and John's wishes simultaneously.

Stefan Monnier

unread,

Jun 12, 2022, 10:36:04 AM6/12/22

to

Michael S [2022-06-12 06:05:54] wrote:
> Itanium (42.(6) bits per instruction) appears to satisfy both yours and
> John's wishes simultaneously.

Then I guess the Next Frontier is irrational bit length, then?
Unless Itanium's demise means we have to re-invent fractional bit
length first? Or should we go straight to imaginary bit length?

Stefan

David Brown

unread,

Jun 12, 2022, 10:58:04 AM6/12/22

to

Such machines would be of no use without software support. Fortunately,
there is already a proposal for extending integer types in C that will
cover this (and more) :

<https://open-std.org/JTC1/SC22/WG21/docs/papers/2018/p0989r0.pdf>

Stefan Monnier

unread,

Jun 12, 2022, 11:25:56 AM6/12/22

to

David Brown [2022-06-12 16:58:00] wrote:
> Such machines would be of no use without software support. Fortunately,
> there is already a proposal for extending integer types in C that will cover
> this (and more) :
>
> <https://open-std.org/JTC1/SC22/WG21/docs/papers/2018/p0989r0.pdf>

Sadly, it hasn't yet made it into a standard; furthermore this was being
discussed for C++, it's probably going to be many more years before it
makes it into C.

Stefan

Ivan Godard

unread,

Jun 12, 2022, 12:40:26 PM6/12/22

to

Some languages have integral types with arbitrary static bounds, with
bounds checking. Mary is the only one I know that had biased bounds:
type twentieth range(1900, 1999);
fit in a byte.

MitchAlsup

unread,

Jun 12, 2022, 1:02:28 PM6/12/22

to

PL/1 had widths applied and range checked.
ADA has an entire system of width application and range checking.

Ivan Godard

unread,

Jun 12, 2022, 2:02:05 PM6/12/22

to

But not with biased representation as far as I know; please correct me.

Niklas Holsti

unread,

Jun 12, 2022, 2:03:10 PM6/12/22

to

On 2022-06-12 19:40, Ivan Godard wrote:
> On 6/12/2022 7:58 AM, David Brown wrote:
>> On 12/06/2022 05:41, Stefan Monnier wrote:
>>> Quadibloc [2022-06-11 18:31:46] wrote:
>>>> maybe we should want a computer that can handle 6-bit
>>>> characters and 7-bit characters! (Especially if four-bit
>>>> BCD-coded decimal digits are not a concern.) So perhaps
>>>> there should have been more computers with a 42-bit
>>>> word length. (ASI did make _one_ comuter with a 21-bit
>>>> word length...)
>>>
>>> Fractional bit length anyone?
>>>
>>
>> Such machines would be of no use without software support.
>> Fortunately, there is already a proposal for extending integer types
>> in C that will cover this (and more) :
>>
>> <https://open-std.org/JTC1/SC22/WG21/docs/papers/2018/p0989r0.pdf>

Good fun, that.

> Some languages have integral types with arbitrary static bounds, with
> bounds checking. Mary is the only one I know that had biased bounds:
> type twentieth range(1900, 1999);
> fit in a byte.

Some Ada compilers do biased representations, too, but I believe it is
not a requirement of the Ada standard.

Example source file twenth.ads:

package twenth is
type twentieth is range 1900 .. 1999 with Size => 8;
end twenth;

Compiling with the gcc-based GNAT Ada compiler, using "gnatmake twenth.ads":

gcc -c twenth.ads
twenth.ads:2:46: warning: size clause forces biased representation
for "twentieth"

This works also with a Size of 7 bits, the smallest possible size for
this numeric range even with bias.

Ivan Godard

unread,

Jun 12, 2022, 2:14:01 PM6/12/22

to

I stand corrected. This didn't exist the last time I had anything to do
with Ada; thank you.

MitchAlsup

unread,

Jun 12, 2022, 2:33:52 PM6/12/22

to

ADA has the equivalent to "type twentieth range(1900, 1999); ".

David Brown

unread,

Jun 12, 2022, 3:29:03 PM6/12/22

to

On 12/06/2022 20:03, Niklas Holsti wrote:
> On 2022-06-12 19:40, Ivan Godard wrote:
>> On 6/12/2022 7:58 AM, David Brown wrote:
>>> On 12/06/2022 05:41, Stefan Monnier wrote:
>>>> Quadibloc [2022-06-11 18:31:46] wrote:
>>>>> maybe we should want a computer that can handle 6-bit
>>>>> characters and 7-bit characters! (Especially if four-bit
>>>>> BCD-coded decimal digits are not a concern.) So perhaps
>>>>> there should have been more computers with a 42-bit
>>>>> word length. (ASI did make _one_ comuter with a 21-bit
>>>>> word length...)
>>>>
>>>> Fractional bit length anyone?
>>>>
>>>
>>> Such machines would be of no use without software support.
>>> Fortunately, there is already a proposal for extending integer types
>>> in C that will cover this (and more) :
>>>
>>> <https://open-std.org/JTC1/SC22/WG21/docs/papers/2018/p0989r0.pdf>
>
>
> Good fun, that.
>

I expect many people read the starting sections, taking it somewhat
seriously - but only a few bother to get towards the end or look at the
publication date!

Quadibloc

unread,

Jun 12, 2022, 4:47:08 PM6/12/22

to

On Sunday, June 12, 2022 at 1:29:03 PM UTC-6, David Brown wrote:

> I expect many people read the starting sections, taking it somewhat
> seriously - but only a few bother to get towards the end or look at the
> publication date!

2018-04-01
which is April 1, 2018.

Oh, dear, then the integer types in C won't get extended!

However, specifying 24 bit integers as "short long int" and extending
that principle further is far too complicated and confusing; one wants
an explicit method to specify the number of bits involved.

John Savard

Marcus

unread,

Jun 13, 2022, 12:54:27 AM6/13/22

to

Since C99 we have stdint.h that provides fixed size integers, e.g:

int16_t, uint8_t, int64_t, etc.

I doubt that int24_t/uint24_t would find its way into the standard
unless it's supported in HW by all popular ISAs, though.

/Marcus

Marcus

unread,

Jun 13, 2022, 4:11:35 AM6/13/22

to

On 2022-06-12, Quadibloc wrote:
> One advantage to 24-bit computers is that in addition to handling 6-bit
> upper-case only character codes, they can also fit three eight-bit
> characters in a word, and so use modern character codes like ASCII
> and EBCDIC that have room for lower-case letters!
>
> It's true that they eventually found a use for the extra eighth bit
> in a byte with ASCII, by going to the 8859-1 code, with accented
> letters for a lot of foreign languages. Or one could use the high
> bit to indicate two-byte characters in a DBCS for Chinese or
> Japanese.

UTF-8 is a thing.

> But many computer systems didn't bother much with furrin
> languages. And so there's that wasted bit! So instead of a
> 24-bit word that handles both 6-bit characters and 8-bit
> characters...
>
> maybe we should want a computer that can handle 6-bit
> characters and 7-bit characters! (Especially if four-bit
> BCD-coded decimal digits are not a concern.) So perhaps
> there should have been more computers with a 42-bit
> word length. (ASI did make _one_ comuter with a 21-bit
> word length...)

I still consider text processing an "unsolved problem" in modern CPU
architectures and programming languages. Traversing text character-by-
character, and possibly decoding along the way (as with UTF-8, for
certain operations), is just terribly inefficient. The predominant
paradigms in use are best suited for decades old un-pipelined CISC
(memory-memory) architectures.

Ideally you would want to be able to handle (most) text strings as
efficiently as integer numbers. E.g. copy, compare, branch-if-..., etc
with the same cost as a single register-register integer operation.

/Marcus

David Brown

unread,

Jun 13, 2022, 4:17:50 AM6/13/22

to

The types int24_t/uint24_t /are/ in the standard, and have been since
C99 - along with all other intNN_t and related types for all bit sizes.
However, they are all optional (except int_least8_t, int_least16_t,
int_least32_t, int_least64_t, the matching int_fastNN_t types, and their
unsigned pairs). A C implementation must support the intNN_t type if
and only if it provides a type of that size and characteristic (no
padding bits, two's complement representation) as a standard or extended
integer type.

So if a C compiler for an architecture already provides 24-bit types
(and they turn up in some DSP's and other specialised devices), then
int24_t and uint24_t are required by C99.

(Conversely, an implementation that does /not/ support 8-bit, 16-bit,
32-bit and/or 64-bit types does not have to have types such as int32_t
that it does not support, without breaking conformity with the
standards. Again, you see real-world modern DSP's that do not have
int8_t or int16_t types.)

There is also the proposed "_BitInt(N)" feature for the next C standard,
C23. (clang/llvm has an implementation of approximately the same
feature as an extension.) The main target is FPGAs and other situations
where C code is used on specialist hardware. A _BitInt(32) is not
exactly the same as an int32_t (promotion rules are different, for a
start), but it will be very close.

<https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2763.pdf>

(This is from June, not April :-) )

Terje Mathisen

unread,

Jun 13, 2022, 10:39:11 AM6/13/22

to

You can _almost_ do this with SIMD compares to find the first
difference, then a (possibly wide/complicated) lookup of the
non-matching positions via a collating order table to determine <=>.

People have been able to make JSON or XML parsing close to an order of
magnitude faster this way, but it is not intellectually trivial. :-(

OTOH, people consistently misbehave even when comparing simple fp values
as soon as the first NaN turns up.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

anti...@math.uni.wroc.pl

unread,

Jun 13, 2022, 7:47:55 PM6/13/22

to

Representation is matter for compiler. For example, Pascal compiler
in principle could allocate all numbers on heap and represent them
by (hidden) pointers, that is stupid but legal representation.
For range types biased representation is legal, but it is matter
for compiler developers to decide if it is worth the trouble.
AFAIK most (all??) Pascal compilers decided that biased representation
has less benefits than drawbacks, but at least for GNU Pascal
it was considered, with result "we do not want it now".

--
Waldek Hebisch

Quadibloc

unread,

Jun 14, 2022, 9:49:41 AM6/14/22

to

On Monday, June 13, 2022 at 2:11:35 AM UTC-6, Marcus wrote:

> I still consider text processing an "unsolved problem" in modern CPU
> architectures and programming languages. Traversing text character-by-
> character, and possibly decoding along the way (as with UTF-8, for
> certain operations), is just terribly inefficient. The predominant
> paradigms in use are best suited for decades old un-pipelined CISC
> (memory-memory) architectures.
>
> Ideally you would want to be able to handle (most) text strings as
> efficiently as integer numbers. E.g. copy, compare, branch-if-..., etc
> with the same cost as a single register-register integer operation.

Much of what people want to _do_ with text pretty much has to be done
character by character.

My approach to solving the problem, therefore, is this:

Yes, it would be wasteful to tie up a modern CPU - just as it is wasteful
to tie up a 360/195 - processing text character by character. So, just as
the floating-point unit has its own pipeline, and the integer unit has another
pipeline... send character string instructions off to the character processing
unit.

It handles those instructions in its own slow way, but without tying up
the main CPU which continues to do integer and FP operations at
breakneck speed.

Whether the character box looks like a 6800 or like a 360/85 is a matter
of taste.

John Savard

BGB

unread,

Jun 15, 2022, 3:02:27 PM6/15/22

to

Yeah, BGBCC already implements _BitInt(N) ...

Semantics are mostly that it either represents the value as one of the
supported native types, or as a large value-type object (for N larger
than 128 bits; with storage padded to a multiple of 128 bits).

Though, FWIW:
BGBCC also implements "short float" as a way to specify scalar Binary16
values.

Though, with the funkiness that Binary16 is 16-bit in memory, but
represented using Binary64 in registers (and in non-aliased local
variables), which has a non-zero risk of funky semantic effects (due to
the much larger dynamic range and precision).

Similar also applies for "float", where in both cases the size and
format of the in-memory representation depends on whether or not someone
took the address of the variable.

However, the difference is large enough that, where, a calculation
performed using Binary16 may produce "noticeably different" results from
one calculated using Binary64 being passed off as Binary16.

For now, had mostly been hand-waving this has been "not really too much
of an issue in practice".

This differs when using SIMD ops, where intermediate results are
actually stored in the specified formats.

There is not currently any C type for FP8 formats (FP8U or FP8S), but I
could potentially add these, say, maybe __float8u_t, __float8s_t.

Though, the use-case for scalar FP8 is unclear, existing use-cases as
mostly as a space-saving alternative to Binary16 vectors.

These could be added as-is with existing instructions, they would mostly
just require 3-op load/store sequences (byte load/store with two
converter ops).

Though, FP8U is used as one of the texture-formats for my LDTEX instruction.

Previously I was going to use RGB30A, but:
FP8U is cheaper to encode/decode than RGBA30A;
For textures with an alpha-channel, FP8U has better quality;
The loss of quality and dynamic range for opaque textures is minor.

FP8U is limited to positive values, but this isn't likely a big issue
for textures. In theory, it is still possible to use RGBA64F textures
for negative pixels.

However, at present, there is no direct (C level) support for FP8 values
(but whether or not they would be useful enough to justify actual types
is debatable, compiler intrinsics being "good enough").

Started debating for a moment whether A-law support could be added, but
this is a little too niche (at most it would maybe justify an intrinsic,
even this is debatable). Encoding A-law via going through the FPU could
probably be faster than the current process of doing it via integer ops
though.

...

Thomas Koenig

unread,

Jun 16, 2022, 6:26:27 AM6/16/22

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:

> Marcus wrote:

>> I still consider text processing an "unsolved problem" in modern CPU
>> architectures and programming languages. Traversing text character-by-
>> character, and possibly decoding along the way (as with UTF-8, for
>> certain operations), is just terribly inefficient. The predominant
>> paradigms in use are best suited for decades old un-pipelined CISC
>> (memory-memory) architectures.
>>
>> Ideally you would want to be able to handle (most) text strings as
>> efficiently as integer numbers. E.g. copy, compare, branch-if-..., etc
>> with the same cost as a single register-register integer operation.
>
> You can _almost_ do this with SIMD compares to find the first
> difference, then a (possibly wide/complicated) lookup of the
> non-matching positions via a collating order table to determine <=>.

So, what are the instructions that would be needed to make this easier?

A "find first non-matching byte" instruction could help, which
would do an xor of two registers and then count the number of
tailing zero bytes (up to a maximum). This would just be (for 64 bits)
an xor, a bytwise AND and an eight-bit count trailing zeros with a mask
generated by the maximum value.

Range detection might profit from a bytewise "in range" comparison
with an upper and lower bound given in two separate words (iff it is
any cheaper than the straightforward two comparisons and one and).

> People have been able to make JSON or XML parsing close to an order of
> magnitude faster this way, but it is not intellectually trivial. :-(

Coming from you, that sounds rather scary.

> OTOH, people consistently misbehave even when comparing simple fp values
> as soon as the first NaN turns up.

There is the article "Do Programmers Understand IEEE Floating Point?"
with the sobering observation "Many developers do not understand core
floating point behavior particularly well, yet believe they do."
https://users.cs.northwestern.edu/~pdinda/Papers/ipdps18.pdf

Terje Mathisen

unread,

Jun 16, 2022, 8:18:30 AM6/16/22

to

Thomas Koenig wrote:
> Terje Mathisen <terje.m...@tmsw.no> schrieb:
>> Marcus wrote:
>
>>> I still consider text processing an "unsolved problem" in modern CPU
>>> architectures and programming languages. Traversing text character-by-
>>> character, and possibly decoding along the way (as with UTF-8, for
>>> certain operations), is just terribly inefficient. The predominant
>>> paradigms in use are best suited for decades old un-pipelined CISC
>>> (memory-memory) architectures.
>>>
>>> Ideally you would want to be able to handle (most) text strings as
>>> efficiently as integer numbers. E.g. copy, compare, branch-if-..., etc
>>> with the same cost as a single register-register integer operation.
>>
>> You can _almost_ do this with SIMD compares to find the first
>> difference, then a (possibly wide/complicated) lookup of the
>> non-matching positions via a collating order table to determine <=>.
>
> So, what are the instructions that would be needed to make this easier?
>
> A "find first non-matching byte" instruction could help, which
> would do an xor of two registers and then count the number of
> tailing zero bytes (up to a maximum). This would just be (for 64 bits)
> an xor, a bytwise AND and an eight-bit count trailing zeros with a mask
> generated by the maximum value.

Today you do that with SIMD compare, followed by extract mask (to get a
single bit from each compare) and then a find first 0/1 instruction. In
combination this is a bit too much overhead. Having Mill-style None
markers, and/or a fast way to generate write masks, as well as a
collapsing copy/store which elides None/masked bytes would also be helpful.

Not coincidentally, the Mill does all of these things pretty much
perfectly imho.

>
> Range detection might profit from a bytewise "in range" comparison
> with an upper and lower bound given in two separate words (iff it is
> any cheaper than the straightforward two comparisons and one and).

Yeah.

x86/SIMD got a huge chunk of silicon a decade+ ago, in the form of an
opcode which does 256 (?) simultaneous compare ops. I have never used
it, and don't know why it was included, but you can supposedly use it
for all sorts of strange stuff.

>
>> People have been able to make JSON or XML parsing close to an order of
>> magnitude faster this way, but it is not intellectually trivial. :-(
>
> Coming from you, that sounds rather scary.
>
>> OTOH, people consistently misbehave even when comparing simple fp values
>> as soon as the first NaN turns up.
>
> There is the article "Do Programmers Understand IEEE Floating Point?"
> with the sobering observation "Many developers do not understand core
> floating point behavior particularly well, yet believe they do."
> https://users.cs.northwestern.edu/~pdinda/Papers/ipdps18.pdf

Too true. :-(

Michael S

unread,

Jun 16, 2022, 11:43:31 AM6/16/22

to

On Thursday, June 16, 2022 at 3:18:30 PM UTC+3, Terje Mathisen wrote:
>
> x86/SIMD got a huge chunk of silicon a decade+ ago, in the form of an
> opcode which does 256 (?) simultaneous compare ops. I have never used
> it, and don't know why it was included, but you can supposedly use it
> for all sorts of strange stuff.
>

"decade+ ago" sounds like SSE4.2 (Nov2008), but I am not sure that the
rest fits.
My impression was that on all Intel or AMD implementations up to this day
SSE4.2 is rather small chunk of silicon, on some cores even tiny.
On "big" cores they tend to do 32 comparisons per clock per execution port,
so when there are 2 or 3 SSE4.2-capable SIMD ALUs per core you can see a
throughput figures like 1 instruction per 4 clocks with latency around
10 clocks.
AMD Zen series appears to be the fastest with measured (by AgnerF) throughput
of 1/3 and latency of 8-11 clocks.
Unfortunately, I can not find any info about SSE4.2 performance of Intel
Alder Lake P cores (Golden Cove). It's not mentioned in Intel's otherwise
rather comprehensive instruction tables and Agner also did not Golden Cove
numbers yet.

Back at introduction (Nehalem) there were already two SSE4.2 ALUs, but the
throughput was a little lower than expected - 1 per 5 clocks. I would guess
that in pre-SandyB uArchs Intel had additional bottleneck around transfer of
results from SIMD to GPRs.

MitchAlsup

unread,

Jun 16, 2022, 1:09:32 PM6/16/22

to

On Thursday, June 16, 2022 at 5:26:27 AM UTC-5, Thomas Koenig wrote:
> Terje Mathisen <terje.m...@tmsw.no> schrieb:
> > Marcus wrote:
>
> >> I still consider text processing an "unsolved problem" in modern CPU
> >> architectures and programming languages. Traversing text character-by-
> >> character, and possibly decoding along the way (as with UTF-8, for
> >> certain operations), is just terribly inefficient. The predominant
> >> paradigms in use are best suited for decades old un-pipelined CISC
> >> (memory-memory) architectures.
> >>
> >> Ideally you would want to be able to handle (most) text strings as
> >> efficiently as integer numbers. E.g. copy, compare, branch-if-..., etc
> >> with the same cost as a single register-register integer operation.
> >
> > You can _almost_ do this with SIMD compares to find the first
> > difference, then a (possibly wide/complicated) lookup of the
> > non-matching positions via a collating order table to determine <=>.
> So, what are the instructions that would be needed to make this easier?
>
> A "find first non-matching byte" instruction could help, which
> would do an xor of two registers and then count the number of
> tailing zero bytes (up to a maximum). This would just be (for 64 bits)
> an xor, a bytwise AND and an eight-bit count trailing zeros with a mask
> generated by the maximum value.
<

88110 had compare instructions that had "any byte different" and
"any halfword different"

>
> Range detection might profit from a bytewise "in range" comparison
> with an upper and lower bound given in two separate words (iff it is
> any cheaper than the straightforward two comparisons and one and).
> > People have been able to make JSON or XML parsing close to an order of
> > magnitude faster this way, but it is not intellectually trivial. :-(
> Coming from you, that sounds rather scary.
> > OTOH, people consistently misbehave even when comparing simple fp values
> > as soon as the first NaN turns up.
> There is the article "Do Programmers Understand IEEE Floating Point?"
> with the sobering observation "Many developers do not understand core
> floating point behavior particularly well, yet believe they do."
> https://users.cs.northwestern.edu/~pdinda/Papers/ipdps18.pdf
<

This is a education and management problem not an architecture problem.
<
And there are hundreds of papers like this illustrating how the typical
programmer faced with unit-calculation having FP just tries to "get the
job done" leaving all sorts of subtle bugs in the program.

JimBrakefield

unread,

Jun 16, 2022, 5:41:56 PM6/16/22

to

On Thursday, June 16, 2022 at 5:26:27 AM UTC-5, Thomas Koenig wrote:

Rather than doing table lookup or character translate from a memory byte array,
do a single bit result from using 64, 128 or 256 bits from the register file (1, 2 or 4 registers).
In FPGA land called a LUT, here a 6, 7 or 8-bit LUT. Have not seen such an instruction
anywhere? Should pipeline or parallelize nicely?

MitchAlsup

unread,

Jun 16, 2022, 6:51:29 PM6/16/22

to

What if you need {Alphabetic, numeric, white-space, Capital, Lower,
separator, separator-second }
<
We rarely need single bit lookups.
<
c = getchar();
if( parsetable[c] & alphabetic ) do
{ c = getchar(); }
while( parsetable[c] & ( alphabetic | numeric ) );

JimBrakefield

unread,

Jun 16, 2022, 7:00:35 PM6/16/22

to

For 7-bit ASCII need two 64-bit registers for each LUT.
Don't like to use additional process state. 14 registers used for LUTs is do able.
However, the "LUT engine/functional unit" could easily contain registers for several LUTs
as well as complete byte translation table. Just need a prefetch op-code(s) to
load the LUT engine??

Thomas Koenig

unread,

Jun 17, 2022, 11:27:28 AM6/17/22

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:
> Marcus wrote:

>> I still consider text processing an "unsolved problem" in modern CPU
>> architectures and programming languages. Traversing text character-by-
>> character, and possibly decoding along the way (as with UTF-8, for
>> certain operations), is just terribly inefficient. The predominant
>> paradigms in use are best suited for decades old un-pipelined CISC
>> (memory-memory) architectures.
>>
>> Ideally you would want to be able to handle (most) text strings as
>> efficiently as integer numbers. E.g. copy, compare, branch-if-..., etc
>> with the same cost as a single register-register integer operation.
>
> You can _almost_ do this with SIMD compares to find the first
> difference, then a (possibly wide/complicated) lookup of the
> non-matching positions via a collating order table to determine <=>.
>
> People have been able to make JSON or XML parsing close to an order of
> magnitude faster this way, but it is not intellectually trivial. :-(

I belive I may have found the library you are referring to,
https://github.com/simdjson/simdjson . The numbers they give are
rather impressive. Unfortunately, it is C++ which means that
the first intellectual challenge is to find out where the
actual work is being done :-)

Terje Mathisen

unread,

Jun 17, 2022, 3:04:21 PM6/17/22

to

Michael S wrote:
> On Thursday, June 16, 2022 at 3:18:30 PM UTC+3, Terje Mathisen wrote:
>>
>> x86/SIMD got a huge chunk of silicon a decade+ ago, in the form of an
>> opcode which does 256 (?) simultaneous compare ops. I have never used
>> it, and don't know why it was included, but you can supposedly use it
>> for all sorts of strange stuff.
>>
>
> "decade+ ago" sounds like SSE4.2 (Nov2008), but I am not sure that the
> rest fits.

Like I said, I read about it but never used it myself so I might be
mis-remembering all of it. :-(

> My impression was that on all Intel or AMD implementations up to this day
> SSE4.2 is rather small chunk of silicon, on some cores even tiny.
> On "big" cores they tend to do 32 comparisons per clock per execution port,
> so when there are 2 or 3 SSE4.2-capable SIMD ALUs per core you can see a
> throughput figures like 1 instruction per 4 clocks with latency around
> 10 clocks.
> AMD Zen series appears to be the fastest with measured (by AgnerF) throughput
> of 1/3 and latency of 8-11 clocks.
> Unfortunately, I can not find any info about SSE4.2 performance of Intel
> Alder Lake P cores (Golden Cove). It's not mentioned in Intel's otherwise
> rather comprehensive instruction tables and Agner also did not Golden Cove
> numbers yet.
>
> Back at introduction (Nehalem) there were already two SSE4.2 ALUs, but the
> throughput was a little lower than expected - 1 per 5 clocks. I would guess
> that in pre-SandyB uArchs Intel had additional bottleneck around transfer of
> results from SIMD to GPRs.
>

Terje Mathisen

unread,

Jun 17, 2022, 3:11:59 PM6/17/22

to

Thomas Koenig wrote:
> Terje Mathisen <terje.m...@tmsw.no> schrieb:
>> Marcus wrote:
>
>>> I still consider text processing an "unsolved problem" in modern CPU
>>> architectures and programming languages. Traversing text character-by-
>>> character, and possibly decoding along the way (as with UTF-8, for
>>> certain operations), is just terribly inefficient. The predominant
>>> paradigms in use are best suited for decades old un-pipelined CISC
>>> (memory-memory) architectures.
>>>
>>> Ideally you would want to be able to handle (most) text strings as
>>> efficiently as integer numbers. E.g. copy, compare, branch-if-..., etc
>>> with the same cost as a single register-register integer operation.
>>
>> You can _almost_ do this with SIMD compares to find the first
>> difference, then a (possibly wide/complicated) lookup of the
>> non-matching positions via a collating order table to determine <=>.
>>
>> People have been able to make JSON or XML parsing close to an order of
>> magnitude faster this way, but it is not intellectually trivial. :-(
>
> I belive I may have found the library you are referring to,
> https://github.com/simdjson/simdjson . The numbers they give are

That sounds right.

> rather impressive. Unfortunately, it is C++ which means that
> the first intellectual challenge is to find out where the
> actual work is being done :-)
>

:-(

Shades of FFTW/Boost?

Thomas Koenig

unread,

Jun 18, 2022, 8:49:45 AM6/18/22

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:

Probably. It is possible to have all the inline stuff in a single
header, which is 32k lines and 1.3 MB (but, in all fairness,
also contains a lot of comments).

Fortunately, they have described their method of
character classification in an article, to be found at
https://arxiv.org/pdf/1902.08318.pdf]%C3%82%C2%A0on where they write

"Instead of a comparison, we use the AVX2 vpshufb in- struction to
acts as a vectorized table lookup to do a vectorized classification
[21]. The vpshufb instruction uses the least significant 4 bits
of each byte (low nibble) as an index into a 16-byte table.

[...]

By doing one lookup, followed by a 4-bit right shift and a second
lookup (using a different table), we can separate the characters
into one of two categories: struc tural characters and white-space
characters. The first lookup maps the low nibbles (least significant
4 bits) of each byte to a byte value; the second lookup maps
the high nibble (most significant 4 bits) of each byte to a byte
value. The two byte values are combined with a bitwise AND."

This limits the classification they can do. It works for JSON,
which has (by design or accident) a set of characters for which
this is enough, but not in the general case.

Terje Mathisen

unread,

Jun 19, 2022, 10:16:35 AM6/19/22

to

This happens to exactly mirror some of my own code. :-)

I would have dearly loved to have just one additional bit, i.e.
Altivec's permute() used 5-bit indices into a pair of 16-byte registers.
With AVX-256 and AVX-512 (nee Larrabee), it would have been very obvious
to at least extend the 4-bit lookups into 5 and 6, at which point quite
a few table lookup algorithms become possible.

For bulk (throughput) programming I have considered cascading tables, so
that the second lookup use one of several possible tables based on waht
the first lookup returned, but this only works when you can do them in
parallel, i.e. do 2 or 4 second-stage lookups and use the output of the
first lookup to select which to use: Even using 5 regs to hold those
lookup tables, this does give us a full 6-bit classifer, and we even get
to decide which bits to use for each part.

I have considered such an algorithm for a new word count implementation,
where the current 64kB classification table lookup would be replaced by
byte shuffle/permute operations, but that would mean giving up on one of
the (imho) best features of the current version: The ability to specify
on the command line which characters are separators, which one or two
chars define a newline and which chars are possible inside a word.

Stephen Fuld

unread,

Jun 21, 2022, 7:56:01 PM6/21/22

to

This paper raised a number of questions in my mind. If their claim that
JSON parsing takes up to 95% of some CPUs, it is an example of what was
mentioned earlier about new requirements leading to new instructions.
So is there a better way than "repurposing" the SIMD instructions.
Terje mentioned adding a version of AltaVec's permute instruction.

1. Is there a better way, perhaps a more "purpose built" instruction?
If so, should it be specialized to JSON or is there a more generally
useful instruction?

2. What about an attached processor that builds the parse tree from the
string representation "autonomously"?

3. How does this done with MY 66000's VVM?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Marcus

unread,

Jun 22, 2022, 3:03:34 PM6/22/22

to

My feelings exactly.

> Terje mentioned adding a version of AltaVec's permute instruction.
>
>
> 1.    Is there a better way, perhaps a more "purpose built" instruction?
> If so, should it be specialized to JSON or is there a more generally
> useful instruction?
>
> 2.    What about an attached processor that builds the parse tree from
> the string representation "autonomously"?
>
> 3.    How does this done with MY 66000's VVM?
>
>

My bet is that if you did something along 1 & 2, with support for things
like parsing, string comparisons and perhaps regex and similar common
string processing techniques, you should be able to get a many-fold
performance increase and power consumption reduction for many text
intensive workloads. In some domains (e.g. web servers, JSON/text-based
APIs, and database engines), it just may be worth the effort.

Exactly how such a machine would work, I don't know. My gut feeling is
that you'd want a way to reference and cache immutable strings (in a
special purpose "string store") from the main instruction stream, and
have asynchronous/concurrent string processing going on in an attached
processor. Each string, once pulled in to the string store, can be
preprocessed if needed to attach meta data and classifications, so that
certain processing operations can be accelerated (e.g. strings could be
sorted, hashed and checked for uniqueness to speed up string
comparisons).

(Just thoughts - nothing tested)

/Marcus

MitchAlsup

unread,

Jun 22, 2022, 4:23:00 PM6/22/22

to

On Tuesday, June 21, 2022 at 6:56:01 PM UTC-5, Stephen Fuld wrote:
> On 6/18/2022 5:49 AM, Thomas Koenig wrote:
> > Terje Mathisen <terje.m...@tmsw.no> schrieb:

> >

> > Fortunately, they have described their method of
> > character classification in an article, to be found at
> > https://arxiv.org/pdf/1902.08318.pdf]%C3%82%C2%A0on
<
> This paper raised a number of questions in my mind. If their claim that
> JSON parsing takes up to 95% of some CPUs, it is an example of what was
> mentioned earlier about new requirements leading to new instructions.
> So is there a better way than "repurposing" the SIMD instructions.
> Terje mentioned adding a version of AltaVec's permute instruction.
>

If whatever arrives in JSON format takes 95% of the CPU just to parse
it CANNOT be very important! On the other hand a parser that can get
95% of a CPU on a cycle by cycle basis is running exceptionally well.
<
So that sentence is at best misleading.

>
> 1. Is there a better way, perhaps a more "purpose built" instruction?
> If so, should it be specialized to JSON or is there a more generally
> useful instruction?
>
> 2. What about an attached processor that builds the parse tree from the
> string representation "autonomously"?
>
> 3. How does this done with MY 66000's VVM?
>

Probably something like a YACC parser, using a character indirection table
and My 66000 TableTransfer (TT = switch) instruction.

Stephen Fuld

unread,

Jun 22, 2022, 6:57:15 PM6/22/22

to

On 6/22/2022 1:22 PM, MitchAlsup wrote:
> On Tuesday, June 21, 2022 at 6:56:01 PM UTC-5, Stephen Fuld wrote:
>> On 6/18/2022 5:49 AM, Thomas Koenig wrote:
>>> Terje Mathisen <terje.m...@tmsw.no> schrieb:
>
>>>
>>> Fortunately, they have described their method of
>>> character classification in an article, to be found at
>>> https://arxiv.org/pdf/1902.08318.pdf]%C3%82%C2%A0on
> <
>> This paper raised a number of questions in my mind. If their claim that
>> JSON parsing takes up to 95% of some CPUs, it is an example of what was
>> mentioned earlier about new requirements leading to new instructions.
>> So is there a better way than "repurposing" the SIMD instructions.
>> Terje mentioned adding a version of AltaVec's permute instruction.
>>
> If whatever arrives in JSON format takes 95% of the CPU just to parse
> it CANNOT be very important! On the other hand a parser that can get
> 95% of a CPU on a cycle by cycle basis is running exceptionally well.
> <
> So that sentence is at best misleading.

I tracked down the original paper referenced in the one above. It is at

https://vldb.org/pvldb/vol11/p1576-palkar.pdf

It points to several implementations of fast JSON parsers, but doesn't
address the source of the high CPU usage numbers, or exactly what they mean.

It does say that the main source of slowdown is that the typical parser
processes its input one character at a time, typically spending a few
instructions per input character. The use of SIMD instructions is
designed to minimize that, by processing several characters in parallel.

>>
>> 1. Is there a better way, perhaps a more "purpose built" instruction?
>> If so, should it be specialized to JSON or is there a more generally
>> useful instruction?
>>
>> 2. What about an attached processor that builds the parse tree from the
>> string representation "autonomously"?
>>
>> 3. How does this done with MY 66000's VVM?
>>
> Probably something like a YACC parser, using a character indirection table
> and My 66000 TableTransfer (TT = switch) instruction.

Based on what I can see from the papers, that would lose to an
architecture with SIMD instructions. I was hoping that VVM would offer
a better way, but it appears that, for this class of problem, doesn't. :-(

MitchAlsup

unread,

Jun 22, 2022, 9:17:25 PM6/22/22

to

while( c != EOF )
{ // parse individual tokens.
switch( table[c] )
{
case ALPHABETIC:
p = buffer[0];
while( table[ c=getchar() ] & ALPHANUMERIC ) *p++ = c;
symbol( buffer, p - &buffer );
case NUMERIC:
p = buffer[0];
while( table[ c=getchar() ] & NUMERIC_CONTINUATION ) *p++ = c;
number( buffer, p - &buffer );

The inner loops of each of these (important cases) can be as wide as the
cache port with a little Predication for getchar().

Thomas Koenig

unread,

Jun 26, 2022, 10:55:46 AM6/26/22

to

Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:

> 1. Is there a better way, perhaps a more "purpose built" instruction?
> If so, should it be specialized to JSON or is there a more generally
> useful instruction?

An SIMD "found in bitmap" instruction would go a long way, I think
(already discussesd here). Let's say you have 256-bit-registers.

Look at the value of each byte in the source register and set each
byte in the target register to 0x00 or 0xff if the corresponding
bit in a 256-bit "mask" register is set.

Find first non-matching byte in an SIMD register.

>
> 2. What about an attached processor that builds the parse tree from the
> string representation "autonomously"?

Memory to memory? Unless it's really spezialized for one special
type of parsing, it would have to be programmed somehow. Would
it make sense to put something like lex and yacc in hardware?
I guess it could be done; the tables could be in static RAM,
and the algorithms written in hardware. I have no idea at
all how well this would work, or how difficult it would be.

Stephen Fuld

unread,

Jun 26, 2022, 11:29:38 AM6/26/22

to

Yeah, but . . .

I am assuming that your would have the code at "ALPHABETIC" be a VVM
loop to take advantage of the cache port width.

But I think the benefits of this, while there, are not as great as they
could/should be, since many identifiers are quite short (e.g. "if',
"end", "for", etc.). Similarly for NUMERIC, as you have noted, most of
the immediate values are quite short.

Also, a question for you. If we did something like you suggest above,
and are parsing, for example, abc+def, when the "+" is encountered, and
the code exits the VVM loop, does its load hit in the VVM buffer? What
about when you the start a new VVM loop to get the "def"? Does it pick
up the existing buffer, or start a new one with a wide cache transfer?

I am far from an expert in this area, and please, others chime in here,
but what about adding a new instruction that takes a character in one
source register, and outputs in a destination register, some bits,
similar to your result of a Compare instruction, that indicate, if set,
things like alphabetic, numeric, null, upper case, etc. Then you might
be able to handle parsing the entire, say, line, with a single VVM loop.
Such an instruction would also speed up functions like "isnumeric",
convert to upper case, etc. It may be appropriate to accept a second
source to the instruction that could specify one of perhaps several
schemes included for other than ASCII or UTF8.

This might allow VVM to be performance competitive with SIMD schemes
such as the one Thomas proposed.

MitchAlsup

unread,

Jun 26, 2022, 11:56:50 AM6/26/22

to

What if the language was Arabic, Sanskrit, or Chinese where the mapping
from character to meaning was entirely different. That is:: at least the
string to be searched has to be given to the instruction as an operand.

Stephen Fuld

unread,

Jun 26, 2022, 12:33:17 PM6/26/22

to

That was the idea for the second source (see below), to allow for
alternatives. But I guess there might be an issue (in almost any
solution that tries to go beyond character by character) for things like
Arabic that run right to left. Loading the "next" address might be
problematic.

Stefan Monnier

unread,

Jun 26, 2022, 12:46:17 PM6/26/22

to

>> I am far from an expert in this area, and please, others chime in here,
>> but what about adding a new instruction that takes a character in one
>> source register, and outputs in a destination register, some bits,
>> similar to your result of a Compare instruction, that indicate, if set,
>> things like alphabetic, numeric, null, upper case, etc.
> <
> What if the language was Arabic, Sanskrit, or Chinese where the mapping
> from character to meaning was entirely different. That is:: at least the
> string to be searched has to be given to the instruction as an operand.

Thinking of it in those high-level terms doesn't seem like it will lead
anywhere. Better start from actual code of lexers/parsers (many/most of
which don't parse human language anyway).

Stefan

MitchAlsup

unread,

Jun 26, 2022, 1:38:19 PM6/26/22

to

On Sunday, June 26, 2022 at 11:33:17 AM UTC-5, Stephen Fuld wrote:
> On 6/26/2022 8:56 AM, MitchAlsup wrote:

> >> I am far from an expert in this area, and please, others chime in here,
> >> but what about adding a new instruction that takes a character in one
> >> source register, and outputs in a destination register, some bits,
> >> similar to your result of a Compare instruction, that indicate, if set,
> >> things like alphabetic, numeric, null, upper case, etc.
> > <
> > What if the language was Arabic, Sanskrit, or Chinese where the mapping
> > from character to meaning was entirely different. That is:: at least the
> > string to be searched has to be given to the instruction as an operand.
<
> That was the idea for the second source (see below), to allow for
> alternatives. But I guess there might be an issue (in almost any
> solution that tries to go beyond character by character) for things like
> Arabic that run right to left. Loading the "next" address might be
> problematic.
<

Back in 1982 we were working on cash register applications when the
corporation decided it wanted to be bale to display Arabic and Hebrew
from programs written in BASIC.
<
We had the device driver do the interpretation of L->R versus R->L
so some ASCII text would go out in the normal L-R direction, and
when Arabic or Hebrew was encountered, each letter would push
the preceding letter to the right. Looks strange on the display, but
the customers liked it immensely.
<
The data remained in little endian order.

Quadibloc

unread,

Jun 27, 2022, 5:08:52 AM6/27/22

to

On Sunday, June 26, 2022 at 9:56:50 AM UTC-6, MitchAlsup wrote:
> On Sunday, June 26, 2022 at 10:29:38 AM UTC-5, Stephen Fuld wrote:

> > I am far from an expert in this area, and please, others chime in here,
> > but what about adding a new instruction that takes a character in one
> > source register, and outputs in a destination register, some bits,
> > similar to your result of a Compare instruction, that indicate, if set,
> > things like alphabetic, numeric, null, upper case, etc.

> What if the language was Arabic, Sanskrit, or Chinese where the mapping
> from character to meaning was entirely different. That is:: at least the
> string to be searched has to be given to the instruction as an operand.

Chinese, of course, requires more than 256 characters. The solution for most
other cases, though, would be to provide a "translate" instruction - the one that
the System/360 had was adaptable to just that purpose - you could translate
a character string to a string of bytes which coded character types if you liked.

John Savard

Stephen Fuld

unread,

Jun 28, 2022, 12:00:22 PM6/28/22

to

Good point! The original paper was about parsing JSON, which is
required to be UTF8. While, if we can provide a more general solution
at low cost is "a good thing", let's not get so carried away with
generality that we make solving the primary problem impossible or
impracticable.

I still think my original proposal would allow a huge speedup for
parsing JSON on MY66000 by allowing the use of VVM for many more
characters within a single VVM loop.

MitchAlsup

unread,

Jun 28, 2022, 1:06:20 PM6/28/22

to

I am left wondering::
a) after JSON gets parsed what is done with it ?
b) does what comes out not run for long enough that JSON parsing
....ends up as noise ?

Stephen Fuld

unread,

Jun 28, 2022, 2:10:55 PM6/28/22

to

I don't know the answer to those questions. It would take someone with
more knowledge of what the servers are doing than I have to provide the
answers. I was relying on the assertion in the original paper and its
references that the parsing could take up to 95% of the CPU time.

Stefan Monnier

unread,

Jun 28, 2022, 3:42:14 PM6/28/22

to

> I am left wondering::
> a) after JSON gets parsed what is done with it ?
> b) does what comes out not run for long enough that JSON parsing
> ....ends up as noise ?

IIUC nowadays JSON is heavily used to send datastructures between
separate microservers as a safer alternative to sending pointers to
shared datastructures between threads.

So sometimes they may send a lot more data then the part actually used
for no other reason than because it works ans is convenient.

Stefan

MitchAlsup

unread,

Jun 28, 2022, 3:48:48 PM6/28/22

to

On Tuesday, June 28, 2022 at 2:42:14 PM UTC-5, Stefan Monnier wrote:
> > I am left wondering::
> > a) after JSON gets parsed what is done with it ?
> > b) does what comes out not run for long enough that JSON parsing
> > ....ends up as noise ?
> IIUC nowadays JSON is heavily used to send datastructures between
> separate microservers as a safer alternative to sending pointers to
> shared datastructures between threads.
<

Certainly much of this gets cached, too.

>
> So sometimes they may send a lot more data then the part actually used
> for no other reason than because it works ans is convenient.
<

But doesn't the work carried by JSON consume a lot of cycles making the
parsing cycles noise?
<
That is: it takes a long time to compile a Linux kernel, but nobody cares
because you use that image for years.
>
>
> Stefan

Terje Mathisen

unread,

Jun 28, 2022, 4:43:35 PM6/28/22

to

Even the original 8088 had XLAT, an opcode to translate bytes in AL into
the corresponding value stored in a 256-byte table addressed by BX.

The idea was that you could translate a full buffer with very tight
code, you just needed SI -> source, DI -> dest (can be the same as SI),
CX = buffer size, BX -> translation table:

next:
lodsb ; Load a byte into AL, update SI
xlat ; al = bx[al]
stosb ; Store back, update DI
loop next ; Decrement CX, loop unless decremented to zero.

I.e. 4 instructions and 5 total bytes!

Stephen Fuld

unread,

Jun 28, 2022, 4:58:16 PM6/28/22

to

Yes. The reason I didn't suggest something equivalent was that I was
trying to avoid the need for the load of the translated byte from the
256 byte table. My suggestion, while not as general, should be faster.

MitchAlsup

unread,

Jun 28, 2022, 5:02:26 PM6/28/22

to

On Tuesday, June 28, 2022 at 3:43:35 PM UTC-5, Terje Mathisen wrote:
> Quadibloc wrote:
> > On Sunday, June 26, 2022 at 9:56:50 AM UTC-6, MitchAlsup wrote:
> >> On Sunday, June 26, 2022 at 10:29:38 AM UTC-5, Stephen Fuld wrote:
> >
> >>> I am far from an expert in this area, and please, others chime in here,
> >>> but what about adding a new instruction that takes a character in one
> >>> source register, and outputs in a destination register, some bits,
> >>> similar to your result of a Compare instruction, that indicate, if set,
> >>> things like alphabetic, numeric, null, upper case, etc.
> >
> >> What if the language was Arabic, Sanskrit, or Chinese where the mapping
> >> from character to meaning was entirely different. That is:: at least the
> >> string to be searched has to be given to the instruction as an operand.
> >
> > Chinese, of course, requires more than 256 characters. The solution for most
> > other cases, though, would be to provide a "translate" instruction - the one that
> > the System/360 had was adaptable to just that purpose - you could translate
> > a character string to a string of bytes which coded character types if you liked.
<

I was thinking about this the other day, and wondered why we don't let various
programmers around the world to program in their won language and when one
runs into プリントf() one knows that it is printf(). This is merely 'ln -s' applied at
the symbol table.

<
> Even the original 8088 had XLAT, an opcode to translate bytes in AL into
> the corresponding value stored in a 256-byte table addressed by BX.
>
> The idea was that you could translate a full buffer with very tight
> code, you just needed SI -> source, DI -> dest (can be the same as SI),
> CX = buffer size, BX -> translation table:
>
> next:
> lodsb ; Load a byte into AL, update SI
> xlat ; al = bx[al]
> stosb ; Store back, update DI
> loop next ; Decrement CX, loop unless decremented to zero.
>
> I.e. 4 instructions and 5 total bytes!
<

This "no workie so well" when characters require 2 bytes or decoding
to determine if the character is 1 or 2 bytes.

Stefan Monnier

unread,

Jun 28, 2022, 5:38:02 PM6/28/22

to

>> IIUC nowadays JSON is heavily used to send datastructures between
>> separate microservers as a safer alternative to sending pointers to
>> shared datastructures between threads.
> Certainly much of this gets cached, too.

I don't think so.

>> So sometimes they may send a lot more data then the part actually used
>> for no other reason than because it works ans is convenient.
> But doesn't the work carried by JSON consume a lot of cycles making the
> parsing cycles noise?

JSON does not "carry any work" it's just a big tree datastructure
represented as text. The receiving process reads it and then it uses
the part of it that it needs and often just disregards most of what
it received (because it's easier to do it that way than to somewhat
describe to the sender which parts are needed).

GraphQL tries to reduce this waste by providing a standardized way for
the client to request specific parts, but even so, the server often
returns a lot more data than the client really needs.

Stefan

Stefan Monnier

unread,

Jun 28, 2022, 5:41:56 PM6/28/22

to

> This "no workie so well" when characters require 2 bytes or decoding
> to determine if the character is 1 or 2 bytes.

AFAIK most lexers don't directly work on "characters encoded as utf-8",
for that reason. They work at the level of bytes, or on sequences of
"entities" which can be characters or tokens, returned by a lower
level lexer.

Stefan

MitchAlsup

unread,

Jun 28, 2022, 5:46:43 PM6/28/22

to

On Tuesday, June 28, 2022 at 4:38:02 PM UTC-5, Stefan Monnier wrote:
> >> IIUC nowadays JSON is heavily used to send datastructures between
> >> separate microservers as a safer alternative to sending pointers to
> >> shared datastructures between threads.
> > Certainly much of this gets cached, too.
> I don't think so.
> >> So sometimes they may send a lot more data then the part actually used
> >> for no other reason than because it works ans is convenient.
> > But doesn't the work carried by JSON consume a lot of cycles making the
> > parsing cycles noise?
> JSON does not "carry any work" it's just a big tree datastructure
> represented as text. The receiving process reads it and then it uses
> the part of it that it needs and often just disregards most of what
> it received (because it's easier to do it that way than to somewhat
> describe to the sender which parts are needed).
<

So, are you saying that the requestor sends over an encyclopedia* and
the provider uses 2 paragraphs. Yes, that sounds really efficient. Yep,
Oh Boy is that a good idea.
<
(*) specification

>
> GraphQL tries to reduce this waste by providing a standardized way for
> the client to request specific parts, but even so, the server often
> returns a lot more data than the client really needs.
<

Does the client even know what he is looking for ?
>
>
> Stefan

MitchAlsup

unread,

Jun 28, 2022, 5:47:34 PM6/28/22

to

Most lexers also represent a scant faction of the time spend using what
got lexed.
>
> Stefan

EricP

unread,

Jun 28, 2022, 6:26:25 PM6/28/22

to

rant:ON {

I looked at the JSON wikipedia page - it is almost as dumb as XML.
The notion of converting all integer binary values to text
to communicate between compatible applications when
all machines use 2's compliment little endian or can
trivially convert is just plain stupid. Same for ieee floats.

Or repeatedly sending the key names as text *per field*
instead of references to a prior dictionary definition.

And the argument that "oh well you can edit the file" is crap because
once you have a binary format spec you can trivially build an editor.
Or a binary file decoder/encoder if you are really set on editing text.
And often they are sending this over sockets to transaction servers
so the ability to edit in Notepad is irrelevant.

Also having a spec for key:value pair formats tells you nothing
about the format each application requires for { aggregates },
or field order, or which fields are mandatory or optional.
Once you define this sometimes you can skip key names, just have values.

Apologies to those forced to use JSON but this is jaw droppingly dumb.
Why does anyone bother with this convert-to-text foolishness?
Maybe they are uncomfortable programming with "binary values".

} rant:OFF

Timothy McCaffrey

unread,

Jun 28, 2022, 8:03:13 PM6/28/22

to

I agree with everything you said, however :)

This is the world of microservices and/or serverless applications. Having the JSON be human editable allows
the developer to manually build some unit test cases (JSON is sometimes used to configure things as well).
Overall performance is not the overriding priority here, the point is to be able to update your app on the fly
(in pieces, if necessary) with no downtime. Compromises were made to achieve this goal :(.

It all kind of makes sense if you get into it, and the lack of type/syntax checking goes right along with
using Javascript in the first place (I do not like JS because it seems like they tried to find the worst
aspects of programming languages over the last 60 years and put them in one place0.

But, I agree, the whole thing seems like a train wreck in progress.

- Tim

MitchAlsup

unread,

Jun 28, 2022, 8:29:30 PM6/28/22

to

On Tuesday, June 28, 2022 at 7:03:13 PM UTC-5, timca...@aol.com wrote:

> It all kind of makes sense if you get into it, and the lack of type/syntax checking goes right along with
> using Javascript in the first place (I do not like JS because it seems like they tried to find the worst
> aspects of programming languages over the last 60 years and put them in one place0.
<

You forgot the trademark symbol: ™

MitchAlsup

unread,

Jun 28, 2022, 8:34:16 PM6/28/22

to

On Tuesday, June 28, 2022 at 7:03:13 PM UTC-5, timca...@aol.com wrote:
>

> This is the world of microservices and/or serverless applications. Having the JSON be human editable allows
> the developer to manually build some unit test cases (JSON is sometimes used to configure things as well).
> Overall performance is not the overriding priority here, the point is to be able to update your app on the fly
> (in pieces, if necessary) with no downtime. Compromises were made to achieve this goal :(.
<

This reminds me of the saying that "programmers should develop programs in LISP and then when it is
working convert that into any other programming language for performance."

But do you really want a customer updating a banking application "on the fly" ? Of course not !

And there seems to be the underlying tenet that 'developer' and 'user' are somehow connected.
Or that the user has some notion of what kind of query they are applying to server.

Ivan Godard

unread,

Jun 28, 2022, 9:30:27 PM6/28/22

to

Something I've been repeating to teammates all my life: "You are not the
customer".

Stefan Monnier

unread,

Jun 28, 2022, 11:06:45 PM6/28/22

to

>> JSON does not "carry any work" it's just a big tree datastructure
>> represented as text. The receiving process reads it and then it uses
>> the part of it that it needs and often just disregards most of what
>> it received (because it's easier to do it that way than to somewhat
>> describe to the sender which parts are needed).
> <
> So, are you saying that the requestor sends over an encyclopedia* and
> the provider uses 2 paragraphs. Yes, that sounds really efficient. Yep,
> Oh Boy is that a good idea.

That's sometimes a bit like that, yes. But don't assume it's because
the programmer is an idiot. There can be very good reasons to do it
this way.

> Does the client even know what he is looking for ?

You seem here to be prejudiced against the programmer.

Stefan

Stefan Monnier

unread,

Jun 28, 2022, 11:12:10 PM6/28/22

to

> I looked at the JSON wikipedia page - it is almost as dumb as XML.

JSON sucks in all kinds of ways, but you can also look at it from the
glass half-empty side: as a replacement for XML it's a very substantial
improvement, because it's much simpler, faster to parse, and
significantly more compact.

Stefan

Robert Swindells

unread,

Jun 29, 2022, 6:33:37 AM6/29/22

to

On Tue, 28 Jun 2022 17:34:13 -0700 (PDT), MitchAlsup wrote:

> On Tuesday, June 28, 2022 at 7:03:13 PM UTC-5, timca...@aol.com wrote:
>>
>> This is the world of microservices and/or serverless applications.
>> Having the JSON be human editable allows the developer to manually
>> build some unit test cases (JSON is sometimes used to configure things
>> as well). Overall performance is not the overriding priority here, the
>> point is to be able to update your app on the fly (in pieces, if
>> necessary) with no downtime. Compromises were made to achieve this goal
>> :(.
> <
> This reminds me of the saying that "programmers should develop programs
> in LISP and then when it is working convert that into any other
> programming language for performance."

Lisp compilers have generated good output for a long time, when did you
hear the saying?

Someone recently suggested rewriting a Lisp application in Python, I
offered to put in some delay loops if they just wanted it to run slower.

To get back to the thread, I don't see what was wrong with binary
serialization formats like ASN.1/BER or XDR.

Anton Ertl

unread,

Jun 29, 2022, 10:25:24 AM6/29/22

to

Robert Swindells <r...@fdy2.co.uk> writes:
>To get back to the thread, I don't see what was wrong with binary
>serialization formats like ASN.1/BER or XDR.

Binary serialization formats in general:

Much harder for development (testing/debugging), because you then need
tools to translate this stuff from/to human-readable form.

XDR in particular: It's been >30 years since I had to do with that,
but my memory is that the protocol was compiled into the two
communication partners, so if you need to change it, it was hard
going.

I don't know if the rwho protocol uses XDR, but it certainly used a
binary format, and it produced totally bogus uptime numbers for
several years when we started mixing 32-bit and 64-bit machines.

It seems to me that binary formats tend to die after a while for
reasons like these (unless they have very good change management, and
few do), so the plain text formats tend to survive. Efficiency is not
everything. Of course, for cases where efficiency is really
important, such as NFS, binary formats are de rigeur.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

MitchAlsup

unread,

Jun 29, 2022, 12:21:58 PM6/29/22

to

On Wednesday, June 29, 2022 at 5:33:37 AM UTC-5, Robert Swindells wrote:
> On Tue, 28 Jun 2022 17:34:13 -0700 (PDT), MitchAlsup wrote:
>
> > On Tuesday, June 28, 2022 at 7:03:13 PM UTC-5, timca...@aol.com wrote:
> >>
> >> This is the world of microservices and/or serverless applications.
> >> Having the JSON be human editable allows the developer to manually
> >> build some unit test cases (JSON is sometimes used to configure things
> >> as well). Overall performance is not the overriding priority here, the
> >> point is to be able to update your app on the fly (in pieces, if
> >> necessary) with no downtime. Compromises were made to achieve this goal
> >> :(.
> > <
> > This reminds me of the saying that "programmers should develop programs
> > in LISP and then when it is working convert that into any other
> > programming language for performance."
> Lisp compilers have generated good output for a long time, when did you
> hear the saying?
<

Circa 1983.

Terje Mathisen

unread,

Jun 29, 2022, 12:25:51 PM6/29/22

to

JSON parsing is the modern equivalent to the classic phone bill
benchmark, i.e. ascii/bcd parsing and very simple mathematical operations.

I suspect most/a lot of JSON parsing happens when decoding REST API
calls, leading to a single SQL call or server-local table lookup.

Terje Mathisen

unread,

Jun 29, 2022, 12:31:33 PM6/29/22

to

Sure, that is one of the (several) reasons I'm not actually using code
like this today, but it was implemented this way for very good reasons
around 1979.

Terje Mathisen

unread,

Jun 29, 2022, 12:39:55 PM6/29/22

to

JSON has two good features imho:

a) You can trivially write test code with the request and expected reply
as clear text/ascii in whatever language you prefer.

b) It makes debugging network traces much easier, at least until you
employ zero trust and TLS encrypt every single microservice connection.

Additionally, the structure is so repeatable that when you have a bulk
data transfer (as lines of JSON), any standard compression algo will be
able to reduce the size overhead to maybe 2X instead of 10-100 X. :-)

OTOH, this also applies to compressed XML, afaik this is the basis for
all recent Microsoft Office formats?

Stephen Fuld

unread,

Jun 29, 2022, 3:06:11 PM6/29/22

to

Without commenting on the correctness of your suspicion (I have no
knowledge in the area), two points. If a local table lookup, then JSON
parsing is likely to be a significant part. And while if it results in
an SQL call, that call will be more significant, but the call itself
will require parsing the SQL statement, so instructions for faster
parsing do double duty here. :-)

Andy Valencia

unread,

Jun 29, 2022, 8:37:28 PM6/29/22

to

Stefan Monnier <mon...@iro.umontreal.ca> writes:
> > I looked at the JSON wikipedia page - it is almost as dumb as XML.

> JSON sucks in all kinds of ways, but you can also look at it from the
> glass half-empty side: as a replacement for XML it's a very substantial
> improvement, because it's much simpler, faster to parse, and
> significantly more compact.

And it varies in ways which help, and has far fewer odd little nooks.
Unlike XML, which offers endless options which just trip you up.

Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html

JohnG

unread,

Jun 30, 2022, 1:59:22 AM6/30/22

to

JSON tends to be a good (well, adequate) answer for sharing at the edges of trust boundaries as an interchange format - logs, config files, query results, etc. Places where you probably want text anyway (and can't get away with just a .csv). If a service is using it internally, it's probably because programmer cycles are in shorter supply than machine cycles or that performance path isn't the pain point yet.

Within a trusted set of services (i.e. all controlled by the same entity), you usually want a binary, serialized format with support on a lot of architectures and for a lot of languages and this is then used to cover everything from on-disk, to over-the-wire, to the IPC format. That usually means you have a compiled description language for the protocol and that gives you the methods to suck up and decode the messages into native data structures and vice-versa. Yes, that means both sides have to be sync'd or at least have some sort of versioning and ability to negotiate down to the last common version.

Google's Protocol Buffers (https://en.wikipedia.org/wiki/Protocol_Buffers, https://developers.google.com/protocol-buffers) are a good example of current state of the art. And yes, if you come from a background where you count nanoseconds (e.g. automated trading, or HPC), you make different choices and you can insist that all parties are on the same arch, same software version, on the same linux distro, built from the same compiler version, etc. But 99% of workloads are throughput centric and it's easier to scale by adding more nodes, more switches, more shards, more data centers, etc. than it is to hire the 5 people on the planet who might know how to rewrite and maintain your services as bare-metal code.

-JohnG

EricP

unread,

Jun 30, 2022, 1:11:06 PM6/30/22

to

It might be acceptable for low-end batch style file processing
where performance, efficiency and file size don't matter,
but it is still a lowest common denominator.

The problem is it doesn't stay in the low-end and winds up getting used
in high volume communication links for transaction processing systems.
I've seen this multiple times with in-house systems,
and cloud systems may use it because it initially appears
to be a good match with the client browser JavaScript.

There are many self-defining protocols that are like this,
the ones that convert numbers to text just increase the overhead.
So I'll refer to them all as messaging protocol P.

What protocol P does is inject a JIT compiler (lexer, parser, AST builder,
symbol table builder - those field key names need to be indexed so
they can be found, and semantic checker against a meta data definition)
into an inbound communications link front-end processing.

The semantic checker validates that for this transaction
the correct fields of the correct type and correct size
or range of values are present in the correct order.
So the infinitely variable format message actually wasn't.

After all that the front end copies the data out into a fixed format
record with fixed size fields and fixed data types that the TP needs
so it can talk to the fixed format database,
or other servers which also have fixed format messages.

And on the way back out, it injects the equivalent of a JIT
disassembler into the front-end to convert the fixed format reply
into this potentially-variable-but-not-actually message format.

Putting this into communication links has absolutely predictable
consequences for development costs (none of that work is free),
performance, and run time cpu charges when rolled out to production.

Which is why I suspect there is an interest in SIMD JSON parsing.

> Within a trusted set of services (i.e. all controlled by the same entity), you usually want a binary, serialized format with support on a lot of architectures and for a lot of languages and this is then used to cover everything from on-disk, to over-the-wire, to the IPC format. That usually means you have a compiled description language for the protocol and that gives you the methods to suck up and decode the messages into native data structures and vice-versa. Yes, that means both sides have to be sync'd or at least have some sort of versioning and ability to negotiate down to the last common version.

Right, but the question is do they NEED a variable message format,
or are they just designing it this way and using these tools because
that's how it was done before on some other application and
programmers are familiar with it.

> Google's Protocol Buffers (https://en.wikipedia.org/wiki/Protocol_Buffers, https://developers.google.com/protocol-buffers) are a good example of current state of the art. And yes, if you come from a background where you count nanoseconds (e.g. automated trading, or HPC), you make different choices and you can insist that all parties are on the same arch, same software version, on the same linux distro, built from the same compiler version, etc. But 99% of workloads are throughput centric and it's easier to scale by adding more nodes, more switches, more shards, more data centers, etc. than it is to hire the 5 people on the planet who might know how to rewrite and maintain your services as bare-metal code.
>
> -JohnG

Well, at least it is binary.
But I don't think it changes anything I said.
System designers need to question whether they really REQUIRE this
flexibility, because it is not free and in my experience they rarely do.

Tim Rentsch

unread,

Jul 1, 2022, 7:52:16 AM7/1/22

to

MitchAlsup <Mitch...@aol.com> writes:

> This reminds me of the saying that "programmers should develop
> programs in LISP and then when it is working convert that into
> any other programming language for performance."

This advice is humorous on so many levels.

Tim Rentsch

unread,

Jul 1, 2022, 8:41:56 AM7/1/22

to

EricP <ThatWould...@thevillage.com> writes:

[...]

> rant:ON {
>
> I looked at the JSON wikipedia page - it is almost as dumb as XML.

Let me note here that I am not a fan of either JSON or XML. On a
visceral level I share your reaction to these monstrosities.

> And the argument that "oh well you can edit the file" is crap because
> once you have a binary format spec you can trivially build an editor.
>

> [...]

>
> Why does anyone bother with this convert-to-text foolishness?
> Maybe they are uncomfortable programming with "binary values".

On these subpoints I would like to offer some counterpoint.

Building a minimal editor for a binary format may be fairly
simple, but building a full-fledged editor is not. Editors
are used for much more than simple editing. (Please imagine
here a rant of my own against the various "registry editors"
in MS Windows.)

Even if a new editor could be created instantly with a snap
of the fingers, it would be Yet Another Tool whose skill
curve would have to be climbed. I would much rather reuse
the tools I already know (and have used for decades) than
have to learn a new one. Well, unless of course there is
some significant compensating advantage for the new regime,
but I don't see that here.

Furthermore it isn't just editors. There are lots of text-based
tools (grep, sed, awk, diff, etc.) that can be usefully applied
for plain text input, but are essentially useless for almost any
binary format input.

It is IMO much better that JSON and XML are text-based rather
than using a binary format, whatever their other shortcomings
might be. And I think anyone who has done a non-trivial amount
of system administration work (which, sadly, I myself have) would
agree with that.

> } rant:OFF

(Despite my comments above I appreciated your ranting comments.)

Michael S

unread,

Jul 1, 2022, 9:01:24 AM7/1/22

to

If your language does not feature reflection than JSON is practically impossible to de-serialize in generic way.
All C++ JSON parsers are really just parsers rather than de-serializers.

If your language does feature reflection then generic JSON de-serializer is possible, but typically at cost of
quality of reporting of semantic mismatches.
I.e. you expect field 'bar' to be float array but got it as integer scalar. How do you tell to user what exactly get
wrong, with indices of problematic line and character ?
Most likely, in something like Java or C# user ends up seeing cryptic exception printout.
And exploits based on clever mis-format of the message are not completely unlikely, although probably more
so in python/Javascript than in Java/C#/Swift/Go.

Michael S

unread,

Jul 1, 2022, 9:08:15 AM7/1/22

to

Last I looked at it, Protocol_Buffers were fine. They were not over-flexible beyond practical needs.
I ended up rolling my own, simpler, Interface Definition Language, because back then Protocol_Buffers
had no support for C, and for us C was the most important.

But that was 5 or 7 years ago, so, possibly, not everything now is like it was then.

Robert Swindells

unread,

Jul 1, 2022, 9:37:53 AM7/1/22

to

Only if you think being wrong is funny.

Timothy McCaffrey

unread,

Jul 1, 2022, 10:43:08 AM7/1/22

to

On Thursday, June 30, 2022 at 1:59:22 AM UTC-4, JohnG wrote:

> Google's Protocol Buffers (https://en.wikipedia.org/wiki/Protocol_Buffers, https://developers.google.com/protocol-buffers) are a good example of current state of the art. And yes, if you come from a background where you count nanoseconds (e.g. automated trading, or HPC), you make different choices and you can insist that all parties are on the same arch, same software version, on the same linux distro, built from the same compiler version, etc. But 99% of workloads are throughput centric and it's easier to scale by adding more nodes, more switches, more shards, more data centers, etc. than it is to hire the 5 people on the planet who might know how to rewrite and maintain your services as bare-metal code.
>
> -JohnG

Capn'proto (https://capnproto.org/news/) was written by (one of) the Protocol buffer developers, using Lessons Learned (tm).
The only problem with it is that it requires an OO language for the interfaces (I really don't see a way to use it directly with C).
- Tim

Jean-Marc Bourguet

unread,

Sep 17, 2022, 2:21:26 PM9/17/22

to

Ivan Godard <iv...@millcomputing.com> writes:

> On 6/12/2022 10:02 AM, MitchAlsup wrote:
>> On Sunday, June 12, 2022 at 11:40:26 AM UTC-5, Ivan Godard wrote:
>>> On 6/12/2022 7:58 AM, David Brown wrote:
>>>> On 12/06/2022 05:41, Stefan Monnier wrote:
>>>>> Quadibloc [2022-06-11 18:31:46] wrote:
>>>>>> maybe we should want a computer that can handle 6-bit
>>>>>> characters and 7-bit characters! (Especially if four-bit
>>>>>> BCD-coded decimal digits are not a concern.) So perhaps
>>>>>> there should have been more computers with a 42-bit
>>>>>> word length. (ASI did make _one_ comuter with a 21-bit
>>>>>> word length...)
>>>>>
>>>>> Fractional bit length anyone?
>>>>>
>>>>
>>>> Such machines would be of no use without software support. Fortunately,
>>>> there is already a proposal for extending integer types in C that will
>>>> cover this (and more) :
>>>>
>>>> <https://open-std.org/JTC1/SC22/WG21/docs/papers/2018/p0989r0.pdf>
>>>>
>>> Some languages have integral types with arbitrary static bounds, with
>>> bounds checking. Mary is the only one I know that had biased bounds:
>>> type twentieth range(1900, 1999);
>>> fit in a byte.
>> <
>> PL/1 had widths applied and range checked.
>> ADA has an entire system of width application and range checking.
>
> But not with biased representation as far as I know; please correct me.

Gnat default representation doesn't use biased representation, but will use
it to respect a representation clause. See
https://gcc.gnu.org/onlinedocs/gnat_rm/Biased-Representation.html

Yours,

--
Jean-Marc