is there a need for byte integer data type?

44 views
Skip to first unread message

Pascal Jasmin

unread,
Oct 13, 2025, 10:43:41 AMOct 13
to Forum
seems like the only way to store a byte list is

 a. {~ 12 0 _22

convert to numeric by

 a. i. a. {~ 12 0 _22

12 0 234

I understand that cpu math can be faster on bytes than larger sized integers. 

Is there internal use of bytes that make this not a concern?  Just use above approach for storage?

Henry Rich

unread,
Oct 13, 2025, 12:57:12 PMOct 13
to forum
There are some special combinations using a. &i. .

We may implement 1-byte integers someday. 

Henry Rich

To unsubscribe from this group and stop receiving emails from it, send an email to forum+un...@jsoftware.com.

Devon McCormick

unread,
Oct 14, 2025, 2:06:05 PMOct 14
to fo...@jsoftware.com
My own initial reaction to this is that it's not such a great idea.  For there to be a significant performance improvement, one would have to be dealing with large arrays of small integers.  This might be the case when dealing with pictures represented as three-dimensional RGB arrays but it seems that, even in this case, most arithmetic would be in danger of promoting the single-byte integers to floating point or full integer values; this would probably negate any performance enhancement.

Another concern is the general one that the size of an interpreter is proportional to the number of datatypes it handles, since any new data type has to be considered in relation to all the existing data types.  If we were going to push for such an increase, it would seem that giving J true Booleans, with one bit per Boolean value, might be more generally useful.

--

Devon McCormick

Flâneur


Marshall Lochbaum

unread,
Oct 14, 2025, 3:48:01 PMOct 14
to fo...@jsoftware.com
I have some notes on the topic based on Dyalog and BQN experience here:

https://mlochbaum.github.io/BQN/implementation/primitive/types.html

I think you're underestimating how much work bit booleans are. With 2D
or higher-rank arrays, even simple operations can be very tricky to
implement well. On the other hand much of J's code for 1-byte characters
could be directly reused for integers, as structural manipulations and
searching don't depend on arithmetic or ordering.

Promotion to a higher type can be a practical problem which is one
reason I emphasize 2-byte integers on that page. There's also the issue
of getting primitives to output a small type when applicable, which is
not that hard in a new interpreter but may be more of an annoyance when
changing a mature one.

Decreasing element type by a factor of 2 halves the per-element cost of
arithmetic, so it will be significant as long as the arrays aren't so
small that the program mostly spends on constant per-array costs. This
holds at about 1000 elements in CBQN, which I wouldn't say is too large.
The following benchmark for <./\ seems reasonably representative:

https://mlochbaum.github.io/bencharray/output/plot/scan-min.svg

(main page is https://mlochbaum.github.io/bencharray/pages/summary.html)

Marshall

Ak

unread,
Oct 14, 2025, 8:27:23 PMOct 14
to fo...@jsoftware.com

Can you share more commonly the Rank/Size of the information that you are storing?

Also, what is the elementwise range of an individual value in your overall case?


Ak

Ak

unread,
Oct 14, 2025, 10:56:08 PMOct 14
to fo...@jsoftware.com
...most arithmetic would be in danger of promoting the single-byte integers to floating point or full integer values...

More generally it would be necessary to have the ability to either belay a promotion action or to specify the numeric type to hold for the operation (There was a tangential forum discussion relating to the bug in  x A. y usage that breaks its sound use when movin1g between A. y <->  x A. y)

A concrete example: once we have the ability in J to target GPUs, APUs, NPUs etc. natively consider these penalties caused by promotion.

A certain Processor Unit (PU) is able process a tensor operation on an Integer Array (IA) of shape 128 x 128 x 128 in an Efficient Time (ET) that is some fixed constant number of cycles.

At any point where IA is treated as float, some form of the following takes place. IA will immediately factored into subsets of 64 x 64 x 64 or worse 32 x 32 x 32 Factor Bricks (FB) depending on the PU.  Once formed, each FB is individually processed, imposing a Penalty Factor (PF). After processing the FB group the solution must recompose IA. 

In the nieve case (Associative Commutative properties) , the imposed PF ->  Factor Size (FS) raised to the Rank all squared  times ET. 

PF= (FS y ^( $$y))^2)*ET

(2^3)^2 * ET for FB x64 and  (4^3)^2 * ET for FB x32. 

It can worsen dramatically if the FBs are not correctly aligned for processing. In many cases incorrect solitions will pervade (not considering common value discrepancies caused by moving between Integer and Floating). 

It is easy to imagine a trivial 7 step series of piped tensor operation applications where a promotion could be triggered at any individual step. Each trigger imposes a fresh PF. So any pipe would be subject to a multiplicative PF (given by the order of promotion triggers)

Yikes!


Ak

Henry Rich

unread,
Oct 15, 2025, 3:23:58 AMOct 15
to forum
With 2- & 4-byte integers we do not promote to larger size, reasoning that you chose the short type wisely. Presumably we would do the same with bytes. Saturating addition would be needed. 

Henry Rich

Ak

unread,
Oct 15, 2025, 5:44:08 AMOct 15
to fo...@jsoftware.com
I do recall that is true for the general case. 

Corner case:
      datatype A. 0 1 2 3
extended


The critical point remains. We need the ability to specify that an operator treats its arguments as a particular type. 


Ak

Pascal Jasmin

unread,
Oct 15, 2025, 8:02:08 AMOct 15
to fo...@jsoftware.com
It's about neural network/llm math where memory is scarce.  Primary research is about smaller bit sizes than 8, but Henry's advice/reminder that a.&i. has special code is useful.  Research is to emulate FP in -1 to 1 range at small bits with "scaled integer multiplication".

Pascal Jasmin

unread,
Oct 15, 2025, 10:08:56 AMOct 15
to fo...@jsoftware.com
Saturation would work for both RGB cases and "network matrix weights".  If user/programmer needs overflows then they "should" use general number format.

Ak

unread,
Oct 15, 2025, 12:59:50 PMOct 15
to fo...@jsoftware.com
Interesting, and how are you storing (Data shape, Storage format, Storage frequency)?

Which operations does that consist of (Neural Network/LLM math)? 

Pascal Jasmin

unread,
Oct 16, 2025, 10:25:39 AMOct 16
to fo...@jsoftware.com
The proof of concept doesn't seem to work out well.

B3 =: #. inv 64 $ i.8  NB. place holder for 3 bit results of 3 bit multiplication table lookup for 3 bit result
mul3 =: (B3 {~ #.@:,.)

7 treads on 8 core processor

  10 timespacex 'mul3~ t.'''' "2 a'[a =. ? 7 1000 1000 3 $ 2

0.0257503 5.9535e6


   10 timespacex '*~ t.'''' "2 a'[a =. ? 7 1000 1000 $ 0

0.00483459 8.39254e6


5x slower for minimal memory savings compared to floating point.


Though I don't understand why reported memory is so high with creating bit array (doesn't affect timings above)


 timespacex ' ? 7 1000 1000 3 $ 2'

0.0357271 3.01991e8

timespacex ' ? 7 1000 1000 $ 0'

0.0409945 7.54984e7



JVERSION

Engine: j9.7.0-beta8/j64avx2/windows

Pascal Jasmin

unread,
Oct 16, 2025, 11:17:55 AMOct 16
to 'Pascal Jasmin' via forum
appologies for correction no one may care about, but correct comparison is

 10 timespacex 'mul3~ t.''''"3 a'[a =. ? 7 1000 1000 3 $ 2

0.0169003 2.51701e7

vs. 

  10 timespacex '*~ t.'''' "2 a'[a =. ? 7 1000 1000 $ 0

0.00483459 8.39254e6


speed down to 4x worse, but now more memory use.

Ak

unread,
Oct 16, 2025, 5:53:19 PMOct 16
to fo...@jsoftware.com
Hi Pascal,

Before I coment on your first message, I want a correct context for your POC.

Can you please explain/comment mul3 with more detail?

I want to understand how you describe it through. Please dumb it down for me.

Maybe a by-step explanation of one execution on one item. 


I appreciate your patience. 

Akinbo


Pascal Jasmin

unread,
Oct 16, 2025, 9:21:53 PMOct 16
to fo...@jsoftware.com
B3 is a list of 64 3bit numbers that I could hand tune to represent 3 bit floating point numbers between -1 and 1, such that any 3 bit numbers that are multiplied also fall into this range.

mul3 (multiplication) just concatentates 2 3bit numbers into a 6 bit number.  #. converts it into an integer, and a lookup from B3 is the result.  B3 would represent buckets to round results to nearest point.

I had hoped that this could be faster than multiplication of floating point numbers.  I don't understand why it doesn't use less memory.

Clifford Reiter

unread,
Oct 17, 2025, 8:08:17 AMOct 17
to fo...@jsoftware.com
mul3 was designed to work on rank 2 inputs so the change to rank 3 gives a different shape result.

Pascal Jasmin

unread,
Oct 17, 2025, 9:09:25 AMOct 17
to fo...@jsoftware.com
to optimize on infinite rank definition, yes that is annoying:

  1 1 1 mul3&,: 1 0 0  NB. (,: 1 1 1) mul3 ,: 1 0 0

1 0 0

conceptually simpler is mul3 as a rank 1 function with , instead of ,.

mul3b =: (B3 {~ #.@:,)("1 1) 

1 1 1 mul3b 1 0 0

1 0 0


   10 timespacex 'mul3b~ t.''''"3 a'[a =. ? 7 1000 1000 3 $ 2

0.0962772 4.19882e6

8 times slower.  Memory improvement I think is explained by lower rank operations means smaller memory for intermediate steps.

intermediate memory is not as big a deal as storage requirement, so there can be some edge cases for it, but fp math is surprisingly fast.

Ak

unread,
Oct 18, 2025, 1:44:43 AMOct 18
to fo...@jsoftware.com
I had hoped that this could be faster than multiplication of floating point numbers.  I don't understand why it doesn't use less memory.
(From your first email)

Though I don't understand why reported memory is so high with creating bit array (doesn't affect timings above)
(From your second email)


Three things to consider on this point: 1. Aside from size of the arrays and the many more major operation invocations than the primitive multiplication operator (A structural, a conversion, a search), I would expect size and time increases. Multiplication only needs to treat the arguments as they are. Where are you expecting time or speed improvements 
I still might not be clear on the case you are treating. Are you building an alternative to the bdot operator (m b.)? I believe this operator is able to treat your arguments bitwise (without conversion).  Maybe someting like an appropriate combination of these will realize time/space improvements.

     1 1 1 (7 b. ) //.@(2 b./) 1 0 0
or
     2 7 7 (22 b.)//.@(17 b./) 3 3 2          NB. (bitwise)


 2. The Cache properties of the executing host. Meaning there might be a more amenable shape for handling the arguments efficiently. 

3. The arrangement of the axis of your arguments. For example, 1000 1000 7 vs 1000 7 1000 vs 7 1000 1000. Since your use of multiplication here is Associative and Commutative, you can absolutely dial in the shapes (by transpose or reshape as needed) that your host executes. Maybe Henry or Marshall can provide some insights on access patterns if it is relevant here or correct me if my thinking is incorrect. 


Akinbo

Pascal Jasmin

unread,
Oct 18, 2025, 8:58:02 AMOct 18
to fo...@jsoftware.com
technique I was interested has been done/actively worked on in a C/asm library.  Their words for describing it are "support mixed-precision matrix multiplication (int1/2/3/4 x int8/fp16/fp32) without the need for dequantization by utilizing lookup tables"

https://github.com/microsoft/T-MAC/blob/main/README.md




specific part of project is qmemm_lut
  
https://github.com/microsoft/T-MAC/blob/main/deploy/tuned/kernels.h



The 7 part of dimensions in my examples was about timing under 7 threading chunks.

Ak

unread,
Oct 19, 2025, 2:24:29 AMOct 19
to fo...@jsoftware.com
I see. 

If I am correctly understanding the Techniques section, the method designs in some failure conditions.


I suspect the following congenital features/symptoms will be exhibited in solutions:

-For comparisons using multiple  runs on a Constant Input Group Cig_0 (where Cig_0 reuses the same  arguments for a sample execution):

          Different implementations on the same arguments
1. For a time delta (meaning a timing delta being attributed to algorithmic improvement) will result in inconsistent solutions as measured against the baseline solution. Meaning, many solution varieties will be the general case.

          Same implementation on different arguments 
2. For comparisons between Cig_0, Cig_1, through Cig_n, timing deltas within an algorithmic implementation will exhibit and be susceptible to branch prediction failure penalties that have a combinatorial characteristic. (given the earliest sequential layer depth at which the failure occurred). This property is caused by the architecture of the algorithm. Moreover, it is not only an accumulation that can result in an incorrect lookup. Given the order of failure positions in a series (or in an individual threading chunk), all work (composition, synchronization, etc.) must be rescheduled and computed anew. I don't know what error detection, error correction scheme is being implemented in T-Mac. 

          Algorithmic approach
3. More generally for this usecase, lookup tables leads to other degeneracies that cannot be cured. 

There are better algorithmic tools to treat the principle question. I don't have experience running this algorithm, maybe you can share where my thinking can be improved.



Akinbo


 

-------- Original message --------
From: 'Pascal Jasmin' via forum <fo...@jsoftware.com>
Date: 10/18/25 06:58 (GMT-07:00)
Subject: Re: [Jforum] is there a need for byte integer data type?

Reply all
Reply to author
Forward
0 new messages