Row / Column / width / depth major, transpose option

196 views
Skip to first unread message

Erich Plondke

unread,
Jul 5, 2016, 1:09:36 PM7/5/16
to gemmlowp

Can you confirm or deny my understanding about ordering?

I think the EightBitIntGemm interface assumes column major if the transpose option is false, and row major if the transpose option is true?

For both the LHS and RHS matrix, "Depth" is the "K" direction; depth must be equal in both matrices... in kernel.h, it says "In a matrix product of a MxK matrix times a KxN matrix, we call K the 'depth'."

In kernel.h, it says that:

"In the LHS, width-major means row-major, while in the RHS, width-major means column-major." 

I would expect that the LHS K direction is rows, and the RHS K direction is columns.  So wouldn't LHS depth-major be row-major?

I think the default "depth-major" in the code is actually width-major, and vice versa.  For both the LHS and RHS the data is arranged sequentially for parallel execution, rather than accumulation.

I wrote a test with 256 bytes contiguous, array[i] = i, and pass to a test with 16x16, transposed.

If I make my kernel CellOrder::WidthMajor, I see the data come in in chunks of row-major order (0,1,2,3) on the LHS.  If I make my kernel DepthMajor, I see column-major order (0x0,0x10,0x20,0x30) on the LHS.  It's OK, I just think it's not the "Depth" direction so I want to make sure I haven't screwed up the ordering somewhere.


Erich Plondke

unread,
Jul 5, 2016, 6:20:11 PM7/5/16
to gemmlowp
One more ordering question: the output of the kernel is always column-major? 

Benoit Jacob

unread,
Jul 6, 2016, 5:17:48 AM7/6/16
to Erich Plondke, gemmlowp
On Tue, Jul 5, 2016 at 1:09 PM, Erich Plondke <eplo...@gmail.com> wrote:

Can you confirm or deny my understanding about ordering?

I think the EightBitIntGemm interface assumes column major if the transpose option is false, and row major if the transpose option is true?

That's correct.

First, I would advise against using the legacy EightBitIntGemm interface. Prefer the public/gemmlowp.h interface, which is explicit in this respect, especially if you are going to care about internal details such as kernels (below in your email).

In any case, the transpose_* flags in the EightBitIntGemm interface are used here, which explains their effect:


This confirms that transpose_*=true means RowMajor, false means ColMajor.

 

For both the LHS and RHS matrix, "Depth" is the "K" direction; depth must be equal in both matrices... in kernel.h, it says "In a matrix product of a MxK matrix times a KxN matrix, we call K the 'depth'."

Correct.
 

In kernel.h, it says that:

"In the LHS, width-major means row-major, while in the RHS, width-major means column-major." 

Correct.
 

I would expect that the LHS K direction is rows, and the RHS K direction is columns.  So wouldn't LHS depth-major be row-major?

No, in the LHS the K direction is columns and in the RHS the K direction is rows, since K is always depth in either side.

"Width" is always the opposite dimension as "depth". So in the LHS, width is rows and in the RHS width is columns.

"width-major" is defined as "storing contiguously all the matrix entries sharing a common 'width' index", just like "row-major" means "storing contiguously all the matrix entries sharing a common 'row' index'.

Thus, it is consistent that "width-major" means row-major in the LHS and means column-major in the RHS.
 

I think the default "depth-major" in the code is actually width-major, and vice versa.

I hope that the above clarifies it.

 
 For both the LHS and RHS the data is arranged sequentially for parallel execution, rather than accumulation.

Note that the kernel format only applies to small pieces of the LHS and RHS, sized to fit in registers. Beyond that, these pieces are always stored 'depth-major', so that the inner loop (the kernel) can traverse the matrices in the depth direction for accumulation. That is explained in the example in kernel.h.

On Tue, Jul 5, 2016 at 6:20 PM, Erich Plondke <eplo...@gmail.com> wrote:
One more ordering question: the output of the kernel is always column-major? 

Yes, that is correct. The unpack stage then takes care of properly mapping that to the destination matrix's storage order. 

Cheers,
Benoit
 

I wrote a test with 256 bytes contiguous, array[i] = i, and pass to a test with 16x16, transposed.

If I make my kernel CellOrder::WidthMajor, I see the data come in in chunks of row-major order (0,1,2,3) on the LHS.  If I make my kernel DepthMajor, I see column-major order (0x0,0x10,0x20,0x30) on the LHS.  It's OK, I just think it's not the "Depth" direction so I want to make sure I haven't screwed up the ordering somewhere.


--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+u...@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/3a870cd4-bf23-4b04-948d-6c88093c16f7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Erich Plondke

unread,
Jul 6, 2016, 11:39:43 AM7/6/16
to Benoit Jacob, gemmlowp
On Wed, Jul 6, 2016 at 4:17 AM, Benoit Jacob <benoi...@google.com> wrote:


For both the LHS and RHS matrix, "Depth" is the "K" direction; depth must be equal in both matrices... in kernel.h, it says "In a matrix product of a MxK matrix times a KxN matrix, we call K the 'depth'."

Correct.
 

So the only way this makes sense to me is if the LHS and RHS are swapped from what I expect.  


A LHS of MxK would be M rows each of length K.  A RHS would be N columns each of length K.  K is the depth direction, and the direction of accumulation, and the direction we take the "dot product".

But instead if I flip around LHS and RHS opposite of High School Matrix Multiplies: (https://www.mathsisfun.com/algebra/images/matrix-multiply-a.gif), and we are doing the dot product of the LHS columns and RHS rows then it makes sense; and the LHS is K columns and the RHS is K rows.

But the reference kernels (Neon/SSE) seem to not accumulate the products of consecutive LHS and RHS elements, even though they use the default CellOrder (DepthMajor).  Instead they vectorize in that direction.

The picture at https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L83 seems to match what we do in the code: 

* Multiply the first four consecutive elements into separate accumulators: https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L121
* Accumulate them with the second set of values: https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L141

If we are doing "dot product" down the depth direction, I would expect the products of consecutive elements would be added to each other.  Instead we accumulate the products of the first four elements with the products of the second four elements.  It seems like that would mean "width major" since depth is the order of accumulation.



If we have a kernel that has a KernelFormat<KernelSideFormat<CellFormat<2, 4>, 1>,<KernelSideFormat<CellFormat<2,4>, 1> >, and in memory we found:

LHS: A, B, C, D, E, F, G, H
RHS: a, b, c, d, e, f, g, h

I would expect we should compute Aa+Bb+Cc+Dd, Ea+Fb+Gc+Hd, Ae+Bf+Cg+Dh, and Ee+Ff+Gg+Hh in our kernel if we have depth-major ordering.

Instead, I think that we are expected to compute Aa+Cc+Ee+Gg, Ba+Dc+Fe+Hg, Ab+Cd+Ef+Gh, and Bb+Dd+Ff+Hh.  This is perfect if I have two element wide vectors that add elementwise, but it seems like the memory format of the cell has consecutive elements in the width direction, not the depth direction.

I have a new kernel working (passing tests), but I'm concerned that I'm doing something wrong because I have to use WidthMajor to get cells in a format where consecutive elements in memory are along the dot product dimension.
 

On Tue, Jul 5, 2016 at 6:20 PM, Erich Plondke <eplo...@gmail.com> wrote:
One more ordering question: the output of the kernel is always column-major? 

Yes, that is correct. The unpack stage then takes care of properly mapping that to the destination matrix's storage order. 

 I see.  This probably means I need to swap my LHS and RHS strategies so I am able to store contiguous vectors. I seem to have row-major on the brain.

Benoit Jacob

unread,
Jul 7, 2016, 6:48:00 AM7/7/16
to Erich Plondke, gemmlowp
On Wed, Jul 6, 2016 at 11:39 AM, Erich Plondke <eplo...@gmail.com> wrote:


On Wed, Jul 6, 2016 at 4:17 AM, Benoit Jacob <benoi...@google.com> wrote:


For both the LHS and RHS matrix, "Depth" is the "K" direction; depth must be equal in both matrices... in kernel.h, it says "In a matrix product of a MxK matrix times a KxN matrix, we call K the 'depth'."

Correct.
 

So the only way this makes sense to me is if the LHS and RHS are swapped from what I expect.  


A LHS of MxK would be M rows each of length K.  A RHS would be N columns each of length K.  K is the depth direction, and the direction of accumulation, and the direction we take the "dot product".

That is exactly what I have in mind, too.
 

But instead if I flip around LHS and RHS opposite of High School Matrix Multiplies: (https://www.mathsisfun.com/algebra/images/matrix-multiply-a.gif), and we are doing the dot product of the LHS columns and RHS rows then it makes sense; and the LHS is K columns and the RHS is K rows.

I don't understand what "flipping around is needed" to make this agree with the above. The matrix multiplication picture above matches what I have in mind, too. Moreover: the two viewpoints that you are describing above:

"the LHS is K columns"

and, earlier, in the above paragraph,

"A LHS of MxK would be M rows each of length K"

are compatible: if the LHS is "M rows each of length K" then it is K columns.

Thus, so far, all of what you wrote seems to be perfectly compatible with what I have in mind (except I do not understand what "flipping around" would be needed).
 

But the reference kernels (Neon/SSE) seem to not accumulate the products of consecutive LHS and RHS elements, even though they use the default CellOrder (DepthMajor).  Instead they vectorize in that direction.

The picture at https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L83 seems to match what we do in the code: 

* Multiply the first four consecutive elements into separate accumulators: https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L121
* Accumulate them with the second set of values: https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L141

If we are doing "dot product" down the depth direction, I would expect the products of consecutive elements would be added to each other.  Instead we accumulate the products of the first four elements with the products of the second four elements.  It seems like that would mean "width major" since depth is the order of accumulation.



If we have a kernel that has a KernelFormat<KernelSideFormat<CellFormat<2, 4>, 1>,<KernelSideFormat<CellFormat<2,4>, 1> >, and in memory we found:

LHS: A, B, C, D, E, F, G, H
RHS: a, b, c, d, e, f, g, h

I would expect we should compute Aa+Bb+Cc+Dd, Ea+Fb+Gc+Hd, Ae+Bf+Cg+Dh, and Ee+Ff+Gg+Hh in our kernel if we have depth-major ordering.

Instead, I think that we are expected to compute Aa+Cc+Ee+Gg, Ba+Dc+Fe+Hg, Ab+Cd+Ef+Gh, and Bb+Dd+Ff+Hh.  This is perfect if I have two element wide vectors that add elementwise, but it seems like the memory format of the cell has consecutive elements in the width direction, not the depth direction.

I have a new kernel working (passing tests), but I'm concerned that I'm doing something wrong because I have to use WidthMajor to get cells in a format where consecutive elements in memory are along the dot product dimension.

Ah, thanks for the explanation, I think I understand now what the confusion is about.

First, if your kernel is passing tests, then it's good. The only "contract" that your kernel must satisfy, is that it produces correct results; since a kernel is a piece of code computing narrow/deep (rectangular*rectangular) matrix product, the contract is only concerned with the correctness of the matrix product result.

That means that it is entirely up to each kernel how that matrix product is going to be computed. One way to compute matrix products is to compute each entry of the result as a dot-product, as you explain above.

There are other ways of computing matrix products; the one in the existing kernel that you linked to,
is *not* working by computing dot products in the destination matrix. Instead, like you found, it takes a vector of N entries in the LHS and multiplies it with 1 individual (scalar) entry of the RHS, using ARM NEON multiply-by-scalar instructions, like this
vmlal.u16 q4, d2, d0[0]

It accumulates that (as you found) into N accumulators.

A high-level picture of what this is doing, is this is splitting the matrix*matrix product in to individual matrix*vector products (seeing RHS columns as separate vectors) and then again splitting that as vector*scalar products (seeing LHS columns as vectors and RHS vector entries as scalar).

But that is just one approach to making a kernel and if you prefer a dot-product approach, that is absolutely OK too.

In fact, gemmlowp's kernel interface, where each kernel provides the entire inner loop, is designed to allow such differences of design between different kernels. It is a departure from the more rigid kernel interface in Eigen, which pretty much dictated one single kernel design.

We found that different architectures, data types, etc, could often require different kernel designs to perform optimally, so it would be very interesting to hear what you found on particular hardware leading to your particular kernel's design choices!

Cheers,
Benoit

Erich Plondke

unread,
Jul 7, 2016, 3:00:01 PM7/7/16
to Benoit Jacob, gemmlowp
On Thu, Jul 7, 2016 at 5:47 AM, Benoit Jacob <benoi...@google.com> wrote:


On Wed, Jul 6, 2016 at 11:39 AM, Erich Plondke <eplo...@gmail.com> wrote:


On Wed, Jul 6, 2016 at 4:17 AM, Benoit Jacob <benoi...@google.com> wrote:


For both the LHS and RHS matrix, "Depth" is the "K" direction; depth must be equal in both matrices... in kernel.h, it says "In a matrix product of a MxK matrix times a KxN matrix, we call K the 'depth'."

Correct.
 

So the only way this makes sense to me is if the LHS and RHS are swapped from what I expect.  


A LHS of MxK would be M rows each of length K.  A RHS would be N columns each of length K.  K is the depth direction, and the direction of accumulation, and the direction we take the "dot product".

That is exactly what I have in mind, too.
 

But instead if I flip around LHS and RHS opposite of High School Matrix Multiplies: (https://www.mathsisfun.com/algebra/images/matrix-multiply-a.gif), and we are doing the dot product of the LHS columns and RHS rows then it makes sense; and the LHS is K columns and the RHS is K rows.

I don't understand what "flipping around is needed" to make this agree with the above. The matrix multiplication picture above matches what I have in mind, too. Moreover: the two viewpoints that you are describing above:

"the LHS is K columns"

and, earlier, in the above paragraph,

"A LHS of MxK would be M rows each of length K"

are compatible: if the LHS is "M rows each of length K" then it is K columns.

Thus, so far, all of what you wrote seems to be perfectly compatible with what I have in mind (except I do not understand what "flipping around" would be needed).

Good! I'm glad I'm not crazy. :-)

I was trying to find a way to explain to myself how the "LHS" could be K rows of length M instead of M rows of length K and still have the matrix multiply work.
 
 

But the reference kernels (Neon/SSE) seem to not accumulate the products of consecutive LHS and RHS elements, even though they use the default CellOrder (DepthMajor).  Instead they vectorize in that direction.

The picture at https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L83 seems to match what we do in the code: 

* Multiply the first four consecutive elements into separate accumulators: https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L121
* Accumulate them with the second set of values: https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L141

If we are doing "dot product" down the depth direction, I would expect the products of consecutive elements would be added to each other.  Instead we accumulate the products of the first four elements with the products of the second four elements.  It seems like that would mean "width major" since depth is the order of accumulation.



If we have a kernel that has a KernelFormat<KernelSideFormat<CellFormat<2, 4>, 1>,<KernelSideFormat<CellFormat<2,4>, 1> >, and in memory we found:

LHS: A, B, C, D, E, F, G, H
RHS: a, b, c, d, e, f, g, h

I would expect we should compute Aa+Bb+Cc+Dd, Ea+Fb+Gc+Hd, Ae+Bf+Cg+Dh, and Ee+Ff+Gg+Hh in our kernel if we have depth-major ordering.

Instead, I think that we are expected to compute Aa+Cc+Ee+Gg, Ba+Dc+Fe+Hg, Ab+Cd+Ef+Gh, and Bb+Dd+Ff+Hh.  This is perfect if I have two element wide vectors that add elementwise, but it seems like the memory format of the cell has consecutive elements in the width direction, not the depth direction.

I have a new kernel working (passing tests), but I'm concerned that I'm doing something wrong because I have to use WidthMajor to get cells in a format where consecutive elements in memory are along the dot product dimension.

Ah, thanks for the explanation, I think I understand now what the confusion is about.

First, if your kernel is passing tests, then it's good. The only "contract" that your kernel must satisfy, is that it produces correct results; since a kernel is a piece of code computing narrow/deep (rectangular*rectangular) matrix product, the contract is only concerned with the correctness of the matrix product result.

That means that it is entirely up to each kernel how that matrix product is going to be computed. One way to compute matrix products is to compute each entry of the result as a dot-product, as you explain above.

There are other ways of computing matrix products; the one in the existing kernel that you linked to,
is *not* working by computing dot products in the destination matrix. Instead, like you found, it takes a vector of N entries in the LHS and multiplies it with 1 individual (scalar) entry of the RHS, using ARM NEON multiply-by-scalar instructions, like this
vmlal.u16 q4, d2, d0[0]

It accumulates that (as you found) into N accumulators.

A high-level picture of what this is doing, is this is splitting the matrix*matrix product in to individual matrix*vector products (seeing RHS columns as separate vectors) and then again splitting that as vector*scalar products (seeing LHS columns as vectors and RHS vector entries as scalar).

But that is just one approach to making a kernel and if you prefer a dot-product approach, that is absolutely OK too.
 
Great!  I'm just still confused as to how the consecutive entries being along the width direction becomes "DepthMajor" (which is how the ARM kernel is defined (default)).  I would consider it "WidthMajor" since the consecutive elements are along the width (parallel) dimension.  So it's just the docs / naming that is a bit confusing; I had to print out values and tinker with ordering to convince myself that I was getting the right data in the right locations.
 
In fact, gemmlowp's kernel interface, where each kernel provides the entire inner loop, is designed to allow such differences of design between different kernels. It is a departure from the more rigid kernel interface in Eigen, which pretty much dictated one single kernel design.

Yes, that's very nice for us.

I noticed that we don't have the option of row vs. column major output.  Does the outside code (tensorflow) tend to favor column major order?
 
We found that different architectures, data types, etc, could often require different kernel designs to perform optimally, so it would be very interesting to hear what you found on particular hardware leading to your particular kernel's design choices!
 
I'd be happy to share this with you in a separate thread. :-)

Benoit Jacob

unread,
Jul 11, 2016, 12:08:41 PM7/11/16
to Erich Plondke, gemmlowp
I have been confused too! When I introduced the notions of width/depth as an alternative to row/column to have a uniform addressing scheme for both the LHS and RHS, I wondered how to name the corresponding storage orders, and have been confused myself a lot with my own conventions!

The rationale for the current convention is as follows:

The accepted convention is to call "row-major" the storage order whereby entries with the same "row" index (i.e. entries in the same row) are stored contiguously.

Likewise, the accepted convention is to call "column-major" the storage order whereby entries with the same "column" index (i.e. entries in the same column) are stored contiguously.

So to be consistent with these accepted conventions, I felt that I had no choice but to call "width-major" the storage order whereby entries with the same "width" index are stored contiguously.

Yet, like you, I've subsequently found it confusing. My best defense is thus that this confusing terminology preexists in the accepted "rowmajor" and "column-major", only it doesn't feel quite as confusing there. Probably because "row" and "column" have dual meaning: they mean both an index (like "width does) and a slice of a matrix at constant index (unlike "width"). It seems that the intuitiveness of "row-major" and "column-major" terminology came from that latter meaning, which does not transport to "width".

 
 
In fact, gemmlowp's kernel interface, where each kernel provides the entire inner loop, is designed to allow such differences of design between different kernels. It is a departure from the more rigid kernel interface in Eigen, which pretty much dictated one single kernel design.

Yes, that's very nice for us.

I noticed that we don't have the option of row vs. column major output.

That's correct. gemmlowp is not symmetric in this respect: kernel's outputs are always column-major. That does mean that there is a penalty for using row-major output.

However, note that the penalty does not have to be large. The conversion to a final row-major destination is done in the Unpack stage. There, there is a slow generic implementation (in unpack.h) and there may be architecture-specific fast paths (e.g. unpack_neon.h). Those fast paths are each defined for one particular destination storage order.

At the moment there is no fast path for row-major destinations, so those would fall back to slow generic code. However, it would not be difficult to write a fast path (as another specialization in unpack_neon.h) for a row-major destination, doing the transposing. Since Unpack (once we have a fast path) is cheap and is often mostly memory-bound anyways, I believe that such a row-major (transposing) path would be almost as fast as column-major.

The rationale for sticking to only column-major output in kernels was thus to avoid one degree of freedom that I thought would not be missed too much. Happy to revisit as needed (and being more "symmetric" would be nice).

 
Does the outside code (tensorflow) tend to favor column major order?

Note - the calling code may or may not be TensorFlow.

What makes it difficult to answer the present question is that there is a mismatch of abstraction levels between GEMM, which is a "low-level" numerical primitive, and Neural Network operators such as "Fully Connected layer" and "Convolutional layer", which are higher-level operators that may, as an implementation detail, happen to call GEMM, and still have a few degrees of freedom in how they call GEMM.

For example, one may choose to put the weights in the LHS of the GEMM and put the activations in the RHS,

ResultActivationsMatrix = WeightsMatrix * InputActivationsMatrix,

or one may choose the opposite approach, putting the weights in the RHS of the GEMM:

ResultActivationsMatrix = InputActivationsMatrix * WeightsMatrix,

Of course, the mathematical operation of matrix transposition allows to "translate" between these two approaches, since for any two matrices A and B, one has

transpose (A * B)  =  transpose(B) * transpose(A).

That means that the freedom to choose the order of operands when mapping NN operators (such as Convolutional Layer) to GEMM primitives, gives the freedom to choose the storage order of the result (in exchange for flipping the two operands to the GEMM).

Given how difficult it is to design a GEMM implementation that is both efficient and generic (for all 2^3=8 combinations of storage orders), and given how easy it is for a NN inference implementation to adapt to one preferred GEMM output storage order. I found it a reasonable compromise to have the GEMM implementation just pick one and have the caller adapt to it. Happy to revisit as needed.

Cheers,
Benoit
Reply all
Reply to author
Forward
0 new messages