Can you confirm or deny my understanding about ordering?I think the EightBitIntGemm interface assumes column major if the transpose option is false, and row major if the transpose option is true?
For both the LHS and RHS matrix, "Depth" is the "K" direction; depth must be equal in both matrices... in kernel.h, it says "In a matrix product of a MxK matrix times a KxN matrix, we call K the 'depth'."
In kernel.h, it says that:"In the LHS, width-major means row-major, while in the RHS, width-major means column-major."
I would expect that the LHS K direction is rows, and the RHS K direction is columns. So wouldn't LHS depth-major be row-major?
I think the default "depth-major" in the code is actually width-major, and vice versa.
For both the LHS and RHS the data is arranged sequentially for parallel execution, rather than accumulation.
One more ordering question: the output of the kernel is always column-major?
I wrote a test with 256 bytes contiguous, array[i] = i, and pass to a test with 16x16, transposed.If I make my kernel CellOrder::WidthMajor, I see the data come in in chunks of row-major order (0,1,2,3) on the LHS. If I make my kernel DepthMajor, I see column-major order (0x0,0x10,0x20,0x30) on the LHS. It's OK, I just think it's not the "Depth" direction so I want to make sure I haven't screwed up the ordering somewhere.
--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+u...@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/3a870cd4-bf23-4b04-948d-6c88093c16f7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
For both the LHS and RHS matrix, "Depth" is the "K" direction; depth must be equal in both matrices... in kernel.h, it says "In a matrix product of a MxK matrix times a KxN matrix, we call K the 'depth'."Correct.
On Tue, Jul 5, 2016 at 6:20 PM, Erich Plondke <eplo...@gmail.com> wrote:One more ordering question: the output of the kernel is always column-major?Yes, that is correct. The unpack stage then takes care of properly mapping that to the destination matrix's storage order.
On Wed, Jul 6, 2016 at 4:17 AM, Benoit Jacob <benoi...@google.com> wrote:For both the LHS and RHS matrix, "Depth" is the "K" direction; depth must be equal in both matrices... in kernel.h, it says "In a matrix product of a MxK matrix times a KxN matrix, we call K the 'depth'."Correct.So the only way this makes sense to me is if the LHS and RHS are swapped from what I expect.What I expect from https://github.com/google/gemmlowp/blob/master/internal/kernel.h#L46:A LHS of MxK would be M rows each of length K. A RHS would be N columns each of length K. K is the depth direction, and the direction of accumulation, and the direction we take the "dot product".
But instead if I flip around LHS and RHS opposite of High School Matrix Multiplies: (https://www.mathsisfun.com/algebra/images/matrix-multiply-a.gif), and we are doing the dot product of the LHS columns and RHS rows then it makes sense; and the LHS is K columns and the RHS is K rows.
But the reference kernels (Neon/SSE) seem to not accumulate the products of consecutive LHS and RHS elements, even though they use the default CellOrder (DepthMajor). Instead they vectorize in that direction.The picture at https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L83 seems to match what we do in the code:* Load up 8 bytes: https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L113* Zero extend them to halves: https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L121* Multiply the first four consecutive elements into separate accumulators: https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L121* Accumulate them with the second set of values: https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L141If we are doing "dot product" down the depth direction, I would expect the products of consecutive elements would be added to each other. Instead we accumulate the products of the first four elements with the products of the second four elements. It seems like that would mean "width major" since depth is the order of accumulation.If we have a kernel that has a KernelFormat<KernelSideFormat<CellFormat<2, 4>, 1>,<KernelSideFormat<CellFormat<2,4>, 1> >, and in memory we found:LHS: A, B, C, D, E, F, G, HRHS: a, b, c, d, e, f, g, hI would expect we should compute Aa+Bb+Cc+Dd, Ea+Fb+Gc+Hd, Ae+Bf+Cg+Dh, and Ee+Ff+Gg+Hh in our kernel if we have depth-major ordering.Instead, I think that we are expected to compute Aa+Cc+Ee+Gg, Ba+Dc+Fe+Hg, Ab+Cd+Ef+Gh, and Bb+Dd+Ff+Hh. This is perfect if I have two element wide vectors that add elementwise, but it seems like the memory format of the cell has consecutive elements in the width direction, not the depth direction.I have a new kernel working (passing tests), but I'm concerned that I'm doing something wrong because I have to use WidthMajor to get cells in a format where consecutive elements in memory are along the dot product dimension.
On Wed, Jul 6, 2016 at 11:39 AM, Erich Plondke <eplo...@gmail.com> wrote:On Wed, Jul 6, 2016 at 4:17 AM, Benoit Jacob <benoi...@google.com> wrote:For both the LHS and RHS matrix, "Depth" is the "K" direction; depth must be equal in both matrices... in kernel.h, it says "In a matrix product of a MxK matrix times a KxN matrix, we call K the 'depth'."Correct.So the only way this makes sense to me is if the LHS and RHS are swapped from what I expect.What I expect from https://github.com/google/gemmlowp/blob/master/internal/kernel.h#L46:A LHS of MxK would be M rows each of length K. A RHS would be N columns each of length K. K is the depth direction, and the direction of accumulation, and the direction we take the "dot product".That is exactly what I have in mind, too.But instead if I flip around LHS and RHS opposite of High School Matrix Multiplies: (https://www.mathsisfun.com/algebra/images/matrix-multiply-a.gif), and we are doing the dot product of the LHS columns and RHS rows then it makes sense; and the LHS is K columns and the RHS is K rows.I don't understand what "flipping around is needed" to make this agree with the above. The matrix multiplication picture above matches what I have in mind, too. Moreover: the two viewpoints that you are describing above:"the LHS is K columns"and, earlier, in the above paragraph,"A LHS of MxK would be M rows each of length K"are compatible: if the LHS is "M rows each of length K" then it is K columns.Thus, so far, all of what you wrote seems to be perfectly compatible with what I have in mind (except I do not understand what "flipping around" would be needed).
But the reference kernels (Neon/SSE) seem to not accumulate the products of consecutive LHS and RHS elements, even though they use the default CellOrder (DepthMajor). Instead they vectorize in that direction.The picture at https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L83 seems to match what we do in the code:* Load up 8 bytes: https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L113* Zero extend them to halves: https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L121* Multiply the first four consecutive elements into separate accumulators: https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L121* Accumulate them with the second set of values: https://github.com/google/gemmlowp/blob/master/internal/kernel_neon.h#L141If we are doing "dot product" down the depth direction, I would expect the products of consecutive elements would be added to each other. Instead we accumulate the products of the first four elements with the products of the second four elements. It seems like that would mean "width major" since depth is the order of accumulation.If we have a kernel that has a KernelFormat<KernelSideFormat<CellFormat<2, 4>, 1>,<KernelSideFormat<CellFormat<2,4>, 1> >, and in memory we found:LHS: A, B, C, D, E, F, G, HRHS: a, b, c, d, e, f, g, hI would expect we should compute Aa+Bb+Cc+Dd, Ea+Fb+Gc+Hd, Ae+Bf+Cg+Dh, and Ee+Ff+Gg+Hh in our kernel if we have depth-major ordering.Instead, I think that we are expected to compute Aa+Cc+Ee+Gg, Ba+Dc+Fe+Hg, Ab+Cd+Ef+Gh, and Bb+Dd+Ff+Hh. This is perfect if I have two element wide vectors that add elementwise, but it seems like the memory format of the cell has consecutive elements in the width direction, not the depth direction.I have a new kernel working (passing tests), but I'm concerned that I'm doing something wrong because I have to use WidthMajor to get cells in a format where consecutive elements in memory are along the dot product dimension.Ah, thanks for the explanation, I think I understand now what the confusion is about.First, if your kernel is passing tests, then it's good. The only "contract" that your kernel must satisfy, is that it produces correct results; since a kernel is a piece of code computing narrow/deep (rectangular*rectangular) matrix product, the contract is only concerned with the correctness of the matrix product result.That means that it is entirely up to each kernel how that matrix product is going to be computed. One way to compute matrix products is to compute each entry of the result as a dot-product, as you explain above.There are other ways of computing matrix products; the one in the existing kernel that you linked to,is *not* working by computing dot products in the destination matrix. Instead, like you found, it takes a vector of N entries in the LHS and multiplies it with 1 individual (scalar) entry of the RHS, using ARM NEON multiply-by-scalar instructions, like thisvmlal.u16 q4, d2, d0[0]It accumulates that (as you found) into N accumulators.A high-level picture of what this is doing, is this is splitting the matrix*matrix product in to individual matrix*vector products (seeing RHS columns as separate vectors) and then again splitting that as vector*scalar products (seeing LHS columns as vectors and RHS vector entries as scalar).But that is just one approach to making a kernel and if you prefer a dot-product approach, that is absolutely OK too.
In fact, gemmlowp's kernel interface, where each kernel provides the entire inner loop, is designed to allow such differences of design between different kernels. It is a departure from the more rigid kernel interface in Eigen, which pretty much dictated one single kernel design.
We found that different architectures, data types, etc, could often require different kernel designs to perform optimally, so it would be very interesting to hear what you found on particular hardware leading to your particular kernel's design choices!
In fact, gemmlowp's kernel interface, where each kernel provides the entire inner loop, is designed to allow such differences of design between different kernels. It is a departure from the more rigid kernel interface in Eigen, which pretty much dictated one single kernel design.Yes, that's very nice for us.I noticed that we don't have the option of row vs. column major output.
Does the outside code (tensorflow) tend to favor column major order?