Hi Amit,
ok, took a while to go through your results. When reporting time, it's
often better to divide by 1e9 to get seconds, not nanoseconds (so many
digits to parse...)
The numbers you report are correct, but the reason seems faster is
that dgemv only does matrix-vector multiplication, even if you feed it
two matrices. No check's in there on the LAPACK side. So dgemv only
does a fraction of the work of dgemm.
If you take that into account, you get the following picture:
Case 1&2&4: Matrix-matrix multiplication
mmul uses dgemm internally, so they are equally fast, dgemv does not
compute matrix-matrix multiplication
Case 3: matrix-vector multiplication
mmul recognizes that it's actually matrix-vector multiplication,
switches to Java code, is faster than both dgemv and dgemm which copy
the data to native memory and back
I hope this clears it up. The main mistake you made in your thinking
was that you thought dgemv would also do matrix-matrix multiplication.
Best,
-M