Precision of dgetrf_

27 views
Skip to first unread message

Aleksei Nikiforov

unread,
Jul 10, 2023, 11:17:32 AM7/10/23
to openbl...@googlegroups.com
Good day.

I've noticed that dgetrf_ may produce different results on different
platforms.

Here's somewhat minimized example:

#include <stdio.h>

void dgetrf_(int *m, int *n, double *a, int *lda, int *ipiv, int *info);

int main(int argc, char **argv)
{
int m = 5;
int n = 5;
int lda = 5;

double a[] = {
0.72857546562820341,
-0.78281609716933998,
0.86046155246780609,
-1.0006287963975689,
1.5611716761308798,
0.41911784786867096,
-0.6454528622624528,
0.41128486717490204,
-0.94252011314907469,
1.5177029668084923,
0.3450071123222726,
-0.42495485776980968,
0.50896485191514396,
-0.55082163903276304,
0.84641169749513157,
0.81046059079204757,
-0.10383394280963526,
-0.051601994141029284,
0.017058483822232351,
0.72535356734870426,
-0.51269675072665255,
-0.098946294965133214,
-0.75082005371623339,
-0.46603827004164611,
0.48025114546825054,
-0.61770032879577774,
0.68022029332664247,
0.21512495606941925,
-0.76615979420838609,
-0.81191869547374351,
-0.20692033461495593,
-0.22374837296402705,
-0.012269267666604688,
0.93102477795925398,
-0.14859077429722012,
0.48403391266183848,
0.46697504485264119,
0.86276744808951711,
-0.34663362784875729,
0.11161289032174304,
-0.24023098337965051,
-0.27876090035355339,
-0.18123863893499198,
0.12825184063873868,
-0.16530575508712297,
0.24255849205301655,
0.3254377222939866,
0.086062864141845608,
-0.86081670246286279,
0.1582069164833744,
};

int ipiv[] = {
682,
607260352,
682,
607541760,
0,
32,
0,
48,
682,
244581389,
0,
0,
0,
1,
0,
607588400,
0,
160,
0,
689,
1023,
-1836775248,
1023,
-1471162984,
1023,
-1704443792,
1023,
-1704427472,
1023,
-1836596224,
1023,
-1835840160,
1023,
-1835839280,
1023,
-1471162984,
1023,
-1471162984,
1007,
1946288288,
1007,
1946288528,
1007,
1946269104,
1023,
-1471162984,
1023,
-1471162984,
1023,
-1471162984,
};

int info[] = {
682,
636589680,
682,
606873584,
0,
80,
0,
81,
682,
256022213,
682,
635388704,
682,
266471653,
1,
4,
682,
607371552,
682,
607635712,
682,
244616901,
682,
607570288,
0,
160,
0,
33,
682,
607664976,
682,
607861008,
1869766506,
1987051632,
0,
33,
682,
607548752,
0,
0,
0,
0,
0,
97,
1023,
-1881495688,
20,
0,
682,
607542000,
};

dgetrf_(&m, &n, a, &lda, ipiv, info);

for (int i = 0; i < 50; ++i)
{
printf("a[%d] = %e\n", i, a[i]);
}

return 0;
}

I compile it with "gcc blas.c -O0 -g -ggdb -lopenblas -o blas" and run.
On linux on x86_64 a[24] = 6.938894e-17, on linux on s390x a[24] =
0.000000e+00.

It looks like on x86_64 there are separate vector multiply and vector
add instructions used:

https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_64/dgemv_n_4.c#L155-L158

On s390x similar code uses combined vector multiply and add instructions:

https://github.com/xianyi/OpenBLAS/blob/develop/kernel/zarch/dgemv_n_4.c#L318-L325

Due to that, precision is higher, but results are different.

I've also noticed that in other parts of x86_64 code, like here,
combined instructions are used as well:

https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_64/dgemv_n_microk_haswell-4.c#L62-L63

So I have a question: are both results considered within acceptable
precision or is one of results invalid?

Kind regards,
Aleksei Nikiforov

Nima Sahraneshin

unread,
Jul 10, 2023, 12:01:35 PM7/10/23
to Aleksei Nikiforov, openbl...@googlegroups.com
Hi,

I think here we have a memory problem. Are you passing the memory location correctly? such a value for the ipiv vector and info means that without doing LU decomposition the operation has been ended.

Best regards,
Nima

--
You received this message because you are subscribed to the Google Groups "OpenBLAS-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openblas-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openblas-dev/4151c9a0-76d6-99c9-0de6-76bfdc12cb5a%40linux.ibm.com.

Nima Sahraneshin

unread,
Jul 10, 2023, 12:06:34 PM7/10/23
to Aleksei Nikiforov, openbl...@googlegroups.com
Also your matrix size is 5x5 but the ipiv size is something like 60. It should be 5, the number of rows.

Aleksei Nikiforov

unread,
Jul 11, 2023, 4:28:44 AM7/11/23
to OpenBLAS-dev
Good day,

you're correct, the example can be minimized more, like this:

    };

    int ipiv[] = {
        682,
        607260352,
        682,
        607541760,
        0,
    };

    int info[] = {
        682,

    };

     dgetrf_(&m, &n, a, &lda, ipiv, info);

     for (int i = 0; i < 25; ++i)

     {
        printf("a[%d] = %e\n", i, a[i]);
     }

    return 0;
}

Still, result stays same, a[24] = 6.938894e-17 on x86_64 and a[24] = 0.000000e+00 on s390x.

Kind regards,
Aleksei Nikiforov


понедельник, 10 июля 2023 г. в 18:06:34 UTC+2, unix...@gmail.com:

Nima Sahraneshin

unread,
Jul 11, 2023, 4:37:01 AM7/11/23
to Aleksei Nikiforov, OpenBLAS-dev
Hi Aleksei,

Could you please share with me the test code?
It does not make sense to me to select row #682 from a 5x5 matrix for pivoting. Also the info is 682 so, it means that the selected pivot was 0.  I think here we are working with some uninitialized memory or wrong memory address.

Best regards,
Nima

 

Aleksei Nikiforov

unread,
Jul 11, 2023, 5:13:28 AM7/11/23
to OpenBLAS-dev
Good day,

sure, but the real example is really big.

I'm running into it when running test "test_forward_mode_AD_linalg_det_singular_cpu_float64" from file "test/test_ops_fwd_gradients.py" from pytorch (https://github.com/pytorch/pytorch/).

When inputs are following:
(Pdb) print(inputs)
(tensor([[[ 0.7286, -0.7828,  0.8605, -1.0006,  1.5612],
        [ 0.4191, -0.6455,  0.4113, -0.9425,  1.5177],
        [ 0.3450, -0.4250,  0.5090, -0.5508,  0.8464],
        [ 0.8105, -0.1038, -0.0516,  0.0171,  0.7254],
        [-0.5127, -0.0989, -0.7508, -0.4660,  0.4803]],

       [[-0.6177,  0.6802,  0.2151, -0.7662, -0.8119],
        [-0.2069, -0.2237, -0.0123,  0.9310, -0.1486],
        [ 0.4840,  0.4670,  0.8628, -0.3466,  0.1116],
        [-0.2402, -0.2788, -0.1812,  0.1283, -0.1653],
        [ 0.2426,  0.3254,  0.0861, -0.8608,  0.1582]]], dtype=torch.float64,
      requires_grad=True),) 

And expected results are following:

(Pdb) print(numerical_vJu)
[[tensor([-0.0134, -0.0296], dtype=torch.float64)]]

On s390x I'm getting different results:
 
(Pdb) print(analytical_vJu)
((tensor([-0.0000, -0.0296], dtype=torch.float64),),)

Here's code of function:


I've tracked it down the call stack and difference started with call to openblas function dgetrf_ I've used as base for example I've posted in initial post. Some memory might be uninitialized before call, but to me it didn't look like it's the source of current issue.

Kind regards,
Aleksei Nikiforov

вторник, 11 июля 2023 г. в 10:37:01 UTC+2, unix...@gmail.com:

Nima Sahraneshin

unread,
Aug 10, 2023, 5:30:02 AM8/10/23
to Aleksei Nikiforov, OpenBLAS-dev
Dear Aleksei,

Did you find the problem? 

Best regards, 
Nima

Aleksei Nikiforov

unread,
Aug 14, 2023, 6:19:42 AM8/14/23
to OpenBLAS-dev
Hi,

yes, as I have written in original post, the problem is introduced by using FMA instructions in some cases, and not using them in other cases. I'm asking if results are considered valid in both cases, or if only one of results is correct.

Kind regards,
Aleksei Nikiforov

четверг, 10 августа 2023 г. в 11:30:02 UTC+2, unix...@gmail.com:
Reply all
Reply to author
Forward
0 new messages