Precision of dgetrf

Aleksei Nikiforov

unread,

Jul 10, 2023, 11:17:32 AM7/10/23

to openbl...@googlegroups.com

Good day.

I've noticed that dgetrf_ may produce different results on different
platforms.

Here's somewhat minimized example:

#include <stdio.h>

void dgetrf_(int *m, int *n, double *a, int *lda, int *ipiv, int *info);

int main(int argc, char **argv)
{
int m = 5;
int n = 5;
int lda = 5;

double a[] = {
0.72857546562820341,
-0.78281609716933998,
0.86046155246780609,
-1.0006287963975689,
1.5611716761308798,
0.41911784786867096,
-0.6454528622624528,
0.41128486717490204,
-0.94252011314907469,
1.5177029668084923,
0.3450071123222726,
-0.42495485776980968,
0.50896485191514396,
-0.55082163903276304,
0.84641169749513157,
0.81046059079204757,
-0.10383394280963526,
-0.051601994141029284,
0.017058483822232351,
0.72535356734870426,
-0.51269675072665255,
-0.098946294965133214,
-0.75082005371623339,
-0.46603827004164611,
0.48025114546825054,
-0.61770032879577774,
0.68022029332664247,
0.21512495606941925,
-0.76615979420838609,
-0.81191869547374351,
-0.20692033461495593,
-0.22374837296402705,
-0.012269267666604688,
0.93102477795925398,
-0.14859077429722012,
0.48403391266183848,
0.46697504485264119,
0.86276744808951711,
-0.34663362784875729,
0.11161289032174304,
-0.24023098337965051,
-0.27876090035355339,
-0.18123863893499198,
0.12825184063873868,
-0.16530575508712297,
0.24255849205301655,
0.3254377222939866,
0.086062864141845608,
-0.86081670246286279,
0.1582069164833744,
};

int ipiv[] = {
682,
607260352,
682,
607541760,
0,
32,
0,
48,
682,
244581389,
0,
0,
0,
1,
0,
607588400,
0,
160,
0,
689,
1023,
-1836775248,
1023,
-1471162984,
1023,
-1704443792,
1023,
-1704427472,
1023,
-1836596224,
1023,
-1835840160,
1023,
-1835839280,
1023,
-1471162984,
1023,
-1471162984,
1007,
1946288288,
1007,
1946288528,
1007,
1946269104,
1023,
-1471162984,
1023,
-1471162984,
1023,
-1471162984,
};

int info[] = {
682,
636589680,
682,
606873584,
0,
80,
0,
81,
682,
256022213,
682,
635388704,
682,
266471653,
1,
4,
682,
607371552,
682,
607635712,
682,
244616901,
682,
607570288,
0,
160,
0,
33,
682,
607664976,
682,
607861008,
1869766506,
1987051632,
0,
33,
682,
607548752,
0,
0,
0,
0,
0,
97,
1023,
-1881495688,
20,
0,
682,
607542000,
};

dgetrf_(&m, &n, a, &lda, ipiv, info);

for (int i = 0; i < 50; ++i)
{
printf("a[%d] = %e\n", i, a[i]);
}

return 0;
}

I compile it with "gcc blas.c -O0 -g -ggdb -lopenblas -o blas" and run.
On linux on x86_64 a[24] = 6.938894e-17, on linux on s390x a[24] =
0.000000e+00.

It looks like on x86_64 there are separate vector multiply and vector
add instructions used:

https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_64/dgemv_n_4.c#L155-L158

On s390x similar code uses combined vector multiply and add instructions:

https://github.com/xianyi/OpenBLAS/blob/develop/kernel/zarch/dgemv_n_4.c#L318-L325

Due to that, precision is higher, but results are different.

I've also noticed that in other parts of x86_64 code, like here,
combined instructions are used as well:

https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_64/dgemv_n_microk_haswell-4.c#L62-L63

So I have a question: are both results considered within acceptable
precision or is one of results invalid?

Kind regards,
Aleksei Nikiforov

Nima Sahraneshin

unread,

Jul 10, 2023, 12:01:35 PM7/10/23

to Aleksei Nikiforov, openbl...@googlegroups.com

Hi,

I think here we have a memory problem. Are you passing the memory location correctly? such a value for the ipiv vector and info means that without doing LU decomposition the operation has been ended.

Best regards,

Nima

--
You received this message because you are subscribed to the Google Groups "OpenBLAS-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openblas-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openblas-dev/4151c9a0-76d6-99c9-0de6-76bfdc12cb5a%40linux.ibm.com.

Nima Sahraneshin

unread,

Jul 10, 2023, 12:06:34 PM7/10/23

to Aleksei Nikiforov, openbl...@googlegroups.com

Also your matrix size is 5x5 but the ipiv size is something like 60. It should be 5, the number of rows.

Aleksei Nikiforov

unread,

Jul 11, 2023, 4:28:44 AM7/11/23

to OpenBLAS-dev

Good day,

you're correct, the example can be minimized more, like this:

};

int ipiv[] = {
682,
607260352,
682,
607541760,
0,

};

int info[] = {
682,

};

dgetrf_(&m, &n, a, &lda, ipiv, info);

for (int i = 0; i < 25; ++i)

{
printf("a[%d] = %e\n", i, a[i]);
}

return 0;
}

Still, result stays same, a[24] = 6.938894e-17 on x86_64 and a[24] = 0.000000e+00 on s390x.

Kind regards,

Aleksei Nikiforov

понедельник, 10 июля 2023 г. в 18:06:34 UTC+2, unix...@gmail.com:

Nima Sahraneshin

unread,

Jul 11, 2023, 4:37:01 AM7/11/23

to Aleksei Nikiforov, OpenBLAS-dev

Hi Aleksei,

Could you please share with me the test code?

It does not make sense to me to select row #682 from a 5x5 matrix for pivoting. Also the info is 682 so, it means that the selected pivot was 0. I think here we are working with some uninitialized memory or wrong memory address.

Best regards,

Nima

To view this discussion on the web visit https://groups.google.com/d/msgid/openblas-dev/5b42a309-f066-47ed-9692-2ea875c6d6cdn%40googlegroups.com.

Aleksei Nikiforov

unread,

Jul 11, 2023, 5:13:28 AM7/11/23

to OpenBLAS-dev

Good day,

sure, but the real example is really big.

I'm running into it when running test "test_forward_mode_AD_linalg_det_singular_cpu_float64" from file "test/test_ops_fwd_gradients.py" from pytorch (https://github.com/pytorch/pytorch/).

When inputs are following:

(Pdb) print(inputs)
(tensor([[[ 0.7286, -0.7828, 0.8605, -1.0006, 1.5612],
        [ 0.4191, -0.6455, 0.4113, -0.9425, 1.5177],
        [ 0.3450, -0.4250, 0.5090, -0.5508, 0.8464],
        [ 0.8105, -0.1038, -0.0516, 0.0171, 0.7254],
        [-0.5127, -0.0989, -0.7508, -0.4660, 0.4803]],

       [[-0.6177, 0.6802, 0.2151, -0.7662, -0.8119],
        [-0.2069, -0.2237, -0.0123, 0.9310, -0.1486],
        [ 0.4840, 0.4670, 0.8628, -0.3466, 0.1116],
        [-0.2402, -0.2788, -0.1812, 0.1283, -0.1653],
        [ 0.2426, 0.3254, 0.0861, -0.8608, 0.1582]]], dtype=torch.float64,
      requires_grad=True),)

And expected results are following:

(Pdb) print(numerical_vJu)
[[tensor([-0.0134, -0.0296], dtype=torch.float64)]]

On s390x I'm getting different results:

(Pdb) print(analytical_vJu)
((tensor([-0.0000, -0.0296], dtype=torch.float64),),)

Here's code of function:

https://github.com/pytorch/pytorch/blob/main/torch/autograd/gradcheck.py#L1389-L1421

I've tracked it down the call stack and difference started with call to openblas function dgetrf_ I've used as base for example I've posted in initial post. Some memory might be uninitialized before call, but to me it didn't look like it's the source of current issue.

Kind regards,

Aleksei Nikiforov

вторник, 11 июля 2023 г. в 10:37:01 UTC+2, unix...@gmail.com:

Nima Sahraneshin

unread,

Aug 10, 2023, 5:30:02 AM8/10/23

to Aleksei Nikiforov, OpenBLAS-dev

Dear Aleksei,

Did you find the problem?

Best regards,

Nima

To view this discussion on the web visit https://groups.google.com/d/msgid/openblas-dev/ad916afd-edfd-40ff-9ed8-1005aafcec01n%40googlegroups.com.

Aleksei Nikiforov

unread,

Aug 14, 2023, 6:19:42 AM8/14/23

to OpenBLAS-dev

Hi,

yes, as I have written in original post, the problem is introduced by using FMA instructions in some cases, and not using them in other cases. I'm asking if results are considered valid in both cases, or if only one of results is correct.

Kind regards,

Aleksei Nikiforov

четверг, 10 августа 2023 г. в 11:30:02 UTC+2, unix...@gmail.com:

Reply all

Reply to author

Forward

Precision of dgetrf_

Aleksei Nikiforov

Nima Sahraneshin

Nima Sahraneshin

Aleksei Nikiforov

Nima Sahraneshin

Aleksei Nikiforov

Nima Sahraneshin

Aleksei Nikiforov