Getting error from openblas: DGEMM parameter 10 had an illegal value

861 views
Skip to first unread message

Yusuf Işık

unread,
Jul 3, 2015, 11:52:12 AM7/3/15
to tor...@googlegroups.com
Hi,

I am getting an error related with openblas during the backward step of my model. The error message says that:

"On entry to SGEMM parameter number 10 had an illegal value"

or

"On entry to DGEMM parameter number 10 had an illegal value".

I have prepared a toy example to illustrate the problem and attached it to the message. After the statement:

mlp:backward(x,gc)

I get the error:

"On entry to DGEMM parameter number 10 had an illegal value".

Any help would be appreciated. Thanks,

Yusuf
toy_example.lua

soumith

unread,
Jul 3, 2015, 12:22:39 PM7/3/15
to torch7 on behalf of Yusuf Işık
Thanks for the isolated example.

It seems that I am able to reproduce this on OSX, but not on Linux.
I'll look into this.

On Linux the example runs bug-free.

--
You received this message because you are subscribed to the Google Groups "torch7" group.
To unsubscribe from this group and stop receiving emails from it, send an email to torch7+un...@googlegroups.com.
To post to this group, send email to tor...@googlegroups.com.
Visit this group at http://groups.google.com/group/torch7.
For more options, visit https://groups.google.com/d/optout.

Yusuf Işık

unread,
Jul 3, 2015, 3:25:23 PM7/3/15
to tor...@googlegroups.com
Thanks for your help. I am getting this error on a machine with Ubuntu 12.04 installed.

Yusuf

soumith

unread,
Jul 9, 2015, 3:19:01 AM7/9/15
to torch7 on behalf of Yusuf Işık
Can you try using OpenBLAS instead of your system BLAS? I am still having trouble reproducing it on Linux, but still looking into it.

On Fri, Jul 3, 2015 at 2:25 PM, Yusuf Işık via torch7 <torch7+APn2wQdde6YusRUZSV5u093DS...@googlegroups.com> wrote:
Thanks for your help. I am getting this error on a machine with Ubuntu 12.04 installed.

Yusuf

Yusuf Işık

unread,
Jul 9, 2015, 9:53:42 AM7/9/15
to tor...@googlegroups.com
Hi,

I have made a fresh install of Torch on a new machine with Ubuntu 14.04 installed. I have used the commands from the official site to install it. To be sure I also used the ldd linux command on libTH.so and at the output I could see that it is using libopenblas.so shared library. I still got the same error although after the error message, it also outputs the result. 

I also tried the CUDA back-end, and this time, I got the error:

"On entry to SGEMM parameter number 10 had an illegal value"

and then it gives a Segmentation fault error and quits without the result.

Have you tried it with CUDA back-end instead of OpenBlas? Thanks,

Yusuf

Francisco Vitor Suzano Massa

unread,
Jul 9, 2015, 10:33:16 AM7/9/15
to tor...@googlegroups.com
I have the same problem in my Ubuntu 14.04 with OpenBlas (from July 2014 developer branch).
The issue comes from nn.Linear receiving a zero strided gradOutput Tensor in it's backward pass. The issue with addmm is reported here https://github.com/torch/torch7/issues/58
In fact, the Mean module outputs a zero strided Tensor in it's backward. This reduced code snippet reproduces the problem on my machine

m1 = nn.Sequential()
m1:add(nn.Linear(5,3)):add(nn.Mean(1))

x = torch.rand(10,5)

y = m1:forward(x)
m1:backward(x,y)


The same thing happens on cuda.

Francisco Vitor Suzano Massa

unread,
Jul 9, 2015, 10:49:41 AM7/9/15
to tor...@googlegroups.com
A workaround for the moment is to add a nn.Copy with option forceCopy option before the nn.Linear. Like this
m1 = nn.Sequential()
m1:add(nn.Linear(5,3)):add(nn.Copy(nil,nil,true)):add(nn.Mean(1))

The reason why it didn't happen in the second branch of your code (which uses Sum) is that you were using a non-linearity between the Linear and the Sum, making the gradOutput from the Linear contiguous.

Yusuf Işık

unread,
Jul 10, 2015, 4:19:52 AM7/10/15
to tor...@googlegroups.com
I have tried the workaround and it works. Thanks, 

Best regards,

Yusuf
Reply all
Reply to author
Forward
0 new messages