Illegal memory access during BasePrefetchingDataLayer<>::Forward_gpu()

1,989 views
Skip to first unread message

Nick Carlevaris-Bianco

unread,
Oct 13, 2014, 9:53:51 AM10/13/14
to caffe...@googlegroups.com
I am getting a strange error somewhat randomly during training. Often the network will train successfully for many iterations. But then I will get the following error:

F1013 09:33:06.971670  4890 math_functions.cpp:91] Check failed: error == cudaSuccess (77 vs. 0)  an illegal memory access was encountered
*** Check failure stack trace: ***
    @     0x7ffff2577b9d  google::LogMessage::Fail()
    @     0x7ffff2579c9f  google::LogMessage::SendToLog()
    @     0x7ffff257778c  google::LogMessage::Flush()
    @     0x7ffff257a53d  google::LogMessageFatal::~LogMessageFatal()
    @           0x4f3b98  caffe::caffe_copy<>()
    @           0x548a4e  caffe::BasePrefetchingDataLayer<>::Forward_gpu()
    @           0x537c3f  caffe::Net<>::ForwardFromTo()
    @           0x537f7f  caffe::Net<>::ForwardPrefilled()
    @           0x5144de  caffe::Solver<>::Solve()
    @           0x4292e9  train()
    @           0x422ddb  main
    @     0x7fffee26f76d  (unknown)
    @           0x4269ed  (unknown)


The model is small, and only using a few 100MB on a 4GB card. So the card is running out of memory. Has anyone encountered an error similar to this?

Xijing Dai

unread,
Dec 9, 2014, 11:29:25 AM12/9/14
to caffe...@googlegroups.com
Did u find the solution?

Cheers

刘盛中

unread,
Jan 4, 2015, 10:08:40 PM1/4/15
to caffe...@googlegroups.com
I get the error like yours. Did you solve the bug? 
I am training models on Ubuntu 12.04, GTX980, Cuda 6.5. The error comes out randomly in the train stage.
Here is the error information:

F0105 11:01:44.109680 13344 math_functions.cu:81] Check failed: error == cudaSuccess (77 vs. 0)  an illegal memory access was encountered
*** Check failure stack trace: ***
    @     0x7f56f3b68b7d  google::LogMessage::Fail()
    @     0x7f56f3b6ac7f  google::LogMessage::SendToLog()
    @     0x7f56f3b6876c  google::LogMessage::Flush()
    @     0x7f56f3b6b51d  google::LogMessageFatal::~LogMessageFatal()
    @           0x4a9102  caffe::caffe_gpu_memcpy()
    @           0x51e92a  caffe::SyncedMemory::gpu_data()
    @           0x475161  caffe::Blob<>::gpu_data()
    @           0x52e3c5  caffe::ConvolutionLayer<>::Forward_gpu()
    @           0x465b9f  caffe::Net<>::ForwardFromTo()
    @           0x465edf  caffe::Net<>::ForwardPrefilled()
    @           0x487b96  caffe::Solver<>::Test()
    @           0x488746  caffe::Solver<>::TestAll()
    @           0x48893f  caffe::Solver<>::Solve()
    @           0x424cff  train()
    @           0x41eccb  main
    @     0x7f56f104676d  (unknown)
    @           0x42228d  (unknown)
Aborted (core dumped)




在 2014年10月13日星期一UTC+8下午9时53分51秒,Nick Carlevaris-Bianco写道:

Huaishuo

unread,
Jan 7, 2015, 1:48:11 AM1/7/15
to caffe...@googlegroups.com
hi, I get the error like yours. Did you solve the bug? 


在 2014年12月10日星期三UTC+8上午12时29分25秒,Xijing Dai写道:

Pavel Machalek

unread,
Feb 17, 2015, 8:02:18 PM2/17/15
to caffe...@googlegroups.com
I get the same error:

I0218 00:38:05.609284  1490 solver.cpp:224] Learning Rate Policy: step

I0218 00:38:05.609299  1490 solver.cpp:267] Iteration 0, Testing net (#0)

I0218 00:55:57.192952  1490 solver.cpp:318]     Test net output #0: loss = 299034 (* 1 = 299034 loss)

F0218 00:55:57.284987  1490 im2col.cu:59] Check failed: error == cudaSuccess (77 vs. 0)  an illegal memory access was encountered

*** Check failure stack trace: ***

    @     0x7f2fbf6c3b8d  google::LogMessage::Fail()

    @     0x7f2fbf6c5c8f  google::LogMessage::SendToLog()

    @     0x7f2fbf6c377c  google::LogMessage::Flush()

    @     0x7f2fbf6c652d  google::LogMessageFatal::~LogMessageFatal()

    @           0x56b4c9  caffe::im2col_gpu<>()

    @           0x563bb9  caffe::ConvolutionLayer<>::Forward_gpu()

    @           0x52188f  caffe::Net<>::ForwardFromTo()

    @           0x521c1f  caffe::Net<>::ForwardPrefilled()

    @           0x53c400  caffe::Solver<>::Step()

    @           0x53cea7  caffe::Solver<>::Solve()

    @           0x4172d8  train()

    @           0x41175b  main

    @     0x7f2fbc95bec5  (unknown)

    @           0x415a47  (unknown)

Aborted (core dumped)

zhen zhou

unread,
Mar 26, 2015, 11:09:40 AM3/26/15
to caffe...@googlegroups.com
I got the same error in im2col.cu. Does anyone find any solutions?

Xiang E. Xiang

unread,
Apr 9, 2015, 8:06:57 PM4/9/15
to caffe...@googlegroups.com
Same for me! Just nobody solves it or gets around it?

Ting Lee

unread,
May 5, 2015, 5:21:09 AM5/5/15
to caffe...@googlegroups.com
Same for me!

在 2014年10月13日星期一 UTC+2下午3:53:51,Nick Carlevaris-Bianco写道:

Yoann

unread,
May 5, 2015, 5:41:58 AM5/5/15
to caffe...@googlegroups.com
Same for me. Do you know how to solve this?

F0505 11:34:24.536881  2253 math_functions.cpp:91] Check failed: error == cudaSuccess (77 vs. 0)  an illegal memory access was encountered
*** Check failure stack trace: ***
    @     0x7f5d50d40b7d  google::LogMessage::Fail()
    @     0x7f5d50d42c7f  google::LogMessage::SendToLog()
    @     0x7f5d50d4076c  google::LogMessage::Flush()
    @     0x7f5d50d4351d  google::LogMessageFatal::~LogMessageFatal()
    @           0x494cc8  caffe::caffe_copy<>()
    @           0x4e98ee  caffe::BasePrefetchingDataLayer<>::Forward_gpu()
    @           0x47d4df  caffe::Net<>::ForwardFromTo()
    @           0x47d81f  caffe::Net<>::ForwardPrefilled()
    @           0x4709ba  caffe::Solver<>::Solve()
    @           0x424ed9  train()
    @           0x41ebdb  main
    @     0x7f5d4cd6476d  (unknown)
    @           0x4225d9  (unknown)

Fatemeh Saleh

unread,
Aug 10, 2015, 10:17:58 PM8/10/15
to Caffe Users
Hi,
I got the same error after 640 iterations. Finally I figure out that the problem is the number of outputs in the last layer of the network. I have used pascal context with 33 labels but because of background the number of outputs should be 34.

李阳

unread,
Oct 29, 2015, 11:49:24 AM10/29/15
to Caffe Users
I check the number of output,it's right.but I still got the same problem in math_function.cu:81.    
CUDA_CHECK(cudaMemcpy(Y, X, N, cudaMemcpyDefault));  // NOLINT(caffe/alt_fn)

Check failed: error == cudaSuccess (77 vs. 0)  an illegal memory access was encountered

what should I do to solve this?Any suggestions?

在 2015年8月11日星期二 UTC+8上午10:17:58,Fatemeh Saleh写道:

Mohamed Ezz

unread,
Feb 26, 2016, 9:12:33 AM2/26/16
to Caffe Users
(See debugger backtrace below)
I'm seeing the same problem with a fairly small dataset (1500 images) on a 12GB Nvidia Titan X GPU. 
nvidia-smi shows VRAM usage of 6800MB of total available 12200MB.

The labels are numpy arrays of type np.uint8, serialized to Leveldb.

I ran caffe in gdb debugger, and this is the error and backtrace.

Any ideas ?

F0226 14:05:57.918788 17788 base_data_layer.cu:25] Check failed: error == cudaSuccess (77 vs. 0)  an illegal memory access was encountered

*** Check failure stack trace: ***

    @     0x7ffff692edaa  (unknown)

    @     0x7ffff692ece4  (unknown)

    @     0x7ffff692e6e6  (unknown)

    @     0x7ffff6931687  (unknown)

    @     0x7ffff7221be7  caffe::BasePrefetchingDataLayer<>::Forward_gpu()

    @           0x419b28  caffe::Layer<>::Forward()

    @     0x7ffff7118102  caffe::Net<>::ForwardFromTo()

    @     0x7ffff7117ea1  caffe::Net<>::ForwardPrefilled()

    @     0x7ffff71182a4  caffe::Net<>::Forward()

    @     0x7ffff70baeb3  caffe::Net<>::ForwardBackward()

    @     0x7ffff70a8825  caffe::Solver<>::Step()

    @     0x7ffff70a821b  caffe::Solver<>::Solve()

    @           0x414e36  train()

    @           0x416da6  main

    @     0x7ffff5e40ec5  (unknown)

    @           0x413c09  (unknown)

    @              (nil)  (unknown)


Program received signal SIGABRT, Aborted.

0x00007ffff5e55cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56

56      ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.

(gdb) bt

#0  0x00007ffff5e55cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56

#1  0x00007ffff5e590d8 in __GI_abort () at abort.c:89

#2  0x00007ffff6936ec3 in ?? () from /usr/lib/x86_64-linux-gnu/libglog.so.0

#3  0x00007ffff692edaa in google::LogMessage::Fail() () from /usr/lib/x86_64-linux-gnu/libglog.so.0

#4  0x00007ffff692ece4 in google::LogMessage::SendToLog() () from /usr/lib/x86_64-linux-gnu/libglog.so.0

#5  0x00007ffff692e6e6 in google::LogMessage::Flush() () from /usr/lib/x86_64-linux-gnu/libglog.so.0

#6  0x00007ffff6931687 in google::LogMessageFatal::~LogMessageFatal() () from /usr/lib/x86_64-linux-gnu/libglog.so.0

#7  0x00007ffff7221be7 in caffe::BasePrefetchingDataLayer<float>::Forward_gpu (this=0x4d1a700, bottom=std::vector of length 0, capacity 0, 

    top=std::vector of length 1, capacity 1 = {...}) at src/caffe/layers/base_data_layer.cu:25

#8  0x0000000000419b28 in caffe::Layer<float>::Forward (this=0x4d1a700, bottom=std::vector of length 0, capacity 0, top=std::vector of length 1, capacity 1 = {...})

    at ./include/caffe/layer.hpp:486

#9  0x00007ffff7118102 in caffe::Net<float>::ForwardFromTo (this=0x49a1af0, start=0, end=69) at src/caffe/net.cpp:600

#10 0x00007ffff7117ea1 in caffe::Net<float>::ForwardPrefilled (this=0x49a1af0, loss=0x7fffffffdc9c) at src/caffe/net.cpp:620

#11 0x00007ffff71182a4 in caffe::Net<float>::Forward (this=0x49a1af0, bottom=std::vector of length 0, capacity 0, loss=0x7fffffffdc9c) at src/caffe/net.cpp:634

#12 0x00007ffff70baeb3 in caffe::Net<float>::ForwardBackward (this=0x49a1af0, bottom=std::vector of length 0, capacity 0) at ./include/caffe/net.hpp:87

#13 0x00007ffff70a8825 in caffe::Solver<float>::Step (this=0x7100b0, iters=100000000) at src/caffe/solver.cpp:228

#14 0x00007ffff70a821b in caffe::Solver<float>::Solve (this=0x7100b0, resume_file=0x0) at src/caffe/solver.cpp:306

#15 0x0000000000414e36 in train () at tools/caffe.cpp:212

#16 0x0000000000416da6 in main (argc=2, argv=0x7fffffffe528) at tools/caffe.cpp:394

maina...@gmail.com

unread,
Jul 26, 2017, 4:17:22 PM7/26/17
to Caffe Users
Hi Fatemah
is it possible to share your train file? i think i am missing something. my pixel classes are 2: 0 and 1, if ii am putting num_outputs==2 there is shape mismatch between prediction labels and llabel gt. is it some thing with GT. How to modify GT. My GT is binary, only 0 and 1.

wangy...@gmail.com

unread,
Jul 27, 2017, 9:53:13 PM7/27/17
to Caffe Users

ñ你的回答适合我,点赞!
在2015年8月11日星期二UTC + 8上午10:17:58 Fatemeh Saleh写道:
嗨,
640次迭代后我得到相同的错误。最后我想出问题是网络最后一层的输出数量。我已经使用pascal上下文与33个标签,但是由于背景,输出的数量应该是34。


2015年5月5日星期二下午7:41:58 UTC + 10,Yoann写道:
我也是。你知道如何解决这个问题吗?

F0505 11:34:24.536881 2253 math_functions.cpp:91]检查失败:error == cudaSuccess(77 vs. 0)遇到非法内存访问
***检查故障堆栈跟踪:***
    @ 0x7f5d50d40b7d google :: LogMessage :: Fail()
    @ 0x7f5d50d42c7f google :: LogMessage :: SendToLog()
    @ 0x7f5d50d4076c google :: LogMessage :: Flush()
    @ 0x7f5d50d4351d google :: LogMessageFatal :: 〜LogMessageFatal()
    @ 0x494cc8 caffe :: caffe_copy <>()
    @ 0x4e98ee caffe :: BasePrefetchingDataLayer <> :: Forward_gpu()
    @ 0x47d4df caffe :: Net <> :: ForwardFromTo()
    @ 0x47d81f caffe :: Net <> :: ForwardPrefilled()
    @ 0x4709ba caffe :: Solver <> :: Solve()
    @ 0x424ed9 train()
    @ 0x41ebdb主
    @ 0x7f5d4cd6476d(未知)
    @ 0x4225d9(未知)

Le mardi 5 mai 2015 11:21:09 UTC + 2,Ting Lee aécrit:
我也一样!

在2014年10月13日星期一UTC + 2下午3:53:51,尼克Carlevaris -比安科写道:
我在训练过程中有些随机出现了一个奇怪的错误。网络通常会成功训练多次迭代。但是,我会得到以下错误:

F1013 09:33:06.971670 4890 math_functions.cpp:91]检查失败:error == cudaSuccess(77 vs. 0)遇到非法内存访问
***检查故障堆栈跟踪:***
    @ 0x7ffff2577b9d google :: LogMessage :: Fail()
    @ 0x7ffff2579c9f google :: LogMessage :: SendToLog()
    @ 0x7ffff257778c google :: LogMessage :: Flush()
    @ 0x7ffff257a53d google :: LogMessageFatal :: 〜LogMessageFatal()
    @ 0x4f3b98 caffe :: caffe_copy <>()
    @ 0x548a4e caffe :: BasePrefetchingDataLayer <> :: Forward_gpu()
    @ 0x537c3f caffe :: Net <> :: ForwardFromTo()
    @ 0x537f7f caffe :: Net <> :: ForwardPrefilled()
    @ 0x5144de caffe :: Solver <> :: Solve()
    @ 0x4292e9 train()
    @ 0x422ddb主
    @ 0x7fffee26f76d(未知)
    @ 0x4269ed(未知)


该型号很小,只能在4GB卡上使用几个100MB。所以卡的内存不足了。有没有人遇到类似的错误?
Reply all
Reply to author
Forward
0 new messages