Hi all
would appreciate any insight to this....
I'm running a fully convolutional net (longjon future branch), had ton of problems until I got it to run, but it finally runs.
However, after about 6000 iterations it fails on:
F0709 22:02:38.919351 30207 syncedmem.cpp:51] Check failed: error == cudaSuccess (2 vs. 0) out of memory
Now, that doesn't make sense at all, I have a K80 with 12G (actually have a bunch of them):
caffe@iltlvl383:~$ ./caffe/build/tools/caffe device_query -gpu=0
I0709 21:50:32.001188 32412 caffe.cpp:73] Querying device ID = 0
I0709 21:50:35.214002 32412 common.cpp:157] Device id: 0
I0709 21:50:35.214074 32412 common.cpp:158] Major revision number: 3
I0709 21:50:35.214087 32412 common.cpp:159] Minor revision number: 7
I0709 21:50:35.214097 32412 common.cpp:160] Name: Tesla K80
I0709 21:50:35.214107 32412 common.cpp:161] Total global memory:
12079136768I0709 21:50:35.214126 32412 common.cpp:162] Total shared memory per block: 49152
I0709 21:50:35.214136 32412 common.cpp:163] Total registers per block: 65536
I0709 21:50:35.214146 32412 common.cpp:164] Warp size: 32
I0709 21:50:35.214156 32412 common.cpp:165] Maximum memory pitch: 2147483647
I0709 21:50:35.214165 32412 common.cpp:166] Maximum threads per block: 1024
I0709 21:50:35.214174 32412 common.cpp:167] Maximum dimension of block: 1024, 1024, 64
I0709 21:50:35.214184 32412 common.cpp:170] Maximum dimension of grid: 2147483647, 65535, 65535
I0709 21:50:35.214193 32412 common.cpp:173] Clock rate: 823500
I0709 21:50:35.214201 32412 common.cpp:174] Total constant memory: 65536
I0709 21:50:35.214210 32412 common.cpp:175] Texture alignment: 512
I0709 21:50:35.214220 32412 common.cpp:176] Concurrent copy and execution: Yes
I0709 21:50:35.214236 32412 common.cpp:178] Number of multiprocessors: 13
I0709 21:50:35.214246 32412 common.cpp:179] Kernel execution timeout: No
while all I need for the model is:
I0709 21:48:47.162098 30207 net.cpp:219] Memory required for data: 122306400
which is obviously a lot less than what I have.
In addition, this is a smaller model than the one I'm actually trying to run, I've reduced the number of possible pixel classes from 100 to 10 (just for testing).
I'm also running in batch_size=1 & iter_size=1
This is my solver definition:
net: "./models/models/V3_FCN/train_val_s32.prototxt"
test_iter: 500
test_interval: 10000 # py solving tests
display: 500
#average_loss: 20
lr_policy: "fixed"
base_lr: 1e-4
momentum: 0.9
iter_size: 1
# base_lr: 1e-9
# momentum: 0.99
max_iter: 100000
weight_decay: 0.0005
snapshot: 6000
test_initialization: false
snapshot_prefix: "./models/snapshots/V3_FCN/snapshot"
So, if anybody can tell me why my memory explodes, how to solve this, or is there any way to make the run use less memory I'd appreciate it.
THX