Check failed: error == cudaSuccess (74 vs. 0) misaligned address

228 views
Skip to first unread message
Assigned to vaidehis...@gmail.com by me

Song Andyᅣ

unread,
Oct 4, 2019, 5:16:13 AM10/4/19
to Caffe Users
Hi, I'm trying to install caffe from source in order to run the network implemented in the following link (https://github.com/NVlabs/ssn_superpixels). The caffe source is downloaded in the the directory "video_prop_networks/lib/caffe/".

I have successfully downloaded all of the prerequisites and installed caffe as per the instructions (step 4) in the README.md in the above repository. However, once I run training (running the train_ssn.py file the ssn_superpixels repository) I get the following error:

F1004 08:53:56.852197    14 math_functions.hpp:176] Check failed: error == cudaSuccess (74 vs. 0)  misaligned address
*** Check failure stack trace: ***
Aborted (core dumped)

I am not completely sure what version of caffe is downloaded in "ssn_superpixels/lib/video_prop_networks/lib/caffe", but I am trying to run the code in a docker container whose Dockerfile is based on the one in "caffe/docker/gpu/":

FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04
LABEL maintainer caffe
-maint@googlegroups.com

RUN apt
-get update && apt-get install -y --no-install-recommends \
        build
-essential \
        cmake
\
        git
\
        wget
\
    vim
\
        libatlas
-base-dev \
        libboost
-all-dev \
        libgflags
-dev \
        libgoogle
-glog-dev \
        libhdf5
-serial-dev \
        libleveldb
-dev \
        liblmdb
-dev \
        libopencv
-dev \
        libprotobuf
-dev \
        libsnappy
-dev \
        protobuf
-compiler \
        python
-dev \
        python
-numpy \
        python
-pip \
        python
-setuptools \
        python
-scipy && \
    rm
-rf /var/lib/apt/lists/*

ENV CAFFE_ROOT=/opt/caffe
WORKDIR $CAFFE_ROOT

# FIXME: use ARG instead of ENV once DockerHub supports this
# https://github.com/docker/hub-feedback/issues/460
ARG CLONE_TAG=1.0

RUN git clone -b ${CLONE_TAG} --depth 1 https://github.com/BVLC/caffe.git . && \
    python -m pip install --upgrade pip && \
    cd python && for req in $(cat requirements.txt) pydot; do pip install $req; done && cd .. && \
    git clone https://github.com/NVIDIA/nccl.git && cd nccl

# modify the Makefile.config.example before proceeding

RUN make -j install && cd .. && rm -rf nccl \
    mkdir build && cd build && \
    cmake -DUSE_CUDNN=1 -DUSE_NCCL=1 .. && \
    make -j"$(nproc)"

ENV PYCAFFE_ROOT $CAFFE_ROOT/python
ENV PYTHONPATH $PYCAFFE_ROOT:$PYTHONPATH
ENV PATH $CAFFE_ROOT/build/tools:$PYCAFFE_ROOT:$PATH
RUN echo "$CAFFE_ROOT/build/lib" >> /etc/ld.so.conf.d/caffe.conf && ldconfig

WORKDIR /workspace

My machine is running CUDA version 10.1 (as per nvidia-smi) and has a Nvidia Geforce RTX 2080 (which only supports CUDA versions 10.0 and upwards). Please help me resolve the address misalignment issue. Thanks in advance

LB

unread,
Nov 12, 2019, 10:18:05 PM11/12/19
to Caffe Users
Message has been deleted

Khan Engr zubair

unread,
Jan 6, 2020, 2:13:55 PM1/6/20
to Caffe Users
i having the same problem, did you managed to solve it. Even modifying cudnn_conv.cpp doesnt help.
Message has been deleted
Message has been deleted

Yaa Yang

unread,
May 14, 2020, 10:41:58 PM5/14/20
to Caffe Users


在 2019年10月4日星期五 UTC+8下午5:16:13,Song Andyᅣ写道:
1. If you are multi-GPU and changing cudnn_conv_layer.cpp does not work, you can first test whether a single GPU can work, if the single GPU is OK, but the multi-GPU is not OK, it may be the version of Nccl and Cuda,

2. Please try the example of multi-GPU running Caffe official website such as Mnist + lenet, if Mnist + lenet is possible. Then reduce the size of a certain layer in your network structure as far as possible to keep each layer with few parameters, verify Is it the problem of the size of nccl communication data

3. Then you can use the nccl_test tool to perform a data test on your multi-card. If you are nccl2.x, you can use `https: // github.com / NVIDIA / nccl-tests` This tool is more useful for your nccl Perform a test on the data of GPU communication. If you are nccl1.x, this tool is integrated into nccl.

4. nccl_test will test the communication between multiple cards, then you can set the number of GPUs and the maximum data size of the communication, refer to the readme.md of nccl_test. Finally, if you are running nccl_test, if nccl is not compatible with your cuda Then, when the amount of data exceeds a certain size, the misaliend address will be reported, then at this time you can consider replacing the version of cuda and nccl. One available version I tried is cuda10.1 + nccl2.3.5-5 + caffe_windows

Of course, this is the result of my test on ubuntu18.04. I will be honored if it is helpful to your question!




 

E Merth

unread,
May 17, 2020, 6:11:26 PM5/17/20
to Caffe Users
This is a bit of a guess. Try reducing your batch sizes. With RTX cards and CUDA 10 Caffe I get odd memory errors when the GPU runs out of memory. They should be "out memory" message but sonetimes they are address error messahes instead.
Reply all
Reply to author
Forward
0 new messages