Need basic debugging tips

115 views
Skip to first unread message

bdrc

unread,
Oct 23, 2016, 7:35:59 PM10/23/16
to sailfish-cfd
I am trying to run some examples such as ldc_2d.py. I am trying to use OpenCL on some AMD 5870 graphics cards that support OpenCL 1.2. I think that I have installed everything on a Debian Jessie system and I have the AMD proprietary drivers installed and those also appear to be working. I have two cards installed, but perhaps I don't have them configured correctly. I would consider buying newer cards only if I knew that would make a difference. I am pretty convinced that I want to use OpenCL with AMD cards though, but I am open to alternatives if someone has experience with them actually working.

I can use the following with the CPU and it works, but it just hangs if I choose a GPU.

python ldc_2d.py --debug_single_process --opencl-interactive-select

Here I select a GPU and then it stalls, although sometimes it does go through a few iterations.

[ 67393  INFO MainProcess] Initializing subdomain.
[ 67394  INFO MainProcess] Relaxation model: bgk
[ 67394  INFO MainProcess] Actual lattice size is: [258, 258]
[ 67394  INFO MainProcess] Required memory:
[ 67394  INFO MainProcess] . distributions: 5 MiB
[ 67395  INFO MainProcess] . fields: 0 MiB
[ 67436  INFO MainProcess] Fluid node fraction: 98.8%
[ 67481  INFO MainProcess] On-GPU invalid result check disabled as the device does not support all required features.
/home/schrodinger/sailfish-mj/sailfish/backend_opencl.py:160: UserWarning: Received OpenCL source code in Unicode, should be ASCII string. Attempting conversion.
  return cl.Program(self.ctx, preamble + source).build() #'-cl-single-precision-constant -cl-fast-relaxed-math')
[ 69095 WARNING MainProcess] Running infinite simulation.
[ 69105  INFO MainProcess] Starting simulation.

(Only once in a while will it do this before it hangs and I can't get a traceback. This is the CPU though so the numbers are off, but the MLUPS are even lower)

[  5955  INFO MainProcess] iteration:2000  speed:272.77 MLUPS
[  6197  INFO MainProcess] iteration:3000  speed:267.16 MLUPS

I have tried to run it with pdb and pudb, but the program will run if I take each step. It still hangs if I use continue to load it all.

I have read through a lot of the posts that talk about a similar hanging problem, but they never mention how they solved the problem or if it can be solved.

Thanks for any tips.

bdrc

unread,
Oct 23, 2016, 10:08:14 PM10/23/16
to sailfish-cfd
I can actually get a traceback so I thought I would add it here.

If I run without the single process flag, I receive a EOF read error. If I use no arguments, it still stalls but I can get the result at the bottom. Some tries would result in a few thousand iterations.

python ldc_2d.py ---opencl-interactive-select

:~/sailfish-cfd/examples$ ./ldc_2d.py --opencl-interactive-select
[   964  INFO Master/debian] Machine master starting with PID 2515 at 2016-10-24 01:46:24 UTC
[   965  INFO Master/debian] Simulation started with: ./ldc_2d.py --opencl-interactive-select
[   977  INFO Master/debian] Sailfish version: f111f6e4a0953357f0871374aa825bc2eaafc2a0
[   977  INFO Master/debian] Handling subdomains: [0]
[   977  INFO Master/debian] Subdomain -> GPU map: {0: 0}
[   978  INFO Master/debian] Selected backend: opencl
Choose device(s):
[0] <pyopencl.Device 'Cypress' on 'AMD Accelerated Parallel Processing' at 0x2d81fa0>
[1] <pyopencl.Device 'Cypress' on 'AMD Accelerated Parallel Processing' at 0x371da00>
[2] <pyopencl.Device 'AMD FX(tm)-8320 Eight-Core Processor' on 'AMD Accelerated Parallel Processing' at 0x32251d0>
Process Subdomain/0:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/home/schrodinger/sailfish-cfd/sailfish/master.py", line 49, in _start_subdomain_runner
    backend = backend_class(config, gpu_id)
  File "/home/schrodinger/sailfish-cfd/sailfish/backend_opencl.py", line 45, in __init__
    self.ctx = cl.create_some_context(True)
  File "/usr/lib/python2.7/dist-packages/pyopencl/__init__.py", line 873, in create_some_context
    answer = get_input("Choice, comma-separated [0]:")
  File "/usr/lib/python2.7/dist-packages/pyopencl/__init__.py", line 803, in get_input
    user_input = raw_input(prompt)
EOFError: EOF when reading a line
Choice, comma-separated [0]:

The above hangs.

If I use no argument. I can at least interrupt with the keyboard and get a traceback.

./ldc_2d.py
[   999  INFO Master/debian] Machine master starting with PID 11003 at 2016-10-24 02:04:21 UTC
[   999  INFO Master/debian] Simulation started with: ./ldc_2d.py
[  1006  INFO Master/debian] Sailfish version: f111f6e4a0953357f0871374aa825bc2eaafc2a0
[  1007  INFO Master/debian] Handling subdomains: [0]
[  1007  INFO Master/debian] Subdomain -> GPU map: {0: 0}
[  1009  INFO Master/debian] Selected backend: opencl
[  1438  INFO Subdomain/0] Initializing subdomain.
[  1439  INFO Subdomain/0] Required memory:
[  1439  INFO Subdomain/0] . distributions: 5 MiB
[  1439  INFO Subdomain/0] . fields: 0 MiB
[  1513  INFO Subdomain/0] On-GPU invalid result check disabled as the device does not support all required features.
/home/schrodinger/sailfish-cfd/sailfish/backend_opencl.py:159: UserWarning: Received OpenCL source code in Unicode, should be ASCII string. Attempting conversion.

  return cl.Program(self.ctx, preamble + source).build() #'-cl-single-precision-constant -cl-fast-relaxed-math')
[  2829 WARNING Subdomain/0] Running infinite simulation.
[  2830  INFO Subdomain/0] Starting simulation.
^CTraceback (most recent call last):
  File "./ldc_2d.py", line 41, in <module>
    ctrl.run()
  File "/home/schrodinger/sailfish-cfd/sailfish/controller.py", line 793, in run
    return self._finish_simulation(subdomain_specs, summary_receiver)
  File "/home/schrodinger/sailfish-cfd/sailfish/controller.py", line 708, in _finish_simulation
    self._simulation_process.join()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 145, in join
    res = self._popen.wait(timeout)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 154, in wait
    return self.poll(0)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 135, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
Process Master/debian:
    func(*targs, **kargs)
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/util.py", line 325, in _exit_function
  File "/usr/lib/python2.7/multiprocessing/process.py", line 261, in _bootstrap
    util._exit_function()
  File "/usr/lib/python2.7/multiprocessing/util.py", line 325, in _exit_function
    p.join()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 145, in join
    p.join()
    res = self._popen.wait(timeout)
  File "/usr/lib/python2.7/multiprocessing/process.py", line 145, in join
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 154, in wait
    return self.poll(0)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 135, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
    res = self._popen.wait(timeout)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 154, in wait
    return self.poll(0)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 135, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Error in sys.exitfunc:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/lib/python2.7/multiprocessing/util.py", line 325, in _exit_function
    p.join()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 145, in join
    res = self._popen.wait(timeout)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 154, in wait
    return self.poll(0)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 135, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
schrodinger@debian:~/sailfish-cfd/examples$ ./ldc_2d.py
[   987  INFO Master/debian] Machine master starting with PID 11019 at 2016-10-24 02:04:32 UTC
[   987  INFO Master/debian] Simulation started with: ./ldc_2d.py
[   994  INFO Master/debian] Sailfish version: f111f6e4a0953357f0871374aa825bc2eaafc2a0
[   994  INFO Master/debian] Handling subdomains: [0]
[   995  INFO Master/debian] Subdomain -> GPU map: {0: 0}
[   995  INFO Master/debian] Selected backend: opencl
[  1439  INFO Subdomain/0] Initializing subdomain.
[  1440  INFO Subdomain/0] Required memory:
[  1440  INFO Subdomain/0] . distributions: 5 MiB
[  1440  INFO Subdomain/0] . fields: 0 MiB
[  1518  INFO Subdomain/0] On-GPU invalid result check disabled as the device does not support all required features.
/home/schrodinger/sailfish-cfd/sailfish/backend_opencl.py:159: UserWarning: Received OpenCL source code in Unicode, should be ASCII string. Attempting conversion.

  return cl.Program(self.ctx, preamble + source).build() #'-cl-single-precision-constant -cl-fast-relaxed-math')
[  2804 WARNING Subdomain/0] Running infinite simulation.
[  2804  INFO Subdomain/0] Starting simulation.
^CTraceback (most recent call last):
  File "./ldc_2d.py", line 41, in <module>
    ctrl.run()
  File "/home/schrodinger/sailfish-cfd/sailfish/controller.py", line 793, in run
    return self._finish_simulation(subdomain_specs, summary_receiver)
  File "/home/schrodinger/sailfish-cfd/sailfish/controller.py", line 708, in _finish_simulation
    self._simulation_process.join()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 145, in join
    res = self._popen.wait(timeout)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 154, in wait
    return self.poll(0)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 135, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/lib/python2.7/multiprocessing/util.py", line 325, in _exit_function
    p.join()
Process Master/debian:
  File "/usr/lib/python2.7/multiprocessing/process.py", line 145, in join
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 261, in _bootstrap
    res = self._popen.wait(timeout)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 154, in wait
    util._exit_function()
  File "/usr/lib/python2.7/multiprocessing/util.py", line 325, in _exit_function
    return self.poll(0)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 135, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
    p.join()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 145, in join
Error in sys.exitfunc:
    res = self._popen.wait(timeout)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 154, in wait
    return self.poll(0)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 135, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/lib/python2.7/multiprocessing/util.py", line 325, in _exit_function
    p.join()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 145, in join
    res = self._popen.wait(timeout)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 154, in wait
    return self.poll(0)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 135, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt

bdrc

unread,
Sep 21, 2017, 7:49:42 AM9/21/17
to sailfish-cfd
I was never able to solve this problem, but I did want to report back that I no longer have this problem. I did finally buy a 7970 and tried again. I am able to run most examples now except I can't use --mode=visualization. That appears to be an Python 3 issue.

Using the proprietary drivers packaged in Jessie finally worked. I saw a lot of disinformation about this. They were removed for some time, etc. It finally worked and I could run examples on the GPU. Previously a Gallium driver would be installed instead of the AMD driver. That could have been the issue before.

I had less luck with AMDGPU on Stretch with this card. It required a custom kernel and had an issue with the 64 bit pointer flag. That is why I decided to revisit using Jessie and I am happy that I finally had an example work.
Reply all
Reply to author
Forward
0 new messages